How I manage the infrastructure of skeeled alone
How I manage the infrastructure of skeeled alone? Plot twist: I'm not really alone!
Over the past 5 years, I built and managed what is today skeeled's infrastructure. I worked alone, but with the trust and guidance of Rui, my manager, who knows infrastructure well and trusted me completely with this task. All while staying confident in our stack and sleeping well at night.
This was accomplished through two main approaches:
- Keeping it simple with standards and "boring" technology
- Standing on the shoulders of giants (mentors and open source)
Keeping it simple
We follow industry standards and use well-tested boring technology. Our infrastructure is coded simply and clearly.
This approach let us migrate environments, join clusters, and adapt when tools like Rancher Fleet added features we needed. We could revert changes and try again seamlessly.
Everything declared as code gives us an always-updated recipe for our infrastructure. We're more than halfway to having a complete Disaster Recovery Plan (DRP).
When problems arise, we move quickly because we have a solid foundation, not a castle of sand. We have proper disaster recovery and data backups.
Standing on the shoulders of giants
I stand and work on the shoulders of giants: mentors and the open source community.
Mentors
Having mentors accelerates your career and knowledge.
When I joined Farfetch, my team forced inspired me to read two technical books:
- Site Reliability Engineering: How Google Runs Production Systems
- Hands-On Infrastructure Monitoring with Prometheus by Joel Bastos and Pedro Araújo
The second book is special and biased because it was written by two team members who became important mentors.
My daily routine started with reading, taking notes, doing exercises, and asking questions. They were kind yet demanding in their explanations and expectations.
Before that, during my interview, they recommended The Phoenix Project, which I bought and read before starting.
They taught me to read open source code and understand how it works. When issues appear, check the source code—you might find a bug and contribute a fix.
When I joined skeeled knowing little about Kubernetes, the team again provided courses, books, and most importantly, time to properly learn.
Even after changing companies, I contact mentors for implementation advice. I pay it forward by helping friends or even strangers that ask for help. These helping moments matter more than all the solo coding hours. I still remember the feeling of helping the wife of a friend that works at TripAdvisor with some Git issue, or helping a chilhood friend with his company GitHub Actions.
There's joy in getting help and learning from others, and even more joy in teaching and helping others.
Open source community
My job would be impossible without volunteer developers making the world better, usually for free. The world runs on software written by developers who provide it freely with source code we can learn from and improve.
This creates both a learning source and high-quality industry standard tools.
Our infrastructure is setup and runs almost exclusively on open source tools: Kubernetes, Helm, Prometheus, Grafana, Thanos, OpenTelemetry, SigNoz, Rancher Fleet, and Terraform[1].
Rather than diving into technical details, I'll focus on the important decisions these tools enabled.
Kubernetes is complex, but it's learnable complexity. Once you understand it, it saves significant work and provides sensible defaults.
Running Prometheus and Grafana in each cluster is essential (thanks to mentor advice). You can have all the global monitoring you want, but when network issues hit, you'll thank yourself for having local tools to understand the problem.
Trust me, it's always DNS.

Thanos provides global metric storage, letting us compare application behavior across years. This helped us understand strange behavior from a deployment months earlier.
We use Dead Man's Snitch[2] for watchdog alerts. Prometheus should be silent to avoid interrupting work, but you need to know your monitoring works. A watchdog alert provides peace of mind.
These industry standards gave us flexibility to:
- Migrate from Datadog to SigNoz in days using OpenTelemetry (saving money)
- Move Thanos data from AWS S3 to OVH Object Storage (saving money)
- Migrate clusters from Scaleway to OVH easily (saving money)
- Switch Kafka from Confluent to OVH (saving a lot of money, reduced costs to 20% of original bill)
- Migrate from Mongo Atlas
- Move freely between cloud providers (AWS → OVH → Scaleway → Azure → GCP)
- Migrate regions after a datacenter fire
- Recover from a drop database in production.
- And more...
Key takeaways
- You're never truly alone
- Keep it simple—use boring, well-tested technologies and standards
- Get a mentor and mentor others
- Read technical books and manuals (RTFM)
- Invest in education; managers should provide learning time and resources
- Use and contribute to open source (start small even with documentation typos or examples)
- Define everything as code, including disaster recovery and backups
- Choose options that avoid vendor lock-in and enable movement
- Accept challenges even when you're not ready yet and do your best to learn
Terraform now uses Business Source License rather than being fully open source. ↩︎
Dead Man's Snitch is not open source. Let me know if you know a good open alternatives. ↩︎
What should you do next?
Comment or give me feedback via email. Follow the site via RSS or Email.
- ← Previous
Cloudflare outage and infrastructure as code