How I manage the infrastructure of skeeled alone

05 December 2025

How I manage the infrastructure of skeeled alone? Plot twist: I'm not really alone!

Over the past 5 years, I built and managed what is today skeeled's infrastructure. I worked alone, but with the trust and guidance of Rui, my manager, who knows infrastructure well and trusted me completely with this task. All while staying confident in our stack and sleeping well at night.

This was accomplished through two main approaches:

Keeping it simple with standards and "boring" technology
Standing on the shoulders of giants (mentors and open source)

Keeping it simple

We follow industry standards and use well-tested boring technology. Our infrastructure is coded simply and clearly.

This approach let us migrate environments, join clusters, and adapt when tools like Rancher Fleet added features we needed. We could revert changes and try again seamlessly.

Everything declared as code gives us an always-updated recipe for our infrastructure. This is like having the blueprint of the Disaster Recovery Plan (DRP) defined as code.

When problems arise, we move quickly because we have a solid foundation, not a castle of sand. We have proper disaster recovery and data backups.

Standing on the shoulders of giants

I stand and work on the shoulders of giants: mentors and the open source community.

Mentors

Having mentors accelerates your career and knowledge. When I joined Farfetch, my team ~~forced~~ inspired me to read two technical books:

The second book is special and biased because it was written by two team members who became important mentors.

My daily routine started with reading, taking notes, doing exercises, and asking questions. They were kind yet demanding in their explanations and expectations.

Before that, during my interview, they recommended The Phoenix Project, which I bought and read before starting.

They taught me to read open source code and understand how it works. When issues appear, check the source code—you might find a bug and contribute a fix.

When I joined skeeled knowing little about Kubernetes, the team again provided courses, books, and most importantly, time to properly learn.

Even after changing companies, I contact mentors for implementation advice. I pay it forward by helping friends or even strangers that ask for help. These helping moments matter more than all the solo coding hours. I still remember the feeling of helping the wife of a friend that works at TripAdvisor with some Git issue, or helping a chilhood friend with his company GitHub Actions.

There's joy in getting help and learning from others, and even more joy in teaching and helping others.

Open source community

My job would be impossible without volunteer developers making the world better, usually for free. The world runs on software written by developers who provide it freely with source code we can learn from and improve.

This creates both a learning source and high-quality industry standard tools.

Our infrastructure is setup and runs almost exclusively on open source tools: Kubernetes, Helm, Prometheus, Grafana, Thanos, OpenTelemetry, SigNoz, Rancher Fleet, and Terraform^[1].

Rather than diving into technical details, I'll focus on the important decisions these tools enabled.

Kubernetes is complex, but it's learnable complexity. Once you understand it, it saves significant work and provides sensible defaults.

Running Prometheus and Grafana in each cluster is essential (thanks to mentor advice). You can have all the global monitoring you want, but when network issues hit, you'll thank yourself for having local tools to understand the problem.

Trust me, it's always DNS.

Source: https://www.auvik.com/franklyit/blog/securing-your-dns/

Thanos provides global metric storage, letting us compare application behavior across years. This helped us understand strange behavior from a deployment months earlier.

We use Dead Man's Snitch^[2] for watchdog alerts. Prometheus should be silent to avoid interrupting work, but you need to know your monitoring works. A watchdog alert provides peace of mind.

These industry standards gave us flexibility to:

Migrate from Datadog to SigNoz in days using OpenTelemetry (saving money)
Move Thanos data from AWS S3 to OVH Object Storage (saving money)
Migrate clusters from Scaleway to OVH easily (saving money)
Switch Kafka from Confluent to OVH (saving a lot of money, reduced costs to 20% of original bill)
Migrate from Mongo Atlas
Move freely between cloud providers (AWS → OVH → Scaleway → Azure → GCP)
Migrate regions after a datacenter fire
Recover from a drop database in production.
And more...

Key takeaways

You're never truly alone
Keep it simple—use boring, well-tested technologies and standards
Get a mentor and mentor others
Read technical books and manuals (RTFM)
Invest in education; managers should provide learning time and resources
Use and contribute to open source (start small even with documentation typos or examples)
Define everything as code, including disaster recovery and backups
Choose options that avoid vendor lock-in and enable movement
Accept challenges even when you're not ready yet and do your best to learn

Terraform now uses Business Source License rather than being fully open source. ↩︎
Dead Man's Snitch is not open source. Let me know if you know a good open alternatives. ↩︎

What should you do next?

Comment or give me feedback via email. Follow the site via RSS or Email.

← Previous
Cloudflare outage and infrastructure as code