Featured

Building Engineering Capability: The 3-6 Person Model

The 'you need a huge DevOps team' myth prevents organizations from trying. With Kubernetes, GitOps, and AI assistance, 3-6 engineers can run production infrastructure. The key: sustainable on-call rotation (1 week every 2 months) and smart tooling choices.

By Jurg van Vliet

Published Aug 25, 2025

The Real Constraint

After years of infrastructure transitions, I'm convinced the hard part isn't technical. It's organizational.

"Nobody gets fired for buying AWS" is the modern version of the IBM adage. It's true—there's safety in choosing the dominant vendor. If something goes wrong, you made the safe choice.

Choosing a different path means accepting responsibility for outcomes you could have outsourced. That's a real consideration, not something to dismiss.

The "Too Small to Run Kubernetes" Myth

The most persistent objection: "You need a huge DevOps team to run Kubernetes."

This was true in 2015. Kubernetes was immature, operationally complex, poorly documented. Running it required specialists.

It's not true in 2025.

What changed:

  • Managed Kubernetes eliminates control plane operations (no etcd management, no API server upgrades)
  • GitOps (Flux, ArgoCD) reduces manual operations and prevents drift
  • Mature ecosystem provides patterns, tools, community support (Stack Overflow has answers)
  • AI assistance accelerates learning and reduces toil

The operational burden has dropped dramatically. The knowledge required is more accessible. The tooling is better.

The 3-6 Person Model

This is our actual operational model. Not hypothetical—this is what works for us.

Why 3 people minimum?

On-call rotation. If you want sustainable 24/7 coverage with reasonable quality of life, you need at least 3 people rotating.

With 3 people:

  • Each person: 1 week on-call, 5 weeks off (rotation every 6 weeks)
  • Coverage during vacation (2 people can cover while 1 is away)
  • Knowledge sharing (rotate, learn from each other)
  • Resilience to attrition (losing 1 person doesn't break coverage)

With 2 people:

  • Each person: 1 week on, 1 week off (unsustainable)
  • No vacation coverage (always on-call when not on vacation)
  • No resilience (1 person leaving breaks everything)
  • Burnout guaranteed

Why 6 people maximum?

Communication overhead. Beyond 6 people, coordination costs exceed productivity gains. Small teams move faster, make decisions easier, maintain shared context naturally.

Brooks' Law applies: adding people to a late project makes it later. Keep the team small enough to maintain high communication bandwidth.

What 3-6 People Can Actually Do

With modern tooling, this team size can operate:

Infrastructure:

  • Multi-cluster Kubernetes (management, test, production)
  • GitOps deployment pipelines (Flux reconciling from Git)
  • Centralized monitoring and observability (Prometheus, Grafana, Loki)
  • Secrets management (SOPS encryption, cert-manager certificates)

Operations:

  • On-call rotation (1 week every 2 months per person)
  • Incident response (runbooks, clear escalation)
  • System maintenance (updates, capacity planning)
  • Security patching (responded to CVE in 3 hours)

Development:

  • Feature implementation (new endpoints, UI improvements)
  • Test coverage (264 API tests, comprehensive E2E suite)
  • Documentation (architecture decisions, operational guides)
  • Refactoring and technical debt management

This isn't theoretical. This is what we actually do.

The Key: Leverage

Small teams succeed through leverage—making each person's effort go further.

1. Managed services (selective use)

Don't run Kubernetes control planes. Don't manage PostgreSQL at the disk level. Don't operate email servers.

Use managed offerings that reduce operational burden while maintaining portability:

  • Managed Kubernetes (Scaleway Kapsule, OVHcloud Managed K8s)
  • Managed PostgreSQL (CloudNativePG operator or provider-managed)
  • Transactional email (Scaleway TEM, SendGrid)

Avoid managed services that create lock-in:

  • Proprietary serverless (Lambda, Cloud Functions)
  • Proprietary databases (DynamoDB, CosmosDB)
  • Proprietary queues (SQS, Cloud Pub/Sub)

2. GitOps automation

Flux reconciles automatically. Clusters pull their own configuration. Manual deployment eliminated.

The system maintains itself within defined boundaries:

  • Drift is prevented, not detected and manually fixed
  • Changes are auditable (Git history shows everything)
  • Rollback is clean (git revert + wait for reconciliation)

This reduces operational toil significantly. No "deploy to production" procedures. Just commit to Git.

3. Comprehensive monitoring

Good observability means you understand problems faster. Our centralized monitoring:

  • Single pane of glass (one Grafana, all environments visible)
  • Complete data (no sampling, year-plus retention)
  • Fast debugging (correlate metrics and logs in seconds)

During incidents, time to understanding matters. Good monitoring reduces MTTR (mean time to recovery) dramatically.

4. AI assistance

Claude helps us understand complex systems faster. It's not replacing engineers—it's multiplying their effectiveness.

Real examples:

  • Debugging multi-cluster networking issues
  • Writing OpenTofu modules that follow best practices
  • Understanding certificate chain validation failures
  • Transforming architecture discussions into documentation

5. Focus on high-value work

Automate toil. The team should work on:

  • Architecture and design decisions
  • Feature development
  • System improvements
  • Knowledge building

Not on:

  • Manual deployments (automated via GitOps)
  • Configuration drift fixes (prevented by Flux)
  • Routine monitoring (automated alerts)
  • Repetitive operations (scripted or eliminated)

The Sustainable On-Call Model

On-call rotation must be sustainable. Burnout prevents long-term success. Here's our model:

Rotation schedule:

  • 1 week on-call, 5-7 weeks off (with 3-person rotation)
  • Predictable schedule (planned months in advance)
  • No "always on-call" culture
  • Clear handoff procedures

Escalation paths:

  • Primary on-call person handles initial response
  • Secondary on-call for escalation (clear criteria for when to escalate)
  • Full team escalation only for critical incidents affecting revenue

Runbooks for common issues:

  • Clear procedures reduce decision-making during stress
  • Known problems have documented solutions
  • Links to relevant dashboards and log queries
  • Escalation criteria defined explicitly

Blameless postmortems:

  • Learning, not punishment
  • What went wrong? What went right?
  • How can we prevent recurrence?
  • What should we improve?

This prevents burnout. Engineers can plan their lives, take vacations, have weekends—not be constantly anxious about alerts.

Skills vs Headcount

Small teams need breadth of skills, not narrow specialization.

T-shaped skill requirements:

  • Deep expertise in one area (frontend, backend, infrastructure, databases)
  • Working knowledge across stack (can debug, understand tradeoffs, collaborate)

This isn't about "10x engineers"—it's about team members who can work across the stack when needed.

Practical example:

  • Frontend engineer can debug API issues (understands HTTP, REST, auth)
  • Backend engineer can troubleshoot Kubernetes pods (understands containers, networking)
  • Infrastructure engineer can review application code (understands application requirements)

This breadth enables small teams to move fast without constant handoffs.

What Doesn't Work

Don't try to build everything. Small teams fail when they:

1. Run their own Kubernetes control planes Use managed Kubernetes. Control plane operations (etcd management, API server upgrades, scheduler tuning) are toil that doesn't differentiate your business.

2. Build custom CI/CD from scratch Use GitHub Actions, GitLab CI, or similar. Good CI/CD exists. Don't rebuild it.

3. Implement every feature themselves Use mature open source tools (Prometheus, Grafana, PostgreSQL, cert-manager). Stand on shoulders of giants.

4. Operate 24/7 with 2 people Burnout guaranteed. You need minimum 3 people for sustainable on-call rotation.

5. Optimize prematurely Start with simple architecture. Add complexity only when you have evidence it's needed. "Might need to scale to millions" isn't evidence.

Recruiting for Small Teams

Curious engineers want interesting problems.

"We configure managed services" is less compelling than "We build and operate our own infrastructure."

The engineers we want to attract:

  • Care about understanding how systems actually work (not just clicking buttons)
  • Value building over assembling (create, don't just compose)
  • Think about long-term consequences (architecture that lasts)
  • Are motivated by mission (sovereignty, privacy, sustainability matter)

Running on European infrastructure, building on open standards, maintaining portability—these aren't just operational choices. They're signals about engineering culture.

Engineers who care about these things often care about:

  • Code quality and craftsmanship
  • System understanding and debugging skill
  • Long-term thinking and sustainability
  • Impact beyond just shipping features

This is the talent you want. The mission helps attract them.

The Honest Tradeoffs

Small team benefits:

  • Fast decision-making (no lengthy approval processes)
  • High context (everyone knows everything)
  • Direct communication (talk, don't email)
  • Shared ownership (everyone responsible for everything)

Small team costs:

  • Limited specialization (everyone wears multiple hats)
  • Vacation coverage challenge (3-person minimum for reason)
  • Knowledge concentration risk (bus factor)
  • Hiring pressure (each hire has huge impact)

For European cloud independence, small teams are often ideal: you can move fast, make bold architectural choices, and build capability without bureaucracy.

Making It Work

Clear responsibilities: Even in small teams, someone needs to own each area. Ownership doesn't mean "only person who works on it"—it means "person responsible for ensuring it works."

Regular knowledge sharing: Weekly technical discussions, architecture reviews, postmortems. Keep everyone's context current.

Documentation discipline: Small teams are tempted to skip documentation ("we all know this"). Don't. You'll forget. New people will join. Document as you build.

Sustainable pace: Marathon, not sprint. Protect against burnout. Enforce reasonable on-call rotations. Take vacations.

Key Points:

  • 3-6 engineers can run production Kubernetes infrastructure
  • Managed services + GitOps + AI assistance = force multipliers
  • Sustainable on-call: 1 week every 2 months with 3-person rotation
  • T-shaped skills: breadth across stack, depth in one area
  • Small teams move fast with right tooling and patterns
  • Mission matters: sovereignty resonates with purpose-driven engineers

#leadership #engineering #teambuilding #capability #sustainability