Right-Sizing Kubernetes for European Startups: Our 2-Node HA Production Setup
How we cut infrastructure costs by 40% while maintaining high availability on Scaleway Kapsule.
By Jurg van Vliet
The Kubernetes community has a scaling problem—not the technical kind, but a cultural one. Conference talks showcase clusters with hundreds of nodes. Best practices assume you have a platform team. Resource examples default to generous allocations that work great in enterprise budgets.
But what about the European startup running a Next.js application with a PostgreSQL database and a caching layer? What about the team that needs production reliability but can't justify €500/month in compute costs for an early-stage product?
We run Clouds of Europe on a 2-node Scaleway Kapsule cluster. This article explains how we made that work: the QoS strategy that prevents OOM kills while maximizing node utilization, the anti-affinity rules that spread pods for availability, and the cost calculation that justified scaling down from three nodes to two.
The Cost Reality
Let's start with numbers. Scaleway's PRO2-XXS instances (2 vCPU, 8GB RAM) cost approximately €20/month each. Our initial production setup used three nodes across three availability zones:
- 3 nodes × €20 = €60/month for compute
- Plus storage, networking, load balancer
When we analyzed actual resource usage, we found we were running at 30-40% CPU utilization on average. The third node existed for theoretical resilience, not practical necessity.
Scaling to two nodes saved roughly €40/month (accounting for the full infrastructure delta). That's not transformative money, but for an early-stage project, €480/year funds other things—domain renewals, email services, the occasional debugging tool subscription.
More importantly, the exercise forced us to think rigorously about resource allocation. The real savings came from the efficiency improvements, not just the node reduction.
Understanding Kubernetes QoS Classes
Kubernetes assigns every pod a Quality of Service class based on its resource configuration. This class determines what happens when nodes run low on resources:
Guaranteed: Requests equal limits for both CPU and memory. These pods are never killed due to resource pressure (unless they exceed their own limits). They get exactly what they asked for, no more, no less.
Burstable: Requests are lower than limits, or only some resources have limits. These pods can use more than their requests if capacity is available, but they're candidates for eviction when nodes are under pressure.
BestEffort: No resource requests or limits set. These pods get whatever's left over and are first to be killed when resources are scarce.
The conventional wisdom is "set requests equal to limits for production workloads." This gives you Guaranteed QoS and predictable behavior. But it also means you're reserving resources you might not use.
Our Strategy: Guaranteed Memory, Burstable CPU
Here's the insight that changed our approach: memory and CPU behave differently under pressure.
Memory is binary. When a process needs memory and can't get it, bad things happen—OOM kills, data corruption, undefined behavior. There's no graceful degradation. If your application needs 512MB during a traffic spike, it needs 512MB.
CPU is throttleable. When a process needs more CPU than available, it slows down. Requests take longer. But nothing crashes. The kernel scheduler ensures every process gets its fair share based on requests, and anything above that is best-effort.
This means the optimal configuration for cost-conscious production is:
resources:
requests:
cpu: "100m" # Low request for scheduling
memory: "512Mi" # Full memory requirement
limits:
cpu: "1000m" # Allow bursting to 10x request
memory: "512Mi" # Same as request = Guaranteed for memory
This gives us:
- Burstable QoS overall (because CPU requests < limits)
- Guaranteed memory behavior (because memory requests = limits)
- Efficient scheduling (because CPU requests are low)
- Headroom for spikes (because CPU limits are high)
Applying This Across the Stack
We applied this pattern to every workload in our cluster.
Application (Next.js)
resources:
requests:
cpu: "100m" # Low request for scheduling, can burst to limit
memory: "512Mi" # Increased for Prisma operations
limits:
cpu: "1000m" # Allow bursting for API requests and Prisma operations
memory: "512Mi"
The Next.js application is bursty by nature. Page renders and API calls spike CPU briefly, then go idle. Setting CPU requests at 100m (0.1 cores) means we're only "reserving" 10% of a core for scheduling purposes. But when a request comes in that needs computation—a complex Prisma query, server-side rendering, image processing—it can burst up to a full core.
Memory is set at 512Mi for both requests and limits. Prisma's connection pool and Next.js's server-side caching need predictable memory. We can't afford OOM kills during traffic spikes.
PostgreSQL (CloudNativePG)
resources:
requests:
cpu: 100m # Low request for scheduling, can burst to limit
memory: 512Mi
limits:
cpu: 500m # Allow bursting to 500m when needed
memory: 512Mi
PostgreSQL mostly sits idle waiting for queries. When queries arrive, they can be CPU-intensive (joins, aggregations, index scans). The 100m request means PostgreSQL doesn't block scheduling, while the 500m limit provides headroom for complex operations.
Memory is critical for PostgreSQL—shared buffers, work memory, connection state. We set requests = limits at 512Mi. The database will never be evicted due to memory pressure.
Memcached and Mcrouter
# Memcached
resources:
requests:
memory: "128Mi"
cpu: "20m"
limits:
memory: "128Mi"
cpu: "100m"
# Mcrouter
resources:
requests:
memory: "32Mi"
cpu: "30m"
limits:
memory: "32Mi"
cpu: "100m"
Cache workloads are memory-bound. Memcached's entire purpose is holding data in RAM. CPU usage is minimal—parsing requests, serialization. We give them guaranteed memory with burstable CPU.
The Scheduling Math
Why does this matter for scheduling? Kubernetes schedules pods based on requests, not limits. When you ask for 500m CPU, the scheduler reserves 500m on a node—even if your actual usage is 50m.
On a 2-vCPU node (2000m total), here's the difference:
Guaranteed approach (requests = limits):
- Application: 500m CPU
- PostgreSQL: 500m CPU
- Memcached: 100m CPU
- Mcrouter: 100m CPU
- System overhead: ~200m
- Total reserved: 1400m
- Remaining for scheduling: 600m
Our approach (low CPU requests):
- Application: 100m CPU
- PostgreSQL: 100m CPU
- Memcached: 20m CPU
- Mcrouter: 30m CPU
- System overhead: ~200m
- Total reserved: 450m
- Remaining for scheduling: 1550m
The second approach leaves 2.5x more CPU available for additional pods or for cluster autoscaler decisions. When actual usage spikes, workloads burst into the remaining capacity. When usage is low (most of the time), we're not paying for idle reservation.
Pod Anti-Affinity for High Availability
Running on two nodes creates an obvious concern: what happens if one node fails? We use pod anti-affinity to spread replicas:
affinity:
podAntiAffinity:
preferredDuringSchedulingIgnoredDuringExecution:
- weight: 100
podAffinityTerm:
labelSelector:
matchLabels:
app: clouds-of-europe-app
topologyKey: kubernetes.io/hostname
This tells Kubernetes: "prefer to schedule application pods on different nodes." If we run 2 replicas, they'll land on separate nodes. If one node fails, one replica survives.
We use preferredDuringSchedulingIgnoredDuringExecution rather than required because we want flexibility. If both replicas must run and only one node is healthy, we'd rather have 2 replicas on one node than 0 replicas waiting for a second node.
For PostgreSQL, CloudNativePG handles this automatically:
affinity:
enablePodAntiAffinity: true
topologyKey: "topology.kubernetes.io/zone"
podAntiAffinityType: "preferred"
The primary and replica are scheduled in different availability zones when possible. If the zone with the primary fails, the replica promotes automatically.
CloudNativePG: Database HA Without the Complexity
Speaking of PostgreSQL, CloudNativePG deserves mention. It's a Kubernetes operator that manages PostgreSQL clusters with:
- Automatic failover (primary dies, replica promotes)
- Continuous WAL archiving to S3
- Point-in-time recovery
- Zero-downtime updates via switchover
Our configuration:
spec:
instances: 2 # Primary + sync replica
primaryUpdateStrategy: unsupervised
primaryUpdateMethod: switchover
backup:
retentionPolicy: "30d"
barmanObjectStore:
destinationPath: "s3://coe-prod-pgbackups/backups"
s3Credentials:
accessKeyId:
name: postgres-s3-credentials
key: ACCESS_KEY_ID
secretAccessKey:
name: postgres-s3-credentials
key: SECRET_ACCESS_KEY
With synchronous replication enabled, every committed transaction exists on two instances before acknowledgment. If the primary fails, we lose zero transactions. WAL archiving to Scaleway S3 means we can recover to any point in the last 30 days.
This is genuine production resilience—not "we have backups somewhere" but "we can recover to any second of the last month with zero data loss."
When Not to Do This
This approach has limits. Here's when you should use larger clusters and Guaranteed QoS:
Latency-sensitive workloads: If your application has strict P99 latency requirements, CPU throttling during contention is unacceptable. Set CPU requests = limits.
Multi-tenant clusters: If untrusted workloads share your cluster, burstable pods can be starved by noisy neighbors. Guaranteed QoS provides isolation.
Regulated environments: Some compliance frameworks require dedicated resources. Check your requirements.
Databases with heavy write loads: Our PostgreSQL handles modest traffic. If you're doing thousands of writes per second, don't skimp on CPU.
The pattern works for early-stage products with variable traffic, where cost efficiency matters and occasional CPU throttling is acceptable.
Monitoring the Trade-offs
We watch several metrics to ensure our lean configuration isn't causing problems:
CPU throttling: container_cpu_cfs_throttled_seconds_total shows when containers hit their CPU limits. Some throttling is expected; sustained throttling means limits are too low.
Memory pressure: container_memory_working_set_bytes vs limits. If working set approaches limits consistently, increase memory before OOM kills happen.
Node pressure: kube_node_status_condition{condition="MemoryPressure"} and DiskPressure. If nodes are under pressure, workloads risk eviction.
Pod evictions: kube_pod_status_reason{reason="Evicted"}. Any eviction of our workloads indicates the configuration is too aggressive.
In practice, we've seen zero evictions and minimal CPU throttling. The bursty nature of web applications means contention is rare—traffic spikes don't perfectly align across all pods.
The European Context
Why does this matter specifically for European startups?
European digital sovereignty isn't just about data location; it's about building technology that serves European values. Efficient, sustainable, appropriately-sized infrastructure fits that ethos better than over-provisioned clusters burning electricity.
Key Takeaways
Memory and CPU are different. Set memory requests = limits for guaranteed memory. Set CPU requests low with higher limits for efficient scheduling with burst capacity.
Understand QoS classes. Guaranteed isn't always better. Burstable with guaranteed memory gives you predictable stability with scheduling efficiency.
Use pod anti-affinity. On small clusters, spreading pods across nodes is essential for availability. Make it preferred, not required, for scheduling flexibility.
CloudNativePG is production-ready. You can run PostgreSQL on Kubernetes with proper HA, automated failover, and point-in-time recovery. S3 backups make it affordable.
Monitor the trade-offs. Watch throttling, memory pressure, and evictions. The configuration that works at low traffic might need adjustment as you grow.
Right-size for your reality. The Kubernetes community's defaults assume enterprise scale. Question every resource configuration against your actual needs.
*This article documents work done on the Clouds of Europe platform in January 2026.