Writing Runbooks: Our Alert-Driven Documentation Approach
Every alert links to a runbook. Every runbook was written after an incident. Here's the system.
By Jurg van Vliet
Documentation has a reputation problem. Teams write comprehensive operational guides during calm periods, then never look at them again. When something actually breaks at 2am, the runbook is outdated, doesn't cover the actual failure mode, or requires fifteen minutes of reading before you find the relevant command.
Our runbooks are different. Each one exists because an alert fired and someone had to figure out what to do. They're not aspirational documentation about what might happen—they're battle-tested procedures for what did happen.
This article shares our approach: the template we use, how alerts link directly to runbooks, and the culture that keeps runbooks useful rather than decorative.
The Core Insight: Runbooks Are Executable Procedures
The difference between documentation and a runbook is the verb tense.
Documentation describes: "The PostgreSQL cluster uses CloudNativePG for high availability. Replication is configured synchronously. Failover happens automatically when the primary becomes unavailable."
A runbook instructs: "Check cluster status: kubectl get cluster postgres-cluster -n clouds-of-europe. If replicas show 0, check pod events for scheduling failures. If PVC is stuck in zone-locked state, delete the PVC and pod—CNPG will recreate them."
Documentation tells you how things work. Runbooks tell you what to do when things don't work.
This distinction matters at 2am when your phone is buzzing and your brain is operating at 60% capacity. You don't need architecture explanations. You need commands to copy-paste, decisions to make, and escalation paths when you're stuck.
The Template
Every runbook follows the same structure:
# Runbook: AlertName
## Alert Details
| Alert | Severity | Pending | Threshold |
|-------|----------|---------|-----------|
| AlertName | critical | 5m | Description of trigger condition |
**Data Source:** Where the metric comes from
## Quick Diagnosis
1. First thing to check (with link to dashboard)
2. Second thing to check (with command)
3. Third thing to check (with log query)
## Common Causes
### 1. Most Likely Cause
**Symptoms:** What you'll observe
**Check:** Command or query to confirm
**Fix:** Exact steps to resolve
### 2. Second Most Likely Cause
...
## Emergency Procedures
What to do if things are critically broken and you need to restore service immediately, even with temporary fixes.
## Verification
How to confirm the issue is actually resolved.
## Escalation
When to page someone else, and who.
## Related Alerts
Other alerts that often fire together or share root causes.
This structure is deliberate:
Alert Details at the top lets responders confirm they're looking at the right runbook. The threshold reminds them what triggered the alert.
Quick Diagnosis gives three fast checks to understand the situation. These should take under two minutes total.
Common Causes lists the actual reasons this alert has fired historically, ordered by frequency. The first cause listed is what you should check first.
Emergency Procedures provides the "break glass" options when you need to restore service immediately, even imperfectly.
Verification prevents premature closure. The alert might stop firing, but is the underlying issue actually fixed?
Related Alerts helps responders understand cascading failures. If PostgreSQL is down, expect application errors too.
Real Examples
PostgreSQL No Replica
This runbook was written after we woke up to find our production database running without a replica—HA completely degraded without anyone noticing until the alert fired.
## Alert Details
| Alert | Severity | Pending | Threshold |
|-------|----------|---------|-----------|
| PostgreSQL No Replica | critical | 5m | Streaming replicas < expected |
## Quick Diagnosis
1. **Check PostgreSQL Cluster Status**
```bash
kubectl get cluster postgres-cluster -n clouds-of-europe -o yaml | grep -A 20 "status:"
-
Check Pod Status
kubectl get pods -n clouds-of-europe -l cnpg.io/cluster=postgres-cluster -o wide -
Check Pod Events
kubectl get events -n clouds-of-europe --field-selector involvedObject.name=postgres-cluster-2
Common Causes
1. Replica Pod Pending - PVC Zone Affinity
Symptoms: Pod pending, PVC bound to node in unavailable zone
Resolution:
- Delete the stuck PVC:
kubectl delete pvc postgres-cluster-2 -n clouds-of-europe - Delete the pending pod:
kubectl delete pod postgres-cluster-2 -n clouds-of-europe - CNPG will recreate both in an available zone
The PVC zone affinity issue was the actual cause when this alert first fired. Scaleway's block storage is zone-locked—a PVC created in `fr-par-1` can only attach to nodes in `fr-par-1`. When autoscaling removed that node, the replica couldn't reschedule.
The fix is counterintuitive (delete the PVC!) but correct. Without this runbook, responders would spend thirty minutes trying to figure out why the pod won't schedule.
### App Heap Memory High
This runbook came from a memory leak that crept in after a dependency update. We watched heap usage climb over hours until the alert fired.
```markdown
## Quick Diagnosis
1. **Check Application Dashboard**
- Look at "Heap Usage" panel in Node.js Runtime section
- Check "Process Memory" for RSS trends
2. **Check Pod Resource Usage**
```bash
kubectl top pods -n clouds-of-europe
- Check Application Logs for Memory Warnings
{namespace="clouds-of-europe"} |~ "(?i)(memory|heap|oom)"
Common Causes
1. Memory Leak
Symptoms: Steady increase in heap over time, never decreasing
Fix:
- Rollback recent deployment if leak started after deploy
- Restart pods as temporary mitigation:
kubectl rollout restart deployment/clouds-of-europe-app -n clouds-of-europe
2. High Traffic / Load
Symptoms: Heap spikes correlate with request rate
Fix:
- Scale up replicas to distribute load:
kubectl scale deployment/clouds-of-europe-app -n clouds-of-europe --replicas=3
The "restart pods as temporary mitigation" note is important. It's not a real fix—the leak is still there. But at 2am, restoring service matters more than root cause analysis. The runbook acknowledges this pragmatism while making clear it's temporary.
### Flux Reconciliation Failing
This runbook is longer because Flux failures have many possible causes. GitOps is powerful, but when the reconciliation loop breaks, deployments stop.
```markdown
## Common Causes
### 1. Git Repository Access Issues
**Symptoms:** GitRepository source not ready
**Check:**
```bash
flux get sources git -A
kubectl describe gitrepository flux-system -n flux-system
2. Invalid Manifests / Helm Values
Symptoms: Kustomization or HelmRelease fails to apply
Check:
flux logs --level=error
kubectl get events -n flux-system --sort-by='.lastTimestamp'
Fix:
- Review recent Git commits for syntax errors
- Validate manifests locally:
kustomize build gitops/infrastructure/production
Emergency Procedures
If Production Deployment Blocked
-
Check if issue is blocking production changes:
flux get kustomization clouds-of-europe-app -n flux-system -
If urgent, apply manually (temporary):
# Only for emergencies - GitOps will reconcile later kubectl apply -f <manifest> -
Document manual intervention for later cleanup
The "apply manually" instruction is deliberately marked as emergency-only. It breaks the GitOps principle, but sometimes you need to ship a hotfix *now*. The runbook permits this while emphasizing it's an exception.
## Linking Alerts to Runbooks
Every Grafana alert includes a `runbook_url` annotation:
```yaml
- uid: production-postgres-no-replica
title: "Production PostgreSQL No Replica"
annotations:
description: "Production PostgreSQL cluster has fewer streaming replicas than expected."
runbook_url: "https://github.com/aknostic/clouds-of-europe/blob/main/docs/runbooks/postgres-no-replica.md"
summary: "Production PostgreSQL replica missing"
When the alert fires, the runbook URL appears in the notification. In Grafana's alert UI, it's a clickable link. In Slack or SMS notifications, it's visible text.
This removes the "where's the runbook?" friction. The alert tells you something is wrong and where to find help. You're one click from the exact procedure you need.
We use full GitHub URLs rather than relative paths because notifications go to multiple channels—Slack, SMS, email. An absolute URL works everywhere.
The Alert-Runbook Lifecycle
Here's how runbooks actually get written:
-
Alert fires for the first time. Someone responds, figures out what's wrong, fixes it.
-
After the incident, that person writes a runbook. Not during—after. The immediate priority is restoration. Documentation happens when you're calm.
-
The runbook goes in Git, in
docs/runbooks/, named after the alert. -
The alert configuration gets updated to include
runbook_urlpointing to the new runbook. -
Next time the alert fires, the responder follows the runbook. If it's incomplete or wrong, they update it.
This is the key: runbooks evolve through use. The first version captures what worked once. Subsequent incidents refine it. After three or four firings, the runbook covers the common cases well.
Runbooks written in advance, before any incident, are speculation. They might be helpful. They might be wrong. You won't know until something breaks. Our approach ensures every runbook has been tested in production conditions at least once.
The Threshold Update Story
A small commit tells a bigger story:
commit a720e3d9
Update heap memory alert threshold to 90% in runbook
The alert threshold was changed from 80% to 90% in Grafana. The runbook still said 80%. Someone noticed during an incident—the runbook said the alert fires at 80%, but it had actually fired at 90%.
This kind of drift is inevitable. Alert thresholds get tuned. Runbooks get stale. The fix is simple: when you notice a discrepancy, fix it immediately. The commit took thirty seconds.
The underlying lesson: runbooks are code. They live in Git. They get reviewed and merged like any other change. When the system changes, the runbook changes too.
Synthetic Monitoring Test Procedure
One runbook is unusual: it documents how to intentionally trigger alerts.
## Testing Production Alerts
To test synthetic monitoring alerts, block the probe IPs using an Envoy Gateway SecurityPolicy:
```bash
# Suspend Flux first to prevent auto-reconciliation
flux suspend helmrelease clouds-of-europe -n flux-system
# Block the 3 probe IPs
cat <<EOF | kubectl apply -f -
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
name: block-probe-ips
namespace: clouds-of-europe
spec:
targetRefs:
- group: gateway.networking.k8s.io
kind: HTTPRoute
name: clouds-of-europe-app
authorization:
defaultAction: Allow
rules:
- action: Deny
principal:
clientCIDRs:
- "51.15.135.156/32" # Paris
- "51.15.61.219/32" # Amsterdam
- "151.115.33.39/32" # Warsaw
EOF
# Wait 2-3 minutes for alerts to fire
This is a runbook for testing, not responding. But it follows the same principle: executable procedures, not explanations. Someone used this exact procedure to verify our alerting pipeline works, documented what they did, and now anyone can repeat it.
Building the Culture
Technical systems are easy. Cultural adoption is hard. Here's what worked for us:
Make runbook creation part of incident review. Every incident has a follow-up action: "write or update runbook." Not optional. Not "if you have time." It's part of closing the incident.
Keep runbooks short. A runbook that requires ten minutes of reading won't get read. Quick Diagnosis should take two minutes. Common Causes should be skimmable. If you're writing paragraphs, you're writing documentation, not a runbook.
Use real commands, not pseudocode. kubectl get pods -n clouds-of-europe -l app=myapp is better than "check the pod status in the application namespace." The responder can copy-paste, not translate.
Include the verification step. "How do I know it's fixed?" is a real question. Answer it explicitly.
Link related alerts. Cascading failures are common. If PostgreSQL is down, the application will throw errors too. Both alerts fire. The runbooks should reference each other.
Put runbooks in Git. Version control, review process, history. Runbooks are operational code. Treat them that way.
The European Angle
For European teams, runbooks have a compliance dimension. GDPR and industry regulations often require documented incident response procedures. "We have runbooks" is a legitimate answer to "how do you handle production incidents?"
But more practically: distributed European teams benefit from runbooks because they reduce reliance on tribal knowledge. The person who knows how to fix the PostgreSQL issue might be asleep in a different timezone. A good runbook means anyone on the rotation can respond effectively.
Self-documenting operations support sovereignty. You're not dependent on a vendor's support team being available during European business hours. Your team has the procedures to handle issues independently.
What Our Runbook Collection Looks Like
After six months of this approach, we have nine runbooks:
| Runbook | Alert | Written After |
|---|---|---|
| app-high-error-rate.md | AppHighErrorRate | Deployment bug caused 500s |
| app-heap-memory-high.md | AppHeapMemoryHigh | Memory leak from dependency |
| postgres-no-replica.md | PostgreSQLNoReplica | Zone affinity PVC issue |
| postgres-replication-lag.md | PostgresReplicationLagHigh | Slow disk during backup |
| postgres-backup-stale.md | PostgresBackupStale | S3 credentials expired |
| memcached-down.md | MemcachedDown | Pod scheduling failure |
| flux-reconciliation-failing.md | FluxReconciliationFailing | Invalid Helm values |
| synthetic-monitoring.md | Multiple | Alert testing procedure |
| heystaq-alerts.md | Platform alerts | HeyStaq platform reference |
Each one represents a real incident. Each one has been used at least once since being written. None of them were written "just in case."
Key Takeaways
Write runbooks after incidents, not before. Speculation is less useful than experience. Let reality inform your documentation.
Use a consistent template. Alert details, quick diagnosis, common causes, emergency procedures, verification, escalation, related alerts. Every runbook follows the same structure.
Link alerts directly to runbooks. The runbook_url annotation makes the runbook one click away from the alert notification.
Keep runbooks executable. Commands to copy-paste, not concepts to understand. The responder's brain is at 60% capacity at 2am.
Evolve runbooks through use. Each incident that uses a runbook should improve it. If something was missing or wrong, fix it.
Store runbooks in Git. Version control, review process, change history. Runbooks are operational code.
Make runbook creation non-negotiable. It's part of incident closure, not a nice-to-have follow-up.
*This article documents work done on the Clouds of Europe platform in January 2026.