Centralized Monitoring: One Pane of Glass, Better Sleep
Instead of Prometheus and Grafana in every environment, we built a management cluster that monitors all clusters via remote write. Single pane of glass. Reduced resource usage. But the real win: reduced operational complexity.
By Jurg van Vliet
Published Oct 15, 2025
The Per-Environment Monitoring Problem
Standard Kubernetes monitoring: every cluster runs Prometheus, Grafana, and associated infrastructure. This is recommended practice—monitoring should be reliable even when the application is failing.
For a single cluster, this works fine. For multiple environments (development, test, production), it becomes repetitive:
- Three Prometheus instances, each scraping their own cluster
- Three Grafana instances, each with their own dashboards
- Three sets of alerting rules, which inevitably drift out of sync
- Three places to look during an incident
Resource cost is real—Prometheus and Grafana aren't lightweight. But the cognitive cost is larger. During an incident, which Grafana are you looking at? Are the dashboards identical? Is the alert configuration the same?
The Centralized Alternative
We run a management cluster—a small Kubernetes cluster whose only job is monitoring and GitOps control for other clusters.
Each workload cluster runs Prometheus Agent (not full Prometheus). The agent scrapes metrics and remote-writes them to Mimir in the management cluster. Grafana in the management cluster queries Mimir for metrics from all environments.
What this looks like:
# In each workload cluster: Prometheus Agent
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus-agent-config
data:
prometheus.yml: |
remote_write:
- url: https://mimir.mgmt.example.com/api/v1/push
basic_auth:
username: <cluster-name>
password_file: /etc/prometheus/secrets/password
Prometheus Agent is much lighter than full Prometheus—no local storage, no query engine, just scraping and forwarding. Memory footprint drops from 2-4GB to 200-500MB.
Resource Savings
Per cluster, we eliminated:
- Full Prometheus: ~2-4GB RAM, 2+ CPU cores
- Grafana: ~500MB RAM, 1 CPU core
- Persistent storage for metrics: 50-100GB
Across three clusters (dev, test, production), that's roughly:
- 6-12GB RAM saved
- 6-9 CPU cores freed
- 150-300GB storage eliminated
The management cluster runs Mimir (for metrics storage) and Grafana (for visualization), but that's shared across all environments. One Grafana instance serves dashboards for all clusters.
Total resource reduction: roughly 60-70%. Not revolutionary, but meaningful at scale.
The Real Win: Cognitive Simplicity
During an incident at 3 AM, you're half-awake, stressed, and need answers fast. With centralized monitoring:
- One URL to remember: monitoring.example.com
- One dashboard showing all environments
- One set of alerts in one place
- Correlate across clusters easily (did test see this issue earlier?)
Before centralization, I'd have three browser tabs open, trying to remember which Grafana showed production. It sounds trivial. At 3 AM, it's friction you don't need.
Cognitive load reduction is real. One interface means less mental overhead. Unified dashboards mean consistent visualization. This improves incident response measurably.
Security: mTLS for Remote Write
Remote write means metrics leave the workload cluster and travel to the management cluster. This is cross-network traffic; it needs security.
We use mTLS (mutual TLS) for Prometheus remote write:
- Each Prometheus Agent has a client certificate
- Mimir requires valid certificates
- Certificates are issued by cert-manager from a cluster CA
- Traffic is encrypted and authenticated
Configuration example:
remote_write:
- url: https://mimir.mgmt.example.com/api/v1/push
tls_config:
cert_file: /etc/prometheus/certs/tls.crt
key_file: /etc/prometheus/certs/tls.key
ca_file: /etc/prometheus/certs/ca.crt
This isn't just security theater. In regulated environments, encrypting and authenticating metrics traffic is a requirement. mTLS provides both with standard tooling.
Tradeoffs: Single Point of Failure
The obvious concern: if the management cluster fails, do you lose all monitoring?
Yes and no. Prometheus Agent buffers metrics locally. If remote write fails (network issue, management cluster down), the agent queues metrics and retries. Once connectivity returns, metrics backfill automatically.
How long can it buffer? Depends on memory and retention settings. We configure 1-2 hours of buffer. For most incidents, that's sufficient.
What if management cluster is down longer? You still have application logs and Kubernetes events in the workload cluster. You can deploy Grafana temporarily to that cluster if needed. It's not ideal, but it's a fallback.
The tradeoff is acceptable: simplified operations 99% of the time, with a known recovery path for the 1% edge case.
Why This Matters for Small Teams
Large organizations can afford dedicated monitoring specialists. Small teams need infrastructure that reduces overhead, not increases it.
Centralized monitoring means:
- One system to learn instead of three
- One place to maintain dashboards and alerts
- Lower resource costs (matters when you're cost-conscious)
- Better incident response (matters when you're on call)
The pattern scales down effectively. Even two clusters benefit from centralization. The cognitive simplification pays for itself.
Sources:
#monitoring #prometheus #grafana #centralizedinfrastructure #observability