Centralized Monitoring: One Pane of Glass, Better Sleep

Jurg van Vliet

Featured

Centralized Monitoring: One Pane of Glass, Better Sleep

Instead of Prometheus and Grafana in every environment, we built a management cluster that monitors all clusters via remote write. Single pane of glass. Reduced resource usage. But the real win: reduced operational complexity.

By Jurg van Vliet

Published Oct 15, 2025 · Updated Dec 13, 2025

The Per-Environment Monitoring Problem

Standard Kubernetes monitoring: every cluster runs Prometheus, Grafana, and associated infrastructure. This is recommended practice—monitoring should be reliable even when the application is failing.

For a single cluster, this works fine. For multiple environments (development, test, production), it becomes repetitive:

Three Prometheus instances, each scraping their own cluster
Three Grafana instances, each with their own dashboards
Three sets of alerting rules, which inevitably drift out of sync
Three places to look during an incident

Resource cost is real—Prometheus and Grafana aren't lightweight. But the cognitive cost is larger. During an incident, which Grafana are you looking at? Are the dashboards identical? Is the alert configuration the same?

The Centralized Alternative

We run a management cluster—a small Kubernetes cluster whose only job is monitoring and GitOps control for other clusters.

Each workload cluster runs Prometheus Agent (not full Prometheus). The agent scrapes metrics and remote-writes them to Mimir in the management cluster. Grafana in the management cluster queries Mimir for metrics from all environments.

What this looks like:

# In each workload cluster: Prometheus Agent
apiVersion: v1
kind: ConfigMap
metadata:
  name: prometheus-agent-config
data:
  prometheus.yml: |
    remote_write:
    - url: https://mimir.mgmt.example.com/api/v1/push
      basic_auth:
        username: <cluster-name>
        password_file: /etc/prometheus/secrets/password

Prometheus Agent is much lighter than full Prometheus—no local storage, no query engine, just scraping and forwarding. Memory footprint drops from 2-4GB to 200-500MB.

Resource Savings

Per cluster, we eliminated:

Full Prometheus: ~2-4GB RAM, 2+ CPU cores
Grafana: ~500MB RAM, 1 CPU core
Persistent storage for metrics: 50-100GB

Across three clusters (dev, test, production), that's roughly:

6-12GB RAM saved
6-9 CPU cores freed
150-300GB storage eliminated

The management cluster runs Mimir (for metrics storage) and Grafana (for visualization), but that's shared across all environments. One Grafana instance serves dashboards for all clusters.

Total resource reduction: roughly 60-70%. Not revolutionary, but meaningful at scale.

The Real Win: Cognitive Simplicity

During an incident at 3 AM, you're half-awake, stressed, and need answers fast. With centralized monitoring:

One URL to remember: monitoring.example.com
One dashboard showing all environments
One set of alerts in one place
Correlate across clusters easily (did test see this issue earlier?)

Before centralization, I'd have three browser tabs open, trying to remember which Grafana showed production. It sounds trivial. At 3 AM, it's friction you don't need.

Cognitive load reduction is real. One interface means less mental overhead. Unified dashboards mean consistent visualization. This improves incident response measurably.

Security: mTLS for Remote Write

Remote write means metrics leave the workload cluster and travel to the management cluster. This is cross-network traffic; it needs security.

We use mTLS (mutual TLS) for Prometheus remote write:

Each Prometheus Agent has a client certificate
Mimir requires valid certificates
Certificates are issued by cert-manager from a cluster CA
Traffic is encrypted and authenticated

Configuration example:

remote_write:
- url: https://mimir.mgmt.example.com/api/v1/push
  tls_config:
    cert_file: /etc/prometheus/certs/tls.crt
    key_file: /etc/prometheus/certs/tls.key
    ca_file: /etc/prometheus/certs/ca.crt

This isn't just security theater. In regulated environments, encrypting and authenticating metrics traffic is a requirement. mTLS provides both with standard tooling.

Tradeoffs: Single Point of Failure

The obvious concern: if the management cluster fails, do you lose all monitoring?

Yes and no. Prometheus Agent buffers metrics locally. If remote write fails (network issue, management cluster down), the agent queues metrics and retries. Once connectivity returns, metrics backfill automatically.

How long can it buffer? Depends on memory and retention settings. We configure 1-2 hours of buffer. For most incidents, that's sufficient.

What if management cluster is down longer? You still have application logs and Kubernetes events in the workload cluster. You can deploy Grafana temporarily to that cluster if needed. It's not ideal, but it's a fallback.

The tradeoff is acceptable: simplified operations 99% of the time, with a known recovery path for the 1% edge case.

Why This Matters for Small Teams

Large organizations can afford dedicated monitoring specialists. Small teams need infrastructure that reduces overhead, not increases it.

Centralized monitoring means:

One system to learn instead of three
One place to maintain dashboards and alerts
Lower resource costs (matters when you're cost-conscious)
Better incident response (matters when you're on call)

The pattern scales down effectively. Even two clusters benefit from centralization. The cognitive simplification pays for itself.

Sources:

#monitoring #prometheus #grafana #centralizedinfrastructure #observability