Featured

Building European Observability: Multi-Location Synthetic Monitoring with GitOps

How we implemented synthetic monitoring across Paris, Amsterdam, and Warsaw using Grafana Operator CRDs and GitOps principles, open source and independent. Proudly European.

By Jurg van Vliet

·

When your platform serves a European audience, monitoring from US-East-1 tells you almost nothing useful. Network latency, regional routing policies, and even DNS resolution behave differently when your traffic doesn't cross the Atlantic. We learned this the hard way when our synthetic probes showed everything was fine while actual users in France reported occasional slow page loads.

This article walks through how we built multi-location synthetic monitoring for Clouds of Europe—probing our endpoints every 30 seconds from Paris, Amsterdam, and Warsaw. But the interesting part isn't the monitoring itself. It's how we built it: infrastructure-as-code with Grafana Operator CRDs, consensus-based alerting that reduces false positives, and a testing procedure that uses Envoy Gateway SecurityPolicies to simulate real outages.

Why European Vantage Points Matter

Most managed monitoring services run their probes from a handful of locations, often concentrated in North America. This creates a blind spot: you're measuring performance through a lens that doesn't match your users' experience.

For a European digital independence platform, this mismatch is both technical and philosophical. If your mission is European independence from US cloud infrastructure, it's incongruous to rely on American probe servers to tell you whether your site is up.

We use Heystaq, a European managed observability platform built by Aknostic on Mimir, Loki and Tempo, which runs Blackbox Exporter instances in three European cities:

Location Purpose
Paris (fr-par) Western Europe, major internet exchange
Amsterdam (nl-ams) Northern Europe, AMS-IX proximity
Warsaw (pl-waw) Eastern Europe, emerging tech hub

This geographic distribution means we detect region-specific issues—a BGP misconfiguration affecting German traffic, a Scaleway datacenter hiccup in Paris—before they become widespread user complaints.

The GitOps Pattern: Synthetic Targets as Configuration

Traditional monitoring setups involve clicking through web UIs to add probe targets. This works until you need to reproduce your configuration after a disaster, audit who changed what, or promote changes through test environments.

We define our synthetic targets in a ConfigMap that lives in Git:

apiVersion: v1
kind: ConfigMap
metadata:
  name: synthetic-targets-clouds-of-europe
  labels:
    heystaq.com/synthetic-targets: "true"
    heystaq.com/tenant: "clouds-of-europe"
data:
  targets.yaml: |
    - targets: ["placeholder"]
      labels:
        name: "coe-homepage"
        address: "https://clouds-of-europe.eu"
        module: "http_2xx"
        environment: "production"
        priority: "critical"
        __scrape_interval__: "30s"

    - targets: ["placeholder"]
      labels:
        name: "coe-health-api"
        address: "https://clouds-of-europe.eu/api/health"
        module: "http_2xx"
        environment: "production"
        priority: "critical"
        __scrape_interval__: "30s"

    - targets: ["placeholder"]
      labels:
        name: "coe-content-api"
        address: "https://clouds-of-europe.eu/api/content/cached?limit=1"
        module: "http_2xx"
        environment: "production"
        priority: "warning"
        __scrape_interval__: "30s"
        __probe_timeout__: "15s"

The placeholder in targets is a Heystaq convention—the actual target URL comes from the address label. What matters here is that every configuration decision is visible: the 30-second scrape interval, the 15-second timeout for the content API, the priority labels that determine alerting severity.

When we need to add a new endpoint to monitor, we edit this file, commit it, and push. Flux picks up the change and applies it. No clicking, no screenshots of "how to configure monitoring," no wondering if production matches staging.

Debugging the Paris DNS Problem

A few weeks after deploying this setup, we noticed intermittent probe failures specifically from Paris. The homepage and health API were fine, but the content API endpoint was failing roughly 10% of the time—always from the French probe.

The metrics told the story: probe_duration_seconds for successful probes was occasionally spiking to 9-10 seconds. Our default 10-second timeout wasn't leaving any margin.

Digging deeper, we found the culprit: DNS resolution. The content API endpoint has a more complex hostname path that was triggering additional DNS lookups. From Paris, these lookups occasionally took 5+ seconds—possibly due to resolver congestion or routing to a distant upstream server.

The fix was simple once we understood it:

__probe_timeout__: "15s"

This is the kind of regional quirk that US-based monitoring would never catch. Your American probes see 200ms DNS resolution consistently. Your French users see 5-second spikes that make your site feel broken.

Consensus-Based Alerting

The most valuable insight from multi-location monitoring isn't "is my site up?" but "is my site up for most users?" A single failed probe might indicate a problem with the probe itself, a regional network issue, or a transient blip.

Our alerting rules use consensus: we only fire critical alerts when the majority of locations report failure.

- uid: af9j72o3pay2ob
  title: "EndpointDown"
  condition: C
  for: 1m
  labels:
    severity: "critical"
  annotations:
    description: "{{ $labels.instance }} is unreachable from majority of probe locations - likely service outage."
    summary: "{{ $labels.instance }} down - less than 50% of locations can reach it"
  data:
    - refId: A
      model:
        expr: avg by (instance, target_name, priority) (probe_success)

The PromQL expression avg by (instance) (probe_success) computes the average success rate across all probe locations. When this drops below 50%, we know the problem isn't localized—something is genuinely wrong.

We also have a "LocationDown" alert for single-location failures, but it's severity "warning" rather than "critical." This gives us visibility into regional issues without waking someone at 3am for a Paris-specific network blip.

Grafana Operator CRDs: Alerts as Code

The traditional way to configure Grafana alerts involves the web UI: clicking through forms, hoping you don't fat-finger a threshold, and having no record of what changed. For a GitOps platform, this is untenable.

We converted our entire alerting configuration to Grafana Operator Custom Resource Definitions. Here's what a synthetic monitoring alert group looks like:

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaAlertRuleGroup
metadata:
  name: synthetic-monitoring
  namespace: org-clouds-of-europe
spec:
  instanceSelector:
    matchLabels:
      org: clouds-of-europe
  folderRef: synthetic-monitoring
  interval: 1m
  rules:
    - uid: cf9j72llgh0cgf
      title: "LocationDown"
      condition: C
      for: 2m
      labels:
        severity: "warning"
      annotations:
        runbook_url: "https://github.com/..."

This transformation was significant: 17 alert rules converted to 5 GrafanaAlertRuleGroup resources, 7 dashboards to GrafanaDashboard CRDs, plus notification policies and contact points. Everything lives in Git, everything is versioned, everything is reproducible.

The runbook_url annotation is key. Every alert links directly to documentation explaining what the alert means and how to respond. When the pager goes off at 2am, you're not guessing—you're following a procedure.

Testing Alerts Without Breaking Production

How do you test your alerting pipeline without actually taking down production? We developed a procedure using Envoy Gateway SecurityPolicies.

The idea is simple: block the probe IP addresses temporarily, verify alerts fire, then remove the block. But the implementation requires care—you don't want Flux to reconcile your temporary change away.

# Suspend Flux first to prevent auto-reconciliation
flux suspend helmrelease clouds-of-europe -n flux-system

# Block the 3 probe IPs
cat <<EOF | kubectl apply -f -
apiVersion: gateway.envoyproxy.io/v1alpha1
kind: SecurityPolicy
metadata:
  name: block-probe-ips
  namespace: clouds-of-europe
spec:
  targetRefs:
    - group: gateway.networking.k8s.io
      kind: HTTPRoute
      name: clouds-of-europe-app
  authorization:
    defaultAction: Allow
    rules:
      - action: Deny
        principal:
          clientCIDRs:
            - "51.15.135.156/32"   # Paris
            - "51.15.61.219/32"   # Amsterdam
            - "151.115.33.39/32"  # Warsaw
EOF

# Wait 2-3 minutes for alerts to fire
# Verify in Grafana and your notification channel

The probe IPs are documented (51.15.135.156 for Paris, 51.15.61.219 for Amsterdam, 151.115.33.39 for Warsaw), so you know exactly what to block. After verifying alerts fired correctly:

# Remove the SecurityPolicy
kubectl delete securitypolicy block-probe-ips -n clouds-of-europe

# Resume Flux
flux resume helmrelease clouds-of-europe -n flux-system

The SecurityPolicy successfully blocked all three probe IPs, the Blackbox Exporter dashboard showed DOWN status, alerts fired and routed to GoAlert, and restoration was clean. Now we know the entire pipeline works—not just in theory, but verified in production.

The Path to Self-Hosted Observability

A question worth addressing: if this is about European digital independence, why use a managed service at all? Why not run your own Prometheus, Mimir, Loki, and Grafana?

The honest answer: we (Aknostic) had that, but extracted it because it was useful for our customers as well. Observability infrastructure is operationally intensive. The OOM kills, the storage calculations, the retention policies, the upgrade treadmill—these consume engineering time that early-stage platforms should spend on their actual product. Sharing these resources limits waste (#lessismore).

We chose to call it Heystaq, running European infrastructure. Our metrics never leave EU jurisdiction. But more importantly, we're using this managed phase to learn the patterns:

  • Which dashboards do we actually look at?
  • What alert thresholds make sense for our traffic patterns?
  • How do we want to structure our PromQL queries?

By defining everything as Grafana Operator CRDs and synthetic target ConfigMaps now, we're building infrastructure-as-code that will work identically when we eventually migrate to a self-hosted stack. The CRDs are portable. The ConfigMaps are portable. The runbooks document real operational experience.

This is the European way: pragmatic sovereignty. Use managed services from European providers while building the skills and patterns for eventual independence. Don't let perfect be the enemy of operational.

Key Takeaways

Monitor from where your users are. European platforms need European probes. Regional performance issues are invisible to global monitoring services.

Treat monitoring configuration as code. Synthetic targets in ConfigMaps, alerts as Grafana Operator CRDs, everything in Git. This isn't just about reproducibility—it's about making changes reviewable and auditable.

Use consensus for critical alerts. Single-location failures shouldn't wake you up. Multiple locations agreeing something is wrong should.

Test your alerting pipeline. A fire drill that proves alerts actually reach the right people is worth the brief production disruption.

Build for portability. Even if you're using managed services today, structure your configuration so it can move tomorrow. Grafana Operator CRDs work the same whether your Grafana is managed or self-hosted.


This article documents work done on the Clouds of Europe platform in January 2026.