Managed Observability in Europe: Why We Chose HeyStaq Over Self-Hosted Prometheus

Jurg van Vliet

Managed Observability in Europe: Why We Chose HeyStaq Over Self-Hosted Prometheus

The true cost of running your own observability stack, and when a European managed service makes more sense.

By Jurg van Vliet

Published Feb 4, 2026

We ran self-hosted Prometheus, Loki, and Grafana for three months. Then we deleted 38,000 lines of configuration and migrated to a managed service. This article isn't a vendor pitch—it's an honest accounting of what self-hosted observability actually costs, and why a European managed platform made sense for our stage.

The commit that removed our self-hosted stack deleted 84 files across Mimir, Loki, Grafana, Tempo, blackbox-exporter, and their supporting infrastructure. Each of those files represented decisions, debugging sessions, and operational knowledge accumulated over weeks. We walked away from all of it.

Here's why.

The Self-Hosted Dream

The appeal of self-hosted observability is obvious, especially for a European sovereignty platform:

Complete control: You own your metrics, your retention policies, your access controls
No vendor lock-in: Standard protocols (Prometheus remote_write, Loki push API) mean you can move
Cost predictability: Compute and storage costs, not per-metric pricing
Sovereignty by default: Data never leaves your infrastructure

We built a comprehensive stack on our management cluster:

Mimir for long-term metrics storage (Prometheus-compatible)
Loki for log aggregation
Grafana for visualization and alerting
Tempo for distributed tracing
Prometheus Agent on each workload cluster, shipping via remote_write
Promtail for log shipping
Blackbox Exporter for synthetic monitoring

Each component had its own Helm chart, ConfigMaps for configuration, Secrets for credentials, ServiceMonitors for self-monitoring, and dashboards for visibility. The GitOps repository grew accordingly.

The Hidden Costs

What the architecture diagrams don't show is the operational tax.

Memory Tuning

Mimir and Loki are memory-hungry applications. Not in the "allocate 4GB and forget it" way, but in the "tune your ingester batch sizes, query concurrency, and compaction schedules or watch OOM kills" way.

Our Mimir ingesters would periodically get OOM-killed during compaction. The solution wasn't more memory—it was understanding the interaction between:

-ingester.max-global-series-per-user
-blocks-storage.bucket-store.max-chunk-pool-bytes
-server.grpc-max-recv-msg-size-bytes
Pod memory limits

Getting this right took multiple iterations, each requiring a deployment, observation period, and adjustment.

Storage Math

Prometheus metrics aren't small. A typical Kubernetes cluster generates thousands of time series. Each series has samples at your scrape interval (usually 15-30 seconds). Retention compounds quickly.

Our calculations:

~50,000 active series across both clusters
15-second scrape interval
8 bytes per sample (timestamp + value)
30-day retention

That's roughly 70GB of raw data, before accounting for indexes and metadata. Mimir's block storage format is efficient, but you're still looking at significant S3 costs and query latency for historical data.

We spent more time on retention policies than we'd like to admit. Which metrics warrant 30 days? Which can be downsampled? Which can be dropped after a week? These questions don't have universal answers—they depend on your debugging patterns and compliance requirements.

The Upgrade Treadmill

Observability components release frequently. Mimir, Loki, and Grafana each have their own release cadence, their own breaking changes, their own deprecation timelines.

Staying current isn't optional. Security patches, performance improvements, and bug fixes matter for infrastructure you depend on. But each upgrade requires:

Reading release notes for breaking changes
Testing in a non-production environment (if you have one)
Planning the upgrade sequence (some components have ordering dependencies)
Executing the upgrade
Verifying everything still works
Rolling back if it doesn't

This isn't a one-time cost. It's a recurring tax on engineering time, every few weeks.

Configuration Complexity

Our prometheus-agent configuration alone was 300 lines. Metric relabeling rules, cAdvisor filtering, ServiceMonitor selectors, resource limits. Each line represented a decision and potential failure mode.

When metrics stopped appearing in Grafana, the debugging path included:

Is the target being scraped? (Check Prometheus targets page)
Are ServiceMonitor labels matching? (Check label selectors)
Is remote_write working? (Check Prometheus logs)
Is Mimir accepting the samples? (Check Mimir logs)
Is the series being dropped by a relabeling rule? (Check relabel configs)
Is the series exceeding cardinality limits? (Check Mimir limits)
Is the query correct? (Check PromQL syntax)

Multiply this by every component in the stack.

The Migration Decision

After three months, we had a working stack. It ingested metrics, stored logs, fired alerts. But we also had:

A significant portion of our GitOps repository dedicated to observability
Regular time spent on observability maintenance instead of product work
Occasional outages where our monitoring was down and we couldn't monitor our monitoring

The irony of not knowing your observability stack is broken because your observability stack is broken is not lost on anyone who's lived it.

We evaluated our options:

Keep self-hosting: Accept the operational cost as part of sovereignty Use a US-based managed service: Datadog, Grafana Cloud US regions Use a European managed service: Find a provider with EU data residency

The third option hadn't been obvious initially. The observability market is dominated by US companies. But we found HeyStaq, a multi-tenant observability platform built by Aknostic, a European company running European infrastructure.

What HeyStaq Provides

HeyStaq is built on the same open-source components we were running:

Mimir for metrics (Prometheus-compatible)
Loki for logs
Grafana for visualization and alerting
Multi-location synthetic monitoring from European cities

The difference: Aknostic operates it. They handle the OOM tuning, the storage scaling, the upgrades, the high availability. We ship metrics and logs via standard protocols.

Endpoints:

Service	URL
Mimir (metrics)	`https://mimir.heystaq.com/api/v1/push`
Loki (logs)	`https://loki.heystaq.com/loki/api/v1/push`

Multi-tenancy: The X-Scope-OrgID: clouds-of-europe header isolates our data from other tenants. We can't see their metrics; they can't see ours.

Authentication: mTLS client certificates for metric and log shipping. Google OAuth for Grafana access.

The Migration

Migrating was straightforward because we were already using standard protocols. Prometheus Agent's remote_write and Promtail's Loki client just needed new endpoints and credentials.

Prometheus Agent Configuration

prometheus:
  prometheusSpec:
    remoteWrite:
      - url: https://mimir.heystaq.com/api/v1/push
        headers:
          X-Scope-OrgID: clouds-of-europe
        writeRelabelConfigs:
          - targetLabel: environment
            replacement: production
          - targetLabel: kubernetes_cluster
            replacement: production
        queueConfig:
          capacity: 10000
          maxShards: 10
          maxSamplesPerSend: 5000
          sampleAgeLimit: 5m
        tlsConfig:
          cert:
            secret:
              name: heystaq-client-mtls
              key: tls.crt
          keySecret:
            name: heystaq-client-mtls
            key: tls.key

The sampleAgeLimit: 5m was learned through experience—after a prometheus-agent restart, it would try to re-send buffered samples, some of which were older than Mimir's out-of-order window. Dropping samples older than 5 minutes prevents rejection errors.

mTLS Setup

HeyStaq uses mutual TLS for authentication. We receive a client certificate and key, store them as a Kubernetes Secret (SOPS-encrypted in Git), and reference them in the prometheus-agent and promtail configurations.

apiVersion: v1
kind: Secret
metadata:
  name: heystaq-client-mtls
  namespace: monitoring
type: kubernetes.io/tls
stringData:
  tls.crt: |
    -----BEGIN CERTIFICATE-----
    <Client certificate>
    -----END CERTIFICATE-----
  tls.key: |
    -----BEGIN RSA PRIVATE KEY-----
    <Client private key>
    -----END RSA PRIVATE KEY-----

One gotcha: HeyStaq's servers use a public CA (Let's Encrypt). The CA certificate they provide is for server-side client verification, not for verifying the server. Including it in the client's CA bundle causes TLS failures. We learned this through debugging, not documentation.

Deleting the Old Stack

With shipping confirmed working, we deleted the self-hosted stack:

84 files changed, 38,373 deletions(-)

Mimir configuration: gone. Loki configuration: gone. Grafana deployment: gone. Tempo: gone. Postgres cluster for Grafana state: gone. S3 credentials for metric storage: gone. Alert rules: migrated to HeyStaq's Grafana.

The GitOps repository became dramatically simpler.

What We Gained

Operational Simplicity

We no longer debug OOM kills in our observability stack. We no longer plan Mimir upgrades. We no longer calculate retention storage. Someone else does that—someone whose job is operating observability infrastructure, not building a community platform.

Reliability

HeyStaq runs redundant infrastructure across multiple availability zones. Our self-hosted stack ran on a single management cluster. When that cluster had issues, our monitoring had issues.

Now, our monitoring is independent of our workload infrastructure. When our production cluster has problems, we can still see metrics and logs to diagnose them.

Synthetic Monitoring

HeyStaq runs Blackbox Exporter instances in Paris, Amsterdam, and Warsaw. They probe our endpoints every 30 seconds from three European vantage points. This is genuinely useful for detecting regional issues—and it's infrastructure we don't operate.

Grafana Operator CRDs

We converted our dashboards and alerts to Grafana Operator Custom Resource Definitions. These live in Git, are deployed via Kustomize, and provision directly into HeyStaq's Grafana instance:

apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaAlertRuleGroup
metadata:
  name: synthetic-monitoring
  namespace: org-clouds-of-europe
spec:
  instanceSelector:
    matchLabels:
      org: clouds-of-europe
  folderRef: synthetic-monitoring
  interval: 1m
  rules:
    - uid: af9j72o3pay2ob
      title: "EndpointDown"
      condition: C
      for: 1m
      labels:
        severity: "critical"

This gives us GitOps for observability configuration while using managed infrastructure for execution. The CRDs are portable—if we eventually return to self-hosting, the same files work.

What We Lost

Complete Control

We can't modify Mimir's ingestion limits. We can't add custom recording rules at the Mimir level. We can't change Loki's retention beyond what the platform offers. These constraints haven't mattered yet, but they could.

Cost Transparency

Self-hosted costs were predictable: compute, storage, network. Managed platform pricing is less transparent—we pay based on our plan, and scaling is a conversation rather than a Terraform variable.

Debugging Depth

When something's wrong with metric ingestion on a self-hosted stack, you can read Mimir's logs, check its metrics, examine its configuration. With a managed platform, you open a support ticket. For a small team, this is fine. For larger organizations with dedicated observability teams, it might be limiting.

The Sovereignty Calculation

For a European digital sovereignty platform, the managed vs. self-hosted question has an additional dimension: where does data go?

With HeyStaq:

Metrics and logs are stored on European infrastructure
The company operating the infrastructure is European
Data never leaves EU jurisdiction
GDPR applies to the provider relationship

This is meaningfully different from using Grafana Cloud's EU region (still a US company, subject to US jurisdiction) or Datadog (same concern).

We're still dependent on a third party. That's a trade-off against pure sovereignty. But the alternative—operating all infrastructure ourselves—has its own risks: reliability issues, security gaps from slow patching, and opportunity cost from infrastructure work instead of product work.

When to Self-Host

Our decision isn't universal. Self-hosted observability makes sense when:

You have a platform team: Dedicated engineers who can own the operational burden, respond to issues, and stay current on upgrades.

You have strict compliance requirements: Some regulations require not just data residency but operational control. Managed platforms may not satisfy auditors.

You need custom capabilities: Recording rules, exotic metric types, integration with internal systems that don't play well with multi-tenant platforms.

Cost scales better: At very high metric volumes, self-hosted can be cheaper than managed per-metric pricing. But the crossover point is higher than most expect once you account for engineering time.

For an early-stage European platform with a small team, none of these applied. The managed path was clearly better.

The Hybrid Model

We didn't fully abandon self-hosted. Our architecture is:

Shipping: Self-managed Prometheus Agent and Promtail in each cluster
Storage and query: Managed HeyStaq (Mimir, Loki)
Visualization and alerting: Managed Grafana with GitOps-provisioned configuration
Synthetic monitoring: Managed blackbox-exporter probes

This hybrid gives us control over what gets shipped (our metric relabeling rules, our ServiceMonitors) while offloading the stateful, complex parts (storage, query, high availability).

The Prometheus Agent's remote_write is a portability guarantee. If we outgrow HeyStaq or want to return to self-hosting, we change an endpoint URL. The shipping configuration stays the same.

Key Takeaways

Self-hosted observability is operationally expensive. Memory tuning, storage calculation, upgrade management, configuration debugging—these costs are real and recurring.

European managed observability exists. You don't have to choose between sovereignty and operational simplicity. Look for European providers with EU infrastructure.

Standard protocols enable migration. Prometheus remote_write and Loki push API mean you're not locked in. Your shipping configuration works with self-hosted or managed backends.

GitOps works with managed platforms. Grafana Operator CRDs let you version-control dashboards and alerts while using managed Grafana. You get the benefits of infrastructure-as-code without operating the infrastructure.

Know your stage. Early-stage teams should focus on product, not observability infrastructure. As you grow, the calculus may change. Build with portability in mind.

*This article documents work done on the Clouds of Europe platform in January 2026.