Managed Observability in Europe: Why We Chose HeyStaq Over Self-Hosted Prometheus
The true cost of running your own observability stack, and when a European managed service makes more sense.
By Jurg van Vliet
We ran self-hosted Prometheus, Loki, and Grafana for three months. Then we deleted 38,000 lines of configuration and migrated to a managed service. This article isn't a vendor pitch—it's an honest accounting of what self-hosted observability actually costs, and why a European managed platform made sense for our stage.
The commit that removed our self-hosted stack deleted 84 files across Mimir, Loki, Grafana, Tempo, blackbox-exporter, and their supporting infrastructure. Each of those files represented decisions, debugging sessions, and operational knowledge accumulated over weeks. We walked away from all of it.
Here's why.
The Self-Hosted Dream
The appeal of self-hosted observability is obvious, especially for a European sovereignty platform:
- Complete control: You own your metrics, your retention policies, your access controls
- No vendor lock-in: Standard protocols (Prometheus remote_write, Loki push API) mean you can move
- Cost predictability: Compute and storage costs, not per-metric pricing
- Sovereignty by default: Data never leaves your infrastructure
We built a comprehensive stack on our management cluster:
- Mimir for long-term metrics storage (Prometheus-compatible)
- Loki for log aggregation
- Grafana for visualization and alerting
- Tempo for distributed tracing
- Prometheus Agent on each workload cluster, shipping via remote_write
- Promtail for log shipping
- Blackbox Exporter for synthetic monitoring
Each component had its own Helm chart, ConfigMaps for configuration, Secrets for credentials, ServiceMonitors for self-monitoring, and dashboards for visibility. The GitOps repository grew accordingly.
The Hidden Costs
What the architecture diagrams don't show is the operational tax.
Memory Tuning
Mimir and Loki are memory-hungry applications. Not in the "allocate 4GB and forget it" way, but in the "tune your ingester batch sizes, query concurrency, and compaction schedules or watch OOM kills" way.
Our Mimir ingesters would periodically get OOM-killed during compaction. The solution wasn't more memory—it was understanding the interaction between:
-ingester.max-global-series-per-user-blocks-storage.bucket-store.max-chunk-pool-bytes-server.grpc-max-recv-msg-size-bytes- Pod memory limits
Getting this right took multiple iterations, each requiring a deployment, observation period, and adjustment.
Storage Math
Prometheus metrics aren't small. A typical Kubernetes cluster generates thousands of time series. Each series has samples at your scrape interval (usually 15-30 seconds). Retention compounds quickly.
Our calculations:
- ~50,000 active series across both clusters
- 15-second scrape interval
- 8 bytes per sample (timestamp + value)
- 30-day retention
That's roughly 70GB of raw data, before accounting for indexes and metadata. Mimir's block storage format is efficient, but you're still looking at significant S3 costs and query latency for historical data.
We spent more time on retention policies than we'd like to admit. Which metrics warrant 30 days? Which can be downsampled? Which can be dropped after a week? These questions don't have universal answers—they depend on your debugging patterns and compliance requirements.
The Upgrade Treadmill
Observability components release frequently. Mimir, Loki, and Grafana each have their own release cadence, their own breaking changes, their own deprecation timelines.
Staying current isn't optional. Security patches, performance improvements, and bug fixes matter for infrastructure you depend on. But each upgrade requires:
- Reading release notes for breaking changes
- Testing in a non-production environment (if you have one)
- Planning the upgrade sequence (some components have ordering dependencies)
- Executing the upgrade
- Verifying everything still works
- Rolling back if it doesn't
This isn't a one-time cost. It's a recurring tax on engineering time, every few weeks.
Configuration Complexity
Our prometheus-agent configuration alone was 300 lines. Metric relabeling rules, cAdvisor filtering, ServiceMonitor selectors, resource limits. Each line represented a decision and potential failure mode.
When metrics stopped appearing in Grafana, the debugging path included:
- Is the target being scraped? (Check Prometheus targets page)
- Are ServiceMonitor labels matching? (Check label selectors)
- Is remote_write working? (Check Prometheus logs)
- Is Mimir accepting the samples? (Check Mimir logs)
- Is the series being dropped by a relabeling rule? (Check relabel configs)
- Is the series exceeding cardinality limits? (Check Mimir limits)
- Is the query correct? (Check PromQL syntax)
Multiply this by every component in the stack.
The Migration Decision
After three months, we had a working stack. It ingested metrics, stored logs, fired alerts. But we also had:
- A significant portion of our GitOps repository dedicated to observability
- Regular time spent on observability maintenance instead of product work
- Occasional outages where our monitoring was down and we couldn't monitor our monitoring
The irony of not knowing your observability stack is broken because your observability stack is broken is not lost on anyone who's lived it.
We evaluated our options:
Keep self-hosting: Accept the operational cost as part of sovereignty Use a US-based managed service: Datadog, Grafana Cloud US regions Use a European managed service: Find a provider with EU data residency
The third option hadn't been obvious initially. The observability market is dominated by US companies. But we found HeyStaq, a multi-tenant observability platform built by Aknostic, a European company running European infrastructure.
What HeyStaq Provides
HeyStaq is built on the same open-source components we were running:
- Mimir for metrics (Prometheus-compatible)
- Loki for logs
- Grafana for visualization and alerting
- Multi-location synthetic monitoring from European cities
The difference: Aknostic operates it. They handle the OOM tuning, the storage scaling, the upgrades, the high availability. We ship metrics and logs via standard protocols.
Endpoints:
| Service | URL |
|---|---|
| Mimir (metrics) | https://mimir.heystaq.com/api/v1/push |
| Loki (logs) | https://loki.heystaq.com/loki/api/v1/push |
Multi-tenancy: The X-Scope-OrgID: clouds-of-europe header isolates our data from other tenants. We can't see their metrics; they can't see ours.
Authentication: mTLS client certificates for metric and log shipping. Google OAuth for Grafana access.
The Migration
Migrating was straightforward because we were already using standard protocols. Prometheus Agent's remote_write and Promtail's Loki client just needed new endpoints and credentials.
Prometheus Agent Configuration
prometheus:
prometheusSpec:
remoteWrite:
- url: https://mimir.heystaq.com/api/v1/push
headers:
X-Scope-OrgID: clouds-of-europe
writeRelabelConfigs:
- targetLabel: environment
replacement: production
- targetLabel: kubernetes_cluster
replacement: production
queueConfig:
capacity: 10000
maxShards: 10
maxSamplesPerSend: 5000
sampleAgeLimit: 5m
tlsConfig:
cert:
secret:
name: heystaq-client-mtls
key: tls.crt
keySecret:
name: heystaq-client-mtls
key: tls.key
The sampleAgeLimit: 5m was learned through experience—after a prometheus-agent restart, it would try to re-send buffered samples, some of which were older than Mimir's out-of-order window. Dropping samples older than 5 minutes prevents rejection errors.
mTLS Setup
HeyStaq uses mutual TLS for authentication. We receive a client certificate and key, store them as a Kubernetes Secret (SOPS-encrypted in Git), and reference them in the prometheus-agent and promtail configurations.
apiVersion: v1
kind: Secret
metadata:
name: heystaq-client-mtls
namespace: monitoring
type: kubernetes.io/tls
stringData:
tls.crt: |
-----BEGIN CERTIFICATE-----
<Client certificate>
-----END CERTIFICATE-----
tls.key: |
-----BEGIN RSA PRIVATE KEY-----
<Client private key>
-----END RSA PRIVATE KEY-----
One gotcha: HeyStaq's servers use a public CA (Let's Encrypt). The CA certificate they provide is for server-side client verification, not for verifying the server. Including it in the client's CA bundle causes TLS failures. We learned this through debugging, not documentation.
Deleting the Old Stack
With shipping confirmed working, we deleted the self-hosted stack:
84 files changed, 38,373 deletions(-)
Mimir configuration: gone. Loki configuration: gone. Grafana deployment: gone. Tempo: gone. Postgres cluster for Grafana state: gone. S3 credentials for metric storage: gone. Alert rules: migrated to HeyStaq's Grafana.
The GitOps repository became dramatically simpler.
What We Gained
Operational Simplicity
We no longer debug OOM kills in our observability stack. We no longer plan Mimir upgrades. We no longer calculate retention storage. Someone else does that—someone whose job is operating observability infrastructure, not building a community platform.
Reliability
HeyStaq runs redundant infrastructure across multiple availability zones. Our self-hosted stack ran on a single management cluster. When that cluster had issues, our monitoring had issues.
Now, our monitoring is independent of our workload infrastructure. When our production cluster has problems, we can still see metrics and logs to diagnose them.
Synthetic Monitoring
HeyStaq runs Blackbox Exporter instances in Paris, Amsterdam, and Warsaw. They probe our endpoints every 30 seconds from three European vantage points. This is genuinely useful for detecting regional issues—and it's infrastructure we don't operate.
Grafana Operator CRDs
We converted our dashboards and alerts to Grafana Operator Custom Resource Definitions. These live in Git, are deployed via Kustomize, and provision directly into HeyStaq's Grafana instance:
apiVersion: grafana.integreatly.org/v1beta1
kind: GrafanaAlertRuleGroup
metadata:
name: synthetic-monitoring
namespace: org-clouds-of-europe
spec:
instanceSelector:
matchLabels:
org: clouds-of-europe
folderRef: synthetic-monitoring
interval: 1m
rules:
- uid: af9j72o3pay2ob
title: "EndpointDown"
condition: C
for: 1m
labels:
severity: "critical"
This gives us GitOps for observability configuration while using managed infrastructure for execution. The CRDs are portable—if we eventually return to self-hosting, the same files work.
What We Lost
Complete Control
We can't modify Mimir's ingestion limits. We can't add custom recording rules at the Mimir level. We can't change Loki's retention beyond what the platform offers. These constraints haven't mattered yet, but they could.
Cost Transparency
Self-hosted costs were predictable: compute, storage, network. Managed platform pricing is less transparent—we pay based on our plan, and scaling is a conversation rather than a Terraform variable.
Debugging Depth
When something's wrong with metric ingestion on a self-hosted stack, you can read Mimir's logs, check its metrics, examine its configuration. With a managed platform, you open a support ticket. For a small team, this is fine. For larger organizations with dedicated observability teams, it might be limiting.
The Sovereignty Calculation
For a European digital sovereignty platform, the managed vs. self-hosted question has an additional dimension: where does data go?
With HeyStaq:
- Metrics and logs are stored on European infrastructure
- The company operating the infrastructure is European
- Data never leaves EU jurisdiction
- GDPR applies to the provider relationship
This is meaningfully different from using Grafana Cloud's EU region (still a US company, subject to US jurisdiction) or Datadog (same concern).
We're still dependent on a third party. That's a trade-off against pure sovereignty. But the alternative—operating all infrastructure ourselves—has its own risks: reliability issues, security gaps from slow patching, and opportunity cost from infrastructure work instead of product work.
When to Self-Host
Our decision isn't universal. Self-hosted observability makes sense when:
You have a platform team: Dedicated engineers who can own the operational burden, respond to issues, and stay current on upgrades.
You have strict compliance requirements: Some regulations require not just data residency but operational control. Managed platforms may not satisfy auditors.
You need custom capabilities: Recording rules, exotic metric types, integration with internal systems that don't play well with multi-tenant platforms.
Cost scales better: At very high metric volumes, self-hosted can be cheaper than managed per-metric pricing. But the crossover point is higher than most expect once you account for engineering time.
For an early-stage European platform with a small team, none of these applied. The managed path was clearly better.
The Hybrid Model
We didn't fully abandon self-hosted. Our architecture is:
- Shipping: Self-managed Prometheus Agent and Promtail in each cluster
- Storage and query: Managed HeyStaq (Mimir, Loki)
- Visualization and alerting: Managed Grafana with GitOps-provisioned configuration
- Synthetic monitoring: Managed blackbox-exporter probes
This hybrid gives us control over what gets shipped (our metric relabeling rules, our ServiceMonitors) while offloading the stateful, complex parts (storage, query, high availability).
The Prometheus Agent's remote_write is a portability guarantee. If we outgrow HeyStaq or want to return to self-hosting, we change an endpoint URL. The shipping configuration stays the same.
Key Takeaways
Self-hosted observability is operationally expensive. Memory tuning, storage calculation, upgrade management, configuration debugging—these costs are real and recurring.
European managed observability exists. You don't have to choose between sovereignty and operational simplicity. Look for European providers with EU infrastructure.
Standard protocols enable migration. Prometheus remote_write and Loki push API mean you're not locked in. Your shipping configuration works with self-hosted or managed backends.
GitOps works with managed platforms. Grafana Operator CRDs let you version-control dashboards and alerts while using managed Grafana. You get the benefits of infrastructure-as-code without operating the infrastructure.
Know your stage. Early-stage teams should focus on product, not observability infrastructure. As you grow, the calculus may change. Build with portability in mind.
*This article documents work done on the Clouds of Europe platform in January 2026.