When Atlassian Goes Full SaaS-Monolith, Go Open Source
We are replacing OpsGenie with GoAlert and built a Kubernetes operator for GitOps provisioning and a CLI to make model-based root cause analysis possible. Both are open source.
By Jurg van Vliet
·
The OpsGenie Problem
Atlassian is retiring OpsGenie as a standalone product, absorbing it into the Jira Service Management monolith. For European infrastructure teams, that means buying into an ever-growing bundle of non-EU SaaS services just to keep your on-call schedules running. We chose not to follow.
GoAlert
We are replacing OpsGenie with GoAlert, an open-source on-call management system originally built by Target Corporation. It does what OpsGenie does: on-call scheduling, escalation policies, SMS and voice notifications with two-way acknowledgment. It's a single Go binary backed by PostgreSQL. We run it on Scaleway in France, next to our Grafana stack.
Grafana Alert Rules → Grafana Alertmanager → GoAlert Webhook → Notifications
↑
SMS ack (1a/1c)
Grafana fires alerts, GoAlert routes them to whoever is on call, Twilio delivers SMS and voice. Engineers acknowledge by replying 1a to the SMS. When Grafana resolves the alert, GoAlert auto-closes it.
What convinced us was the simplicity. GoAlert has a clean GraphQL API, handles deduplication by alert summary, and supports generic webhook ingestion. Any system that can send an HTTP POST can create alerts. We use this for forwarding Datadog and CloudWatch alerts from customers who haven't migrated their monitoring yet.
GitOps for Incident Management
GoAlert's weakness is operational management. Services, escalation policies, schedules, and user notification rules are configured through a web UI or direct API calls. For a team that manages everything through Git, that wasn't going to work.
So we built goalert-provisioning, a Kubernetes operator that reconciles GoAlert configuration from Custom Resource Definitions:
apiVersion: goalert.heystaq.com/v1alpha1
kind: GoAlertService
metadata:
name: atlas-critical
namespace: org-atlas
spec:
serviceName: "Atlas CRITICAL"
escalationPolicyRef:
name: aknostic-critical
namespace: aknostic
integrationKeys:
- name: datadog-critical
secretRef:
name: atlas-integration-keys
key: datadog-critical
You push to Git, Flux syncs to the cluster, the operator creates the service in GoAlert and writes the integration key token to a Kubernetes Secret. Onboarding a new tenant takes a pull request. We onboarded five tenants this way, each with critical and non-critical alert routing, in under an hour per tenant.
The operator covers the full lifecycle: services and integration keys, escalation policies, on-call schedules, rotations, user accounts and contact methods, even system-level admin configuration. This mirrors the pattern we already use for Grafana resources through the Grafana Operator. The entire incident management stack, from metric collection to alert routing to on-call notification, lives in version control.
gctl: a CLI for Model-Based RCA
The operator solved provisioning, but we also wanted to bring large language models into incident response. That required a programmatic interface to GoAlert, so we built gctl.
gctl oncall
gctl alert list
gctl query '{ services { nodes { name } } }'
It's a thin GraphQL client for checking who's on call, listing active alerts, or querying services. But the reason we built it is that we expose our GoAlert to our agents. An RCA investigation usually starts at the end, with the alert(s). It needs to find patterns in the incident response management, just as important as the other components of our observability stack.
The model gets access to the alert context, related metrics in Mimir, logs in Loki, and traces in Tempo. It correlates signals across these sources and proposes a root cause. It doesn't replace the engineer's judgment, but it handles the tedious part: cross-referencing dashboards, crafting log queries, chasing through trace spans. The model operates within the same tenant boundaries enforced by the platform, querying Mimir and Loki with the correct X-Scope-OrgID header.
What We Gained
Replacing OpsGenie wasn't just about avoiding a forced SaaS migration. It changed how we work. Alert rules, escalation policies, on-call schedules, contact points all live in one repository with one review process and one audit log. When someone asks why an alert went to a particular person, the answer is a Git commit.
Each customer gets a Kubernetes namespace containing their GoAlert CRDs, Grafana CRDs, and SOPS-encrypted secrets. Onboarding is a directory copy with find-and-replace. GoAlert is Apache 2.0 licensed, the operator and gctl are open source, and the underlying infrastructure runs on standard Kubernetes with S3-compatible storage. Alert data, on-call schedules, notification logs are all stored in PostgreSQL on Scaleway France. The one remaining US dependency is Twilio for SMS and voice delivery, which we might replace, but there are more important workloads to repatriate first.
The goalert-provisioning operator and gctl are available at gitlab.aknostic.com/aknostic/goalert-provisioning. The operator installs via Helm and works with any GoAlert instance. gctl installs with go install. If you're running GoAlert and want GitOps provisioning, or if you're looking at alternatives to OpsGenie, we'd welcome contributors and feedback.
GoAlert Provisioning Operator [GoAlert][(https://goalert.me/)
This is a follow-up to Building European Observability: Multi-Location Synthetic Monitoring with GitOps, documenting the next phase of the Heystaq platform.