Public-facing uptime surface. Paired with
incident-response.md (internal playbook)
and sla-template.md (commitment).
Buyers expect “is it up right now, what broke recently, what’s scheduled” on day one. The recipes below cover that floor.
Status: recipe ready-to-apply; hosting-account provisioning
remains owner-side. The helm sidecar option (below) lands the
self-hosted path as a single values.yaml toggle; the UptimeRobot
path is fully scripted as curl recipes. The two outstanding manual
steps are (a) flipping the toggle / running the curl against a real
account, and (b) pointing DNS at the result.
Target hostname: https://status.example.com — agreed in Wave 14
(the product-site nav badge + footer link reference this URL; the
product site itself lives in the private metis-pro repo). DNS, TLS
cert, and the actual deploy do not exist yet.
Selected path (per Wave 11): Tier B — Uptime Kuma self-hosted,
deployed via the helm chart’s statusPage.enabled toggle. Rationale:
matches the open-core posture (“you run it”); avoids per-deployment
SaaS account provisioning; the documented Tier B trade-off
(cluster-wide outage takes the status page down) is acceptable for a
v1 single-region deployment.
Tier A (UptimeRobot / Better Stack) remains documented as the off-failure-domain alternative for buyers whose compliance posture requires it. The two paths are not mutually exclusive — running both in parallel doubles cost (~$30/mo) and removes the failure-domain risk.
Steps NOT automated by this chart/doc; the operator runs them once per deployment:
statusPage.enabled: true in
values.yaml (Tier B) OR provision an UptimeRobot / Better Stack
account (Tier A). Both can coexist.status.<your-domain> at the chosen surface
(Ingress for Tier B; the SaaS provider’s CNAME for Tier A).metis gateway issue-key --daily-cap-usd 0.50 --allow-model
anthropic:claude-haiku-4-5 and bundle into the probe.status-page-config.yaml as the exact
paste artifact for the owner-side setup ticket.Pick one. Most buyers start with the external tier and add the self-hosted tier once their compliance team asks where the status data itself lives.
Cheap, no infra, off your gateway’s failure domain. Right for pre-pilot and pilot. Free tiers (UptimeRobot 50 monitors / 5-min; Better Stack 10 monitors / 3-min) cover a single gateway; Atlassian Statuspage (~$30/mo) is the polished public-page option.
Two probes, the only signals worth a status page on v1:
1. https://gateway.example.com/healthz expect 200
2. https://gateway.example.com/v1/messages (synthetic) expect 200/401
The second probe needs a real POST /v1/messages against a
dedicated synthetic-traffic key — issue one with
--daily-cap-usd 0.50 and --allow-model anthropic:claude-haiku-4-5.
Send a 1-token request every 5 min; monthly cost lands under $1.
When the buyer’s compliance team objects to “status reported by a SaaS we don’t control.” Trade-off: a cluster-wide outage takes the status page down too — fix is cross-region, at which point you’re rebuilding Statuspage.io for free.
Two install paths — both ship Uptime Kuma; pick by who owns the release:
A. Helm sidecar (recommended — bundled with the gateway release).
Wave 11 ships an opt-in statusPage.enabled toggle on the gateway
chart. Single helm release, one upgrade story, the status page lives
or dies with the gateway pod’s cluster:
helm upgrade --install metis-gateway ./infra/gateway/helm/ \
--namespace metis-gateway --create-namespace \
--set provider.existingSecret=metis-providers \
--set keystore.existingSecret=metis-keystore \
--set statusPage.enabled=true \
--set statusPage.ingress.enabled=true \
--set statusPage.ingress.host=status.example.com \
--set statusPage.ingress.className=nginx \
--set 'statusPage.ingress.tls[0].secretName=status-page-tls' \
--set 'statusPage.ingress.tls[0].hosts[0]=status.example.com'
What you get: a sibling Deployment + Service + PVC (and optional
Ingress) carrying app.kubernetes.io/name=<release>-status-page so
the gateway’s Service can’t accidentally route to the Kuma pod. PVC
defaults to 1Gi; Kuma’s SQLite DB + monitor configs land there.
Resource defaults are conservative (100m CPU / 128Mi memory request)
because a status page is rarely a hot path. All knobs under
statusPage.* in infra/gateway/helm/values.yaml.
Verify the install renders:
helm template test ./infra/gateway/helm/ \
--set provider.anthropicApiKey=sk-test \
--set statusPage.enabled=true \
| grep -E "^(kind:| name:)" | grep -A1 status-page
You should see PVC + Service + Deployment all with the -status-page
suffix. The Recreate strategy on the Deployment is deliberate —
Uptime Kuma’s SQLite DB doesn’t tolerate two concurrent writers
during a rolling upgrade.
B. Upstream chart (alternative — separate release, separate release cadence).
helm repo add uptime-kuma https://uptime-kuma-helm.github.io/uptime-kuma
helm install kuma uptime-kuma/uptime-kuma \
--namespace metis-ops --create-namespace \
--set persistence.enabled=true --set persistence.size=1Gi
Use this when the status page is owned by a separate ops team or when you need features (e.g. clustered HA) the sidecar chart doesn’t expose. The Wave 11 helm-sidecar option is a strict subset of what the upstream chart can do; pick the upstream path if you outgrow it.
In both paths, expose on a separate hostname (status.example.com)
with its own TLS cert; do not co-locate with the gateway ingress.
Same probes as Tier A — see “Monitoring checks” below. Optionally
wire a Kuma “Push” monitor to the trace-DB SQL probe from
incident-response.md —
the cron pushes only when the SQL probe returns clean. Restrict the
admin UI to a buyer VPN range via NetworkPolicy; the public
read-only view stays open.
Four probes, paste these into the Kuma UI on first-boot (or post
via the UptimeRobot v2 API — recipes at the end of this section).
Each probe maps to a status-page component listed under “What to
publish” below; the gateway-key liveness check is new for Wave 10/11
and assumes the Prometheus /metrics endpoint shipped in Wave 11.
/healthz)The canonical “is the gateway up” signal. Two consecutive failures
in 60s = SEV1 per
incident-response.md §On-call.
| Field | Value |
|---|---|
| Monitor type | HTTP(s) |
| URL | https://gateway.example.com/healthz |
| Method | GET |
| Interval | 60s |
| Timeout | 10s |
| Accepted statuses | 200 |
| Retries | 1 (≥2 failures = page) |
| Component | Gateway (HTTP liveness) |
Catches the failure mode /healthz misses: the process is up but
the routing chain or provider adapter is broken. Costs $1/mo at 1
call per 5 min on haiku.
| Field | Value |
|---|---|
| Monitor type | HTTP(s) keyword OR Kuma “Push” with curl cron |
| URL | https://gateway.example.com/v1/messages |
| Method | POST |
| Headers | x-api-key: $SYNTHETIC_KEY, anthropic-version: 2023-06-01, content-type: application/json |
| Body | {"model": "anthropic:claude-haiku-4-5", "max_tokens": 1, "messages": [{"role":"user","content":"ping"}]} |
| Keyword | "type":"message" |
| Interval | 300s |
| Accepted statuses | 200, 401 |
| Component | Gateway (Anthropic shape) |
A second copy hitting /v1/chat/completions with an OpenAI-shape
body covers Gateway (OpenAI shape).
/metrics heartbeat (Wave 11)The gateway’s Prometheus exposition endpoint is also the cheapest
“the process is healthy enough to report metrics” canary —
/healthz only checks the ASGI app boots; /metrics exercises the
event-bus subscriber path and the MetricsCollector registry. If
the bus has stalled, this 200s.
| Field | Value |
|---|---|
| Monitor type | HTTP(s) keyword |
| URL | https://gateway.example.com/metrics |
| Method | GET |
| Keyword | metis_gateway_keys_active |
| Interval | 60s |
| Timeout | 10s |
| Accepted statuses | 200 |
| Component | Gateway (metrics surface) |
Note: production deployments should put /metrics behind a
NetworkPolicy that only allows scraping from the Prometheus pod’s
namespace. The status-page probe lives inside that namespace, so
this is compatible — what’s not compatible is exposing /metrics
to the public internet just so an external Tier-A probe can read
it. For Tier A, lift this probe out and rely on probes 1, 2, 4 only.
Catches the “keystore is empty / corrupt” failure mode where the
gateway boots, /healthz returns 200, but no key resolves so every
real client request 401s. Reads metis_gateway_keys_active (the
Wave 11 metric, gauge of status="active" keys in the keystore).
| Field | Value |
|---|---|
| Monitor type | Kuma “Push” wired to a cron, OR HTTP keyword |
| Push interval | 60s |
| Cron probe | see SQL recipe below |
| Component | Keystore (active keys) |
Cron recipe (paste under a Kuma Push monitor’s “Push URL” callback, or run from the same node that hosts the Prometheus scrape target):
* * * * * gateway-ops \
ACTIVE=$(curl -fsS http://metis-gateway.metis-gateway.svc:8422/metrics \
| awk '/^metis_gateway_keys_active /{print $2}' | head -1) ; \
if [ "${ACTIVE:-0}" -lt 1 ]; then \
curl -fsS -X POST "$SLACK_HOOK" -d '{"text":"gateway keystore empty"}'; \
else \
curl -fsS "$KUMA_PUSH_URL?status=up&msg=keys=$ACTIVE"; \
fi
Threshold: < 1 active key = SEV1 (gateway is effectively offline
for paying tenants). metis_gateway_keys_active dropping by ≥ 50%
within 5 min without a corresponding gateway.key_rotated audit
event = SEV2 (unexpected mass-revocation; check
audit-log.md for the cause).
If you flip Tier A on instead of (or alongside) Tier B, the four
probes above can be created in one shot via curl. Stash
UPTIMEROBOT_API_KEY from your account’s “API Settings” page first:
# Probe 1 — /healthz
curl -X POST https://api.uptimerobot.com/v2/newMonitor \
-d "api_key=$UPTIMEROBOT_API_KEY&format=json&type=1" \
-d "url=https://gateway.example.com/healthz" \
-d "friendly_name=Metis%20Gateway%20-%20healthz" \
-d "interval=60"
# Probe 3 — /metrics keyword
curl -X POST https://api.uptimerobot.com/v2/newMonitor \
-d "api_key=$UPTIMEROBOT_API_KEY&format=json&type=2" \
-d "url=https://gateway.example.com/metrics" \
-d "keyword_type=1&keyword_value=metis_gateway_keys_active" \
-d "friendly_name=Metis%20Gateway%20-%20metrics%20heartbeat" \
-d "interval=60"
Probes 2 (synthetic POST with body + headers) and 4 (gateway-key liveness) need a paid UptimeRobot plan for POST-with-body and keyword monitors over headers; on the free tier, run them as Kuma Push monitors driven by a curl cron on a workstation. Better Stack supports POST-with-body on its free 10-monitor tier.
operational / degraded / partial-outage / major-outage / maintenance. Set by hand at triage.Gateway (OpenAI shape), Gateway (Anthropic shape), Analytics surface, Status page itself.sla-template.md exclusions).gateway_key_id values (side-channel tenant identifier), raw cost numbers in USD (“quota exceeded” is fine; “$8,432.10 over budget” is not).Plain text. Substitute the bracketed fields. Times in ISO 8601 UTC.
Initial (within 15 min of detection):
[INVESTIGATING] <Component> — <one-line user-visible symptom>
Posted: <YYYY-MM-DDTHH:MMZ>
Investigating reports of <symptom> affecting <component>. Customers
may experience <impact>. Next update by <YYYY-MM-DDTHH:MMZ>.
Identified (within 1 hour):
[IDENTIFIED] <Component> — <symptom>
Posted: <YYYY-MM-DDTHH:MMZ>
Cause: <plain-English summary; no internal component names>. We are
<mitigating action — failover / rollback / restart>. Next update by
<YYYY-MM-DDTHH:MMZ>.
Mitigating (every 30 min until resolved):
[MITIGATING] <Component> — <symptom>
Posted: <YYYY-MM-DDTHH:MMZ>
<Action in progress>. <Optional: % of traffic restored, ETA>. Next
update by <YYYY-MM-DDTHH:MMZ>.
Resolution:
[RESOLVED] <Component> — <symptom>
Posted: <YYYY-MM-DDTHH:MMZ>
Resolved as of <YYYY-MM-DDTHH:MMZ>. Duration: <HH:MM>. Root cause:
<one paragraph>. Post-mortem by <within 7 days>. Service-credit
claims per SLA at <link>.
Scheduled maintenance (≥48h before):
[SCHEDULED] <Component> — <e.g. "trace-DB upgrade">
Window: <YYYY-MM-DDTHH:MMZ> to <YYYY-MM-DDTHH:MMZ>
Expected impact: <none / brief degradation / brief unavailability>
During <unavailability window if any>, requests to <endpoint> will
<queue / fail / return 503>. Update posted when complete.
The templates above are by stage (initial / identified / etc.).
The four below are the same content pre-instantiated for each
severity from incident-response.md §Severity levels.
Paste them as saved incident templates in your status-page provider
(Statuspage.io / Better Stack support named templates; Uptime Kuma
1.x doesn’t — keep them as a copy-paste cheat-sheet in the operator
runbook). Overall status maps strictly:
| Severity | Overall status | Cadence | Initial-update target |
|---|---|---|---|
| SEV1 | major-outage |
30 min | 15 min from detection |
| SEV2 | partial-outage |
30 min | 1 hour from detection |
| SEV3 | degraded |
4 hours | 1 business day |
| SEV4 | (not status-page-worthy unless user-visible impact) |
major-outage)[INVESTIGATING] <Component> — <user-visible symptom>
Posted: <YYYY-MM-DDTHH:MMZ>
Investigating reports of <symptom> affecting <component>. Customers
are unable to <e.g. "complete LLM requests via the Anthropic-shape
gateway endpoint">. We are <mitigating action — rollback / failover /
restart>. Next update by <YYYY-MM-DDTHH:MMZ (set 30 min out)>.
Overall status set to: major-outage
Affected components: <list — Gateway (HTTP liveness), Gateway
(Anthropic shape), etc.>
SEV1 triggers per incident-response.md:
total gateway outage, irrecoverable trace-DB corruption, suspected
key compromise with active exploitation, prompt/completion exposure,
provider bill threshold breached.
partial-outage)[INVESTIGATING] <Component> — <user-visible symptom>
Posted: <YYYY-MM-DDTHH:MMZ>
Investigating elevated <error rate / latency> on <component — e.g.
"the OpenAI-shape inbound" or "Anthropic upstream calls">. Customers
using <impact scope — e.g. "the OpenAI shape" or "models routed to
Anthropic"> may experience <impact>. Other components are operating
normally. Next update by <YYYY-MM-DDTHH:MMZ (set 30 min out)>.
Overall status set to: partial-outage
Affected components: <one-of, not all>
SEV2 triggers: one inbound shape down; one upstream provider down with no failover; per-key analytics rollup wrong by ≥10%; ingress TLS expired.
degraded)[INVESTIGATING] <Component> — <user-visible symptom>
Posted: <YYYY-MM-DDTHH:MMZ>
Elevated <latency / non-default-model unavailability / quota-alert
volume> on <component>. Customer impact is <minimal / single-tenant
/ confined to non-default routes>. Working on <mitigation>. Next
update by <YYYY-MM-DDTHH:MMZ (set 4 hours out)>.
Overall status set to: degraded
Affected components: <component>
SEV3 triggers: elevated latency (p95 > 2× baseline); trace-DB near disk-full; a non-default model unavailable; one tenant’s quota-alert spamming.
SEV4 is cosmetic / log noise / doc errors. It is not status-page-
worthy unless it crosses into user-visible impact, at which point
re-evaluate per
incident-response.md —
typical reclassification is SEV4 → SEV3.
If you find yourself reaching for “publish a SEV4,” the post itself will erode trust faster than the issue does. Internal channel only.
metis-sev1-major-outage /
metis-sev2-partial-outage / metis-sev3-degraded; paste each
body. Bind each template to the matching default impact (Major /
Partial / Minor) and the affected components below (“What to
publish”).#metis-incidents) and copy-paste into
the “Add Incident” form on the status page when posting. v2.x
reportedly adds templates; check the release notes when you
upgrade.https://api.uptimerobot.com/v2/
setPSPMessage for status-page text; on free / Solo tiers, paste
manually.incident-response.md.