Operational playbook for a Metis gateway / agent-server deployment.
Presumes the single-tenant single-region shape from
docs/gateway-deployment.md. Metis
ships open-core; you run the gateway and own the incident loop.
See also status-page.md (external comms) and
sla-template.md (downstream SLA).
Pick the highest matching row.
| Severity | Criteria | Ack target | Mitigation target | Resolution target |
|---|---|---|---|---|
| SEV1 | Total gateway outage; irrecoverable trace-DB corruption; suspected key compromise with active exploitation; exposure of customer prompts / completions; provider bill threshold breached | 15 min, 24/7 | 1 hour | 4 hours |
| SEV2 | One inbound shape down; one upstream provider down with no failover; per-key analytics rollup wrong by ≥10%; ingress TLS expired | 1 hour, 24/7 | 4 hours | 1 business day |
| SEV3 | Elevated latency (p95 > 2× baseline); trace-DB near disk-full; a non-default model unavailable; one tenant’s quota-alert spamming | 1 business day | 3 business days | 1 week |
| SEV4 | Cosmetic dashboard bug, log noise, doc errors, deprecation warnings | Best effort | Best effort | Next sprint |
Re-evaluate severity at every status update. SEV2 → SEV1 mid-incident is normal; it tightens the comms cadence.
The gateway emits no native paging signal; alerts derive from three sources. Wire each to whatever paging provider you already use (PagerDuty, Opsgenie, Splunk On-Call, plain email + SMS).
| Source | Detection | Suggested rule |
|---|---|---|
GET /healthz (gateway), GET /health (server) |
External probe (UptimeRobot / Pingdom / Uptime Kuma) | Two consecutive failures in 60 s → SEV1; one → SEV2 |
| Container / pod logs (stdout) | Loki / CloudWatch / Stackdriver — filter traceback, quota_exceeded, key_revoked |
Any ERROR traceback within 5 min → SEV2; sustained quota_exceeded > 1/min/key → SEV3 |
| Trace-DB SQL probe | Cron (sqlite3 over the mounted PVC) every 1–5 min |
Per-model error rate > 5% over 10 min → SEV2; daily cost projection > 2× budget → SEV3 |
One primary on-call rotation. SEV1 wakes them; SEV2 pages in business hours, escalates out-of-hours; SEV3 / SEV4 go to a chatops queue. Trace-DB probe recipe (PagerDuty Events API v2; swap to your provider’s heartbeat URL as needed):
* * * * * metis /usr/local/bin/sqlite3 /var/lib/metis/metis.db \
"SELECT 100.0*SUM(json_extract(payload,'\$.stop_reason')='error')/COUNT(*) \
FROM events WHERE type='llm.call_completed' \
AND timestamp > datetime('now','-10 minutes');" \
| awk '$1 > 5 { exit 1 }' \
|| curl -fsS -X POST https://events.pagerduty.com/v2/enqueue -H 'content-type: application/json' \
-d "{\"routing_key\":\"$PD_KEY\",\"event_action\":\"trigger\",
\"payload\":{\"summary\":\"Metis LLM error rate > 5%\",
\"severity\":\"warning\",\"source\":\"metis-gateway\"}}"
Four beats. Time-box each; if you blow the box, escalate (page secondary, open a war room).
/healthz from a known-good vantage point. 200 = up; 5xx = sick
app; timeout = ingress / network broken.kubectl logs -n metis-gateway deploy/metis-gateway --tail=100 -c gateway
(or docker compose logs --tail=100 gateway).Classify blast radius.
route.decided.chain[*].verdict='unavailable' in the trace DB over
the last 15 min — that’s how routing flags a provider down.SELECT key_id, COUNT(*) FROM events WHERE
type='llm.call_completed' AND timestamp > datetime('now','-15
minutes') GROUP BY key_id.Set severity; post the initial status-page update
(status-page.md).
Get traffic flowing. The right move is rarely a fix — it’s the fastest revert. Order of preference:
helm rollback metis-gateway or
docker compose down && docker compose up -d with the previous tag.METIS_GATEWAY_GLOBAL_DEFAULT=openrouter:anthropic/claude-haiku-4-5
and restart. Per-(provider, model) availability tracking routes
around the dead provider for keys that don’t pin model; keys that
pin keep failing until flipped.kubectl rollout restart deploy/metis-gateway.respond 503, nginx return 503;) while you investigate.Do not fix-forward in the first hour unless the fix is one line and locally reproduced. Mitigation first, root cause after.
Post on every status change and at least every 30 min — even if the
update is “still investigating.” Templates in
status-page.md. Internal channel
(#metis-incidents): same updates plus the raw commands you ran.
Blameless, structured, runs ≤ 1 week after resolution. Use this exact heading set so post-mortems are scannable.
# Incident YYYY-MM-DD — <short title>
**Severity:** SEV<n> **Duration:** HH:MM **Customer impact:** <plain English>
## Summary
Two sentences. What broke, what was user-visible, when it ended.
## Timeline (UTC)
- HH:MM — first signal (alert / customer report)
- HH:MM — ack by <on-call>
- HH:MM — mitigation X attempted (outcome)
- HH:MM — root cause identified
- HH:MM — full resolution; status page updated
## Impact
- Affected tenants: <list or "all">
- Affected endpoints: <list>
- Requests dropped: <count from trace DB>
- Estimated revenue / SLA-credit impact: <USD>
## Root cause
What actually broke. Code path, config, infra event. One paragraph.
Reference commit / line if applicable.
## Contributing factors
What made this worse than it had to be — slow detection, unclear
runbook, missing alert.
## What went well
Genuine. Fast detection, good rollback, clear comms.
## What went poorly
Equally genuine. No blame on individuals; focus on systems and process.
## Action items
| Item | Owner | Due | Tracking |
|--------------------------------------|---------|------------|---------------|
| Add alert for X | @alice | 2026-MM-DD | issue #123 |
Action items must have an owner, a date, and a tracking link. An item without a tracking link does not exist.
All llm.call_completed events for one provider carry
stop_reason='error'; routing has marked (provider, model)
unavailable and slot 6 / 7 is flipping to the next default.
Detect. SELECT json_extract(payload,'$.chain') FROM events
WHERE type='route.decided' AND timestamp > datetime('now','-5
minutes') ORDER BY timestamp DESC LIMIT 20 — look for
verdict='unavailable'.
Mitigate.
METIS_GATEWAY_GLOBAL_DEFAULT to a healthy provider’s
canonical id and restart. Clients pinning the dead provider still
fail; tell their owners to flip or drop the pin.OPENROUTER_API_KEY set,
route via openrouter:anthropic/claude-haiku-4-5. Same wire shape
client-side; higher latency, overlay pricing.Recover. Per-(provider, model) availability auto-clears on the next successful adapter probe.
Pod / container restart loop with
sqlite3.OperationalError: database disk image is malformed or
database or disk is full.
Detect.
df -h on the PVC mount. SQLite WAL grows metis.db-wal
until checkpoint; > 100 MB is normal at moderate traffic.sqlite3 /var/lib/metis/metis.db 'PRAGMA integrity_check'.Mitigate.
kubectl scale deploy/metis-gateway --replicas=0
(or docker compose stop gateway). Restoring under an active writer
is unsafe per
gateway-deployment.md.DELETE FROM events WHERE timestamp < '2026-04-01T00:00:00Z'; VACUUM;.
Resize the volume, restart.metis restore /backup/daily/metis.<latest>.db --db-path /var/lib/metis/metis.db --force.
You lose events between the backup and the crash — document under
“Impact” in the post-mortem.Prevent. Daily backup cron from
gateway-deployment.md
plus PVC-size alert at 80%.
A gw_… token leaked. SEV1 until scope confirmed.
Detect. Scope spend via the trace DB:
SELECT json_extract(payload,'$.model') AS model, COUNT(*) AS calls,
ROUND(SUM(json_extract(payload,'$.usage.cost_usd')),4) AS cost
FROM events WHERE type='llm.call_completed'
AND json_extract(payload,'$.gateway_key_id')='gk_01HXYZ...'
AND timestamp > datetime('now','-24 hours')
GROUP BY model;
Mitigate.
metis gateway revoke-key gk_01HXYZ....
Subsequent requests return 401 code='key_revoked' per
gateway.md §11.2; no pod restart needed.metis gateway issue-key --name "client-x-v2" --workspace /workspace --allow-model anthropic:claude-haiku-4-5 --daily-cap-usd 5.00.--daily-cap-usd was unset, cross-check the SQL against billing
and file a provider refund claim if their TOS allows.Prevent. Always issue with --daily-cap-usd and --allow-model.
Rotate quarterly via metis gateway rotate-key (default 24h grace).
One key’s spend rate is 10× normal. quota.alert fired (soft
threshold) but the hard breaker hasn’t, because the key is below
daily_cap_usd — or has no daily_cap_usd set.
Detect.
SELECT json_extract(payload, '$.gateway_key_id') AS key,
ROUND(SUM(json_extract(payload, '$.usage.cost_usd')), 4) AS cost_1h
FROM events WHERE type='llm.call_completed'
AND timestamp > datetime('now', '-1 hour')
GROUP BY key ORDER BY cost_1h DESC LIMIT 10;
Mitigate.
/analytics/by_key; the spike may be a legitimate
CI burst. Confirm before revoking — false-positive revocations
destroy trust faster than cost overruns destroy margin.Prevent. Always set --daily-cap-usd and --monthly-cap-usd at
issuance. quota.alert fires at 80% of the cap; tail it from the
trace DB into the same paging system as /healthz failures.
status-page.md, sla-template.md.../gateway-deployment.md — install, TLS, backup/restore, helm chart.../specs/gateway.md §11 — key lifecycle; ../specs/event-bus-and-trace-catalog.md §7.5 — backup/restore contract.