Status: v1 (Wave 14a, 2026-05-16)
Audience: SRE / platform operators running metis-gateway or metis-server in production.
Companion to incident-response.md — that doc tells
you what to do when something goes wrong; this one tells you what your
graphs are showing you and how to read them. Pairs 1:1 with the
Prometheus alert rules
and the Grafana dashboard
that ship with the helm chart.
The spec contract for the metric surface itself is
docs/specs/observability.md; this doc covers
the operational contract — meaning, alert thresholds, runbook entries.
Every metric the gateway and server expose, what it counts, the alert that fires on it, and the first thing to check when the alert pages.
| Metric | Type | Labels | What it means |
|---|---|---|---|
metis_llm_calls_total |
counter | provider, model, status |
One row per LLM API call. status is ok for completions, or the 8-value LLMErrorClass for failures. |
metis_llm_call_errors_total |
counter | provider, model, error_class |
Failure-only counter split out so error-rate alerts don’t have to sum across status labels. |
metis_llm_call_latency_seconds |
histogram | provider, model |
Wall-time per call, both success and failure paths. Bucket range covers 50 ms through 120 s. |
metis_llm_cost_usd_total |
counter | provider, model |
Cumulative spend. Decimal is converted to float at the export boundary. |
metis_gateway_key_cost_usd_total |
counter | gateway_key_id |
Per-key cost attribution. Agent-loop traffic (no key) buckets under gateway_key_id="null". |
| Metric | Type | Labels | What it means |
|---|---|---|---|
metis_routing_decisions_total |
counter | winning_slot, chosen_model |
One row per route.decided. winning_slot is the 7-value RoutingPolicyName literal. |
metis_routing_decision_latency_seconds |
histogram | (none) | Wall-time of the routing engine itself. Sub-millisecond in steady state; tails out under K-NN cluster-tightening regimes. |
metis_pattern_matches_total |
counter | chose_model, fingerprint_version |
Slot-4 (pattern store) wins only. |
metis_tool_call_latency_seconds |
histogram | tool_name |
Tool dispatcher wall-time, drained from both tool.completed and tool.failed. The collector correlates tool_name from the prior tool.called via a bounded LRU. |
metis_tool_failures_total |
counter | tool_name, error_class |
Tool failures only, with the 8-value ToolErrorClass. |
| Metric | Type | Labels | What it means |
|---|---|---|---|
metis_gateway_auth_failures_total |
counter | reason |
Auth-time rejection counter. Three reasons: missing_token, invalid_token, key_revoked. |
metis_gateway_keys_active |
gauge | (none) | Number of is_active(now) keys in the keystore, polled at scrape. |
metis_gateway_keys_revoked |
gauge | (none) | Total – active, so grace-period-expired keys count here even before the next admin sweep persists them. |
metis_quota_used_ratio |
gauge | identity_kind, identity_id |
Per-identity (key/user/team) most-recent quota usage ratio. Pinned to 1.0 when gateway.quota_exceeded fires. |
| Metric | Type | Labels | What it means |
|---|---|---|---|
metis_session_count |
gauge | (none) | Server-only. Active in-memory sessions. |
metis_eval_verdicts_total |
counter | judge_kind, subject_kind |
Evaluator verdicts (eval.completed only — eval.failed is not counted here). |
metis_trace_wal_bytes |
gauge | (none) | Trace-DB WAL file size, polled at scrape. |
metis_pattern_embedding_cache_hit_ratio |
gauge | workspace_id |
v2 embedding-cache hit ratio per workspace. |
The helm chart ships four PrometheusRule alert templates under
prometheus-rules.yaml.
Each is off by default; enable via monitoring.prometheusRules.enabled: true and
tune individual thresholds in values.yaml. After enablement, triage
according to the runbook entries below.
Alert: MetisLLMCallLatencyP99High
Default threshold: p99 > 30 s for 5 min
PromQL:
histogram_quantile(0.99,
sum by (provider, model, le) (rate(metis_llm_call_latency_seconds_bucket[5m]))
) > 30
What it means. A specific (provider, model) pair has 1% of its calls
taking more than 30 seconds wall-time. The threshold matches the worst-case
turn-latency budget in sla-template.md; your SLA may
warrant tighter.
First-action checklist (in priority order):
metis_llm_call_errors_total for the same (provider, model).
If errors are also up, you’re latency-bound because of retries — the
upstream is degraded, not just slow. If errors are flat, the upstream is
simply slow; consider failover.Mitigations:
METIS_GATEWAY_GLOBAL_DEFAULT to a healthy provider /
model (incident-response.md §”Upstream LLM API outage” step 2).openrouter:...) if the direct provider is healthy.False-positive patterns: legitimately long single calls (large thinking
blocks, complex tool-use chains, long output). If the alert fires once a
week for ~15 minutes and the dashboard shows a single tall bar in the
histogram heat-map, it’s probably real-but-rare. Bump the threshold or the
for clause.
Alert: MetisLLMErrorRateHigh
Default threshold: error rate > 5% for 10 min
PromQL:
sum by (provider, model) (rate(metis_llm_call_errors_total[10m]))
/
sum by (provider, model) (rate(metis_llm_calls_total[10m])) > 0.05
What it means. A (provider, model) pair is failing more than 5% of
the time over a 10-minute window.
First-action checklist:
error_class. The five common shapes:
rate_limit — provider throttling. The client / agent is the cause
(too high a sustained burst), not the gateway. Tell the client to back
off, or raise the relevant provider’s account-level rate limit.auth — provider key invalid or revoked. Rotate
ANTHROPIC_API_KEY / OPENAI_API_KEY / OPENROUTER_API_KEY.server_error (5xx) — upstream provider issue. Status page first;
failover if confirmed.network — connection / TLS / DNS issue. Probe from inside the pod:
kubectl exec deploy/metis-gateway -- curl -v https://api.anthropic.com.
If repeated network errors trip provider-wide unavailability,
restart the pod to reset adapter availability state.context_overflow — client sent a request larger than the model’s
window. The client’s responsibility; surface to them.Mitigations: see incident-response.md §”Upstream LLM API outage”.
False-positive patterns: a low-traffic model on a single canary client
will spike its error rate to 100% on one bad call (denominator: 1, numerator:
1). The for: 10m clause filters most of these; tune for higher for
fleet shapes with many low-traffic models.
Alert: MetisGatewayAuthFailureRateHigh
Default threshold: > 0.1 failures/sec (≈ 360/hour) for 5 min
PromQL:
sum(rate(metis_gateway_auth_failures_total[5m])) > 0.1
What it means. The gateway is rejecting authentication at a sustained rate well above the steady-state baseline (~0 failures/sec under healthy operation; the alert fires above ~6 fails/min). This is the signature of:
/v1/chat/completions (large
reason="invalid_token" series, varied source IPs).token_hash_prefix showing
up repeatedly in the audit log — metis audit export --event-type
gateway.auth_failed).reason="key_revoked"
series, source IPs all internal).reason="missing_token" from one IP).First-action checklist:
Open the dashboard’s “Gateway auth failures by reason” panel. Which reason dominates? That decides triage:
| Dominant reason | Investigation |
|---|---|
invalid_token |
External attack OR internal client with wrong key. Check source-IP distribution in the trace DB (SELECT json_extract(payload, '$.token_hash_prefix'), COUNT(*) FROM events WHERE type='gateway.auth_failed' GROUP BY 1 ORDER BY 2 DESC LIMIT 20;). If one IP / prefix dominates: external attacker; consider rate-limit middleware (gateway-hardening.md §3). If diffuse: a deploy regression. |
key_revoked |
Recent rotation. Cross-reference gateway.key_revoked events around the same timestamp — the rotation didn’t propagate to all clients. Reach out before re-issuing. |
missing_token |
Client misconfig. Check user-agent in the access log of your TLS terminator (Ingress / Caddy). |
monitoring.rateLimit.enabled: true in helm values) to give the
in-process limiter a chance to slow the scanner before it exhausts
resources. This is not a substitute for an edge WAF — the buyer’s
CDN / WAF should be the first line of defense. The rate-limit middleware
protects against attackers who bypass the WAF.gateway.auth_failed event is audit-flagged, so it survives the
90-day trace-retention sweep. Pull a long-window export for the security
team: metis audit export /tmp/auth-failures.jsonl --event-type
gateway.auth_failed --since 2026-04-01.Mitigations:
monitoring.rateLimit.enabled: true per
gateway-hardening.md §3. Per-IP bucket
defaults to 1000 RPM — well above any well-behaved client.False-positive patterns: a deploy that flips the keystore without
warning produces a brief key_revoked spike that resolves itself once
clients catch up. If the alert resolves within 15 minutes after a planned
key rotation, it’s the rotation. If it lingers, real.
Alert: MetisGatewayKeyCostSpike
Default threshold: > $10/hour per single gateway key, sustained 10 min
PromQL:
sum by (gateway_key_id) (
rate(metis_gateway_key_cost_usd_total{gateway_key_id!="null"}[1h]) * 3600
) > 10
What it means. One gateway key’s burn rate is above $10/hour over the
last 60 minutes. This catches runaway spend BEFORE the per-key daily /
monthly hard cap (quota.alert / gateway.quota_exceeded) fires.
First-action checklist:
curl http://gateway/analytics/by_key?key=<gateway_key_id> | jq
SELECT json_extract(payload, '$.model') AS model,
COUNT(*) AS calls,
ROUND(SUM(json_extract(payload, '$.cost_usd')), 4) AS cost
FROM events WHERE type='llm.call_completed'
AND json_extract(payload, '$.gateway_key_id')='<key>'
AND timestamp > datetime('now', '-1 hour')
GROUP BY model;
Mitigations:
metis gateway
revoke-key, then issue-key --daily-cap-usd 5.00 --allow-model
anthropic:claude-haiku-4-5).--daily-cap-usd /
--monthly-cap-usd at issuance) on similarly-scoped keys.False-positive patterns: the first hour of a key’s lifetime when a
client is bulk-loading a fresh embedding cache. The for: 10m clause
filters short bursts; longer-running legitimate workloads may need a
per-key threshold override.
The Grafana dashboard JSON ships at
infra/gateway/helm/dashboards/metis-gateway.json.
Import into Grafana 9+ via “Dashboards → Import → Upload JSON file”; bind
the DS_PROMETHEUS datasource to your existing Prometheus instance.
Layout (top-to-bottom):
metis_quota_used_ratio) — useful for budgeting and
capacity planning.metis_trace_wal_bytes over time. Sustained
growth above ~3× the auto-checkpoint threshold means a long-running
reader is holding the checkpoint barrier; see
trace-performance.md §WAL for the SQL probe.monitoring.enabled=true and
monitoring.prometheusRules.enabled=false. The dashboard renders, no
alerts fire.metis_llm_call_latency_seconds p99 →
llmLatencyP99.threshold is ~2-3x that.metis_llm_call_errors_total / metis_llm_calls_total →
llmErrorRate.threshold is ~2x that.rate(metis_gateway_auth_failures_total[5m]) →
gatewayAuthFailureRate.threshold is ~3-5x that.gatewayKeyCostSpike.threshold is ~3-5x that.values.yaml, flip
monitoring.prometheusRules.enabled=true, redeploy.key_revoked count) so you confirm the paging
path works end-to-end.parent_event_id; use the trace DB
metis evaluate for per-turn deep dives.metis_*_slo_compliant boolean
metrics — operators wire SLO calculations in Prometheus itself or
Grafana SLO panels against the raw counters and histograms./analytics/cost,
/analytics/by_key, /analytics/by_user, /analytics/by_team REST
endpoints for budget reporting. The metis_gateway_key_cost_usd_total
counter is for alerting on cost anomalies, not for monthly billing
reconciliation.docs/specs/observability.md — the metric-surface contract.docs/operations/incident-response.md — what to do once an alert pages you.docs/operations/trace-performance.md — WAL gauge interpretation and SQLite-level tuning.docs/operations/sla-template.md — buyer-facing SLA that the latency / error thresholds are derived from.infra/gateway/helm/templates/prometheus-rules.yaml — alert rule definitions.infra/gateway/helm/dashboards/metis-gateway.json — Grafana dashboard JSON.