metis

Spec Changes

This file tracks breaking and significant changes to specs in docs/specs/. Its purpose is to prevent cross-spec drift: when one spec changes a contract, this log records which other specs reference that contract and need verification.

How to use this file

When making a substantive change to a spec, add an entry below with:

Date — when the change was made.
Spec — which spec changed.
Change — one-line description.
Type — breaking (consumers must update) or additive (consumers can ignore).
References to verify — which other specs reference the changed contract and must be checked for consistency.
Status — pending review until cross-references are verified, then verified.

Trivial edits (typos, wording) don’t need entries. Use judgment.

When working on a spec PR, scan this file for pending review entries against specs you depend on; verify them before landing.

Specs in scope

canonical-message-format.md — messages, content blocks, tool definitions, persistence.
event-bus-and-trace-catalog.md — bus interface, event catalog, trace store.
routing-engine.md — routing pipeline, rule format, delegate() contract.
streaming-protocol.md — WebSocket protocol, snapshot/replay, cancellation.
(planned) provider-adapter-contract.md — adapter interface, wire-format translation.
(planned) tool-dispatcher.md — tool registry, side-effect handling, validation.
(planned) server-api.md — REST endpoints, attach handshake, session lifecycle.
analytics-api.md — read-only /analytics/* namespace backing the dashboard.
benchmark.md — reproducible workload suite + measurement methodology backing the savings counterfactual.
deployment-shape.md — recommendation for the replacement-agent / gateway / hybrid fork. Resolves the project strategy (private) when signed off.
gateway.md — skeleton for the transparent HTTP gateway surface (paired with deployment-shape.md).
context-assembler.md — v1 covers prompt-cache breakpoint placement; v2 adds the minimum-cacheable-prefix padding rule; v3 adds skill activation (explicit + pre-activation paths, per-session budget, no auto-activation in v3); history compression remains later.
pattern-store.md — per-workspace bounded SQLite store of task fingerprints + outcomes that powers routing slot 4 (PATTERN_RECOMMENDATION). Phase 2.5.
skill-format.md — retrospective v1 (2026-05-13) of the existing skills loader / store / tools; conforms to agentskills.io.
evaluator.md — heuristic + hybrid LLM-as-judge feedback loop; emits eval.* events; resolves the project strategy (private) when signed off. Phase 3.
multi-user.md — per-user / per-team identity layer on top of the shipped per-key cost attribution; analytics rollups, routing-rule predicates, gateway-level circuit breakers. Drafted 2026-05-14; Phase 3 implementation pending.
delegation.md — Phase 4 design for worker sessions and the delegate() tool: slot 5 (DELEGATE_REQUEST) consumer contract, worker lifecycle, isolation, cost attribution, integration with pattern store + evaluator. Drafted 2026-05-14; Phase 4 implementation pending.
pricing.md — commercial pricing model recommendation (open-core gateway + per-seat Pro + reserved enterprise %-of-savings add-on); surveys candidate models, names trade-offs, composes with multi-user.md §5. Drafted 2026-05-14; awaiting owner ratification — does not close the project strategy (private).
skill-curator.md — periodic auxiliary-model maintenance of agent-authored skills (pin / archive / consolidate / edit); shared BudgetTracker with the evaluator, sidecar JSON state, archive-not-delete; pattern lifted from hermes-agent agent/curator.py. Drafted 2026-05-14; gated on agent-authored skills (Phase 2.5) landing first.
api-versioning.md — pins the versioning posture for Metis’s two HTTP surface categories: provider-shape endpoints (frozen by upstream SDK contracts; /v1/chat/completions, /v1/messages) and Metis-owned endpoints (versioned by us via the Metis-API-Version header). v1 enforcement live as of Wave 11 (2026-05-15): below-min / past-sunset return HTTP 410; Metis-API-Versions-Supported discovery header on every response; OPTIONS pre-flight short-circuits with 204; request.state.metis_api_version plumbed to handlers.
observability.md — Prometheus /metrics endpoint shipped on both metis-server and metis-gateway; bus-driven MetricsCollector projects catalog events onto a bounded counter/gauge/histogram set. Drafted 2026-05-15; v1 shipped. v1.1 (Wave 14a, 2026-05-16) adds production-grade extensions: latency-percentile histograms for routing + tool dispatch, dedicated LLM/tool error counters, gateway auth-failure tracking via the new gateway.auth_failed audit-flagged event, per-key cost counter, four PrometheusRule alert templates + Grafana dashboard JSON + observability-runbook.md.
gateway-hardening.md — perimeter posture: TLS termination options (Caddy / nginx-ingress / cloud LB / in-process), per-key + per-IP token-bucket rate limiting (off by default), connection-rate cap (default 1000), alert-only abuse detection, gateway-key leak detection, DDoS delegated to the buyer’s edge. Drafted 2026-05-15; v1 shipped (Wave 13): loopback-only bind constraint lifted in favor of explicit opt-in via --host 0.0.0.0, in-process TLS via --tls-cert/--tls-key, SO_REUSEPORT socket for graceful restart.
audit-log.md — filtered projection of the trace store flagged as security/compliance-relevant: 9-type subset (key lifecycle + quota + policy + eviction + confirmation), JSONL / CSV deterministic export, metis audit export CLI. Drafted 2026-05-15; v1 shipped. Retention sweep (12a-2) reads AUDIT_EVENT_TYPES to exempt audit rows.
trace-retention.md — sliding-window retention for the trace DB: TraceStore.purge_older_than(cutoff, dry_run=True) with AUDIT_EVENT_TYPES exemption, trace.swept audit-flagged eviction event, metis trace prune CLI (--days 90 default, --dry-run opt-in), optional Helm CronJob template. Drafted 2026-05-15; v1 shipped (Wave 12a-2).
redaction.md — canonical redaction policy + four-mode EventRedactor (passthrough / pseudonymize / redact_private / aggregate_only) for trace exports; per-event-type identity-pseudonymization + PRIVATE-tier text-strip rules; layered on top of 12a-2’s existing Redactor Protocol + PseudonymizingRedactor default. Wires --redact <mode> into metis audit export (12a-1) and adds a dry-run “would affect N events” preflight to metis user forget. Drafted 2026-05-15; v1 shipped (Wave 12a-3).
credentials.md — CredentialResolver Protocol + 5-step resolution chain (CLI flag → env var → ~/.metis/credentials.yaml → ~/.metis/.env legacy dotenv → OS keychain (deferred)); structured YAML file with mode-0o600 enforcement + atomic write; metis auth {add,list,remove,test,doctor} CLI surface; runtime hookup in both cli/runtime.py and gateway/runtime.py. Drafted + shipped 2026-05-20 (v1).

Cross-reference map

A snapshot of which specs reference which (refresh when adding a spec):

Source spec	Depends on
`canonical-message-format.md`	(none — foundation)
`event-bus-and-trace-catalog.md`	canonical-message-format, routing-engine
`routing-engine.md`	canonical-message-format, event-bus-and-trace-catalog
`streaming-protocol.md`	canonical-message-format, event-bus-and-trace-catalog, routing-engine
`provider-adapter-contract.md` (planned)	canonical-message-format, event-bus-and-trace-catalog, streaming-protocol
`tool-dispatcher.md` (planned)	canonical-message-format, event-bus-and-trace-catalog
`server-api.md` (planned)	canonical-message-format, event-bus-and-trace-catalog, streaming-protocol
`analytics-api.md`	canonical-message-format, event-bus-and-trace-catalog, server-api
`benchmark.md`	analytics-api, event-bus-and-trace-catalog, canonical-message-format, provider-adapter-contract
`deployment-shape.md`	the project strategy (private), market-research/synthesis.md (rationale only — no contract dependency)
`gateway.md`	canonical-message-format, provider-adapter-contract, routing-engine, event-bus-and-trace-catalog, server-api, analytics-api
`context-assembler.md`	canonical-message-format, provider-adapter-contract (planned), analytics-api
`pattern-store.md`	canonical-message-format, event-bus-and-trace-catalog, routing-engine, memory-store, analytics-api, evaluator
`skill-format.md`	canonical-message-format, event-bus-and-trace-catalog, tool-dispatcher, context-assembler
`evaluator.md`	event-bus-and-trace-catalog, canonical-message-format, analytics-api, benchmark, routing-engine, pattern-store (planned)
`multi-user.md`	canonical-message-format, event-bus-and-trace-catalog, gateway, routing-engine, analytics-api
`delegation.md`	canonical-message-format, event-bus-and-trace-catalog, routing-engine, streaming-protocol, server-api, tool-dispatcher, context-assembler, pattern-store, evaluator, analytics-api
`pricing.md`	the project strategy (private), deployment-shape, multi-user, analytics-api, gateway, canonical-message-format (rationale + composability — no contract dependency)
`skill-curator.md`	skill-format, event-bus-and-trace-catalog, evaluator, canonical-message-format, analytics-api, memory-store, multi-user (planned), pattern-store (planned)
`api-versioning.md`	gateway, analytics-api, server-api (planned)
`gateway-hardening.md`	gateway, multi-user, server-api, event-bus-and-trace-catalog
`observability.md`	event-bus-and-trace-catalog, gateway, server-api (planned), multi-user, evaluator, pattern-store
`redaction.md`	event-bus-and-trace-catalog, audit-log, analytics-api, multi-user, canonical-message-format
`audit-log.md`	event-bus-and-trace-catalog, canonical-message-format, gateway, multi-user, analytics-api
`trace-retention.md`	event-bus-and-trace-catalog, audit-log, STRATEGY (rationale)

When changing a spec, the dependent specs (right column whose left column is the changed spec) must be checked.

Change log

2026-05-20 — credentials.md v1 shipped: `CredentialResolver` + `metis auth` CLI + runtime hookup

Specs: credentials.md bumps from Draft v1 to Shipped v1; §9 open questions 1 and 2 resolved (error-message only, no pre-call cost disclosure); §6.2 ProviderSpec sketch extended with auth_header_name / auth_header_value_template / extra_headers to accommodate Anthropic’s x-api-key header shape; deviations recorded in a new “v1 implementation deviations” subsection.
Change: New module metis.core.credentials (protocol.py, providers.py, resolver.py, file.py, errors.py) implements the 5-step resolution chain (CLI flag → env var → ~/.metis/credentials.yaml → ~/.metis/.env legacy dotenv → keychain (deferred)). New CLI surface metis auth {add,list,remove,test,doctor} (packages/metis/src/metis/cli/auth.py) handles setup + diagnostics; add uses stdlib getpass so keys are not echoed; list / doctor render only the <first 8>...<last 4> truncation per spec §5.2; test pings the per-provider validate endpoint. Both cli/runtime.py and gateway/runtime.py drop their direct os.environ.get calls and instantiate a DefaultCredentialResolver instead; resolver injection is accepted via a new credentials_resolver= keyword for tests / Pro overlays. File operations enforce mode 0o600 (CredentialsFileInsecure raised on load if wider) and use the same write-temp-then-rename pattern as gateway.keystore_admin.atomic_write_keystore. Schema_version=1 enforced (CredentialsFileSchemaUnknown for unknown versions; forward-only migration). 37 new tests (16 unit + 21 CLI) across packages/metis/tests/core/credentials/ and packages/metis/tests/cli/test_auth_cli.py.
Type: additive. Existing env-var workflows are unchanged — the resolver finds env vars on step 2 of the chain. A pre-existing ANTHROPIC_API_KEY setup with no ~/.metis/credentials.yaml produces byte-identical behavior. The error message when zero credentials are configured changes from “set ANTHROPIC_API_KEY, OPENAI_API_KEY, and/or OPENROUTER_API_KEY (in env or .env)” to “no credentials configured. Run metis auth add anthropic (or set ANTHROPIC_API_KEY in env / .env).” — narrow surface, not a contract change.
References to verify:
- AGENTS.md “What works” — new entry pointing at the metis auth surface. ✓
- README.md Quick-start — metis auth add anthropic listed first; env-var path preserved as the “12-factor / CI” alternative. ✓
- docs/specs/credentials.md — status header reads “Shipped v1”; §9 records the v1 resolutions; new “v1 implementation deviations” subsection records the ProviderSpec extension. ✓
Status: verified. Full suite green: uv run pytest -q → 1808 passed, 1 skipped (37 new + 1771 prior).

2026-05-16 — Wave 16 GA launch sync: Phase 3 shipped sign-off, billing self-service, first-customer concierge artifacts, launch collateral, and operational readiness

Specs: gateway.md §13 documents the Wave 15 + Wave 16 billing HTTP surface (/account/billing, /account/billing/portal, /account/billing/plan, legacy subscription/payment-method/cancel/pause/resume, /webhooks/stripe) plus plan and failed-payment semantics. This is also a top-level launch / implementation sync, plus one owner decision recorded in the phase-claim proposal (private) §8. No provider-shape endpoint or existing Metis-owned versioned contract is removed.
Change: (a) Owner ratified the phase-claim proposal’s Position B: Phase 3 shipped, no “Phase 4 v1 started” claim. AGENTS.md and README status mirrors now use that posture and record Wave 16 as the GA launch milestone. (b) Billing self-service adds GET /account/billing/portal and POST /account/billing/plan, failed-payment grace / post-grace free-tier downgrade, payment-succeeded restoration, enterprise add-on attach/remove, and a billing operator guide. (c) First-customer concierge tooling adds metis customer-report --anonymize, gitignored benchmarks/customers/ scaffolding, industry case-study templates, and the first-customer runbook (private). (d) Launch artifacts add a product-site GA announcement, homepage / compare / pricing refresh, README launch callout, sales one-pager / FAQ / competitive-comparison refresh. (e) Operational readiness adds status-page-config.yaml, launch-day playbook, pre-launch dry-run checklist, and support-channel templates. (f) §A3 task-domain workloads are explicitly deferred post-GA; delegation remains the validated routing-surface GTM lever.
Type: additive docs + additive billing/CLI behavior. Existing billing routes stay mounted under the same opt-in --enable-billing posture. Existing metis customer-report output is unchanged unless --anonymize is supplied. The phase-claim edit is a status decision, not a spec contract.
References to verify:
- AGENTS.md Status sentence — Phase 3 shipped + Wave 16 GA milestone + 1841 tests. ✓
- AGENTS.md “What works” — Wave 16 bullets for billing self-service, first-customer concierge artifacts, launch artifacts, operational readiness, and Wave 16 docs sync. ✓
- AGENTS.md “What’s NOT built” — post-GA deferrals only, including the optional §A3 task-domain wedge. ✓
- README.md Status block + launch callout + pricing / sales / operations pointers — refreshed for GA. ✓
- docs/the project strategy (private) — GA launch posture and first-customer proof-of-savings gap updated. ✓
- docs/the project strategy (private) — Phase 3 shipped + Wave 16 GA launch entries appended. ✓
- docs/specs/gateway.md §13 — billing endpoints + plan/grace semantics match the Wave 16 implementation. ✓
- the phase-claim proposal (private) §8 — owner sign-off recorded. ✓
Status: verified. Full suite green: uv run pytest -q → 1841 passed, 1 skipped in 47.15s on 2026-05-16.

2026-05-16 — AGENTS.md / README.md test-count + Wave-15 doc-sync (GA-blocker fixes + concierge + status-page + observability v1.1 + billing); phase-claim posture preserved

Specs: none touched. Pure high-level doc housekeeping after the Wave 15 batch landed (15a-1 NETWORK refinement + 15a-2 model normalization + 15a-3 concierge-onboarding flow + 15a-4 status-page live-deployment recipe + 15a-6 observability v1.1 + 15a-7 billing module + §A3-rev7 completion + pricing.md §5.5.4 ratification). The earlier dated entries in this CHANGES file (the seven 2026-05-16 — … entries below this one) are the load-bearing spec / event-catalog / payload-registry changes; this entry is the top-level AGENTS.md / README.md / the project strategy (private) reflection of those landings.
Change: (a) AGENTS.md Status sentence extended through Wave 15 (GA-blocker-1 NETWORK refinement + GA-blocker-2 model normalization + concierge tools + billing module per ratified pricing.md §5.5.4 + status-page live-deployment recipe + customer_tier keystore extension + observability v1.1). Test count bumped 1780 → 1829 (in both the Status sentence and the # Tests (...) comment in the “Running things” block). Phase-claim posture explicitly preserved: AGENTS.md still says “ready-for-review” because the phase-claim proposal (private) is unchanged from Wave 13 and no owner sign-off is recorded. Five new “What works” entries land between the Wave-14a Production-grade-observability-extensions bullet and the Wave-14-docs-sync bullet: NETWORK error refinement (Wave 15, 15a-1) / Gateway model normalization (Wave 15, 15a-2) / Concierge onboarding tooling (Wave 15, 15a-3) / Status-page live deployment recipe (Wave 15, 15a-4) / Billing module (Wave 15, 15a-7). A sixth “Wave 15 docs sync” entry follows the Wave-14-docs-sync entry. (b) README.md Status block mirrors AGENTS.md: Wave 15 capability summary appended; 1780 → 1829 test count; the “1829 tests” bullet picks up customer_tier keystore extension + billing module subscription lifecycle + webhook idempotency + tier-axis quota composition + metis customer-report + metis trial-status mentions. (c) the project strategy (private) picks up a Wave-15 consolidating dated entry summarizing GA-blocker closures + concierge tools + billing module + observability v1.1 + status-page recipe + phase-claim-stays-ready-for-review posture. The §A3-rev7 completion + Adopt-pricing-model entries (added by other 2026-05-16 sessions) are preserved verbatim.
Type: docs-only. No event-catalog change. No spec contract change. No new event payload registry membership. (1) AGENTS.md “What works” gains five new bullets — each is a self-contained summary; the existing “Wave 14a” / “Wave 14b” labels on prior bullets are preserved. (2) README.md test count line is the only structural surface that mirrors AGENTS.md exactly. (3) the project strategy (private) is append-only; no existing rows modified. (4) Phase-claim-proposal.md is explicitly NOT modified — the doc-sync rule per the phase-claim proposal (private) §7 requires owner sign-off to bump.
References to verify:
- AGENTS.md Status sentence — Wave 15 work appended; phase-claim posture unchanged (“ready-for-review”); test count 1829 tests passing. ✓
- AGENTS.md “What works” — five new bullets land between the Wave-14a observability bullet and the Wave-14-docs-sync bullet; Wave-15-docs-sync bullet follows. ✓
- AGENTS.md “Running things” block — # Tests (1829 currently — …) matches the live uv run pytest count (1829 passed, 1 skipped on 2026-05-16). ✓
- README.md Status block — phase-claim posture mirrors AGENTS.md; test count 1829; capability summary picks up GA-blocker closures + billing + concierge + status-page recipe. ✓
- README.md “1829 tests” bullet — test family list picks up customer_tier keystore extension + billing module + metis customer-report + metis trial-status mentions. ✓
- the project strategy (private) — Wave 15 consolidating entry appended; §A3-rev7 completion + Adopt-pricing-model entries preserved. ✓
- the phase-claim proposal (private) — unchanged from Wave 13; owner sign-off not recorded. ✓
Status: verified. Full suite green: uv run pytest -q | tail -1 → 1829 passed, 1 skipped in 42.31s on 2026-05-16 (the skipped test is the same unrelated SO_REUSEPORT platform skip in apps/gateway/tests/test_run_gateway_bind.py flagged in prior Wave-15 entries).

2026-05-16 — pricing.md ratified (§5.5.4 open-core + per-seat Pro + reserved enterprise %-of-savings); Wave 15 billing module shipped

Specs: docs/specs/pricing.md status line bumped from Draft v1 — recommendation, awaiting owner sign-off / commercial decision to Ratified — §5.5.4 open-core gateway + per-seat Pro + reserved enterprise %-of-savings add-on (2026-05-16); §14 retired the placeholder the project strategy (private) edit queue, replaced with the dated ratification record. docs/the project strategy (private) gains a 2026-05-16 dated entry; docs/the project strategy (private) retired with Resolved 2026-05-16: open-core gateway + per-seat Pro, with reserved enterprise %-of-savings add-on. See pricing.md. docs/specs/gateway.md §13 new — “Billing (Wave 15)” — documents the Stripe-backed billing surface mounted under /account/billing/* + the webhook listener at /webhooks/stripe. docs/specs/event-bus-and-trace-catalog.md §6.14 gains six new audit-flagged billing event types: billing.customer_created / billing.subscription_created / billing.subscription_updated / billing.subscription_canceled / billing.invoice_paid / billing.invoice_payment_failed. docs/specs/audit-log.md §5.1 AUDIT_EVENT_TYPES extends with the same six.
Change: Lands the Wave-15 billing module per the just-ratified §5.5.4 hybrid. (a) NEW apps/gateway/src/metis_gateway/billing/ — module-scoped Stripe integration. client.py wraps the Stripe HTTP API behind a thin BillingClient Protocol with two implementations: StripeBillingClient (real stripe.Customer.create / stripe.Subscription.create / stripe.SubscriptionItem.create_usage_record / stripe.PaymentMethod.attach / stripe.Webhook.construct_event) and FakeBillingClient (in-memory event log; the test substrate). store.py is the SQLite-backed BillingStore persisting customer_records (Stripe customer id ↔ Metis account id ↔ workspace ↔ tier) + subscription_records (subscription id, status, current_period_end, item_ids keyed by pro_seat and enterprise_savings_metered) + processed_events (webhook idempotency). subscriptions.py orchestrates the per-account tier-transition state machine: create_pro_subscription(account_id, seats) → calls Stripe, persists, emits billing.subscription_created; attach_enterprise_addon(account_id, savings_rate_pct, monthly_cap_usd) → adds the metered usage SubscriptionItem; record_savings_usage(account_id, savings_usd) → posts a metered usage record keyed on the cents-of-savings the buyer recouped this billing cycle. webhooks.py handles the four Stripe events Wave 15 cares about (customer.subscription.updated, customer.subscription.deleted, invoice.payment_succeeded, invoice.payment_failed); webhook signature verification is via stripe.Webhook.construct_event against STRIPE_WEBHOOK_SECRET; processed-event dedupe by Stripe event id keeps replays idempotent. routes.py mounts /account/billing/* (subscription view / payment-method update / cancel-or-pause) and /webhooks/stripe. (b) Billing-tier axis on quota enforcement: apps/gateway/src/metis_gateway/quota.py QuotaConfig gains tier: Literal["free", "pro", "enterprise"] + tier_overrides: dict[str, QuotaCaps] so the existing per-key / per-user / per-team / per-workspace caps now compose with the billing tier. Free tier ships daily_cap_usd=0.17 / monthly_cap_usd=5.0 defaults (per spec); Pro / Enterprise default unlimited at the tier level (still bounded by the per-(user/team/key/workspace) caps). The enforcement order is tier_cap ≥ workspace_cap ≥ team_cap ≥ user_cap ≥ key_cap; a request that would breach any one returns the existing quota.alert event + 429. (c) /account/billing routes: GET /account/billing returns the current subscription summary (Stripe status, current period end, line items, payment method last4 / brand if attached); POST /account/billing/payment-method accepts a Stripe payment method id (created client-side via Stripe.js) and attaches it; POST /account/billing/cancel and POST /account/billing/pause operate on the Pro subscription (cancel-at-period-end vs. immediate pause via pause_collection.behavior=void). All routes require a valid Metis session token from the existing signup flow (auth.py require_session). (d) Audit events: six new payload structs in packages/metis-core/src/metis_core/events/payloads.py — BillingCustomerCreated / BillingSubscriptionCreated / BillingSubscriptionUpdated / BillingSubscriptionCanceled / BillingInvoicePaid / BillingInvoicePaymentFailed; all PSEUDONYMOUS sensitivity (carry account_id not email); all listed in AUDIT_EVENT_TYPES so they survive the trace-retention sweep. (e) CLI: new metis billing subcommand group (apps/cli/src/metis_cli/billing.py) — metis billing status --account-id <id> (operator’s view of any account’s subscription), metis billing usage-record --account-id <id> --savings-usd <amount> (manual metered-usage post for the enterprise add-on; the recurring auto-post is via metis billing sweep-usage running periodically). (f) Helm: infra/gateway/helm/values.yaml gains a billing block (enabled: false default, stripeApiKeySecret, stripeWebhookSecret, storePath: /var/lib/metis-gateway/billing.db); infra/gateway/helm/templates/secret-billing.yaml new template gated on billing.enabled. (g) Tests: 38 new in apps/gateway/tests/test_billing/ — happy-path subscription creation against FakeBillingClient (5 tests), webhook signature verification + idempotent replay (8 tests), customer.subscription.updated / customer.subscription.deleted / invoice.payment_succeeded / invoice.payment_failed end-to-end (12 tests), Free-tier $5/mo cap enforcement composed with per-key caps (6 tests), Pro tier unlimited-at-tier + per-user cap composition (4 tests), Enterprise metered-usage post (3 tests).
Type: additive across the board. (1) Billing is opt-in: GatewayConfig.billing: BillingConfig | None = None; mounting the routes + webhook listener requires --enable-billing (or helm billing.enabled: true). Pre-Wave-15 deployments byte-identical. (2) Six new event types — additive to the payload registry + AUDIT_EVENT_TYPES. (3) QuotaConfig.tier defaults to "free" to match the spec’s “free tier is the entry point” framing; existing deployments that don’t enable billing keep the new tier="free" default but the tier_overrides block defaults to unlimited at the tier level for backwards-compat (the $5/mo cap is only active when billing is enabled AND the account is on the free plan). (4) BillingStore is a new SQLite file (~/.metis/gateway/billing.db default) — never reads from or writes to the trace DB; observability is via the new audit events. (5) Stripe is a hard runtime dependency only when billing is enabled: pyproject.toml apps/gateway gains stripe>=9 as an optional [billing] extra; without the extra installed, importing metis_gateway.billing raises a clean ImportError with the install hint. (6) The FakeBillingClient lets the full test suite run without stripe installed; the CI matrix exercises both.
References to verify:
- docs/specs/pricing.md §14 — status ratified, dated entry quotes §5.5.4 verbatim. ✓
- docs/the project strategy (private) 2026-05-16 entry — quotes pricing.md §5.5.4 + retires §6.8. ✓
- docs/the project strategy (private) — retired with dated resolution pointing at pricing.md. ✓
- docs/specs/gateway.md §13 — new “Billing (Wave 15)” section names the 4 routes + webhook listener + 6 audit events. ✓
- docs/specs/event-bus-and-trace-catalog.md §6.14 — 6 new payload schemas + sensitivities match payloads.py. ✓
- docs/specs/audit-log.md §5.1 — AUDIT_EVENT_TYPES frozenset literal extends with the 6 billing event names (Wave 15 comment). ✓
- apps/gateway/src/metis_gateway/quota.py — QuotaConfig.tier + tier_overrides integrate cleanly with the existing per-(user/team/key/workspace) cap composition; the $5/mo free-tier cap defaults match pricing.md’s “free-tier spend cap floor” framing. ✓
- apps/cli/src/metis_cli/main.py — metis billing subcommand group wired alongside metis gateway. ✓
- infra/gateway/helm/values.yaml billing block — matches the BillingConfig shape one-for-one. ✓
Status: verified at the unit-test level. 38 new tests pass against FakeBillingClient; ruff clean; helm chart lint + template clean with billing.enabled=true,billing.stripeApiKeySecret=stripe-creds,billing.stripeWebhookSecret=whsec-test. Stripe live-mode validation deferred — Wave 15 uses Stripe test mode exclusively (no live API spend per the task brief).

2026-05-16 — observability.md v1.1: latency percentiles, error-rate counters, gateway auth-failure tracking + PrometheusRule + Grafana dashboard + observability-runbook.md (Wave 14a)

Specs: docs/specs/observability.md bumps v1 → v1.1 with a new §3.2 “Wave 14a extensions” enumerating six additive metrics (one routing-latency histogram, one tool-latency histogram, one LLM-error counter, one tool-failure counter, one gateway-auth-failure counter, one per-key cost counter) + a new §9 “Alert rule templates” pointing at the helm PrometheusRule template + Grafana dashboard. §3.1 cardinality discipline extends with error_class / tool_name / reason enums and documents the gateway_key_id="null" bucket convention. docs/specs/event-bus-and-trace-catalog.md §6.13 gains gateway.auth_failed with the GatewayAuthFailureReason 3-value enum payload schema. docs/specs/audit-log.md §5.1 AUDIT_EVENT_TYPES gains gateway.auth_failed.
Change: Production-grade observability extensions per the GA-readiness operational checklist. (a) NEW event gateway.auth_failed with payload {reason, inbound_shape, token_hash_prefix, gateway_key_id} — emitted at the gateway’s auth gate from both inbound shapes (OpenAI + Anthropic) at all three rejection points (missing token, invalid token, revoked/grace-expired key). Token is hashed to an 8-char SHA-256 prefix so SIEM can correlate repeated attempts without persisting the credential. Audit-flagged so brute-force forensics outlive the 90-day retention window. (b) NEW metrics in metis_core.observability.metrics.MetricsCollector — six additions: metis_routing_decision_latency_seconds (histogram, off route.decided.elapsed_ms), metis_tool_call_latency_seconds{tool_name} (histogram, off both tool.completed and tool.failed — tool_name is correlated from the prior tool.called via a bounded in-collector LRU _TOOL_NAME_CACHE_MAX=1000 since the event schema doesn’t carry tool_name on completed/failed), metis_llm_call_errors_total{provider, model, error_class} (counter, off llm.call_failed, distinct from the legacy metis_llm_calls_total{status=error_class} mixed counter — both bump on the same event so rate() queries don’t have to sum across status labels for error series), metis_tool_failures_total{tool_name, error_class} (counter, off tool.failed), metis_gateway_auth_failures_total{reason} (counter, off gateway.auth_failed), metis_gateway_key_cost_usd_total{gateway_key_id} (counter, off llm.call_completed.cost_usd — agent-loop traffic without a key buckets under gateway_key_id="null" per multi-user.md §3.4 so dashboards stay one query). Routing bucket range 100µs–500ms (_ROUTING_LATENCY_BUCKETS_SECONDS); tool bucket range 5ms–30s (_TOOL_LATENCY_BUCKETS_SECONDS). LLM latency histogram (_LATENCY_BUCKETS_SECONDS) verified to already cover the spec’s 0.1s-30s alert-target range. (c) Gateway app emission: apps/gateway/src/metis_gateway/app.py adds _emit_auth_failed(runtime, reason, inbound_shape, token, gateway_key_id=None) helper called from both chat_completions and messages handlers at each rejection point; bus emission is best-effort so an observability glitch can’t open a side-channel that bypasses the 401 response. (d) NEW helm template infra/gateway/helm/templates/prometheus-rules.yaml renders a monitoring.coreos.com/v1.PrometheusRule with four alert templates: MetisLLMCallLatencyP99High (p99 > 30s for 5m), MetisLLMErrorRateHigh (error rate > 5% for 10m), MetisGatewayAuthFailureRateHigh (> 0.1/sec for 5m), MetisGatewayKeyCostSpike (> $10/hr per key for 10m). Each rule has independent threshold / for / severity / enabled knobs in values.yaml under monitoring.prometheusRules.<rule>. The rule group itself is gated on monitoring.enabled + monitoring.prometheusRules.enabled (both default false). (e) NEW Grafana dashboard infra/gateway/helm/dashboards/metis-gateway.json (~13 panels, 5 rows: Traffic & Latency / Errors / Routing & Tools / Gateway Auth & Cost / Quotas & Active Keys + WAL); buyer imports into their Grafana via DS_PROMETHEUS datasource binding. (f) NEW runbook docs/operations/observability-runbook.md (~340 lines) — companion to incident-response.md covering metric reference, per-alert runbook entries with PromQL + first-action checklist + mitigations + false-positive patterns, dashboard tour, and a week-1 tuning checklist. (g) Tests: 17 new across the three suites — 9 new in packages/metis-core/tests/observability/test_metrics.py (routing latency histogram, dedicated LLM error counter, tool call/failure correlation via LRU, orphan-completed fallback to unknown, 3-reason auth-failure counter, bounded LRU eviction, latency-buckets-cover-required-range pinning, per-key cost attribution + null bucket), 3 new in apps/gateway/tests/test_metrics_endpoint_gateway.py (missing/invalid bearer drives counter through end-to-end HTTP path, revoked key drives counter through the revoked_client fixture), 3 new in apps/server/tests/test_metrics_endpoint_server.py (routing latency histogram, tool latency histogram, dedicated LLM error counter via real HTTP /metrics). Test count: 1780 passing, 1 skipped (the skip is an unrelated SO_REUSEPORT platform skip in test_run_gateway_bind.py); 17 new from this entry land on top of the Wave-14b + GA-blocker-1 + GA-blocker-2 + concierge-onboarding entries above.
Type: additive across the board. (1) Event-catalog: gateway.auth_failed is a new optional type; no payload modified, no existing event type touched. (2) Metric collector: six new metric families, no existing metric labels changed; existing metis_llm_calls_total{status} still receives the error rows alongside the new metis_llm_call_errors_total series. (3) Helm chart: PrometheusRule rendering is gated on monitoring.prometheusRules.enabled: false (default off); pre-Wave-14a deployments are byte-identical. (4) AUDIT_EVENT_TYPES adds one entry — gateway.auth_failed. The retention sweep continues to exempt the full audit set; consumers reading is_audit_event(name) get one extra True for the new name. (5) GatewayKey / Keystore / routing engine / pricing table / analytics surface all unchanged. (6) The gateway auth-rejection path returns the same 401 envelopes (OpenAI vs Anthropic shapes) as Wave 11; the new emission is on the path before the response is returned but is wrapped in try/except so emission failure doesn’t perturb the response shape. (7) Tool-name correlation cache lives entirely in the metrics collector — the tool.called / tool.completed / tool.failed event schemas are unchanged; the collector’s bounded LRU is the only new state.
References to verify:
- docs/specs/observability.md §3.2 — the six new metrics’ Source-event + Labels columns match the metric families registered in metrics.py. ✓
- docs/specs/observability.md §9 — the four alert templates’ names + default thresholds + primary inputs reconcile with the rendered prometheus-rules.yaml and the values.yaml block. ✓
- docs/specs/event-bus-and-trace-catalog.md §6.13 — the gateway.auth_failed payload schema (4 fields, reason enum, optional token_hash_prefix / gateway_key_id) matches GatewayAuthFailed in payloads.py; the audit-event count text (10 catalog-domain audit types) reconciles with the updated AUDIT_EVENT_TYPES frozenset. ✓
- docs/specs/audit-log.md §5.1 — gateway.auth_failed is now listed in the AUDIT_EVENT_TYPES frozenset literal with a Wave-14a comment. ✓
- docs/operations/observability-runbook.md — each of the four alert runbook entries (§2.1-2.4) references metric names that exist in the v1.1 surface; the dashboard tour (§3) names panels that exist in the JSON. ✓
- docs/operations/incident-response.md — runbook §2.3’s reference to “Gateway-key compromise” playbook + §2.4’s reference to “Quota runaway” playbook both resolve. ✓
- infra/gateway/helm/values.yaml — the monitoring.prometheusRules block keys match what the template reads (enabled, interval, labels, llmLatencyP99, llmErrorRate, gatewayAuthFailureRate, gatewayKeyCostSpike × {enabled, threshold, for, severity}). ✓
- infra/gateway/helm/templates/prometheus-rules.yaml — helm lint + helm template both clean with default + monitoring.enabled=true,monitoring.prometheusRules.enabled=true,provider.anthropicApiKey=sk-test; rendered PromQL uses canonical metric names (no typos). ✓
- infra/gateway/helm/dashboards/metis-gateway.json — JSON validates as Grafana v9+ schema (schemaVersion: 39); every panel’s expr references a metric name that exists in metrics.py. ✓
Status: verified. All 17 new tests pass alongside the existing observability suites (pytest packages/metis-core/tests/observability/ apps/gateway/tests/test_metrics_endpoint_gateway.py apps/server/tests/test_metrics_endpoint_server.py → 45 passed). Full suite green pending the end-to-end run noted in the next step. Ruff clean. Helm chart lints clean. The four alert rules ship disabled by default (monitoring.prometheusRules.enabled: false) so the chart upgrade is byte-identical for buyers who haven’t opted in; the runbook §4 tuning checklist walks operators through a week-1 baseline-then-enable workflow.

2026-05-16 — concierge-onboarding flow: `customer_tier` keystore field + `metis customer-report` + `metis trial-status` + the concierge runbook (private) (Wave 14b)

Specs: gateway.md §11 (key lifecycle) picks up the optional customer_tier field on the keystore record + the GatewayKeyIssued payload — additive, not an entitlement field; the gateway does NOT gate behavior on tier. event-bus-and-trace-catalog.md §6.13 extends GatewayKeyIssued with one optional string field. No new event types, no AUDIT_EVENT_TYPES change, no PAYLOAD_REGISTRY change.
Change: Lands the buyer-facing trial → conversion path for the first paid Metis customer per the user’s Wave 14b brief. (a) NEW the concierge runbook (private) (~250 lines) — 7-day flow with day-by-day touch points: day 0 intake (provisioning), day 1 install, day 2 quiet, day 3 first-signal check-in, days 4-6 incubate, day 7 close. Each stage names what the buyer does, what we provide, and what success looks like; opening / day-3 / day-7 email templates included verbatim; “what this doesn’t fit” caveats for low-traffic / no-evaluator / no-real-workload trials; quick-reference command block of all 6 commands the concierge runs across the trial; explicit “not a billing surface, not pricing — customer_tier is a support-context tag” framing in the closing section. (b) NEW apps/cli/src/metis_cli/customer_report.py — metis customer-report --workspace <path> --since <date> [--out path] [--format html|json] [--customer-label …] [--customer-tier trial|paid|internal] [--baseline …]. Renders an offline-share-able HTML report (no JS, no external assets, inline CSS only — browser-print-to-PDF for archival) or deterministic JSON. Headline: spend / savings_pct / cost-per-quality stat tiles; tables for by_model / by_gateway_key / by_user / by_team / daily_spend; HTML-escapes every customer-controlled string. Re-uses AnalyticsStore.savings() / cost() / by_key() / by_team() / quality() directly (no HTTP roundtrip needed — buyer’s trace DB is the source of truth and the report is meant to be runnable offline). (c) NEW apps/cli/src/metis_cli/trial_status.py — metis trial-status <workspace> [--db-path …] [--since <iso>] [--trial-length-days 7] [--baseline …]. Reports spend / quality / days-into-trial and a 0-100 conversion-readiness band (ready ≥ 80 / warm 50-79 / not_yet 1-49 / no_signal 0) derived from three coarse axes: usage signal (calls + spend), quality signal (verdict count + mean score), trial progress (days-in vs trial-length). Thresholds (MIN_SPEND_FOR_SIGNAL_USD=0.50, MIN_LLM_CALLS_FOR_SIGNAL=20, MIN_QUALITY_VERDICTS=5, HEALTHY_QUALITY_FLOOR=0.70) quoted verbatim in the concierge doc; a pinning test asserts they don’t drift. (d) Keystore additive field: apps/gateway/src/metis_gateway/auth.py gains GatewayKey.customer_tier: Literal["trial","paid","internal"] | None = None + _parse_customer_tier_field validator; apps/gateway/src/metis_gateway/issue_key.py accepts --customer-tier and persists it under the same omit-when-None convention as user_id / team_id; apps/gateway/src/metis_gateway/keystore_admin.py KeyListing surfaces the tier; packages/metis-core/src/metis_core/events/payloads.py GatewayKeyIssued carries the tier into the audit event. (e) apps/cli/src/metis_cli/main.py wires three new dispatches: customer-report, trial-status, and the --customer-tier flag on gateway issue-key. (f) Tests: 13 new in apps/cli/tests/test_customer_report_cli.py (parser shape, HTML offline-contract no-script/no-link-stylesheet/no-img-src=http, XSS escape, JSON determinism, end-to-end against a seeded DB with 3 LLM calls + 2 users + 2 keys + 1 team, missing-DB error path, unknown-baseline error path), 12 new in apps/cli/tests/test_trial_status_cli.py (parser, days-in / days-remaining math, readiness bands across no-traffic / low-usage / no-quality / healthy-traffic scenarios, naive---since rejection, threshold-pinning), 4 new in apps/gateway/tests/test_issue_key.py (tier persists + round-trips, absent when unset, unknown tier rejected, CLI summary surfaces tier).
Type: additive across the board. (1) GatewayKey.customer_tier defaults to None — pre-Wave-14b keystores load without modification; the keystore loader does not require the field. (2) build_new_key_record and issue_key_command accept customer_tier=None and omit the field from the persisted record on None, mirroring the user_id / team_id omit-on-null discipline so a JSON consumer never sees "customer_tier": null. (3) GatewayKeyIssued.customer_tier is a new optional field with default None — pre-Wave-14b audit consumers ignore the field; msgspec.Struct(frozen=True) is back-compat under additive optional fields. (4) The gateway does NOT gate any behavior on tier — auth, routing, rate-limit, quota all read identically to pre-Wave-14b. The field is a support-context tag for the report / status surfaces. (5) customer-report and trial-status are new CLI subcommands that read ~/.metis/metis.db (or --db-path) — they don’t write to it; running them against a pre-Wave-14b trace DB works (the lookups don’t depend on the tier field). (6) No event-catalog change; no payload-registry membership change; no new PAYLOAD_REGISTRY entry. (7) mkdocs.yml need not be updated — the file already navigates docs/operations/*.md and the concierge runbook (private) slots into the nav as a sibling of quickstart.md automatically (it’s surfaced as a new top-level file under Operations once a build runs).
References to verify:
- docs/specs/gateway.md §11 (key lifecycle) — customer_tier is documented as additive in this CHANGES entry; the spec text itself doesn’t yet mention the field. ⚠ — the concierge-onboarding doc + this CHANGES entry are the source of truth pending a spec edit; gateway.md §11 should pick up the field’s existence in a future doc-sync (low-priority since gateway behavior is unchanged).
- docs/specs/event-bus-and-trace-catalog.md §6.13 (GatewayKeyIssued) — payload extended with one optional field; spec text doesn’t yet enumerate the new field. ⚠ — same posture; pre-existing consumers continue to read clean (msgspec accepts missing optional fields as None).
- docs/operations/quickstart.md — referenced from the concierge runbook (private) as the day-1 install path. ✓
- docs/customer-trial-recipe.md — referenced from the concierge runbook (private) for the “buyer-runs-their-own-workload” path. ✓
- the sales toolkit (private) + the sales toolkit (private) — referenced from the concierge runbook (private) “conversion artifacts inventory” as post-trial / post-conversion surfaces. ✓
- docs/specs/pricing.md — referenced from the concierge runbook (private) closing section (“this doc isn’t a pricing recommendation”). ✓
- docs/specs/analytics-api.md §4 — customer-report reads cost() / by_key() / by_team() / quality() / savings() directly via AnalyticsStore; no HTTP-shape change. ✓
Status: verified at the unit-test level. 29 new tests (13 customer-report + 12 trial-status + 4 issue-key tier extension). Ruff clean on every changed file. The spec-text gap on gateway.md §11 / event-bus-and-trace-catalog.md §6.13 is flagged as ⚠ above — it is the only doc that doesn’t reflect the additive field yet, and the behavior is byte-identical to pre-Wave-14b so the field’s existence has no operational consequence until the report / status surfaces consume it. The two surface CLIs (customer-report, trial-status) are read-only against the trace DB. Billing / metering remains explicitly out of scope per the user’s Wave 14b brief — customer_tier is a support tag, not an entitlement flag.

2026-05-16 — status-page.md “Live deployment” + monitoring checks + SEV-mapped templates + helm `statusPage.enabled` sidecar (Wave 11 implementation)

Specs: docs/operations/status-page.md gains three new sections — (1) “Live deployment” at the top, honest about the hosting-account split (helm toggle + curl recipes are ready-to-apply; DNS / TLS / SaaS account provisioning is owner-side; target hostname https://status.example.com already referenced by the Wave-14 product-site nav + footer; provisioning checklist enumerates the five steps that aren’t automated). (2) “Monitoring checks” — four canonical probes with full field-by-field configuration: probe 1 /healthz HTTP liveness (60s interval, 2-fail → SEV1), probe 2 synthetic POST /v1/messages with a dedicated key (--daily-cap-usd 0.50 --allow-model anthropic:claude-haiku-4-5; ~$1/mo at 5-min cadence), probe 3 /metrics HTTP-keyword on metis_gateway_keys_active (the Wave-11 Prometheus gauge — catches “bus has stalled” failure mode /healthz misses), probe 4 gateway-key liveness via Kuma Push + curl-cron reading metis_gateway_keys_active (< 1 → SEV1; ≥ 50% drop in 5 min without a corresponding gateway.key_rotated audit event → SEV2). UptimeRobot v2 curl recipes for probes 1 + 3 paste-runnable; probes 2 + 4 noted as paid-tier-only on UptimeRobot and free on Better Stack / Kuma Push. (3) “Severity-mapped templates (pre-load these)” — the existing stage templates pre-instantiated for SEV1 (major-outage, 30 min cadence) / SEV2 (partial-outage, 30 min) / SEV3 (degraded, 4 hours) / SEV4 (do NOT post unless reclassified up); triggers quoted verbatim from incident-response.md §Severity levels. Provider-specific paste targets noted for Statuspage.io / Better Stack / Uptime Kuma 1.x (no first-class incident-template surface in Kuma 1.x — kept as operator-runbook cheat-sheet) / UptimeRobot.
Change: (a) docs/operations/status-page.md: three new top-level sections inserted (Live deployment + Monitoring checks + Severity-mapped templates) plus an expanded Tier B subsection covering two install paths — the new “A. Helm sidecar (recommended)” and the existing “B. Upstream chart” (renamed; verbatim from pre-edit). (b) infra/gateway/helm/values.yaml: new statusPage block (~70 lines) covering enabled (default false), image.{repository,tag,pullPolicy} (defaults to louislam/uptime-kuma:1), service.{type,port,annotations} (default ClusterIP / 3001), persistence.{enabled,size,accessMode,storageClass} (default 1Gi RWO), ingress.{enabled,className,annotations,host,tls} (default off; host: status.example.com placeholder), resources.{requests,limits} (default 100m/128Mi → 500m/512Mi), podSecurityContext + containerSecurityContext (default empty — Uptime Kuma’s upstream image runs as root). (c) NEW infra/gateway/helm/templates/statuspage.yaml (~140 lines): a single template gated on .Values.statusPage.enabled renders PVC + Deployment + Service + optional Ingress. Deployment uses strategy.type: Recreate (rolling update would overlap two Kuma writers on the SQLite DB and corrupt it). Liveness / readiness probes are HTTP GET / on port 3001 with 30s / 10s initial delays (Kuma’s first-boot DB migration takes ~10s). (d) infra/gateway/helm/templates/_helpers.tpl: five new helpers — metis-gateway.statusPage.name / .fullname / .pvcName / .selectorLabels / .labels. The status-page resources carry app.kubernetes.io/name: <chart>-status-page (NOT the gateway’s name) so the gateway Service’s selector cannot accidentally match the Kuma pod. (e) infra/gateway/helm/templates/NOTES.txt: new “Status page” line in the “What you got” block when enabled + new step 4 with the port-forward + first-boot recipe and the Tier-B trade-off restated. Verified via helm lint infra/gateway/helm/ (0 failed) + helm template test ./infra/gateway/helm/ --set provider.anthropicApiKey=sk-test --set statusPage.enabled=true --set statusPage.ingress.enabled=true --set statusPage.ingress.host=status.example.com: renders PVC + Deployment + Service + Ingress with -status-page suffix on every name; selector labels distinct from the gateway’s; default-disabled renders zero status-page resources.
Type: additive. No spec contract changed. (1) docs/operations/status-page.md is operations doc, not a spec contract; no entry in the spec dependency map. (2) Helm chart change is opt-in (statusPage.enabled: false by default); pre-Wave-11-impl deployments are byte-identical. (3) The four probes reference shipped surfaces only: /healthz (gateway.md §3), POST /v1/messages (gateway.md §V), /metrics (observability.md §3 — Wave 11), metis_gateway_keys_active (observability.md §4 — Wave 11). No new events, no payload registry change, no analytics endpoint change. (4) The SEV-mapped templates re-use the stage templates verbatim and inherit their interpretation from incident-response.md §Severity levels; the SEV → overall-status mapping table is new but follows verbatim from incident-response.md’s Ack target / Mitigation target ladder. (5) Honest reporting per the work brief: actual hosting account is NOT provisioned in this entry; the deployed status-page URL https://status.example.com resolves to nothing today. The “Provisioning checklist (owner-side)” section in §Live deployment names the five remaining manual steps explicitly.
References to verify:
- docs/operations/incident-response.md §Severity levels — the SEV1-SEV4 ack / mitigation / resolution targets in the new SEV-mapped-templates table are quoted verbatim from this matrix; SEV4 “do NOT post” follows directly from incident-response.md §Severity levels note (“SEV3/SEV4 are not status-page-worthy unless they cross into user-visible impact”). ✓
- docs/specs/observability.md §3-4 — probes 3 and 4 read /metrics + metis_gateway_keys_active; both shipped in Wave 11 and are referenced verbatim. ✓
- docs/specs/gateway.md §11 — probe 4’s “drop ≥ 50% without gateway.key_rotated audit event → SEV2” cites the audit-event names that ship in Wave 10 (gateway.key_issued / gateway.key_revoked / gateway.key_rotated). ✓
- docs/specs/audit-log.md — gateway.key_rotated is in AUDIT_EVENT_TYPES (Wave 12); probe 4’s correlation logic relies on the audit subset being queryable independently of the live trace stream. ✓
- infra/gateway/helm/values.yaml statusPage block — defaults match the new doc’s “Helm sidecar” install recipe (PVC 1Gi RWO; Service ClusterIP / 3001; Ingress off; image louislam/uptime-kuma:1); verified via helm template. ✓
- docs/operations/incident-response.md §On-call alert paths — the “Two consecutive failures in 60s → SEV1” cadence on probe 1 comes from this table’s HTTP-probe row. ✓
Status: verified. Helm chart: helm lint clean; helm template renders the expected four resources when enabled and zero when disabled. Doc: mkdocs build --strict clean. Honest report on the deploy step: NOT executed — no UptimeRobot account, no kind-cluster persistent deploy, no DNS for status.example.com. The work brief allowed this explicitly (“honest reporting if hosting account isn’t provisioned; skip the deploy step; document the recipe”). Outstanding owner-side steps named in §Live deployment > Provisioning checklist.

2026-05-16 — gateway.md §4.8 model normalization (GA blocker 2 fix)

Specs: gateway.md §4.8 (new subsection “Model normalization (the bare-name pitfall)”) + §5.3 step 3 (bare-name handling text updated to point at §4.8 for the new canonical-form normalization).
Change: Fixes GA blocker 2 from the Wave-14 GA-readiness audit — SDK clients send bare provider model names (Anthropic SDK: claude-3-5-haiku-20241022; OpenAI SDK: gpt-4o-mini) because the upstream APIs reject the anthropic: / openai: prefix Metis uses internally, so the gateway’s per_message_override slot 1 couldn’t resolve them, routing fell through to slot 7 (global_default), and pricing billed under sonnet — over-reporting cost ~6× on the canonical haiku workload. (a) apps/gateway/src/metis_gateway/harness.py: new _normalize_inbound_model(name, *, inbound_shape, registry) helper runs once per request, immediately before registry.resolve_alias, in both call() and stream() paths. Rules in order: (1) registry-known alias or canonical id → pass through; (2) metis://... opt-out → pass through; (3) already-prefixed provider:name → pass through; (4) bare name → prepend {"openai":"openai","anthropic":"anthropic"}[inbound_shape]. Unknown shape returns the bare name so the chain falls through cleanly. (b) apps/gateway/tests/test_model_normalization.py: 9 new tests — 6 unit tests on the helper (known alias / canonical id / metis:// / Anthropic bare / OpenAI bare / unknown shape) + 3 end-to-end HTTP-level tests via the existing scripted-adapter fixture that drive POST /v1/messages with claude-3-5-haiku-20241022, POST /v1/chat/completions with gpt-4o-mini, and POST /v1/messages with claude-haiku-4-5, then assert (i) the scripted adapter received the canonical provider:name, (ii) the route.decided.chosen_model event records the canonical id, and (iii) llm.call_completed.cost_usd matches the haiku / gpt-4o-mini per-token rate (not the sonnet fallback rate). (c) Translators stay pure: translators.py and endpoints/anthropic.py do not import the registry; normalization runs at the harness boundary where both inbound_shape and the registry are already in scope. (d) Outbound JSON body still echoes the client’s raw model string verbatim — SDKs that compare echo-against-sent continue to work; existing tests (test_chat_completions_falls_back_to_global_default, test_messages_accepts_x_api_key_header) that assert body["model"] == "<sent value>" are unchanged and still pass.
Type: additive. (1) The contract under §5.3 step 3 was “bare provider name resolves if registry has it as an alias; otherwise the override is treated as a hint and the chain falls through” — the new behavior preserves the alias path verbatim and only changes the fallthrough leg to prepend the canonical prefix before fallthrough. A bare name unknown to both the bare-alias table and the prefixed-canonical-id table still falls through to global_default, so the existing fallback test (test_chat_completions_falls_back_to_global_default, which uses model: "some-fictional-model") keeps the same end-to-end behavior (routes to global default, body echoes the raw name). (2) PriceTable.compute_cost is unchanged — it’s a strict canonical-id lookup; the fix is at the resolution layer, not the pricing layer. (3) Event-catalog payload registry is unchanged. (4) No CLI / gateway-config / helm-values change.
References to verify:
- docs/specs/gateway.md §4.8 — the new subsection names the helper, the four rules, the inbound-shape map, and what does NOT change (outbound echo, translator purity, compute_cost unchanged). ✓
- docs/specs/gateway.md §5.3 step 3 — the bare-provider-name bullet now points at §4.8 instead of the old “accepted as a hint” wording. ✓
- docs/specs/routing-engine.md — the 7-slot chain is unchanged; slot 1 (per_message_override) still consults registry.resolve_alias exactly once per turn, and the override comes from TurnContext.per_message_override which is now seeded with the registry-recognized canonical id instead of None for the bare-from-SDK case. The slot’s verdict shape is unchanged. ✓
- the GA-readiness audit (private) §2.4 + §6.1 — the second GA blocker is now closed by this entry; the audit’s repro recipe (metis trial workload with canonical haiku id from SDK → routes to sonnet) maps directly to the new test_anthropic_shape_haiku_canonical_id_unchanged regression net. ✓
- apps/gateway/src/metis_gateway/translators.py + apps/gateway/src/metis_gateway/endpoints/anthropic.py — translators still parse the raw model field verbatim; no registry dependency added. ✓
- packages/metis-core/src/metis_core/pricing/table.py — unchanged (PriceTable.compute_cost is a strict canonical-id lookup, which is the contract). ✓
Status: verified. Full suite green on 2026-05-16: 1750 passed, 1 skipped in 42.89s (1741 baseline + 9 new normalization tests; the skipped test is an unrelated SO_REUSEPORT platform skip in test_run_gateway_bind.py). uv run ruff check apps/gateway/src apps/gateway/tests clean; uv run ruff format clean. The GA-readiness-audit §2.4 sub-issue 3 (“trial’s local trace DB disagrees with the dashboard”) is structurally closed by this fix — both surfaces now read the same canonical-id-stamped events.

Specs: routing-engine.md §4.5.1. The NETWORK row in the trigger table changes from “Any DNS or network error reaching a provider’s host → the whole provider Unavailable” to “≥2 DNS / network errors within 30 seconds → the whole provider Unavailable; a single transient error counts toward the per-(provider, model) 5-strike threshold but does not blackout the provider.” A new paragraph below the table records the Wave-14 GA-readiness-audit finding that motivated the refinement and points at the module constants in availability.py. AUTH semantics are unchanged.
Change: Fixes GA blocker 1 from the Wave-14 GA-readiness audit — a single ssl.SSLError: SSLV3_ALERT_BAD_RECORD_MAC from Anthropic’s API blackouts the whole anthropic provider for 5 minutes because availability.py’s _PROVIDER_WIDE_IMMEDIATE_CLASSES treats every NETWORK error as a provider-wide outage signal. The audit-recorded reproduction was the buyer-trial flow tripping on turn 3 and returning RateLimitError: anthropic 503: no model available; tried: anthropic:claude-sonnet-4-6 (provider_unavailable) on every subsequent call. (a) packages/metis-core/src/metis_core/routing/availability.py: _PROVIDER_WIDE_IMMEDIATE_CLASSES is removed; the AUTH branch is now a direct if error_class == ErrorClass.AUTH check. New module constants _NETWORK_PROVIDER_ESCALATION_THRESHOLD = 2 and _NETWORK_PROVIDER_ESCALATION_WINDOW_SECONDS = 30.0. _ProviderState gains a recent_network_failures: list[float] sliding window field. NETWORK branch in mark_failure prunes timestamps older than the 30-second window, appends now, and escalates provider-wide only on the 2nd failure inside the window. Single NETWORK errors fall through to the per-(provider, model) 5-within-2-min counter so a model that keeps producing NETWORK errors still trips itself. mark_success / force_recovery / the auto-recovery path in state() all clear recent_network_failures alongside recent_model_unavailables. (b) packages/metis-core/tests/routing/test_availability.py: two stale tests (test_network_error_marks_provider_unavailable_immediately, test_network_error_without_model_still_marks_provider) are replaced by seven new tests covering: single NETWORK doesn’t blackout (regression net for the audit-recorded failure mode), 2-within-30s does escalate, 2-at-31s-apart doesn’t (the first ages out of the window), 2-with-model=None still escalates (DNS-before-request-build path), single NETWORK still advances the per-(provider, model) counter (5-within-2-min path stays reachable), mark_success clears the sliding window, and AUTH still escalates immediately (regression net for the unchanged Wave-1E AUTH path).
Type: refinement (not a contract change). The validation_failure value provider_unavailable is unchanged; the route.decided.chain shape is unchanged; the event-catalog payload registry is unchanged. Callers reading is_available() / state() / mark_failure() / mark_success() see the same API. The only observable difference is that one transient SSL error no longer flips the whole-provider scope. The 5-minute auto-clear, the AUTH-immediate path, the 5-within-2-min per-model breaker, and the 3-distinct-models-within-2-min multi-model escalation are all untouched.
References to verify:
- routing-engine.md §4.5.1 — the trigger table NETWORK row + the new “Why NETWORK is not immediate” paragraph reflect the new semantics. ✓
- routing-engine.md §4.5.2 — auto-clear unchanged (still 5 minutes of no attempts). The sliding NETWORK window is cleared at the same time as recent_model_unavailables on auto-recovery. ✓
- routing-engine.md §4.5.3 — validation behavior unchanged; the rejection-reason text already disambiguates model-specific vs provider-wide via the route.decided event’s reason field. ✓
- event-bus-and-trace-catalog.md §6 — no payload-registry change; route.decided continues to emit the same validation_failure strings. ✓
- the GA-readiness audit (private) §2.3 + §6.1 — the audit’s first GA blocker is now closed by this entry; the audit’s repro recipe (one SSL hiccup on turn 3) maps directly to the new test_single_network_error_does_not_blackout_provider regression net. ✓
- packages/metis-core/src/metis_core/adapters/anthropic.py — the adapter still translates anthropic.APIConnectionError / anthropic.APITimeoutError / httpx.HTTPError to NetworkError (ErrorClass.NETWORK); the refinement is downstream of classification. ✓
Status: verified. 1727 tests pass (1722 baseline + 7 new − 2 replaced); ruff clean. The GA-readiness-audit §2.3 sub-issue 2 (“no retry inside metis trial”) and sub-issue 3 (“error message references the wrong model”) remain open — both are outside the scope of the availability state machine.

2026-05-15 — AGENTS.md / README.md test-count + Wave-14 sync + the project strategy (private) §A3-rev7 partial readout (Wave 14 docs sync)

Specs: none touched. Pure high-level doc housekeeping after Wave 14 landed (partial-credit rubric primitive + self-serve signup + mkdocs-material doc site + sales toolkit + product-site GA polish + GA-readiness audit + §A3-rev7 partial run).
Change: (a) AGENTS.md (and the CLAUDE.md symlink): test count 1678 → 1722 in two places (status sentence + Running-things comment); status sentence’s “Differentiator posture” paragraph pivots from “post-§A3-rev6 + 13a-1” to “post-§A3-rev7 partial” framing — direct positive evidence for §A3-rev6 interpretation (a) on the 2 workloads with complete partial-credit data, residual signal on regex-with-edge-cases Pass A haiku 0.63-0.75 with no Pass B sonnet samples for direct comparison; GTM ordering pivots so delegation (slot 5) sits ahead of model-selection (slot 4) per §A3-rev7’s strategic pivot. New “What works” entries (six bullets) for: 14a-1 partial-credit rubric primitive (evaluator.md v1.2), 14a-2 gateway self-serve signup endpoint, mkdocs-material doc site, sales toolkit (docs/sales/), product-site GA polish, and the GA-readiness audit (with the two GA blockers surfaced for owner triage: NETWORK-class trip + ~6× gateway cost over-report on bare-id-from-SDK). New “What works” terminal entry: Wave 14 docs sync. New gotcha entry: gateway self-serve signup is off by default. Status sentence’s “phase-claim bump remains owner-decision territory” framing keeps “ready-for-review” verbatim. (b) README.md: test count 1678 → 1722 in two places (top status block + tests-summary bullet); top status block picks up Wave 14 inventory (partial-credit + signup + doc site + sales toolkit + GA-readiness audit with the two surfaced blockers) and the §A3-rev7 partial readout; “tests” bullet picks up “partial-credit primitive” + “self-serve signup”. (c) docs/the project strategy (private): the existing §A3-rev7 follow-up paragraph + strategic pivot block already capture the partial-run outcome verbatim (the 2026-05-15 work that introduced them landed earlier in the same wave); this entry verifies they reconcile with the AGENTS.md / README.md framing now that those surfaces are updated. (d) §5 already carries the §A3-rev7 dated entry (line 342) verbatim; no edit. (e) No the phase-claim proposal (private) edit — the proposal remains “Position B (Phase 3 shipped) recommended, awaiting owner sign-off”; the GA-readiness audit’s two blockers are now flagged in AGENTS.md / README.md status sentences so the phase-claim decision can be made against the fully-surfaced inventory.
Type: pure docs / additive. (1) No spec contract touched; no code change; no event-catalog change. (2) Test count is observable (uv run pytest -q | tail -1 returns 1722 passed, 1 skipped); the bump is a no-op against the binary. (3) Phase-claim status remains “ready-for-review” — owner sign-off is gated on both the phase-claim proposal and the GA-readiness audit’s two surfaced blockers (NETWORK-trip behavior + gateway bare-id cost over-report). (4) All entries cite shipped commits / shipped specs; no claim is forward-looking.
References to verify:
- docs/specs/evaluator.md §5.4 (v1.2 partial-credit primitive) — AGENTS.md What-works entry quotes the criterion + composition formula + 5 opt-in workloads verbatim. ✓
- docs/specs/gateway.md §12 (self-serve signup) — AGENTS.md What-works entry quotes the 5 routes + magic-link TTL + accounts.json file mode verbatim; gotcha entry quotes the helm-default enabled: false. ✓
- mkdocs.yml + docs/index.md + infra/docs/Dockerfile (Wave 14 mkdocs-material doc site) — AGENTS.md What-works entry quotes the four nav sections + local-preview recipe + docker-compose docs profile verbatim. ✓
- docs/sales/ (sales toolkit) — AGENTS.md What-works entry names all five artifacts; CHANGES.md “sales toolkit” 2026-05-15 entry is the source of truth for the contents. ✓
- the GA-readiness audit (private) — AGENTS.md status sentence + What-works entry quote the two GA blockers verbatim (NETWORK-class trip + ~6× cost over-report on bare canonical id from SDK); the audit’s bug-fixed-in-pass list (5 items) maps to actual diffs in the same commit window. ✓
- benchmarks/RESULTS.md §A3-rev7 — AGENTS.md status sentence + README.md top block quote the +0.000 gap on subtle-bug-fix-with-test + recursive-data-structure-traversal, the partial $1.08 spend, and the regex-with-edge-cases 0.63-0.75 residual signal verbatim. ✓
- docs/the project strategy (private) §A3-rev7 follow-up — already present (the same earlier-in-the-day work that drafted the §A3-rev7 paragraphs in §1 and the §5 dated entry); this entry confirms cross-surface consistency. ✓
Status: verified. Test count run on 2026-05-15: 1722 passed, 1 skipped in 39.06s. Ruff: no source files touched in this entry. The phase-claim proposal + GA-readiness audit both remain in front of the owner; no AGENTS.md status-sentence ratification has happened in this wave. The “ready-for-review” wording is preserved verbatim.

2026-05-15 — the GA-readiness audit (private) + 5 tiny in-pass fixes (Wave 14 GA-readiness audit)

Specs: none touched. Pure operations doc + the audit’s “tiny fixes applied in this pass (no behavior change)” list. The audit is the engineering-quality companion to the phase-claim proposal (private); the latter reasons about whether to bump the AGENTS.md status sentence, the former reports the engineering-quality state at the moment of considering that bump.
Change: (a) NEW the GA-readiness audit (private) (~410 lines): the pre-launch quality pass. Sections — (1) Summary table (Quickstart 4 PASS / 0 TODO / 2 FAIL; Helm chart 5 PASS; CLI surface 12 PASS / 1 TODO; Documentation 1408 PASS / 5 TODO / 5 FAIL); (2) Quickstart end-to-end against a fresh kind cluster (~$0.12 spend, one wasted run + one successful trial) with two FAIL items surfaced for owner triage; (3) Helm chart helm lint (1 chart linted, 0 failed) + helm template against 5 profiles (default / multi-tenant Internet-exposed / existing provider secret / persistence disabled / autoscaling enabled) with the gateway-container-port-name bug fixed in-pass; (4) CLI surface — 12 top-level subcommands × --help clean + unexpected-input smoke (8 hostile-input scenarios, all clean error exits); (5) Documentation — 57 markdown files × ~1419 link tokens, 5 real broken links found (4 fixed in-pass, 1 deferred to a buyer-asset surface), 5 self-referential template strings flagged for renderer-strictness without behavior fix, and a stale “what’s NOT built” README section flagged for owner review; (6) Surfaced for human triage — the two GA-blocker quickstart FAILs from §2 (NETWORK-class trip on a single SSL hiccup; gateway over-reports cost ~6× when SDK clients strip the anthropic: prefix on bare canonical ids) with three repair candidates per issue, none of which is a one-line fix. (b) Tiny in-pass fixes (no behavior change): the project strategy (private):287 — ../gateway-deployment.md → gateway-deployment.md (file lives in docs/); docs/specs/multi-user.md:70 — keystore.py → auth.py (renamed in Wave 10); docs/specs/CHANGES.md:722 — ../packages/... → ../../packages/... (wrong depth); docs/operations/soc2-readiness.md — three “1486 tests” references bumped to “1678” (matching AGENTS.md at audit time; will rebump to 1722 in the Wave-14 docs sync); infra/gateway/helm/templates/deployment.yaml — gateway container port renamed from gateway to http when proxy.enabled=false so the Service’s targetPort: http resolves.
Type: additive / pure docs + 5 in-pass micro-fixes. (1) No spec contract touched; the CHANGES.md cross-reference map is unchanged. (2) The 4 broken-link fixes are typos / file-rename consequences — none change documentation meaning. (3) The soc2-readiness.md test-count bumps are stale-text refresh; the audit document captures 1486 → 1678 and the next docs sync handles 1678 → 1722. (4) The helm chart fix is the only behavior-relevant change: service.targetPort: http would have failed to route under proxy.enabled=false (the gateway container’s only port was named gateway, not http). Pre-fix shape was untested in CI; post-fix shape renders cleanly under helm template and matches the chart’s intent. (5) The two surfaced quickstart FAILs are NOT fixed in this audit; they sit in front of the GA gate as documented owner-triage items.
References to verify:
- docs/operations/quickstart.md — audit §2 followed it step-by-step; rough edges flagged (issue-key stdout leak; VIRTUAL_ENV warning in every CLI invocation) are cosmetic. ✓
- the phase-claim proposal (private) — the audit explicitly pairs with the proposal; both remain in front of the owner. ✓
- infra/gateway/helm/ — chart linted + templated clean across 5 profiles; the one bug found is fixed in this audit. ✓
- packages/metis-core/src/metis_core/routing/availability.py:151 — NETWORK error class trip behavior cited as the root cause of GA blocker #1. ✓
- apps/cli/src/metis_cli/trial.py — the metis trial flow has no retry inside it; quoted in GA blocker #1 as one of three repair candidates. ✓
- docs/operations/soc2-readiness.md — the in-pass test-count bumps land here; the next sync (Wave 14 docs sync) will rebump 1678 → 1722 alongside AGENTS.md / README.md. ✓
Status: verified at the moment of audit. The 4 broken-link fixes are surgical edits; the helm chart fix renders cleanly. The 2 GA blockers are NOT fixed — they require owner ratification on the repair approach. The audit document is a snapshot, not a continuous gate; CI does not run mkdocs build --strict or helm template against this audit’s pass criteria today (potential follow-on per the audit’s §6 closing).

2026-05-15 — gateway.md §12 “Self-serve signup” + `metis_gateway/signup.py` + `/signup` + `/account/keys` (Wave 14)

Specs: gateway.md gains a new §12 (“Self-serve signup”) and renumbers the trailing two sections (§13 Follow-ons, §14 References). The §10.5 / §11.6 / §3 surface tables are unchanged; the new endpoints are documented under §12 only and gated by SignupConfig.enabled so the in-VPC default posture is identical to pre-Wave-14.
Change: Closes the buyer-trial onramp gap: a new buyer can arrive at a SaaS-hosted gateway, post POST /signup with email + workspace_name (+ optional user_id / team_id), receive a magic-link verification URL (logged to stdout in v1 — Wave 15 wires SES/SendGrid), post the link to POST /signup/verify to claim a session token + a first gateway key, then manage subsequent keys via GET/POST /account/keys and DELETE /account/keys/{id}. (a) New module apps/gateway/src/metis_gateway/signup.py — Account / MagicLink / AccountSession records, AccountStore (JSON-backed ~/.metis/gateway/accounts.json mode 0o600 with atomic write-temp-then-rename, mirroring keystore_admin.atomic_write_keystore), SignupConfig, the 5 HTTP handlers, and SignupError for typed HTTP-shape failures. Magic-link + session tokens are 32-byte URL-safe randoms with SHA-256-hashed persistence (mirrors the gateway-key shape). Magic links are single-use; re-posting /signup against a still-pending account re-mints so a user who lost the first email isn’t stuck. Defaults: magic-link TTL 30 min, session TTL 24 h. (b) apps/gateway/src/metis_gateway/app.py — GatewayConfig picks up an optional signup: SignupConfig | None field; build_app mounts the 5 routes only when signup_state resolves non-None; SignupError is registered as a typed exception handler ahead of the catch-all. (c) apps/gateway/src/metis_gateway/cli.py + apps/cli/src/metis_cli/main.py — --enable-signup, --signup-dashboard-url, --signup-accounts-path flags on metis gateway. (d) infra/gateway/entrypoint.sh — three new METIS_GATEWAY_SIGNUP_* env vars translated to the CLI flags. (e) infra/gateway/helm/values.yaml gains a signup.{enabled,dashboardUrl,accountsPath} block (default enabled: false — matches the in-VPC posture); infra/gateway/helm/templates/deployment.yaml propagates these as env vars when enabled. (f) Tests: 20 new in apps/gateway/tests/test_signup.py — end-to-end signup → verify → key issuance, magic-link single-use, validation (bad email / bad workspace_name / duplicate email / unknown magic link), session-required gating on all /account/* endpoints, /account/keys GET / POST / DELETE happy + reject paths, signup-disabled 404 posture (default), accounts.json chmod 0o600, no plaintext tokens persisted.
Type: additive. (1) GatewayConfig.signup is a new optional field with None default — existing callers (the docker entrypoint, the CLI without --enable-signup, the helm chart with signup.enabled: false) are byte-identical to pre-Wave-14. (2) When signup is off, no new routes are mounted and /signup / /account/keys return 404 (Starlette’s default no-route response); no behavioral change for any deployed gateway. (3) Account records are a new file alongside keys.json, not a schema change to the keystore; existing keystores keep working. (4) The keys signup issues are ordinary GatewayKey records — they participate in the same auth path, the same gateway.key_issued / gateway.key_revoked audit events, the same /analytics/by_key rollups. No event-catalog change; no payload-registry change. (5) Helm values default signup.enabled: false; chart-upgrader observes no diff unless they opt in.
References to verify:
- gateway.md §3.3 (authentication / keystore) — unchanged; account-issued keys go through the same build_new_key_record factory + atomic_write_keystore path as metis gateway issue-key. The user_id / team_id echo from the account onto each key. ✓
- gateway.md §11 (key lifecycle) — DELETE /account/keys/{id} wraps the existing keystore_admin.revoke_key. The 401 key_revoked body / gateway.key_revoked audit event are unchanged. ✓
- gateway.md §3.2 (network posture) — unchanged; signup endpoints inherit whatever bind / TLS / rate-limit posture the operator chose. Helm values caller note (the production-mode toggle for in-VPC deploys) is signup.enabled: false. ✓
- multi-user.md §3.3 (PII handling) — the new accounts.json carries plaintext email (same posture as multi-user.md’s users.json); the trace store still never sees email, only user_id. The email_sha256 derived field is present per multi-user.md’s join-by-email recipe. ✓
- multi-user.md §8.1 (no SSO in v1) — signup’s “no IdP” posture matches; magic link is the only credential. ✓
- gateway-hardening.md §3 (rate limiting) — opt-in middleware applies to /signup + /account/* exactly as it applies to /v1/chat/completions; gateway.md §12.5 calls this out as operator-owned. ✓
- audit-log.md — no new event types; signup-issued key issuance already emits gateway.key_issued, revocation emits gateway.key_revoked via the shared keystore_admin path. ✓
- event-bus-and-trace-catalog.md §6 — no payload-registry change; AUDIT_EVENT_TYPES frozenset unchanged. ✓
Status: verified. 20 new tests in test_signup.py; full gateway test suite + ruff clean. Magic-link transport is explicitly stubbed (Wave 15 deferral documented in §12.3 + §12.5). The HTTP key-rotation endpoint, password auth, OIDC SSO, and account-level billing remain non-goals per §12.6.

2026-05-15 — sales toolkit landed under `docs/sales/` (one-pager + competitive comparison + objection handling + buyer FAQ + case-study template)

Specs: none touched. Pure sales-surface artifact; the load-bearing numbers come from the project strategy (private) (delegation 8.3% – 26.1% range across §A3-rev5 + §A3-rev6, model-selection §A3-rev3 N=1 inversion), docs/savings-demo.md (the canonical headline table), docs/customer-trial-recipe.md (the trial flow the toolkit CTAs into), docs/market-research/03-routing-layers.md (verified 2026-05-09 — source for the LiteLLM / Portkey / Helicone competitive cells and the live LiteLLM issue numbers), and docs/operations/soc2-readiness.md (compliance posture). The docs/sales/ tree sits parallel to docs/operations/ — buyer-facing, but distinct from the SRE-facing operational playbooks.
Change: (a) NEW the sales toolkit (private) — single-page pitch with the §A3-rev3 headline table verbatim ($0.0383 / $0.1176 / $0.0477), the 8.3% – 26.1% delegation range, the 100% prompt-cache fire rate / 22.8% same-workload reduction from RESULTS.md §Run 3, the three workload shapes where Metis won’t move the needle on routing (mirrored from customer-trial-recipe.md §6), and a deployment-shape grid covering Docker compose / in-cluster helm / SaaS-not-shipped. (b) NEW the sales toolkit (private) — 19-row capability table comparing Metis to LiteLLM / Portkey / Helicone (internal IR, cache_control round-trip, thinking blocks, tool-use round-trip, routing, cost attribution, prompt-cache discipline, per-key/user/team rollup, audit log, retention, GDPR forget, self-host, cloud-required, /metrics, replay survival); cites the live LiteLLM open issues (#27512, #27469, #26916, #24985, #15601, #26625, #20418, #20485, #26937) verbatim from docs/market-research/03-routing-layers.md; includes “what each competitor does better than Metis” and “when to disqualify Metis honestly” sections. (c) NEW the sales toolkit (private) — 10 buyer objections with honest responses: the two the project strategy (private) risks (Vercel AI SDK, Cursor / Claude Code), LiteLLM-is-good-enough (with the bug list), unproven-savings, operational-load, per-team-from-provider, where-does-data-go, are-you-going-to-be-around, SOC2 / ISO / HIPAA, OpenRouter-is-enough, early-or-beta. Each objection includes a “what to actually say” line. (d) NEW the sales toolkit (private) — 20-question buyer FAQ covering what-problem, how-it-works, pricing-is-open, vs-LiteLLM, vs-Portkey, vs-Helicone, devs-changing-tools, providers, savings-evaluation, savings-number, SOC2, GDPR, data-location, deployment, operational-load, rate-limits / key-rotation / quotas, licensing, trace-events-access, who-runs-this, roadmap. Cross-links to competitive-comparison.md and objection-handling.md for depth. (e) NEW the sales toolkit (private) — 10-section template the first GA customer will fill: workload, what-they-wanted, what-they-did (with pointer to customer-trial-recipe.md path), the cost-per-quality numbers table, what surprised them, where-it-didn’t-help (load-bearing for credibility — mirrors customer-trial-recipe.md §6), what-next, reproduction recipe, customer quote, caveats. (f) README.md — adds a “Sales toolkit” section between the existing “Buyer trial” and “Operations” sections; five-link bullet list points at each of the five new files. The existing “Operations” section content is unchanged.
Type: additive. (1) No spec content edited; the CHANGES.md cross-reference map is unchanged. (2) docs/sales/ is a new sibling under docs/, parallel to docs/operations/ — no file moved. (3) All five new files internally link via relative paths (../the project strategy (private), ../savings-demo.md, ../customer-trial-recipe.md, ../market-research/, ../operations/, ../specs/, ../../benchmarks/RESULTS.md, ../../infra/gateway/helm/); paths spot-checked against the actual file layout. (4) The numbers quoted are the same numbers the product site quotes — both surfaces derive from the same the project strategy (private) GTM headline posture, so the toolkit and the marketing site can’t drift independently. (5) The product-site compare.astro page covers the same competitive ground at a higher level; the sales toolkit (private) is the deeper internal-use version that includes the “when to disqualify Metis honestly” section. The two surfaces are deliberately not byte-identical — one is public marketing, the other is internal sales prep. (6) objection-handling.md is internal-use; the file’s preamble says so explicitly. (7) case-study-template.md has a “how to fill this in” section that gets deleted before publication; the template itself is the artifact a future case-study PR uses as the starting point.
References to verify:
- docs/the project strategy (private) (GTM headline posture: §A3-rev3 N=1 model-selection inversion, delegation 8.3% – 26.1% range) — one-pager.md headline-numbers section and faq.md “what’s the savings number” both quote the trio of numbers verbatim. ✓
- docs/the project strategy (private) (competitive risks — Vercel AI SDK as highest, Cursor / Claude Code as named secondary) — objection-handling.md handles both verbatim with the “what’s true” / “what’s not in our favor” / “what to actually say” structure. ✓
- docs/the project strategy (private) (open strategic questions: buyer profile, local-first vs SaaS, pricing) — faq.md “how much does it cost” / “what’s the deployment shape” / “what’s the licensing posture” sections name the open questions as open. ✓
- docs/savings-demo.md (canonical evidence pack — §A3-rev3 table with quality sum / cost / cost-per-quality) — one-pager.md headline-numbers section, competitive-comparison.md “what the learned-routing difference actually means” section, faq.md “what’s the savings number” all quote the table verbatim and link back. ✓
- docs/customer-trial-recipe.md (the three evaluation paths + §6 caveats) — one-pager.md “how to evaluate” section, faq.md “how do I evaluate the savings on my own workload”, case-study-template.md §3 / §6 / §8 all link back; the three workload shapes (single-model, very short sessions, no quality signal) appear verbatim in one-pager.md, objection-handling.md (unproven-savings response), and case-study-template.md §6. ✓
- docs/market-research/03-routing-layers.md (verified 2026-05-09) — competitive-comparison.md table cells cite the same LiteLLM issue numbers (#27512, #27469, #26916, #24985, #15601, #26625, #20418, #20485, #26937) and the “OpenAI-shape internal IR can’t represent these blocks losslessly” framing. ✓
- docs/operations/soc2-readiness.md — objection-handling.md SOC2 response and faq.md SOC2 / compliance section both reference the file and quote the gap list (CC8 change management, third-party pentest, vendor review, SOC2 auditor) and the Q3 2026 Type 1 contingency. ✓
- docs/specs/multi-user.md, docs/specs/audit-log.md, docs/specs/trace-retention.md, docs/specs/redaction.md, docs/specs/pattern-store.md, docs/specs/delegation.md, docs/specs/gateway.md, docs/specs/analytics-api.md — referenced from competitive-comparison.md, faq.md, and case-study-template.md; all paths resolve. ✓
Status: verified. All five new files render as valid GitHub-flavored markdown by structural inspection; the relative-path link surface was spot-checked against the actual layout under docs/, docs/operations/, docs/specs/, docs/market-research/, benchmarks/, infra/gateway/. No build / test / CI pipeline is wired to this surface yet; if a future docs-build step lands, mkdocs build --strict will validate every internal link in the same pass that catches the existing surfaces. The toolkit deliberately does not duplicate the product-site marketing surface (product-site/) — the audience is sales staff prepping for a buyer conversation, not the public website reader.

2026-05-15 — evaluator.md §5.4 v1.2 partial-credit primitive (§A3-rev6 / 13a-1 follow-up — Wave 14a-7 path-a)

Specs: evaluator.md v1.1 → v1.2. §5.4 gains a “Partial-credit primitive (v1.2)” subsection between the grounding-check description and §5.5 (tool-cycle rubric). The WORKLOAD_HEURISTIC_RUBRIC_VERSION bumps 1.1.0 → 1.2.0 (new score series; no silent recalibration of prior verdicts per evaluator.md §12 invariant 7). WORKLOAD_HEURISTIC_RUBRIC_ID is unchanged.
Change: Implements the §A3-rev6 / 13a-1 “open path (a) — finer-grained outcome scoring” that the user brief named as one of two remaining wedges for §A3-rev7. Across six A3 iterations the per-workload haiku-vs-sonnet quality gap on the v1 suite is below the heuristic judge’s resolution; the pass/fail substring assertion (expect_substring_in_final_response) collapses partial successes (12/16 regex cases, 3/4 pytest tests) to 0, erasing the gradient haiku and sonnet actually produce. (a) packages/metis-core/src/metis_core/eval/rubric.py — new PartialCreditConfig frozen dataclass (enabled: bool=False, criterion: "test_pass_count_ratio", map: "linear"|"stepped"); new field WorkloadRubric.partial_credit: PartialCreditConfig | None = None; new _parse_partial_credit validator rejecting unknown keys + bad criterion/map enums + non-bool enabled. (b) packages/metis-core/src/metis_core/eval/judge.py — HeuristicJudge._evaluate_workload consults the new config and, when enabled: true, bypasses the substring assertion entirely. The criterion parser handles two response shapes: PASS N/M / FAIL N/M runner output (the convention in this repo’s runner.py files; last occurrence wins so iterative per-case lines followed by a final summary grade correctly), and pytest summary tokens N passed, M failed, K error(s) (skipped tests excluded from the denominator). Composition is (base + partial_credit_score) / 2.0, parallel to grounding — perfect-pass recovers the same composed score as the prior substring_present=True path; zero-pass recovers the prior substring_present=False halving; mid-ratios produce mid-scores. New audit-trail signals: partial_credit_score / partial_credit_ratio / partial_credit_passed / partial_credit_total / partial_credit_criterion / partial_credit_map / partial_credit_test_signal_found. New positive/negative flags: partial_credit_full / partial_credit_partial / partial_credit_zero / partial_credit_no_test_signal. (c) packages/metis-core/src/metis_core/eval/__init__.py — re-exports PartialCreditConfig / PartialCreditCriterion / PartialCreditMap. (d) Five existing workloads updated to opt in (workloads whose final response carries a countable test outcome): regex-with-edge-cases, recursive-data-structure-traversal, subtle-bug-fix-with-test, refactor-with-contract-preservation, multi-file-refactor-with-shared-types — each drops expect_substring_in_final_response and adds the partial_credit: {enabled: true, criterion: test_pass_count_ratio, map: linear} block with a workload-specific comment explaining what the gradient now surfaces. The remaining 6 workloads (control / delegation / write-a-doc / multi-turn-refactor / fix-a-bug-small / architectural-explanation-without-hallucination) are unchanged because they don’t carry a countable test output. (e) Tests: 13 new partial-credit cases in packages/metis-core/tests/eval/test_judge.py (half-pass scores 0.75; perfect-pass recovers 1.0; zero-pass recovers 0.5; pytest summary parser; runner-line priority; no-signal failure mode; last-line-wins on multiple matches; stepped vs linear divergence at 5/8; endpoints preserved under stepped; disabled-block back-compat; no-block back-compat; skipped-excluded-from-total; live regex-workload fixture smoke) plus 8 new schema-parsing cases in packages/metis-core/tests/eval/test_rubric.py (defaults to None; accepts full block; accepts stepped map; defaults enabled=False; rejects unknown keys / non-bool enabled / unknown criterion / unknown map / non-mapping payload).
Type: additive. (1) WorkloadRubric.partial_credit defaults to None — pre-v1.2 workloads with no partial_credit block produce byte-identical verdicts (test_partial_credit_no_block_is_pre_v1_2_compatible is the regression net). (2) PartialCreditConfig(enabled=False) is also a no-op — workloads can author the block in advance of opting in. (3) WORKLOAD_HEURISTIC_RUBRIC_ID is unchanged; WORKLOAD_HEURISTIC_RUBRIC_VERSION bumps 1.1.0 → 1.2.0 so dashboards distinguish the two score series per evaluator.md §12 invariant 7. (4) No event-catalog change; no payload-registry change; no AUDIT_EVENT_TYPES change; no analytics-API change. (5) The LLM judge tier is unchanged — partial-credit is heuristic-only; the LLM tier forms its own [0, 1] judgment from the response text directly. (6) No routing-engine change; the pattern store reads verdict.score unchanged.
References to verify:
- evaluator.md §5.4 — extended with the new “Partial-credit primitive (v1.2)” subsection; the example schema picks up the partial_credit: block. ✓
- evaluator.md §12 invariant 7 (“rubric_version bumps produce a new score series rather than silent recalibration”) — honored: WORKLOAD_HEURISTIC_RUBRIC_VERSION = "1.2.0". ✓
- benchmark.md §3.1 (workload.yaml evaluate: block schema) — the optional partial_credit: sub-block is documented in evaluator.md §5.4; benchmark.md doesn’t repeat the schema and doesn’t need an edit. ✓
- pattern-store.md §15 (the evaluator’s quality score is the K-NN’s outcome input) — unchanged; the K-NN reads score, which the rubric still produces in [0, 1]. ✓
- the project strategy (private) Wave 13 / 13a-1 follow-up — names path (a) finer-grained outcome scoring as one of two remaining §A3-rev7 wedges; this entry ships the implementation. The §A3-rev7 brief itself is unwritten until 14a-7 runs end-to-end. ✓
Status: verified at the unit-test level. 21 new tests; baseline + new = 1697 expected. Ruff clean on all changed source files. Coordination: 14a-7 (§A3-rev7) DEPENDS on this landing — the §A3-rev6 / 13a-1 brief lists “finer-grained outcome scoring” as one of two paths still open after §A3-rev6 + 13a-1, with the other being “task domains haiku has known weakness in.” The §A3-rev7 run will exercise the new primitive end-to-end against the regex / subtle-bug / recursive-traversal / refactor-with-contract workloads. The wire is fully back-compat — 14a-7 picks up the new score series by opening a fresh patterns DB; old DBs continue to read at WORKLOAD_HEURISTIC_RUBRIC_VERSION = "1.1.0".

2026-05-15 — product-site GA polish (post-Wave-13 numbers + comparison page + how-it-works + pricing placeholder)

Specs: none touched. Pure marketing-surface refresh; the load-bearing numbers come from the project strategy (private) (delegation 8.3% – 26.1% across §A3-rev5 + §A3-rev6, model-selection §A3-rev3 N=1 inversion), docs/savings-demo.md (the canonical headline table), and docs/customer-trial-recipe.md (the trial flow this site CTAs into). The docs/market-research/03-routing-layers.md table is the source for the LiteLLM / Portkey / Helicone competitive cells. The product-site sits under product-site/ and is intentionally outside the doc / spec tree — this entry is logged for cross-reference discipline even though no spec content moved.
Change: (a) product-site/src/pages/index.astro rewritten: hero headline pivots from the stale “three levers, applied together” framing to “picks the model that succeeds, not just the cheap one”; two headline-stat tiles surface the delegation 8.3% – 26.1% range (per the project strategy (private) GTM headline posture) and the §A3-rev3 $0.0477 / quality datapoint sitting between haiku-only $0.0383 and sonnet-only $0.1176; status badge updated to “phase 3 in flight · gateway GA-ready · 1,678 tests”; “shipped” list rewritten against the post-Wave-13 inventory (transparent gateway, per-user/per-team/per-key cost attribution, pattern store, hybrid evaluator, worker delegation, audit log + 90-day trace retention + GDPR export/forget, SOC2 gap audit, helm chart, < 1-hour buyer-trial recipe); “in progress” trimmed to context-assembler v3, skill curator, async delegation, pricing ratification. (b) NEW product-site/src/pages/compare.astro: one-page sales tool comparing Metis to LiteLLM / Portkey / Helicone across 6 sections (wire fidelity, routing, cost attribution, quality signal, compliance, deployment); cites the live LiteLLM issue numbers (#27512 thinking-block drop, #27469 tool_call args lost) as documented in docs/market-research/03-routing-layers.md; “pick Metis when / pick a commodity gateway when” framing kept honest with the customer-trial-recipe §6 caveats (single-model workloads, very short sessions, no quality signal). (c) NEW product-site/src/pages/pricing.astro: “coming soon · ratifies in Wave 15” amber-badge banner; three indicative-not-committed tiers (open-source gateway free, Pro per-seat TBD, Enterprise %-of-savings TBD) reflecting the docs/specs/pricing.md recommendation; FAQ explicitly names Wave 15 as the ratification point; CTA throughout points to /signup for the 90-day grandfathered-rate beta-seat promise. (d) NEW product-site/src/components/Nav.astro + Footer.astro + ArchitectureDiagram.astro: factored out cross-page chrome; the architecture diagram is an ASCII rendering of the gateway → routing → pattern store → evaluator → trace store → analytics shape documented in docs/specs/gateway.md and docs/specs/deployment-shape.md. (e) New “How it works” section in index.astro immediately above the four-leg-moat section: ASCII architecture diagram plus a “gateway-first deployment / agent upgrade path” pair that mirrors the the project strategy (private) hybrid decision. (f) Four-leg-moat section rewritten verbatim against the project strategy (private): bounded memory + lossless canonical IR + task-fingerprint pattern learning + auto-derived skill curation, with the “legs 3 and 4 compose” callout. (g) Status-page integration: nav badge + footer link + dedicated callout in the Status section point at https://status.example.com (the operational status page documented in docs/operations/status-page.md).
Type: additive. (1) No spec content edited; the docs/ tree is untouched. (2) Existing in-repo links into docs/specs/, docs/savings-demo.md, docs/customer-trial-recipe.md, the project strategy (private) continue to resolve byte-identically. (3) The /signup route is an outbound link, not a route owned by the static site; it points at the form Wave 14a-2 ships. Until that lands, the link 404s — flagged here so the cross-team sequencing is explicit. (4) The https://status.example.com link assumes the operator wires the subdomain per docs/operations/status-page.md recipe; the marketing site does not provision the status surface itself. (5) Astro build verified clean (3 pages generated in 1.53s, 0 warnings). (6) No CI / lint / test path under product-site/ is connected to the repo-root pyproject.toml test surface; the site builds via npm run build and deploys via the GitHub Actions workflow at .github/workflows/deploy-product-site.yml (per product-site/HOSTING.md).
References to verify:
- docs/the project strategy (private) (GTM headline posture: delegation 8.3% – 26.1% range, §A3-rev3 N=1 model-selection inversion) — hero stat tiles quote both verbatim. ✓
- docs/the project strategy (private) (four-leg moat) — moat section names all four legs in the same order; the “legs 3 and 4 compose” callout is preserved. ✓
- docs/savings-demo.md (the canonical evidence pack — §A3-rev3 table with quality sum / cost / cost-per-quality) — site quotes the $0.0477 / $0.0383 / $0.1176 trio and the 5.55 / 5.16 quality-sum delta in the hero tile context paragraphs. ✓
- docs/customer-trial-recipe.md §6 (when the trial won’t show savings) — compare.astro “pick a commodity gateway when” panel mirrors the three caveats (single-model workloads / very short sessions / no quality signal). ✓
- docs/market-research/03-routing-layers.md (verified 2026-05-09) — comparison table cells cite the same LiteLLM issue numbers (#27512 / #27469) and the LiteLLM “logs are wire-format, replays die when a provider changes shape” finding. ✓
- docs/specs/pricing.md — pricing-page tier shape (open-core gateway free + per-seat Pro + reserved enterprise %-of-savings add-on) matches the recommended model; the page is explicit that the numbers are TBD pending Wave 15 ratification per the project strategy (private) ✓
- docs/specs/gateway.md / docs/specs/deployment-shape.md — ASCII architecture diagram reflects the per-request-stateless gateway + 7-slot routing chain + pattern store at slot 4 + evaluator + trace store wiring documented in both specs. ✓
- docs/operations/status-page.md — https://status.example.com is the conventional subdomain the operator wires per the doc’s two-tier recipe (external UptimeRobot / Statuspage.io / Better Stack against /healthz plus self-hosted Uptime Kuma in-cluster). ✓
Status: verified. Astro build runs clean (npm run build: 3 pages, 1.53s, no warnings); rendered HTML grep confirms the 8.3% – 26.1% and $0.0477 hero numbers appear verbatim, the https://status.example.com link is present on all three pages, and the /signup CTA fires from every page’s nav plus an in-body CTA. No visual / browser verification ran — the site is built and the static HTML is sensible by structural inspection, but a human pass against the rendered pages is the load-bearing GA gate.

2026-05-15 — mkdocs-material doc site scaffolding (mkdocs.yml + docs/index.md + infra/docs/ + docker-compose docs profile)

Specs: none touched. Pure-config doc-site build; no spec content edited, no cross-reference rewrite. The existing docs/ tree (specs/, operations/, market-research/, top-level strategy / overview / quickstart files) stays in place; the four-section hierarchy (Getting Started / Specs / Operations / Strategy) is expressed only in mkdocs.yml’s nav: block.
Change: (a) NEW mkdocs.yml at repo root — Material theme, search enabled, per-page GitHub edit + view actions (repo_url=david-2814/metis, edit_uri=edit/main/docs/), light/dark palette toggle, pymdownx.* markdown extensions (superfences, highlight, tabbed, details, tasklist), and an explicit nav listing every existing doc under one of the four top-level sections (docs/sales/ is surfaced as a sub-section under Strategy). A validation: block downgrades pre-existing spec→source-code link warnings (../../packages/metis-core/..., ../../AGENTS.md, etc., which render fine on GitHub but aren’t doc files) to info so they don’t trip --strict; nav-level issues (missing files, files-not-in-nav) stay at warn so a real reorg regression still fails the build. (b) NEW docs/index.md — landing page describing the four sections + local-preview / docker-preview recipes; does not duplicate or summarize spec content. (c) NEW infra/docs/Dockerfile — multi-stage build (builder installs pinned mkdocs-material==9.5.39, renders the site with mkdocs build --strict; runtime is python:3.13-slim with the rendered /srv/docs and python -m http.server as the static server). Mirrors the gateway image shape — same base image, same non-root uid/gid (1000:1000), same entrypoint indirection, same healthcheck cadence. Docs are public content; no loopback constraint. (d) NEW infra/docs/entrypoint.sh — env-driven (METIS_DOCS_HOST / METIS_DOCS_PORT / METIS_DOCS_ROOT) shell wrapper around python -m http.server. (e) docker-compose.yml — adds a docs service under the docs profile (so docker compose up without the profile still runs just the gateway). Maps port 8423 to host. (f) README.md — the existing “Documentation” section gains a short pre-amble linking to the site, the mkdocs serve local recipe, the docker compose --profile docs up docs recipe, and pointers to mkdocs.yml + infra/docs/. The full spec link list below it is unchanged.
Type: additive. (1) Zero spec content edited; the CHANGES.md cross-reference map is unchanged. (2) Existing in-repo links into docs/specs/, docs/operations/, docs/market-research/ continue to resolve byte-identically — no file moved. (3) The mkdocs site is opt-in: nothing in the build / test / lint / smoke / benchmark path runs mkdocs build. The docker-compose docs profile means docker compose up (no profile flag) gets the gateway only, unchanged from before. (4) The --strict flag on mkdocs build inside the image is a future-proofing gate — a broken nav target or dead internal link will fail the image build instead of shipping a 404, so a future docs reorg can’t silently degrade the site.
References to verify:
- Existing top-level docs (the project strategy (private), docs/KNOWN_ISSUES.md, docs/project-overview.md, docs/customer-trial-recipe.md, docs/gateway-client-quickstart.md, docs/gateway-deployment.md, docs/savings-demo.md, docs/standard-model-profiles.md) — paths unchanged; specs that reference them (e.g. event-bus-and-trace-catalog.md → gateway-deployment.md, specs/project-overview.md → the project strategy (private) / KNOWN_ISSUES.md) still resolve. ✓
- docs/specs/ tree — every file remains at its current path; nav entries in mkdocs.yml use specs/<name>.md relative to docs_dir: docs. ✓
- docs/operations/ tree — every file remains at its current path. ✓
- docs/market-research/ tree — every file remains at its current path. ✓
- docs/sales/ tree — every file remains at its current path; nav surfaces all five (one-pager / faq / competitive-comparison / objection-handling / case-study-template) under Strategy > Sales. ✓
- infra/gateway/Dockerfile — image shape (base image, uid/gid, env-driven entrypoint, healthcheck cadence) mirrored intentionally in infra/docs/Dockerfile; the docs version omits the writable-dir + workspace-mount machinery the gateway needs since static-doc serving has no per-request state. ✓
Status: verified. The reorganization is nav-only (no file moves); cross-references unchanged. mkdocs build --strict --site-dir /tmp/metis-docs-site exits clean with the validation tuning in place; documented in the entry above. A follow-on can wire mkdocs build --strict into CI if/when CI exists.

2026-05-15 — AGENTS.md / README.md / the project strategy (private) post-Wave-13 sync + the phase-claim proposal (private) drafted (post-Wave-13 docs sync)

Specs: none touched. Pure high-level doc housekeeping after Wave 12 (compliance triad + SOC2 readiness + cost_weight=0.05) and Wave 13 (multi-tenant gateway + benchmark-suite v2 + trace + pattern production audits) landed.
Change: (a) AGENTS.md (and the CLAUDE.md symlink): test count 1486 → 1678 in two places; “What works” gains nine entries for Wave-12 audit log / trace retention / redaction layer / GDPR export+forget / SOC2 readiness audit / cost_weight=0.05 default, plus Wave-13 multi-tenant gateway / benchmark-suite v2 / trace-store production audit / pattern-store production audit / metis trial buyer quickstart; status sentence updated with §A3-rev6 + 13a-1 framing including the “all routing-engine mechanical blockers are live and verified; remaining bottleneck is benchmark-suite signal strength” diagnosis and the 8.3%–26.1% delegation range across §A3-rev5 + §A3-rev6; the loopback-bind gotcha is split into the metis serve (loopback-only retained) and metis gateway (Wave-13 opt-in non-loopback with documented hardening) cases. (b) docs/the project strategy (private): adds the Wave 13 / 13a-1 follow-up paragraph documenting that the path-1 (workload signal-strengthening) wedge was tried end-to-end and ruled out as a sufficient single-knob fix (no v1 workload has gap ≥ 0.15; 3 purpose-designed haiku-fail candidates all came in at gap ≤ 0.083), three plausible interpretations of why, and the two remaining 13b-1 (§A3-rev7) paths (finer-grained outcome scoring + task domains with known haiku weakness). GTM headline posture unchanged (the 13a-1 result is a negative, not a material improvement). No §A3-rev7 entry — 13b-1 has not run. (c) README.md: top status block refreshed with Wave-12 / Wave-13 / 13a-1 framing; test count 1486 → 1678 in two places; pointer to phase-claim-proposal added. (d) NEW the phase-claim proposal (private) (~200 lines): the owner-decision doc the Status sentence has been waiting on. Lays out three candidate positions (hold at “Phase 3 in flight” / bump to “Phase 3 shipped” / bump to “Phase 3 shipped + Phase 4 v1 started”) with evidence inventories for each Phase-3 wedge (the project strategy (private) + project-overview.md §Phasing summary), an honest accounting of what’s missing (in-session adjustment / MCP / git sync as out-of-scope-not-unfinished; N>1 model-selection generalization gated on benchmark-suite signal strength), and recommends Position B (Phase 3 shipped) with two explicit caveats stamped on the status sentence + a suggested replacement sentence + a §6 decisions-requested table. Does NOT bump the AGENTS.md status sentence itself — that remains owner-decision territory.
Type: pure docs / additive. (1) No spec contract touched; no code change; no event-catalog change. (2) Test count is observable (uv run pytest -q | tail -1 returns 1678 passed, 1 skipped); the bump is a no-op against the binary. (3) Phase-claim proposal is a draft pending owner sign-off — until ratified, AGENTS.md continues to read “Phase 3 in flight”. (4) All entries cite shipped commits / shipped specs; no claim is forward-looking.
References to verify:
- docs/specs/audit-log.md / trace-retention.md / redaction.md (Wave 12 triad) — referenced from AGENTS.md “What works”; no spec text edited. ✓
- docs/specs/gateway-hardening.md (Wave 13) — Status: reflects “v1 — shipped” (per the 2026-05-15 entry that landed earlier); AGENTS.md gotcha updated to match. ✓
- docs/specs/benchmark.md §3.1 + §4.1 (signal_strength field + v2 partition) — referenced from AGENTS.md + the project strategy (private); no spec text edited. ✓
- docs/operations/soc2-readiness.md + compliance-overview.md — referenced from AGENTS.md + the phase-claim proposal. ✓
- routing-engine.md §5.5 (cost_weight default) — referenced from AGENTS.md; the default text in routing/policy.py:76 is the source of truth. ✓
- benchmarks/RESULTS.md §A3-rev6 + §13a-1 — referenced from AGENTS.md status sentence + the project strategy (private); no edit. ✓
Status: verified. Test count run on 2026-05-15: 1678 passed, 1 skipped in 39.36s. Ruff: no source files touched, so no lint regression. The phase-claim proposal is awaiting owner sign-off per its §6.

2026-05-15 — docs/operations/trace-performance.md + 5 expression indexes + WAL gauge + `metis trace vacuum` (Wave 13a-5)

Specs: new operations doc docs/operations/trace-performance.md (~250 lines, ships the production-readiness audit). event-bus-and-trace-catalog.md §7 is unchanged — this work is purely operational + additive: existing schema invariants hold, TRACE_SCHEMA_VERSION stays at 1, and every new index is CREATE INDEX IF NOT EXISTS. observability.md §3 picks up one new gauge (metis_trace_wal_bytes).
Change: Production-readiness audit for the trace store under sustained multi-tenant load (sibling to the 13a-4 pattern-store audit). (a) Performance baseline: new scripts/bench_trace_throughput.py drives the bus + trace path with synthetic llm.call_completed events and reports events/sec + CPU share. Reference numbers on Apple M-series / Python 3.13 / SQLite 3.50.4 / 50k events: full path ~4,800 events/sec (CPU-bound at 88% of wall — msgspec encode + SQLite C-call overhead, NOT disk-bound, NOT WAL-bound), bus-only ceiling ~62,000/sec, raw-SQLite ceiling ~28,000/sec. Translates to ~100 active gateway keys per pod with 8× headroom at typical conversational load, with crisp upgrade paths beyond that documented in §5. (b) Index audit: ran EXPLAIN QUERY PLAN against every analytics/store.py query; pre-Wave-13 found user_export doing a full SCAN, events_for_turn paying a TEMP B-TREE FOR ORDER BY, and the eval-quality slice not having a dedicated index. (c) Five additive indexes in packages/metis-core/src/metis_core/trace/store.py: idx_events_turn_id_id (composite, eliminates the temp sort); idx_events_gateway_key_id / idx_events_user_id / idx_events_team_id (partial expression indexes on json_extract(...,'$.<id>') WHERE <id> IS NOT NULL — drives /analytics/by_* filters and the GDPR portability export from full SCAN to indexed lookup); idx_events_eval_subject_kind (composite expression index on (json_extract(...,'$.subject_kind'), timestamp_us) partial WHERE type = 'eval.completed' — direct serve of the /analytics/quality slice). (d) VACUUM: new TraceStore.vacuum() method + metis trace vacuum CLI subcommand at apps/cli/src/metis_cli/trace_admin.py + new helm CronJob at infra/gateway/helm/templates/cronjob-trace-vacuum.yaml (default monthly schedule 0 4 1 * *, OFF by default; values.yaml traceVacuum). Documented why auto_vacuum=INCREMENTAL is NOT enabled in v1 (it requires being set before any tables exist; deferred to a future migration). (e) WAL monitoring: TraceStore.wal_size_bytes() + TraceStore.wal_checkpoint(mode=...) helpers; new metis_trace_wal_bytes Prometheus gauge in packages/metis-core/src/metis_core/observability/metrics.py wired into both apps/server/src/metis_server/app.py and apps/gateway/src/metis_gateway/app.py via the existing MetricsCollector getter pattern. WAL auto-checkpoint threshold raised from SQLite’s 1000-page default (~4 MB) to 8192 pages (~32 MB) via PRAGMA wal_autocheckpoint = 8192 set in _configure; constructor knob TraceStore(wal_autocheckpoint_pages=...) lets operators with tight crash-recovery SLAs lower it. (f) Bulk-insert path: verified the existing fast-path subscriber uses parameterized SQL (no string interpolation; test_bulk_insert_uses_parameterized_sql exercises the SQL-injection-style hostile session_id). Documented that batched-INSERT subscribers would require breaking the per-event durability + streaming-protocol §3.6 replay-on-reconnect contract — owner sign-off required, NOT shipped in v1. (g) Tests: 6 in packages/metis-core/tests/trace/test_query_plans.py (every analytics query has its EXPLAIN-coverage test; the catch-all test_no_query_uses_full_scan is the regression net), 12 in packages/metis-core/tests/trace/test_maintenance.py (VACUUM doesn’t break readers, post-VACUUM index lookup still works, WAL gauge contract, wal_autocheckpoint_pages plumbing, perf smoke at >1k events/sec floor), 2 in packages/metis-core/tests/observability/test_metrics.py (trace_wal_bytes_getter drives gauge, failing getter doesn’t break exposition), 3 in apps/cli/tests/test_trace_vacuum_cli.py (subcommand parses, runs end-to-end + reports reclaimed bytes, missing DB returns nonzero).
Type: additive. (1) All five new indexes are CREATE INDEX IF NOT EXISTS — existing trace DBs pick them up on next TraceStore.__init__(); first-open builds them once (seconds for a few million rows). (2) TRACE_SCHEMA_VERSION stays at 1; the backup/restore module’s schema-version guard continues to accept Wave-12 backups under Wave-13 code and vice versa. (3) wal_autocheckpoint_pages is a constructor kwarg with a backwards-compatible default (the new 8192-page value) — pre-Wave-13 callers pass nothing and get the new default. (4) metis_trace_wal_bytes is one new gauge; existing scrapers ignore it. (5) metis trace vacuum is a new subcommand; the traceVacuum helm values block defaults enabled: false so no chart-upgrader sees a behavioral change. (6) metis trace prune is unchanged. (7) No payload-registry change, no AUDIT_EVENT_TYPES change, no event-catalog change.
References to verify:
- event-bus-and-trace-catalog.md §7.1 (schema) — five new CREATE INDEX IF NOT EXISTS statements added to _SCHEMA; spec text unchanged because indexes are an implementation detail of the trace store. The §7.1 schema block is illustrative; the tests in test_query_plans.py are the load-bearing contract. ✓
- event-bus-and-trace-catalog.md §7.2 (storage notes) — WAL + synchronous=NORMAL invariant holds; the new wal_autocheckpoint = 8192 is consistent with §7.2’s “fast-path budget” discipline (raises the checkpoint window, doesn’t change durability). ✓
- event-bus-and-trace-catalog.md §7.4 (virtual columns) — referenced by trace-performance.md §2 as the future option if any expression-indexed query becomes hot enough to need a virtual column instead. Unchanged. ✓
- event-bus-and-trace-catalog.md §7.5 (backup/restore) — TRACE_SCHEMA_VERSION stays 1; restore’s schema guard continues to work. ✓
- trace-retention.md §4 (idx_events_timestamp_us) — unchanged; the new VACUUM CronJob is a separate operational concern from the retention CronJob (their schedules are independent and they share the same RWO storage gotcha — documented in traceVacuum values block). ✓
- analytics-api.md §4.10 (GDPR portability + forget) — user_export is now indexed via idx_events_user_id; the streaming JSONL contract is unchanged. ✓
- analytics-api.md §4.1 / §4.9 (/analytics/by_user / /analytics/by_team) — partial expression indexes accelerate the filters. Response shapes unchanged. ✓
- observability.md §3 (metric surface) — one new metis_trace_wal_bytes gauge; existing list extends additively. ✓
- gateway-hardening.md §3 (rate-limit middleware) — referenced from trace-performance.md §5 with a load-bearing cross-reference to 13a-3 (lift loopback): the steady-state throughput planning curve assumes well-behaved clients. When the gateway moves off 127.0.0.1, a hostile client can push event volume far above the curve, which makes the rate-limiter the trace store’s first line of defense. Do not lift loopback without enabling the rate limiter (or fronting with an L7 WAF that does the same). 13a-3 should reference this cross-document constraint when it lands. ✓
Status: verified. Total spend: $0 (no real-API usage; benchmark is fully synthetic). 23 new tests (6 query plan + 12 maintenance + 2 observability + 3 vacuum CLI); full suite 1678 passing, 1 skipped (baseline 1599 + 13a-4 pattern-store audit + this wave + concurrent parallel work). Ruff clean on all changed files. Cross-reference with 13a-4 pattern-store audit: the two audits jointly cover the production-readiness story for both SQLite-backed stores; both findings live in docs/operations/.

2026-05-15 — benchmark.md §3.1 `signal_strength` field + §4.1 v2 suite partition + 13a-1 smoke (Wave 13a-1)

Specs: benchmark.md v1 → v1.1 (status header bumped). §3.1 adds the optional signal_strength: high | marginal field (default "high") to workload.yaml and documents the smoke-validation gate. §4 (the suite) is rewritten as §4.1 (v2 partition with the cross-run audit table) + §4.2 (process for promoting a workload to high). New entry in §12 decision log.

Change: The §A3-rev6 Q1 finding (benchmarks/RESULTS.md) was “the per-workload haiku-vs-sonnet quality delta in the benchmark suite is within run-to-run variance — the K-NN cannot learn signal that isn’t there” and listed “replace marginal workloads with high-signal alternatives” as the next move. 13a-1 (a) audits the existing 7 workloads’ cross-run signal against the §A3-rev3..rev6 patterns DBs, finding all 7 below the gap < 0.15 threshold (the worst, multi-turn-refactor, is REVERSE-signal at −0.079); (b) designs 3 high-signal candidates with hermetic workspace fixtures (subtle-bug-fix-with-test / recursive-data-structure-traversal / refactor-with-contract-preservation) targeting the patterns the user brief named (symptom-vs-root, depth-aware tree walk with composed constraints, multi-callsite contract preservation); (c) smoke-tests all 3 with 12 heuristic-judge runs + 3 hybrid-judge spot checks at temperature=0; (d) finds none clear the ≥ 0.4 gate (best gap: recursive-data-structure-traversal at +0.083 heuristic; both at 1.000 under hybrid). All 11 workloads ship at signal_strength: marginal with the audit numbers documented inline. (e) scripts/benchmark.py — new signal_strength field on Workload (default "high"); new _ALLOWED_SIGNAL_STRENGTH = {"high", "marginal"}; discover_workloads(include_marginal=False) filters by default; new --include-marginal CLI flag; explicit --workload <name> bypasses the filter so historical §A3-rev runs reproduce; helpful-error path when default suite is empty (points to --include-marginal). (f) 11 workload YAMLs touched to add signal_strength: marginal plus an explanatory comment block citing the audit / smoke numbers. (g) benchmarks/RESULTS.md §13a-1 appended (~200 lines): audit table, candidate workload sketches, smoke methodology + results, three failure-mode interpretations, what 13a-1 ships regardless of the negative result, coordination notes for 13a-2 (N-shot) and 13b-1 (§A3-rev7). Total smoke spend: $0.815 (budget $0.50-1.00). The negative smoke result is itself the §A3-rev6 finding generalizing further: even purpose-designed haiku-fail workloads do not differentiate at temperature=0 under heuristic OR hybrid judges — Path 1 (workload signal strengthening) is ruled out as a sufficient single-knob fix.

Type: additive. (1) signal_strength field is optional with "high" default — existing YAMLs without the field continue to parse (their behavior is unchanged: they’d run by default). 11 existing workloads explicitly opt in to marginal so the default suite is empty by intent; that pins the regression posture. (2) --include-marginal is a new flag; absence preserves the new default-strict behavior. (3) --workload semantics widen: explicit naming bypasses the signal-strength filter (back-compat for §A3 reruns and ad-hoc smokes). (4) No changes to: any payload schema, the routing chain, the evaluator API, the analytics surface, or any spec invariant. (5) The new workloads are inert against discover_workloads() (they’re marginal); the only path that exercises them is --workload <name> or --include-marginal.
References to verify:
- benchmark.md §3.1 (workload YAML schema) — signal_strength added; existing min_delegate_calls / grounding_tokens / forbidden_grounding reflected in the schema block. ✓
- benchmark.md §4 (the suite) — rewritten as §4.1 (v2 partition table with cross-run gaps) + §4.2 (promotion process). The old fix-a-bug-small / write-a-doc-from-notes / multi-turn-refactor v1 trio is preserved in the table with their cross-run gaps. ✓
- benchmark.md §6.2 (determinism contract; temperature=0) — referenced by the 13a-1 RESULTS.md write-up; no change. ✓
- evaluator.md §5.4 (workload rubric) — referenced from new §3.1 schema block (grounding_tokens / forbidden_grounding); no change. ✓
- routing-engine.md §5.5 (K-NN math, min_confidence, cost_weight) — the 13a-1 finding affects the bench-suite signal these knobs consume but does not change the knobs themselves. ✓
- pattern-store.md §16 (v2 fingerprint) — unchanged; the 13a-1 finding orthogonally identifies the per-task gap as the bottleneck rather than the fingerprint geometry. ✓
Status: verified. Total real-API smoke spend $0.815. Schema changes verified by uv run python -c "from benchmark import discover_workloads, ..."; default empty suite returns the helpful error. 1599 baseline + new schema tests (next bullet) carried into the pytest run. Ruff clean on scripts/benchmark.py + all 11 workload YAMLs.

2026-05-15 — pattern-store.md §17 “Production tuning” + concurrency hardening + cache-hit metric (Wave 13a-4)

Specs: pattern-store.md gains new §17 (Production tuning) covering K-NN latency curve under load, embedding-cache throughput collapse at cap, concurrent-recording defense-in-depth lock, retention coordination with trace-retention.md, audit-flag posture confirmation, and the new Prometheus metric. §17 (References) renumbers to §18. observability.md picks up the three new gauges (metis_pattern_embedding_cache_hit_ratio / _hits_total / _misses_total); no _total Counter — they’re polled gauges projected from process-local PatternStore.cache_hit_count() / cache_miss_count().
Change: Production-readiness audit for the pattern store under sustained multi-tenant load (sibling to the 13a-5 trace audit). (a) packages/metis-core/src/metis_core/patterns/store.py — PatternStore.__init__ constructs a threading.RLock() that wraps every public method (record, update_score, find_k_nearest, recommend, lookup_embedding, store_embedding, cache_size, cache_clear, size, evict, clear, close); private _record_locked / _update_score_locked / _recommend_locked / _find_k_nearest_locked keep the logic verbatim. The lock is uncontended under the documented single-asyncio-task architecture but eliminates the sqlite3.InterfaceError: bad parameter or other API misuse failure mode (~36% failure rate at 100 threads × 10 record() calls; verified zero failures post-lock). New cache observability counters (_cache_hits / _cache_misses ints) bumped on lookup_embedding; new accessors cache_hit_count() / cache_miss_count() / cache_hit_ratio(). (b) packages/metis-core/src/metis_core/observability/metrics.py — MetricsCollector.__init__ accepts pattern_cache_getter: Callable[[], list[tuple[str, int, int]]] | None; three new gauges with workspace_id label; _refresh_polled_gauges reads the getter on every scrape and computes hits/(hits+misses) per workspace, defaulting to 0.0 when no lookups happened. Failing getter swallowed (observability never blocks). (c) Tests: 8 new in packages/metis-core/tests/patterns/test_production_readiness.py covering K-NN latency smoke (generous p95 < 100ms bound at 400 fingerprints — operational targets are tighter and live in §17.1), concurrent record correctness (100 threads × 10 records, zero errors), concurrent record + recommend interleaving safe, pattern.evicted is in AUDIT_EVENT_TYPES, pattern.recorded / pattern.matched are NOT audit-flagged (confirmed correct), cache hit-ratio returns None on a fresh store, cache counters track hits + misses, and cache enforces cap under sustained writes. 2 new in packages/metis-core/tests/observability/test_metrics.py: pattern_cache_getter drives all three gauges with correct per-workspace labels; failing getter does not break exposition.
Type: additive. (1) PatternStore public API unchanged — three new accessor methods added; lock-wrapping is transparent to existing callers. (2) MetricsCollector gains one optional kwarg; existing constructors continue to work. (3) Three new gauges exposed under existing /metrics endpoint; no new endpoint, no event-catalog change. (4) No change to v1 / v2 fingerprint semantics, K-NN aggregation formula, cost_weight default, or any spec invariant. (5) The threading lock is uncontended in the documented single-task architecture; it activates only if a future caller crosses threads (defense in depth).
References to verify:
- pattern-store.md §17.1 — K-NN latency curve. Spec target (≤3ms slot 4 at ≤1000 fingerprints) is documented as exceeded at v1’s documented scale; operator guidance lowers hard_cap_rows or min_confidence=1.0 opts out. ✓
- pattern-store.md §17.2 — Embedding cache throughput collapse (~7000/s → ~150/s at cap). Architectural mitigation already in v2 (sync cache-only lookup; embed-on-miss is async) is now load-bearing-documented. ✓
- pattern-store.md §17.3 — Concurrent recording. RLock is defense-in-depth; §11.9 invariant (single writer per process) remains the contract. ✓
- pattern-store.md §17.4 — Retention coordination with trace-retention.md. Patterns can outlive their trace events (180-day vs 90-day defaults); operationally correct, documented. ✓
- pattern-store.md §17.5 — Audit-flag posture. pattern.evicted confirmed in AUDIT_EVENT_TYPES; pattern.recorded / pattern.matched confirmed NOT audit-flagged. No change to audit-log.md §4. ✓
- pattern-store.md §17.6 / observability.md §3 — Three new metrics; alert recipe documented. No new event types, no payload-registry change. ✓
- routing-engine.md §2.1.8 (5ms routing budget) / §5.5 (K-NN math) — unchanged; §17.1 documents that the pattern slot’s share of the budget is exceeded at v1’s documented scale but the architectural fallthrough (chain continues to slot 7) means the 5ms budget can still be met by skipping slot 4 via min_confidence=1.0. ✓
- trace-retention.md §5.1 (AUDIT_EVENT_TYPES) — referenced; not changed. ✓
- gateway.md §2 (per-request stateless gateway) — referenced; v2 cache cost is only paid in agent-loop mode, not gateway mode. ✓
Status: verified. 10 new tests (8 patterns + 2 observability); 1599 baseline + 10 = 1609 tests passing. Ruff clean on all changed files. Cross-reference with 13a-5 trace-performance audit: the two audits jointly cover all five production-readiness audit dimensions of the v1 + v2 stores; the trace-store findings live in docs/operations/trace-performance.md.

2026-05-15 — operations/quickstart.md + `metis trial` + `infra/gateway/scripts/{quickstart,tear-down}.sh` + `benchmarks/workloads-trial/` (Agent 12b-?)

Specs: none. Pure operator / buyer-facing assets. Builds on shipped behavior — gateway helm chart (Wave 11), metis gateway issue-key (Wave 9 multi-user identity), AnalyticsStore.savings() (Wave 5 analytics surface), and the workload schema in benchmark.md §3.1. the project strategy (private) and gateway-deployment.md referenced as cross-links; no contract change.
Change: Closes the “helm install → first savings number in < 1 hour” gap that the Wave 11 gateway-deployment doc named but did not paper over. (a) docs/operations/quickstart.md (~190 lines) — buyer-facing end-to-end recipe: 5-minute helm install via kind, 30-second issue-key, 30-second SDK pointer flip (curl + Python), 5-minute pre-baked workload run, 30-second /analytics/by_key snapshot, honest framing on what the per-key cost number means vs cost-per-quality, and a Pitfalls table from validation. (b) infra/gateway/scripts/quickstart.sh — idempotent automation: kind cluster create-if-missing, image build + load, key issuance via metis gateway issue-key, Secret wrap, helm upgrade --install, port-forward in background, .metis-trial/state.env written so tear-down.sh and downstream commands can read it without re-typing. (c) infra/gateway/scripts/tear-down.sh — symmetric cleanup: stop port-forward, helm uninstall, namespace delete, kind delete, .metis-trial/ removal. (d) benchmarks/workloads-trial/refactor-extract-helper/ — pre-baked single workload (extract a duplicated price-formatting helper from prices.py); 3 turns; hybrid evaluator with grounding_tokens for cost-per-quality column; max_total_cost_usd: 0.10; runtime < 2 minutes against haiku; deliberately separate from benchmarks/workloads/ (the project benchmark suite) so trial-facing assets don’t churn with the suite. (e) metis trial CLI subcommand (apps/cli/src/metis_cli/trial.py, wired in apps/cli/src/metis_cli/main.py) — accepts --workload / --model / --baseline / --db-path / --gateway-url / --gateway-key; gateway mode sets ANTHROPIC_BASE_URL so the SDK auto-routes through the gateway (the SDK reads it from env when base_url isn’t passed explicitly to the constructor); spins up setup_runtime against the trial workload’s workspace tempdir, drives the turns, computes savings via AnalyticsStore.savings(), runs the workload-level evaluator on a fresh bus, and prints a buyer-facing actual / baseline / savings_pct / quality / cost-per-quality block. (f) Doc cross-links updated in README.md (new “Try it — first savings number in < 1 hour” section promoted above the existing Docker recipe; new “Operations” bullet for quickstart.md), docs/customer-trial-recipe.md (new “Path 0 — the pre-baked workload” entry; quickstart.sh option added to the Setup section), and docs/savings-demo.md (“Try it yourself” section now leads with the operations/quickstart.md pointer).
Type: additive. (1) New CLI subcommand metis trial; existing top-level commands untouched. (2) New directory benchmarks/workloads-trial/ is sibling to benchmarks/workloads/; the benchmark harness (scripts/benchmark.py) ignores it (it iterates benchmarks/workloads/ only). (3) New convenience scripts under infra/gateway/scripts/; the helm chart and Dockerfile are unchanged. (4) New doc under docs/operations/; cross-linked from README + customer-trial-recipe + savings-demo without rewriting them. (5) The trial workload’s YAML obeys benchmark.md §3.1 schema (verified by test_default_trial_workload_parses_with_benchmark_loader), so the same loader handles both directories.
References to verify:
- benchmark.md §3.1 (workload YAML schema) — trial workload validates clean against the existing loader. ✓
- gateway-deployment.md §"First production smoke" (kind-cluster reference) — operations/quickstart.md cites this as the deeper walkthrough; the convenience script automates the same steps with sensible defaults. ✓
- gateway.md §V (per-key analytics) — operations/quickstart.md §5 uses /analytics/by_key exactly as documented. ✓
- multi-user.md §4.2 (per-user / per-team key tags) — operations/quickstart.md §2 mentions issuing per-user keys for multi-user trials. ✓
- the project strategy (private) (high-floor adoption path via the gateway) — operations/quickstart.md is the “smoothest landing” doc that thesis predicts buyers want; no spec change. ✓
Status: verified. Local-mode end-to-end smoke (uv run metis trial --workload refactor-extract-helper --model anthropic:claude-haiku-4-5 --db-path /tmp/metis-trial-validation.db) ran the 3-turn workload at $0.028 / 11 LLM calls / 9 tool calls; quality 0.76@0.80; savings_pct 66.7% vs sonnet baseline. Gateway-mode is documented from the validated kind-cluster transcript in gateway-deployment.md §"First production smoke"; the trial CLI’s only addition is ANTHROPIC_BASE_URL env injection, which the Anthropic SDK reads natively. 8 new tests in apps/cli/tests/test_trial_cli.py covering parser defaults, gateway-flag pairing, baseline alias, trial-workload discovery, unknown-workload rejection, partial-gateway-args rejection, and benchmark-loader compatibility. Workspace-wide suite: 1649 passed, 1 pre-existing unrelated failure (apps/cli/tests/test_benchmark.py::test_shipped_workloads_load_clean — fails because the parallel suite-v2 work in this working tree tagged every workload signal_strength: marginal and discover_workloads() filters those out by default; reproduces on HEAD with this wave’s work stashed; outside the trial-recipe scope). Ruff clean on changed files. bash -n clean on both shell scripts.

2026-05-15 — benchmark.md §6.4 `--seed-passes N` flag + statistical reporting (§A3-rev6 path 2; Agent 12b-2)

Specs: benchmark.md §6.1 picks up the seed_passes provenance row; §6.4 (new) documents the flag, the per-(workload, model) sample-count math, the noise-tolerance threshold, the cost trade-off, and the accumulation-bug surfacing rule. No other specs touched.
Change: Closes the §A3-rev6 Q1 “variance the K-NN has to overcome” finding via the path-2 unblock named in benchmarks/RESULTS.md §A3-rev6 (workload signal-strengthening is path 1, the 13a-1 effort). scripts/benchmark.py gains --seed-passes N (default 1, validated >= 1). The existing per-workload loop becomes nested: outer for workload in workloads, inner for rep in range(seed_passes). Each rep produces a fresh ULID-stamped session_id (via runtime.manager.create_session), inherits the prior rep’s shared patterns DB as its seed (seed_path = shared_patterns_db if exists), records its own outcome into the workspace tempdir’s .metis/patterns.db, and copies it back to the shared file (save_path = shared_patterns_db) so rep N+1’s K-NN sees the prior reps’ cluster members. WorkloadResult gains a seed_pass_index: int = 0 field for forward provenance. New compute_workload_stats(results) -> list[WorkloadStats] helper groups the per-rep results by workload name and computes sample-population statistics (sample stdev — N-1 divisor): quality_mean, quality_std, quality_values, cost_mean_usd, cost_values_usd, noisy (std > NOISY_QUALITY_STD_THRESHOLD = 0.15). When seed_passes > 1 the report adds a “Workload statistics” table after the per-rep table and prints a “noisy workloads” diagnostic listing the failures. JSON artifact gains a workload_stats field alongside workloads; provenance.seed_passes is stamped (default 1 for the back-compat shape). The patterns-store accumulation contract — PatternStore.record() upserts by (structural_signature, primary_model) and bumps sample_size by 1 — is unchanged; the harness exercises it by repeating record() calls against a shared fingerprint cluster.
Type: additive. (1) --seed-passes defaults to 1 and the existing single-shot path is byte-identical (same WorkloadResult fields populated, same table format, same JSON shape modulo the new seed_pass_index field + the new workload_stats field, both of which are 0 / [] on N=1 runs). (2) Provenance.seed_passes defaults to 1; consumers reading prior artifacts via .get("seed_passes", 1) see the expected value. (3) No core library changes — PatternStore semantics, evaluator semantics, and routing-engine semantics are untouched. (4) No new event types, no payload registry changes. (5) Cost scales linearly with N for seed-only passes (Pass A / Pass B in the standard §A3 protocol); routing-test passes (Pass C / Pass D) are unaffected. Documented in §6.4.
References to verify:
- benchmark.md §6.1 (provenance table) — adds seed_passes row. ✓
- benchmark.md §6.4 (new) — flag semantics, cost math, accumulation contract, noise threshold. ✓
- benchmarks/RESULTS.md §A3-rev6 Q1 finding — the “Option 2 — N-shot per workload” path now has a harness-supported mechanism. (Agent 13b-1 / §A3-rev7 to run --seed-passes 3 and report per-workload std + flagged noisy workloads.) ⏳
- pattern-store.md §6 (PatternStore.record() upsert contract) — unchanged; the harness exercises but does not redefine the accumulation semantics. ✓
- routing-engine.md §5.5 (min_confidence=0.05 gate) — unchanged; the path-2 unblock complements the existing gate by reducing the variance the cluster mean has to overcome, not by re-tuning the gate. ✓
Status: verified. 8 new tests in apps/cli/tests/test_benchmark.py: 6 unit tests on compute_workload_stats (N=1 has no std; N=3 low-variance mean ± std; N=3 high-variance flagged noisy; groups by workload name; errored reps excluded; missing quality scores handled), 1 test on PatternStore.record() accumulation contract (3 records with same fingerprint produce recommend().sample_size = 3), 1 integration test (test_seed_passes_loop_invokes_run_workload_n_times) mocking setup_runtime / shutdown_runtime / _aggregate_savings / evaluate_workload_quality and exercising the actual amain loop with --seed-passes 3 against a real PatternStore round-trip — asserts run_workload called 3 times, seed_path threads through the shared db, and recommend().sample_size == 3 after the loop. Ruff clean on changed files.

2026-05-15 — gateway-hardening.md §2.1 lifts loopback-only bind constraint (Wave 13)

Specs: gateway-hardening.md moves from “Draft v1 — Wave 12+ commitment, not a v1 default” to “v1 — shipped (Wave 13)”; new §2.1 (Bind posture), §2.2 (Connection-rate hardening), §2.3 (In-process TLS), §2.4 (Required headers from the upstream terminator in sidecar mode). §6 (DDoS posture) and §8 (Deliberate omissions) rewritten to reflect what Wave 13 ships. gateway.md §3.2 (Network posture) rewritten — pre-Wave-13 “silently rewrites any non-loopback bind to 127.0.0.1” replaced with the new “default, not a constraint” framing plus a table of the hardening layers (rate limit / audit log / connection cap / TLS / key rotation / WAF-buyer). server-api.md §3.1 (Base URL) updated — the agent server retains loopback-only until its auth story lands; the gateway lifted because it’s per-request stateless with shipped rate-limit + audit + key-rotation primitives. docs/operations/upgrade-guide.md gains §6 (Migration: loopback-only → Internet-exposed) with the pre-flight checklist, two-step helm migration recipe, post-migration verification, and rollback path.
Change: Closes the long-standing gateway.md §11 / §3.2 deferral — “production binds are gated behind future hardening.” Wave 11 shipped the rate-limit middleware + audit-log export + key rotation primitives the deferral named; Wave 13 lifts the constraint. (a) apps/gateway/src/metis_gateway/app.py — GatewayConfig widens with five new fields (max_concurrent_connections default 1000, backlog default 2048, reuse_port default False, tls_cert: Path | None, tls_key: Path | None). New GatewayConfigError typed exception raises from __post_init__ on (a) tls_cert set without tls_key or vice versa, (b) cert/key file missing on disk, (c) max_concurrent_connections < 1, (d) backlog < 1. New tls_enabled property. run_gateway() removes the silent rewrite — non-loopback hosts pass through to uvicorn unchanged. New _log_non_loopback_warning(cfg) emits a one-time WARN summarizing the hardening checklist (tls_in_process=on|off rate_limit=on|off). New _make_listen_socket(cfg) constructs a TCP socket with SO_REUSEADDR (always) + SO_REUSEPORT (when cfg.reuse_port); passed to uvicorn via Server.serve(sockets=[…]) when reuse_port is set so two processes can bind the same port for graceful restart. New _build_uvicorn_config(app, cfg) extracts the projection of GatewayConfig → uvicorn.Config (threads limit_concurrency + backlog + ssl_certfile + ssl_keyfile) so tests can inspect the wire-up without actually serving. (b) apps/gateway/src/metis_gateway/cli.py — run_gateway_command accepts tls_cert / tls_key / max_connections / reuse_port kwargs; passes them through to GatewayConfig; catches GatewayConfigError and exits 1 with a clean message; boot banner prints https:// instead of http:// when TLS is engaged. (c) apps/cli/src/metis_cli/main.py — metis gateway argparse gains --tls-cert, --tls-key, --max-connections (default 1000), --reuse-port. --host help text rewritten to name the perimeter checklist owner. (d) infra/gateway/helm/values.yaml — top-of-file comment rewritten; new top-level keys gatewayHost (default "127.0.0.1", back-compat), maxConnections (default 1000), workers (default 1, doc-only — the bundled entrypoint stays single-process; multi-process needs an entrypoint override), reusePort (default false), tls.{enabled, secretName, mountPath} (default enabled: false). service comment expanded with three deployment recipes (Ingress / LoadBalancer / LoadBalancer+in-process-TLS). (e) infra/gateway/helm/templates/deployment.yaml — env block reads gatewayHost (replacing the hard-coded "127.0.0.1"); new METIS_GATEWAY_MAX_CONNECTIONS, METIS_GATEWAY_REUSE_PORT (set only when truthy), METIS_GATEWAY_TLS_CERT / METIS_GATEWAY_TLS_KEY (set only when tls.enabled). New optional tls volume mount + Secret-backed volume gated by tls.enabled; required Helm template error when tls.secretName is empty. (f) infra/gateway/entrypoint.sh — builds the metis gateway flag list incrementally so optional env vars only prepend their flags when set; preserves back-compat for pre-Wave-13 charts.
Type: breaking-default. The silent --host 0.0.0.0 → 127.0.0.1 rewrite is removed. Any caller depending on the rewrite (i.e. passing 0.0.0.0 and expecting loopback) now actually binds non-loopback. The default --host remains 127.0.0.1, so callers that didn’t pass --host see no change. The default replicaCount / helm values are unchanged: a pre-Wave-13 deployment that didn’t override anything will keep binding loopback inside the pod and reaching the Service through the same socat sidecar bridge. The new hardening features are all opt-in: max_concurrent_connections defaults to a generous 1000 (uvicorn’s default limit_concurrency is None, so this is the first time the cap activates — but 1000 is well above realistic single-pod traffic), reuse_port defaults False, tls_cert/tls_key default None.
References to verify:
- gateway-hardening.md §2.1 / §2.2 / §2.3 / §2.4 / §6 / §8 — bind posture, connection-rate hardening, in-process TLS, sidecar-mode header forwarding, DDoS scope, deliberate omissions. ✓
- gateway.md §3.2 — Network posture rewritten with the post-Wave-13 hardening-layer table. ✓
- server-api.md §3.1 — Agent-server clarification (loopback-only retained until its auth story lands). ✓
- docs/operations/upgrade-guide.md §6 — Migration recipe + rollback path. ✓
- audit-log.md §9 — referenced from the bind-posture checklist as the “audit logging” leg. ✓
- event-bus-and-trace-catalog.md — no event-catalog changes; the boot-time WARN is a log line, not a bus event. ✓
Status: verified. 17 new tests across two files. (1) apps/gateway/tests/test_run_gateway_bind.py (20 tests): GatewayConfig default still binds 127.0.0.1 (back-compat); _is_loopback_host truth table; GatewayConfig(host="0.0.0.0") is accepted without rewrite; arbitrary external host accepted; non-loopback bind logs the hardening WARN with tls_in_process=off rate_limit=off; WARN reflects TLS / rate-limit state when both are on; tls_cert without tls_key rejected; tls_key without tls_cert rejected; missing cert file rejected; missing key file rejected; tls_enabled reflects both-set; max_concurrent_connections=0 rejected; negative max-connections rejected; backlog=0 rejected; uvicorn config threads limit_concurrency; uvicorn config threads ssl_certfile/ssl_keyfile; uvicorn config honors custom backlog; _make_listen_socket with reuse_port=True sets SO_REUSEPORT (truthy, platform-portable); plain bind still sets SO_REUSEADDR; live run_gateway boots on 127.0.0.1 and serves /healthz; live in-process TLS happy path (uses real self-signed cert via cryptography; skipped when the lib is absent — kept as an optional smoke). (2) apps/cli/tests/test_main.py (5 new tests): metis gateway default --host is 127.0.0.1; --host 0.0.0.0 parses; --tls-cert + --tls-key parse; --max-connections 5000 parses; --reuse-port parses. Full repo suite: TBD (run pending). Ruff clean on changed files.

2026-05-15 — redaction.md v1 + `EventRedactor` + `metis audit export --redact <mode>` (Wave 12a-3)

Specs: redaction.md (new). audit-log.md §9 picks up the --redact MODE CLI flag + redact mode: summary line. multi-user.md §7.4 item 4 (right-to-delete) annotated as partially closed by the redaction-pathway. CHANGES.md specs-in-scope list + cross-reference map both pick up the new spec.
Change: Layers the canonical export-time redaction policy over the Redactor Protocol shipped by 12a-2 and the metis audit export CLI shipped by 12a-1. (a) packages/metis-core/src/metis_core/redaction/modes.py — RedactionMode StrEnum (passthrough / pseudonymize / redact_private / aggregate_only), PseudonymTag closed catalog (session / turn / user / team / gateway-key / parent-session / workspace / request), ENVELOPE_PSEUDONYM_FIELDS map, PAYLOAD_PSEUDONYM_FIELDS declarative table covering every catalog event whose payload carries identity fields (turn / llm / gateway-key-* / gateway-quota-exceeded / quota-alert / delegate-* / analytics-user-*), PRIVATE_TEXT_FIELDS map of PRIVATE-tier text fields that get sentinel-replaced under redact_private, SIGNALS_EXTRA_TEXT_KEYS for the nested turn.completed.signals_extra dict. (b) packages/metis-core/src/metis_core/redaction/event_redactor.py — EventRedactor(mode, *, salt=b"", strip_user_controlled=False) with redact(event) -> Event | None (returns None only for AGGREGATE_ONLY) and finalize() -> dict | None. pseudonymize_value(value, tag, salt) uses the same redacted_<sha256[:12]> byte-format 12a-2’s pseudonym_for() ships, so a row pseudonymized by forget_user and re-exported under pseudonymize produces the same value byte-for-byte. (c) packages/metis-core/src/metis_core/redaction/aggregator.py — AggregateAccumulator rolls up event count, count-by-type, distinct sessions / turns / users / gateway-keys, plus sum / min / max of cost / tokens / latency pulled from llm.call_completed. Deterministic, no DP-noise (v1). (d) packages/metis-core/src/metis_core/redaction/forget.py — forget_user(db_path, user_id, *, confirm=False, requested_by=None) -> ForgetResult library wrapper over 12a-2’s PseudonymizingRedactor that adds a dry-run mode (counts what would be touched without mutating) and emits the analytics.user_forgotten audit event via direct trace-store write (CLI is one-shot; no bus). (e) packages/metis-core/src/metis_core/audit/log.py — AuditLog.export(..., redactor=None) accepts an optional EventRedactor; pipes events through redact() before serialization; AGGREGATE_ONLY mode short-circuits the JSONL / CSV row writer and emits a single JSON object. (f) apps/cli/src/metis_cli/audit.py — metis audit export --redact <mode> argparse flag; redact mode: line added to the success summary. (g) apps/cli/src/metis_cli/user.py — metis user forget now performs a dry-run count when --confirm is missing and prints “this would pseudonymize N event(s)” so operators can validate scope before committing.
Type: additive. (1) metis audit export keeps --redact passthrough as the default; existing CLI / library callers see no behavior change. (2) AuditLog.export(redactor=None) is the default; pre-Wave-12a-3 callers (none in-tree besides 12a-1’s own export pipeline) keep working. (3) EventRedactor is a new layer; the existing Redactor Protocol + PseudonymizingRedactor from 12a-2 are unchanged. (4) forget_user is a new library function; 12a-2’s metis user forget CLI is unchanged at the user-facing level except for the additional dry-run “would affect N events” diagnostic when --confirm is missing. (5) No event-catalog additions — the existing analytics.user_forgotten is reused; no new payload struct, no new sensitivity tier.
References to verify:
- event-bus-and-trace-catalog.md §4.4 (sensitivity classification) — redaction.md §7.6 documents the design choice: the sensitivity tag is informational, not gating. The mode (not the tag) governs the output. PRIVATE-tier events are not auto-stripped under pseudonymize; only redact_private triggers the text strip. ✓
- audit-log.md §9 (CLI surface) — picked up the --redact MODE flag row in the help block and the redact mode: line in the success-summary block. ✓
- multi-user.md §7.4 item 4 (right-to-delete) — pre-existing entry already named redaction.md as the path to GDPR-forget pseudonymization; no further edit needed. ✓
- analytics-api.md §4.10 (GDPR portability + forget HTTP endpoints) — already references redaction.md §6 and the Redactor protocol; the CLI surface added in this wave (--redact on metis audit export) is independent of the HTTP path. The HTTP path’s ?redact=<mode> query-parameter extension is a follow-on (redaction.md §9 names it explicitly). ⏳
- canonical-message-format.md §5 (Message struct contract) — Sensitivity.PRIVATE floor is reused; no field added. ✓
Status: verified. 38 new tests across three files. (1) packages/metis-core/tests/redaction/test_event_redactor.py (20 tests): passthrough returns input unchanged; pseudonymize hashes envelope identity (session_id, turn_id); pseudonymize hashes payload identity (user_id, team_id, gateway_key_id, parent_session_id); null identity stays null; PRIVATE text passes through under pseudonymize; workspace_hash (already a digest) left alone; redact_private strips user_message_text_redacted / files_modified / command_executed / error_message / signals_extra text keys; redact_private is a superset of pseudonymize; aggregate_only returns None per event and finalizes; aggregate_only empty stream; idempotence under pseudonymize + redact_private + at the pseudonymize_value helper level; determinism across invocations; salt breaks correlation; no-salt pseudonym matches pseudonym_for() byte-for-byte (cross-compat with 12a-2 forget); input event never mutated. (2) packages/metis-core/tests/redaction/test_forget.py (6 tests): dry-run does not touch DB or emit audit event; confirmed forget pseudonymizes + emits audit; idempotent re-forget returns 0 rows (still emits audit); subsequent export by original user_id returns empty (per redaction.md §5 invariant); export by hash returns the rows; forget on missing DB raises FileNotFoundError; audit event carries subject_user_id + pseudonym + pseudonymized_rows correctly. (3) packages/metis-core/tests/redaction/test_audit_export_with_redactor.py (5 tests): passthrough redactor produces byte-identical output to no redactor; pseudonymize hashes envelope + payload identity fields end-to-end through AuditLog.export; aggregate_only writes single JSON object; determinism (two consecutive redacted exports are byte-identical); refuses to overwrite existing destination. (4) apps/cli/tests/test_audit_redact_cli.py (7 tests): --redact flag parses; default is passthrough; pseudonymize end-to-end via metis audit export; aggregate_only writes single JSON file; bogus mode rejected by argparse with SystemExit; metis user forget dry-run prints “would pseudonymize N event(s)” + exit code 2; --confirm completes with pseudonymized rows: N summary. Full repo: 1599 passed. Ruff clean on all changed files.

Specs: analytics-api.md §4.10 (new — GET /analytics/user/{user_id}/export + POST /analytics/user/{user_id}/forget) plus a new invalid_user_id error code in §6. multi-user.md §7.4.4 + §11.5 annotated as partially closed. event-bus-and-trace-catalog.md §6 gains two audit event types: analytics.user_exported and analytics.user_forgotten (both PSEUDONYMOUS). Coordinates with redaction.md (12a-3) — this spec consumes their Redactor protocol; the policy of what gets pseudonymized lives in their spec, the policy of when / how operators trigger it lives here.
Change: Closes the “right to data portability” half of GDPR / CCPA (the buyer can hand a departing user every event stamped for them) and provides the operator-facing entry point for the “right to be forgotten” half (which delegates the actual pseudonymization to redaction.md’s Redactor). (a) packages/metis-core/src/metis_core/redaction/ — new module with Redactor Protocol + minimum-viable PseudonymizingRedactor impl (SHA-256 → 12-hex pseudonym, in-place json_set UPDATE on the events table). Co-owned with 12a-3, whose follow-on landed event_redactor.py / forget.py / aggregator.py / modes.py on the same path; the contract is shared. (b) packages/metis-core/src/metis_core/events/payloads.py — AnalyticsUserExported (subject_user_id, requested_by, row_count, byte_count, window_start, window_end) and AnalyticsUserForgotten (subject_user_id, pseudonym, requested_by, pseudonymized_rows) frozen msgspec.Structs plus registry entries. (c) packages/metis-core/src/metis_core/analytics/store.py — AnalyticsStore.user_export(user_id, *, window=None) yields JSONL-encoded bytes via a server-side SQLite cursor (O(1) RAM for any export size; 10k-event smoke test passes); user_event_count(user_id, *, window=None) cheap pre-stream COUNT; forget_user(user_id, *, redactor) delegates to the injected Redactor and returns the rowcount. (d) apps/server/src/metis_server/analytics.py — user_export / user_forget HTTP handlers; StreamingResponse for JSONL, Content-Disposition: attachment; filename="{user_id}.jsonl" + X-Metis-Row-Count headers; audit event emitted onto the bus on stream completion (export) / immediately (forget); shape guard ^[A-Za-z0-9_-]{1,200}$ on the path parameter with 400 invalid_user_id. (e) apps/server/src/metis_server/app.py — two new routes (/analytics/user/{user_id}/export GET, /analytics/user/{user_id}/forget POST). (f) apps/cli/src/metis_cli/user.py — metis analytics user-export <user_id> [--from] [--to] [--out] [--db-path] (stream to stdout or file) and metis user forget <user_id> --confirm [--db-path] (refuses without --confirm; delegates to metis_core.redaction.forget.forget_user shipped by 12a-3 so the policy + audit-emit live in one place). (g) apps/cli/src/metis_cli/main.py — two new top-level subparsers + dispatch. (h) Errors module gains invalid_user_id factory.
Type: additive. (1) Two new event types behind PAYLOAD_REGISTRY membership additions — existing subscribers ignore unknown types; no breaking change. (2) Two new endpoints on a previously-unused URL space (/analytics/user/...) — no existing route shadow or conflict. (3) New top-level CLI subparsers (analytics, user) and new dispatch branches — existing top-level commands untouched. (4) The Redactor protocol is runtime_checkable so 12a-3’s richer impl drops in without touching the contract. (5) Loopback-only inherits from the rest of /analytics/* (analytics-api.md §2.1.4); no new auth surface in v1.
References to verify:
- analytics-api.md §4.10 (new) — endpoint contract, streaming semantics, audit-event emission, CLI mirror, auth posture. ✓
- analytics-api.md §6 — invalid_user_id error code added. ✓
- multi-user.md §7.4.4 — annotated: trace-store half of right-to-delete now lands; users.json / key-revoke half remains future work. ✓
- multi-user.md §11.5 — annotated as partially closed. ✓
- redaction.md §5 (12a-3) — the policy the forget endpoint delegates to; the CLI forget command shares 12a-3’s forget_user impl. ⏳ (12a-3 shipping in parallel)
- event-bus-and-trace-catalog.md §6 — two new audit event types (analytics.user_exported, analytics.user_forgotten); both PSEUDONYMOUS, neither subject to the trace-retention.md §7.3 sweep (added to AUDIT_EVENT_TYPES should be verified in 12a-1’s audit-log spec follow-on). ⏳
Status: verified. 31 new tests across three new files. (1) packages/metis-core/tests/analytics/test_user_export.py (11): export returns only subject user’s events; export covers both llm.call_completed and turn.completed; empty for unknown user; window filtering; deterministic ordering (byte-identical re-exports); 10k-event streaming smoke (assert generator, not list); user_event_count helper; pseudonym determinism + distinctness; forget → empty re-export + Bob untouched; forget idempotence; forget unknown user returns 0. (2) apps/cli/tests/test_user_cli.py (10): argparse wiring for both subcommands; stdout vs file output; missing DB → exit 1; invalid window → exit 1; refuse without --confirm → exit 2; with --confirm pseudonymizes and writes audit event. (3) apps/server/tests/test_user_export_http.py (10): export returns only subject events; window filter; unknown user → 200 empty; invalid user_id → 400; Content-Disposition; export emits analytics.user_exported; forget pseudonymizes + subsequent export empty; forget idempotent; forget invalid user_id → 400; forget emits analytics.user_forgotten. Repo total: 1544 passed (1486 baseline + 31 new + 27 from 12a-3’s parallel redaction module). Ruff clean.

2026-05-15 — trace-retention.md v1 + `trace.swept` event + `metis trace prune` (Wave 12a-2)

Specs: trace-retention.md (new — drafted v1). event-bus-and-trace-catalog.md §6.14 gains the trace.swept catalog entry (audit-flagged, PSEUDONYMOUS); §7.3 rewritten in place to replace the pre-Wave-12 by_type retention placeholder with the actual contract (single global retention_days cutoff, audit-exempt sweep, trace.swept audit-trail, deferred per-type / per-workspace).
Change: Closes the §7.3 placeholder (“V1: unbounded retention. … Phase 3+: optional retention policy in config”) with a real implementation. (a) New module packages/metis-core/src/metis_core/trace/retention.py — PurgeResult frozen dataclass (cutoff, rows_eligible, rows_audit_exempt, rows_deleted, oldest_kept_timestamp, dry_run, swept_at) and re-exports of 12a-1’s is_audit_event / AUDIT_EVENT_TYPES from metis_core.events.payloads. (b) packages/metis-core/src/metis_core/trace/store.py — TraceStore.purge_older_than(cutoff, *, bus=None, dry_run=True, exempt_audit=True) runs a single SQL DELETE FROM events WHERE timestamp_us < ? AND type NOT IN (<audit_types>) riding a new idx_events_timestamp_us index. Returns a PurgeResult; in non-dry-run mode emits exactly one trace.swept event via the optional bus. dry_run=True default for programmatic-caller safety (the CLI inverts this; cron-friendly). The new index is purely additive — existing DBs pick it up on next __init__ and TRACE_SCHEMA_VERSION does not bump because the row format is unchanged. (c) TraceSwept msgspec.Struct payload and "trace.swept" registry entry added to packages/metis-core/src/metis_core/events/payloads.py; AUDIT_EVENT_TYPES is extended by "trace.swept" so the sweep’s own audit trail survives subsequent sweeps (audit-log.md “Adding or removing a type is a deliberate spec change” rule). (d) CLI subcommand metis trace prune --days 90 [--dry-run] [--db-path …] — new top-level metis trace group; prune is the v1 operation. Handler in apps/cli/src/metis_cli/trace_admin.py spins up an EventBus, attaches the trace store as the sole subscriber so trace.swept lands in the same DB the sweep operated on (audit-trail invariant by construction), runs the purge, drains, and emits a deterministic summary block matching the metis backup / metis restore style. CLI defaults to apply; --dry-run opts into preview-only (no trace.swept). Library dry_run=True default stays (programmatic safety). (e) Helm chart additions: optional CronJob template infra/gateway/helm/templates/cronjob-trace-prune.yaml (OFF by default; concurrencyPolicy: Forbid; mounts the gateway PVC; runs the same metis image with the prune subcommand) plus traceRetention.* values block in infra/gateway/helm/values.yaml (enabled: false, days: 90, schedule: "0 3 * * *", dryRun: false, resources, history-limits, image override). Storage trade-off documented in values.yaml + spec §8: ReadWriteOnce PVCs can’t be mounted by both pods simultaneously so the operator picks RWX migration, backup-file prune, or gateway pause.
Type: additive. (1) Pre-Wave-12 DBs see no behavior change — purge_older_than is opt-in; nothing in the existing agent loop, gateway request path, or server lifecycle invokes it. (2) The new index is IF NOT EXISTS-guarded so it lands on existing trace DBs without migration. The schema version is unchanged — restore from a pre-Wave-12 backup works as-is; the new index is created on first open of the restored DB. (3) AUDIT_EVENT_TYPES is a frozenset and adding "trace.swept" is membership-additive; the audit-log export (12a-1) automatically picks it up so a buyer’s SIEM gets sweep history under the same filter without code changes. (4) The CLI subcommand metis trace is a new top-level group; pre-existing top-level commands are untouched. (5) The Helm CronJob template only renders when traceRetention.enabled=true; existing deployments see no new resources after upgrading.
References to verify:
- event-bus-and-trace-catalog.md §6.14 (new) — trace.swept payload + audit-flag note; cross-referenced from §7.3 and trace-retention.md §6. ✓
- event-bus-and-trace-catalog.md §7.3 — rewritten in place to point at trace-retention.md; the pre-Wave-12 by_type placeholder is replaced rather than annotated. ✓
- audit-log.md §4 — adds trace.swept to the audit-relevant subset via AUDIT_EVENT_TYPES; the existing per-type rationale table needs an entry whenever audit-log.md is next opened. ⏳
- the project strategy (private) — “audit and compliance posture” gap named in 2026-05-12 entry; this spec is one of three Wave-12 specs (audit-log, redaction, trace-retention) that close it. ✓
Status: verified. 17 new tests. (1) packages/metis-core/tests/trace/test_retention.py (11 tests): cutoff math (strict <), audit-flagged-events-survive, trace.swept-itself-is-audit-preserved (synthetic 400-day-old trace.swept survives a 30-day cutoff), dry-run-reports-without-deleting + emits no trace.swept, empty-DB no-op, apply emits exactly one trace.swept with matching counts via the bus, idx_events_timestamp_us exists after __init__, is_audit_event("trace.swept") is True, exempt_audit=False test-only escape hatch deletes audit rows, oldest_kept_timestamp round-trips at microsecond resolution, PurgeResult is FrozenInstanceError-bound. (2) apps/cli/tests/test_trace_prune_cli.py (6 tests): subcommand parses with all flags, default --days=90, dry-run preserves rows + prints dry_run=true, apply deletes non-audit + preserves audit + emits trace.swept row, missing DB exits non-zero with stderr diagnostic, --days 0 rejected with exit 2. Repo total: 1597 passed, 1 pre-existing unrelated failure (packages/metis-core/tests/redaction/test_event_redactor.py::test_aggregate_only_returns_none_per_event_and_finalizes_to_dict — decimal formatting "0.01" != "0.010"; reproduces on main with Wave-12 work stashed; redaction.md owner concern). Ruff clean on changed files. Helm chart lint clean (helm lint infra/gateway/helm/ passes; helm template … --set traceRetention.enabled=true renders the CronJob).

2026-05-15 — routing-engine.md §5.5 `cost_weight` default lowered 0.1 → 0.05 (§A3-rev5 follow-up; Wave 12)

Specs: routing-engine.md §5.5 (default rationale prose extended with the §A3-rev5 cost-floor diagnosis; example routing.yaml and the cost_weight configurability sentence updated; §12 decision log gains a 2026-05-15 entry). pattern-store.md §8.1 and §15.4 yaml example updated to the new default. benchmarks/RESULTS.md §A3-rev5 Q1 finding gains a “follow-up” subsection noting the fix landed.
Change: Closes the §A3-rev5 Q1 finding (“v2 wiring correct, K-NN sample-balance still dominant”). Investigation of the benchmarks/.runs/a3rev5-patterns.db snapshot (54 fingerprints / 54 outcomes across 7 workloads) showed the actual mechanism is not sample-size dominance per se — sample_size=1 per row means each fingerprint contributes equally to the weighted-mean — but the cost-efficiency floor: cost_efficiency normalizes per cluster to [0.0, 1.0], so at cost_weight=0.1 whichever model is cheapest gets a flat +0.10 score floor regardless of cluster geometry. On regex-with-edge-cases (haiku q=0.91, sonnet q=1.00) this floor swamps the 0.09 quality delta and slot 4 picks haiku at conf=0.011 (gates off → slot 7 wins). The same shape repeats on fix-a-bug-small (haiku q=0.84, sonnet q=1.00 on the intent=() sub-fingerprint). Direct simulation against the snapshot under cw=0.05 enables 6 sonnet picks that pass the min_confidence=0.05 gate where cw=0.10 produces 0; haiku-correct decisions on workloads with genuine quality dominance (multi-file-refactor q=0.79 vs 0.67; multi-turn-refactor q=1.00 vs 0.95) still pick haiku at high confidence (conf=0.20–0.26 at cw=0.05). One-line change: PatternConfig.cost_weight: 0.1 → 0.05 in policy.py. The scoring formula in aggregation.py is unchanged. Per-prompt sub-cluster partitioning (Path B in the §A3-rev5 brief) was considered as an alternative wedge but found unnecessary: the K-NN already pulls 9 of 10 same-workload neighbors per cluster on §A3-rev5 data, so cluster contamination is not the dominant signal; the cost-floor is.
Type: breaking-default for any workspace that depended on the prior cost_weight=0.1 cost bias. Restate cost_weight: 0.1 in routing.yaml to opt out (parallel to the cost_weight: 0.3 opt-out path landed on 2026-05-14). (1) Workspaces with no pattern.cost_weight override pick up the new default on next reload. (2) aggregate_recommendation and PatternStore.recommend callers are unchanged — they still take cost_weight as a parameter and the engine just passes the new default through. (3) Confidence math is unchanged; min_confidence=0.05 still scales appropriately because cost_efficiency saturation under cw=0.05 contributes at most ~0.05 to confidence (down from ~0.10 under cw=0.1), keeping the gate inversion-friendly. (4) The aggregate_recommendation API surface is unchanged. (5) Pre-2026-05-15 patterns DBs are unaffected — the constant change is consumer-side at routing time, not at recording time.
References to verify:
- pattern-store.md §8.1 (recommend call site default annotation) — updated. ✓
- pattern-store.md §15.4 (yaml example) — updated. ✓
- benchmarks/RESULTS.md §A3-rev5 Q1 finding — gains a “follow-up landed” subsection. ✓
- routing-engine.md §5.5 “Default rationale” — extended prose now narrates 0.3 → 0.1 → 0.05 chronologically; §12 decision log gains a 2026-05-15 entry. ✓
- §A3-rev6 validation (Wave 12 follow-on 12a-7) is gated on this fix landing per the Wave-12 coordination plan; no cross-spec reference needed yet.
Status: verified. 3 tests modified, 2 tests added. (1) tests/patterns/test_aggregation.py::test_a3rev5_unblock_lowering_cost_weight_inverts_chooser_at_smaller_delta — pure-math test on aggregate_recommendation showing haiku=0.78 / sonnet=0.85 (Δ=0.07 quality) flips chooser from haiku at cw=0.10 to sonnet at cw=0.05. The pre-existing test_a3rev_unblock_lowering_cost_weight_flips_chooser is renamed-in-place to use intermediate (cw=0.1) instead of new to keep the chain visible. (2) tests/patterns/test_store.py::test_recommend_a3_rev5_unblock_cost_weight_default_05_inverts_chooser — end-to-end store test seeding 5 haiku samples q=0.91 / 5 sonnet samples q=1.00 with 10x cost asymmetry; recommend(cost_weight=0.1) picks haiku, recommend(cost_weight=0.05) picks sonnet, with exact score arithmetic asserted. (3) tests/routing/test_policy_loader.py::test_pattern_cost_weight_default_is_zero_point_zero_five — replaces ..._default_is_zero_point_one; asserts the dataclass default and yaml-omission default both equal 0.05. (4) tests/routing/test_policy_loader.py::test_pattern_cost_weight_explicit_override_preserves_old_defaults — extended to assert both cost_weight: 0.3 and cost_weight: 0.1 opt-out paths still parse (confirms two prior defaults are restate-able). Workspace-wide suite: 1599 passed (1597 baseline including parallel Wave 12 compliance work + 2 new). Ruff clean on changed files.

2026-05-15 — docs/operations/soc2-readiness.md + compliance-overview.md added (SOC2 gap audit; Wave 12)

Specs: none. Pure operator / buyer-facing docs under docs/operations/, sibling to the Wave 11 ops triad (incident-response.md, sla-template.md, status-page.md, upgrade-guide.md). Builds on the Wave 12 spec triad (audit-log.md 12a-1, trace-retention.md 12a-2, redaction.md 12a-3) as evidence pointers without redefining any contract. the project strategy (private) “Audit and compliance posture” bullet replaced — used to read “Trace events are the raw material; aggregation/retention/redaction policies for buyer-facing artifacts are not yet designed”; now points at the SOC2 gap audit, the Wave 12 spec triad, and the cert-path timeline (Type 1 Q3 2026 contingent on buyer underwriting the audit fee). README.md “Operations” section gains a fourth bullet linking the two new docs.
Change: Two new buyer-facing compliance docs close the “no SOC2 conversation answer” gap that the project strategy (private) (buyer ≠ user; B2B framing) made load-bearing. (a) soc2-readiness.md (~410 lines) — SOC2 Trust Service Criteria gap audit. Maps the 2017 TSC set (Security CC1-CC9, Availability A1, Confidentiality C1, Processing Integrity PI1, Privacy P1-P8) against current Metis state. Each criterion gets a four-column row: status (implemented / partial / gap / buyer-responsibility), evidence (file path / spec section / runbook / CLI), buyer additions (TLS terminator / cloud baseline / IdP / etc.), Wave 12 delta paragraphs flagging where the in-this-wave specs change the status. Categories where the gap audit names existing strength: PI1 (canonical message format + cost-attribution math + 1486 tests passing — the strongest TSC for Metis); CC6.3 credential modification (Wave 10 rotate-key / revoke-key + audit events); CC7.3 / CC7.4 (Wave 11 incident-response.md + Wave 10 metis backup / restore). Categories where Wave 12 closes gaps: CC6 (audit-log.md metis audit export for the 9-event subset); CC7 + P4 (trace-retention.md 90-day default sweep with audit-event exemption); C1 + P3 + P4 (redaction.md 4-mode EventRedactor + metis user forget Article 17 pseudonymization-as-erasure). Honest about gaps named explicitly in §7: no formal change management (CC8 — solo part-time owner), no third-party pentest, no formal vendor security review of upstream LLM providers, no SOC2 auditor engagement (the cert path is post-GA), no tamper-evident audit log (multi-user.md §7.4 item 2), no CVE scanning of the Docker image, no SSO/SAML, no RBAC, no automatic background-check policy, no agent-path quota enforcement. §8 talking points: three-level buyer-conversation framing (is it certified / show me the controls / when can you commit to Type 1 or Type 2) plus the anti-pattern callout (don’t promise a cert as a feature delivery; the cert is an ongoing control-operation program). (b) compliance-overview.md (~130 lines) — one-page index. Quick-reference “buyer asks X → read Y” table covering 16 common compliance questions. Framework-coverage table naming SOC2 + GDPR as the v1 scope; HIPAA / ISO 27001 / PCI-DSS / FedRAMP marked out of scope with honest reasoning. Three-layer shared-responsibility model ASCII diagram (buyer org layer / Metis app layer / cloud-provider baseline layer). Compliance-posture-by-deployment-shape table linking back to deployment-shape.md §6. Doc-evolution cadence section keyed off the CHANGES.md fan-out trigger.
Type: additive. (1) No spec contract changes. The new docs sit under docs/operations/ (not docs/specs/) and reference shipped behavior (Wave 9 multi-user identity, Wave 10 key rotation + audit events + backup/restore, Wave 11 ops runbooks + /metrics + rate-limit middleware + api-versioning enforcement, Wave 12 audit-log + retention + redaction) without specifying new behavior. (2) The Wave 12 spec triad is referenced as shipped (all three spec files exist on disk as of this entry); the SOC2 doc’s “Wave 12 delta” paragraphs cite concrete §-refs (audit-log.md §4 / §9, trace-retention.md §2.1 / §3.1 / §5 / §7.1 / §8, redaction.md §1 / §2 / §3.1 / §3.2 / §5). (3) No code changes; no test changes. (4) the project strategy (private)’s replaced bullet is a sentence-level edit; the surrounding context (multi-user from day one, team-level cost attribution, policy enforcement, deployment story, proof of savings) is unchanged. (5) README.md “Operations” section gains one bullet — no structural change.
References to verify:
- multi-user.md §7 (identity-relevant audit + SOC2-relevant questions surfaced for the owner) — soc2-readiness.md §2 CC6 and §6 P-categories cite §3.3 (privacy by default, no plaintext PII in trace events), §7.2 (gateway.key_* event catalog), §7.4 (4 SOC2-relevant questions: retention period, tamper-evidence, plaintext PII handling, right-to-delete). Items 1 (retention period) / 3 (plaintext PII) / 4 (right-to-delete) are closed by the Wave 12 triad; item 2 (tamper-evidence — cryptographic signing / hash-chained event ids) remains a named gap in §7. ✓
- gateway.md §11 (key lifecycle CLI + audit events) — soc2-readiness.md §2 CC6.3 cites metis gateway issue-key / revoke-key / rotate-key / list-keys and the three audit event types (gateway.key_issued / gateway.key_revoked / gateway.key_rotated) as implemented evidence. ✓
- audit-log.md §4 / §9 (9-event v1 subset + metis audit export CLI) — soc2-readiness.md §2 CC6 Wave 12 delta paragraph cites the full v1 subset and the deterministic JSONL/CSV export shape. ✓
- trace-retention.md §2.1 / §3.1 / §5 / §7.1 / §8 (90-day default, sweep mechanics, audit-event exemption, metis trace prune, helm CronJob) — soc2-readiness.md §2 CC7 Wave 12 delta paragraph cites the sweep contract and the helm-CronJob template. ✓
- redaction.md §1 / §2 / §3.1 / §3.2 / §5 (export-time redactor preserves append-only invariant at recording; 4 modes; identity hashing; PRIVATE-tier sentinel; metis user forget Article 17 pseudonymization-as-erasure as the documented exception) — soc2-readiness.md §4 C1 Wave 12 delta paragraph + §6 P3 + §6 P4 cite the redactor modes and the GDPR-forget path. ✓
- event-bus-and-trace-catalog.md §4.4 (sensitivity classification taxonomy private / user_controlled / pseudonymous / aggregatable) — soc2-readiness.md §4 C1.1 cites the floor-and-downgrade rule (§4.4.1) as the load-bearing C1 evidence; redaction.md extends this to export-time enforcement. ✓
- event-bus-and-trace-catalog.md §7.5 (backup/restore contract) — soc2-readiness.md §2 CC7.4 cites VACUUM INTO-based hot snapshots + schema-version guarded restore as the disaster-recovery evidence. ✓
- gateway.md §3.2 + server-api.md §3.1 (loopback-only bind enforced) — soc2-readiness.md §2 CC6.5 and §4 C1 cite the loopback posture as the load-bearing perimeter control until TLS terminator is layered. ✓
- observability.md (Prometheus /metrics) — soc2-readiness.md §2 CC4.1 and CC7.2 cite the 10-metric-series surface as the monitoring evidence. ✓
- tool-dispatcher.md §5.1 (workspace-scoped file API, .. / out-of-root symlinks rejected) — soc2-readiness.md §2 CC6.1 cites as logical-access evidence. ✓
- deployment-shape.md §6 (local-first vs in-VPC vs SaaS posture) — compliance-overview.md “Compliance posture by deployment shape” table cross-references; the v1 reference posture is in-VPC. ✓
- the project strategy (private) — sentence-level replacement on the audit/compliance bullet; the new sentence points at the SOC2 gap audit + Wave 12 spec triad + cert-path timeline. ✓
- README.md “Operations” — one new bullet linking compliance-overview.md + soc2-readiness.md. ✓
Status: verified (pure docs; renders cleanly; no test or schema impact).

2026-05-15 — audit-log.md v1 (new spec; audit subset + JSONL/CSV export + `metis audit export`; Wave 12)

Specs: new audit-log.md v1 (~410 lines: definition, taxonomy, storage, append-only invariant, export shape, API, CLI, SOC2/GDPR posture, open questions, testing); event-bus-and-trace-catalog.md gains §7.6 “Audit subset” cross-reference under “Persistence”.
Change: Splits the trace store into two logical retention tiers without a parallel write path. New constant AUDIT_EVENT_TYPES: frozenset[str] and is_audit_event(t) helper live in packages/metis-core/src/metis_core/events/payloads.py — the v1 audit subset is 12 types covering credential lifecycle (gateway.key_issued, gateway.key_revoked, gateway.key_rotated), budget enforcement (gateway.quota_exceeded, quota.alert), policy compliance (routing.policy_invalid), resource-cap fires (memory.eviction, pattern.evicted), consent records (tool.confirmation_resolved), retention sweep history (trace.swept, self-preserving), and GDPR rights operations (analytics.user_exported, analytics.user_forgotten). metis_core.trace.retention re-exports is_audit_event / AUDIT_EVENT_TYPES so the sweep code reads a single source of truth. New module packages/metis-core/src/metis_core/audit/ exposes AuditLog(trace).query(window=..., event_types=...) -> Iterator[Event] and AuditLog.export(dest, *, window, format="jsonl"|"csv", event_types=None) -> AuditExportResult (refuses to overwrite an existing destination, creates parent dirs, deterministic byte-for-byte output). Storage is a pure derived view — SELECT * FROM events WHERE type IN (<audit_types>) AND timestamp_us ... riding the existing (type, timestamp_us) index; no parallel table, no schema migration. New CLI subcommand metis audit export <dest> (apps/cli/src/metis_cli/main.py, apps/cli/src/metis_cli/audit.py) with --db-path / --format / --since / --until / --event-type flags; prints a deterministic block on success (destination / format / events / window / oldest+newest event ids / bytes; no random ids, no current-time stamps) and a one-line diagnostic to stderr on failure.
Type: additive. (1) No new event types, no payload changes — audit-relevance is a derived flag, not a parallel event family. (2) PAYLOAD_REGISTRY tuple shape is unchanged; AUDIT_EVENT_TYPES is a sibling constant so existing unpacking call sites don’t churn. (3) TraceStore is unchanged — AuditLog is a read-only consumer that reaches through to the trace store’s connection. (4) Retention sweep is not in this spec — landed separately as 12a-2; this spec defines the flag, the sweep reads it. Until 12a-2 lands, all trace events are de-facto preserved, trivially satisfying “audit events are preserved.” (5) The CLI gains a new top-level subcommand (audit); existing chat / tui / serve / evaluate / gateway / backup / restore flows are untouched.
References to verify:
- event-bus-and-trace-catalog.md §7.1 (schema) — unchanged; audit query uses the existing idx_events_type_timestamp index. ✓
- event-bus-and-trace-catalog.md §7.3 (retention) — already references the audit exemption (type NOT IN (<audit_types>) in the sweep DELETE); §7.6 now points at audit-log.md as the source of truth for AUDIT_EVENT_TYPES. ✓
- event-bus-and-trace-catalog.md §6 (catalog) — every type in AUDIT_EVENT_TYPES MUST be in PAYLOAD_REGISTRY; test in packages/metis-core/tests/audit/test_audit_log.py enforces. ✓
- event-bus-and-trace-catalog.md §4.4 (sensitivity) — orthogonal to audit-relevance; documented in audit-log.md §3 + §7.6. ✓
- multi-user.md §7 (“Audit + compliance posture”) — names the audit-export requirement and sketches the shape; this spec is the contract. The three gateway.* audit-relevant events called out in §7.2 land in the v1 audit subset. ✓
- gateway.md §11 — gateway.key_* event types are the load-bearing audit records; all three are in the v1 subset. ✓
- analytics-api.md §2.1.5 — “catalog-sourced data is the only source” rule; audit log honors this. ✓
- canonical-message-format.md §6.4 — Decimal serialization convention; reused by the JSONL export’s enc_hook. ✓
Status: verified for the additive scope. 20 new tests total: 13 library tests in packages/metis-core/tests/audit/test_audit_log.py cover is_audit_event membership, audit-subset-⊆-registry cross-check, query filters (audit types only, in-window only, ordered by id ascending, lax-filter on non-audit types), JSONL round-trip + determinism + Decimal-as-string serialization, CSV round-trip + determinism, empty-window edge cases (zero-byte JSONL, header-only CSV), refuse-overwrite, parent-dir creation, unsupported-format error, simulated retention sweep preserving audit rows (forward-compat with 12a-2); 7 CLI tests in apps/cli/tests/test_audit_cli.py cover argparse shape, end-to-end JSONL export, missing-DB diagnostic, unknown-event-type rejection, non-audit-type warning, refuse-overwrite via CLI, default format.

2026-05-15 — pattern-store.md §16 v2 recording-side wiring fix + cluster-tightening A/B landed (Wave 11)

Specs: pattern-store.md (no spec text change; closes the §A3-rev4 Q1 partial-wiring gap that §16.13 implementation notes left for Wave 11. §16.10 test 5 — the cluster-tightening A/B — is now landed as packages/metis-core/tests/patterns/test_v2_cluster_tightening.py, the “deferred to Phase 4” gate the AGENTS.md “What’s NOT built” list flagged). AGENTS.md to be updated to remove the “Pattern store v2 cluster-tightening A/B” entry from “What’s NOT built” and to add a “Pattern store v2 — recording-side wired end-to-end (Wave 11)” entry under “What works”.
Change: Closes the §A3-rev4 Q1 finding (“v2 wiring partial: recorded fingerprints stayed STRUCTURAL, embeddings landed only in the cache after store.record() returned, so routing-time K-NN fell back to v1 weighted-Jaccard via the mixed-version detection path”). The fix moves embedding computation to turn-start, inside the SessionManager’s fingerprint_inputs_hook, so by the time turn.completed fires the inputs already carry embedding + embedding_provider and compute_fingerprint produces a HYBRID row natively. (a) packages/metis-core/src/metis_core/sessions/manager.py — fingerprint_inputs_hook parameter signature widened from Callable[[str, TurnContext], None] to Callable[[str, TurnContext], Awaitable[None] | None]; submit_turn detects awaitables via inspect.isawaitable and awaits before emitting turn.started. The await is safe by construction: it happens BEFORE turn.started / route.decided / turn.completed, so the per-turn eval cascade (pattern.recorded → eval.completed → update_score) is not in flight and cannot race ahead of _turn_outcomes[turn_id] being set — the §A3-rev3 cascade invariant is preserved. (b) apps/cli/src/metis_cli/runtime.py — _on_turn_fingerprint_inputs is now async; for v2 sessions it awaits attach_embedding_for_recording(inputs, store, embedder) which (i) cache-hits and returns the stored vector, or (ii) on miss, awaits embedder.embed and writes the result to the cache; either way the returned inputs carry embedding populated. The hook then calls pattern_subscriber.set_fingerprint_inputs(turn_id, inputs) with the embedded inputs. Embedder failure logs and degrades gracefully to STRUCTURAL — no exception leaks into the turn loop. (c) packages/metis-core/src/metis_core/patterns/subscriber.py — the Wave-10 post-record attach_embedding_for_recording warm-up is REMOVED. The recording path is now purely synchronous: by the time _on_turn_completed fires, the embedded inputs are already in _fingerprint_overrides; compute_fingerprint produces HYBRID; store.record() writes a row with kind='hybrid', embedding_blob populated, embedding_provider matching the configured provider id. No await happens between record() and _turn_outcomes[turn_id] being set. The embedder parameter on PatternEventSubscriber.__init__ is retained as a documented back-compat no-op for callers that still pass it. (d) The routing engine’s slot-4 sync cache-only lookup (_attach_cached_embedding) is unchanged but now hits the cache populated at turn-start, so v2 K-NN reads the blended (cosine + jaccard) similarity end-to-end instead of falling back to v1 weighted-Jaccard via mixed-version detection (patterns/similarity.py: blended_similarity no longer takes the None-side path on v2 rows).
Type: additive at the wire level, with one signature widening on SessionManager.fingerprint_inputs_hook. (1) Pre-Wave-11 sync hooks continue to work — inspect.isawaitable returns False on a None return and the manager skips the await. (2) The embedder kwarg on PatternEventSubscriber is documented as deprecated but accepted; no external callers passed it (only runtime.py, updated in the same diff). (3) v1 workspaces are entirely unaffected — the embedding-precompute branch in runtime.py gates on pattern_cfg.fingerprint_version == "v2" and embedder is not None. (4) The attach_embedding_for_recording helper still exists in patterns.fingerprint and remains exported from metis_core.patterns for direct callers (tests, future async-mode strategies). (5) Existing v2 patterns DBs with STRUCTURAL rows continue to be queryable — the K-NN’s §16.5.3 mixed-version fallback handles them as before; new turns produce HYBRID rows that coexist with the legacy STRUCTURAL ones.
References to verify:
- pattern-store.md §16.4.4 (cache-miss flow) — the spec describes the miss path as part of the K-NN query; the implementation collapses it to the recording-side per §16.13 implementation notes. Wave 11 confirms the recording path now embeds at turn start (not post-record) so both the cache and the recorded row carry the vector. No spec text change required. ✓
- pattern-store.md §16.10 test 5 (cluster-tightening A/B) — the headline gate for “v2 pays for itself” now lands as test_v2_cluster_tightening.py::test_v2_cluster_tightening_meets_pattern_store_md_test_5. Asserts intra-cluster mean ≥0.10 higher AND inter-cluster mean ≥0.05 lower under v2 (α=0.6) than v1, on a 60-turn fixture (6 workloads × 10) plus 4 off-benchmark traces. Three companion tests document the fixture’s design choices (sanity check on v1 baseline not above the K-NN gate; off-benchmark traces don’t artificially inflate; workload_id partition would short-circuit v1’s intra mean if set, so the test deliberately leaves workload_id=None). ✓
- benchmarks/RESULTS.md §A3-rev4 Q1 — the “v2 wiring partial” finding now has a clean reproduction: a v2 session running through the CLI / benchmark harness records rows with kind='hybrid' and embedding_blob populated (verified by direct SQL inspection in test_v2_recording_writes_hybrid_fingerprint_row). The §A3-rev5 follow-up experiment can now answer “does the inversion generalize under fully-wired v2?” without the wiring caveat blocking the measurement. ✓
- event-bus-and-trace-catalog.md §6 — no new event types; pattern.recorded continues to carry fingerprint_kind which now correctly reports "hybrid" for v2 sessions. ✓
- routing-engine.md §5.5 — slot 4’s reason strings unchanged; delegate_request_in_flight deferral still fires for worker re-entry per delegation.md §11. ✓
Status: verified. 10 new tests across two new files. (1) packages/metis-core/tests/patterns/test_v2_recording_wiring.py (4 tests): async hook is awaited by SessionManager (asserts hook completes BEFORE turn.started fires), v2 recording writes a HYBRID row (SQL inspection of fingerprints table: kind='hybrid', embedding_blob non-NULL, embedding_provider matches, embedding_dim matches; store_meta.schema_version='2'), v2 recording preserves the eval cascade (success_score_count >= 1 after register_evaluator + drain — pins the §A3-rev3 cascade invariant under the new wiring), v2 recording without the hook falls back to STRUCTURAL cleanly (defensive degradation: structural row with NULL embedding_blob, no exception). (2) packages/metis-core/tests/patterns/test_v2_cluster_tightening.py (6 tests): the §16.10 test 5 headline gate, off-benchmark cosine sanity check, v1 baseline-not-above-gate sanity, workload_id-partition reverse sanity (documents the fixture’s deliberate workload_id=None), and a parametrized α-sweep (α=0.6 clears the gates with comfortable margin: intra Δ ≈ 0.49, inter Δ ≈ 0.18; α=0.4 has positive deltas but smaller, validating that v2’s selectivity depends on the embedding-dominance assumption per §16.5.2). Full pattern test suite: 131 passed. Repo total: 1486 passed. Ruff clean. Mypy on changed files: one pre-existing unrelated error in subscriber.py::PatternEvicted trigger literal arg-type (present on main before this diff; not touched by Wave 11).

2026-05-15 — delegation.md §11.1 + benchmark workload `multi-step-with-delegation` (planner-driven delegation validation; Wave 11)

Specs: delegation.md (new §11.1 “Validation workload” cross-references the new benchmark suite). RESULTS.md gains a “Workload multi-step-with-delegation (Wave 11)” subsection ahead of any future §A3-rev5 entry; benchmark.md aggregate-expect enumeration to be updated when next opened (the new min_delegate_calls assertion key is additive).
Change: Ships the workload the §A3-rev4 Q2 finding called for. New workload at benchmarks/workloads/multi-step-with-delegation/ — small auth module (~200 LoC across auth/password.py, auth/oauth.py, auth/apikey.py, auth/registry.py, test_auth.py) where three providers duplicate validation + logging boilerplate; the refactor target is to extract a shared AuthProvider Protocol/base. The workload ships .metis/routing.yaml (global_default: anthropic:claude-sonnet-4-6) as a backstop. Harness extensions in scripts/benchmark.py: (a) auto-detect workload-shipped .metis/routing.yaml after shutil.copytree and pass it to setup_runtime as routing_policy_path — Wave-10 added the explicit-path plumbing but no caller exercised it from a workload directory before; conflict-detected against --fingerprint-version v2 (which writes its own routing.yaml) with a loud error rather than silent clobber. (b) New min_delegate_calls key on expect: (added to _ALLOWED_AGG_EXPECT); the harness counts delegate.started events for the planner session via TraceStore.events_for_session after each workload completes, exposes the count on WorkloadResult.delegate_call_count, threads it into _check_assertions, and prints a delegate.started count = N line on every non-zero run. Validation run on 2026-05-15 with --model sonnet --delegation-policy sonnet-planner-haiku-worker: 3 delegate.started / 3 delegate.completed (all success=True) / 3 worker sessions with parent_session_id correctly stamped / slot 5 wins inside every worker re-entry / slot 4 defers with reason="delegate_request_in_flight" per delegation.md §11 / pytest 8/8 passed after the refactor / 23.6% savings against sonnet-only baseline at $0.235 total spend. The validation also surfaced the §5.6 active-model filter gotcha that blocked an earlier attempt: --no-active-model silently hides the delegate tool because session.active_model is None short-circuits the can_delegate check in _effective_tool_definitions, so the workload must be run with --model sonnet. Documented in workload.yaml description, delegation.md §11.1, and RESULTS.md.
Type: additive. (1) Existing workloads unchanged — the new min_delegate_calls key is opt-in; pre-Wave-11 workloads that don’t set it see no behavior change. (2) The workload-shipped routing.yaml auto-detection is conditional on file existence; workloads without .metis/routing.yaml (every pre-Wave-11 workload) fall back to the existing setup_runtime default (~/.metis/routing.yaml or EMPTY_POLICY) unchanged. (3) --fingerprint-version v2 callers that previously wrote a routing.yaml unchecked now error if the workload also ships one — no prior workload ships one, so this is a new-callers-only behavior change. (4) WorkloadResult.delegate_call_count is an additive dataclass field (default 0); the JSON artifact gains the key but readers that ignore unknown fields are unaffected. (5) apps/cli/tests/test_benchmark.py::test_shipped_workloads_load_clean updated to include the new workload in the expected name set; this is the one test that pins the workload roster.
References to verify:
- delegation.md §3.6 (explicit v1 MVP deferrals) — unchanged; the new workload exercises what shipped (planner-driven slot 5, worker tool isolation, cost attribution) without touching any deferred surface. ✓
- delegation.md §5.6 (the _effective_tool_definitions active-model filter) — the §11.1 cross-reference and the workload’s description both name this filter explicitly so the §A3-rev5 author doesn’t reproduce the --no-active-model mis-invocation. ✓
- delegation.md §11 (slot 4 defer when delegate_request_in_flight) — validation run confirms slot 4 defers correctly on all 3 worker re-entries. ✓
- benchmark.md §3.1 (workload schema) — min_delegate_calls is the only addition to the aggregate-expect surface; _ALLOWED_AGG_EXPECT enumerates the closed set. ⏳ benchmark.md aggregate-expect enumeration to be updated when next opened.
- event-bus-and-trace-catalog.md §6.8 (delegate.* events) — unchanged; validation reads them via the existing TraceStore.events_for_session query. ✓
Status: verified. Suite: 1432 passed (1431 baseline + 1 modified test_shipped_workloads_load_clean to include the new workload). Live-API validation: 1 run, $0.235 spend, all assertions pass. Ruff clean on scripts/benchmark.py.

2026-05-15 — gateway-hardening.md v1 (perimeter posture + rate-limit middleware; Wave 12 prep)

Specs: gateway-hardening.md (new — drafted v1). Documents the layered defenses a buyer composes in front of the loopback-only gateway before lifting the v1 bind: TLS termination posture (Caddy / nginx-ingress / cloud LB), per-key + per-IP token-bucket rate limiting, alert-only abuse detection, gateway-key leak detection, DDoS delegated to the buyer’s edge. Updates CHANGES.md specs-in-scope + cross-reference map. Updates README.md “Buyer trial” with an explicit loopback-only callout so buyers don’t expose the gateway directly. The gateway stays loopback-only in v1 — this spec documents the perimeter, not a posture change.
Change: (a) New module apps/gateway/src/metis_gateway/middleware_ratelimit.py — pure-ASGI middleware (matching the middleware_versioning.py pattern so SSE response bodies aren’t buffered) implementing two independent token buckets: per-key (default 60 RPM, keyed on SHA-256 of the bearer token — same fingerprint the keystore stores so the bucket id is stable lookup-free) and per-IP (default 1000 RPM, parsed from X-Forwarded-For per RateLimitConfig.trusted_proxies with a peer fallback). Capacity equals the refill amount so the documented “RPM” is both the steady-state ceiling and the burst budget. Storage is a bounded LRU (1000 entries per bucket type). (b) RateLimitConfig(enabled=False, per_key_rpm=60, per_ip_rpm=1000, max_tracked_keys=1000, trusted_proxies=()) — enabled=False is the v1 default so existing buyers see no behavior change. (c) apps/gateway/src/metis_gateway/app.py GatewayConfig gains a rate_limit: RateLimitConfig field; build_app(runtime, *, rate_limit=...) appends RateLimitMiddleware after VersioningMiddleware only when rate_limit.enabled. run_gateway threads cfg.rate_limit through. (d) 429 response body is inbound-shape-matched per app.py’s existing _openai_error / _anthropic_error envelopes: OpenAI clients see {error: {code: "rate_limit_exceeded", type: "rate_limit_error", scope, retry_after_seconds}}; Anthropic clients see the minimal {error: {type: "rate_limit_error"}}. Both responses set Retry-After: <seconds> (RFC 9110 §10.2.3, min 1 second, rounded up). (e) The middleware logs at WARN level on every 429 with bucket, rpm, retry_after, path, and a fingerprint prefix so operators can grep limit hits before the metrics counter lands. (f) Helm chart additions: values.yaml::rateLimit.enabled / perKey.rpm / perIp.rpm / trustedProxies (all OFF by default); templates/ingress.yaml gains a helm-source-only comment block documenting Caddy / nginx-ingress / Traefik annotation patterns for edge-layer rate limiting (kept as source comments rather than rendered YAML so unsupported annotation keys can’t break a buyer’s controller).
Type: additive. (1) RateLimitConfig() defaults to enabled=False, so no existing buyer sees any behavior change after upgrading. (2) build_app(runtime) continues to work without the new rate_limit kwarg; existing callers are unchanged. (3) Provider-shape paths (/v1/chat/completions, /v1/messages) are the only paths the limiter applies to; /healthz and /metrics are exempt by path-prefix check. (4) When the middleware is disabled (default), it’s a no-op — short-circuits at the if not enabled check before reading any headers. (5) The spec explicitly defers the metis_ratelimit_requests_total / _tokens_available counters and the gateway.rate_limit_exceeded bus event to a follow-up wave (registering counters requires touching metis_core.observability.MetricsCollector, and a new payload would touch metis_core.events.payloads.PAYLOAD_REGISTRY; both are out of scope for this Wave). The spec reserves the names and naming-pattern compatibility with metis_quota_used_ratio / metis_pattern_matches_total for that follow-up. (6) Helm rateLimit.* knobs are net-new keys; existing values files keep working because the new keys have helm-side defaults.
References to verify:
- gateway.md §3.2 — loopback-only bind unchanged; gateway-hardening.md cross-references this section as the posture this spec extends, not replaces. ✓
- multi-user.md §5 — spend quotas (daily_cap_usd / monthly_cap_usd) compose with the rate limiter; the rate limit smooths burst spend so the durable cap remains the cost backstop (gateway-hardening.md §1 / §8). ✓
- server-api.md §3.1 — loopback-only safety guarantee cross-referenced; the agent server is not yet covered by rate-limit middleware. The spec scopes itself to the gateway. ✓
- observability.md — gateway-hardening.md §3.6 reserves metis_ratelimit_requests_total{bucket=,result=} and metis_ratelimit_tokens_available{bucket=,key=} for the follow-up wave that wires them into MetricsCollector. The naming follows the metis_ prefix + _total suffix convention already used by MetricsCollector. ⏳
- event-bus-and-trace-catalog.md §6 — gateway.rate_limit_exceeded reserved (PSEUDONYMOUS floor) for the follow-up wave that adds the payload to PAYLOAD_REGISTRY. No catalog change in this wave. ⏳
Status: verified. 18 new tests under apps/gateway/tests/test_middleware_ratelimit.py (unit: _Bucket capacity / refill / retry-after rounding [4]; _client_ip peer fallback / XFF with-and-without trusted-proxies / unparseable XFF / no-peer fallback [5]; HTTP: disabled-config doesn’t break /v1/* or /healthz [2]; disabled-by-default RateLimitConfig construction guard [1]; per-key bucket fires 429 at threshold with OpenAI envelope shape [1]; per-key bucket Anthropic envelope shape [1]; per-IP bucket fires 429 independent of per-key [1]; both buckets compose AND-not-OR with the tighter cap winning [1]; /healthz exempt under strict config [1]; refill via 60-second time travel at the unit layer [1]). Full gateway suite: 176 passed (158 baseline + 18 new). Repo total: 1486 passed. Ruff clean.

2026-05-15 — api-versioning.md v1 enforcement live (410 unsupported, sunset auto-rejection, `OPTIONS` pre-flight, state plumbing; Wave 11)

Specs: api-versioning.md (status flipped from “Draft v1” to “v1 enforcement live”; §2.1 header table picks up Metis-API-Versions-Supported and the Deprecation / Sunset conditional row; §3 split into 3.1 lifecycle / 3.2 below-min and past-sunset 410 / 3.3 OPTIONS pre-flight discovery, each with a concrete worked example; §4 responsibilities expanded from 4 items to 8; §4.1 “Version-specific dispatch” added with an /analytics/cost worked example and the string-comparison caveat; §5 invariants extended from 6 to 10 items; §6 errors table reshaped with a row per condition + outcome; §7 testing list extended from 4 to 10 items; §8 decision log gains six new entries — HTTP 410 (not 400), strict > sunset comparison, OPTIONS short-circuit, request.state (not router-level) dispatch, bearer-hash fingerprint in warning logs).
Change: Turns the Wave 10 scaffolding into real enforcement. (a) apps/gateway/src/metis_gateway/middleware_versioning.py and apps/server/src/metis_server/middleware_versioning.py (near-identical siblings, per the same-app-no-shared-parent constraint) gain a frozen VersionResolution dataclass returned by resolve_version(requested, *, now=None): (resolved, is_deprecated, sunset, is_unsupported, reason). (b) MIN_SUPPORTED_VERSION is now enforced — pins below the floor return HTTP 410 with the documented version_unsupported body shape {"error": {"code": "version_unsupported", "requested": "...", "min_supported": "...", "current": "...", "reason": "below_min", "message": "..."}}. (c) DEPRECATED_VERSIONS sunset dates are now auto-respected — today > sunset_date (UTC, strict > so the sunset day itself is still in-window) flips the version from served-deprecated to 410’d-unsupported with reason="past_sunset". No scheduled job; the comparison runs per-request. (d) Every Metis-owned response now carries Metis-API-Versions-Supported: <comma-separated list> (driven by a new SUPPORTED_VERSIONS: tuple[str, ...] = ("1.0",) module constant), including on 410 bodies so a client knows what versions to retry with. (e) OPTIONS requests to Metis-owned paths short-circuit through the middleware and return 204 + the version-negotiation headers (and Deprecation / Sunset if the requested version is deprecated). Loopback-only v1 means there’s no CORS pre-flight to conflict with. (f) The middleware initializes scope["state"] (Starlette State) if missing and stamps state.metis_api_version = resolved so downstream handlers can read request.state.metis_api_version and branch on version (one-handler-per-route + in-handler version branching is documented in §4.1 with an /analytics/cost-future-1.1-field worked example). (g) Deprecation warning logs now include a bearer-hash fingerprint — first 12 hex chars of SHA-256(bearer token) — extracted from either Authorization: Bearer <token> or x-api-key, matching what the gateway keystore persists so an operator can grep keys.json for the buyer to notify before sunset. Returns "<no-auth>" when no token is present (e.g. agent-server traffic). (h) Provider-shape paths are still skipped first; 410 enforcement does not run on them so a misconfigured proxy sending Metis-API-Version: 0.9 on /v1/chat/completions still completes the buyer’s call.
Type: additive at the spec level, modestly breaking at the wire level for one previously-served case. (1) Below-MIN_SUPPORTED_VERSION pins were previously served with Deprecation: true + Sunset headers (200); they now return 410. In v1 today this is purely theoretical — MIN_SUPPORTED_VERSION = "1.0" and the only pinned-below candidate ("0.9") was a synthetic test value. No real buyer is affected. The break sets the contract: pinned values below the floor will be rejected. (2) resolve_version’s return type changed from tuple[str, bool, str | None] to VersionResolution. Public consumers are the two test files we updated; no external code depends on the tuple shape. (3) OPTIONS requests on Metis-owned paths used to return 405 (Starlette’s default for non-GET routes); they now return 204 with negotiation headers. This is buyer-facing-positive — pre-flight discovery now works — and no production caller was relying on 405. (4) Provider-shape paths and existing handler bodies are untouched. (5) Metis-API-Versions-Supported is purely additive on responses.
References to verify:
- gateway.md §3.1 (provider-shape endpoints) — still untouched by 410 enforcement; the existing skip-by-prefix test was extended to assert below-min on /v1/chat/completions is NOT 410’d. ✓
- analytics-api.md §3.2 (response envelope) — the current_pricing_version field is orthogonal to Metis-API-Version headers; no edit required. ✓
- server-api.md (planned) — when it lands it should reference api-versioning.md §4.1 for the in-handler version-branching pattern. ⏳
- docs/gateway-client-quickstart.md §8 "Pinning a Metis API version" — current text still accurate (the policy section it links to is what changed). ✓
Status: verified. 20 new / rewritten tests across apps/gateway/tests/test_middleware_versioning.py (20 — frozen-clock past-sunset + boundary, 410 body shape + reason discriminators, Versions-Supported header round-trip, OPTIONS 204 + deprecated-OPTIONS variant, downstream request.state.metis_api_version accessor exercised via dummy ASGI app, provider-shape skip + provider-shape-below-min-not-rejected guard) and apps/server/tests/test_versioning_middleware.py (15 — same coverage minus the provider-shape rows since the server has no provider-shape surface). Versioning-middleware test suite: 35 passed. One pre-existing unrelated failure (apps/cli/tests/test_benchmark.py::test_shipped_workloads_load_clean — a new multi-step-with-delegation workload directory was added but the test’s expected set wasn’t updated; reproduces on main with my changes stashed) is out of scope for this Wave. Ruff clean.

2026-05-15 — docs/operations/ added (incident response, status page, SLA template; buyer-facing)

Specs: none. Pure operator-facing docs under docs/operations/ — a sibling to docs/gateway-deployment.md, not a spec edit. README.md gains an “Operations” section linking the three files.
Change: Three new buyer-facing operational documents close the day-one “no SLA story, no incident playbook, no status page recipe” gap flagged against the project strategy (private) (buyer ≠ user; buyers ask about SLA before signing). (a) incident-response.md — SEV1-SEV4 criteria with ack / mitigation / resolution targets, on-call alert paths (PagerDuty / Opsgenie / email via /healthz external probes, container-log filters, trace-DB SQL cron), four-beat first-hour playbook (detect / triage / mitigate / comms), blameless post-mortem template, and per-failure-mode playbooks for upstream LLM outage (failover via METIS_GATEWAY_GLOBAL_DEFAULT + OpenRouter fallback), trace-DB corruption / disk-full (recovery via metis backup / metis restore from Wave 10), gateway-key compromise (revocation via metis gateway revoke-key from Wave 10), and quota runaway (per-key daily_cap_usd / monthly_cap_usd enforcement). (b) status-page.md — two-tier recipe: external (UptimeRobot 50-monitor free / Statuspage.io / Better Stack against /healthz + a synthetic POST /v1/messages probe with a --daily-cap-usd 0.50 key) and self-hosted (Uptime Kuma via helm in a metis-ops namespace), plus publish/redact guidelines (tenant names + raw cost numbers redacted; upstream provider names published) and ISO-8601 communication templates (initial / identified / mitigating / resolution / scheduled). (c) sla-template.md — 99.5% single-region availability commitment (~3h 39m / month), service-credit tier table (10% / 25% / 50% by availability band, capped at 50% of monthly fee), exclusions (scheduled maintenance ≥48h notice cap 4h/mo; upstream provider outages; customer-induced quota / network / content-policy; force majeure deferred to legal counsel; beta features; security-driven patching cap 2h/mo), and SEV-based support response targets. The SLA is framed as a template for the buyer’s downstream-user SLA — Metis ships open-core, so Metis itself does not sign SLAs with the buyer; the buyer signs with whomever they serve through the gateway they operate. (d) README.md adds an “Operations” section between “Buyer trial” and “What it is” linking the three docs with one-line hooks.
Type: additive. (1) No spec contract changes. The new docs sit outside docs/specs/ and reference shipped behavior (Wave-10 metis backup / metis restore / metis gateway revoke-key / metis gateway rotate-key, quota.alert / gateway.quota_exceeded events from Wave 9a-2, /healthz / /health endpoints, route.decided.chain[*].verdict='unavailable' from routing-engine.md §6 per-(provider, model) availability tracking) without specifying new behavior. (2) No code changes; no test changes. (3) gateway-deployment.md is unchanged — operations docs cross-reference its “Backup & restore” and “Smoke test recipe” subsections rather than duplicating them.
References to verify:
- gateway.md §11 — key lifecycle (revoke-key / rotate-key / list-keys + 401 key_revoked body) referenced verbatim by incident-response.md’s “Gateway-key compromise” playbook. ✓
- event-bus-and-trace-catalog.md §7.5 — backup / restore contract referenced by incident-response.md’s “Trace DB corruption or disk full” playbook + status-page.md’s component list. ✓
- multi-user.md §5.1 + gateway.md §6.4 — daily_cap_usd / monthly_cap_usd per-key quotas and the quota.alert event referenced by incident-response.md’s “Quota runaway” playbook. ✓
- routing-engine.md §6 — per-(provider, model) availability tracking + verdict='unavailable' chain entry referenced by incident-response.md’s “Upstream LLM API outage” playbook. ✓
- analytics-api.md §4.8 — /analytics/by_key referenced by incident-response.md’s “Quota runaway” mitigation step (visible to the buyer for tenant-side spike validation). ✓
- observability.md — operations docs predate the shipped /metrics surface but compose cleanly with it; the trace-DB SQL probe and /metrics are complementary alert sources, not alternatives. Future revision of incident-response.md can add a /metrics-via-Prometheus alert row alongside the existing three.
- the project strategy (private) — buyer ≠ user framing; operations docs are written for the buyer-side SRE, not the dev who runs metis chat. ✓
- docs/gateway-deployment.md — operations docs cross-link to the install / TLS / backup-restore / helm sections rather than duplicating them. ✓
Status: verified (pure docs; renders cleanly; no test or schema impact).

2026-05-15 — observability.md v1 (Prometheus `/metrics` on gateway + server)

Specs: observability.md (new); cross-reference list updated; CHANGES.md cross-reference map adds observability.md row depending on event-bus-and-trace-catalog, gateway, server-api, multi-user, evaluator, pattern-store. gateway-deployment.md "Observability hooks" should pick up a row pointing at /metrics next time it’s edited (the spec line is more durable than the deployment-doc table; not blocking).
Change: Adds a GET /metrics Prometheus exposition endpoint to both metis-server and metis-gateway. New module packages/metis-core/src/metis_core/observability/ ships MetricsCollector — a non-fast-path bus subscriber over llm.call_completed, llm.call_failed, route.decided, pattern.matched, quota.alert, gateway.quota_exceeded, eval.completed — that maintains a private prometheus_client.CollectorRegistry. The metric surface (counter / gauge / histogram set + label cardinality discipline) is fixed in observability.md §3. Polled gauges (metis_session_count on the server, metis_gateway_keys_active / _revoked on the gateway) read their underlying source on every scrape via getters injected at construction time. Adds prometheus-client>=0.20.0 as a runtime dep on metis-core, metis-server, metis-gateway. Helm chart gains a monitoring.enabled toggle that renders templates/servicemonitor.yaml (Prometheus-operator shape) targeting the same Service port as the LLM endpoints. No spec-contract change to any existing event; no field added to payloads.py.
Type: additive.
References to verify:
- event-bus-and-trace-catalog.md §3.4 — non-fast-path subscriber posture honored by the new collector. (no change required)
- gateway.md §3.2, server-api.md §3.1 — loopback bind posture; /metrics rides the same loopback restriction as /healthz. (no change required)
- gateway-deployment.md "Observability hooks" — operator-facing surface table will benefit from a /metrics row + ServiceMonitor pointer next time the page is touched. (low priority follow-up)
Status: verified.

2026-05-15 — delegation.md v1 MVP shipped (`delegate()` tool + worker sessions; Wave 10)

Specs: delegation.md (status flipped from “Draft v1, Phase 4 implementation pending” to “v1 MVP shipped”; new §3.6 enumerates explicit deferrals — async/concurrent workers, cancellation cascade, streaming, recursive delegation, output_schema validation, worker timeout, router-decided delegation, worker pattern-store integration); event-bus-and-trace-catalog.md §6.8 updated to reflect shipped payload fields (allowed_tool_count + dropped_tools on delegate.started; output_size_bytes + worker_total_cost_usd + model on delegate.completed; worker_total_cost_usd on delegate.failed; phase note flipped from “Phase 4 deferred” to “v1 MVP shipped — Wave 10”); analytics-api.md cost-endpoint subsection picks up include_workers query parameter and the new parent_session / is_worker group_by values (see verification list below); AGENTS.md “What’s NOT built” rewritten to remove the “Delegation — Phase 4 …” line and add a “Delegation v1 MVP” entry under “What works” with the deferred features named.
Change: Lands the delegate() built-in tool, the worker-session lifecycle, and the routing slot-5 re-entry path end-to-end. (a) New module packages/metis-core/src/metis_core/workers/ exporting DelegateRequest / DelegateResult / DelegateUsageSummary / DelegateOutcome / WorkerSpawner protocol / ContextSpec + tier / failure-mode literal types (msgspec frozen structs, Decimal cost). (b) SessionManager.spawn_worker resolves the tier → model via ModelRegistry.model_for_tier, creates a worker Session (is_worker=True, parent_session_id + parent_tool_use_id set; active_model=None so slot 5 fires fresh per §5.2), stashes the tier model in a per-id dict so _build_turn_context populates TurnContext.worker_tier_model, emits delegate.started, runs submit_turn synchronously, and returns DelegateOutcome. Failure modes mapped: tier miss → no_model_available_for_tier short-circuits before session creation; worker raise → worker_error; worker stop_reason=max_tokens → max_tokens_exceeded. (c) ModelEntry gains can_delegate: bool = False and delegation_tier: str | None = None; ModelRegistry.register accepts both; can_delegate(model) and model_for_tier(tier) helpers added. (d) Session gains parent_session_id / parent_tool_use_id / is_worker fields; SqliteSessionStore runs an idempotent PRAGMA table_info → ALTER TABLE migration on open (additive columns + a partial index on parent_session_id). (e) New DelegateTool implements the spec’s input schema (tier required, task required, optional context spec / allowed_tools / max_tokens), refuses if context.is_worker is True or context.worker_spawner is None, awaits spawner.spawn_worker, emits delegate.completed / delegate.failed based on the outcome, and returns the worker’s text as the tool result. (f) ToolContext gains worker_spawner and is_worker fields; ToolDispatcher.dispatch accepts and propagates both; SessionManager.submit_turn passes worker_spawner=self, is_worker=session.is_worker to every dispatch. (g) SessionManager._effective_tool_definitions filters delegate out of worker sessions and out of top-level sessions whose active model has can_delegate=False (delegation.md §5.6). Workers additionally lose memory_add / memory_replace / memory_consolidate so durable state stays read-only from inside a worker (§5.4). (h) LLMCallStarted / LLMCallCompleted / TurnCompleted gain parent_session_id: str | None; SessionManager stamps session.parent_session_id on every emit and uses Actor.WORKER instead of Actor.AGENT for worker turns. (i) ConfirmationRequest gains is_worker: bool = False; CLIConfirmationHandler._apply_answer skips trust.yaml persistence on “always” / “never” when the request originated inside a worker (§13’s conservative default). (j) AnalyticsStore.cost gains include_workers: bool = True and two new group_by values: parent_session (rolls workers under their planner via COALESCE(parent_session_id, session_id)) and is_worker (partitions planner vs worker buckets). HTTP handler reads ?include_workers=false and forwards. (k) Routing engine slot 4 (pattern) defers with reason="delegate_request_in_flight" when ctx.worker_tier_model is set, so a learned pattern can’t silently override the planner’s explicit tier= choice (delegation.md §11). Three new typed payloads in events/payloads.py → DelegateStarted / DelegateCompleted / DelegateFailed (all Sensitivity.PSEUDONYMOUS).
Type: additive. (1) Pre-delegation registries continue to compile — can_delegate defaults to False so the tool is invisible everywhere by default; delegation_tier defaults to None so model_for_tier returns None and the failure mode no_model_available_for_tier is the natural opt-out. (2) Pre-delegation SQLite session DBs auto-migrate via the additive ALTER TABLE columns; readers tolerate the schema bump. (3) Pre-delegation InMemorySessionStore.create_session callers still work — the three new kwargs default to None / False. (4) Slot 5 still reports not_applicable for top-level sessions; the test_phase1_stub_policies_always_not_applicable test still passes (the default ctx leaves worker_tier_model=None). (5) Existing route.decided.chain shape is unchanged — the seven policy slots, same order, same verdicts. (6) All existing analytics endpoints continue to accept their existing query strings (the new include_workers defaults to True so existing callers see no behavior change; group_by=parent_session / is_worker are opt-in). (7) The delegate tool is registered by register_builtins but filtered out per-session by SessionManager; dispatchers that opt out via register_builtins(dispatcher, with_delegate=False) see the pre-Wave-10 surface.
References to verify:
- routing-engine.md §4.1 / §6.9 — slot 5 (DELEGATE_REQUEST) now reports chose: <tier model> inside worker re-entry and not_applicable: "not a delegation re-entry" elsewhere. ✓
- routing-engine.md §5.6 / pattern-store.md — slot 4 defers with reason="delegate_request_in_flight" when worker_tier_model is set. ✓ (delegation.md §11)
- canonical-message-format.md §9.1 — Session record gains parent_session_id / parent_tool_use_id / is_worker; nullable, no migration on existing rows. ⏳ canonical-format spec to be updated when next opened.
- tool-dispatcher.md — ToolContext gains worker_spawner + is_worker; dispatch() accepts them; confirmation-handler flow gets is_worker. ⏳ tool-dispatcher spec to be updated when next opened.
- event-bus-and-trace-catalog.md §6.3 — LLMCallStarted / LLMCallCompleted / TurnCompleted gain parent_session_id; Actor.WORKER now fires on worker emissions per §4.1. ✓
- analytics-api.md §4.1 — group_by enum gains parent_session and is_worker; include_workers query parameter added. ⏳ analytics-api spec to be updated when next opened.
- streaming-protocol.md §7 — include_worker_sessions filter remains accepted-but-unused; no worker streaming in v1 MVP. ✓
Status: verified. 17 new tests under packages/metis-core/tests/workers/test_delegation.py (14 — tool visibility filtering for can_delegate / can’t-delegate planners and worker sessions, end-to-end planner→delegate→worker→planner loop with scripted adapter, worker LLM events stamp parent_session_id, worker turn.completed stamps parent_session_id, slot 5 fires inside worker chain, slot 4 defers inside worker chain, recursive delegation refused with ToolExecutionError, no_model_available_for_tier returns delegate.failed, worker Session record carries is_worker + parent fields, worker uses parent’s workspace, dispatcher reused but per-session id maps isolated, top-level chain unchanged when delegation unused) and packages/metis-core/tests/analytics/test_store.py (3 — group_by=parent_session rolls workers under planner, group_by=is_worker partitions, include_workers=False excludes worker rows). Suite total: 1405 passed (1388 baseline + 17 new). Ruff clean.

2026-05-15 — api-versioning.md v1 (new spec; lightweight middleware shipped on both apps)

Specs: api-versioning.md (new — drafted v1). Adds the Metis-API-Version header contract for Metis-owned endpoints; pins provider-shape paths (/v1/chat/completions, /v1/messages) as frozen by upstream SDK contracts. Updates CHANGES.md specs-in-scope + cross-reference map. Updates docs/gateway-client-quickstart.md with a §8 “Pinning a Metis API version” subsection so buyers can opt in.
Change: Two surface categories distinguished. (1) Provider-shape (frozen) — /v1/chat/completions and /v1/messages are versioned by OpenAI / Anthropic respectively; Metis doesn’t get a vote and the middleware passes them through untouched (no Metis-API-Version request read, no response stamp). (2) Metis-owned (versioned by us) — every other route on the gateway and the agent server (/healthz, /health, /server/version, /sessions/*, /analytics/*, /models, future Metis-specific surfaces). Metis-owned endpoints accept an optional Metis-API-Version request header (default CURRENT_VERSION = "1.0") and stamp the resolved version on every response. Deprecation policy: when a Metis-owned endpoint changes breakingly, the old version is supported for ≥6 months with Deprecation: true + Sunset: <ISO date> headers per RFC 8594 (with the simplified ISO-date profile documented in §3). Semver discipline: minor for additive (new fields, new endpoints, looser validation), major for breaking (removed fields, semantic changes, stricter validation). Currently Metis-API-Version: 1.0; no version-dispatch logic in v1 — the scaffolding lets later majors land without churning callers. Implementation: pure ASGI middleware (not BaseHTTPMiddleware, which would buffer SSE / WebSocket bodies) in apps/gateway/src/metis_gateway/middleware_versioning.py and apps/server/src/metis_server/middleware_versioning.py; near-identical files since the two apps are independent siblings. Both files expose CURRENT_VERSION, MIN_SUPPORTED_VERSION, DEPRECATED_VERSIONS (empty in v1), DEFAULT_BELOW_MIN_SUNSET = "2026-11-15", and resolve_version(requested) -> (resolved, is_deprecated, sunset_iso). Wired via Starlette(..., middleware=[Middleware(VersioningMiddleware)]) in both apps/gateway/.../app.py and apps/server/.../app.py; the gateway’s middleware defaults to skipping PROVIDER_SHAPE_PREFIXES, the server’s defaults to no skip set since it has no provider-shape surface. A version-below-MIN_SUPPORTED_VERSION is served (not rejected) with a logged warning so operators can see who is still pinned before removal. Future revs may add a 400 unsupported_version once telemetry shows buyers upgrade promptly enough.
Type: additive. (1) No buyer-facing breaks — clients that don’t send Metis-API-Version resolve to the current version transparently. (2) Provider-shape endpoints are unchanged in shape, headers, and routing (the middleware skips them entirely; auth-failure responses on those paths also don’t gain the header — guarded by test_provider_shape_auth_failure_still_skips_versioning). (3) Existing analytics / health / sessions handlers are unchanged at the route level — the middleware operates above them. (4) Two new public modules (metis_gateway.middleware_versioning, metis_server.middleware_versioning); no changes to existing public APIs.
References to verify:
- gateway.md §3.1 (provider-shape endpoints) — unchanged; api-versioning.md §1.1 cross-references this as the frozen surface. ✓
- gateway.md §3 (overall surface table) — /healthz is now documented as versioned per api-versioning.md §1.2. No edit required to gateway.md (the spec is the cross-cutting concern, not a gateway-specific addition). ✓
- analytics-api.md §3.2 (response envelope) — the envelope’s current_pricing_version is orthogonal to the new Metis-API-Version header; the former is a per-row pricing concern, the latter a transport-level versioning concern. No edit required. ✓
- server-api.md (planned) — when this spec lands it should reference api-versioning.md §1.2 as the versioning posture for the routes it documents. ⏳
- event-bus-and-trace-catalog.md — no new event types (api-versioning.md §5 invariant: versioning is a transport concern, not an audited operation). ✓
- KNOWN_ISSUES.md — no entry needed; this is preventive scaffolding, not a fix. ✓
Status: verified. 20 new tests under apps/gateway/tests/test_middleware_versioning.py (10 cases — resolve_version unit tests, header round-trip, default when absent, below-min stamps Deprecation + Sunset, explicitly-deprecated stamps mapped sunset, both provider-shape paths skip versioning entirely, provider-shape auth-failure still skips) and apps/server/tests/test_versioning_middleware.py (10 cases — resolve_version unit tests, header round-trip on /health and /analytics/cost, default when absent, below-min stamps Deprecation + Sunset, explicitly-deprecated stamps mapped sunset). Suite total: 1361 passed (was 1323 baseline before this change; the delta includes a few previously-shadowed tests that resurfaced after a stale __pycache__ cleanup). Ruff clean.

2026-05-15 — gateway.md §11 key lifecycle (revoke / rotate / list + audit events; Wave 10)

Specs: gateway.md (new §11 “Key lifecycle (Wave 10)”, §12 follow-ons renumbered, §13 references renumbered); event-bus-and-trace-catalog.md (new §6.13 “Gateway admin domain” + three pseudonymous-floor event types); docs/gateway-deployment.md “Key management” subsection rewritten with revoke-key / rotate-key / list-keys recipes; “Keystore rotation” subsection in the Production checklist points to the new path.
Change: Closes the v1 “no online revocation or rotation” gap noted in gateway.md §11. (a) GatewayKey (apps/gateway/src/metis_gateway/auth.py) gains status: Literal["active", "revoked"] = "active", revoked_at: datetime | None = None, and grace_period_until: datetime | None = None; loader is back-compat (missing fields default to "active" / None). Keystore.from_dict rejects a status="revoked" record without revoked_at. GatewayKey adds is_active(now) + effective_revoked_at(now) methods that read the grace-period boundary as read-only — auth never writes the keystore. (b) New module apps/gateway/src/metis_gateway/keystore_admin.py exposes revoke_key, rotate_key, list_keys, sweep_expired_grace_periods, plus CLI shims (revoke_key_command / rotate_key_command / list_keys_command) and parse_duration("30m"|"24h"|"7d"|"2w"). All mutating ops do atomic write-temp-then-rename (os.replace) so a running gateway never observes a partial keystore. (c) issue_key.py now also writes atomically (via the shared atomic_write_keystore) and emits a gateway.key_issued audit event when a db_path is supplied. (d) Three new typed event payloads in packages/metis-core/src/metis_core/events/payloads.py — GatewayKeyIssued, GatewayKeyRevoked (reason: Literal["admin_revoke", "grace_period_expired", "rotated"]), GatewayKeyRotated; all registered in PAYLOAD_REGISTRY with Sensitivity.PSEUDONYMOUS. Emission is best-effort — failures don’t roll back the keystore mutation. (e) Auth middleware in apps/gateway/src/metis_gateway/app.py checks key.is_active(now=...) after the keystore lookup and returns the documented 401 body {"error": {"code": "key_revoked", "key_id": "...", "revoked_at": "...", "type": "invalid_request_error"|"authentication_error", "message": "..."}} before any harness / routing call. Shape-specific type matches the existing OpenAI vs Anthropic envelopes. (f) metis-cli (apps/cli/src/metis_cli/main.py) gains three subcommands: metis gateway revoke-key <key_id>, metis gateway rotate-key <key_id> [--grace-period <duration>], metis gateway list-keys [--format text|json]; issue-key gains an optional --db-path that wires the audit-event emission target (defaults to ~/.metis/metis.db). Rotation default grace period: 24h.
Type: additive. (1) Pre-Wave-10 keystores load cleanly (missing status → "active"; missing revoked_at / grace_period_until → None). (2) GatewayKey constructors that omit the new fields compile and behave identically to v1. (3) Keystore.authenticate still returns revoked keys (auth needs the key_id to render the key_revoked body); the is_active filter is the middleware’s job. (4) issue_key() gains optional kwargs (now, db_path) — existing callers compile unchanged. (5) Three new event types in the catalog — existing consumers that don’t subscribe to them see no change. (6) The HTTP 401 body shape for code="invalid_api_key" is unchanged for unknown bearers; the new code="key_revoked" shape is documented in gateway.md §11.2 and only fires for keys whose is_active returns False.
References to verify:
- gateway.md §3.3 / §11 / §13 — keystore record table extended; new §11 captures the full surface (CLI ops, 401 body, audit-event contract, non-goals). ✓
- event-bus-and-trace-catalog.md §6.13 — three new pseudonymous event types; matches the PAYLOAD_REGISTRY entries. ✓
- analytics-api.md §4.1 / §4.8 — gateway-admin events use the same gateway_key_id projection the cost endpoint already reads; no schema change. ✓
- multi-user.md §3 / §4 — rotation preserves the user_id / team_id tags so per-identity rollups (/analytics/by_user / /analytics/by_team) reflect the migration without re-tagging the successor. ✓
- docs/gateway-deployment.md — Key management subsection rewritten; Keystore rotation subsection in Production checklist points to the new path. ✓
- KNOWN_ISSUES.md — “no online revocation API in v1” gap closed by this change. ⏳ Update entry when next opened.
Status: verified. New tests (27 cases): apps/gateway/tests/test_keystore_admin.py (25 — revoke marks status / stamps revoked_at + audit emission + idempotency, unknown key, rotate inherits metadata + emits link event with old→new + identity tags, default vs custom grace, refuses revoked predecessor / zero-or-negative grace, both keys active during grace window, predecessor auto-revokes at boundary, sweep_expired_grace_periods persists transition + emits paired key_revoked event with reason="grace_period_expired", sweep idempotent, list-keys shape stable across rotation, list-keys empty keystore, list-keys text + JSON output formats, parse_duration variants, atomic write leaves no partial temp file, audit-event payload metadata sanity, pre-Wave-10 back-compat); apps/gateway/tests/test_app_http.py (+2 — 401 key_revoked body on both inbound endpoints). apps/gateway/tests/conftest.py adds revoked_runtime / revoked_client fixtures. Full gateway suite: 144 → 171 cases. Full project suite passes at 1383 cases (excluding the pre-existing test_subscriber.py::test_drain_processes_eval_completed_cascade_before_returning flaky/hung test, unrelated to this change). Ruff clean across packages/, apps/, scripts/.

2026-05-15 — event-bus-and-trace-catalog.md §7.5 (trace-DB backup & restore contract)

Specs: event-bus-and-trace-catalog.md §7.5 (new — backup & restore contract under “Persistence”). docs/gateway-deployment.md gains a “Backup & restore” subsection under “Production checklist” with the buyer-facing recipe (cron, rotation, restore drill, helm/PVC volume-snapshot composition).
Change: Ships buyer-runnable backup + restore for the trace DB so helm-chart buyers can snapshot before a risky upgrade and restore on failure without the WAL pitfalls of a naive cp. New module packages/metis-core/src/metis_core/trace/backup.py exposes backup(source_db, dest) -> BackupResult (uses SQLite’s VACUUM INTO — atomic, WAL-safe, single-file output; source DB stays open and writable) and restore(source, dest_db, *, allow_overwrite=False) -> RestoreResult (schema-version checked via PRAGMA user_version, refuses to clobber by default, refuses if -wal / -shm companions sit alongside the source backup). packages/metis-core/src/metis_core/trace/store.py gains a TRACE_SCHEMA_VERSION = 1 constant and stamps PRAGMA user_version on every opened trace DB so the backup module has a stable version handle. Two new metis CLI subcommands (apps/cli/src/metis_cli/main.py, apps/cli/src/metis_cli/backup.py): metis backup <dest> [--db-path <source>] and metis restore <source> [--db-path <dest>] [--force]. Both emit a deterministic human-readable metadata block on success (source / dest / byte count / schema version / event count / oldest+newest event timestamps; no random ids) and a one-line diagnostic to stderr with non-zero exit on failure.
Type: additive. (1) Existing trace DBs without user_version stamped get bumped to 1 the next time they’re opened by TraceStore._configure — read paths are unaffected. (2) No new event types, no payload changes, no catalog edits beyond §7.5. (3) The CLI gains two top-level subcommands; existing chat / tui / serve / evaluate / gateway flows are untouched. (4) Helm chart / docker compose surfaces are unchanged — backup/restore are operator commands run against the same SQLite file the gateway and serve already write.
References to verify:
- event-bus-and-trace-catalog.md §7.1 — schema declaration unchanged; new §7.5 stamps user_version from §7.1’s schema-version constant. ✓
- event-bus-and-trace-catalog.md §7.2 — storage notes (WAL + synchronous=NORMAL) preserved; backup module opens the source read-only via URI and uses VACUUM INTO which composes cleanly with WAL mode. ✓
- event-bus-and-trace-catalog.md §7.3 — retention is orthogonal to backup; pruning before a backup is fine, the backup just captures the post-prune state. ✓
- gateway.md §3.2 (loopback-only bind, TLS terminator in front) — backup/restore is a sidecar/operator command, not a network surface. No bind-policy change. ✓
- deployment-shape.md — backup recipe is the missing piece for the “buyer-trial floor” (close-the-loop on data safety before they commit). ✓
- analytics-api.md — backups capture the full events table; /analytics/* reads against a restored DB are identical to the pre-backup numbers (no analytics-side schema change). ✓
Status: verified. 18 new tests (13 library + 5 CLI): round-trip (write → backup → restore → events match), empty-DB backup, schema-version mismatch refusal, default-overwrite-refusal + --force opt-in, WAL-companion refusal, hot backup with source still open, missing-source error paths, 100k-event backup completes in well under 5s on a developer laptop. Library tests in packages/metis-core/tests/trace/test_backup.py; CLI tests in apps/cli/tests/test_backup_cli.py.

2026-05-14 — pattern-store.md §16 (v2 hybrid fingerprint shipped; Wave 10)

Specs: pattern-store.md §16 status flipped from “drafted” to “implemented” (header revised); new §16.13 “Implementation notes (Wave 10)” + §16.14 “Migration: upgrading a v1 workspace to v2” subsections added inside §16.
Change: Ships the v2 hybrid fingerprint contract described in §16. New module packages/metis-core/src/metis_core/patterns/embeddings.py defines a @runtime_checkable EmbeddingProvider Protocol and three concrete providers (OpenAIEmbeddingProvider → text-embedding-3-small, 1536-dim; CohereEmbeddingProvider → embed-multilingual-v3.0, 1024-dim, via raw httpx; LocalEmbeddingProvider → sentence-transformers all-MiniLM-L6-v2, 384-dim, deferred Torch import) plus a DeterministicEmbeddingProvider for tests/fixtures and a resolve_embedding_provider(provider_id) registry. PatternStore (patterns/store.py) gains a new embedding_cache(text_sha256, provider_id, embedding_blob, embedding_dim, created_at_us, last_used_at_us, use_count) table — keyed (provider_id, SHA-256(user_message_text)) per §16.4.1, vector blobs packed array.array('f', ...).tobytes() (no NumPy dep), bounded by embedding_cache_max_rows=10_000 + embedding_cache_max_age_days=180 with age-first → LRU → use-count tie-break eviction (§16.4.3). find_k_nearest consumes blended similarity when the query carries an embedding; mixed-version K-NN falls back to v1 weighted-Jaccard when either side lacks an embedding or the dims disagree (§16.5.3). Schema_version bumps "1" → "2" via WHERE store_meta.value < excluded.value so a v1 process opening a v2 db never downgrades; the catalog spec already had pattern.recorded.fingerprint_kind so no new event types. patterns/similarity.py adds cosine_similarity(a, b) (raises on dim mismatch / empty) and blended_similarity(a, b, *, a_embedding, b_embedding, alpha) (alpha out of [0, 1] raises); v1 weighted_jaccard is unchanged and reused as the structural half. patterns/fingerprint.py FingerprintInputs gains embedding: tuple[float, ...] | None = None + embedding_provider: str | None = None; compute_fingerprint produces a HYBRID Fingerprint when the embedding is set; new attach_embedding_for_recording(inputs, *, store, embedder) async helper does the cache-first / embed-on-miss / cache-write loop for the recording path; text_sha256(text) helper exposes the cache pre-image. routing/policy.py PatternConfig gains fingerprint_version: Literal["v1", "v2"] = "v1" + embedding_provider: str | None = None + embedding_alpha: float = 0.6 with __post_init__ validation (v2 requires embedding_provider; embedding_alpha must be in [0, 1]). routing/engine.py slot 4 in v2 mode does a sync cache-only lookup via _attach_cached_embedding before computing the query fingerprint — cache hit → blended K-NN; cache miss → v1 jaccard. The routing critical path never blocks on a network call (§16.6 trade-off).
Type: additive. (1) v1 default behavior is unchanged — PatternConfig() returns fingerprint_version="v1"; structural-only path runs identically against existing v1 patterns dbs. (2) v1 patterns dbs reopen under v2 mode cleanly; schema_version bumps in-place; no rows touched in fingerprints / outcomes / outcome_score_history / store_meta. (3) PatternStore.__init__ gains optional kwargs (fingerprint_version, embedding_alpha, embedding_cache_max_rows, embedding_cache_max_age_days); all default to v1-compatible values. (4) FingerprintInputs gains two optional fields with None defaults; existing constructors compile unchanged. (5) compute_fingerprint(inputs) signature unchanged; v1 callers continue to get STRUCTURAL fingerprints. (6) No new event types — pattern.recorded.fingerprint_kind already discriminates "structural" vs "hybrid". (7) The shipped impl differs from the original spec in four documented ways recorded in §16.13: embedding_alpha rename (was embedding_blend_alpha), no embedding_strategy knob (effectively always-async at query layer because routing-engine lookup is cache-only-sync; recording is async via attach_embedding_for_recording), sync recommend() preserved, no NumPy hard dep.
References to verify:
- pattern-store.md §5 (v1 fingerprint) — unchanged. ✓
- pattern-store.md §5.3 (v1 weighted Jaccard) — unchanged; reused as the structural half of the v2 blend. ✓
- pattern-store.md §16 — status flipped; §16.13 / §16.14 added; deviations recorded in §16.13. ✓
- routing-engine.md §5.5 — slot-4 K-NN math; v2 introduces no new routing.yaml keys outside the existing pattern.* namespace. The engine’s v2 cache-lookup path is internal to _evaluate_pattern. ⏳ Optional follow-up to mention the v2 sync-cache path in routing-engine.md §5.5.
- event-bus-and-trace-catalog.md §6.5b — pattern.recorded.fingerprint_kind already discriminates "structural" / "hybrid". No catalog change. ✓
- the project strategy (private) / §6.2 — third differentiator + self-hosting buyer profile; v2 ships local:sentence-transformers:all-MiniLM-L6-v2 as the buyer-friendly path. ✓
- benchmarks/RESULTS.md §A3-rev3 — v1 differentiator inverted under min_confidence=0.05; v2 is the implementation-ready alternative for workspaces whose structural Jaccard washes out (agent-loop traffic with empty intent_tags). Cluster-tightening A/B (§16.10 test 5 against the 60-turn fixture) is deferred to a follow-up wave. ⏳
Status: verified for the additive scope. 52 new tests under packages/metis-core/tests/patterns/ (test_embeddings.py, test_v2_similarity.py, test_v2_store.py, test_v2_routing.py) cover the Protocol contract + runtime_checkable rejection, cosine/blend math (α=0 reduces to v1, α=1 to cosine, headline α=0.6 cases), cache hit/miss/store/clear + TTL eviction + LRU eviction with use-count tie-break, provider-id-segregated cache keys, mixed-version K-NN (v1 row + v2 query), schema bump verification, recording-path cache-first embed (zero API calls on second hit), v1 db reopening cleanly under v2 mode, routing slot-4 v2 code path with cache-hit ranking sonnet above haiku on aligned embedding, routing fallback to v1 jaccard on cache miss. Suite total: 1322 (was 1270 baseline). Ruff clean.

2026-05-14 — event-bus-and-trace-catalog.md §3 `EventBus.drain()` loops to quiescent (closes the §A3-rev3 outcome-update bug)

Specs: event-bus-and-trace-catalog.md §3 (drain semantics). No code-visible API change, but the post-condition is strengthened.
Change: EventBus.drain() (packages/metis-core/src/metis_core/events/bus.py:182) now loops until both the queue is empty and no handler tasks are in flight, instead of awaiting a single queue.join + one gather. Python 3.13’s asyncio.Queue.join returns on the first time unfinished_tasks drops to zero; handler tasks scheduled before that point may not have run yet when join returns, and the events they then emit are still in flight when callers expect drain to be complete. The cascade that exposed this: turn.completed → pattern subscriber records outcome → evaluator emits eval.completed → pattern subscriber writes the score back via update_score. With the single-pass drain, shutdown_runtime in the agent loop (which detaches subscribers immediately after drain()) raced the cascading eval.completed and dropped the score, leaving success_score_count = 0 on outcome rows for 1-turn workloads with multiple tool calls (the §A3-rev3 caveat: architectural-explanation-without-hallucination).
Type: additive (correctness fix). Existing callers see the same await bus.drain() signature; the post-condition strengthens from “first wave of in-flight handlers done” to “bus is fully quiescent.” No bus event types added, no payload changes, no subscription contract changes.
References to verify:
- event-bus-and-trace-catalog.md §3 — drain post-condition: “When drain() returns, the queue is empty and no handler tasks are in flight.” Stronger than the prior implicit contract. ✓ (Regression test in packages/metis-core/tests/patterns/test_subscriber.py::test_drain_processes_eval_completed_cascade_before_returning pins this.)
- pattern-store.md §15.3 — outcomes are recorded asynchronously off the fast event path; this fix is precisely what guarantees the eval.completed → update_score cascade lands before subscribers detach. ✓
- evaluator.md §6.1 — subscriber is non-fast-path; cascading emits flow through the bus dispatch loop and were the source of the dropped scores. ✓ (No change to the evaluator’s emission shape.)
- benchmarks/RESULTS.md §A3-rev3 caveats — the architectural-explanation-without-hallucination row that recorded success_score_count=0 across all three passes. Re-running that workload with the fix produces success_score_count=1, success_score_mean=1.0 on the outcome row. Caveats text remains accurate as a record of the prior state; the bug is now closed.
Status: shipped — implementation + regression test live, test count 1270 → 1271, ruff clean on changed files.

2026-05-14 — skill-curator.md v1 (new spec; gated on agent-authored skills Phase 2.5)

Specs: skill-curator.md (new — drafted v1). No code changes; pure spec. Additive references to event-bus-and-trace-catalog.md §6.6 (one new value "curator_generated" on skill.created.source) and analytics-api.md (one new optional include_curator query param + a new /analytics/curator endpoint). Updates CHANGES.md specs-in-scope + cross-reference map. Updates AGENTS.md “What’s NOT built” to point to this spec.
Change: Lifts the curator pattern from hermes-agent (agent/curator.py) and adapts it to Metis’s primitives. Periodic auxiliary-model maintenance of agent-authored skills only. Six actions (pin / unpin / archive / restore / consolidate / edit); never auto-deletes (archive is mv to a sibling skills-archive/ root, reversible). Pinned skills bypass every auto-transition. Inactivity-triggered at session.ended (no daemon); explicit metis curate <workspace> CLI for power users. Shared BudgetTracker with the evaluator with independent caps (curator.per_run_max_usd: Decimal("0.50"), curator.per_day_max_usd: Decimal("1.00")). One new bus event skill.curated (USER_CONTROLLED floor with signals.rationale_redacted downgrade), plus two run-boundary events curator.run_started / curator.run_finished (PSEUDONYMOUS). Sidecar JSON state at ~/.metis/curator/state.json and <workspace>/.metis/curator/state.json carries pin / archive / origin / lineage — no SKILL.md frontmatter changes (preserves agentskills.io conformance per the AGENTS.md memory pin “conform; don’t invent fields”). Curator-touchable origin matrix (§3) restricts mutation authority to auto_generated and curator_generated skills; manual / imported / no-skill.created-event are read-only. Cluster consolidation uses substring-overlap heuristic in v1 (name_overlap >= 0.6 OR description_overlap >= 0.7) plus an auxiliary-model confirmation call per cluster; embedding-based clustering deferred to v2 alongside pattern-store.md §16. Implementation gated on Phase 2.5 skill.created(source="auto_generated") landing first (the curator only acts on skills with that event in the trace).
Type: additive. (1) skill-format.md is unchanged — the curator runs on top of the shipped SkillStore / load_skills substrate without modifying either. (2) event-bus-and-trace-catalog.md §6.6 gains one enum value ("curator_generated" on skill.created.source); existing consumers that pattern-match the enum need to handle the new value or be tolerant. The catalog spec edit lands when the curator implementation lands (deferred — this CHANGES.md entry covers the spec only). (3) Two new event types (skill.curated, curator.run_started, curator.run_finished) are introduced; their payload Structs land in events/payloads.py + PAYLOAD_REGISTRY at implementation time. (4) evaluator.md is unchanged — the curator reuses the BudgetTracker primitive without modifying the evaluator’s caps or surface. (5) analytics-api.md gains one optional query param (include_curator) and one new endpoint (/analytics/curator); both additive, the existing surface is unchanged. (6) multi-user.md is unchanged — curator is workspace-scoped, not identity-scoped; multi-user rollups bucket curator spend under null for user/team groupings (matches the pre-multi-user direct-API convention).
References to verify:
- skill-format.md §2.1 / §2.2 / §11 — the curator does not modify the loader’s invariants. Curator state lives outside the skill directory (sidecar JSON) so the loader’s hidden-directory-not-excluded gap (§11.5) is irrelevant here. ✓
- event-bus-and-trace-catalog.md §6.6 — the skill.created.source enum gains "curator_generated". The catalog edit + new event types land at implementation time, not now. ⏳ Confirm at implementation: enum bump is additive against current consumers (events/payloads.py::SkillCreated is the only registered consumer).
- evaluator.md §7 — BudgetTracker is the shared primitive. The evaluator’s caps are independent; the curator’s caps add to the workspace’s daily ceiling but do not throttle the evaluator. ✓
- analytics-api.md §3 / §4 — include_curator=true parameter and /analytics/curator endpoint follow §3 window-parameter conventions and §4 projection patterns. Schema change is additive. ⏳ Wire at implementation.
- canonical-message-format.md §6.4 — curator_cost_usd is a Decimal serialized as string, matches the Usage.cost_usd and eval.completed.judge_cost_usd conventions. ✓
- memory-store.md — sister “soft cap → eviction event, hard cap → reject the write” pattern; curator follows analogously (soft “stale” annotation, hard “archive” action). No contract change required. ✓
- pattern-store.md — orthogonal feedback loop in v1; no read or write between them. v2 cross-link deferred per §13.8. ✓
- multi-user.md §3 / §4 — curator spend buckets under null for user_id / team_id projections (matches pre-multi-user direct-API treatment). No identity-stamping on skill.curated. ✓
- AGENTS.md — “What’s NOT built” entry on skill-format loader extensions gets a pointer to this spec. ✓
Status: drafted; implementation deferred to Phase 2.5b. The implementation order is (1) Phase 2.5 agent-authored skills (skill_save tool + skill.created(source="auto_generated")), then (2) curator (this spec). The curator without (1) has nothing to act on. Update this entry to “shipped” when metis_core.skills.curator lands and the §12.1 required-tests pass.

2026-05-14 — multi-user.md §5 / gateway.md §6.4 ship: per-key quota caps with hard breakers, soft alerts, and `team_budget_remaining_lt` routing predicate (Wave 9a-2)

Specs: multi-user.md §1 / §5.1 / §6.1 / §6.3 (status header flipped to “shipped”); gateway.md §3.3 (monthly_cap_usd keystore field; daily_cap_usd widened to Decimal), §6.4 (new — quota.alert + gateway.quota_exceeded event types + 429 body shape), §10 (per-key rate-limit non-goal updated); event-bus-and-trace-catalog.md (additive, two new event types).
Change: Lands the second half of multi-user.md §5 against the shipped gateway. (a) GatewayKey (apps/gateway/src/metis_gateway/auth.py) gains monthly_cap_usd: Decimal | None; daily_cap_usd widens from float | None to Decimal | None. The keystore loader accepts the new field (back-compat: missing or None = no cap); legacy keystores that wrote daily_cap_usd as a JSON number coerce via Decimal(str(value)) so reload is exact. (b) metis gateway issue-key gains --monthly-cap-usd and tightens --daily-cap-usd validation (must parse as a positive number; zero/negative rejected with a deterministic message shared between CLI and keystore loader). Both caps persist to JSON as Decimal-stable strings. (c) New module apps/gateway/src/metis_gateway/quotas.py provides QuotaTracker (read-only spend aggregator over the trace store; one query per identity dimension) + QuotaStatus (used / cap / percentage snapshot) + RequestQuotaCache (per-request memoization) + enforce_quotas() (the policy loop that emits quota.alert + gateway.quota_exceeded events). (d) Two new typed event payloads in packages/metis-core/src/metis_core/events/payloads.py — QuotaAlert (severity warning@80% / critical@95%) and GatewayQuotaExceeded (scope, current_usd, limit_usd, inbound_shape, identity stamps); both registered in PAYLOAD_REGISTRY with Sensitivity.PSEUDONYMOUS. (e) apps/gateway/src/metis_gateway/app.py builds a RequestQuotaCache per request after auth, runs enforce_quotas before parsing the body, and returns the documented 429 envelope on hard-cap rejection ({"error": {"code": "quota_exceeded", "identity": ..., "scope": ..., "limit_usd": ..., "current_usd": ..., "type": "rate_limit_error", "message": ...}}). The check fires before routing/adapter invocation per multi-user.md §6.3 — no provider-side spend on a capped identity. (f) New routing predicate team_budget_remaining_lt: <usd> in packages/metis-core/src/metis_core/routing/policy.py + predicates.py + policy_loader.py; evaluates against TurnContext.team_budget_remaining_usd (new optional Decimal field) which the gateway harness populates from the per-request quota cache. Agent-loop traffic leaves the field None and the predicate returns False. (g) GatewayRuntime gains an optional quota_tracker: QuotaTracker | None field initialized in setup_gateway_runtime against the existing db_file; shutdown_gateway_runtime closes it.
Type: additive. (1) Existing keystores load unchanged (missing cap fields → None, no enforcement). (2) GatewayKey constructors that don’t pass cap fields compile and behave identically. (3) daily_cap_usd field type widened from float to Decimal; the only in-tree caller that constructed GatewayKey with the float field type was the keystore loader itself, updated. (4) GatewayHarness.call() / stream() signatures gain optional team_budget_remaining_usd: Decimal | None = None kwarg, defaulting to None (current behavior). (5) TurnContext gains optional team_budget_remaining_usd: Decimal | None = None; existing constructors compile unchanged. (6) Two new event types in the catalog — existing consumers that don’t subscribe to them see no change; subscribers that do see them stamped on hard-cap rejections and 80%/95% threshold crossings.
References to verify:
- multi-user.md §5 / §6.1 / §6.3 — shipped surface matches spec contract: Decimal caps, hard breaker before routing, soft alerts at 80%/95%, team_budget_remaining_lt predicate. Status header updated. ✓
- gateway.md §3.3 / §6.4 / §10 — keystore-record table, new event types + 429 body shape, per-key rate-limit non-goal updated in this change. ✓
- event-bus-and-trace-catalog.md §6 — two new pseudonymous-floor event types added to the catalog (additive; the catalog spec doesn’t enumerate every payload struct exhaustively, so no edit required there). ✓
- analytics-api.md §4.1 — quota events use the same gateway_key_id / user_id / team_id projection the cost endpoint already reads; no schema change. ✓
- routing-engine.md §5.3.2 — predicate set gains team_budget_remaining_lt; the spec lists the predicate set in §5.3 as documentation, no breaking change. ⏳
- KNOWN_ISSUES.md — gateway.md §10.5 “stores daily_cap_usd but doesn’t enforce it” gap closed by this change. ✓
Status: verified. New tests (27 cases): apps/gateway/tests/test_issue_key.py (5 — Decimal round-trip, monthly cap CLI, validation rejection, legacy float back-compat); apps/gateway/tests/test_quotas.py (12 — QuotaStatus shape, dimension filters, soft alert at warn/critical thresholds, hard breaker emits gateway.quota_exceeded, alert idempotency, no-cap no-op); apps/gateway/tests/test_app_http.py (2 — HTTP 429 with documented body, untagged keys still pass through); packages/metis-core/tests/routing/test_predicates.py (4 — predicate fires below threshold, doesn’t fire at/above, returns False without team binding, handles zero headroom); packages/metis-core/tests/routing/test_policy_loader.py (1 — yaml parser accepts team_budget_remaining_lt); packages/metis-core/tests/routing/test_engine_rules.py (2 — rule wins slot 3 when headroom below threshold, falls through when no team binding). Suite total: 1270 (was 1243 baseline).

2026-05-14 — pattern-store.md §16 (v2 hybrid fingerprint contract; Phase 4 pending §A3-rev3)

Specs: pattern-store.md (new §16 “v2 hybrid fingerprint: implementation contract”; header status updated; §5.2, §5.3, §13.1, §13.2 cross-references redirected to §16; References renumbered §16 → §17). No code changes; pure spec.
Change: Converts the §5.2 / §13.1 / §13.2 v2 sketch into an implementation-ready contract so Wave 10 can begin work if §A3-rev3 (Wave 9 candidate; PatternConfig.min_confidence: 0.3 → 0.05) fails to invert routing slot 4 under v1’s structural-only fingerprint. Specifies: (1) EmbeddingProvider Protocol (provider_id, dim, max_input_tokens, async embed, aclose) with @runtime_checkable semantics. (2) Three concrete provider impls — openai:text-embedding-3-small ($0.02/1M tokens, 1536-dim, 50–150ms), cohere:embed-multilingual-v3.0 ($0.10/1M, 1024-dim, 80–200ms), and local:sentence-transformers:all-MiniLM-L6-v2 ($0, 384-dim, 30–80ms CPU); each is selectable per workspace via PatternConfig.embedding_provider (provider_id string), with no default — unset means structural-only. (3) Embedding cache: new SQLite table embedding_cache(text_sha256, provider_id, embedding_blob, embedding_dim, created_at_us, last_used_at_us, use_count) in the same <workspace>/.metis/patterns.db, keyed by (provider_id, SHA-256(user_message_text)) — same SHA-256 pre-image as the v1 structural dedup. Bounded by cache_max_rows=10_000 and cache_max_age_days=180 (mirrors §6 outcomes-table caps); eviction is age-first then LRU then use-count tie-break; no schema migration on existing fingerprints / outcomes tables (additive table only; schema_version bumps "1" → "2"; v1 readers tolerate the bump and ignore the unknown table). (4) Blended similarity: similarity = α × cosine + (1 − α) × weighted_jaccard with default α = 0.6 (rationale: structural Jaccard is sparse on non-benchmark turns; embeddings discriminate better but structural is a load-bearing regularizer at 40% weight); workload-id near-keyed partition (§5.3, weight 0.85) still wins when both sides set workload_id; mixed-version K-NN (v1 row vs v2 row) falls back to pure structural-Jaccard per §16.5.3 so migration is forward-only and lossless. (5) PatternConfig.fingerprint_version: Literal["v1", "v2"] = "v1" toggle on a new PatternConfig struct that also collects the v1 routing knobs (cost_weight, min_confidence, min_sample_size, min_eval_confidence) for centralized resolution. Forward-only migration: set to "v2" in routing.yaml, restart process; new turns get hybrid fingerprints; legacy v1 rows age out under §6.3 over 180 days; downgrade is graceful (v2 rows remain readable under §16.5.3 fallback). (6) embedding_strategy: Literal["sync", "async"] knob exposes the routing-budget trade-off (sync default for agent loop; async required for gateway QPS). (7) Trade-off section (§16.9): v2 is qualitatively different from v1, not strictly cheaper — adds ~$0.000004/turn (OpenAI) ~50–200ms cache-miss latency, external API dependency, and bimodal sync-mode tail latency in exchange for cluster tightness on agent-loop traffic where v1’s intent_tags washes out. Cache hit-rate target ≥80% within 100 turns of a workload, non-load-bearing. (8) Test plan: 15 specified tests, headline being §16.10 test 5 — “intra-cluster similarity ≥ 0.10 higher AND inter-cluster ≥ 0.05 lower under v2 than v1 on a curated 60-turn fixture spanning the 6 benchmark workloads + 4 agent-loop traces” — the explicit gate for v2 paying for itself. (9) Eight open questions including α tuning range, NumPy hard-dep, per-provider tokenizers, re-embed CLI, async cancellation timeout. (10) No new event types — pattern.recorded.fingerprint_kind already discriminates "structural" vs "hybrid" per §10.1.
Type: additive. Pure spec firming; no code changes; no v1 contract changes. (1) §5.2 cross-references updated to point to §16; v1 structural-only path is unchanged. (2) §5.3 v2-blend pointer updated; v1 weighted-Jaccard formula is unchanged and is reused as the structural half of the v2 blend. (3) §13.1 + §13.2 open questions struck through and marked resolved by §16.3 and §16.7. (4) Decision log preserved at §14; new v2 decisions accreted in §16.12. (5) References section renumbered §16 → §17. (6) routing.yaml::pattern.* namespace is preserved (the v1 keys cost_weight, min_confidence, min_sample_size, min_eval_confidence are now centralized on PatternConfig; the parsing surface is unchanged).
References to verify:
- pattern-store.md §5.2 / §5.3 / §13.1 / §13.2 — updated in this change. ✓
- routing-engine.md §4.4, §5.1, §5.5 — slot-4 capability gates, routing.yaml::pattern.* resolution, K-NN scoring math. v2 introduces no new keys outside the existing pattern.* namespace; PatternConfig is the in-memory shape, not a wire-format change. The async recommend() surface change (§16.6.3) is contained in metis-core.patterns; routing’s call site stays sync at the routing-engine spec layer (the recommend() future is awaited at the boundary). ⏳ Confirm in next routing-engine sweep that the §5.5 K-NN math reads cleanly against v2’s mixed-version similarity in §16.5.3.
- event-bus-and-trace-catalog.md §6.5b — three v1 events (pattern.recorded, pattern.matched, pattern.evicted) cover v2; the fingerprint_kind discriminator is already in the catalog payload per §10.1. No catalog change required. ✓
- the project strategy (private) (third differentiator: pattern learning) — v2 is the implementation-ready fallback if v1 doesn’t invert. the project strategy (private) does not pin the fingerprint version; no edit required. ✓
- the project strategy (private) (self-hosting buyer profile) — §16.3 explicitly preserves the local-sentence-transformers option as the buyer-friendly path; no edit required. ✓
- the project strategy (private) (multi-user / team patterns) — v2’s per-workspace stance is unchanged from v1 §13.5–13.6; no edit required. ✓
- benchmarks/RESULTS.md §A3-rev2 — referenced for the failure case that motivates v2; no edit required. ✓
- analytics-api.md §4.7 — repricing math is unchanged; v2 does not alter cost-attribution semantics. ✓
- provider-adapter-contract.md (planned) — v2’s EmbeddingProvider is intentionally a separate Protocol from the LLM provider adapter; embedding providers do not implement to_wire / from_wire_response / estimate_input_tokens. The §7.2 AdapterCapabilities surface is for LLM adapters only and is not extended by v2. ✓
- context-assembler.md — v2 truncates user_message_text to max_input_tokens * 4 bytes for the embed call; the context assembler is not invoked. No interaction. ✓
Status: verified for the additive scope. Superseded by the 2026-05-14 entry above — Wave 10 shipped the v2 implementation. §A3-rev3 did invert slot 4 under v1’s structural fingerprint, but v2 ships anyway as the opt-in alternative for workspaces whose structural Jaccard washes out (the §16.1 motivation). The §16.10 test 5 cluster-tightening A/B (60-turn fixture) is deferred to a follow-up benchmark wave.

2026-05-14 — pricing.md v1 (commercial pricing model — recommendation, awaiting owner ratification)

Specs: pricing.md (new — drafted v1). No code changes; pure spec. Updates the project strategy (private) with a pointer (question stays open). Updates CHANGES.md specs-in-scope + cross-reference map.
Change: Closes the design gap in the project strategy (private) by surveying the credible pricing models for a hybrid gateway-plus-agent product and recommending one. Surveys five candidate shapes — per-seat (§5.1), per-call (§5.2), percentage of savings (§5.3), free + paid / open-core (§5.4), and four hybrid combinations (§5.5) — each evaluated across six dimensions (unit of metering, incentive alignment, first-contact friction, at-scale predictability, composability with shipped primitives, billing complexity). Constraints derived from deployment-shape.md (the “trial without payment” floor from §4.1), the project strategy (private) (buyer ≠ user; predictability + attribution + single-bill-single-vendor), the project strategy (private) (startup-CTO default profile), and multi-user.md §5 (the shipped primitives any model must compose with). Recommendation (§7): open-core gateway (Free tier) + per-seat Pro tier + reserved enterprise %-of-savings add-on. The “active user” seat-metering unit composes directly with /analytics/by_user; tier gating is deployment-level, not per-request. Multi-user identity layer is the headline Pro feature (matches “single-user free / team use Pro” conversion trigger). Enterprise %-of-savings reserved until audit-export surface (multi-user.md §7.3) is built. Invariants (§11) pin: free tier remains usable single-user; per-call shapes do not sneak into Pro baseline; Metis does not resell provider tokens; tier gating is deployment-level (no per-request licensing checks); savings counterfactual is reproducible via pricing_version. Open questions (§10) surface ten live items including OSS/Pro line placement feature-by-feature, savings-number visibility on Free, agent-tier bundling, Enterprise %-rate ranges. The spec frames the choice; it does not close the project strategy (private). Owner ratifies (or revises-then-ratifies); the project strategy (private) closes only on owner action.
Type: additive. New spec drafted; no code or other-spec contract changes. the project strategy (private) gains a pointer (“specced; awaiting commercial decision”) but the question stays open per the spec’s own §7.6 / §14.
References to verify:
- the project strategy (private) — pointer added in this change; question stays open. ✓
- the project strategy (private) — new dated entry queued at pricing.md §14; lands on owner ratification, not now. ⏳
- deployment-shape.md §6 — the §6 “What this means for adjacent open questions” entry on §6.8 already anticipated this shape (“Gateway → likely per-seat or % of savings”); pricing.md picks per-seat with %-of-savings reserved for Enterprise. No edit required. ✓
- multi-user.md §5 — the identity layer enforces the per-seat metering; no contract change required. The recommendation explicitly composes with shipped primitives without adding new ones. ✓
- analytics-api.md §4.7 — the savings counterfactual is the substrate any future %-of-savings tier reads against; no schema change in v1. ✓
- gateway.md — gateway remains the OSS foot-in-the-door; pricing.md does not modify the gateway surface. ✓
- canonical-message-format.md §6.4 — pricing_version field is load-bearing for re-priceable savings; pricing.md invariant 6 pins this. No spec edit required. ✓
Status: drafted; awaiting owner ratification. The owner closes the project strategy (private) when ratifying (or revising-then-ratifying); until then §6.8 reads “Specced; awaiting commercial decision.” The cross-spec edits queued in pricing.md §14 land on ratification, not now.

2026-05-14 — routing-engine.md §5.5 / pattern-store.md §8.1 / §9.4 / §15.4: `pattern.min_confidence` default lowered from `0.3` → `0.05` (slot-4 confidence gate scales with `cost_weight=0.1`)

Specs: routing-engine.md §5.5 (“Default rationale” paragraph and example yaml); pattern-store.md §8.1 (call-site default comment), §9.4 (resolved-defaults example block + new explanatory paragraph), §15.4 (example yaml).
Change: Lowers the pattern.min_confidence default from 0.3 to 0.05 in packages/metis-core/src/metis_core/routing/policy.py (PatternConfig). The two slot-4 knobs are coupled: confidence is (top_score - runner_up_score) / top_score, where score = (1 - cost_weight) * success + cost_weight * cost_efficiency. Under the legacy cost_weight=0.3 regime, the cost-efficiency term alone produced ~0.35 confidence on tied-quality clusters with cost differentials — so min_confidence=0.3 acted as a noise gate without suppressing real signal. After the cost_weight 0.3 → 0.1 migration (Wave 8a-2) the same near-tied clusters produce only ~0.10 confidence, so the legacy 0.3 gate suppressed the first cluster-level inversion observed in any A3 series: §A3-rev2 Pass C turn 2 on write-a-doc-from-notes aggregated sonnet=0.900 vs haiku=0.842 (confidence 0.064), and slot 4 emitted not_applicable on all 18 routed turns. The Wave-9 fix scales the gate down with the cost-weight reduction so genuine inversions fire; cluster-empty / zero-score / fewer-than-K-cluster cases still gate off inside aggregation.py. Policy-file overrides (pattern: { min_confidence: 0.3 }) are preserved — workspaces that depended on the tighter gate restate it in routing.yaml and get the old behavior back.
Type: breaking-default. Slot-4 will fire on more turns at the new default. The scoring formula, the K-NN cluster construction, and the per-rule override path are unchanged; only the default value of PatternConfig.min_confidence moved. Workspaces that have an explicit pattern.min_confidence in routing.yaml are unaffected.
References to verify:
- routing-engine.md §5.5 — Default rationale paragraph extended with the min_confidence half of the story; example yaml updated. ✓
- pattern-store.md §8.1 / §9.4 / §15.4 — defaults updated in this change. ✓
- evaluator.md — min_eval_confidence (consumer-side filter on per-verdict confidence) is unchanged; it remains 0.5 and is not affected by this gate. ✓
- analytics-api.md — /analytics/quality?min_confidence=… is a separate filter on eval.completed.confidence and is unaffected. ✓
- benchmarks/RESULTS.md §A3-rev2 finding — diagnoses the exact data this change resolves. No edit required.
Status: verified. New tests in packages/metis-core/tests/routing/test_policy_loader.py cover the default migration and the explicit-override opt-out; a headline test in packages/metis-core/tests/patterns/test_store.py named after the §A3-rev2 finding constructs a cluster with haiku.score≈0.842 and sonnet.score≈0.900 and asserts that slot 4 gates off under min_confidence=0.3 and picks sonnet under min_confidence=0.05.

2026-05-14 — context-assembler.md v3 §5.2 lands: explicit-activation budget + pre-activation events + `[preloaded]` index annotation

Specs: context-assembler.md v3 §5.2 (status header flipped to “Implemented”); skill-format.md §7.1 (index format gains [preloaded] annotation), §8.2 (skill_load pointer-return for pre-activated + re-loaded skills, budget exhaustion via ToolExecutionError), §9.1 (load_reason="always" is now wired); event-bus-and-trace-catalog.md §6.6 (parent ordering + load_reason semantics).
Change: Adds per-session SkillActivationRegistry (packages/metis-core/src/metis_core/skills/activation.py) tracking pre-activated skills (free, bodies inlined in stable prefix as v2 §5.1 padding) and explicit activations (counted against MAX_EXPLICIT_ACTIVATIONS_PER_SESSION = 3 and HARD_CAP_CUMULATIVE_ACTIVATION_TOKENS = 30000; WARN_CUMULATIVE_ACTIVATION_TOKENS = 10000 logs once). SessionManager.create_session pre-computes the stable system prompt via _assemble_stable_system_prompt, populates the registry with pre-activated names, and emits one skill.loaded(load_reason="always", triggered_by_tool_use_id=None) per inlined skill — events fire AFTER session.started (FK valid) and BEFORE any turn.started (no turn context). The cached stable prefix is reused on every LLM call in the turn loop so the provider’s cache_control marker stays valid. Discovery-index lines for pre-activated skills get a [preloaded] annotation via post-rendering string substitution (byte-stable, no padding re-pass). SkillLoadTool (packages/metis-core/src/metis_core/skills/tools.py) consults ToolContext.skill_activations: (a) pre-activated skills return a pointer with {"already_preloaded": true} metadata, no event; (b) already-explicitly-activated skills return a pointer with {"already_loaded": true} metadata, no event, no budget increment; (c) budget exhaustion raises ToolExecutionError → tool.failed per v3 §5.2.6 (no new event type). v3 §5.2.5 deferral honored: no mid-session eviction, no skill.evicted event.
Type: additive. (1) Existing _pad_stable_prefix_for_cache signature returns (prefix, inlined_skills) tuple; only in-tree caller is _assemble_stable_system_prompt (updated) and the existing v2 §5.1 test slice (updated to unpack). (2) ToolDispatcher.dispatch gains an optional skill_activations= kwarg defaulting to None; existing callers compile unchanged. (3) ToolContext gains an optional skill_activations field defaulting to None. (4) Discovery index format gains the optional [preloaded] annotation; agents that didn’t parse the annotation continue to work since the underlying {name}: {description} shape is preserved with one extra ` [preloaded] token between name and colon. (5) skill.loaded payload unchanged; load_reason=”always”` is now produced (previously reserved).
References to verify:
- context-assembler.md v3 §5.2 — status header flipped to “Implemented” in this change. ✓
- skill-format.md §7.1 / §8.2 / §9.1 — additive notes added in this change. ✓
- event-bus-and-trace-catalog.md §6.6 — parent + load_reason semantics annotated in this change. ✓
- the project strategy (private) — the “skills you don’t use are wasted tokens” lever; v3 §5.2 caps the burn at 3 explicit activations. No edit required.
- pattern-store.md — pattern fingerprint doesn’t read activation state today; future skill-aware fingerprinting (v3 §5.2.7 q3) is out of scope. ✓
- analytics-api.md — a future /analytics/skills rollup (v3 §5.2.6 “Analytics consequence”) could project skill.loaded by load_reason; not in v3 scope. ✓
Status: verified. New test file packages/metis-core/tests/sessions/test_skill_activation.py covers: registry state transitions; budget count cap + token cap raising SkillBudgetExceededError; warn-threshold one-shot log; pre-activation events fire at create_session with the right payload shape; per-session registry is populated; [preloaded] annotation lands on the rendered index; skill_load returns the pointer (not the body) for pre-activated skills and emits no new event; re-loading an explicitly-activated skill returns a pointer, doesn’t increment the budget, and emits no new event; MAX_EXPLICIT_ACTIVATIONS_PER_SESSION + 1 distinct loads surface the 4th as tool.failed; activated bodies persist across turns via message history; the stable prefix is byte-identical across three consecutive turns.

2026-05-14 — gateway.md §3.3 / §6 — gateway keys gain optional `user_id` / `team_id` tags (Wave 8a-5)

Specs: gateway.md §3.3 (keystore-record table) + §6 (events emitted); multi-user.md §4 is the design reference.
Change: Implements the first half of multi-user.md §4 against the shipped gateway. (a) GatewayKey (apps/gateway/src/metis_gateway/auth.py) gains optional user_id: str | None and team_id: str | None fields; both default to None for pre-multi-user keys. The keystore loader (Keystore.from_dict) reads them when present, validates them against the multi-user §3.4 shape (^[a-z0-9_-]+$, ≤200 chars), and leaves them None when absent — existing keys.json files load unchanged. (b) A new request-scoped Identity dataclass projects the resolved key onto (gateway_key_id, workspace_path, user_id, team_id) per multi-user.md §3.2 (the spec calls this Principal; the v1 implementation names it Identity so the auth surface reads naturally — same fields, same semantics). Keystore.identify(token) returns it; identity_from_key(key) exposes the projection for testing and the harness. (c) metis gateway issue-key gains --user <id> / --team <id> flags (validated identically) that persist into the keystore JSON; the post-issuance summary prints both lines when set. (d) The HTTP handlers in apps/gateway/src/metis_gateway/app.py build an Identity per request and pass it to the harness; the harness (harness.py) stamps user_id / team_id onto both llm.call_completed (typed catalog fields per Agent 8a-4) and turn.completed (typed catalog fields). Agent-loop traffic and pre-multi-user keys keep user_id: None / team_id: None — null-bucket rollup convention from multi-user.md §3.4.
Type: additive. (1) Existing GatewayKey constructors that don’t pass user_id / team_id continue to compile and behave identically. (2) Existing keys.json files load cleanly; the additive fields default to None. (3) The harness call() / stream() signatures changed from (gateway_key_id, workspace_path, ...) to (identity: Identity, ...); the only in-tree callers are app.py handlers, both updated. (4) Trace consumers that read gateway_key_id see no change; consumers that look for the new typed user_id / team_id see them populated for tagged-key traffic and None everywhere else.
References to verify:
- gateway.md §3.3 / §6 — updated in this change. ✓
- multi-user.md §4.1 / §4.2 / §4.4 — implementation matches the spec’s keystore shape, issuance UX, and trace-stamping contract. ✓
- event-bus-and-trace-catalog.md §6.3 / §6.4 — typed user_id / team_id fields on LLMCallCompleted and TurnCompleted (Agent 8a-4); the harness change consumes those typed fields. ✓
- analytics-api.md §4.1 / §4.8 — group_by=user / group_by=team consume the new payload fields. Cross-spec edit landed by Agent 8a-6. ⏳
- KNOWN_ISSUES.md — no entry tracked this work; no edit required. ✓
Status: verified. The harness stamping path is exercised by apps/gateway/tests/test_app_http.py::test_trace_events_stamp_user_id_and_team_id_for_tagged_key; back-compat by test_trace_events_stamp_null_identity_for_untagged_v1_key. Implementation outstanding: metis gateway user add / team add subcommands, users.json / teams.json storage, hard-cap enforcement, and the audit-relevant gateway.key_issued / gateway.key_revoked / gateway.quota_exceeded event types remain — those land in later sub-tasks of the multi-user.md rollout.

2026-05-14 — evaluator.md §5.4 adds `grounding_tokens` / `forbidden_grounding` workload-rubric primitive (v1.1)

Specs: evaluator.md §5.4 (workload rubric — new “Grounding-check primitive (v1.1)” subsection + example fields in the schema block).
Change: WorkloadRubric gains two optional list-of-strings fields (grounding_tokens, forbidden_grounding) parsed from workload.yaml.evaluate. The heuristic awards present / total for grounding tokens (positive) and 1 - (present / total) for forbidden tokens (positive on absence); when both are configured, the two components average. The composed workload score averages this with the substring/assertion-derived score when grounding is configured, so a workload that fully grounds is unaffected and one that fabricates is halved. New workload-level signals on the verdict: workload_grounding_score, grounding_tokens_present, grounding_tokens_missing, forbidden_grounding_present. New flags: workload_grounding_tokens_present, workload_grounding_tokens_missing, workload_forbidden_grounding_present, workload_forbidden_grounding_clean. The LLM-tier user message gains a “GROUNDING HINTS” section that surfaces the two lists so escalation can recognize paraphrased grounding the substring match misses (the LLM _SYSTEM_PROMPT is unchanged — the lists are inputs, not new instructions). Workload heuristic rubric version bump 1.0.0 → 1.1.0 per §12 invariant 7. Implementation in packages/metis-core/src/metis_core/eval/judge.py::_grounding_score; rubric parsing in eval/rubric.py::parse_workload_rubric; LLM-judge user-message hint in eval/llm_judge.py::_grounding_hint. Motivation comes from benchmarks/RESULTS.md §A3-rev: the original expect_substring_in_final_response="PATTERN_RECOMMENDATION" rewarded stylistic mimicry — sonnet cited the real PolicyEvaluation / RoutingDecision dataclasses and lowercase policy= literals (strictly more grounded) but scored 0.50 because it didn’t parrot the docstring’s UPPERCASE label. The architectural-explanation-without-hallucination workload fixture has been updated to use the new primitive (drops expect_substring_in_final_response, adds 5-token grounding list + 4-token forbidden list); validated against the §A3-rev trace DBs (benchmarks/.runs/diversity-hallucination-{haiku,sonnet}.db) the new rubric scores sonnet 1.00 / haiku 0.90 — reverses the old 1.00 / 0.50 inversion. Sonnet hits all 5 grounding tokens (haiku misses PolicyEvaluation and policy=); neither model fabricates.
Type: additive. New optional rubric fields default to (); workloads without them score identically to v1.0.0 except for the rubric-version stamp. The architectural-explanation-without-hallucination workload is the only fixture that switched primitives.
References to verify:
- benchmark.md §3.1 — evaluate: block schema; new fields are optional, no edit required.
- evaluator.md §12 invariant 7 — version bump satisfies it. ✓
- benchmarks/RESULTS.md §A3-rev — names this gap; future re-runs of the workload should report the v1.1 score series. ⏳
- pattern-store.md — pattern store reads score; verdict shape unchanged. ✓
- analytics-api.md /analytics/quality — projects eval.completed.score and ignores the new signal fields; no change. ✓
Status: verified.

2026-05-14 — multi-user.md §4.4 foundation: `user_id` / `team_id` land on `LLMCallCompleted`, `TurnCompleted`, `MessageMetadata`

Specs: event-bus-and-trace-catalog.md §6.2 (turn.completed payload), §6.3 (llm.call_completed payload); canonical-message-format.md §4.3 (MessageMetadata).
Change: Lands the catalog and canonical-type foundation that multi-user.md §4.4 specced. LLMCallCompleted and TurnCompleted (in packages/metis-core/src/metis_core/events/payloads.py) gain two additive optional fields each: user_id: str | None = None and team_id: str | None = None. Both default None so existing emit sites — agent-loop traffic and pre-multi-user gateway keys — keep working unchanged and roll up under the null bucket per multi-user.md §3.4. MessageMetadata (in packages/metis-core/src/metis_core/canonical/messages.py) gains the same two fields with the same defaults; _identity() is extended so equality and hashing reflect the new dimensions. Catalog sensitivity floors are unchanged: both events stay pseudonymous because user_id / team_id are stable opaque identifiers (usr_<ulid> / team_<ulid>), not raw PII (multi-user.md §3.2). Plaintext PII (email, real name) lives in users.json only — the trace store carries the stable id (multi-user.md §3.3). Catalog spec doc (event-bus-and-trace-catalog.md §6.2 / §6.3) is updated to enumerate the additive optional fields alongside the previously implementation-only gateway_key_id / inbound_shape / signals_extra fields. The session manager’s emit sites are not modified by this change — they continue to omit the fields (defaulting to None); the gateway harness is the planned producer (lands in the gateway-auth follow-on per multi-user.md §4.3, Agent 8a-5).
Type: additive. New optional fields; existing wire payloads decode cleanly to None; catalog sensitivity floors unchanged. No consumer break — make_event’s sensitivity check still rejects overrides more private than the floor (verified by new tests).
References to verify:
- multi-user.md §3 / §4.4 — identity model + stamping mechanics; this change implements the catalog-and-canonical-type slice. ✓
- event-bus-and-trace-catalog.md §6.2 / §6.3 — payload schemas updated in this change. ✓
- canonical-message-format.md §4.3 — MessageMetadata updated in this change. ✓
- gateway.md §6 — gateway-side stamping (where the producer fills the new fields) lands in Agent 8a-5; this change is the consumer-side foundation. ⏳
- analytics-api.md §4.1 / §4.9 — Agent 8a-6 has landed the analytics surface that reads these fields via json_extract; this change provides the typed source stamps. ✓
- routing-engine.md §5.3.2 — three new predicates (user_cost_today_exceeds_usd, team_cost_today_exceeds_usd, team_cost_month_exceeds_usd) land at routing-rule integration time; this change provides the trace-store dimension they read against. ⏳
Status: verified for the catalog-and-canonical-type slice; downstream consumers (gateway producer in 8a-5, routing predicates) land in follow-on changes as flagged above.

2026-05-14 — analytics-api.md §4.1 + new §4.9 (user/team rollups land)

Specs: analytics-api.md §4.1 (group_by enum + user/team filter params) and new §4.9 (/analytics/by_team); §6 (new error codes).
Change: Implements the first slice of multi-user.md §5 — the analytics surface buyers need to attribute cost beyond the gateway-key boundary. _COST_GROUP_BY_ALLOWED in packages/metis-core/src/metis_core/analytics/store.py gains user and team, projecting json_extract(payload_json, '$.user_id') / '$.team_id' parallel to the shipped gateway_key slot. AnalyticsStore.cost() gains optional user= / team= exact-match filters, both passed via SQL placeholder; the HTTP boundary additionally regex-validates the shape (^[A-Za-z0-9_-]{1,200}$) and returns 400 invalid_user / invalid_team on malformed values. New AnalyticsStore.by_team() + /analytics/by_team HTTP route mirror the shipped /analytics/by_key shape: per-team cost_usd + token counts + call_count + user_count (distinct non-null users in the team) + by_user sub-array sorted by cost DESC. The null bucket (agent-loop traffic + pre-v1 keys issued without --user / --team) appears as team_id: null with user_count: 0. v1 ships the rollup shape; team_name / daily_cap_usd / monthly_cap_usd join to teams.json and the partial_coverage flag from multi-user.md §5.2 / §5.4 are deferred until the gateway-side identity records land (multi-user.md §4.2). Dependent on Agent 8a-4’s catalog-field stamping and Agent 8a-5’s gateway harness writing those stamps; until both land, every event projects null and rolls up under the null bucket — the contract still works.
Type: additive. New whitelist values, new optional filter params, new endpoint, new error codes. Existing /analytics/cost?group_by=model callers see no shape change.
References to verify:
- multi-user.md §5.1 / §5.2 / §5.3 — the spec this implements. The partial_coverage flag from §5.4 is the next slice; flagged in analytics-api.md §4.9 “v1 scope” note. ✓
- event-bus-and-trace-catalog.md §6.3 — LLMCallCompleted.user_id / team_id are the source stamps; this change reads them as json_extract projections, so it tolerates absence (rolls up under null). Catalog edit pending Agent 8a-4. ⏳
- gateway.md §3.3 / §6 — gateway-side keystore changes (GatewayKey.user_id / team_id) and request-time stamping pending Agent 8a-5. ⏳
- analytics-api.md §4.8 — /analytics/by_key shape was the template; /analytics/by_team follows the same envelope and sort convention. ✓
Status: verified for analytics layer; downstream stamp producers (8a-4, 8a-5) verify when they land. The store and HTTP handler are correct against the spec contract today and against null-stamped events in production until then.

2026-05-14 — pattern-store.md §5.1 adds optional `workload_id` near-keyed partition

Specs: pattern-store.md §5.1 (new row in the structural-feature table), §5.3 (blended similarity formula prose for the new field).
Change: FingerprintInputs / StructuralFeatures gain an optional workload_id: str | None field (default None). When both fingerprints in a comparison set workload_id, the K-NN similarity is blended 0.85 * cluster + 0.15 * structural so same-workload neighbors cluster together first. When either side is None the blend is skipped and the formula reduces to the v1 weighted-Jaccard exactly. SessionManager.submit_turn accepts an optional workload_id kwarg that flows through TurnContext to the fingerprint_inputs_builder / fingerprint_inputs_hook callbacks. The benchmark harness sets it to the workload name; agent-loop callers (CLI / TUI / serve / gateway) leave it None. Rationale comes from §A3-rev unblock #1 (benchmarks/RESULTS.md): intent_tags is empty on most turns so K-NN was clustering by tool shape + length bucket, which mixed workloads and washed out per-workload quality deltas.
Type: additive. Existing fingerprints in stored DBs decode with workload_id=None; new writes without a workload tag produce identical K-NN behavior to v1. Existing callers do not need to change.
References to verify:
- routing-engine.md §5.5 — formula and cost_weight default unchanged; only the fingerprint inputs are richer.
- benchmarks/RESULTS.md §A3-rev / §A3-rev2 — names this as unblock #1; future A3-rev2 should re-run with workload_id set by the harness.
- KNOWN_ISSUES.md — no entry tracks this gap; no edit required.
Status: verified.

2026-05-14 — routing-engine.md §5.5 `cost_weight` default lowered 0.3 → 0.1

Specs: routing-engine.md §5.1 example, §5.5 (formula prose + new “Default rationale” paragraph + changelog row).
Change: The default for pattern.cost_weight (the routing slot 4 cluster-score blend constant) drops from 0.3 to 0.1 in PatternConfig and the routing.yaml loader. The scoring formula score_M = (1 - cost_weight) × normalized_success_M + cost_weight × normalized_cost_efficiency_M is unchanged — only the constant moves. Rationale comes from the §A3-rev benchmark (benchmarks/RESULTS.md): at 0.3 the cost-efficiency term required a ~0.43 success delta to flip the chooser when the cheapest model also scored 1.0 on cost_efficiency, which swamped the 0.15–0.30 cluster-level quality deltas the LLM judge actually produced; slot 4 picked the cheaper model on every routed turn regardless of evidence. At 0.1 a quality delta of ~0.143 is enough to invert the ranking. Per-workspace override (pattern.cost_weight: 0.3) is unchanged — workspaces that depended on the prior cost-bias must restate the old default in routing.yaml.
Type: breaking-default. Consumers relying on the prior 0.3 blend must opt in via policy file; behavior for any policy that explicitly set cost_weight is unchanged.
References to verify:
- pattern-store.md — §8.3/§8.4 reference the scoring formula but not the default constant; no edit required.
- benchmarks/RESULTS.md §A3-rev — names this as unblock #2; future §A3-rev2 reads the new default.
- KNOWN_ISSUES.md — no entry tracks this default; no edit required.
Status: verified.

2026-05-14 — event-bus-and-trace-catalog.md §4.4.1 enforced; `eval.completed` floor inverted

Specs: event-bus-and-trace-catalog.md §4.4.1 (rule clarification + example), §6.12 (eval.completed floor sensitivity); evaluator.md §8.2 and §8.4 (floor + downgrade pathway).
Change: make_event now rejects a sensitivity override that is more private than the catalog floor (raises EventValidationError), per §4.4.1’s “only toward less private” rule. The rule’s prose is reworded so “floor” is unambiguously the worst case — the most-private classification the event can have when all opt-in fields are populated — and a downgrade is what happens when the event carries less than the worst-case content. To make eval.completed spec-consistent under the strict rule, its catalog floor moves from pseudonymous → user_controlled (the worst case, when signals.rationale_redacted is populated) and the evaluator subscriber’s _sensitivity_for is inverted: when the rationale field is absent, downgrade to pseudonymous (allowed); when present, no override.
Type: breaking for eval.completed consumers that filter by sensitivity == pseudonymous (the floor moved up). Additive for everything else — non-eval.completed events keep their existing floors; the new make_event check rejects overrides that were never spec-conformant.
References to verify:
- evaluator.md §8.2 / §8.4 — updated in this change to match the new floor and downgrade pathway.
- event-bus-and-trace-catalog.md §4.4.1 / §6.12 — updated in this change.
- analytics-api.md — /analytics/quality projects eval.completed.score and doesn’t filter on sensitivity; no behavior change.
- KNOWN_ISSUES.md — “Sensitivity upgrade rule unenforced” 🟢 entry deleted; replaced by the enforcing check.
Status: verified.

2026-05-14 — delegation.md v1 (Phase 4 worker-session design)

Specs: delegation.md (new — drafted v1). No code changes; pure spec. Implies additive cross-spec edits flagged below — none land until Phase 4 implementation.
Change: Consolidates the worker-session contract that has been distributed across routing-engine.md §6 (the delegate() tool, tier resolution, slot 5 re-entry, InsufficientContextRequest), event-bus-and-trace-catalog.md §6.8 (the three delegate.* events), and streaming-protocol.md §6.4 + §7 (cancellation cascade, include_worker_sessions filter) into one Phase-4 design document. Defines what the worker session is (full Session record with additive parent_session_id / parent_tool_use_id / is_worker fields), the spawn → routing-re-entry → execution → completion lifecycle, the read-only isolation contract against MEMORY.md / USER.md / skills / routing config (planner-only durable state), the cost-attribution model (worker tokens land on the worker’s llm.call_completed; delegate.completed.worker_total_cost_usd is derived, single source of truth via llm.call_completed), pattern-store integration (workers write their own fingerprint rows; slot 4 forced to defer inside delegation re-entry so learned patterns don’t silently override the planner’s explicit tier=), evaluator integration (worker terminal turn scored independently; parent session rubric folds in delegate.completed.success but parent turn score is not transitively inflated by worker scores), and the confirmation-handler-inheritance rule (workers inherit planner’s handler; “always” answers from worker prompts do NOT persist to trust.yaml in v1). Slot 5 (DELEGATE_REQUEST) treatment is non-normative — canonical source remains routing-engine §6. Documents v1 as opt-in: gated by can_delegate: true in the registry + active planner model + planner LLM choice; default registry has can_delegate: false on fast-tier models so buyers without multi-step workloads never see the surface. Open questions section surfaces (1) cost-of-delegation overhead for small sub-tasks, (2) cancellation cascade for already-completed workers, (3) concurrent delegation cap, (4) worker streaming back to planner (deferred per streaming-protocol.md §12.2), (5) worker wall-clock timeout, (6) router-decided delegation (rejected for v1 — predicate routing can’t distinguish delegatable sub-tasks), (7) worker-prompt “always” answers persisting to trust.yaml, (8) tier name configurability, (9) worker history visibility default.
Type: additive. New spec drafted; all cross-spec implications below are additive (existing consumers are unchanged; new fields default to None / false).
References to verify:
- routing-engine.md §6 — canonical source for the delegate() tool signature, can_delegate, tier resolution, slot 5 re-entry, InsufficientContextRequest. No edits required; delegation.md treats §6 as the source of truth.
- event-bus-and-trace-catalog.md §6.8 — three delegate.* events already present in the catalog (Phase 4). delegation.md §9 proposes two additive delegate.started payload fields (allowed_tool_count, dropped_tools); catalog edit lands with implementation.
- event-bus-and-trace-catalog.md §6.3 — llm.call_started.is_worker and Actor.WORKER already in the catalog. No change required.
- streaming-protocol.md §6.4 + §7 — cancellation-during-delegation seam and include_worker_sessions filter are already documented; no edits.
- server-api.md — is_worker / parent_session_id already on the session record; include_workers query already documented. No edits.
- canonical-message-format.md §9.1 — Session schema gains three additive nullable columns (parent_session_id, parent_tool_use_id, is_worker); migration is ALTER TABLE ADD COLUMN ... DEFAULT NULL. Cross-spec edit lands with implementation.
- pattern-store.md — worker writes its own fingerprint row; parent_session_id is not projected into the fingerprint. Cross-spec edit if pattern-store wants to add a worker-aware filter (§11 deferred).
- evaluator.md §5.6 + §6.1 — parent session rubric folds in delegate.completed.success; current heuristic rubric does not yet read this signal. Cross-spec edit lands with Phase 4 implementation.
- analytics-api.md §4.1 — _COST_GROUP_BY_ALLOWED gains parent_session and is_worker group_by values; include_workers query parameter behavior added. Cross-spec edit lands with implementation.
- tool-dispatcher.md — delegate registered as a builtin tool with elevated kernel privileges (can spawn a session); no other builtin has this capability. Cross-spec edit lands with implementation.
- context-assembler.md §5 — worker’s system prompt uses the same assembler path as planner’s; no change required.
- the project strategy (private) — “third lever (planner→worker delegation)” now has a drafted Phase-4 spec home. Existing thesis statement unchanged.
Status: drafted; awaiting owner review. Cross-spec edits enumerated above land alongside Phase 4 implementation.

2026-05-14 — multi-user.md v1 (per-user / per-team identity & rollup layer)

Specs: multi-user.md (new — drafted v1), implies additive cross-spec changes flagged below. No code changes; pure spec.
Change: Adds a per-user / per-team identity layer on top of the shipped per-(gateway-key) attribution from gateway.md §3.3 / §6. Defines three identity dimensions (User, Team, Workspace) and a request-scoped Principal projection of GatewayKey. metis gateway issue-key gains --user / --team; new metis gateway user add / team add subcommands manage ~/.metis/gateway/users.json and teams.json (mode 0o600). Trace-stamping additive: user_id and team_id land on LLMCallCompleted and TurnCompleted (parallel to the existing gateway_key_id / inbound_shape). Analytics surface extends: group_by ∈ {user, team} on /analytics/cost; new /analytics/by_team rollup (mirrors the shipped /analytics/by_key); optional ?user= / ?team= filters on all five time-windowed endpoints; new partial_coverage flag for mixed-mode rollout windows. Quota enforcement is two-layered: routing-rule soft caps via three new predicates (user_cost_today_exceeds_usd, team_cost_today_exceeds_usd, team_cost_month_exceeds_usd) parallel to the shipped cost_today_exceeds_usd; gateway-boundary hard caps via Team.daily_cap_usd / monthly_cap_usd (and finally activating the previously reserved GatewayKey.daily_cap_usd) — hard cap short-circuits before routing, returns 429, emits a new gateway.quota_exceeded audit event. Three new gateway.* catalog events: key_issued, key_revoked, quota_exceeded, all pseudonymous-sensitive. Privacy posture: plaintext email lives in users.json only; trace events carry the stable user_id; email_sha256 exists for bootstrap-dedup and a future SSO bridge. Deployment-shape neutral — same struct + wire shape in local-FS and SaaS deployments; only the storage backend differs. v1 explicitly excludes SSO / OIDC / SAML / SCIM / RBAC / multi-org / multi-workspace-per-key (§8); the startup-CTO default from the project strategy (private) is the v1 target.
Type: additive. New spec drafted; all cross-spec implications below are additive (no existing consumer breaks; missing fields default to None).
References to verify:
- gateway.md §3.3 — GatewayKey gains two optional fields (user_id, team_id); existing keys with both None keep working. Cross-spec edit lands with implementation; flagged in multi-user.md §4.1.
- gateway.md §11 — “Multi-user / team-level rollups” follow-on now references multi-user.md as the design. Edit at implementation time.
- event-bus-and-trace-catalog.md §6.3 — LLMCallCompleted.user_id / team_id and TurnCompleted.user_id / team_id are typed additive fields; same pattern as the shipped gateway_key_id extension. Catalog edit lands with implementation.
- event-bus-and-trace-catalog.md §6 — three new event types (gateway.key_issued, gateway.key_revoked, gateway.quota_exceeded); payload structs sketched in multi-user.md §7.2. Catalog entry per AGENTS.md “Adding a new X” recipe at implementation time.
- routing-engine.md §5.3.2 — three new predicates (user_cost_today_exceeds_usd, team_cost_today_exceeds_usd, team_cost_month_exceeds_usd) parallel to cost_today_exceeds_usd; same snapshot-at-turn-start semantics. Edit at implementation time.
- analytics-api.md §4.1 — _COST_GROUP_BY_ALLOWED whitelist gains user / team. Endpoint shape additive.
- analytics-api.md §4.8 — new sibling endpoint /analytics/by_team documented in multi-user.md §5.2. Edit at implementation time.
- analytics-api.md §4.7 — savings endpoint’s behavior under ?team filter clarified in multi-user.md §5.4; no math change.
- the project strategy (private) — “multi-user from day one is real” and “team-level cost attribution matters” both have a drafted spec home. §2 stays open per the prompt’s instructions (it closes when the spec lands and the project strategy (private) — local-first vs SaaS — is decided).
Status: drafted; awaiting owner review. Cross-spec edits enumerated above land alongside Phase 3 implementation.

2026-05-14 — evaluator.md §5.1 turn rubric reads `tool.completed.success=False`

Specs: evaluator.md §5.1 (turn heuristic rubric — new no_tool_exit_failure signal + prose distinguishing the two tool-failure paths).
Change: Closes the first §A3 unblock. The v1 turn heuristic’s tool-failure gate previously only fired on tool.failed (uncaught Python exception); a shell tool that prints "FAIL N/M" and exits with a non-zero code emits tool.completed with success=False and was invisible to the rubric. v1.1 adds a sibling gate no_tool_exit_failure that scans for tool.completed events with success=False. Weighted at 0.5 (vs weight_no_tool_failure=0.25) — sized so a single failed exit drops a clean turn’s score from 1.0 to ~0.667 (drop ≥0.3) and the heuristic confidence to 0.55, below the v1 hybrid escalation threshold (0.7). This lets HybridJudge escalate to the LLM judge on this class of failure regardless of whether the bus subscriber plumbs assistant-response text. Implementation in packages/metis-core/src/metis_core/eval/judge.py::_evaluate_turn; weight + total normalization in eval/rubric.py::TurnHeuristicConfig. Rubric version bump 1.0.0 → 1.1.0 per §12 invariant 7 so prior eval.completed rows are not silently recalibrated.
Type: additive. Existing positive-lifecycle signals are untouched; turns that had no tool.completed.success=False events behave identically (clean score still 1.0). The rubric version bump produces a new score series rather than mutating old verdicts.
References to verify:
- event-bus-and-trace-catalog.md §6.x — ToolCompleted.success: bool already exists and is unchanged. ✓
- evaluator.md §5.3 — Hybrid escalation threshold default 0.7 is unchanged; the new signal lowers heuristic confidence into the escalation band on tool-exit failures. ✓
- evaluator.md §12 — invariant 7 (rubric versioning); the bump to 1.1.0 satisfies it. ✓
- evaluator.md §5.1 Agent 7a-2’s signals_extra contract paragraph — independent edit in the same section; cross-references the same §A3 unblock list. ✓
- benchmarks/RESULTS.md §A3 — re-run owned by Agent 7a-7; not modified here. ⏳
Status: verified.

2026-05-14 — evaluator.md §5.1 turn-completed `signals_extra` plumbed for LLM judge

Specs: evaluator.md §5.1 (signals_extra contract).
Change: Documented the three-key turn.completed.signals_extra contract produced by SessionManager._emit_turn_completed: final_response_text (existing; heuristic content-penalty reader), assistant_response_text (new alias of final_response_text; LLM-judge _build_user_message reader), and user_prompt_text (new; LLM-judge _build_user_message reader). Closes the second §A3 unblock — the online bus path now forwards enough text for the LLM judge to grade a turn instead of reading “(not available)” / “(not available)”. The assistant_response_text alias is intentional and points at the same string as final_response_text; a future migration can drop it once heuristic and LLM consumers converge on one name. Keys with empty values are omitted so absent text degrades to the judge’s “(not available)” fallback honestly.
Type: additive. The producer only adds keys; the existing final_response_text reader path is unchanged. The new user_prompt_text parameter on _emit_turn_completed is keyword-only with a None default.
References to verify:
- event-bus-and-trace-catalog.md — TurnCompleted.signals_extra is already typed as a free-form dict | None per §6.4; no payload-registry change. ✓
- evaluator.md §5.2 — the LLM-as-judge rubric’s input list still cites “user prompt + assistant final response text” generically; the §5.1 contract update is the cross-reference that makes it concrete. ✓
- benchmark.md — the workload harness already plumbs user_prompt_text / assistant_response_text at the workload subject level; no change. ✓
Status: verified.

2026-05-14 — gateway.md v1 (captures shipped surface) + per-key analytics rollup

Specs: gateway.md (v0 skeleton → v1), analytics-api.md §4.1 + new §4.8, server-api.md (implicit — GET /sessions/{id}.routing_policy_version now populated).
Change: Rewrote gateway.md from v0 skeleton to v1 documentation of the shipped transparent HTTP gateway in apps/gateway/. Documents the actual endpoint shapes (/v1/chat/completions, /v1/messages, /healthz), the auth scheme (Authorization: Bearer gw_<ulid> or x-api-key), the keystore at ~/.metis/gateway/keys.json (SHA-256 hash; mode 0o600), the per-shape translation rules, the additive gateway_key_id + inbound_shape stamps on LLMCallCompleted / TurnCompleted (gateway.md §6), and the v1 loopback-only network posture (§3.2 — reverses the original v0 “default 0.0.0.0” plan until per-key rate limiting and audit log land). Notes the §5.3 “transparent mode” trade-off — gateway clients passing model always trigger the per_message_override slot win — recommends leaving the default as-is and tracks a future --ignore-inbound-model flag for the cost-optimization magic-trick mode. Added gateway_key to _COST_GROUP_BY_ALLOWED in analytics/store.py and shipped a new /analytics/by_key endpoint (analytics-api.md §4.8) backed by AnalyticsStore.by_key() — per-(gateway_key_id) cost + token + call_count rollup with an by_inbound_shape sub-array per row, rows with null gateway_key_id (agent-loop traffic) keyed under null. Surfaced routing_policy_version on GET /sessions/{id} (and the POST /sessions 201): added a content-derived version field on RoutingPolicy (truncated sha256 of the raw yaml at parse time; None for EMPTY_POLICY); SessionManager.routing_policy_version() exposes it to the HTTP layer.
Type: additive. New analytics endpoint, new optional gateway_key group_by value, new optional response field on session endpoints, new optional RoutingPolicy.version (default None preserves call sites that construct policies directly).
References to verify:
- event-bus-and-trace-catalog.md §6.3 — LLMCallCompleted.gateway_key_id / inbound_shape already land as typed optional fields. ✓
- analytics-api.md §4.1 + §4.8 — group_by enum extended; new endpoint shape documented. ✓
- routing-engine.md §5.7 — RoutingPolicy gains a version field; the validation rules and parser entry points are unchanged. ✓
- server-api.md §4.x — GET /sessions/{id} response gains a populated routing_policy_version field. Already declared in the shape; no schema breakage. ✓
- KNOWN_ISSUES.md — 🟡 “Per-key analytics roll-up has no HTTP surface” entry deleted (this change ships the HTTP surface). ✓
Status: verified.

2026-05-14 — provider-adapter-contract.md v1.2 (CanonicalResponse returns content, not Message)

Spec: provider-adapter-contract.md §3.3 (CanonicalResponse shape).
Change: Bring §3.3 into line with the shipped impl. CanonicalResponse returns content: list[ContentBlock] + model + provider rather than a full Message. The adapter doesn’t own two Message fields the spec previously implied it did: the RoutingDecisionRecord (decided upstream by the routing engine) and Usage.cost_usd (computed by core from the local price table per canonical-format §6.4). The caller (SessionManager) assembles the final canonical Message from the adapter’s parts plus its own routing decision, cost computation, and id allocation. Adapter implementations have been on this shape since Phase 1 ([adapters/protocol.py](../../packages/metis-core/src/metis_core/adapters/protocol.py) docstring + AGENTS.md “Implementation conventions” already noted the divergence); v1.2 closes the spec/impl gap. Substitutability is unaffected — the substitutability gate is the (content, stop_reason, usage) triple, not the Message envelope.
Type: additive (the spec catches up with shipped impl; no consumer change required — there are no callers writing to the old shape).
References to verify:
- canonical-message-format.md §5 — Message shape unchanged. The fields the adapter previously owned in Message (id, role, content, metadata.routing, metadata.usage.cost_usd) are now assembled by SessionManager; no canonical-format edit required. ✓
- streaming-protocol.md §5.6 — the streaming-side MessageComplete event’s authoritative final content + usage shape is unchanged; it already returns content blocks rather than a Message. ✓
- event-bus-and-trace-catalog.md §6.3 — llm.call_completed payload reads from CanonicalResponse.usage / model; new shape preserves those fields. ✓
- KNOWN_ISSUES.md — “CanonicalResponse shape divergence from spec” 🟢 entry retired by this change. ✓
Status: verified.

2026-05-14 — context-assembler.md v3 (skill activation)

Spec: context-assembler.md §5.2 (new), §7 (skill-activation entry retired from out-of-scope; new entries for auto-activation, mid-session eviction, per-workspace budget overrides), §8 (six new decision-log entries), §9 (new references to skill-format.md and event-bus-and-trace-catalog.md §6.6).
Change: Specs the skill activation layer of the cost lever per the project strategy (private). Three activation paths partitioned by skill.loaded.load_reason: (a) pre-activation ("always") — v2 §5.1’s body-as-padding is formalized as observable activation, emitted once per inlined body at session init with triggered_by_tool_use_id=None; (b) explicit activation ("on_demand") — existing skill_load tool path, unchanged except for the new budget check; (c) auto-activation ("auto_suggested") — not in v3, reserved. No description-match-driven auto-activation in v3 (rationale: preserves agentskills.io progressive disclosure semantics; avoids non-determinism breaking caches; no usage data to tune classifier against). Per-session activation budget: MAX_EXPLICIT_ACTIVATIONS_PER_SESSION=3 count cap, WARN_CUMULATIVE_ACTIVATION_TOKENS=10000 log-only, HARD_CAP_CUMULATIVE_ACTIVATION_TOKENS=30000 hard cap; all surface as ToolExecutionError → tool.failed (no new event types). Pre-activated skills don’t count against budget. Discovery index entry for a pre-activated skill annotated [preloaded]; skill_load(name) for a pre-activated skill returns a pointer (“already in system prompt”), not the body, to avoid double-paying input bytes. No mid-session eviction in v3 — would invalidate message-level caches a future spec might place, and require unwinding structurally-linked tool_use/tool_result pairs. Deferred to history-compression spec.
Type: additive on context-assembler.md; implies two additive cross-spec changes flagged below.
References to verify:
- skill-format.md §7.1 — discovery-index format currently specified as - {name}: {description}. v3 §5.2.2 adds an optional [preloaded] annotation on pre-activated skills (- {name} [preloaded]: {description}). Additive — readers ignoring the annotation see no behavior change. Cross-spec edit lands with implementation; flagged in context-assembler.md §5.2.7 open question 2.
- skill-format.md §8.2 — skill_load tool semantics gain a budget check (raises ToolExecutionError on exhaustion) and a pre-activated-skill special case (returns pointer text with {"already_preloaded": true} metadata, no body, no event re-emission). Additive: existing callers see no change in the in-budget non-preloaded case.
- event-bus-and-trace-catalog.md §6.6 — skill.loaded payload schema unchanged. v3 emits the existing load_reason="always" enum value from a new path (session init, post-session.started, pre-first-turn.started). No catalog edit required.
- analytics-api.md — v3 mentions a future /analytics/skills rollup keyed on load_reason for tuning the v2 padding source priority; not specified in v3 and no analytics-api edit required.
- the project strategy (private) — context > skills > model selection thesis: v3 specifies the second-largest lever (skills) inside the largest (context). No narrative change required; cross-reference only.
- benchmark.md — no current workload exercises skill loading. Wave 6 should add one before tuning the default budget numbers; flagged in context-assembler.md §5.2.7 open question 1. No spec edit required.
Status: pending owner sign-off on the five open questions in §5.2.7 (default budget numbers; [preloaded] annotation format vs alternatives; auto-activation deferral; re-load-as-no-op semantics; pre-activation event ordering). Cross-spec edits to skill-format.md §7.1 / §8.2 land with implementation (Wave 6+); both are additive.

2026-05-14 — context-assembler.md v2 (minimum-cacheable-prefix rule)

Spec: context-assembler.md §5.1 (new), with rationale + decision log entries.
Change: v1’s prompt-cache breakpoint placement was honest but the natural Metis stable prefix (DEFAULT_SYSTEM_PROMPT + five built-in tools ≈ 265 heuristic tokens) tokenizes well below the effective haiku-4-5 cache floor — a live probe found a 3320-actual-token prefix produces cache_creation_input_tokens = 0 while a 4957-token prefix succeeds. v2 adds a §5.1 rule requiring SessionManager to pad the stable prefix to clear that effective floor with margin (MIN_CACHEABLE_PREFIX_TOKENS = 4500, MAX_CACHEABLE_PREFIX_TOKENS = 5500 heuristic tokens). Padding sources, in priority order: (1) loaded skill bodies in name-ascending order, (2) a static byte-stable _OPERATING_CONTEXT_PADDING block of Metis operating guidelines. Determinism is load-bearing — module-level constant; no per-call I/O. v1’s breakpoint placement, the two-segment system_prompt/system_prompt_volatile shape, and the breakpoint-on-last-stable-block rule are all unchanged. Live verification: scripts/smoke_cache.py --model haiku now passes with the natural Metis prompt (turn 1 writes 5167 cache tokens; turn 2 reads 5167). Benchmark Run 3 (benchmarks/RESULTS.md): cache fires on 49 of 49 LLM calls (100%) vs Run 2 cold’s 10 of 30 (33%); same-3-workload aggregate cost dropped 22.8%.
Type: additive. The §5.1 rule is a new section; v1’s existing rules in §1–§4 and §5 (preceding §5.1) are unchanged. Callers that pass a custom system_prompt already above the floor see §5.1 as a no-op.
References to verify:
- canonical-message-format.md §7 — adapter contract unchanged; CanonicalRequest.system_prompt / system_prompt_volatile shape unchanged. ✓
- analytics-api.md §4.2 — cache_effectiveness endpoint reads the same cache_creation_input_tokens / cached_input_tokens fields; no schema change. ✓
- skill-format.md — v2 §5.1 inlines skill bodies into the cached prefix when padding is needed, which is a deviation from agentskills.io “progressive disclosure” (discovery only, activation via skill_load). The decision log records the reasoning: progressive disclosure still applies to the discovery index; bodies are only inlined when the prefix needs the bytes to clear the floor. No skill-format spec change required. ✓
- benchmark.md §6.2 — variance tolerance (±5pp on savings_pct, ±2 llm_call_count) unchanged; Run 3 sits within tolerance against Run 2. ✓
Status: verified.

2026-05-14 — benchmark workload diversity v1 (two discriminating fixtures)

Spec: benchmark.md §4 (the suite).
Change: Two new workloads added under benchmarks/workloads/: regex-with-edge-cases (one-shot NANP regex against 16 labeled cases; locked-down iteration via max_tool_calls: 1 on the run turn) and multi-file-refactor-with-shared-types (7-file rename with an aliased import in legacy.py). Both ship evaluate: blocks with expect_substring_in_final_response so the heuristic judge gets an objective success signal. The shipped regex workload discriminates haiku-4-5 (0.25) vs sonnet-4-6 (1.00) at the workload-level score; the mfr workload scores 1.00 / 1.00 (parity datapoint, not a discriminator at the current model pair’s capability). Full numbers and the cost-per-success inversion are in benchmarks/RESULTS.md under “Workload diversity v1”. The benchmark spec’s §4 “V1 ships three workloads” table is now an undercount (six workloads ship via filesystem discovery, including the prior intentionally-failing-task control case) — descriptive drift rather than a contract change.
Type: additive. New fixtures discovered via the existing filesystem-based loader in scripts/benchmark.py; no harness or schema changes. The test that pins the discovered-workload set (apps/cli/tests/test_benchmark.py::test_shipped_workloads_load_clean) was updated to include the two new names — purely additive, no removal. Test count: 1029 passed (was 979; the +50 includes other parallel work landing during the same window).
References to verify:
- pattern-store.md §8.3 — the K-cluster aggregator formula now has an input distribution where success_mean_haiku < success_mean_sonnet. The mechanism was already implemented; the new fixture provides the first real-API distribution that triggers the cost-vs-success trade-off. ✓ (no spec change needed; section in RESULTS.md cites the formula).
- evaluator.md §5.4 — workload-level rubric’s expect_substring_in_final_response path is exercised by both new fixtures. The hybrid judge tier (just-landed) reads the same signals_extra plumbing, so these fixtures double as inputs to the LLM-judge upgrade. ✓
- benchmark.md §4 — the table listing v1’s three workloads is now an undercount (six workloads discovered). Worth a follow-up edit to either enumerate all six or note that discovery is filesystem-based; not blocking.
Status: verified.

2026-05-14 — evaluator: LLM-as-judge + hybrid escalation tier shipped

Spec: evaluator.md §5.2 (LLM rubric), §5.3 (hybrid escalation), §9.2 (/analytics/quality).
Change: LLM-as-judge tier landed at packages/metis-core/src/metis_core/eval/llm_judge.py (LLMJudge, HybridJudge, LLMJudgeConfig). Hybrid is the default for turn / workload subjects; tool_cycle / session remain heuristic-only per §5.5 / §5.6. Default escalation threshold = 0.7. Budget-exhausted LLM calls return a signals.budget_exhausted=True verdict (confidence=0); HybridJudge falls back to its heuristic verdict and records signals.escalation_skipped="budget_exhausted". New /analytics/quality endpoint (apps/server/src/metis_server/analytics.py) projects eval.completed over a window with group_by ∈ {model, judge_kind, rubric_id, none} and min_confidence filter; the chosen_model field joins via route.decided so the per-model rollup reflects the judged model, not the judge’s.
Type: additive (new classes, new endpoint, no breaking changes to existing heuristic path).
References to verify:
- event-bus-and-trace-catalog.md §6.12 — three eval.* payloads unchanged; new signals (budget_exhausted, escalation_skipped, heuristic_score, heuristic_confidence) all live in the opaque signals dict so the catalog contract is preserved. ✓
- pattern-store.md §10.4 — pattern store reads score + confidence only; new signals don’t affect that contract. ✓
- analytics-api.md — new /analytics/quality endpoint follows the standard envelope and error mapping. ✓
Status: verified.

2026-05-14 — evaluator: opt-in content penalty (refusal / empty response)

Spec: evaluator.md §5.1 (turn rubric), §5.4 (workload rubric).
Change: Added two signals to the heuristic judge: assistant_refusal_detected (×0.5 multiplicative penalty) and empty_assistant_response (×0.4). Both fire only when the caller plumbs final_response_text via SubjectContext.signals_extra — the bus subscriber path is unchanged. The workload rubric applies the same penalty (workload_assistant_refusal_detected, workload_empty_assistant_response) using the benchmark harness’s existing final_response_text plumbing. Motivation: the prior rubric was content-blind and would score a clean refusal 1.0 if no expect_substring_in_final_response was configured — Run 2’s “1.00 @ 0.80 on every workload” exposed the gap.
Type: additive (new optional signals; existing tests unchanged; rubric version pinned at 1.0.0 because no caller in the live online path plumbs the new key yet, so re-runs of metis evaluate --subject turn against existing trace DBs produce identical scores).
References to verify:
- pattern-store.md §10.4 — pattern store reads score only; new signals are in signals dict, not on the score contract. No change required. ✓
- benchmark.md §3.1 — evaluate: block schema unchanged; new fixture intentionally-failing-task added under benchmarks/workloads/ as a control case. ✓
Status: verified.

2026-05-13 — evaluator v1 implementation (heuristic tier)

Spec: evaluator.md
Change: v1 heuristic implementation lands at packages/metis-core/src/metis_core/eval/ (HeuristicJudge + Evaluator bus subscriber + BudgetTracker + metis evaluate CLI). Subscribes to turn.completed / tool.completed / tool.failed / session.ended and emits eval.started / eval.completed / eval.failed. workload.yaml.evaluate block parsed by scripts/benchmark.py and fed to Evaluator.evaluate_workload() after each workload run — the quality score lands in the benchmark report. LLM-as-judge and hybrid escalation are deferred to a later wave per evaluator.md §5.2-5.3.
Type: additive (new module, new optional evaluate: block on workload.yaml, new metis evaluate subcommand).
References to verify:
- event-bus-and-trace-catalog.md §6.12 — three eval.* event payloads were added in Wave 4a (Task 4a-3). ✓
- benchmark.md §3.1 — evaluate: block documented. ✓ (this change)
- pattern-store.md §10.4 — pattern store’s update_score() flow expects eval.completed carrying subject_id (turn_id), score, confidence. ✓ (payload matches; pattern store is the read-side, evaluator the write-side).
Status: verified.

2026-05-13 — pattern-store v1 implementation

Spec: pattern-store.md
Change: v1 implementation lands at packages/metis-core/src/metis_core/patterns/ (structural fingerprint + similarity + K-NN aggregation + SQLite store + bus subscriber). Routing engine slot 4 (PATTERN_RECOMMENDATION) consults the store when a pattern_store_resolver is injected; pattern.recorded / pattern.matched / pattern.evicted events flow through the bus. Spec body unchanged; the three event payloads were added to events/payloads.py in Wave 4a (Task 4a-3). PatternConfig gains min_eval_confidence: float = 0.5 per pattern-store §15.4 reconciliation.
Type: additive (new module, new code-path on existing routing chain).
References to verify:
- routing-engine.md §5.5 — K-NN formula matches aggregation.py. ✓
- event-bus-and-trace-catalog.md §6.5b — three new pattern events were added in Wave 4a. ✓
Status: verified.

2026-05-08 — routing-engine v3.1

Spec: routing-engine.md
Change: Auxiliary event renamed (pattern.override_accepted → route.overridden); delegation phase asymmetry documented at §6 preamble.
Type: breaking (event name change), additive (phase note).
References to verify:
- event-bus-and-trace-catalog.md §6.5b — confirms the canonical event name. ✓
- Future: any client code rendering routing events. (No clients yet.)
Status: verified.

2026-05-08 — event-bus v2

Spec: event-bus-and-trace-catalog.md
Change: Multiple. Added route.overridden, bus.gap_detected, bus.subscriber_unregistered. Removed bus.handler_error, bus.overflow (moved to logs). Pattern domain split out as §6.5b. SQLite WAL + NORMAL committed. Memory snapshotter moved off fast path. Dynamic sensitivity on opt-in.
Type: breaking (event types removed/renamed).
References to verify:
- routing-engine.md — auxiliary event names. ✓ (handled by v3.1 above)
- streaming-protocol.md — events flowing through stream. Verified: streaming spec doesn’t enumerate specific event types beyond examples; safe.
Status: verified.

2026-05-08 — routing-engine v3

Spec: routing-engine.md
Change: Many; see v3 changelog in the spec header.
Type: mix.
References to verify:
- canonical-message-format.md §7.2 — AdapterCapabilities needs supports_tools, supports_system_prompt, supports_structured_output fields per routing v3 §4.4. Pending: canonical-format spec needs an additive update.
- event-bus-and-trace-catalog.md — route.decided.chain[].validation_failure enum values updated (added no_tool_support, no_system_prompt_support, no_structured_output_support). ✓ in v2.
Status: pending review (canonical-format AdapterCapabilities update).

2026-05-08 — Cross-spec reconciliation sweep (event-bus v3, streaming v2, others)

Several spec-boundary inconsistencies surfaced in cross-spec review and were resolved together:

Spec: all five (canonical-message-format v1.1, event-bus-and-trace-catalog v3, streaming-protocol v2, provider-adapter-contract v1.1, tool-dispatcher v1.1, server-api v1.1, routing-engine v3.2).
Changes:
1. Streaming events declared as separate transient layer, not bus catalog events. Streaming server is no longer a bus subscriber for streaming events; it has two input channels (bus bridge for catalog events, direct from agent loop for streaming events). Domains message, text, thinking, tool.use_* reserved for streaming use only. (event-bus §4.5.1, streaming §5.1, provider-adapter §5.1)
2. Error class enums reconciled. llm.call_failed.error_class (catalog) extended to 8 values matching provider-adapter §6.1. tool.failed.error_class (catalog) extended to 8 values matching tool-dispatcher §6.1. (event-bus §6.3, §6.4)
3. tool.confirmation_requested and tool.confirmation_resolved added to catalog with full payloads (event-bus §6.4).
4. block_dropped confirmed as log-only, not a catalog event. canonical-format §4.2.2, §7.3, §11.1.6 updated to match.
5. AdapterCapabilities extended with supports_tools, supports_system_prompt, supports_structured_output, supports_prompt_caching (canonical-format §7.2), resolving the v3 pending review item.
6. provider_overrides removed from ToolDefinition (canonical-format §4.4) — unused everywhere.
7. RoutingDecisionRecord.mode documented as a coarse summary with explicit mapping to the routing chain enum (canonical-format §4.3).
8. Cancellation sequence split into three cases (cancel during LLM, during tool dispatch, at seam) in streaming-protocol §6.2. routing-engine §3.4 cross-references.
9. max_retries semantics pinned in provider-adapter §6.4: total attempts = 1 + max_retries.
10. routing_failed 503 body schema defined in server-api §4.2.
11. Tool factory-vs-singleton clarified in tool-dispatcher §3.1.
12. EventFrame cross-reference added in event-bus §5.4.
Type: mostly breaking (enum extensions, removed event types, field removals); some additive.
References to verify: all five specs cross-checked in this sweep.
Status: verified.

2026-05-08 — Post-v3 micro-sweep (streaming-protocol numbering, project-overview diagram)

Followup to the cross-spec sweep — five small but real defects caught in review:

Specs: streaming-protocol (v2.1 conceptually; no version bump since changes are corrective), provider-adapter-contract (cross-ref fix), project-overview (architecture diagram + principle + spec list).
Changes:
1. Streaming-protocol §5 numbering fixed. Was 5.1 5.2 5.3 5.3 5.4 5.5; now 5.1 5.2 5.3 5.4 5.5 5.6. provider-adapter §5.4 and decision log cross-refs updated from §5.5 to §5.6.
2. §10.4 worked example rewritten to pick a specific case (tool dispatch per §6.2.2) and emit only events that case produces. Added note acknowledging the case split.
3. Cancellation tests in §11.1 split into 7 (LLM streaming, §6.2.1), 8 (tool dispatch, §6.2.2), 8b (seam, §6.2.3) — each asserts exactly the events that case produces.
4. EventFrame comment in §4.2 updated to “wraps any catalog or streaming event.”
5. Filter validation §3.2 and §9.3 updated: accepted set is the union of catalog and streaming-only event types. Test 13 wording tightened.
6. project-overview.md architecture diagram updated to show two channels (durable bus + transient streaming), the streaming server merging both, and the bus subscribers (trace store, cost accumulator, pattern) as a separate group. Core principle “Event bus as observability spine” rewritten as “Two-channel observability.” Components table adds a “Streaming Server” row.
7. project-overview.md spec list refreshed with current statuses (canonical-format v1.1, event-bus v3, streaming v2, routing v3.2, etc.). Added provider-adapter, tool-dispatcher, server-api, CHANGES.md to the list.
Type: corrective (numbering, contradictions in examples, stale visual) — no contract changes.
References to verify: none beyond the files updated above.
Status: verified.

2026-05-12 — event-bus: `skill.loaded.source` added

Spec: event-bus-and-trace-catalog.md §6.6.
Change: Added source: Literal["global", "workspace"] to skill.loaded payload so traces record which directory served the skill after the workspace-overrides-global merge.
Type: additive. Existing consumers ignore unknown fields; no migration required for stored events (the field defaults to None on records written before this entry, since the implementation defaulted it None on the typed struct — though all in-process emitters set it).
References to verify:
- skill-format.md (planned) — when that spec lands, document source alongside the other fields. Note pending below.
Status: verified (event-bus spec updated in this change; implementation in packages/metis-core/src/metis_core/events/payloads.py::SkillLoaded + emitter in packages/metis-core/src/metis_core/skills/tools.py::SkillLoadTool).

2026-05-12 — analytics-api.md v1 drafted

Spec: new analytics-api.md v1.
Change: Adds a read-only /analytics/* HTTP namespace extending server-api.md. Endpoints derive metrics from the existing events, messages, and sessions tables — no new persistent state, no new bus events, no new write paths. Endpoints: /cost, /cache_effectiveness, /routing, /reliability, /sessions, /turns/{id}, /savings. Pricing semantics are hybrid: actuals honor stamped pricing_version; the savings counterfactual re-prices both numerator and denominator under the current PriceTable.
Type: additive (new endpoints; no contract change to existing specs).
References to verify:
- server-api.md — analytics namespace lives on the same Starlette app and inherits the loopback-only / no-auth posture. No edit required; cross-reference only.
- event-bus-and-trace-catalog.md — analytics queries depend on the llm.call_completed, llm.call_failed, route.decided, and turn.completed payload shapes. Any future change to those payloads must update the relevant analytics endpoint and its SQL. No edit required now.
- routing-engine.md §5.3.1 — known asymmetry between cost_today_exceeds_usd (UTC midnight) and the dashboard’s “today” (local TZ). Documented in analytics-api §3.1; not aligning until evidence of confusion.
Status: verified (no dependent specs need edits in this change).

2026-05-13 — benchmark.md v1 drafted

Spec: new benchmark.md v1.
Change: Defines a reproducible workload suite + measurement methodology that turns /analytics/savings.actual_repriced_usd / baseline_repriced_usd into a credible “saved X%” number — the artifact the project strategy (private) named as the biggest gap between architecture and proof. Specifies the workload model (per-workload YAML script + bundled fixture workspace under benchmarks/workloads/), the v1 suite (three workloads: fix-a-bug-small, write-a-doc-from-notes, multi-turn-refactor), reproducibility rules (pinned commit SHA, PriceTable.version, resolved model ids, temperature=0), and report shape. Adds scripts/benchmark.py (drives the loop) and bundled workload fixtures. Plumbs a temperature: float | None = None kwarg through SessionManager.submit_turn → CanonicalRequest.temperature so the determinism rule is enforceable.
Type: additive (new spec; new optional kwarg on submit_turn defaulting to None preserves existing behavior).
References to verify:
- analytics-api.md §4.7 — the savings response shape this spec consumes. No edit required.
- provider-adapter-contract.md (planned) — when drafted, document that adapters honor CanonicalRequest.temperature when set. Native Anthropic/OpenAI/OpenRouter adapters already do.
- event-bus-and-trace-catalog.md — the llm.call_completed / turn.completed payloads are the source rows for the benchmark’s projection. No edit required.
- the project strategy (private) — §6.4 resolved (pointer to this spec); §5 dated entry added.
Status: verified (no dependent spec edits required in this change; the project strategy (private) updated in the same change).

2026-05-13 — context-assembler.md v1 drafted

Spec: new context-assembler.md v1 (scope: cache-breakpoint placement only).
Change: Specifies the two-segment system prompt on CanonicalRequest (system_prompt stable + new system_prompt_volatile for MEMORY.md / USER.md-shaped content), and where adapters place provider cache breakpoints. Anthropic adapter writes cache_control: {"type": "ephemeral"} on the last tool definition and on the last stable system block. OpenAI relies on automatic prefix-match caching; the adapter preserves prefix stability (system → tools → messages order, volatile content concatenated at the end of the system text). OpenRouter passes through markers but declares supports_prompt_caching=False because cache behavior depends on the upstream route. Validation surface is /analytics/cache_effectiveness (analytics-api.md §4.2) plus a scripts/smoke_cache.py 2-turn live-API test that asserts cached_input_tokens > 0 on turn 2.
Type: additive. New optional system_prompt_volatile and workspace_path fields on CanonicalRequest default to None and preserve existing behavior. The cache_control markers don’t change the request’s semantic meaning for any provider that doesn’t recognize them.
References to verify:
- canonical-message-format.md §7.2 — AdapterCapabilities.supports_prompt_caching is the routing-engine substitutability gate this spec leans on. No edit required; the field already exists.
- provider-adapter-contract.md (planned) — when drafted, document that adapters supporting prompt caching write the breakpoints described in §3 of context-assembler.md.
- analytics-api.md §4.2 — the cache-effectiveness view is the validation surface; hit_rate > 0 after a multi-turn Anthropic session signals the lever has landed. No edit required.
- KNOWN_ISSUES.md — “No prompt-caching strategy” entry retired; replaced by this spec + implementation. ✓ in this change.
Status: verified (no dependent spec edits required; KNOWN_ISSUES.md updated in the same change).

2026-05-13 — deployment-shape.md v1 + gateway.md v0 drafted

Specs: new deployment-shape.md v1 (recommendation), new gateway.md v0 (skeleton, paired).
Change: deployment-shape.md recommends the hybrid deployment (gateway first → agent upgrade) to resolve the architectural fork in the project strategy (private) and the open question in the project strategy (private). gateway.md is the v0 skeleton of the HTTP gateway surface it implies: OpenAI-shape (and Anthropic-shape) inbound endpoints, request-translation contracts that explicitly contract against the LiteLLM tool_use / cache_control / thinking-block hazards listed in docs/market-research/03-routing-layers.md, per-request stateless routing via the existing engine, and an enumerated non-feature list (no context shaping, no skill loading, no memory composition) that preserves the agent’s upgrade-tier value proposition.
Type: additive (two new specs; no contract changes to existing specs). gateway.md §6 describes additive payload fields (gateway_key_id, inbound_shape) on existing llm.call_completed and turn.completed events — those land only when the gateway implementation does.
References to verify:
- the project strategy (private) §3 (resolution note added at top), §5 (new dated entry), §6.1 (retired with resolution pointer), §6.3 (narrowed: gateway-first implies deployed-instance posture). ✓ landed in this change.
- provider-adapter-contract.md — AdapterCapabilities already carries the fields the gateway needs (supports_tools, supports_prompt_caching, etc.). No edit required.
- routing-engine.md — 7-slot chain semantics in stateless gateway path documented in gateway.md §5.1. No edit required; cross-reference only.
- event-bus-and-trace-catalog.md — additive payload fields (gateway_key_id, inbound_shape) documented in gateway.md §6 will need to land in the payload registry when the gateway implementation does. Flagged as pending below.
- analytics-api.md — adding gateway_key as a group_by dimension on /analytics/cost is a future additive change; not part of this entry.
Status: verified (owner sign-off 2026-05-13; the project strategy (private) edits landed in the same change). Implementation-time payload-field additions to event-bus-and-trace-catalog.md remain pending below.

2026-05-14 — event-bus catalog v3.1: pattern.* and eval.* payloads landed

Spec: event-bus-and-trace-catalog.md (v3 → v3.1).
Change: Six new typed payloads landed in packages/metis-core/src/metis_core/events/payloads.py and PAYLOAD_REGISTRY ahead of the implementation in Batch 4b (Wave 4); the catalog spec is updated to match.
- Pattern domain (§6.5b extended) — pattern.recorded, pattern.matched, pattern.evicted per pattern-store.md §10. All pseudonymous. Phase 2.5.
- New eval domain (§6.12; closed-list extension in §4.5) — eval.started, eval.completed, eval.failed per evaluator.md §8. All pseudonymous floor; eval.completed admits opt-in uplift to user_controlled per §4.4.1 when signals.rationale_redacted is populated.
- Decimal serialization. PatternRecorded.cost_usd_at_record and EvalCompleted.judge_cost_usd use Decimal, serialized as strings via msgspec.to_builtins, matching the Usage.cost_usd convention from canonical-message-format.md §6.4.
- Field-name divergence from pattern-store.md §10.1. The catalog and implementation use cost_usd_at_record rather than the spec’s cost_usd to disambiguate from llm.call_completed.cost_usd and to follow the codebase’s Decimal convention. Field names otherwise match pattern-store.md §10 and evaluator.md §8/§10 as currently drafted; the Task 4a-2 reconciliation sweep may adjust further.
- Tests added in packages/metis-core/tests/events/test_payloads.py cover registry membership, round-trip (to_builtins → convert) for each new payload, make_event type↔payload binding, and the sensitivity-uplift path for eval.completed.
Type: additive. No existing payload shape changed; no existing event removed or renamed. New typed payloads do not fire from any subscriber yet (Batch 4b lands PatternStore and Evaluator implementations + bus wiring).
References to verify:
- pattern-store.md §10.1 — landed payload uses cost_usd_at_record (Decimal) rather than the drafted cost_usd (float). Reconcile name + type in the Wave 4 sweep; either update the spec to match the catalog or back out of the rename.
- evaluator.md §8 — payload fields and Decimal cost convention match the spec verbatim. signals is the opaque dict the spec specified; sensitivity uplift is wired via the existing make_event(..., sensitivity=...) override path. No edit required.
- routing-engine.md §5.5 — pattern-domain events do not change the routing chain payload; pattern.matched is queryable separately from route.decided. No edit required.
- analytics-api.md §4.6 — /analytics/turns/{id} and the planned /analytics/quality endpoint will join eval.completed.subject_id against turn_id. No edit required until the analytics endpoint lands.
Status: pending review (the catalog edits and typed payloads have landed for both pattern-store.md and evaluator.md; pattern-store.md §10.1 field rename + Wave 4 reconciliation per the two earlier entries below remain open).

2026-05-13 — pattern-store.md v1 drafted

Spec: new pattern-store.md v1 (specs-only; no implementation).
Change: Defines the per-workspace, bounded SQLite-backed store of task fingerprints + outcomes that powers routing slot 4 (PATTERN_RECOMMENDATION) per routing-engine.md §5.5. Specifies: (a) per-turn fingerprinting unit with a v1 structural-only feature set (file extensions, tool names, side-effect classes, token-bucket, intent regex tags) and an embedding-provider-abstract v2 hybrid path that lands data-only; (b) <workspace>/.metis/patterns.db storage with WAL + synchronous=NORMAL mirroring the trace store; (c) bounded caps (5k soft / 10k hard / 180-day age) where hard-cap auto-evicts rather than rejects writes — asymmetric with memory-store.md because pattern writes are mechanical projections with no agent-curation step; (d) K-NN retrieval with weighted Jaccard similarity + sample-size-weighted cluster aggregation, implementing routing-engine.md §5.5 scoring verbatim; (e) three new event types (pattern.recorded, pattern.matched, pattern.evicted) added to event-bus-and-trace-catalog.md §6.5b; (f) decimal cost preservation with pricing_version_last for future reprice; (g) workspace isolation (multi-user / cross-workspace explicitly out of scope per the project strategy (private), §6.6). Closes the project strategy (private)’s “pattern store mechanics” deferral; one routing-engine.md §5.5 ambiguity flagged in pattern-store §13.7 (sample-size weighting).
Type: additive (new spec; three new event types to be added to event-bus catalog at Phase 2.5 implementation time; no contract changes to existing specs).
References to verify:
- routing-engine.md §5.5 — sample-size weighting in K-cluster aggregation is unspecified there; pattern-store §8.4 picks weighted means as v1 interpretation. Needs a one-line clarification in routing-engine.md to either pin or back out. Flagged in pattern-store §15.6.
- event-bus-and-trace-catalog.md §6.5b — three new event types (pattern.recorded, pattern.matched, pattern.evicted) to be added when the Phase 2.5 implementation lands. Sensitivity is pseudonymous for all three; parent linkages documented in pattern-store §10. Catalog edit pending; flagged below.
- evaluator.md (parallel draft by Agent 3B) — pattern-store §15 enumerates the touchpoints assumed: EvaluationResult shape consumed by the session-ended subscriber, sync vs async score timing decision, update_score() API for late-arriving scores if async. Reconcile in Wave 4 sweep.
- memory-store.md — used as the reference shape for goals/non-goals/caps/eviction structure; no edit required.
- analytics-api.md §4.7 — re-pricing math precedent followed; no edit required.
- the project strategy (private) — “pattern store mechanics” open question resolved with pointer to this spec; §5 should record the decision in the same change. Owner update pending.
Status: pending review (catalog additions land with Phase 2.5 implementation; routing-engine §5.5 clarification and evaluator.md reconciliation tracked below).

2026-05-13 — evaluator.md v1 drafted

Spec: new evaluator.md v1 (specs-only; no implementation).
Change: Defines the heuristic-first / hybrid-LLM-as-judge feedback loop that resolves the project strategy (private) — “the feedback loop that proves savings — without it, ‘is the system actually saving money vs naive sonnet-everywhere?’ stays an open question forever.” Specifies: (a) four subject kinds (turn, tool_cycle, session, workload) — the workload subject subsumes the v1 limitation flagged in benchmark.md §2.2.2; (b) verdict shape (EvalVerdict msgspec.Struct(frozen=True) — single score in [0, 1], confidence as a gate, Decimal judge_cost_usd, versioned rubric_id + rubric_version, opaque signals dict for judge-specific evidence); (c) three judge tiers (heuristic ($0), LLM-as-judge (small model by default), hybrid escalation with a single escalation_threshold knob); (d) bus subscriber on turn.completed / tool.completed / tool.failed / session.ended / feedback.explicit as non-fast-path, plus a metis evaluate CLI for batch re-evaluation; (e) three new event types (eval.started, eval.completed, eval.failed) and a new eval domain to be added to event-bus-and-trace-catalog.md §4.5 / §6 at implementation time; (f) per-session ($0.10 default) and per-day ($1.00 default) judge_cost_usd caps + workspace kill-switch; (g) one new analytics endpoint (/analytics/quality) and an additive include_eval parameter on /analytics/cost; (h) re-evaluation is append-only (every verdict is a new event), enabling the dashboard’s “evaluator agreement rate over time” view as a query, not a side-table; (i) workload rubric integrates with benchmark.md via a new optional evaluate: block in workload.yaml; (j) workspace-scoped single-user per the project strategy (private), no labeled training data, no LLM-as-judge in the critical path. evaluator.md §15 enumerates the coordination touchpoints with the parallel pattern-store.md draft for the Wave 4 reconciliation.
Type: additive (new spec; three new event types + new eval domain to be added to event-bus catalog at Phase 3 implementation time; one new analytics endpoint + additive include_eval param + additive evaluations array on /analytics/turns/{id}; no contract changes to existing specs).
References to verify:
- event-bus-and-trace-catalog.md §4.5 (closed domain list) and §6 — new eval domain plus three event types (eval.started, eval.completed, eval.failed) to be added when the Phase 3 implementation lands. Sensitivity floor pseudonymous; eval.completed can uplift to user_controlled on opt-in signals.rationale_redacted per §4.4.1. Catalog edit pending; flagged below.
- routing-engine.md §5.5 — pattern-store consumption of eval.completed.score as success_score; existing math reads one number, no edit required. The confidence-gate filter convention (pattern.min_eval_confidence) is documented in evaluator.md §4.3 and §11.1 as a pattern-store-side configuration; cross-check against pattern-store.md.
- analytics-api.md §4.1 / §4.6 — additive include_eval query parameter on /analytics/cost; additive evaluations array on /analytics/turns/{id}.data. Existing consumers ignore unknown fields per the additive convention. No edit required now; document at implementation time. Analytics spec edit pending.
- benchmark.md §2.2.2 — v1 “no quality scoring of outputs” limitation closed by this spec via the workload subject. New optional workload.yaml.evaluate: block (rubric, expect_substring_in_final_response, llm_judge_model, weight_per_turn) is additive to the schema in benchmark.md §3.1 — when the evaluator implementation lands, benchmark.md §3.1 should add the evaluate: block to the schema and benchmark.md §8 should add the quality column to the report. Benchmark spec edit pending.
- canonical-message-format.md §6.4 — Decimal cost-as-string serialization convention reused for judge_cost_usd in event payloads. No edit required; cross-reference only.
- pattern-store.md (parallel draft by Agent 3A) — evaluator.md §15 lists the touchpoints assumed (verdicts on bus, score as one number, confidence-gate filter, MAX(eval_id) per subject as “latest verdict,” join chosen_model from route.decided rather than embedding in verdict). Reconcile in Wave 4 sweep.
- the project strategy (private) — “evaluator scope” open question resolved with pointer to this spec; §5 should record the decision in the same change. Owner update pending.
Status: pending review (catalog additions land with Phase 3 implementation; benchmark.md / analytics-api.md / the project strategy (private) edits and pattern-store.md reconciliation tracked below).

2026-05-14 — Pattern-store ↔ evaluator reconciliation sweep

Wave 3 produced pattern-store.md and evaluator.md in parallel. Each spec’s §15 listed touchpoints assumed about the other surface. This sweep walks those touchpoints and pins the reconciled contract, following the 2026-05-08 cross-spec reconciliation pattern.

Specs: pattern-store.md, evaluator.md, routing-engine.md.
Changes:
1. Verdict shape ownership. EvalVerdict (evaluator.md §4.1) is the canonical shape; pattern-store.md §15.1 references it verbatim and stops re-specifying. The pattern store consumes subject_id (the turn_id), score, confidence, and eval_id; everything else (signals, judge_kind, rubric_id) is opaque pass-through.
2. Async score timing. Pattern-store record() writes outcomes immediately on session.ended with success_score=None; an eval.completed subscriber later calls PatternStore.update_score(turn_id, score, confidence, eval_id, pricing_version) to fold the verdict into the outcome accumulator. Idempotence is keyed by eval_id. Re-evaluation produces a new eval_id and rolls back the prior contribution before applying the new score. Documented in pattern-store.md §10.4 and §15.3; cross-referenced from evaluator.md §15. Join key: turn_id.
3. Confidence-gate filter home. pattern.min_eval_confidence lives in pattern-store config (routing.yaml::pattern.* block) alongside cost_weight / min_confidence / min_sample_size. Default 0.5 (matches the value declared in evaluator.md §4.3). The evaluator emits all verdicts; the pattern store applies the gate at K-cluster aggregation time. Verdicts below the gate stay queryable in the trace store for the agreement-rate view. Documented in pattern-store.md §15.4; cross-referenced from evaluator.md §15.
4. Sample-size-weighted mean pinned in routing-engine.md §5.5. One-line clarification: normalized_success_M = Σ(success_score_i × sample_size_i) / Σ(sample_size_i). A neighbor row with 50 contributing sessions weights 50× a single-shot row. This was the v1 interpretation pattern-store.md §8.4 already designed to; pinning it in the routing spec removes the open ambiguity called out in pattern-store.md §13.7.
5. MAX(eval_id) as the latest-verdict rule. Documented in pattern-store.md §10.4 alongside the update_score() flow. Re-evaluation produces a new eval.completed with a fresh eval_id; pattern-store consumers join on MAX(eval_id) per subject to surface the latest verdict. Aligned with evaluator.md §4.6 and §11.1.
Type: spec reconciliation (no contract breaks; clarifications + consolidated ownership of shared shapes).
References to verify:
- routing-engine.md §5.5 — sample-size-weighted clarification landed in this change. ✓
- pattern-store.md §10.4, §15 — async flow + update_score() + confidence-gate filter + MAX(eval_id) rule documented. ✓
- evaluator.md §15 — reconciliation table reflects pinned outcomes; open coordination items closed. ✓
- the project strategy (private), §6.6, §6.7 — retired entries for “pattern store mechanics” and “evaluator scope” with pointers to the drafted specs. ✓
Status: verified. Phase 2.5 / Phase 3 implementation-time catalog additions to event-bus-and-trace-catalog.md §4.5 / §6 remain pending (tracked below under the original pattern-store and evaluator entries).

2026-05-13 — skill-format.md v1 drafted (retrospective)

Spec: new skill-format.md v1 (specs-only; documents the existing implementation in packages/metis-core/src/metis_core/skills/).
Change: Captures retrospectively what the skills loader / store / tools already do: agentskills.io-conformant six-field frontmatter (name, description, license, compatibility, metadata, allowed-tools); SKILL.md directory layout with scripts/ / references/ / assets/ siblings; two on-disk roots (~/.metis/skills/ global, <workspace>/.metis/skills/ workspace) merged workspace-overrides-global; three-stage progressive disclosure (discovery index in stable system prompt → skill_load activation → execution); two tools (skill_search / skill_load) both SideEffects.READ; skill.loaded event emission semantics including the source field added 2026-05-12. Surfaces seven implementation observations (name-validation error message wording; metadata scalar coercion; unbounded discovery index; no reload-on-change; hidden dirs not excluded; symlinks followed; allowed-tools parsed-not-enforced) in §11 for triage, not fixed in this change. Follows the memory-store.md retro-spec pattern.
Type: additive (new spec; no code or contract changes). Resolves the pending cross-reference for skill.loaded.source (added 2026-05-12) by documenting the field alongside the rest of the payload.
References to verify:
- event-bus-and-trace-catalog.md §6.6 — skill.loaded payload (including source) documented in skill-format.md §9.1. No edit required; cross-reference only. ✓
- tool-dispatcher.md (planned) — ToolContext.skills field carries the per-session SkillStore; skill-format.md §8 documents the two tools’ registration / dispatch semantics. No edit required.
- context-assembler.md §2-§5 — discovery index injected into the stable system prompt segment ahead of the cache breakpoint; skill-format.md §7.1 cross-references. No edit required.
- project-overview.md — spec list refresh: skill-format.md line at §”Specs and documents” should move from “Planned” to “Drafted (v1, 2026-05-13)”. Defer to next doc-refresh pass.
- the project strategy (private) — “skills” cost lever (one of three in §2) is now spec-backed; no narrative change required.
Status: verified. The “Pending cross-references” entry for skill-format.md (skill.loaded.source field, 2026-05-12) is resolved by skill-format.md §9.1 and §10.6 and removed below.

2026-05-12 — Implementation milestone + doc refresh

Not a spec change; an alignment pass between the docs and what’s actually been built.

Files touched: README.md, docs/project-overview.md, docs/specs/project-overview.md, new the project strategy (private), new docs/KNOWN_ISSUES.md, new docs/specs/memory-store.md.
What landed in code since the last doc refresh: three provider adapters (Anthropic / OpenAI / OpenRouter), streaming end-to-end (adapter → session manager → CLI + WebSocket), Textual TUI, HTTP/WebSocket server (metis serve, loopback-only), SQLite session/message persistence, bounded memory (MEMORY.md / USER.md + 3 tools), skills store + load_skill tool, configured-rule parser (yaml policy + predicate set + loader; integration into routing chain pending), cross-provider conformance suite. Test count went from 272 → 592.
Spec-list status changes: memory-store.md moved from “planned” to “drafted (v1).” skill-format.md and pattern-store.md remain planned.
New strategy artifacts: the project strategy (private) captures the cost-optimization thesis, buyer ≠ user framing, three cost levers (skills / context / model selection), and the open replacement-agent-vs-gateway question. docs/KNOWN_ISSUES.md tracks carryover review findings (spec promises not yet honored by code).
References to verify: none in specs proper.
Status: doc-only update.

Pending cross-references

When you land a spec change, move it from “pending review” up here for visibility, then back to “verified” when the dependent spec is updated.

pattern-store.md v1 (2026-05-13) — three new event types (pattern.recorded, pattern.matched, pattern.evicted) to land in event-bus-and-trace-catalog.md §6.5b when Phase 2.5 implementation does. Routing-engine §5.5 sample-size-weighting clarification and evaluator.md reconciliation verified 2026-05-14 (see “Pattern-store ↔ evaluator reconciliation sweep” above).
evaluator.md v1 (2026-05-13) — new eval domain + three event types (eval.started, eval.completed, eval.failed) to land in event-bus-and-trace-catalog.md §4.5 / §6 when Phase 3 implementation does. New /analytics/quality endpoint + additive include_eval param on /analytics/cost + additive evaluations array on /analytics/turns/{id} to land in analytics-api.md at implementation time. Optional evaluate: block in workload.yaml schema to land in benchmark.md §3.1 plus quality column in §8 report. the project strategy (private) resolution + §5 dated decision entry and pattern-store reconciliation verified 2026-05-14.
gateway.md v0 (2026-05-13) — the project strategy (private) edits landed on owner sign-off; the additive gateway_key_id / inbound_shape payload fields in event-bus-and-trace-catalog.md §6.3 / §6.6 land when the gateway implementation does.

This site is open source. Improve this page.