Metis picks the model that succeeds at the task class, not just the cheaper one. The first end-to-end demonstration of that mechanism is on disk in
benchmarks/RESULTS.md §A3-rev3.
A three-pass benchmark on the shipped workload suite (Pass A = haiku
pinned, Pass B = sonnet pinned, Pass C = --no-active-model so slot 4
of the routing chain can fire) under the hybrid evaluator with
escalation threshold 0.7:
| Pass | Strategy | Quality sum (7 workloads) | Cost (real API) | Cost / quality unit |
|---|---|---|---|---|
| A | haiku pinned | 5.16 | $0.1977 | $0.0383 |
| B | sonnet pinned | 5.75 | $0.6761 | $0.1176 |
| C | slot 4 picks per workload | 5.55 | $0.2645 | $0.0477 |
Pass C achieves a quality sum 5.55 — closer to sonnet’s 5.75 than to haiku’s 5.16 — for a per-quality cost of $0.0477, landing inside the headline window between haiku-only $0.0383 and sonnet-only $0.1176.
Pass C aggregate savings_pct = 62.0% against the matched-baseline
$0.6955; flat-haiku-everywhere would have been 66.7%. The 4.7-point gap
is the cost of the one sonnet pick — and that pick is the demo.
The workload regex-with-edge-cases is a deliberately failure-prone
“16 edge-case tests” workload that exercises negative lookaheads and
nested quantifiers. Haiku struggles; sonnet doesn’t.
| Pass | Strategy | Quality on regex-with-edge-cases |
|---|---|---|
| A | haiku pinned | 0.19 (rubric-fail at 0.80) |
| B | sonnet pinned | 0.72 |
| C | slot 4 | 0.74 |
On turn 2 of regex-with-edge-cases in Pass C, slot 4 of the
routing chain read the cross-model outcomes accumulated in the shared
pattern store, computed cluster means haiku 0.784 vs sonnet 0.833 at
confidence 0.058, and picked sonnet. Pass C used haiku on the easy
turns of every other workload (which is how savings_pct stays at
62.0%) and routed the one hard turn that mattered to the bigger model.
Quality went from 0.19 (failing) in Pass A to 0.74 in Pass C — differentiated routing recovered 99% of sonnet-only’s quality on this workload for roughly 25% of sonnet-only’s cost.
Confidence 0.058 would have been rejected under the prior
PatternConfig.min_confidence=0.3 gate. The one-line Wave 9 knob
(min_confidence: 0.3 → 0.05 at
routing/policy.py:63)
was the missing piece in front of an already-correct K-NN aggregation.
The two ways a flat-rate router can mis-route in a real workload:
cost(haiku) × N, but quality collapses on the turns where
the small model can’t keep up. regex-with-edge-cases Pass A
(0.19 quality, rubric-fail) is the failure mode in a single number.cost(sonnet) × N (3-4× haiku in this workload mix) and quality
goes up only on the small subset of turns that needed it.Slot 4 reads cross-model outcomes by workload, identifies the turns that need the bigger model, and pays the premium only there. Pass C in §A3-rev3 is the first end-to-end demonstration: 14 of 18 turns reached slot 4, the K-NN cluster picked sonnet on the one turn where it mattered, and the bill came in 25% above haiku-only with 8% more quality (5.55 vs 5.16) than haiku-only could have produced even with infinite budget.
benchmarks/RESULTS.md §A3-rev3 “caveats and observations”:
sample-size asymmetry on specific fingerprints, and the residual
per-turn-vs-workload signal divergence on multi-turn-refactor.
Both are addressable with more pattern-store samples plus the wired
workload-level eval signal (the §A3-rev2 path #1 fix).regex-with-edge-cases is a deliberately-failure-prone workload.
Its rubric is harsh and earlier passes saw quality scores anywhere
in 0.19–1.00. The numbers above are the median, not the floor or
ceiling; the relative ordering Pass A ≪ Pass B ≈ Pass C reproduces
across runs.architectural-explanation-without-hallucination records
success_score_count=0 across all fingerprints despite the trace
recording the per-turn eval.completed. Doesn’t block the
inversion (slot 4 still picked haiku on this workload’s only turn
and quality came out 0.90 either way), but it’s the next
correctness item.The same routing engine that ran the benchmark ships in the transparent HTTP gateway. Point an unmodified OpenAI / Anthropic / Claude Code / Cursor / raw-SDK client at it and your traffic is cost-stamped per dev, per project, per task class — without your devs changing how they work.
The smoothest path is
docs/operations/quickstart.md — a
buyer-facing, < 1-hour recipe: the convenience script automates
helm install end-to-end, then metis trial --gateway-url … --gateway-key …
runs a pre-baked workload through the gateway and prints
actual / baseline / savings_pct (~$0.05 of real-API spend).
End-to-end SDK pointer-flip recipe (Claude Code, Cursor, raw curl/SDK)
at docs/gateway-client-quickstart.md.
For evaluating Metis against your own workload (not our benchmarks),
see docs/customer-trial-recipe.md.
The two-line version, assuming you have a gateway issuing keys and a provider API key in the gateway pod:
# 1. Existing client, one env var flipped.
export ANTHROPIC_BASE_URL="http://your-gateway:8422"
export ANTHROPIC_API_KEY="gw_…" # gateway-issued token
claude # or cursor, or your existing tool — no code changes
# 2. The per-key / per-user / per-team rollup the buyer reads.
curl 'http://your-server:8421/analytics/by_team' | jq
curl 'http://your-server:8421/analytics/cost?group_by=user' | jq
If you want to validate the demo against the trace records before recommending it to a buyer:
# The pattern-store snapshot Pass C read at decision time
sqlite3 benchmarks/.runs/a3rev3-patterns.db \
"SELECT json_extract(f.structural_json,'$.workload_id') AS workload,
o.primary_model,
o.success_score_count AS n,
ROUND(o.success_score_sum / NULLIF(o.success_score_count,0), 3) AS mean
FROM outcomes o JOIN fingerprints f USING (fingerprint_id)
WHERE workload = 'regex-with-edge-cases'
ORDER BY o.primary_model, n DESC;"
# The Pass C route.decided events
sqlite3 benchmarks/.runs/a3rev3-pass-c.db \
"SELECT json_extract(payload_json,'$.chosen_model') AS model,
json_extract(payload_json,'$.chain[3].verdict') AS slot4_verdict,
json_extract(payload_json,'$.chain[3].reason') AS slot4_reason
FROM events WHERE type = 'route.decided'
ORDER BY timestamp_us;"
Reproduce with the recipe at
benchmarks/RESULTS.md §A3-rev3 "Reproduce".
Cost envelope: ~$1.10 real-API spend across three passes.
The thesis: a buyer’s LLM bill bends through three levers — context engineering, skills, and model selection — applied together. Model selection is the cleanest to demonstrate against a benchmark, which is why it’s the artifact above. Context engineering and skills compound on top of it; both are already-shipped (universal prompt-cache placement, agentskills.io-compatible skill store) and the savings from each surface separately in the dashboard.
The mechanism for model selection shipped. Broader workload coverage is in-progress.