Status: Draft v1.1 (signal_strength partition added 2026-05-15) Last updated: 2026-05-15
Defines the workload suite, baseline, and measurement methodology that turn
analytics-api.md /analytics/savings.actual_repriced_usdandbaseline_repriced_usdinto a credible “Metis saved you X%” number — the artifact the project strategy (private) names as “currently the biggest gap between ‘the architecture should work’ and ‘we can show it works.’”
/analytics/savings returns the right shape: actual_repriced_usd,
baseline_repriced_usd, savings_usd, savings_pct. But the numbers are only
as meaningful as the workload they sum over. With no defined workload, an
operator pointing at a personal trace DB asks reasonable questions the dashboard
can’t answer:
This spec closes that gap by pinning the workload. A benchmark suite is a versioned set of scripted user-turn scripts (“workloads”) and the harness that runs them. Re-running the suite on a clean trace DB produces:
/analytics/savings
directly against that trace store.The same trace DB, served by metis serve <workspace> --db-path <db>, renders
the same numbers on the dashboard — by construction, since scripts/benchmark.py
calls the in-process AnalyticsStore.savings() method that backs the HTTP
handler.
This spec depends on:
analytics-api.md §4.7 for the savings response shape and
re-pricing semantics.provider-adapter-contract.md (planned) for
CanonicalRequest.temperature.event-bus-and-trace-catalog.md for the
llm.call_completed / turn.completed events whose presence we assert.canonical-message-format.md §9.1 for the
on-disk trace + session schema.PriceTable version → same per-workload report within a documented
tolerance. Determinism is approximate, not absolute (LLMs are not strictly
deterministic even at temperature=0); see §6 for the tolerance./analytics/savings.savings_pct over the same window.packages/metis-core/tests/analytics/
already cover the SQL projection logic; benchmark v1 is the end-to-end
path.A workload is a fixture directory under benchmarks/workloads/
containing:
benchmarks/workloads/<name>/
workload.yaml # script + assertions
workspace/ # files the agent will operate on (the workspace_path)
...
workload.yaml schemaname: <slug, matches the directory name>
description: <one-line human summary>
suite_version: 1 # benchmark suite schema version
signal_strength: high # high | marginal; default high (§4.1)
turns: # ordered list of user turns
- prompt: "..."
expect: # optional, per-turn
min_tool_calls: 1
max_tool_calls: 20
contains_substring: "..." # optional text assertion on assistant_text
stop_reason: end_turn # optional, defaults to end_turn
expect: # optional, aggregate across the workload
max_total_cost_usd: 0.50
min_llm_calls: 1
max_hard_failures: 0 # /analytics/routing.hard_failures over the window
min_delegate_calls: 0 # optional, planner-scoped delegate.started count
evaluate: # optional, workload-level quality rubric
rubric: heuristic # heuristic | llm | hybrid; default heuristic
expect_substring_in_final_response: "..." # passthrough to heuristic signals
llm_judge_model: anthropic:claude-haiku-4-5 # required when rubric != heuristic
weight_per_turn: 1.0 # how turns aggregate (default 1.0)
grounding_tokens: ["..."] # optional, see evaluator.md §5.4
forbidden_grounding: ["..."] # optional, hallucination-detection list
The signal_strength field. Optional; defaults to "high". Pins
whether the workload produces a stable haiku-vs-sonnet quality gap large
enough for the routing K-NN to learn from (§4.1). Values:
high — smoke-validated gap ≥ 0.4 between the cheap-model mean and
the strong-model mean across ≥ 4 runs per (workload, model) pair at
temperature=0. Methodology: RESULTS.md §13a-1.marginal — gap is within run-to-run variance (typically < 0.15 in
the §A3-rev6 audit; the smoke gate is “not validated as high”). Kept
on disk for §A3 reruns and as a regression-sentinel suite, but
excluded from the default scripts/benchmark.py run.scripts/benchmark.py defaults to signal_strength=high workloads
only; --include-marginal opts the marginal-signal workloads back in.
An explicit --workload <name> bypasses the filter so any workload on
disk can be run by name (preserves §A3-rev reproducibility).
Schema enforcement. The harness validates the YAML against this shape at
load time. Unknown top-level keys, unknown expect keys, and unknown
evaluate keys are rejected. This forces schema migrations to flow through
this spec.
The evaluate: block. Optional; omitting it is equivalent to
rubric: heuristic with no substring assertion. The harness passes the
parsed evaluate to the evaluator with subject_kind=workload after each
workload run; the resulting eval.completed.score shows up in the
benchmark report’s quality column and feeds the “savings on successful
work” headline (evaluator.md §5.4). v1 ships
rubric: heuristic only — llm / hybrid parse but defer to the
heuristic until the LLM-as-judge tier lands.
Assertions are soft floors / hard ceilings. min_* and max_* bound a
window of acceptable behavior — if the model gets cheaper at the same task,
max_total_cost_usd does not break the run. The intent is to catch
regressions (cost ballooned, tool calls went wild), not to pin behavior so
tightly that an unrelated model release breaks the suite.
workspace/ directoryA real workspace tree the agent treats as its working directory. The fixture ships with the files in place; the agent reads / edits / runs them as it would any project. The harness:
workspace/ subtree to a fresh tempdir at run start. The
in-tree fixture is never mutated by a run.workspace_path on SessionManager.create_session().This is what makes workloads hermetic. The agent can edit_file freely;
nothing persists outside the run.
YAML files diff cleanly in PRs, support multi-line strings (the prompts are prose), and read better at PR-review time than JSON. The trade-off is indentation sensitivity, which the schema validator catches.
After §A3-rev6 (RESULTS.md), the suite is
partitioned by signal_strength. The §A3-rev6 Q1 finding is that across
six end-to-end §A3-series runs the per-workload haiku-vs-sonnet quality
gap stayed within run-to-run variance — the K-NN cannot learn signal
that isn’t there. The 13a-1 follow-up tested whether purpose-designed
“haiku-fail” workloads could clear a ≥ 0.4 gap; the result was no (see
RESULTS.md §13a-1).
| Workload | Signal | Shape | §A3-rev6 cross-pass gap |
|---|---|---|---|
fix-a-bug-small |
marginal | Find + fix a bug in a tiny python module | +0.070 |
write-a-doc-from-notes |
marginal | Read raw notes, produce a structured doc | +0.000 |
multi-turn-refactor |
marginal | Rename a function across 4 files | -0.079 (reverse) |
regex-with-edge-cases |
marginal | One-shot NANP regex over 16 cases | +0.119 |
multi-file-refactor-with-shared-types |
marginal | Rename a dataclass across 7 files | +0.043 |
architectural-explanation-without-hallucination |
marginal | Grounded explanation control case (hallucination detector) | +0.000 |
intentionally-failing-task |
marginal | Evaluator-control low-score case | +0.000 |
multi-step-with-delegation |
marginal | Planner-driven delegation exercise (Q2-shaped, not Q1) | — |
subtle-bug-fix-with-test (13a-1 candidate) |
marginal | Root-cause vs symptom config-loader bug | -0.029 (13a-1) |
recursive-data-structure-traversal (13a-1 candidate) |
marginal | Shortest-chain tree walk with tombstoned pruning | +0.083 (13a-1) |
refactor-with-contract-preservation (13a-1 candidate) |
marginal | Keyword-only signature refactor preserving 7 callers | +0.000 (13a-1) |
No workload is currently signal_strength: high. The default
scripts/benchmark.py run with no flags emits a helpful error
pointing to --include-marginal. This is intentional: shipping a
workload as “high-signal” requires smoke-validating a ≥ 0.4 gap, and
no such workload has emerged from the candidate set as of 2026-05-15.
The candidate path is:
temperature=0) and sonnet reliably passes (target: 0.9+).signal_strength: marginal initially. Open a PR with
the workload files + the 4×2 smoke run scoreboard.signal_strength: high and updates the §4.1 table.RESULTS.md
and keep the workload at marginal. The §A3-rev series may still
find it useful as a regression sentinel.The bar for “high” comes from the §A3-rev6 finding: the K-NN’s
cluster math needs a stable per-workload delta well above the
heuristic judge’s resolution (which clusters at 0.833 / 0.917 /
1.000 in the 13a-1 smoke). 0.4 is the rough minimum a K-NN of 5–10
neighbors can preserve through aggregation while leaving room for
the min_confidence gate to fire.
Real-API costs apply. Approximate per-run figures, all in USD, with the default actual=haiku / baseline=sonnet configuration:
| Run mode | Actual cost | Baseline cost (counterfactual, not billed) |
|---|---|---|
| Single workload (smoke) | ~$0.05–0.20 | n/a (no API call) |
| Full suite (3 workloads) | ~$0.30–1.00 | n/a |
Full suite at --model sonnet actuals |
~$1.00–3.00 | n/a |
Full suite at --model opus actuals |
~$3.00–5.00 | n/a |
The baseline does not make API calls. /analytics/savings re-prices each
recorded row’s token counts under the baseline model’s PriceTable rates;
no second LLM run happens. This is what makes the suite cheap enough to run
weekly.
Document these numbers in the harness’s --help. If a run blows past 2× the
upper bound for its mode, that’s a signal to investigate (regressed prompt,
exploded tool-cycle count, etc.).
The harness records and prints provenance so reports are comparable across runs and machines.
| Field | Where it comes from | Why it matters |
|---|---|---|
suite_version |
workload.yaml.suite_version (must be 1 in v1) |
Schema migration gate |
metis_commit_sha |
git rev-parse HEAD at run start |
Identifies the agent’s behavior |
metis_branch |
git rev-parse --abbrev-ref HEAD |
Context for the SHA |
metis_dirty |
git status --porcelain non-empty → true |
Flags “ran against uncommitted code” |
pricing_version |
PriceTable.version at report time |
The number on the report |
actual_model |
Canonical id resolved at run start (alias → id) | Pinned model version |
baseline_model |
Canonical id resolved at run start | Pinned baseline |
actual_provider |
Resolved from the registry | Sanity-check column |
python_version |
sys.version |
Reproducibility nicety |
started_at |
UTC ISO timestamp | Report header |
ended_at |
UTC ISO timestamp | Total wall time |
temperature |
Configured per run (default 0.0) |
Determinism control |
seed_passes |
--seed-passes N (default 1) |
N reps per (workload, model); see §6.4 |
The harness sets temperature=0 by default (overridable via --temperature).
SessionManager.submit_turn(...) accepts a temperature kwarg that threads
through to CanonicalRequest.temperature; adapters that support the parameter
honor it. This is not strict determinism — providers reserve the right to
vary outputs even at temperature=0 (especially on tool-call branches and
under load) — but it’s the strongest reproducibility lever available
without going to a recorded-fixture playback model.
Documented expected variance, run-over-run with all pins held:
| Metric | Tolerance |
|---|---|
savings_pct aggregate |
±5 absolute pp |
Per-workload actual_repriced_usd |
±25% relative |
llm_call_count per workload |
±2 calls (tool-cycle branching) |
If two consecutive clean-DB runs disagree by more than these, suspect a real behavior change, not noise.
The harness defaults --db-path to benchmarks/.runs/benchmark-<UTC-ts>.db
and rejects existing files (so the savings projection never mixes in unrelated
events). The default location is git-ignored to keep large trace files out of
commits.
--seed-passes N)When seeding a shared patterns DB for cross-pass K-NN comparisons (§A3
series), single-shot per-(workload, model) sampling produces 1–2 outcomes
per fingerprint cluster — small enough that one stochastic agent failure
can flip the cluster mean by 0.1–0.3, exceeding the min_confidence=0.05
gate that drives slot-4 inversions (see benchmarks/RESULTS.md §A3-rev6 Q1
finding).
--seed-passes N (default 1) loops each workload N times in the current
pass against the same --patterns-db-path. Each rep:
session_id (new runtime.manager.create_session)..metis/patterns.db from the shared file (after
rep 1) so the K-NN at rep N sees the prior reps’ outcomes.After the loop, the harness reports per-workload quality_mean ± quality_std
across the N reps. Workloads with std > 0.15 are flagged “NOISY” in the
report and surfaced as candidates for replacement under the 13a-1
signal-strength gate (the workload is too noisy for the K-NN to ever learn
a stable preference — fix the workload, not the routing knob).
Cost trade-off. N seed-passes multiplies the actual cost of each seed-only pass (Pass A / Pass B in the typical §A3 protocol) by N. Pass C (routing test on the now-richer patterns DB) and Pass D (delegation) are unaffected. For the typical §A3 four-pass run:
| N | Pass-A+B cost (haiku+sonnet seed) | Pass-C+D cost | Total | Use case |
|---|---|---|---|---|
| 1 (default) | ~$0.50 | ~$0.50 | ~$1.00 | single-shot, fastest signal |
| 3 (recommended for A3) | ~$1.50 | ~$0.50 | ~$2.00 | noise-reduced cluster means |
| 5 (cluster-tightening A/B) | ~$2.50 | ~$0.50 | ~$3.00 | std reporting + signal-strength gate |
Per-workload sample size after seeding scales with N: each rep adds one record per turn fingerprint, so a 3-turn workload at N=3 contributes 3 records to the patterns DB per model — enough for the K-NN cluster mean to absorb a single stochastic failure without flipping the chooser.
Validating accumulation. The harness verifies the shared DB grows: if
after the loop the K-NN’s cluster sample size for the test fingerprint
is less than N, that’s a recording-path bug (silently dropped record)
and the operator should investigate. The unit test
test_seed_passes_loop_invokes_run_workload_n_times exercises this
invariant against the scripted-adapter substrate.
To rerun against a previously-captured DB, pass it explicitly with
--db-path <existing> and --skip-execute (a no-API mode that only runs the
analytics projection on the existing DB). This is the “did I lose the print”
escape hatch; it doesn’t make a new API call.
Beyond the headline savings_pct, the report includes per-workload:
llm_call_count — from the turn.completed events.tool_call_count — same.total_wall_time_seconds — clock time from started_at to ended_at of
the run, not summed turn latencies (intentionally — wall time is the user’s
experience).cache_hit_rate — read from /analytics/cache_effectiveness if the
adapter emits cache metadata (currently none do — see
KNOWN_ISSUES.md “No prompt-caching strategy”).These are diagnostic, not pass/fail. The benchmark is a savings benchmark; the other columns answer “why did savings move?”
The harness prints a per-workload table and an aggregate summary to stdout,
and writes the full report to benchmarks/.runs/benchmark-<UTC-ts>.json.
Stdout (example):
=== Metis benchmark suite ===
commit: d79564b (clean)
suite_version: 1
actual_model: anthropic:claude-haiku-4-5
baseline_model: anthropic:claude-sonnet-4-6
pricing_version: 2026-05-08
temperature: 0.0
db: benchmarks/.runs/benchmark-2026-05-13T10-22-08Z.db
Per-workload:
workload turns llm tool actual_$ baseline_$ saved_$ saved_%
fix-a-bug-small 3 5 4 0.0142 0.0421 0.0279 66.2%
write-a-doc-from-notes 2 2 1 0.0081 0.0238 0.0157 65.9%
multi-turn-refactor 5 9 11 0.0418 0.1320 0.0902 68.3%
Aggregate:
rows_total: 16
rows_missing_from_price_table: 0
actual_repriced_usd: 0.0641
baseline_repriced_usd: 0.1979
savings_usd: 0.1338
savings_pct: 67.6%
Run the dashboard against this DB to verify:
uv run metis serve $(pwd) --db-path benchmarks/.runs/benchmark-2026-05-13T10-22-08Z.db
open http://127.0.0.1:8421/dashboard
JSON artifact carries the provenance from §6.1 plus
the per-workload and aggregate fields above plus the raw
/analytics/savings response per workload.
scripts/benchmark.py
instantiates AnalyticsStore directly against the trace DB; it does not
start the HTTP server. This avoids spinning up uvicorn for a one-shot
report. The dashboard agreement §2.1.4 is by construction
because the HTTP handler delegates to the same AnalyticsStore.savings().started_at / ended_at micros; the aggregate report calls
AnalyticsStore.savings(window=(min(start), max(end)), baseline=...) to
cover the full suite, and per-workload it calls with the workload’s own
window. SQLite’s BETWEEN over the indexed (type, timestamp_us) covers
this in a single scan.shutil.copytree(...) into a TemporaryDirectory and the
workspace_path passed to SessionManager.create_session() points at the
copy. The copy is removed in a try / finally regardless of run outcome.max_total_cost_usd exceeded) or if any turn
raised. Exit codes:
0 — every workload ran clean, assertions held.1 — one or more assertions failed; the report still printed.2 — setup error (missing API key, bad workload file).V1’s “tests” for the spec are the harness and its workloads. Unit tests cover
the schema validator (see §3.1) — accept the three
shipped workloads, reject malformed YAML, reject unknown keys. End-to-end is
a real-API smoke (one workload at --model haiku) on demand, not in CI, per
§2.2.
benchmarks/workloads/ loads without error.foo: bar field at the top level raises a clear error.expect keys. Same for per-turn and aggregate
expect fields.actual_repriced_usd
equals the sum of per-workload actual_repriced_usd (exact Decimal).These are live. Do not unilaterally close them.
multi-turn-refactor use the real edit_file tool? Currently
yes; the alternative is to mock the tool for determinism. Picking yes
because the savings story includes the cost of tool-cycle iterations, and
mocking would erase that.metis serve --db-path be advertised? Today the dashboard
path requires knowing the DB. Adding metis benchmark as a CLI subcommand
(apps/cli/src/metis_cli/main.py)
would close the loop. Out of scope for v1.code-explanation, shell-driven-debug,
long-multi-doc-summarize are all plausible. Wait for the v1 suite’s
numbers to settle before adding noise.| Date | Decision | Rationale |
|---|---|---|
| 2026-05-13 | YAML for workload files | Multi-line prose prompts and PR diffs read better than JSON; msgspec.yaml validates shape. |
| 2026-05-13 | Bundled fixture workspaces, not the metis repo | Hermetic; results don’t drift with repo state. |
| 2026-05-13 | Baseline is re-priced, not re-executed | A re-run baseline doubles cost; analytics-api.md §4.7 already re-prices honestly. |
| 2026-05-13 | temperature=0 by default, plumbed via submit_turn |
Strongest reproducibility lever without recorded-playback. |
| 2026-05-13 | Three workloads in v1 | Enough variation to avoid single-workload accident; few enough to stay under the cost ceiling. |
| 2026-05-13 | Soft floors / hard ceilings, no goldens | LLM variance breaks goldens; tolerance windows catch real regressions without churn. |
| 2026-05-13 | Run analytics in-process (not via HTTP) | Avoids uvicorn lifecycle in a one-shot script; dashboard agreement is by construction. |
| 2026-05-13 | Quality scoring deferred to the evaluator | Benchmark v1 measures spend, not correctness — evaluator’s job per the project strategy (private) |
| 2026-05-15 | signal_strength: high \| marginal partition + --include-marginal flag |
§A3-rev6 Q1 finding (RESULTS.md): the per-workload haiku-vs-sonnet quality gap in the v1 suite is within run-to-run variance. v2 splits the suite by smoke-validated gap so the default run trains the K-NN only on high-signal workloads. 13a-1 smoke (2026-05-15) tested 3 candidate workloads; none cleared the 0.4 gate, so the default suite ships empty pending future candidates. |
analytics-api.md — /analytics/savings response shape;
this spec’s headline number is one field of one of its responses.event-bus-and-trace-catalog.md —
llm.call_completed, turn.completed are the rows the savings projection
sums.canonical-message-format.md — on-disk
schema for events, messages, sessions.provider-adapter-contract.md (planned) —
the contract for CanonicalRequest.temperature honored across adapters.../the project strategy (private) — the open question this spec
closes.../KNOWN_ISSUES.md — prompt-caching gap; the
benchmark’s cache_hit_rate column doubles as a forcing function.scripts/smoke.py,
scripts/smoke_cross_provider.py —
shape reference for the real-API harness pattern this spec extends.