metis

Evaluator Specification

Status: v1 (heuristic + LLM + hybrid tiers shipped) Last updated: 2026-05-14

Defines the feedback loop that turns “was this turn successful?” into a recorded signal the pattern store and the analytics surface can read. This closes the open question in the project strategy (private): without an evaluator, “is the system actually saving money vs naive sonnet-everywhere?” stays an open question forever — savings without quality is just a smaller bill for worse work.

v1 is heuristics-first with an opt-in LLM-as-judge tier. No labeled training data, no fine-tuned classifier, no SaaS dependency. The judge outputs a numeric score in [0, 1] with a confidence, written as a bus event so every consumer (pattern store, /analytics/*, the dashboard) reads from the same record.

1. Purpose

The build records spend honestly today: llm.call_completed.cost_usd is stamped at write time, /analytics/savings re-prices the counterfactual, benchmark.md bounds the workload. None of that proves the agent’s output was good — and the savings-vs-naive-sonnet pitch collapses the moment a buyer asks “how do you know haiku produced an answer worth keeping?”

The evaluator answers that question. It consumes the events already in the bus (turn.completed, tool.completed, feedback.*, route.decided) and emits a verdict per subject — turn, tool cycle, session, or benchmark workload. The verdict is a single numeric score with a confidence and the provenance of the judge that produced it.

This spec depends on:

event-bus-and-trace-catalog.md for the events the evaluator subscribes to (turn.completed, tool.completed, tool.failed, feedback.explicit, feedback.implicit, route.decided) and the catalog the new eval.* events join.
canonical-message-format.md §9.1 for the trace store schema the evaluator reads and writes through.
analytics-api.md for the projection conventions the new /analytics/quality endpoint follows.
benchmark.md §2.2 for the v1 limitation this spec closes (quality scoring deferred to the evaluator).
routing-engine.md §5.5 for the pattern store’s consumption shape (success_score in [0, 1], weighted with cost).

This spec coordinates with the planned pattern-store.md (drafted in parallel; see §15). Touchpoints are listed there so the two specs reconcile before either implementation lands.

2. Goals and non-goals

2.1 Goals

Closes the project strategy (private) Every turn (and every workload run, and every flagged tool cycle) yields a recorded verdict. “Is the system saving money on tasks that succeeded?” becomes a join, not a guess.
Heuristic floor, LLM ceiling. v1 ships heuristic judges that cost $0/turn and a hybrid escalation tier where an LLM-as-judge is invoked only when the heuristic’s confidence is below a threshold. The expensive judge is opt-in, not the default.
Re-evaluatable. Every verdict is a record, not a mutation. Re-running the evaluator over an older window produces new verdicts joined to the same subjects — the dashboard’s “evaluator agreement rate over time” tile (§9.2) is a query, not a side-table.
Single-user / local-first. Workspace-scoped by default per the project strategy (private); the verdict store lives next to the trace store. No external service, no labeling pipeline.
Budget-explicit. LLM-as-judge calls cost money. The evaluator’s own spend is recorded (eval.completed.judge_cost_usd), capped per session and per day, and surfaced under /analytics/cost?group_by=model&include_eval=true so an operator can see “how much did the judge cost vs how much did it save.”
Bus-as-spine. Output is eval.started / eval.completed / eval.failed on the existing bus. No private side-channels; every consumer is a normal subscriber.
Verdict shape is small and stable. The numeric score is a single field, so the pattern store (routing-engine.md §5.5) plugs in without bespoke aggregation. Multi-dimensional rubrics are an additive extension under signals — never a breaking change to the score field.

2.2 Non-goals

No supervised training, no labeled corpus. The single-user local deployment has no labeling pipeline. Calibration is by self-consistency (the agreement-rate view) and by spot-check against benchmark.md workloads with hand-asserted expectations, not by fitting a model to a labeled set.
No quality scoring inside the routing chain. The evaluator runs after turn.completed. Routing decisions for the current turn still come from the chain in routing-engine.md §4. Verdicts feed the pattern store (slot 4) which influences future turns — never the current one. This preserves the turn-locked model invariant (AGENTS.md “Gotchas”).
No verdict overriding the user. A low score never silently degrades the user’s chosen model, never auto-escalates mid-turn, never refuses to route. The evaluator is an observer; the user (or a future pattern-disagreement banner per routing-engine.md §5.6) is the actor.
No real-time LLM-as-judge in v1. The heuristic judge is fast-enough to run inline on the turn.completed event. The LLM judge runs out-of-band (batch / on-demand), not on the fast path. Promotion to a faster tier is a §13 open question.
No multi-judge ensembling. v1 picks one judge per verdict and records it. Ensembling (running two judges and combining their scores) is tempting and a v2 question.
No quality scoring of intermediate model output (tokens mid-stream). The judge sees the completed turn / cycle / session — the atomic unit. Token-level scoring is a different research project.
No remote / multi-tenant rollups. Per the project strategy (private); deferred until the gateway path (deployment-shape.md) needs it.

3. What the evaluator evaluates

The evaluator produces a verdict per subject. Four subject kinds, ordered by inclusion (each kind is the next-level aggregation of the prior):

Subject kind	Identifier	Bus trigger	Typical judge	v1?
`turn`	`turn_id`	`turn.completed`	heuristic; LLM optional	yes
`tool_cycle`	`tool_use_id`	`tool.completed` / `tool.failed`	heuristic only	yes
`session`	`session_id`	`session.ended`	heuristic-over-turns aggregation	yes
`workload`	`workload_run_id`	benchmark harness final report	heuristic + optional LLM	yes

A tool_cycle verdict scopes one tool invocation — did the dispatch return a useful result, or did the agent re-call the same tool with different arguments three calls later (tool thrash)? Tool-cycle verdicts roll up into the parent turn verdict’s signals but do not arithmetic-average into the turn score; the heuristic rubric (§5.1) decides how much weight each signal carries.

A session verdict is the per-turn weighted average plus session-scoped signals (explicit feedback, manual /model swaps inside the session, etc.).

A workload verdict subsumes benchmark.md §2.2.2’s “no quality scoring in v1” gap. The benchmark harness calls the evaluator with subject_kind=workload after the suite run; the workload-level rubric is defined per workload in workload.yaml (§5.4).

No mid-turn evaluation. The evaluator never fires inside a turn. A turn that’s still running has no turn.completed event yet, so the subscription filter never matches it. This is what keeps the evaluator off the fast path and out of the turn-locked model contract.

4. The verdict

4.1 Shape

EvalVerdict is a msgspec.Struct(frozen=True) carried as the payload of eval.completed:

class EvalVerdict(msgspec.Struct, frozen=True):
    eval_id: str                                    # monotonic ULID
    subject_kind: Literal["turn", "tool_cycle", "session", "workload"]
    subject_id: str                                 # turn_id / tool_use_id / session_id / workload_run_id
    score: float                                    # in [0.0, 1.0]; 1.0 = clear success, 0.0 = clear failure
    confidence: float                               # in [0.0, 1.0]; judge's confidence in `score`
    judge_kind: Literal["heuristic", "llm", "hybrid"]
    judge_model: str | None                         # canonical id when llm or hybrid used the LLM tier; else None
    judge_cost_usd: Decimal                         # 0 for heuristic; > 0 for llm/hybrid
    judge_pricing_version: str | None               # set when judge_cost_usd > 0
    judge_latency_ms: int                           # wall time for this verdict alone
    rubric_id: str                                  # which rubric produced this (e.g. "turn-heuristic-v1")
    rubric_version: str                             # rubric's own version string
    signals: dict[str, object]                      # judge-specific evidence (see §4.4)
    parent_eval_id: str | None                      # for tool_cycle → turn / turn → session rollups
    created_at: str                                 # ISO 8601 UTC

4.2 The `score` field

score is a single number in [0.0, 1.0]. Higher is better. This shape is deliberate:

The pattern store’s (routing-engine.md §5.5) normalized_success_M formula expects success_score in [0, 1]. One number, one consumer contract.
Multi-dimensional rubrics (correctness × completeness × verbosity × …) are expressible as signals and collapse to one score via the rubric’s own weights. The rubric is versioned (rubric_version) so a weight change is observable as a new score series.
Single numbers compose. Session score = weighted average of turn scores. Workload score = weighted average of session scores. Multi-dimensional scores require a per-dimension policy at every join — needless complexity for v1.

The “score is one number” rule is the only structural commitment to the pattern store consumer. Everything else under signals is judge-internal and may evolve without breaking routing.

4.3 The `confidence` field

confidence is the judge’s stated confidence in the score it produced. Distinct from the score:

A heuristic judge with all signals firing cleanly (no tool.failed, no manual_swap follow-up, stop_reason=end_turn, low retry similarity) emits high confidence even when the heuristic’s rubric is coarse — it’s saying “the signals I have are unambiguous.”
A heuristic judge with conflicting signals (stop_reason clean but a feedback.implicit.type=retry followed) emits low confidence — “I see the signal but I’m not sure how to read it.”
An LLM judge emits a confidence that reflects its own self-reported certainty (rubric-prompted; see §5.2).

Confidence is a gate, not a score modifier. Consumers (pattern store, analytics) filter by confidence threshold before aggregating. The score itself is not down-weighted by confidence — that would conflate two distinct signals.

Routing-side gate. The pattern store ignores verdicts with confidence < pattern.min_eval_confidence (default 0.5, configured per the planned pattern-store.md — see §15). Low-confidence verdicts still record (for the agreement-rate view) but don’t drive routing.

4.4 The `signals` dict

Free-form, judge-specific evidence. Stable conventions:

Keys are snake_case strings.
Values are JSON-roundtrippable (no Decimal, no objects). Numeric weights belong in the rubric, not in signals.
Required-by-convention keys per judge_kind:
- Heuristic: flags: list[str] — the heuristic flags that fired (e.g. ["stop_reason_clean", "no_tool_failure"]). flags_negative: list[str] for flags that fired against the subject.
- LLM: rationale_hash: str (SHA-256 of the judge’s natural-language rationale), rationale_redacted: str | None (populated only on opt-in, similar to turn.started.user_message_text_redacted per event-bus-and-trace-catalog.md §4.4.1).
- Hybrid: heuristic_score: float, heuristic_confidence: float, escalated: bool, plus the LLM keys if escalated.

signals is opaque to the score; it exists for the audit trail and for re-evaluation (the next time this subject is judged, the new judge can see what the prior signals were).

4.5 The `judge_cost_usd` field

Decimal, computed via the existing PriceTable.compute_cost (pricing/table.py), stamped with judge_pricing_version. Same convention as canonical-message-format.md §6.4: aggregate in Decimal, serialize as JSON number with 6 decimal places at the wire boundary per analytics-api.md §5.1.

For heuristic judges, judge_cost_usd is exactly Decimal("0") and judge_pricing_version is None. This is deliberate — pricing semantics don’t apply to code that did no inference.

4.6 Re-evaluation

A subject may have many verdicts over time. Re-running the evaluator on a past turn_id produces a new EvalVerdict with a fresh eval_id; the old verdict is not replaced or invalidated. The verdict table is append-only by construction (it’s the trace store) and the analytics projection queries it like any other event (§9).

The (subject_kind, subject_id, eval_id) triple is the natural sort key. “Latest verdict” is ORDER BY eval_id DESC LIMIT 1 per subject. “Agreement rate” joins distinct verdicts across runs (§9.2).

This is why the verdict is on the bus, not a column on turn.completed: the latter would make re-evaluation a destructive operation.

5. The judge

5.1 Turn heuristic rubric

rubric_id = "turn-heuristic-v1". Cost: $0. Latency: <1ms.

Inputs (all derived from events already in the trace store for turn_id):

Signal name	Source	Direction
`stop_reason_clean`	`turn.completed.stop_reason == "end_turn"`	positive
`no_llm_failure`	No `llm.call_failed` in turn	positive
`no_tool_failure`	No `tool.failed` in turn (uncaught Python exception path)	positive
`no_tool_exit_failure`	No `tool.completed` with `success=False` in turn (clean-exit-nonzero path; e.g. shell-tool nonzero return code)	positive (strong; single failure must drop a clean turn’s score by ≥0.3)
`no_max_tokens_hit`	No `llm.call_completed.stop_reason == "max_tokens"` in turn	positive
`tool_cycle_count_reasonable`	`turn.completed.tool_call_count` ≤ a configured threshold (default 20)	positive
`assistant_refusal_detected`	`signals_extra.final_response_text` begins with a refusal phrase (e.g. “I cannot help”, “I’m unable to”) within the first 160 chars	negative (×0.5)
`empty_assistant_response`	`signals_extra.final_response_text` is whitespace-only	negative (×0.4)
`no_retry_implicit`	No `feedback.implicit.type == "retry"` whose `subject_turn_id == turn_id` within next 5 user messages	positive
`no_manual_swap_after`	No `feedback.implicit.type == "manual_swap"` whose `subject_turn_id == turn_id`	positive
`no_edit_followup`	No `feedback.implicit.type == "edit_followup"` whose `subject_turn_id == turn_id`	positive
`explicit_thumbs_up`	A `feedback.explicit.rating == "thumbs_up"` with `subject_turn_id == turn_id`	positive (heavy weight)
`explicit_thumbs_down`	A `feedback.explicit.rating == "thumbs_down"` with `subject_turn_id == turn_id`	negative (heavy weight)

The score is the rubric-weighted sum of fired signals normalized to [0, 1]. Concrete v1 weights are an implementation detail of the rubric file (rubrics/turn-heuristic-v1.yaml, not specified here); the contract is that the score is bounded and that explicit feedback dominates implicit signals dominates lifecycle signals.

Two distinct tool-failure signals. v1 distinguishes tool.failed (an uncaught Python exception raised inside a Tool.execute body — the dispatcher catches it and emits tool.failed) from tool.completed with success=False (the tool ran cleanly to completion but returned a non-success outcome — the canonical case is the shell tool reporting a non-zero exit code). Both are real failures from the agent’s perspective; the rubric reads them as two independent gates so a shell tool that prints "FAIL N/M" and exits 1 (success=False, no exception) lowers the score by the same shape as an uncaught exception would. The no_tool_exit_failure weight is sized so that a single failed exit drops a clean turn’s score by ≥0.3 and the resulting confidence below the v1 hybrid escalation threshold (0.7, see §5.3), so HybridJudge escalates to the LLM judge on this class of failure without depending on assistant-text content signals.

Content penalty (opt-in). assistant_refusal_detected and empty_assistant_response apply as multiplicative penalties on the normalized score (×0.5 and ×0.4 respectively), not as weighted lifecycle signals. They fire only when the caller plumbs final_response_text via SubjectContext.signals_extra. The refusal regex is anchored to the first 160 chars of the stripped response so substantive answers that incidentally quote a refusal phrase don’t false-positive.

signals_extra contract. The session manager’s _emit_turn_completed stamps three text keys onto turn.completed.signals_extra when the underlying string is non-empty (any missing string is omitted so the judge’s “(not available)” fallback fires honestly):

Key	Source	Reader
`final_response_text`	last assistant text block in the turn	heuristic content-penalty path (this section)
`assistant_response_text`	alias of `final_response_text`	LLM judge `_build_user_message` (see §5.2)
`user_prompt_text`	first text block of the persisted user message	LLM judge `_build_user_message` (see §5.2)

The two assistant-text keys are intentionally aliased to the same string so producer and consumer evolved independently — the heuristic content-penalty path was wired before the LLM judge tier landed and reads the older name; the LLM judge ships with the newer one. A future migration can drop the alias once the consumer side converges. The benchmark workload harness (see §5.4) also populates these keys at the workload subject level, so workload-level evaluation exercises the same readers.

Confidence is high when ≥ N signals fire in the same direction with no conflict; low when signals contradict (e.g. clean stop reason but implicit retry detected later). Concrete threshold lives in the rubric file.

Lookahead window. Some signals (no_retry_implicit, no_manual_swap_after, no_edit_followup) require seeing user messages after the turn being judged. The heuristic judge waits until either (a) 5 user messages follow in the same session, (b) the session ends, or (c) 24 hours pass without progress, then commits. v1 ships the heuristic with the lookahead window configurable per workspace; the trade-off is verdict latency vs verdict richness, and the dashboard’s “pending” tile makes the backlog visible.

5.2 LLM-as-judge rubric

rubric_id = "turn-llm-v1". Cost: typically $0.001–$0.01 per turn, depending on judge model and turn size. Latency: 500–3000ms.

The LLM judge ingests:

The user’s prompt text and assistant’s final response text for the turn (from the canonical messages table per canonical-message-format.md §9.1).
The list of tool.called / tool.completed events with their hashes and side-effect classifications.
The turn’s route.decided.chosen_model.

It prompts a small model (default anthropic:claude-haiku-4-5, configurable; the same model the routing pipeline considers cheap) with a fixed rubric prompt asking for:

A success score in [0, 1] with one decimal.
A self-reported confidence in [0, 1] with one decimal.
A one-sentence rationale.

The judge response is parsed against a msgspec schema; parse failures emit eval.failed.failure_mode="judge_output_invalid" and the verdict is not written. Retries are bounded (default 1 retry on parse failure) — beyond that, the heuristic verdict (§5.1) stands as the only record for the subject.

The rubric prompt is shipped as rubrics/turn-llm-v1.md and versioned with the spec — changing the prompt is a rubric_version bump that produces a new score series on the dashboard.

Why a small model for the judge. The judge’s job is “did this look like a successful turn” — a classification, not synthesis. Spending opus to grade haiku’s work would invert the cost story. The configuration must allow a bigger judge (a buyer running benchmarks may opt in), but the default is cheap.

5.3 Hybrid escalation

rubric_id = "turn-hybrid-v1". Default judge for v1 turn evaluations.

Algorithm:

Run the heuristic rubric. Get (h_score, h_confidence).
If h_confidence >= hybrid.escalation_threshold (default 0.7), emit the heuristic verdict and stop. Cost: $0.
Otherwise, run the LLM judge. Emit a verdict with judge_kind="hybrid", score=l_score, confidence=l_confidence, signals.heuristic_score=h_score, signals.heuristic_confidence=h_confidence, signals.escalated=true.

The threshold is configurable per workspace. The session- and workload- level rubrics follow the same pattern; tool-cycle is heuristic-only in v1 (the LLM judge there would cost more than the action it’s grading).

This is the cost-vs-truth knob. escalation_threshold = 0 is “always run the LLM judge” (maximum cost, maximum signal); escalation_threshold = 1 is “never run the LLM judge” (zero cost, heuristic-only). The default lands in between, with the dashboard’s agreement-rate view (§9.2) as the calibration surface.

Implementation status (2026-05-14). LLM tier landed at packages/metis-core/src/metis_core/eval/llm_judge.py. Hybrid escalation knob default 0.7 is configurable via HybridJudge(..., escalation_threshold=...). Budget-exhausted LLM calls return a signals.budget_exhausted=True verdict (confidence=0); HybridJudge falls back to its heuristic verdict and records signals.escalation_skipped="budget_exhausted". The LLM judge also delegates to the heuristic for tool_cycle / session subjects so the v1 heuristic-only commitment for those kinds holds even when an LLM judge is wired in.

5.4 Workload rubric

For benchmark workloads (benchmark.md §3), the rubric is authored per-workload in the existing workload.yaml schema as a new optional evaluate: block:

name: fix-a-bug-small
...
evaluate:
  rubric: heuristic        # heuristic | llm | hybrid; default heuristic
  expect_substring_in_final_response: "..."   # passthrough to heuristic signals
  llm_judge_model: anthropic:claude-haiku-4-5  # only when rubric != heuristic
  weight_per_turn: 1.0                           # how turns in the workload aggregate
  grounding_tokens: ["RoutingEngine", "policy=", "PolicyEvaluation"]   # v1.1
  forbidden_grounding: ["PATTERN_LOOKUP", "RouterChain", "ModelSelector"] # v1.1
  partial_credit:                                    # v1.2
    enabled: true
    criterion: test_pass_count_ratio
    map: linear

The benchmark harness (benchmark.md §9) calls the evaluator with subject_kind=workload after the suite run. The resulting EvalVerdict is the workload’s quality score; the benchmark’s savings_pct multiplied (or filtered) by the workload’s score becomes the headline “saved X% on successful work” number a buyer can quote without qualification.

The benchmark v1’s expect.contains_substring (benchmark.md §3.1) is the special-case heuristic for the workload rubric: a present substring contributes positively to the score; an absent one negatively. New workload rubric primitives are added as evaluate.* fields, not as new schema versions.

The workload rubric also applies the same content penalty as the turn rubric — workload_assistant_refusal_detected (×0.5) and workload_empty_assistant_response (×0.4) — using the harness-supplied final_response_text. Without this, a workload whose evaluate: block has no expect_substring_in_final_response would score 1.0 on a clean refusal (lifecycle is fine; substring isn’t asserted). The intentionally-failing-task workload under benchmarks/workloads/ is the control case that exercises this — it scores < 0.8 when the agent refuses or returns nothing.

Grounding-check primitive (v1.1)

grounding_tokens and forbidden_grounding are the rubric inputs for workloads that probe hallucination / source-grounding rather than task completion. The motivating case is documented in benchmarks/RESULTS.md §A3-rev: the architectural-explanation-without-hallucination workload used a single expect_substring_in_final_response="PATTERN_RECOMMENDATION" assertion; sonnet’s response cited the real PolicyEvaluation / RoutingDecision dataclasses and lowercase policy= literals — strictly more grounded than haiku — but scored 0.50 because it didn’t parrot the UPPERCASE PATTERN_RECOMMENDATION label from the engine.py module docstring. The substring check rewarded stylistic mimicry over real grounding.

Semantics:

grounding_tokens: a list of substrings that should appear in the final response. Each one is a real symbol the agent must cite to count as grounded — class names, function names, real string-literal values the source uses. The heuristic awards present / total as a positive score component.
forbidden_grounding: a list of substrings that should not appear. Each one is a plausible-but-fabricated name a hallucinating agent would invent. The heuristic awards 1 - (present / total) as a positive score component (i.e. it pays for absence).
The two lists are independent. A workload may set either, both, or neither. When both are set, the heuristic averages the two components.

The rubric exposes a workload-level signal workload_grounding_score (plus grounding_tokens_present, grounding_tokens_missing, forbidden_grounding_present for the audit trail). The composed workload score averages this with the substring/assertion-derived score when grounding is configured — so a workload that fully grounds in real symbols and avoids fabricated ones is unaffected, and one that misses all expected symbols and contains forbidden ones is halved.

LLM tier escalation: when rubric: llm or rubric: hybrid is set, the configured grounding_tokens and forbidden_grounding lists are surfaced to the judge LLM in the user message (under a “GROUNDING HINTS” section). The LLM tier can recognize paraphrased grounding (citing a real symbol with different capitalization or via a synonym) and partial fabrications (a real prefix joined to a fake suffix) that the heuristic substring match would miss. The LLM judge’s score remains a single [0, 1] number; the grounding hints are inputs, not a separate axis.

Cost discipline: heuristic-tier grounding is $0; LLM-tier grounding escalation is one judge call per workload, governed by the same BudgetTracker caps as the per-turn LLM judge.

Partial-credit primitive (v1.2)

partial_credit is the rubric input for workloads where the agent’s final response carries a count (test pass/fail tallies, sub-task scoreboards) rather than a single boolean substring. The motivating case is documented in benchmarks/RESULTS.md §A3-rev6 / §13a-1: across six A3 iterations the per-workload haiku-vs-sonnet quality gap on the v1 suite is below the heuristic judge’s resolution. Pass/fail substring detection collapses partial successes — 12/16 regex cases, 3/4 pytest tests — to 0; partial-credit surfaces the mid-range gradient haiku and sonnet actually produce.

Schema:

evaluate:
  rubric: heuristic
  partial_credit:
    enabled: true
    criterion: test_pass_count_ratio   # only criterion in v1
    map: linear                        # "linear" | "stepped"

Semantics:

enabled: false (default): partial-credit is off; the workload falls back to the pre-v1.2 substring path.
enabled: true: the heuristic parses final_response_text for the configured criterion, applies map, and folds the resulting score into the composed workload score in place of the pass/fail substring assertion. expect_substring_in_final_response is bypassed when partial-credit is active — pick one or the other.
criterion: test_pass_count_ratio: the parser recognizes two shapes, preferring whichever produces an explicit total:
1. PASS N/M / FAIL N/M runner output (per the runner.py convention used in this repo’s workloads). The last occurrence in the text wins, so iterative per-case lines followed by the final summary line are graded correctly.
2. Pytest summary tokens: N passed, M failed, K error(s). Total is passed + failed + errors; skipped tests are excluded from the denominator (a skip isn’t a pass or a fail). The ratio is passed / total. When the response contains no parseable test signal (e.g. the agent never reached the test step), the ratio is 0.0 and the partial_credit_no_test_signal negative flag fires — a missing signal is treated as a failure.
map: linear (default): pass the ratio through unchanged. Mid-ratios produce mid-scores; perfect-pass produces 1.0 (recovering the same composed score as the prior substring_present=True path); zero-pass produces 0.0 (matching the prior substring_present=False halving).
map: stepped: round the ratio to the nearest 0.25 (so the only possible mapped scores are 0.0, 0.25, 0.5, 0.75, 1.0). Useful when the caller wants a stable bucketed score rather than the continuous version. Endpoints (0/N and N/N) stay exact.

Composition: when partial-credit is active, the composed score is (base + partial_credit_score) / 2.0, parallel to how the grounding score folds in. The rubric exposes workload-level signals partial_credit_score, partial_credit_ratio, partial_credit_passed, partial_credit_total, partial_credit_criterion, partial_credit_map, and partial_credit_test_signal_found for the audit trail. Pre-v1.2 workloads (no partial_credit block) are unaffected.

Cost discipline: heuristic-tier partial-credit is $0 (pure regex over the response text). The LLM tier does not consult partial-credit — it forms its own [0, 1] judgment.

5.5 Tool-cycle rubric

rubric_id = "tool-cycle-heuristic-v1". Heuristic only in v1.

Signals:

tool_succeeded: tool.completed with success=true.
output_size_in_window: tool.completed.output_size_bytes within a per-tool reasonable range (the range is rubric-internal).
no_immediate_re_call_same_input: the next tool.called in the turn doesn’t have the same input_hash for the same tool_name.
no_thrash_in_window: within the next 3 tool calls, the same tool_name is not called with input_hash differing by a small Hamming-style threshold (catches “re-call with one arg tweaked”).

Tool-cycle verdicts attach to their parent turn via parent_eval_id. They do not arithmetic-average into the turn score; the turn rubric reads signals.tool_cycles_with_score_below_threshold and applies its own weight.

5.6 Session rubric

rubric_id = "session-aggregate-v1". Heuristic-only (LLM at session scale is expensive and the per-turn signal is already in the bus).

Algorithm:

Aggregate child turn verdicts: mean_turn_score, min_turn_score, turns_with_explicit_thumbs_down.
Add session-scoped signals: feedback.explicit with scope="session", number of distinct models swapped via /model, session ended with disposition="abandoned".
Emit a session-level verdict with signals carrying the child eval_ids.

6. When the evaluator runs

6.1 Bus subscriber (online)

The evaluator registers a non-fast-path subscriber per event-bus-and-trace-catalog.md §3.4:

Filter event	Action
`turn.completed`	Queue a `turn`-subject evaluation (delayed until the lookahead window resolves; §5.1)
`tool.completed`	Queue a `tool_cycle`-subject evaluation
`tool.failed`	Queue a `tool_cycle`-subject evaluation
`session.ended`	Queue a `session`-subject evaluation
`feedback.explicit`	Re-queue the matching turn / session for re-evaluation (the verdict gains a thumb signal)

Non-fast-path is the right choice: the heuristic judge is fast (~ms), the LLM judge is slow (seconds), and neither is on a critical user-facing path. A backlog is observable as eval.started - eval.completed over time on /analytics/quality.backlog.

6.2 Batch re-evaluation (offline)

The dashboard’s agreement-rate view (§9.2) depends on re-evaluating subjects on demand. The evaluator exposes a re_evaluate(window, subject_kind, rubric_id) entry point (CLI subcommand in v1: metis evaluate --since <ts> --subject turn --rubric turn-hybrid-v1) that runs the judge across the trace store window and emits fresh eval.completed events.

Re-evaluation is bounded by the same per-day and per-session judge_cost_usd caps (§7) as online evaluation — the cap is across both modes.

6.2.1 Batch submission mode (`--batch-mode` / `--collect-batches`)

For offline re-evaluation runs over a wide window, the evaluator opts in to the provider’s batch API at a flat 50% discount per provider-adapter-contract.md §4.6. The CLI exposes two complementary flags on metis evaluate:

--batch-mode — walks the trace store for in-window subjects, builds an LLM-judge CanonicalRequest per subject, bundles them into one call to adapter.submit_batch(requests), persists the returned BatchHandle rows to a small evaluator_batch_handles table on the trace DB, and exits without waiting. Anthropic’s SLA is best-effort 24h, so blocking is incompatible with operator workflow. STDOUT prints the submitted batch_id(s) + request count for the operator’s log.
--collect-batches — reads pending evaluator_batch_handles rows, polls each via adapter.poll_batch, and for completed batches calls adapter.fetch_batch. Each result row is parsed into an EvalVerdict via the LLM judge’s _parse_response helper and emitted as an eval.completed event with signals.pricing_mode='batch' so the savings dashboard’s group_by=pricing_mode partition lights up. The handle row transitions to status='ingested' with ingested_at_ms stamped. Idempotent — a second invocation skips ingested rows and emits no duplicate events.

Two-pass workflow:

# Tuesday afternoon: kick off the backfill, walk away.
metis evaluate --db-path ~/.metis/metis.db --subject turn \
    --since 2026-05-01T00:00:00Z --batch-mode

# Wednesday morning: ingest whatever's ready (idempotent).
metis evaluate --db-path ~/.metis/metis.db --collect-batches

The persisted handle table is evaluator_batch_handles on the existing trace DB (additive; TRACE_SCHEMA_VERSION unchanged because CREATE TABLE IF NOT EXISTS makes the migration a no-op on existing DBs). Schema: one row per custom_id (== canonical request_id), carrying batch_id, provider, submitted_at_ms, subject_kind, subject_id, session_id, turn_id, judge_model, status (pending|ingested), and ingested_at_ms.

Per-request failures inside a successfully-completed batch surface as eval.failed events, NOT as exceptions — same posture as the sync LLM judge’s bounded retry. Batch-level failures (entire batch expired or aborted before any results were produced) raise AdapterError; the handle row stays pending so a later --collect-batches re-tries.

6.3 No mid-turn evaluation

Per §2.2.2. The evaluator never reads in-flight state. This is what keeps it off the turn-locked-model path (AGENTS.md “Gotchas”) and out of the routing chain.

7. Budget and safety

The LLM judge spends real money. Three caps apply:

Cap	Default	Configurable	Effect
`eval.per_session_max_usd`	`Decimal("0.10")`	yes	After spend, eval queue drops `judge_kind in (llm, hybrid)` for the session — heuristic still runs.
`eval.per_day_max_usd`	`Decimal("1.00")`	yes	After spend, eval queue drops `judge_kind in (llm, hybrid)` workspace-wide for the day.
`eval.escalation_threshold`	`0.7`	yes	Hybrid judge escalates to LLM only when heuristic confidence is below this.

When a cap throttles a verdict, an eval.completed is still emitted with judge_kind="heuristic" (the heuristic always runs) and signals.throttled_reason: Literal["session_cap", "daily_cap"]. No verdict is silently dropped.

Kill switch: workspace config eval.disabled = true skips both judges. No eval.* events emitted for that workspace; the bus subscriber unregisters. Useful for benchmarking the cost story without the evaluator’s own cost confounding it.

Why both per-session and per-day caps. A single chatty session can exhaust the daily budget alone (LLM judge on every escalated turn × tens of turns × $0.005 ≈ pennies-fast). The per-session cap prevents one session from starving the rest of the day’s evaluation; the per-day cap is the hard ceiling the operator sees on their bill.

8. Events

Three new bus catalog events. All payloads are msgspec.Struct(frozen=True) defined in packages/metis-core/src/metis_core/events/payloads.py when the implementation lands (this spec describes the contract only).

8.1 `eval.started`

Sensitivity: pseudonymous Phase: 3 Actor: SYSTEM Parent: turn.completed / tool.completed / tool.failed / session.ended / feedback.explicit

{
    "eval_id": str,                                   # monotonic ULID
    "subject_kind": Literal["turn", "tool_cycle", "session", "workload"],
    "subject_id": str,
    "rubric_id": str,
    "rubric_version": str,
    "judge_kind_planned": Literal["heuristic", "llm", "hybrid"],
    "trigger": Literal["bus", "batch", "feedback_arrived", "benchmark"],
}

8.2 `eval.completed`

Sensitivity: user_controlled (floor; downgrades to pseudonymous per event-bus-and-trace-catalog.md §4.4.1 when signals.rationale_redacted is absent) Phase: 3 Actor: SYSTEM Parent: eval.started

{
    "eval_id": str,
    "subject_kind": Literal["turn", "tool_cycle", "session", "workload"],
    "subject_id": str,
    "score": float,                                   # in [0.0, 1.0]
    "confidence": float,                              # in [0.0, 1.0]
    "judge_kind": Literal["heuristic", "llm", "hybrid"],
    "judge_model": str | None,
    "judge_cost_usd": str,                            # Decimal serialized as string (canonical: same as Usage.cost_usd)
    "judge_pricing_version": str | None,
    "judge_latency_ms": int,
    "rubric_id": str,
    "rubric_version": str,
    "signals": dict,                                  # see §4.4
    "parent_eval_id": str | None,
}

Cost is serialized as a string (mirrors Usage.cost_usd in canonical-message-format.md §6.4) so the JSON envelope round-trips through the trace store without Decimal loss.

Sensitivity floor. The catalog floor is user_controlled — the worst case, when signals.rationale_redacted is populated (the user opted into capturing LLM judge rationales) and the event carries user-derived text. When the rationale field is absent (heuristic verdict, or LLM verdict without rationale opt-in), the subscriber passes pseudonymous to make_event — a downgrade toward less private, which the dynamic-sensitivity rule in event-bus-and-trace-catalog.md §4.4.1 allows.

8.3 `eval.failed`

Sensitivity: pseudonymous Phase: 3 Actor: SYSTEM Parent: eval.started

{
    "eval_id": str,
    "subject_kind": Literal["turn", "tool_cycle", "session", "workload"],
    "subject_id": str,
    "failure_mode": Literal[
        "judge_output_invalid",       # LLM response didn't parse against the rubric schema
        "judge_call_failed",          # LLM call hit a hard error (provider down, auth, etc.)
        "throttled_no_heuristic",     # caps fired AND heuristic also unavailable (shouldn't happen in v1; defensive)
        "subject_not_found",          # subject_id resolved to no events
        "rubric_invalid",             # rubric file failed to load
    ],
    "error_message": str,
    "judge_latency_ms": int,
}

throttled_no_heuristic is defensive — v1 always has a heuristic fallback, so the live path never emits it. Reserved for future configurations where a rubric is LLM-only.

8.4 Sensitivity classifications

Summary of the three new events in the event-bus-and-trace-catalog.md §4.4 frame:

Event	Floor sensitivity	Downgrade pathway
`eval.started`	`pseudonymous`	(no opt-in fields)
`eval.completed`	`user_controlled`	`pseudonymous` when `signals.rationale_redacted` is absent
`eval.failed`	`pseudonymous`	(no opt-in fields)

The eval domain joins the closed domain list in event-bus-and-trace-catalog.md §4.5. This is an additive domain (no collision with feedback, which describes user-supplied signal; the evaluator describes the system’s assessment).

Payload registry. The three payloads land in PAYLOAD_REGISTRY (event-bus-and-trace-catalog.md §6) when the implementation lands, per the same convention used for the gateway’s pending payload-field additions (CHANGES.md 2026-05-13 gateway entry). The catalog spec gains §6.11 (the eval domain) at implementation time.

9. Analytics surface

The evaluator’s data is consumable from /analytics/* per analytics-api.md §2.1 (“read-only and derived”). One new endpoint, plus an additive field on the existing cost view.

9.1 `GET /analytics/cost` — additive `include_eval` parameter

GET /analytics/cost?group_by=model&include_eval=false   # default: only llm.call_completed rows
GET /analytics/cost?group_by=model&include_eval=true    # add eval.completed.judge_cost_usd to model totals

Default false keeps the savings story honest (eval spend is metis’s overhead, not the buyer’s agent workload). The dashboard’s overhead tile sets include_eval=true and renders the eval cost as a separate column. SPA-side: subtract for “agent-only spend,” include for “total Metis spend.”

9.2 `GET /analytics/quality`

New endpoint. Aggregates eval.completed events over a time window.

Query parameters:

Parameter	Type	Required	Default
`from`,`to`	ISO 8601 UTC	no	last 7d
`subject_kind`	`turn` \| `tool_cycle` \| `session` \| `workload`	no	`turn`
`group_by`	`model` \| `judge_kind` \| `rubric_id` \| `none`	no	`model`
`min_confidence`	float	no	`0.0`

Response shape (group_by=model):

{
  "window": {...},
  "current_pricing_version": "...",
  "data": [
    {
      "chosen_model": "anthropic:claude-haiku-4-5",
      "verdict_count": 142,
      "mean_score": 0.82,
      "p50_score": 0.85,
      "p10_score": 0.50,
      "mean_confidence": 0.71,
      "judge_cost_usd_total": 0.0823,
      "thumbs_down_count": 3
    }
  ]
}

chosen_model is joined from the route.decided event of the subject turn — the model whose work is being judged, not the judge’s model. The join walks subject_id (turn_id) → route.decided.chosen_model; per analytics-api.md §4.3 the routing event is one per turn so this is a 1:1.

Agreement rate tile (computed from the same source). When two distinct verdicts exist for the same subject_id (one online, one from a batch re-eval; or two with different rubric_ids), the dashboard computes “verdict agreement” as the fraction whose |score_a - score_b| <= agreement_window (default 0.15, configurable client-side). This is a SPA- side computation over the verdict rows; the API surface is just the eval.completed event projection. No new endpoint needed.

Backlog tile. verdict_count versus the count of eval.started without a matching eval.completed / eval.failed in the window. SPA queries /analytics/quality?subject_kind=turn and counts open eval_ids client-side.

9.3 Drill-down: turn analytics gains an `evaluations` field

analytics-api.md §4.6 (GET /analytics/turns/{id}) gets an additive evaluations array in data. Each entry is the EvalVerdict shape from §4.1. No breaking change; existing consumers ignoring the field continue to work.

This is the only place a signals.rationale_redacted value can surface to the UI (under the opt-in sensitivity uplift in §8.4).

9.4 Negative space

Not in v1:

/analytics/quality_trend (time series of mean score by week). Cheap to add later via group_by=day; not worth the SPA tile yet.
Cross-workload quality dashboards (the benchmark harness emits its own workload-level verdict; the SPA can fetch via subject_kind=workload).
Per-rubric comparison views. Add when there are multiple rubrics shipping (v1 ships exactly one per subject_kind).

10. Storage

Verdicts are bus events, persisted in the existing trace store (canonical-message-format.md §9.1, events table). No new table.

Indexes:

The existing (type, timestamp_us) index covers the time-windowed projection in /analytics/quality.
A new index idx_events_eval_subject on (json_extract(payload_json, '$.subject_kind'), json_extract(payload_json, '$.subject_id'), id) is optional — at single-user scale (≤10K verdicts) the full scan over type='eval.completed' rows is fast enough. The index lands as a Phase 3 follow-up if the agreement-rate query measurably slows the dashboard.

Following analytics-api.md §2.1: the source of truth is the bus + trace store; analytics is a projection. No rollup, no materialized verdict table.

11. The feedback loop

The verdict feeds future turn routing through the pattern store. Three consumers:

11.1 Pattern store (slot 4 of the routing chain)

The pattern store (planned spec: pattern-store.md; see §15 for coordination) reads eval.completed.score as the success_score in outcome.primary_model clusters. Per routing-engine.md §5.5:

normalized_success_M = mean(success_score) for neighbors with primary_model = M

The pattern store’s K-nearest aggregation reads from eval.completed events directly (not from a separate outcome table). Verdicts with confidence < pattern.min_eval_confidence (default 0.5) are excluded from the aggregation but stay in the trace store (they’re still useful for the agreement-rate view).

Which verdict wins when there are multiple. The pattern store reads the latest eval.completed per (subject_kind, subject_id) — MAX(eval_id) per subject. Re-evaluation, by construction, supersedes older verdicts for routing purposes.

11.2 Analytics-driven escalation (Phase 3+, surface only)

When a model accumulates mean_score < quality_floor (default 0.6) over a configurable window on /analytics/quality?group_by=model, the dashboard surfaces it (“This task type is underperforming on haiku — consider escalating”). The surface is a banner, not an automatic rule change. Auto-escalation belongs to the pattern store, not the analytics view.

11.3 Benchmark headline

The benchmark harness (benchmark.md §8) gains a quality column in its report: each workload’s score, the suite-level mean, and the gating note “savings on successful work: $X of $Y total.” This is what turns “saved 67%” into “saved 67% on work the evaluator judged as successful at confidence ≥ 0.5.” The exact format lands in a follow-up benchmark.md amendment when the evaluator implementation does.

12. Invariants

Append-only. eval.completed events are never updated or deleted. Re-evaluation produces new events.
One score field. The score is one number in [0, 1]. Multi- dimensional rubrics collapse into signals, never into the score field.
Heuristic always available. Every subject for which the heuristic has all required input events produces a verdict. Throttling never silently drops; it downgrades the planned judge_kind.
Eval cost is recorded. judge_cost_usd is always set (zero for heuristic; positive for LLM / hybrid that escalated). Pricing version is stamped when cost > 0.
No routing-chain involvement. The evaluator is not a chain slot. It does not appear in route.decided.chain. It feeds the pattern store (which is a slot) via the bus, not via direct call.
No mid-turn execution. Subscribers fire on terminal events (turn.completed, etc.), never inside a turn.
Rubric is versioned. Every verdict carries rubric_id and rubric_version. Changing the rubric is a version bump, not a silent recalibration.
Confidence gates routing, not display. The dashboard shows all verdicts; the pattern store filters by confidence threshold.

13. Open questions

These are live. Do not unilaterally close them.

Hybrid escalation threshold default. 0.7 is a guess. The right value is whatever makes the agreement-rate view show heuristic and LLM agreeing ≥ 95% of the time at that threshold — observable only after the dashboard ships and data accumulates.
Should explicit feedback.explicit overwrite a prior verdict’s score? Today a thumbs-down arriving after a verdict triggers a re-evaluation that produces a new verdict (the old verdict stays recorded). Alternative: the new feedback updates the latest verdict’s signals without producing a new verdict. The current shape is cleaner for the agreement-rate view; the alternative is cheaper in event count.
LLM judge prompt-cache discipline. Per context-assembler.md, the LLM judge’s rubric prompt is stable and a perfect cache candidate. The implementation should mark it for caching; the spec doesn’t yet pin the cache contract for the judge’s request shape.
Workload-rubric primitives. v1 leans on expect_substring_in_final_response plus the LLM judge. The benchmark workloads may want richer primitives (assert tests pass, assert a file’s contents match a pattern). Wait for the v1 suite to settle before adding noise.
Tool-cycle LLM judging. v1 ships heuristic-only at the tool-cycle level. The thrash detector catches the obvious case; subtle cases (tool succeeded but the answer was wrong) require the LLM judge. The cost calculus is unfavorable at v1 scale — revisit when there’s evidence of missed signal.
Cross-rubric agreement as a calibration signal. Two heuristic rubrics scoring the same subjects (e.g. “lenient” and “strict”) produce an agreement series independent of the LLM. This is a v2 tool for rubric maintenance; not on the v1 surface.
Per-user calibration. A future signal: “user X consistently thumbs-down turns the heuristic scored ≥ 0.8.” The eval rubric could learn a per-user offset. Defer until multi-user / multi-profile lands per the project strategy (private).
Evaluating the evaluator. Spot-checking via benchmark workloads with hand-asserted expectations is the v1 calibration plan. A more rigorous “golden verdict” corpus (a sampled set of turns the operator manually scores) is plausible but expensive — and would itself need a labeling UI. Open until evidence shows the v1 plan is insufficient.
Cost of re-evaluation at scale. Batch re-evaluation under the per-day cap is bounded, but the cap could throttle an honest investigation. Caps are workspace config; a one-off metis evaluate --override-cap may be needed. Wait for actual friction.

14. Testing strategy

V1’s tests cover the contract, not the rubric weights (those live in the rubric files and may evolve without the spec needing to change).

14.1 Required tests

Heuristic verdict shape. Seed a clean turn.completed + supporting events; assert eval.completed carries judge_kind="heuristic", judge_cost_usd == Decimal("0"), judge_pricing_version is None, and score is in [0, 1].
Hybrid escalation threshold. A turn whose heuristic produces confidence < escalation_threshold triggers the LLM judge (mocked via the scripted adapter, per conftest.py / tests_shared/); the verdict carries judge_kind="hybrid", signals.escalated=true, and judge_cost_usd > 0.
Hybrid no-escalation. A turn whose heuristic produces confidence >= escalation_threshold produces a verdict with judge_kind="heuristic", no LLM call, judge_cost_usd == 0.
Per-session cap downgrade. A configured eval.per_session_max_usd = Decimal("0") causes a judge_kind="hybrid" plan to emit a heuristic verdict with signals.throttled_reason="session_cap".
Per-day cap downgrade. Same shape, signals.throttled_reason ="daily_cap". Caps are independent.
Re-evaluation produces a new event. Re-running the evaluator on a prior turn_id produces a second eval.completed with a fresh eval_id; both are queryable; pattern-store-style “latest” query returns the newer.
Confidence gate filters analytics aggregation. Verdicts below min_confidence are excluded from mean_score on /analytics/quality but still present in verdict_count (or vice versa, per the response contract; pin this in implementation tests).
Subject not found emits eval.failed. Triggering an eval against a nonexistent turn_id produces eval.failed.failure_mode ="subject_not_found" and no eval.completed.
Invalid LLM judge response. A scripted LLM judge whose response fails the rubric schema produces eval.failed.failure_mode ="judge_output_invalid" after the bounded retry; no eval.completed.
Rubric versioning. Two verdicts on the same subject with different rubric_version strings both persist; /analytics/quality?group_by =rubric_id shows two rows.
Tool-cycle attaches to parent turn. A tool.completed triggers a tool_cycle-subject verdict whose parent_eval_id references the turn’s eval (when the turn has been evaluated; null otherwise).
Session aggregation reads child turn verdicts. A session.ended with three child turn verdicts produces a session-subject verdict whose signals.child_eval_ids lists all three.
Sensitivity uplift. Setting signals.rationale_redacted produces a recorded event whose sensitivity == "user_controlled"; omitting the field keeps it at "pseudonymous".
Eval cost folds into /analytics/cost?include_eval=true. Sum of eval.completed.judge_cost_usd matches the delta between the two endpoint calls.
No mid-turn fire. The subscriber filter doesn’t match llm.call_completed events; verifying via subscription registration introspection.

14.2 Property tests

Score boundedness. All emitted eval.completed.score are in [0.0, 1.0].
Heuristic determinism. Same input events → same heuristic verdict (score, confidence, signals.flags) byte-equal. LLM judges are excluded (provider variance).
Cost monotonicity in re-evaluation. Re-evaluating a turn under a more-expensive judge never reduces total eval spend recorded.

15. Coordinates with `pattern-store.md`

Authored 2026-05-13 in parallel with Agent 3A’s draft of pattern-store.md. Reconciled 2026-05-14 — see CHANGES.md “Pattern-store ↔ evaluator reconciliation sweep.” The table below reflects the reconciled contract; the open coordination items originally listed have been closed (see §15.1).

The pattern-store spec (pattern-store.md §15) imports the verdict shape and consumption semantics from this spec verbatim. This section lists the load-bearing touchpoints and their reconciled status.

Touchpoint	Reconciled outcome (2026-05-14)	Where it lives in this spec
Verdict shape (`EvalVerdict`) ownership	Evaluator owns it; pattern store consumes verbatim and does not re-specify.	§4.1
Score timing (sync vs async)	Async. Pattern-store writes outcome immediately on `session.ended` with `success_score=None`; the `eval.completed` subscriber later calls `PatternStore.update_score(turn_id, ...)`. Join key: `turn_id`.	§6.1, pattern-store §10.4, §15.3
Confidence-gate filter home + default	Lives in pattern-store config (`routing.yaml::pattern.min_eval_confidence`); default `0.5`. Evaluator emits all verdicts; pattern store filters at K-cluster aggregation time.	§4.3, pattern-store §15.4
Sample-size weighting in K-cluster aggregation	Pinned in `routing-engine.md §5.5` (2026-05-14 clarification): `Σ(success_score_i × sample_size_i) / Σ(sample_size_i)`.	routing-engine §5.5
Latest-verdict rule when multiple verdicts exist	`MAX(eval_id)` per `(subject_kind, subject_id)` — re-evaluation supersedes. Pattern store rolls back prior contribution to its outcome accumulator before applying the new score (pattern-store §10.4).	§4.6, §11.1, pattern-store §10.4
`outcome.primary_model` join	Joined client-side from `route.decided.chosen_model` of the subject turn; not embedded in `eval.completed` payload.	§9.2
Pattern domain vs eval domain	Distinct domains. Pattern store does not emit `eval.` events; evaluator does not* emit `pattern.*` events.	§2.2.2, §12
Fingerprint computation independence	Fingerprint = task shape (pattern-store concern); verdict = outcome (evaluator concern). No overlap.	(no overlap)
Cost source for K-cluster `avg_cost_M`	Sourced from `llm.call_completed.cost_usd` summed over the turn — not from `eval.completed.judge_cost_usd` (that’s the judge’s cost, surfaced separately under `/analytics/cost?include_eval=true`).	§4.5, §9.1
Session-level vs turn-level verdicts	Pattern-store K-nearest is turn-level only in v1. Session verdicts are not a pattern-store input; they surface on `/analytics/quality?subject_kind=session` for dashboard use.	§5.6

15.1 Closed coordination items

The three open items listed in the original draft are closed:

Confidence-gate filter as pattern-store override. Resolved: the filter is a pattern-store config knob (pattern.min_eval_confidence), not an evaluator-side concern. The evaluator emits unfiltered; consumers (pattern store, analytics) decide their own thresholds.
Pattern-store outcome rollup losslessness. Resolved: the eval.completed payload (§8.2) carries subject_id (the turn_id), score, confidence, eval_id, and judge_pricing_version — sufficient for the pattern store’s update_score() flow (pattern-store §10.4). No additional fields needed in v1.
Session verdicts as pattern-store input. Resolved: turn-level only in v1. Pattern-store §10.4 reads eval.completed events filtered to subject_kind=turn.

16. Decision log

Date	Decision	Rationale
2026-05-13	Numeric `score` in `[0, 1]` as the only structural commitment	Pattern store consumes one number; multi-dim rubrics expressible via `signals` collapsed by the rubric.
2026-05-13	`confidence` is a gate, not a score modifier	Conflating confidence and score loses signal; downstream consumers (pattern store, analytics) can filter or weight independently.
2026-05-13	Heuristic-first, LLM-as-judge gated by hybrid escalation	Default-cheap; LLM judge is opt-in via a single threshold knob the operator can tune from the dashboard.
2026-05-13	Verdicts are append-only bus events, not a mutable verdict table	Re-evaluation must not destroy the old verdict (it’s the source data for the agreement-rate view).
2026-05-13	Three new bus events (`eval.started/completed/failed`)	Bus-as-spine consistency; trace store gets re-eval data for free; consumers (pattern store, analytics) subscribe normally.
2026-05-13	Non-fast-path subscriber	Heuristic is ms-fast; LLM judge is seconds; neither belongs on a user-facing path.
2026-05-13	Per-session AND per-day cost caps	One chatty session can exhaust a daily budget alone; both caps are needed.
2026-05-13	LLM judge defaults to a small model (haiku-class)	Spending opus to grade haiku inverts the cost story; small-model classification is the right tier.
2026-05-13	No mid-turn evaluation	Preserves turn-locked-model invariant; the routing chain stays out of the evaluator’s path.
2026-05-13	Single-user / local-first / per-workspace by default	Per the project strategy (private); multi-user is downstream of the gateway / replacement-agent fork.
2026-05-13	Re-evaluation produces new verdict, doesn’t mutate old	Agreement-rate-over-time is a query, not a side-table; preserves audit trail.
2026-05-13	`judge_cost_usd` is `Decimal`, serialized as string in event payload	Matches `Usage.cost_usd` convention from `canonical-message-format.md §6.4`.
2026-05-13	Rubrics versioned via `rubric_id` + `rubric_version`	Changing the rubric is a version bump that produces a new score series; old verdicts remain comparable.
2026-05-13	One new analytics endpoint (`/analytics/quality`), additive `include_eval` on `/cost`	Minimal surface increase; SPA composition (agreement-rate, backlog) computed client-side from one event projection.
2026-05-13	Workload rubric is per-workload in `workload.yaml.evaluate`	The benchmark harness already owns the workload contract; the evaluator extends it rather than building a parallel surface.

17. References

event-bus-and-trace-catalog.md — the catalog these events join; non-fast-path subscriber contract; sensitivity classifications; dynamic sensitivity for opt-in payloads.
canonical-message-format.md — Decimal cost convention, trace store schema, ULID generation.
routing-engine.md §5.5 — success_score consumption shape (the pattern-store side of the contract).
analytics-api.md — projection conventions, response envelope, SQL-injection-safe parameter whitelisting, Decimal serialization at the wire boundary.
benchmark.md — v1 limitation closed by this spec (quality scoring deferred to the evaluator); workload rubric extension.
context-assembler.md — prompt-cache discipline the LLM judge should follow (open question 13.3).
memory-store.md — sibling spec for shape reference.
../the project strategy (private) — the open question this spec closes.
../project-overview.md — Evaluator’s role in the architecture diagram; Phase 3 (“full evaluator”) deliverable.
../../AGENTS.md — turn-locked-model invariant the evaluator preserves by never running mid-turn.
pattern-store.md (planned, drafted in parallel) — the primary consumer; see §15 for coordination touchpoints.

This site is open source. Improve this page.