Status: Draft v1.2 Last updated: 2026-05-14 Owner: your name
v1.2 changes:
CanonicalResponsereturnscontent: list[ContentBlock]
model+providerrather than a fullMessage(§3.3). The adapter doesn’t see the routing decision or the cost, so it returns the parts it knows and the caller (SessionManager) assembles the final canonicalMessage. Substitutability is unchanged: any two adapters returning the same(content, stop_reason, usage)triple still produce identical downstreamMessages.
v1.1 changes: Clarified that streaming events emit to a separate streaming-only channel, not through the bus (§5.1). Pinned
max_retriessemantics (§6.4): total attempts = 1 + max_retries.
Throughout: paths shown use
~/.yourtool/as a placeholder for the final config directory.
This document specifies the contract every LLM provider adapter implements — the Python interface, the wire-format translation rules, streaming normalization, error classification, cost reporting, and capability declaration.
Without this contract, adapters built in parallel (Anthropic, OpenAI, eventually Ollama and OpenRouter) will diverge structurally in subtle ways: different tool-result shapes, different cancellation semantics, different cost computations, different stream-chunk handling. The canonical-format guarantee (lossless round-trip across providers, mid-session swap survives) depends on adapters being substitutable at the contract level.
Two adapters built without this spec will pass tests individually but break when a session swaps between them. This spec is the substitutability contract.
This spec depends on:
canonical-message-format.md for Message, ContentBlock, ToolDefinition, Usage, AdapterCapabilities.event-bus-and-trace-catalog.md for llm.call_* events and the error_class enum.streaming-protocol.md for the canonical streaming events (text.delta, tool.use_start, etc.) the adapter must emit.routing-engine.md for capability validation requirements (§4.4).error_class: rate_limit.cancel(request_id) that aborts an in-flight call.logit_bias, etc. are not in the canonical interface. Adapters may use them internally for performance but cannot require them in the canonical API.Every adapter implements this Python protocol:
class ProviderAdapter(Protocol):
"""Implemented by every provider adapter."""
name: str # "anthropic" | "openai" | "ollama" | ...
capabilities: AdapterCapabilities
def __init__(self, config: AdapterConfig) -> None: ...
async def complete(
self,
request: CanonicalRequest,
) -> CanonicalResponse:
"""Non-streaming call. Returns once the response is fully received.
Raises AdapterError subclasses on failure (see §6)."""
async def stream(
self,
request: CanonicalRequest,
) -> AsyncIterator[StreamEvent]:
"""Streaming call. Yields canonical StreamEvents in order until the
response completes or is cancelled. See §5 for event sequence rules."""
def estimate_input_tokens(
self,
messages: list[Message],
tools: list[ToolDefinition],
system_prompt: str | None,
) -> int:
"""Pre-flight token estimate for routing decisions. Does not call
the provider; uses local tokenizer or heuristic. Accuracy: ±10%
is acceptable."""
async def cancel(self, request_id: str) -> bool:
"""Abort an in-flight request. Returns True if the request was
cancelled cleanly, False if it had already completed or wasn't
found. Idempotent."""
async def close(self) -> None:
"""Release adapter resources (HTTP client connection pool, etc.).
Called at server shutdown."""
class AdapterConfig:
api_key: str | None # may be None for local adapters
base_url: str | None # override default endpoint; for proxies/Ollama
timeout_seconds: float = 600 # overall request timeout
max_retries: int = 2 # bounded retry within the adapter; see §6.4
extra_headers: dict[str, str] = {} # custom headers (e.g. for OpenRouter)
# Adapter-specific options accepted but not required:
options: dict = {}
options is a permission to pass adapter-specific knobs (e.g., Anthropic’s anthropic-beta headers, OpenAI’s organization field). Core code never reads from options; only the specific adapter does.
The adapter sees canonical inputs and produces canonical outputs. It does not see other adapters’ types, even indirectly.
class CanonicalRequest:
request_id: str # ULID, generated by core; passed to cancel()
messages: list[Message] # canonical messages, in order
tools: list[ToolDefinition] # tools to expose; may be empty
system_prompt: str | None # composed by context assembler; nullable
model: str # provider:name canonical id
max_output_tokens: int # required; adapter must honor
stop_sequences: list[str] = []
temperature: float | None = None
output_schema: dict | None = None # for structured output; v1 used only for delegation
# Streaming-only:
stream: bool = False # True = use stream(); False = use complete()
class CanonicalResponse:
request_id: str
model: str # canonical "provider:name" — the actual model that served the call
provider: str # adapter.name; for trace-side bookkeeping
content: list[ContentBlock] # the assistant's reply blocks, in order
stop_reason: StopReason
usage: TokenUsage # raw token counts, no cost
latency_ms: int # wall-clock for the call
class StopReason(StrEnum):
END_TURN = "end_turn"
MAX_TOKENS = "max_tokens"
STOP_SEQUENCE = "stop_sequence"
TOOL_USE = "tool_use"
CANCELLED = "cancelled"
ERROR = "error"
class TokenUsage:
input_tokens: int
output_tokens: int
cached_input_tokens: int = 0 # cache hit (reads from cache)
cache_creation_input_tokens: int = 0 # cache write (creates cache entry)
# Cost is NOT reported here; computed by core from price table.
The adapter returns content rather than a full Message because it does
not own two of the required Message fields: the RoutingDecisionRecord
(decided upstream by the routing engine) and Usage.cost_usd (computed by
the core from the local price table per canonical-format §6.4). The caller
(SessionManager) assembles the final Message by combining the adapter’s
content + model + provider with its own routing decision, cost
computation, and id allocation. Adapters never see Message on the
response side. Substitutability is unaffected: two adapters returning the
same (content, stop_reason, usage) triple produce identical downstream
Messages.
Every adapter declares its capabilities. Per routing-engine.md §4.4, routing validates against these before dispatch.
class AdapterCapabilities:
# Content type support
supports_images: bool
supports_thinking: bool
supports_tools: bool
supports_system_prompt: bool
supports_structured_output: bool
# Streaming
supports_streaming: bool
supports_streaming_tool_calls: bool # whether tool_use_input_delta is meaningful
supports_parallel_tool_calls: bool # multiple tool_use blocks in one assistant turn
# Caching
supports_prompt_caching: bool
# Limits
max_context_tokens: int
max_output_tokens: int
# Image format support (only meaningful if supports_images)
accepted_image_media_types: list[str]
Declarations MUST be honest. If a model technically supports a feature but the adapter implementation doesn’t expose it, declare false. The capability surface is the substitutability boundary; lying about it breaks mid-session swaps.
For example, if Ollama’s API supports tools but the specific local model loaded doesn’t tool-call reliably, declare supports_tools: false for that model. Routing will skip it for tool turns.
This is where most of the work lives. Per provider, the adapter translates canonical → wire on request and wire → canonical on response.
Tool calls and system prompts are where Anthropic and OpenAI most divergently shape their wire formats. The canonical format is a superset; adapters project losslessly onto each provider’s accepted shape.
| Aspect | Canonical | Anthropic | OpenAI |
|---|---|---|---|
| Tool definition | ToolDefinition with name, description, input_schema |
{name, description, input_schema} direct |
{type: "function", function: {name, description, parameters}} |
| Tool call (in message) | ToolUseBlock in ASSISTANT message |
tool_use content block |
tool_calls[] array on the message; function.arguments is JSON-stringified |
| Tool result (separate role) | ToolResultBlock in TOOL message |
tool_result content block in USER message |
message with role: tool, tool_call_id, content |
| Input data type | dict (validated against schema) |
dict |
JSON-stringified; adapter parses on parse, stringifies on serialize |
| Tool ids | Canonical tu_<ulid>; bidirectional map per session |
toolu_* (provider-issued) |
call_* (provider-issued) |
Adapters maintain a per-session bidirectional map between canonical and provider-issued tool ids per canonical-format §6.2. When parsing wire → canonical, look up or create the canonical id; when serializing canonical → wire, look up the provider id (or generate if first use of this canonical id with this provider).
| Canonical | Anthropic | OpenAI |
|---|---|---|
SYSTEM role messages in list |
Top-level system parameter |
First message in messages with role: system |
The adapter hoists / injects as needed. Multiple SYSTEM messages in the canonical list are concatenated (with \n\n separator) before placement.
Endpoint: POST https://api.anthropic.com/v1/messages
Request shape (high level):
{
"model": <wire model name, derived from canonical id>,
"max_tokens": request.max_output_tokens,
"system": <hoisted system prompt or omitted>,
"messages": [
# USER, ASSISTANT, TOOL messages translated; SYSTEM hoisted out
],
"tools": [<tool defs>] or omitted,
"stop_sequences": request.stop_sequences,
"temperature": request.temperature,
"stream": request.stream,
}
Message translation:
USER → Anthropic user. Content blocks pass through (text, image).ASSISTANT → Anthropic assistant. Content blocks pass through (text, tool_use, thinking).TOOL → Anthropic user with tool_result content blocks. The tool_use_id is mapped to the provider’s stored id via the per-session map.Thinking blocks: Anthropic returns these natively for extended-thinking models. The adapter passes them through as ThinkingBlock and stashes the opaque signature in provider_raw for round-trip fidelity (per canonical-format §6.5).
Token caching: The adapter MAY add cache_control markers to messages or system prompt for prompt caching. This is performance optimization; users don’t see it in the canonical surface. Cache token counts are reported in TokenUsage.cached_input_tokens and cache_creation_input_tokens.
Endpoint: POST https://api.openai.com/v1/chat/completions (or /v1/responses for newer models).
Request shape:
{
"model": <wire model name>,
"max_completion_tokens": request.max_output_tokens,
"messages": [
# SYSTEM as first role:system message; USER, ASSISTANT, TOOL as their respective roles
],
"tools": [{"type": "function", "function": {...}}] or omitted,
"stop": request.stop_sequences,
"temperature": request.temperature,
"stream": request.stream,
# if request.output_schema:
"response_format": {"type": "json_schema", "json_schema": {...}},
}
Message translation:
SYSTEM → OpenAI system. First message; if multiple canonical SYSTEMs, concatenated.USER → OpenAI user. Content blocks pass through; images use OpenAI’s image_url shape.ASSISTANT → OpenAI assistant. Tool uses become tool_calls[] on the message; function.arguments is JSON-stringified from the canonical dict.TOOL → OpenAI tool. The tool_call_id is mapped via the per-session id map. Content is the tool result text (multiple content blocks concatenated).Thinking blocks: OpenAI’s reasoning models use a different mechanism. The adapter MUST drop canonical ThinkingBlock and RedactedThinkingBlock on the way out (with a WARN-level log entry per canonical-format §7.3). On the way in, OpenAI’s reasoning content is not mapped to canonical thinking blocks in v1 (the formats are too different). This is a known asymmetry: a session that originated on Anthropic and swaps to OpenAI loses thinking-block content; a session that originated on OpenAI and swaps to Anthropic doesn’t gain thinking blocks.
Caching: OpenAI’s prompt cache is applied automatically by the provider. The adapter reports cached_input_tokens from response usage; cache_creation_input_tokens is always 0 (OpenAI doesn’t separately report cache creation).
When canonical content cannot be represented in a provider’s wire format, the adapter MUST:
WARN with: session_id, message_id, block type, adapter name, reason. (Not a bus event — this is bus diagnostics per event-bus §3.5 reasoning.)Examples:
ThinkingBlock sent to OpenAI: dropped, logged.RedactedThinkingBlock cross-provider (any direction not Anthropic→Anthropic): dropped, logged.ImageBlock sent to a model whose supports_images: false: should never reach the adapter (routing rejects), but if it does, dropped and logged. The session manager should treat this as a bug.Provider stream chunks are translated to the canonical streaming events from streaming-protocol.md §5.3. The adapter is the translation layer.
Channel note: streaming events (
message.start,text.delta,tool.use_start, etc.) flow on a separate channel from the bus, directly to the streaming server. They are NOT bus catalog events and are NOT persisted in the trace store (perevent-bus-and-trace-catalog.md§4.5.1 andstreaming-protocol.md§5.1). Bus events emitted by the adapter (llm.call_started,llm.call_completed,llm.call_failed) flow through the bus normally. The adapter is responsible for emitting on the right channel for each event family.
Anthropic stream chunks (server-sent events with named types):
| Anthropic event | Canonical event |
|---|---|
message_start |
llm.call_started (bus, already emitted at request init); message.start (streaming) |
content_block_start (type: text) |
implicit (incremented content_block_index) |
content_block_start (type: tool_use) |
tool.use_start (streaming) with tool_use_id, tool_name |
content_block_start (type: thinking) |
implicit (incremented content_block_index) |
content_block_delta (delta.type: text_delta) |
text.delta |
content_block_delta (delta.type: input_json_delta) |
tool.use_input_delta with partial_json |
content_block_delta (delta.type: thinking_delta) |
thinking.delta |
content_block_stop (text block) |
implicit |
content_block_stop (tool_use block) |
tool.use_end with final_input (parsed from accumulated deltas) |
content_block_stop (thinking block) |
thinking.delta final with signature populated |
message_delta (with usage) |
accumulated for message.complete |
message_stop |
message.complete with final_content, usage |
OpenAI stream chunks (server-sent events with data: payloads):
| OpenAI chunk shape | Canonical event |
|---|---|
First chunk with choices[0].delta.role == "assistant" |
message.start |
choices[0].delta.content (string) |
text.delta with content_block_index = 0 |
choices[0].delta.tool_calls[i].id (first appearance) |
tool.use_start |
choices[0].delta.tool_calls[i].function.arguments (string fragment) |
tool.use_input_delta with partial_json |
choices[0].finish_reason set |
tool.use_end for each accumulated tool_call (with parsed JSON), then message.complete |
usage field in final chunk (or via stream_options: {include_usage: true}) |
populated in message.complete.usage |
OpenAI’s stream is more compressed than Anthropic’s; the adapter buffers per-tool-call argument fragments to emit tool.use_end at the right time.
Regardless of provider, the canonical event sequence MUST satisfy:
message.start precedes any deltas for that message.tool.use_start, zero or more tool.use_input_delta, exactly one tool.use_end. In that order.tool.use_end.final_input is a valid JSON object (parsed from accumulated deltas, or the provider’s authoritative final input if available).message.complete is the last event for a message; carries final_content reflecting all deltas seen plus any provider-authoritative state.text.delta, thinking.delta, tool.use_* events for the same message_id carry monotonically non-decreasing content_block_index values. Multiple events at the same index are fine (multiple deltas to one block).These invariants are the contract streaming-protocol.md clients rely on. Adapters MUST validate their own output against these in tests.
When cancel(request_id) is called mid-stream:
message.complete with stop_reason: cancelled and the partial final_content accumulated so far.tool.use_end with final_input set to whatever JSON parses cleanly from the accumulated deltas, or {} if nothing parses.llm.call_failed from inside the stream; the session manager’s cancellation handler (per routing-engine.md §3.4 and streaming-protocol.md §6) is responsible for higher-level event emission.The stream iterator MUST terminate after cancellation (raise StopAsyncIteration); it must not hang.
Per streaming-protocol.md §5.6, v1 streams raw partial JSON strings without best-effort parsing. The adapter MUST emit tool.use_input_delta.partial_json as the literal fragment received from the provider, not as a best-effort parsed object.
The adapter MAY internally accumulate fragments to detect when a complete JSON object has been received (for emitting tool.use_end with final_input). This internal accumulation is for the adapter’s own bookkeeping; the streaming events emitted to consumers carry the raw fragments.
Adapters MUST classify all errors into one of these classes (matching event-bus §6.3 llm.call_failed.error_class):
class ErrorClass(StrEnum):
RATE_LIMIT = "rate_limit" # provider returned a rate-limit signal
AUTH = "auth" # 401, 403, invalid API key
SERVER_ERROR = "server_error" # 5xx other than rate limit
NETWORK = "network" # DNS, connection refused, timeout pre-response
CONTEXT_OVERFLOW = "context_overflow" # request exceeds model's context window
INVALID_REQUEST = "invalid_request" # 4xx other than auth (bad params, etc.)
CANCELLED = "cancelled" # client called cancel()
OTHER = "other" # anything else
Adapters apply these mappings as a starting point, then adjust based on provider error bodies:
| HTTP status | Default class | Provider-body adjustments |
|---|---|---|
| 401, 403 | AUTH |
|
| 408 | NETWORK |
|
| 413 | CONTEXT_OVERFLOW |
Some providers use 400 with body indicating overflow; remap. |
| 429 | RATE_LIMIT |
|
| 5xx | SERVER_ERROR |
Some providers use 529 specifically; same class. |
| Connection refused, DNS error, TLS error | NETWORK |
Pre-response errors. |
| 4xx other | INVALID_REQUEST |
Anthropic returns error.type like "invalid_request_error" or "overloaded_error"; adjust class. |
Per-provider error-body conventions:
{error: {type, message}}. Use error.type as a hint:
"overloaded_error" → RATE_LIMIT (even if HTTP 529)."rate_limit_error" → RATE_LIMIT."authentication_error", "permission_error" → AUTH."invalid_request_error" with message containing “context” or “tokens exceeds” → CONTEXT_OVERFLOW."api_error" → SERVER_ERROR.{error: {type, code, message}}. Use:
error.code == "rate_limit_exceeded" → RATE_LIMIT.error.code == "context_length_exceeded" → CONTEXT_OVERFLOW.error.code == "invalid_api_key" → AUTH.error.type == "server_error" → SERVER_ERROR.class AdapterError(Exception):
"""Base. All adapter exceptions inherit."""
error_class: ErrorClass
provider_status: int | None # HTTP status if applicable
provider_message: str # raw provider message, possibly redacted
retryable: bool # whether the adapter retried internally
request_id: str
class RateLimitError(AdapterError):
retry_after_seconds: float | None # if provider provided a hint
class AuthError(AdapterError): pass
class ServerError(AdapterError): pass
class NetworkError(AdapterError): pass
class ContextOverflowError(AdapterError): pass
class InvalidRequestError(AdapterError): pass
class CancelledError(AdapterError): pass
Adapters raise the most specific subclass. Code in the core catches AdapterError for general handling; specific subclasses for targeted recovery.
Adapters retry transient errors with bounded exponential backoff:
RATE_LIMIT, SERVER_ERROR, NETWORK.AUTH, CONTEXT_OVERFLOW, INVALID_REQUEST, CANCELLED. Raise immediately.config.max_retries (default 2). This is the number of additional attempts after the first; total attempts = 1 + max_retries. With the default of 2, a request can be attempted up to 3 times before raising.retry_after: If a RATE_LIMIT response includes a retry-after hint, sleep for that duration (capped at 60 seconds) before retry.After exhausting retries, raise the appropriate subclass with retryable=True so the caller knows it was a transient class. Sustained failure is the routing-engine’s availability state machine’s concern (§4.5), not the adapter’s.
When an error occurs mid-stream:
text.delta).message.complete with stop_reason: error and the partial content accumulated.AdapterError subclass after the iterator yields final.The session manager catches the exception and emits the llm.call_failed event; the adapter does not emit it directly.
Adapters report raw token counts in TokenUsage. They do NOT compute USD cost. Cost is the core’s responsibility, computed from the local price table per canonical-format §6.4.
class TokenUsage:
input_tokens: int
output_tokens: int
cached_input_tokens: int = 0
cache_creation_input_tokens: int = 0
The core, on receiving a CanonicalResponse from the adapter:
pricing_version and per-model rates from the local price table.cost_usd = input_tokens * input_rate + output_tokens * output_rate + cached_input_tokens * cached_rate + cache_creation_input_tokens * cache_creation_rate.Message.metadata.usage.cost_usd and pricing_version.This separation lets the core retroactively reprice (walk traces, recompute) and handle synthetic providers (Ollama at zero cost, OpenRouter with provider-resolved rates).
Both Anthropic and OpenAI report tokens in their response bodies:
usage: {input_tokens, output_tokens, cache_creation_input_tokens, cache_read_input_tokens}. Map directly: cache_read_input_tokens → cached_input_tokens.usage: {prompt_tokens, completion_tokens, prompt_tokens_details: {cached_tokens}}. Map: prompt_tokens → input_tokens, completion_tokens → output_tokens, prompt_tokens_details.cached_tokens → cached_input_tokens. cache_creation_input_tokens = 0 (OpenAI doesn’t separately report it).For streaming responses, both providers send usage in the final stream chunk (OpenAI requires stream_options: {include_usage: true} in the request). Adapters MUST request usage in streaming mode and propagate it via message.complete.usage.
If usage is unavailable for some reason (provider didn’t send it; rare), the adapter MAY set input_tokens and output_tokens to estimate_input_tokens()’s output and the streamed-token count respectively, with a WARN log noting the estimation. The core’s analytics layer flags estimated usage as such.
The core maintains a registry mapping canonical model ids to (adapter, provider-specific config). Example:
# ~/.yourtool/models.yaml
adapters:
anthropic:
type: anthropic
api_key_env: ANTHROPIC_API_KEY
base_url: https://api.anthropic.com
timeout_seconds: 600
max_retries: 2
openai:
type: openai
api_key_env: OPENAI_API_KEY
base_url: https://api.openai.com
timeout_seconds: 600
max_retries: 2
models:
anthropic:claude-opus-4-7:
adapter: anthropic
wire_name: claude-opus-4-7
tier: deep
can_delegate: true
aliases: [opus, deep]
anthropic:claude-sonnet-4-6:
adapter: anthropic
wire_name: claude-sonnet-4-6
tier: balanced
can_delegate: true
aliases: [sonnet, balanced]
anthropic:claude-haiku-4-5:
adapter: anthropic
wire_name: claude-haiku-4-5
tier: fast
can_delegate: false
aliases: [haiku, fast]
openai:gpt-5:
adapter: openai
wire_name: gpt-5
tier: balanced
can_delegate: true
aliases: [gpt5]
Each model entry maps to an adapter instance and carries wire_name (the actual model string the adapter sends to the provider), tier, can_delegate, and aliases (per routing-engine.md §6.8 and §9.2).
The registry is loaded at server startup. Hot reload on config change is desirable but deferred to Phase 2 (the routing.yaml hot reload covers the more common case).
api_key_env references an environment variable. Direct api_key in config is also accepted but discouraged (key in plaintext config file). Missing API key → adapter fails to register; models routed through that adapter fail validation with not_configured.
close() called on every adapter; connection pools drain.Canonical request:
CanonicalRequest(
request_id="req_01HZ...",
model="anthropic:claude-sonnet-4-6",
messages=[
Message(role=USER, content=[TextBlock("Read README.md and summarize")]),
],
tools=[ToolDefinition(name="read_file", input_schema={...}, ...)],
system_prompt="You are a helpful assistant.",
max_output_tokens=2048,
stream=True,
)
Adapter serializes to Anthropic wire:
{
"model": "claude-sonnet-4-6",
"max_tokens": 2048,
"system": "You are a helpful assistant.",
"messages": [
{"role": "user", "content": [{"type": "text", "text": "Read README.md and summarize"}]}
],
"tools": [{"name": "read_file", "description": "...", "input_schema": {...}}],
"stream": true
}
Anthropic streams back message_start, content_block_start (text), content_block_delta (text_delta), content_block_stop, content_block_start (tool_use), content_block_delta (input_json_delta) ×N, content_block_stop, message_delta, message_stop.
Adapter emits canonical events: message.start, text.delta ×N, tool.use_start, tool.use_input_delta ×N, tool.use_end (with parsed final input), message.complete (with usage).
Session has 4 prior messages (USER, ASSISTANT with tool_use, TOOL with result, ASSISTANT with text). All produced on Anthropic. User runs /model openai:gpt-5. Next turn, OpenAI adapter must serialize the entire history.
Translation of the history:
SYSTEM (composed): hoisted as messages[0] with role: system.messages[1] with role: user.text + tool_use blocks: messages[2] with role: assistant, content: <text>, tool_calls: [{id: <provider-id>, type: "function", function: {name: <tool_name>, arguments: <JSON-stringified input>}}]. The provider id is fetched from the per-session map (or generated if first cross-provider use of this canonical id).messages[3] with role: tool, tool_call_id: <provider-id>, content: <result text>.messages[4] with role: assistant, content: <text>.If the original ASSISTANT message had a ThinkingBlock, the adapter drops it on serialize (WARN log entry; rationale in §4.4).
OpenAI processes the request and streams back deltas. The adapter normalizes them to canonical events same as in §9.1.
Adapter calls Anthropic; receives 529. Adapter classifies as RATE_LIMIT (per the body’s error.type: overloaded_error). Sleeps with backoff (1s + jitter). Retries.
Second attempt: 529 again. Sleeps 2s + jitter. Retries.
Third attempt (max_retries=2 means 2 retries after the first failure): 200 OK, normal response.
The session sees no failure — the retries are internal. The trace store sees three llm.call_started events (the original plus two retries) but only one llm.call_completed. The first two have llm.call_failed events with error_class: rate_limit, retry_count: 0 and retry_count: 1.
If the third attempt also failed, the adapter raises RateLimitError. The session manager catches it, emits llm.call_failed with retry_count: 2. Routing’s availability state machine (per routing-engine.md §4.5) sees the failure pattern; if rules trigger, the (provider, model) or provider transitions to Unavailable.
Adapter is mid-stream on Anthropic, having emitted 200 text.delta events and started a tool.use_start (no tool.use_end yet — tool input still streaming).
Client sends cancel via WebSocket (per streaming-protocol.md §6). Session manager calls adapter.cancel(request_id).
Adapter:
tool.use_end for the in-flight tool: final_input = {} (nothing parses cleanly from partial JSON).message.complete with stop_reason: cancelled and partial final_content (the 200 text deltas reconstructed plus the cancelled tool_use with empty input).Session manager handles the higher-level cancellation events per streaming-protocol.md §6.2.
ToolUseBlock with various input shapes (nested objects, arrays, all primitive types) → wire format → back to canonical → assert equality.false, assert the adapter rejects or surfaces failure cleanly.tool.use_input_delta.partial_json matches the raw provider fragment, not a parsed object.ErrorClass, construct a recorded response (HTTP status + body) and verify the correct class is raised.retry_after honored. 429 with retry-after header; verify adapter sleeps for the indicated duration before retry (capped at 60s).TokenUsage matches.include_usage; verify final message.complete.usage matches non-streaming equivalent.Beyond per-adapter tests, the contract is enforced by a cross-adapter conformance suite:
ErrorClass regardless of provider.HTTP cassettes are committed to the repo per canonical-format §11.2. Re-record when:
Cassettes are reviewed in PRs the same as code.
cache_control markers for optimal cache hits is a heuristic. v1 caches the system prompt and tool definitions only. Phase 2 may add session-history caching once we have data on access patterns.response_format. v1 uses for delegation only (when output_schema is set on CanonicalRequest). Other use cases (general structured agent output) deferred.tools only on the first turn of a session. Deferred — premature optimization.response_format validation. When output_schema is set, OpenAI’s response_format: {type: "json_schema"} enforces schema. Anthropic doesn’t have an equivalent strict mode in the same way; the adapter currently passes the schema as a hint in the system prompt. Inconsistency worth flagging.supports_tools: false for a specific local model). Not implemented in v1; spec accommodates.| Date | Decision | Rationale |
|---|---|---|
| 2026-05-08 | Adapters report token counts only; cost computed by core | Pricing is a core concern; adapters stay simple; retroactive reprice possible. |
| 2026-05-08 | Per-session bidirectional tool-id map maintained by adapter | Cross-provider tool id consistency without provider-id pollution in canonical layer. |
| 2026-05-08 | Bounded transient retry inside adapter; sustained failure to routing | Hide trivial transient errors; escalate sustained patterns to routing’s availability machine. |
| 2026-05-08 | Capability declarations are honest, not theoretical | Substitutability depends on declared capability matching actual implementation. |
| 2026-05-08 | Lossy projection rules drop unrepresentable content with WARN log | Mid-session swap remains resilient; observability over hard failure. |
| 2026-05-08 | Streaming partial JSON is raw fragments, not best-effort parsed | Per streaming-protocol §5.6; provider-portable; clients render placeholder until tool.use_end. |
| 2026-05-08 | Cancellation emits tool.use_end with empty input for in-flight tools |
Stream invariants (every start has an end) preserved even on cancel. |
| 2026-05-08 | Adapter registry separate from routing.yaml | Adapter config is per-installation; routing rules are per-user-policy. Different lifecycles. |
| 2026-05-08 | OpenAI thinking-block translation deferred; Anthropic→OpenAI loses thinking | Formats are too different for clean v1 mapping; documented asymmetry. |
| 2026-05-08 | Closed ErrorClass enum drives consistent classification |
Routing and analytics depend on uniform error semantics across providers. |
canonical-message-format.md — Message, ContentBlock, ToolDefinition, Usage, AdapterCapabilities. The provider-id ↔ canonical-id mapping convention is in §6.2.event-bus-and-trace-catalog.md — llm.call_started, llm.call_completed, llm.call_failed payloads; error_class enum; provider availability events.streaming-protocol.md — canonical streaming events (text.delta, tool.use_start, etc.); cancellation contract.routing-engine.md — capability validation (§4.4); availability state machine (§4.5); retry vs. routing escalation boundary.tool-dispatcher.md (planned) — how ToolUseBlock outputs are dispatched after the adapter returns them.server-api.md (planned) — request lifecycle from API entry through adapter call.