Changelog¶
All notable changes to this project are documented here. The format is based on Keep a Changelog, and this project adheres to Semantic Versioning.
Unreleased¶
Fixed¶
- harness sync KEEPS OpenClaw's dropdown allowlist.
agents.defaults.models["anvil/*"]is OpenClaw's DROPDOWN ALLOWLIST — a preset appears only if listed there. The sync's "drop staleanvil/*overrides" step deleted the ENTRIES (not just the staleenable_thinkingparams), which removed the anvil presets from OpenClaw's picker entirely (hit live re-syncing Mini for the reasoning rollout). The render/merge now KEEP every preset's allowlist entry (empty params) and strip only the stale params; recipe + CLAUDE.md golden rule corrected to match. anvil-serving router upnow passes--no-depsso it manages ONLY the router. Without it,docker compose up routerre-runsdepends_onand RECREATES the model serves whenever their resolved config drifts (e.g. a changed--env-file) — a gpt-oss-120b reload is minutes of 503s. (Hit live redeploying to 0.9.0.) The serves areserves' responsibility, not the router verb's.- harness sync preserves the gateway's LIVE credentials, and
--restartuses a login shell. The gateway-merge now KEEPS an existing anvil-providerbaseUrl/apiKey(the rendered ones are just a default host + a${ENV}placeholder), so re-syncing a gateway that pins a LITERAL token no longer clobbers it into a 401 (hit live re-syncing Mini). Andharness … --restartrunsopenclaw gateway restartvia$SHELL -lcso the remote PATH resolvesopenclaw— a bare non-login ssh shell couldn't find it (installed under~/.local/bin/a brew prefix/etc.).
[0.9.0] - 2026-07-04¶
Added¶
-
Per-request reasoning selection (gpt-oss
reasoning_effort). New tier fieldextra_body_defaults— likeextra_bodybut applied viasetdefault(the request WINS), so a tier'sreasoning_effortbecomes a DEFAULT a caller can override instead of a hard pin. The router now also forwards a request'sreasoning_effortto the upstream (OpenAI dialect), and the harness renders the OpenClaw models withreasoning: true— so OpenClaw's per-message reasoning selector actually takes effect. The flexibility heavy tier now defaults tohighviaextra_body_defaults(was a hardextra_body), so planning/etc. can be dialed low/medium per message; a hardextra_bodykey still always wins (contract preserved). Requires a router redeploy + a harness re-sync to pick up. -
anvil-serving router up --env-file— persist the deploy secrets so a redeploy is reproducible. The router fail-closes withoutANVIL_ROUTER_TOKENand reverts to loopback withoutROUTER_PUBLISH; those lived only in the deploy shell env, so a barerouter up/docker compose upwould break the running router.router upnow passes--env-fileto compose (auto-detecting~/.anvil_envthen~/.env, override with--env-file, disable with--env-file ''), so the token + tailnet publish come from a persisted file (which also carriesHF_TOKENfor the serves). -
anvil-serving harness restart openclaw+sync --restart— reload the gateway so settings apply. OpenClaw reads its config at gateway STARTUP, so a synced config change is inert until a restart.harness restart openclaw [--gateway-host <mini>]runsopenclaw gateway restart(locally or over ssh);harness sync openclaw … --restartrestarts right after a successful push. It's a single command invocation (not a shell script), so it stays portable against a Windows/macOS/Linux gateway.--configis now optional (required only forsync). -
anvil-serving router logs+serves logs—docker logsthrough the management verbs. Diagnosing a router crash-loop or a serve no longer means reaching for rawdocker(the same gap ADR-0012 closed for lifecycle).router logsandserves logs <name>take--tail/--since/--follow, check the container exists first (a clean message beats docker's raw error), and surface BOTH stdout and stderr (a router's fail-closed startup errors — e.g. a missing auth token — go to stderr).serves logsrequires exactly one serve. Docker is dependency-injected, so tests run with no docker. -
flexibility:T016 — Qwen3.5-122B-A10B (MXFP4) serves on sm_120 via a patched vLLM Marlin W4A16 path, proving the any-engine seam on the hardest case. Standard vLLM routes this W4A4 MXFP4 checkpoint to FlashInfer's cute-dsl
mm_fp4, which dies on sm_120 (does not support backend 'cute-dsl' with capability 120); removing the (sm_120-broken)flashinfer.cute_dslmodule at startup forces vLLM's designed Marlin W4A16 fallback. New reusable recipeexamples/fakoli-dark/docker-compose.flexibility.yml+ adocs/findings/blackwell-sm120-lab-notebook.mdwriteup. Correctness preflight = ALL PASS (smoke, structured JSON, 14k needle, 20/20 tool batch) with--no-thinking. anvil-serving harnessverb — own the harness-side config, not just the router.harness sync openclaw --config <router.toml>RENDERS the OpenClaw provider config from the live router config — one selectable model per preset, eachcontextWindow= the LARGEST tier that preset can route to (the clamp gotcha), and NO per-preset thinking overrides (the router ownsreasoning_effort/enable_thinkingper tier now). Emits to stdout/--out, or PUSHES to the remote gateway with--gateway-host— transport isscp(portable: runs on a Windows OR Linux host, against a Windows/macOS/Linux gateway — no remote shell), MERGING the anvil provider into the remote~/.openclaw/openclaw.json(preserving other providers/agents, dropping staleanvil/*overrides, backing it up first);--overwritefor a full write. Closes the "hand-edit the gateway out-of-band" gap named by the new CLAUDE.md golden rule (anvil-serving owns the harness-side config too — keep it in lockstep with the router's intent/tier config). Also ships the reconciledexamples/openclaw/openclaw-flexibility.json5recipe. Skills/agent-config sync is the next scope. (The OpenClaw gateway runs on Fakoli Mini.)
Changed¶
- fakoli-dark router redeployed to the v0.8.0 release image (from the transitional
0.7.1pin in #125): theroutercompose service androuter_manage.DEFAULT_IMAGEnow pinanvil-serving:0.8.0— rebuilt from main, so the deployed router has flexibility mode + the v2 profile loader (backward-compatible with the live v1 profile), androuter promote --imagevalidates against 0.8.0. Live routing verified after the swap (planning/chat/quick-edit → 200).
Fixed¶
- harness
--restartguards (Greptile #130): reject--restarton a stdout-only sync (the config isn't applied, so restarting would reload the OLD config and falsely report success) — require--gateway-hostor--out; and reject sync-only flags (--config/--out/…) on therestartaction instead of silently discarding them.
[0.8.0] - 2026-07-04¶
Fixed¶
- Conservative per-request context gate: an over-context request is refused, not forwarded to a
too-small tier. A live incident routed a ~94k-token request to a 65k/32k-context local tier
(heavy tier was down, so the preset fell back to fast), which 400'd at the model with "Input
length exceeds maximum context length" plus an ASGI traceback.
policy.route()has always had the hard-constraint filter (needs.min_context > tier.context_limit-> drop tier), butserve.RoutingBackendleftNeeds.min_contextat 0, so it never fired.serve._needs_fornow wiresmin_contextfrominternal.estimate_tokens(a whitespace WORD count — a strict lower bound on real tokens: >= 1 token per word, English ~1.3x, dense code/JSON 2-4x). The raw word count is used with no extra discount, so the filter drops a tier only when even this underestimate exceeds the tier's real-tokencontext_limit(effectively realtokens > ~1.3x limit): a built-in cushion that catches the 1.4x incident while never false-rejecting a request merely near a tier's limit. When the gate drops EVERY candidate tier,NoAvailableTierError(kind="over_context")is raised and the front door renders a clean 413 Payload Too Large (distinct from the availability 503/exhaustion_status), instead of forwarding a doomed request or emitting a bare 500.policy.routerecords the specific tiers in a new additivedropped_by_contextnote bucket. stdlib-only, additive; normal-size requests route exactly as before.
Added¶
- External benchmark priors: new
anvil-serving external-benchCLI andanvil_serving.external_benchmarkspackage for ingesting raw external benchmark snapshots, normalizing Millstone AI rows, storing them in SQLite, exporting JSON, producing Markdown reports, and comparing local Anvil benchmark JSON against advisory external rows. These rows are performance priors only and do not change routing quality gates. rtx6kproexternal benchmark source: added a JSON-only adapter forlocal-inference-lab/rtx6kproRTX PRO 6000 Blackwell inference-throughput artifacts, including conservative Qwen/GLM metadata normalization, DCP and speculative-decoding methodology notes, and non-destructive failures for prose, CSV, or HTML imports.- Serve & router management verbs (ADR-0012): every serve/router lifecycle op now flows through an
anvil-servingverb instead of raw docker.anvil-serving router {up|down|restart|reload|status|token}manages the deployed (ADR-0004) containerized router;anvil-serving router promote --profile [--config]is the containerized profile write-back (the ADR-0009 moat) done safely — validate against the deployed image's OWN loader, back up, ATOMICALLY write into the read-only-mounted config volume via a root side-container, reload, and ROLL BACK on a crash-loop (settle + consecutive-running+RestartCount). Newserves rm(retire any container incl. a non-manifest port squatter),serves adopt(recreate an externally-started serve under compose management), andserves up --compose <file>(bring up an experiment serve not in the manifest);serves downnow honors--dry-run(was silently stopping serves). The fakoli-darkdocker-compose.yml/serves.tomlare reconciled to the live flexibility topology (heavy=gpt-oss-120b :30002, fast=Qwen3.6-27B-NVFP4 :30003,vllm-hfcache+ HF repo-ids) soanvil-serving servesmanages the real serves again.
[0.7.3] - 2026-07-02¶
Changed¶
- fakoli-dark heavy tier enables NEXTN speculative decoding (ADR-0008). Self-speculation via
the model's own built-in MTP head (no separate draft model, no additional steady-state VRAM
cost) — validated live with a two-step A/B on production hardware before merging: +30-43%
decode throughput depending on concurrency, ~82% draft-token acceptance rate, and confirmed
SGLang issue #19796 (an SM120-specific NaN-on-prefix-cache-hit crash) does not reproduce on
this stack at cache-hit rates up to 96.2% under concurrent multi-turn traffic. Known tradeoff:
TTFT regresses under concurrency (+37% at concurrency=4); net end-to-end latency still improved
in every trial. No wire-level change —
served-model-nameand the router config are unaffected.
0.7.2 - 2026-07-02¶
Weights on a volume + docs truth-up. Two fixes from live operation, and a documentation pass that brings every stated claim back in line with the shipped code.
Fixed¶
- Model weights mount from a named Docker volume, never a host bind mount (#107). On
Docker Desktop/WSL2, 9P/virtiofs bind mounts turned cold model loads into 20–90 minute
stalls. All serve definitions — the fakoli-dark compose files, the legacy serve scripts,
and the multiplexer's default registry (new
volumeregistry key) — now read weights from an external named volume, with container paths unchanged so serve fingerprints are unaffected. This also removed the last machine-specific host paths from the shipped package. - Eval data default resolves to
tests/fixtures/eval-data(#106) — the previous default pointed at a directory relocated to the companion notes repo; the vLLM experiment entrypoint is pinned alongside it.
Documentation¶
- ADR-0007 (#105): a Claude-subscription cloud tier is feasible and permitted for self-hosted single-operator use — opt-in, subprocess-to-CLI, text-only classes, no tool broker, documented ToS-gray. Design-only; no implementation scheduled. Companion pi harness recipe added to the README.
- Docs truth-up (positioning refresh): README Known limitations rewritten to include
the live-confirmed ADR-0005 keyless-failover caveat, the promotion-table evidence-erosion
note (the reference heavy serve moved off the model the seeds were measured against;
shadow-eval re-run recommended), and the Anthropic-dialect
NotTruncatedpass-through behavior introduced by the v0.7.1 caller-cap fix. AGENTS.md updated off v0.4.1/707-tests to v0.7.x/993; README/CLAUDE.md test counts corrected to 993 collected; mkdocs nav now publishes ADR-0002–0007 and the 2026-07-02 architecture review; docs version badge bumped; stalerelay.py(non-streaming upstream) andserves.py(manifest default) docstrings corrected.
0.7.1 - 2026-07-02¶
Live-incident hardening — a LIVE end-to-end run (2026-07-02) found a harness that
computes max_completion_tokens = declared contextWindow − prompt tokens, floored at 1
(never rejects an oversized prompt). A misdeclared contextWindow made every real turn
arrive with max_completion_tokens: 1; the local model correctly honored the cap and
returned its one token with finish_reason: "length" — but anvil's NotTruncated
verifier had no way to tell a caller-requested cap from an unexpected truncation, so it
hard-failed every such response on every tier: 503 exhaustion on every turn, and the
repeated verify-failures tripped the circuit breaker, blacking out an otherwise-healthy
work-class for the cooldown window. The exhaustion 503 also printed a misleading message
("configure that tier's credentials/endpoint") for a case where the tiers were bound and
reachable the whole time.
Fixed¶
- Caller-capped
length/max_tokensis compliance, not truncation (the headline fix).verify.ResponseViewgained acaller_max_tokensfield, populated from the request's ownmax_tokens(parsed frommax_tokens/max_completion_tokensby the dialects) at both response-view construction sites (serve.py's_structured_view_factoryandcommit_window.build_response_view, the fallback used when a caller injects no factory).NotTruncatednow passes alength-like stop when the caller set an explicit cap — it is exactly what was asked for. When the caller set no cap at all, alength-like stop is still treated as genuine unexpected truncation (unchanged). The critical interaction is preserved: an EMPTY, caller-cappedlengthresponse (thinking-budget starvation, CLAUDE.md gotcha #9) still fails viaNonEmptyContent— only a non-empty caller-capped response passes the full chain. With verify passing, no failure is recorded, so the breaker-poisoning stops too. Regression-pinned end to end: a realmax_tokens: 1request through the front door + a localallowtier now returns 200 with the 1-token body, not a 503, and does not increment the circuit breaker across repeated 1-token-capped requests. - Exhaustion 503 message no longer blames credentials when the tiers were bound and
reachable.
internal.NoAvailableTierErrorgained akindparameter ("unbound"default /"exhausted") distinguishing the two raise sites inserve.py'sRoutingBackend.generate():bound_tiersempty (genuinely unbound — the "configure credentials/endpoint" message is correct and unchanged) vs. every bound candidate attempted and failed verify/relay (now says so — "all N bound candidate tiers were attempted and failed (verification or relay error); see the decision log" — instead of pointing at credentials/reachability). Same exception type throughout — the front door'sexcept NoAvailableTierErrorcontract is unchanged. - Docs:
docs/OPENCLAW-INTEGRATION-SPEC.md§2's provider-config recipe now declarescontextWindow: 131072(the largest routed tier's window,heavy-local) for every preset instead of the previous32000-class values forchat/quick-editthat under-declared their real routed ceiling — the live-confirmed failure mode above is documented in full alongside the corrected recipe.
0.7.0 - 2026-07-01¶
Wire fidelity + production hardening — the relay now forwards what the harness actually sent (tools, tool history, sampling parameters) and streams what the model actually produced (real SSE deltas, real token counts), with a full-codebase hardening pass behind it.
Fixed¶
- Tools and tool history were silently dropped on relay (#96) — the headline fix. The relay
backends rebuilt the upstream body from the flattened
InternalRequest, which dropped the request'stools/tool_choiceand thetool_use/tool_resultconversation history — a routed tier could never call a tool and lost its own tool history between turns. Newdialects/translate.py(pure stdlib) translates tool definitions,tool_choice, and tool-carrying message history between the Anthropic and OpenAI wire shapes;CloudBackend._build_bodyforwards same-dialect requests verbatim and translates cross-dialect ones (e.g. Claude Code → local vLLM). Tool-free requests build a byte-identical body to before (regression-pinned). Verified live: a real 104-tool OpenClaw agent turn now reaches the local model and returns a realtool_callsresponse. relay()now actually streams (#98).resp.read(65536)on anhttp.clientresponse blocks until 64 KB accumulate or EOF, so SSE token deltas were delivered all at once at end-of-stream — TTFT equaled full completion time.read1()returns per-chunk. The most user-visible fix in the hardening pass.- Classifier keyword haystack (#97): only a short (≤150-word) system prompt joins the keyword scan — a harness's standing multi-thousand-word system prompt permanently contains "plan"/"review"/"edit"/"fix", which multi-matched every request into an ambiguous verdict and drowned the actual intent of the last user turn.
- Public-bind warning is auth-aware (#97): with
[server].auth_envconfigured it notes the token gate instead of falsely claiming the endpoint has no authentication. - Production hardening bug bash (#98) — router core:
DecisionLogis a bounded ring buffer (default 10k records; was an unbounded per-request append — a slow leak on a long-running router);RouterConfig.tier()is O(1); an abandoned circuit-breaker half-open probe no longer wedges a tier OPEN forever (probes expire after one cooldown); the fence-scan verifier is linear (was O(spans × delimiters) — adversarial many-fence responses cost ~10⁹ comparisons in the hot path); front-door keep-alive desync and trailing-slash fixes. Support modules: multiplexer swap-path hardening (dead-child detection, checkeddocker rm -f, zombie reaping, OOM-guard eviction credit, clean 4xx/5xx) and loopback bind by default (was0.0.0.0— an unauthenticated model-swap endpoint on the LAN); calibrate bounded backpressure (max_pending=64, drops counted); secrets redaction is component-boundaried (context_limitno longer destroyed by a substring match ontext); prices parse-before-cache, atomic writes, stale-cache fallback, per-process memo; case-insensitive inferred-preset resolution;PYTHONHASHSEED-independent fingerprints (set values canonicalized — set-valued serve flags re-fingerprint once on upgrade). policy.Needs.needs_toolswas never populated on the serve path.policy.route()has always honoredneeds.needs_tools(excludestool_support=falsetiers), butserve.RoutingBackendnever constructed aNeeds—route()was always called withneeds=None, so a tools-bearing request could route to a tier with no tool support (the model would then be unable to call any tool it needed). Wired viadialects.translate.has_tool_artifacts(#96): bothRoutingBackend.generateandRoutingBackend.decidenow build aNeeds(needs_tools=...)from the raw wire body before callingroute(). (Needs.min_contextwas wired conservatively later — see the Unreleased "Conservative per-request context gate" entry above.)- Verify: empty-content false-negative on tool-call-only local replies (regression coverage).
Live end-to-end testing with a real OpenClaw agent turn reported a local model reply with empty
text
contentbut a populatedtool_callsbeing wrongly treated as thinking-budget starvation byNonEmptyContentand escalated/exhausted to a503. Investigation found the router logic was already correct onmain—NonEmptyContent(anvil_serving/router/verify.py) already passes on a non-emptytool_callslist even with empty text, andRoutingBackend._route_with_verify(anvil_serving/router/serve.py) already threads a backend'stool_calls/finish_reasoninto theResponseViewviaget_last_structured()— landed by the structured-field-passthrough work (#42/#52), which predates and is included in v0.6.0. A genuinely empty reply (no text AND notool_calls) still correctly fails and escalates/defers, per the T004 safety net. Added end-to-end front-door regression tests (tests/router/test_serve_fallback.py,tests/router/test_serve_verify_fallback.py) and unit-level edge-case pins (tests/router/test_verify.py) locking in the tool-call-only-pass / truly-empty-fails contract at both the T004 minimal-verify local-"allow" path and the full allow-with-verify chain, since no end-to-end coverage previously existed for this shape. If this was observed against a deployed container, rebuild/redeploy from a commit that includes #42/#52 (any v0.6.0+ build already does).
Added¶
- Measured-profile loading (#97):
[router].profile_pathloads a measuredprofile.json(written byprofile_bootstrap/ eval bootstrap) instead of always routing on the hand-authored seed profile. Configured-but-unloadable is a startupConfigError— fail fast, never silently fall back to seeds the operator asked to replace. - Real usage passthrough (#97): the relay backends extract the upstream's real
usageblock and both dialects render the real token counts when present (word-count estimate remains the fallback). Harnesses use these numbers for context management, so the estimated fiction was actively misleading. - Sampling-field wire fidelity (
top_p/ stop sequences).InternalRequestnow carriestop_pand a normalizedstop(list of strings — OpenAI's string-or-arraystopform is collapsed to a list; Anthropic'sstop_sequencesis native). Both dialects parse them (dialects/openai.py:top_p/stop;dialects/anthropic.py:top_p/stop_sequences), andCloudBackend._build_body(anvil_serving/router/backends/cloud.py) forwards them with dialect-correct wire names, only when present, so an absent field builds the exact same body as before (extends the #96 byte-identical regression pin). Also forwards same-dialect-onlytop_k(Anthropic) andpresence_penalty/frequency_penalty(OpenAI) — never invented for a translated cross-dialect request. Deliberately NOT forwarded:logit_bias,seed,user,metadata— provider-account/session-scoped fields (billing attribution, abuse tracking, deterministic-replay opt-in), not generation-quality knobs, so passthrough would leak caller-side state for little harness value. A tier'sextra_body(applied last, #97) still overrides any of these — documented precedence, now test-pinned. Previously a harness sendingtop_por a stop sequence had it silently dropped: the local/cloud model sampled with different parameters than requested.
0.6.0 - 2026-07-01¶
Router as a service — the front door is now a containerized, network-facing, token-authed endpoint (ADR-0004), so the serves stay loopback-only behind one authenticated boundary and keep-alive comes from Docker.
Added¶
- Built-in front-door token auth (opt-in).
[server].auth_envnames the env var (e.g.ANVIL_ROUTER_TOKEN) holding a shared token; the front door acceptsAuthorization: Bearer <t>orx-api-key: <t>, compares constant-time (hmac), and returns401on mismatch. Off when unset (loopback default unchanged); configured-but-env-unset fails fast. UnauthenticatedGET /healthz. - Repo-root
Dockerfile(stdlib-only image, non-root,HEALTHCHECKon/healthz) and a router+serves compose topology: therouteris the only published, authed service; the serves stay loopback-only and are reached by service name. Shipsconfigs/example-docker.toml.
Changed¶
SECURITY.mddocuments the built-in bearer/x-api-keyauth (supersedes the old "no built-in authentication" note); the raw serves stay loopback/internal behind the router.
0.5.0 - 2026-07-01¶
Portable-by-default — out-of-box router correctness and a generated bring-up (ADR-0003), so anvil-serving works generically, not just on the authors' setup.
Added¶
anvil-serving init/onboard— one command detects GPUs and emits a mutually-consistent compose +serves.toml+ router config.anvil-serving doctorenvironment preflight. Sharedgpus.pyGPU-UUID pinning;deploygains a vLLM engine, loopback-default publish, and serves.toml + router-tier emission. Per-tierextra_body(injectchat_template_kwargs.enable_thinking=falsefor thinking-by-default models); configurable[router].relay_timeout;/v1/modelsserved-name auto-derive.
Fixed¶
- Shipped example configs 404'd out of the box (a local tier without
model=forwarded the preset token upstream) —model=is now required and warned. verify-on-local-allowcatches an empty/truncated local200instead of delivering it. README states Python ≥3.11 + a pipx recipe; the OpenClaw plugin install uses--link.
0.4.1 - 2026-06-30¶
Serving-substrate hardening: model serves are now Docker-Compose-defined and serves up
is drift-safe, plus Blackwell sm_120 serving guidance. No router changes; no breaking
changes.
Changed¶
- Model serves are Docker-Compose-defined (ADR-0002).
anvil-serving serves updelegates todocker compose up -d <service>, which recreates a container when its compose config has drifted and fast-restarts it when unchanged — replacing a blinddocker startthat could silently serve a stale model. Added a parametrized experiment-harness compose (examples/fakoli-dark/docker-compose.experiment.yml). Docker Compose v2 is now a serving-substrate prerequisite (the router itself stays stdlib-only). serves upgained a--recreateflag (forcedocker rm -f+ up) and a served-vs-declared model drift warning for script-based serves.- Serve ports bind
127.0.0.1only; GPU pinning usesCUDA_VISIBLE_DEVICES(reliable on Docker-Desktop/WSL2) alongside Composedevice_ids.
Docs¶
- Blackwell sm_120 serving gotchas (dense NVFP4 vs the MoE-NVFP4/block-FP8 kernel gaps,
NVFP4≈1.8×FP8, the
VLLM_USE_V2_MODEL_RUNNER=0UVA fix, the docker-volume vs 9P load path) inCLAUDE.md; ADR-0002.
0.4.0 - 2026-06-30¶
Advise-and-defer — the subscription-first routing pivot — plus the launch-hardening pass. anvil is now local-serve + routing brain: the harness owns cloud on its subscription and no cloud API key sits in the default path ($0 metered API by default). This release also closes the six post-launch hardening issues (#42, #45, #46, #47, #52, #53).
Changed¶
- Cloud tier is now opt-in, OFF by default.
configs/example.tomlships as local-only; anvil holds no cloud API key and incurs $0 metered API billing in the default configuration. A cloud tier must be explicitly declared inconfigs/example-with-cloud.tomlto unlock it. - Keyless exhaustion handoff replaces mid-request cloud escalation (default path).
When all local candidates are exhausted (verify-failure on an
allow-with-verifyclass with no cloud tier configured), anvil returns anexhaustion_status(503 by default, configurable) with nothing streamed. A gateway like OpenClaw treats this as a transport failure and re-routes the request on its native subscription provider — flat-rate, not metered by anvil. The opt-in keyedCloudBackendpath still works for single-endpoint harnesses that cannot route cloud themselves. - Contract C4 reshaped into two explicit modes — keyless (exhaustion-503 → gateway
transport failover) and opt-in keyed (router-internal escalation → 200). Documented
in
docs/QUALITY-GATED-ROUTER.mdanddocs/PLAN-advise-and-defer.md. - Docs and visual assets refreshed to reflect advise-and-defer terminology (local-only
default, opt-in metered cloud, keyless handoff, $0-metered framing). Internal
design/planning/findings documents relocated to the private companion repo
fakoli/anvil-serving-notes; public docs retain the product-facing surface. - Internal maintainability (#46).
RelayBackenddecoupled into the backends package; dialect/privacy magic strings replaced with named constants; a dialect parity test pins both dialects' surface. Behavior-preserving — no wire change.
Added¶
- Per-intent
metered_cloudgate. When a cloud tier is configured, no work-class is eligible for it unless explicitly listed in[router].metered_cloud. No implicit global "use cloud" switch exists. - Cost dimension. A configured cloud tier carries
cost_input_per_mtok/cost_output_per_mtokfields (USD per million tokens). Estimated cost is surfaced in the decision log and acost_usdmetric on every metered cloud route; local tiers report0. - Optional off-by-default cost-sync. A
[router] cost_sync = truetoggle fetches prices from the free, MIT-licensed LiteLLM pricing JSON (cached at~/.cache/anvil-serving/prices.json, 24 h TTL, stdliburllibonly). Static config is the default; sync is opt-in. Falls back to static config on any fetch failure. - Configurable
exhaustion_status. The HTTP status anvil returns when all local tiers are exhausted is configurable (default 503) so operators can tune the gateway-failover trigger to their gateway's classification. POST /v1/route— the routing-brain endpoint. Exposes the intent-resolve + routing decision without serving the request. Request: acompletions-shaped body plus optionalsignals(work_class,token_estimate,urgency). Response:{ tier, model, provider, work_class, reason, confidence, session_id }. Status 200 (decision, even ifcloud), 400 (malformed), 503 (no suitable tier). Used by the OpenClaw plugin for upfront routing splits.- OpenClaw plugin upfront routing split. The
before_model_resolvehook inplugins/openclaw-anvil-intent-router/now routesdeny-class and cloud-destined work directly to the gateway's native provider (bypassing anvil entirely), and routesallow/allow-with-verifyclasses through anvil. Uses the sharedtier0_keywords.jsonclassifier vocabulary; optionally calls/v1/routefor the authoritative decision. - Tool-call passthrough + live structured verifiers (#42, #52).
tool_calls/tool_useand the realfinish_reason/stop_reasonnow flow through the backends, dialects, and verifiers (streaming and non-streaming) — a coding harness's tool-calling turn is preserved end-to-end, and theNotTruncated/ToolCallJSONValidverifiers run live on the serve path (previously inert). The text path is byte-identical.
Fixed¶
- Fallback-path hardening (#45, #52). Seam isolation (a hung verifier is bounded by a latency budget; a raising observer/log or response-view factory can no longer crash a served request), 32 MiB drain byte-caps (local + cloud) against runaway responses, and a session-scoped, thread-safe circuit breaker with cooldown + half-open decay so a transient blip can't permanently disable a tier.
- Front-door HTTP polish (#53). A
GETto a POST-only route returns405+Allow: POST(not404); a bounded non-blocking drain after a413avoids a connection-reset race;do_GETbody-handling keeps the socket in sync. - Concurrency + correctness hygiene (#47).
DecisionLogis guarded by a lock (it is written fromThreadingHTTPServerrequest threads); a structurally-malformed cloud response now surfaces a sanitized error instead of being masked as an empty completion. benchmarkcontext-clamp +--no-thinking(#78). Right-sizes the replayed request distribution and avoids thinking-budget starvation during benchmarks.
0.3.0 - 2026-06-30¶
First public release. anvil-serving is now a quality-gated local-model router for coding
harnesses: point a harness (Claude Code via ANTHROPIC_BASE_URL, or any OpenAI/Anthropic
client) at one endpoint; per request it resolves an intent to a tier (fast-local /
heavy-local / cloud), cheaply verifies the output, and falls back up the tier chain on
failure — never silently shipping a local-quality miss. stdlib-only, Python >= 3.11.
The harness-router PRD (all 18 tasks, milestones M0–M3) landed in this release.
Added¶
- Protocol-standard front door — accepts both the Anthropic Messages and OpenAI Chat Completions dialects on one endpoint, including SSE streaming, and normalizes them onto a single internal request shape.
- Intent routing — named-preset intents (
planning,quick-edit,review,chat,long-context) carried in themodelfield, accepted bare oranvil/-namespaced, resolving to(model, tier, params); amodel:-pin escape hatch for repro/debugging. - Tier-0 work-class classifier — the universal floor: infers a work-class from the raw
payload (token count,
thinkingflag, tool types, image content, system-prompt fingerprint) for requests that arrive with no declared intent. Vocabulary ships as thetier0_keywords.jsonpackage-data. /v1/modelsdiscovery — advertises the preset vocabulary so intents surface in harness model pickers.- Tier-topology config schema — TOML config declaring tiers, per-tier backends, presets, and
a
mapping_version; loaded with stdlibtomllib. - Quality profile + residency-aware routing policy — a
(model, work-class) -> {quality_score, sample_n, last_measured, decision}table (allow/allow-with-verify/deny) keyed on a serve fingerprint (model + quant + engine + serve flags); policy filters by hard constraints (including privacy / local-only residency) then ranks the survivors. - Cloud-tier credentials on the Backend seam — Anthropic and OpenAI cloud backends with credentials referenced by env-var name, plus secrets redaction so keys never reach logs or the decision record.
- Cheap structural verify — near-zero-cost inline checks (empty/truncated content, tool-call JSON that does not validate, code that does not parse, a diff that does not apply).
- Streaming commit-window + verify-gated fallback + decision log — for fail-prone classes on the streaming path, a non-streamed commit window buffers and verifies before the first byte reaches the harness; on verify-fail / error / timeout / low-confidence the router retries up the tier chain (fast → heavy → cloud) with retry caps and a per-session cost budget; every decision is logged transparently (the response reports the real tier that served).
- Typed extension seams — Backend / verifier / policy extension points for adding tiers, engines, and checks without forking the core.
anvil-serving serve --config ...CLI — starts the front door bound to the tiers declared in a router config; binds127.0.0.1by default.- Profile bootstrap + async calibration + traffic metrics + per-work-class promotion — bootstrap the quality table from the generalized shadow-eval, opt-in async calibration with serve-fingerprint staleness, real-traffic metrics, and a per-work-class promotion decision (planning/critic stay cloud-default, failover-only).
- OpenClaw tooling + reference adapter — validate-first tooling (wire-form + firing-cadence
validator, logging hook, fixture) and a thin, swappable
before_model_resolvereference adapter plugin. The core stays zero-OpenClaw-coupling.
Known limitations¶
- OpenClaw live validation is manual. Validating the integration against a real OpenClaw
install (firing cadence and outbound wire
modelform) requires a human on the gateway box; seeexamples/openclaw/README.md. The committedhook-fire-log.jsonlis a representative fixture, not a live capture. - Most promotion verdicts are seed/expected. Per-work-class promotion decisions in the
shipped profile are hand-seeded and pending real-traffic calibration; only
planningrests on hard eval data (in the companion notes repofakoli/anvil-serving-notes). - The T017 traffic fixture is synthetic. Traffic-metrics behavior is exercised against a synthetic fixture, not yet against real routed production traffic.