ADR-0003 — Portable-by-default: out-of-box router correctness and a generated bring-up¶
- Status: Accepted (2026-07-01)
- Date: 2026-07-01
- Relates to: the
genericityPRD (v0.5.0, in anvil-state) · ADR-0001 (the advise-and-defer failover contract the relay-timeout decision protects) · ADR-0002 (serves are compose-defined — this ADR closes the "example topology, not a portable default" trade-off it accepted) · a 42-finding genericity audit · CLAUDE.md gotchas #1 (loopback bind) / #6, #9 (thinking-by-default empty content) / #13 (GPU-UUID pinning) / #14 (WSL2 UVA) / #15 (9P bind-mount) ·anvil_serving/router/{config.py, serve.py, backends/cloud.py, relay.py, commit_window.py},anvil_serving/{deploy.py, serves.py, multiplexer.py},configs/example.toml,templates/docker-compose.yml.tmpl.
Context¶
anvil-serving v0.4.x works end-to-end — but only on the authors' two-box topology (a macOS OpenClaw gateway routing to GPU serves on a Windows/WSL2 box over Tailscale). A 42-finding audit across five dimensions, grounded in hands-on onboarding onto a second machine, found the product is pinned to that one setup in two independent ways:
- The shipped router config 404s out of the box.
configs/example.toml— the file the quickstart tells users to run — omits the per-tiermodel=served-name, so the router forwards the harness's preset token (quick-edit) as the upstream OpenAImodelid, and every local request 404s (backends/cloud.py:upstream_model = self._tier.model or request.model). This is a product default defect, not an environment quirk. It compounds with two more: thinking-by-default local models (the strong current ones — gpt-oss, Qwen3.x) return empty final content on real agent turns, and theallowstreaming fast-path (serve.py) runs zero verifiers, so the veryNonEmptyContentcheck that exists to catch this is bypassed for exactly the common local classes; and a hardcoded 120s relay timeout (serve.pybuild_backend_for_tier) means a slow/hung local tier blows the caller's budget, defeating the fast native failover the advise-and-defer design (ADR-0001) depends on. - The environment bring-up is machine-pinned with no generator. GPU UUIDs, absolute
C:/Users/…model paths, ports, and a Tailscale IP are hand-authored;serves' default manifest is literally the authors' privateexamples/fakoli-dark/serves.toml; anddeployemits an incomplete, SGLang-only,0.0.0.0-exposed compose that neither pins the GPU correctly on WSL2 (integerdevice_idsare silently ignored — gotcha #13) nor wires the three artifacts that must agree on one string (compose--served-model-name↔serves.toml↔ router tiermodel=).
ADR-0002 explicitly accepted the compose files as "an example
topology, not a portable default — an operator on other hardware edits them (or renders a fresh one
with deploy)." In practice that escape hatch does not hold: deploy renders an artifact that is
incomplete, unsafe, and SGLang-only, so "render a fresh one" does not produce a working serve.
Constraints. The package stays stdlib-only on the hot path (no new runtime dependency); the advise-and-defer posture (ADR-0001) and compose-defined serves (ADR-0002) are unchanged; the work is additive and the existing suite stays green.
Considered options¶
- Document the gaps; leave the defaults as-is (treat
examples/fakoli-darkas reference only). Rejected: the shipped default config literally 404s on the first request. "Read the docs more carefully" does not fix a broken default, and the empty-content/timeout failures are silent. - Ship curated per-hardware preset configs. Rejected: hardware is too varied (GPU count, UUIDs, model paths, ports); static files just re-encode the machine-pinning problem N times and drift.
- Make the defaults correct out-of-box AND generate the bring-up from detected hardware, safe by
default. Chosen. It fixes the product defect (broken default) and the portability defect (no
generator) at their roots, and finally makes ADR-0002's "render a fresh one with
deploy" true.
Decision¶
anvil-serving is portable-by-default. Three commitments:
- A. Out-of-box correctness. The shipped defaults must serve a real request. A local tier's
model=(the upstream served-model-name) is required — the config loader warns when a local tier omits it (the routing token would 404 upstream) — and may be auto-derived once at startup fromGET {base_url}/v1/models(fail-fast on 0 or >1 candidates; an explicitmodel=always wins). The router is correct for arbitrary local models: a per-tierextra_bodypassthrough lets an operator injectchat_template_kwargs={enable_thinking:false}(taming thinking-by-default models); a minimal verify-on-local-allowgate (NonEmptyContent/NotTruncatedvia the commit window) ensures an empty/truncated local200is caught and escalated/deflected, never delivered to the harness; and the relay timeout is configurable and defaults short (≈15–30s) so a hung local tier fails fast into the ADR-0001 native-failover handoff instead of stalling ~120s. - B. Generated, consistent bring-up.
anvil-serving init(aliasonboard) detects GPUs (index + UUID) and reads themodels synccatalog, then emits compose +serves.toml+ router config that agree on served-name and port.deploybecomes the generation engine: UUID-based GPU pinning (CUDA_VISIBLE_DEVICES=<uuid>+CUDA_DEVICE_ORDER=PCI_BUS_ID, belt-and-suspenders withdevice_ids, because integerdevice_idsare ignored on Docker-Desktop/WSL2 — gotcha #13); it emits theserves.tomlentry + a router-tier stub; and it supports vLLM as well as SGLang — reusingmultiplexer.py's GPU-UUID resolution and per-engine argv (factored into a sharedanvil_serving/gpus.py) so there is no second, drifting implementation.servesstops defaulting to the authors' private manifest (defaults to./serves.toml, points atinitwhen absent). - C. Safe by default. Generated serves bind
127.0.0.1; exposing an (unauthenticated) serve beyond loopback is an explicit--expose-lanopt-in that prints a security warning — matchingSECURITY.mdand the front door's--hostbehavior. This closes the ADR-0002 trade-off: the portable default is now both real and safe.
Consequences¶
- What improves. A stranger's first two commands (install,
serve) succeed on a current, common environment; one command generates a working, correctly-pinned, loopback-bound bring-up with no hand-editing of UUIDs, paths, or served-names; the advise-and-defer failover degrades gracefully under a down/slow local tier; thinking-by-default models are usable through the router; and the example's machine-specific values are parameterized (${…}) or# REPLACE:-annotated. - What we keep. The stdlib-only hot path —
init/deploy/gpusare operator-side tooling on the same footing as the Docker/Compose prerequisite ADR-0002 already established, not a runtime import. The advise-and-defer cost posture (ADR-0001) and compose-defined serves (ADR-0002) are unchanged; this is additive and config-gated. - What it costs (new surface). A generator that must keep
deploy,serves,models sync, and the router config schema consistent — mitigated by reusing existing code (multiplexerGPU/argv,deployrender) and keeping emitted artifacts minimal rather than feature-rich. An optional/v1/modelsstartup probe (kept fail-fast; explicitmodel=always wins). A bounded verify/buffer cost on the local streaming happy path (scoped to empty/truncated checks, local tiers only). - Relationship to prior ADRs. Supersedes nothing. It partially resolves the ADR-0002 trade-off
("machine-specific example, not a portable default") by making
deploy/initrender the portable, safe default that ADR-0002 assumed a user could produce. It protects the ADR-0001 failover contract by making a local-tier stall fail fast enough for the gateway's native failover to succeed. - Open tactical choices (deliberately deferred; tracked in the
genericityPRD's Open Questions): interactive vs declarativeinit;/v1/modelsprobe at startup vs lazily on first request; named-volume vs bind-mount as the generated model-mount default (the 9P-slowness gotcha #15); and whether amodels pullweight-fetch helper is in scope for v0.5 (currently a Non-Goal). - Tracking. Implementation is the
genericityPRD (v0.5.0) in anvil-state — 19 requirements, 3 features (Router correctness / Portable generation / First-run truth), 15 tasks.