ADR-0010 — Specialized-Engine Tier: run any model on any engine (config-first, RelayBackend-served)¶
- Status: Accepted
- Date: 2026-07-03
- Relates to: ADR-0003 (genericity / out-of-box router correctness), ADR-0004 (containerized token-authed service), ADR-0008 (heavy-tier speculative decoding), ADR-0009 (measured write-back loop — the fingerprint axis this adds is measured by that loop);
anvil_serving/router/{backends/relay.py, config.py, fingerprint.py, serve.py}; CLAUDE.md gotchas #10/#16/#17/#18 (sm_120 NVFP4), #6/#9 (thinking-budget starvation)
Context¶
Motivation (maximum model-access flexibility). SGLang is excellent for our agent workload (the heavy tier: hybrid GDN+MoE, spec-decode, radix cache) but it — like vLLM — gates which models we can run: it needs safetensors and day-one architecture support, so brand-new models and exotic quant formats are often unservable on it. The goal is that a new or better model is never blocked because our agent engine doesn't support it — run it on whatever engine does (llama.cpp for fresh GGUF architectures, ktransformers for MoE-NVFP4, DS4/DwarfStar for DeepSeek V4 Flash, ExLlamaV3 for EXL3), fronted by the same router, while SGLang stays the agent workhorse. This ADR is that flexibility feature. ktransformers/MoE-NVFP4 on sm_120 is the concrete first case below; the design is engine-general.
The hardware is Blackwell sm_120 (RTX PRO 6000 96 GB + RTX 5090). On sm_120 the batch engines choke on some model/quant combos we want to serve: MoE-NVFP4 grouped-GEMM produces garbage / crashes on vLLM and SGLang (CUTLASS #3096, vLLM #31085/#33416), and block-scaled FP8 is a dead-end (gotcha #16). ktransformers is flagged as the one system with a working MoE-NVFP4 path on sm_120 (kt-kernel MXFP4 operator, sidestepping the grouped-GEMM crash cluster) — a scoped trial candidate, not yet verified on this box. We therefore want a first-class way to declare a router tier backed by a specialized inference engine for model/quant combos the batch engines cannot serve well, toggleable per tier (opt-in).
Hard realities that bound the option space:
- RelayBackend already exists (
relay.py:21-78): aCloudBackendsubclass that relays to ANY OpenAI-compatible local endpoint overurllib, auth-optional (_require_key=False), inheriting the/v1/chat/completionspath, realstream:trueSSE relay, tool forwarding/translation, and usage/finish_reason/tool_calls extraction.build_backend_for_tier(serve.py:194-198) selects it for anyprivacy != "cloud"tier. So an OpenAI-compatible ktransformers serve is routable today with zero new backend code — a tier block suffices. - The Backend Protocol (
internal.py:170-180) is a singlegenerate()method; supported dialects are exactly{openai, anthropic}(cloud.py:93). - Three genuine gaps, all config/identity plumbing (not a new backend class):
Tier(config.py:61-86) is a frozen dataclass exposing noengine/quantization/paramsfield.fingerprint.IDENTITY_FIELDS(fingerprint.py:46-55) ALREADY hasquantizationandparamsslots, but they resolve toNoneon a realTier, so a kt serve is fingerprint-indistinguishable from a vLLM/SGLang serve at the same base_url/model — it could inherit another engine's measured verdict.relay_timeoutis a single globalRouterConfigfield (default 20 s,config.py:123) applied to every local tier; a slow NVFP4 large-prefill could be spuriously fail-fasted.- There is no per-tier concurrency cap — only a process-global
BoundedSemaphore(front_door.py) — so a low-throughput specialized engine cannot be given a smaller bound. - Golden rules in force: stdlib-only router hot path (
http.server+urllib); no new anvil runtime dependency without sign-off; no self-verification for any quality gate; preflight before trusting any new engine on sm_120; data-driven re-measure (no tier gates traffic without a measured-profile A/B); ADR discipline for any contract/architecture change; and the seams-taste rule (engines/plugins are thin config/adapters, not new core classes).
Considered options¶
- Config-first minimal (CHOSEN). Declare the specialized tier as an ordinary
privacy="local"/dialect="openai"tier pointed at the external kt serve, served by the existing RelayBackend. Make it first-class with additive optionalTierfields (engine,quantization,params,timeout, latermax_concurrency), addengineto the fingerprint identity, and gate the tier behind preflight + a measured-profile A/B. No new backend class; substrate untouched in the shipping path. - EngineDescriptor registry. A first-party in-process registry (launch + request-shaping +
capabilities + concurrency) consumed by BOTH router and substrate, extensible via a typed
engineseam, plus an optionalbackend-seam hook for a future non-HTTP engine. Rejected: its own Phase 1 is byte-identical to option 1's zero-code path, so the registry buys nothing for the only engine that exists — a framework for N=1. Thebackend-seam hook designs for a case the design itself concedes "would be a new ADR." Speculative generality; more surface to build, review, and get wrong for the same shipped capability. - New
SpecializedBackendclass (± registry-driven backend selection). Rejected: justified only by a non-HTTP transport or a dialect outside{openai, anthropic}— ktransformers is neither. This is exactly the "config would have sufficed" over-build; it violates the seams-taste rule and the no-over-build-past-RelayBackend constraint.
Decision¶
Adopt option 1. A specialized-engine tier is an ordinary privacy="local" / dialect="openai"
tier whose base_url points at an external, OpenAI-compatible specialized-engine serve
(ktransformers first), routed by the existing RelayBackend with no new backend code.
Engine request-body knobs ride the existing extra_body; launch-time engine flags are the
serve's concern.
"First-class" is delivered by additive, default-unset changes, phased by proven need:
- Phase 1 (zero anvil code): hand-author
docker-compose.ktransformers.yml+ a[[serve]]entry (serves.pyruns an arbitraryupargv and probes a configurablehealthpath); declare the tier; runpreflight(which also confirms kt speaks exactly/v1/chat/completions). - Phase 2 (additive contract): add optional
Tier.engine,Tier.quantization,Tier.params,Tier.timeout; add("engine", ("engine",))tofingerprint.IDENTITY_FIELDS; thread per-tiertimeoutthroughbuild_backendsas an override of the globalrelay_timeout. - Phase 3 (additive capacity): optional
Tier.max_concurrency, enforced by a per-tier stdlibBoundedSemaphore, sized frombenchmark. - Phase 4 (deferred): substrate first-class (multiplexer/deploy engine branches) and registry-driven backend selection — only if a genuine second engine earns it; own ADR.
Explicitly rejected: a SpecializedBackend class and registry-driven backend selection for
ktransformers. Gating: no specialized tier joins a live preset pool until preflight passes
and its quality enters the measured profile (keyed by its engine-distinct fingerprint) via an
independent A/B — never by self-verification, and never as the sole tier of a preset before A/B.
Consequences¶
Keep: the stdlib-only, urllib-only router hot path (kt is an external serve with its own
container/deps — no dependency crosses into the router); the RelayBackend serving path unchanged;
build_backend_for_tier's privacy-based selection unchanged. No new anvil runtime dependency.
Change (additive only): five optional Tier fields and one IDENTITY_FIELDS entry. Because
identity() omits None fields, every existing engine-less tier's fingerprint stays
byte-identical — no profile churn, no FINGERPRINT_SCHEMA bump. A specialized tier now keys a
DISTINCT measured-profile fingerprint (cannot inherit a vLLM/SGLang verdict), and an in-place
vLLM->kt swap at the same base_url marks old rows stale. A slow NVFP4 prefill gets a per-tier
timeout; a low-throughput engine gets a per-tier concurrency bound.
Give up / follow-up: no swap-managed or deploy-rendered kt until Phase 4 (the substrate
multiplexer/deploy surfaces stay hardcoded to sglang+vllm); a kt tier is brought up via a
hand-authored compose + [[serve]] entry meanwhile. The backend-seam / registry path is left
unbuilt until a second, non-OpenAI-compatible engine justifies it.
Risks / open items requiring a local trial (fail-safe by design — a broken/absent kt serve
raises CloudBackendError and fallback escalates): the relay hardcodes /v1/chat/completions
with no per-tier path/header override (cloud.py:622-633), so a non-standard kt path or header is
a real gap that must surface in preflight; kt's tool-call JSON, SSE chunk shape, usage block, and
GET /v1/models support must be verified before the tier is trusted; and the core premise — kt
serving MoE-NVFP4 on sm_120 — is unverified on this box. Preflight is the first proof; no
production preset lists the tier until preflight + A/B pass.