Architecture review + model landscape — 2026-07-02¶
Session record of the full architecture review, the fixes it produced (PRs #96–#102), the mid-2026 local-model landscape research for the fakoli-dark rig, and the recommended next actions. Kept in-repo so it survives the ephemeral review environment and is retrievable with a
git pull. (Dated findings normally live in the companion notes repofakoli/anvil-serving-notes; this one is here by request — move it there if the product-focus convention should win.)
What shipped (the PR series)¶
| PR | Landed | What it fixed |
|---|---|---|
| #96 | 78cb278 |
Tool-traffic wire fidelity. The relay backends rebuilt the upstream body from the flattened InternalRequest, silently dropping tools/tool_choice and the tool_use/tool_result history — a routed tier could never call a tool. Same-dialect requests now forward raw messages verbatim; cross-dialect (Claude Code → local vLLM) get full translation (router/dialects/translate.py). Also: Anthropic streaming output_tokens no longer reports 1 for a buffered commit; localhost doc stragglers fixed. |
| #97 | ea370bb |
[router].profile_path — the live server can load a MEASURED profile.json instead of always routing on hand-authored seeds (fail-fast if unloadable). Real usage passthrough — upstream usage blocks reach the response instead of word-count estimates. Classifier haystack — only short (≤150-word) system prompts join the keyword scan, so a harness's standing prompt can't multi-match every request into ambiguity. Auth-aware public-bind warning. |
| #98 | 58977a8 |
Production bug bash. Multiplexer relay read1 (real SSE TTFT through the proxy); DecisionLog bounded (10k ring buffer — was an unbounded per-request leak); calibrator backpressure; the quadratic unterminated-fence scan made linear (adversarial many-fence input was ~10⁹ comparisons); RouterConfig.tier() O(1); circuit-breaker probe expiry (wedge-proofing); loopback-default multiplexer bind; swap-path hardening (dead-child detect, docker rm verify, zombie reap, OOM-guard evictee credit); secrets-redaction + intent case fixes; GET Transfer-Encoding framing; POST trailing-slash. |
| #101 | c0c9286 |
In-flight draining across multiplexer swaps (ADR-0006). Relays hold an atomically-acquired lease; swaps wait for the old resident's leases (bounded by --drain-timeout, default 30 s) before docker rm — no more mid-stream connection resets on a routine swap. New arrivals queue behind an in-progress swap. |
| #102 | 48c8522 |
Residency-aware routing — RoutingBackend tracks the last-served local tier and feeds policy.route(residency=…), activating the AC3 anti-thrash reorder (was implemented+tested but never wired). True streaming relay — streaming requests issue stream: true upstream and forward the model's own deltas as they arrive (router/backends/sse.py); tool calls + usage reassembled identically to the buffered path. |
Suite grew 881 → 977 tests over the series; everything green on Linux/Windows × py3.11–3.13.
New operator knobs introduced¶
[router].profile_path— path to a measuredprofile.json(frompython -m anvil_serving.router.profile_bootstrap); absent → built-in seed profile.python -m anvil_serving.multiplexer --drain-timeout N— seconds a swap waits for in-flight requests on the old model (default 30;0= old swap-immediately behaviour).[router].verify_local_min = false— opt out of the minimal local-allow commit window to get true streaming TTFT on a localallowtier you trust (the safety-vs-latency dial; cloudallowtiers stream for real out of the box).
Architecture assessment (what's strong, what's still open)¶
Strong: the front door's HTTP framing discipline (smuggling rejection, keep-alive
drain accounting, constant-time auth); the never-raise posture of classify/resolve/route
with auditable notes; verify.py's "only fail what you can prove" principle; the
fail-closed deny default for unmeasured local tiers on high-risk classes; the
availability-never-bypasses-the-gate invariant; the ADR discipline.
Still open (in priority order):
- Profile persistence loop —
profile_pathloads at startup, butrecord_gradeupdates and the async calibration loop still aren't persisted/wired into the live server. The moat needs write-back: persist on grade (atomicos.replace), start the calibrate sampler as a daemon thread fromserve, wireapply_fingerprintto serve changes. - Swap debounce/hysteresis — residency reorder + drain remove most thrash; a queued old-model request still triggers a swap back. If decision logs show alternating-model ping-pong, add batching (swap only after N queued / T seconds).
- Multiplexer registry genericity — the default
REGISTRYinmultiplexer.pystill ships fakoli-dark personal paths (C:/Users/sdoum/...) inside the pip package; should move to an example registry file (ADR-0003 spirit). [Status update, later on 2026-07-02: substantially closed by #107 — the registry now uses container paths on thefakoli-modelsnamed volume and no personal host paths ship in the package. Residue: the default still presumes that pre-populated volume, andinit/doctorstill have no multiplexer coverage.] /healthzdrain visibility — surface in-flight counts + drain waits for operators (noted in ADR-0006 consequences).
Model landscape for fakoli-dark (mid-2026, sm_120: RTX 5090 32 GB + RTX PRO 6000 96 GB)¶
MoE-NVFP4 on sm_120 — gotcha #16 still stands, barely. CUTLASS #3096 documents a
working fix (FlashInfer SM120 patches + CUDA 13.0 compute_120f → correct output at
~39 tok/s) but it needs extensive patches to FlashInfer 0.6.5 + vLLM 0.17 that haven't
fully landed mainline, and it's still ~20% behind Marlin W4A16 (46–49 tok/s).
Keep AWQ/Marlin for MoE, NVFP4 for dense; watch for compute_120f in stable vLLM.
Refs: NVIDIA/cutlass#3096 · vllm-project/vllm#33416 · vllm-project/vllm#31085 ·
flashinfer-ai/flashinfer#2723.
The headline: Qwen3.6-27B (dense). Beats the previous open flagship
Qwen3.5-397B-A17B on every major coding benchmark — SWE-bench Verified 77.2 vs 76.2,
SWE-bench Pro 53.5, Terminal-Bench 2.0 59.3, GPQA 87.8 — and NVIDIA ships an official
Model-Optimizer checkpoint: nvidia/Qwen3.6-27B-NVFP4. Dense + NVFP4 is exactly this
rig's native fast path (gotchas #10/#16/#17). ~13–14 GB FP4 weights → fits the 5090 with
long-context KV headroom. Unified thinking/non-thinking checkpoint (gotcha #6 applies).
Recommended tier shuffle:
| Slot | Today | Recommended | Why |
|---|---|---|---|
| fast (5090 32 GB) | gpt-oss-20b (vLLM) | Qwen3.6-27B-NVFP4 — --quantization modelopt_fp4 --kv-cache-dtype fp8, qwen3 reasoning/tool parsers |
Dense NVFP4 native path; 77.2% SWE-bench dwarfs the current tier |
| heavy (Pro 6000 96 GB) | Qwen3.5-35B-A3B AWQ (SGLang) | gpt-oss-120b (MXFP4, ~60–65 GB, proven single-card: ~134 tok/s @12k ctx) — or consolidate on Qwen3.6-27B at huge batch/context | 122B-A10B NVFP4 wants 2 cards per the rtx6kpro wiki; MoE-NVFP4 still gated on sm_120 |
Strategic implication: if a 27B dense model beats the heavy tier on coding, the
fast/heavy split becomes about context length + concurrency, not quality. The
planning → deny verdict was measured against much weaker locals — re-run the
shadow-eval against Qwen3.6-27B before assuming it still holds.
Sources: qwen.ai/blog?id=qwen3.6-27b · the-decoder.com (Qwen3.6-27B coverage) · huggingface.co/nvidia/Qwen3.6-27B-NVFP4 · github.com/local-inference-lab/rtx6kpro · hardware-corner.net (gpt-oss-120b on RTX Pro 6000) · insiderllm.com VRAM-tier guides.
Next actions (ordered)¶
hf download nvidia/Qwen3.6-27B-NVFP4(to D:, per gotcha #13) → serve on the 5090 →anvil-serving preflight --needle-ctx 60000(large-prefill path especially, gotcha #10).- Re-run the shadow-eval /
eval bootstrapagainst it → writeprofile.json→ set[router].profile_path— let the measured profile decide the new tier ladder. - Consider gpt-oss-120b MXFP4 as the 96 GB heavy tier; benchmark against the AWQ 35B.
- Wire profile write-back + live calibration (open item 1 above).