ADR-0008 — Heavy tier enables NEXTN speculative decoding (self-speculation, no draft model)¶
- Status: Accepted
- Date: 2026-07-02
- Relates to: the fakoli-dark heavy tier (
examples/fakoli-dark/docker-compose.yml, servicesglang, modelqwen35-awq= Qwen3.5-35B-A3B AWQ), ADR-0002 (serves are compose-defined)
Context¶
The heavy tier serves a hybrid architecture — Qwen3.5-35B-A3B is MoE (256 experts, 8 active) with
an attention pattern that alternates 3 linear-attention (Gated-DeltaNet-style) layers per 1
full-attention layer (10 full-attention / 30 linear-attention layers total, confirmed via the
model's config.json). This was previously undocumented: prior investigation of hybrid-model
serving problems focused on the parked Qwen3.6-27B-NVFP4 trial; the production tier turned out to
share the same architecture family.
SGLang ships --speculative-algorithm NEXTN, which self-speculates from the model's own built-in
multi-token-prediction (MTP) head — no separate draft model or additional VRAM budget for one is
needed (confirmed live: the MTP head loaded as an extra ~2.3 GB, type=Qwen3_5ForCausalLMMTP).
Community/upstream research (a PyTorch blog post on hybrid-model speculative decoding in SGLang)
showed this mechanism is architecturally sound for GDN-style hybrid models — SGLang allocates an
independent Mamba cache slot per draft token and promotes the accepted one, avoiding the state-
rollback problem that breaks naive speculative decoding on recurrent architectures. That research
also surfaced one specific, concrete risk for our exact hardware: SGLang issue #19796 — an
SM120-specific NaN-in-logits crash when Eagle-style speculative decoding hits a radix-cache prefix
match. Our production config runs 131072-context agent traffic specifically to make prefix caching
valuable, so this risk was not a footnote to wave past.
Considered options¶
- Do not adopt speculative decoding. Zero risk, zero gain. Rejected once step 1 showed a real, repeatable decode-throughput gain with no observed downside on this checkpoint.
- Adopt spec-decode, disable radix cache (prefix caching) to sidestep #19796 entirely. Avoids the known risk but forfeits prefix caching for 131072-context agent traffic — likely a worse trade than the one it avoids, given how much agent turns benefit from not re-prefilling a shared system-prompt/tool-schema prefix on every request.
- Adopt spec-decode with radix cache left on, after directly testing whether #19796 reproduces on our exact stack (chosen). Requires live verification before rollout, not just trusting the upstream report — this repo's rule is never to self-verify a correctness claim without an independent check, and "the bug report is for a different config than ours" is exactly the kind of assumption that needs testing, not trusting.
Decision¶
Enable --speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1
--speculative-num-draft-tokens 4 on the heavy tier's SGLang command, with radix cache left at
its default (enabled) — i.e. do not add --disable-radix-cache.
This was validated with a live A/B on fakoli-dark before merging, not just adopted on the strength of upstream reports:
- Step 1 (throughput/correctness on the actual AWQ checkpoint, radix cache off to isolate the spec-decode variable): +42.8% output tok/s at concurrency=1 (187.81→268.15), +29.7% at concurrency=4 (486.16→630.64), stable ~82% draft-token acceptance rate, zero crashes/NaN/garbage output across all runs.
- Step 2 (does #19796 reproduce on our stack, radix cache back on): two escalating tests —
single-turn shared-prefix at 78.6% cache-hit rate, then concurrent (concurrency=4) 3-turn
conversations at 96.2% cache-hit rate (the closest analog to real agent traffic) — both clean,
zero errors, confirmed via the server's own
#cached-tokenlog lines, not just the client-side report.
The key numbers above are the complete record of what was measured; the raw benchmark logs from
this validation session are not committed to this repo (per the existing convention — dated
trial/bake-off findings live in the private companion repo fakoli/anvil-serving-notes, not
inline in the product repo).
Consequences¶
- Gain: genuine decode-throughput improvement (30-43% depending on concurrency) at zero additional steady-state VRAM cost (self-speculation reuses the target model's own weights).
- Known, accepted tradeoff — TTFT regresses under concurrency. Mean TTFT was 12.4% better at concurrency=1 but 37.0% worse at concurrency=4 (146.66→200.89 ms) in the step-1 trial. Not yet explained (spec-decode targets decode speed, not prefill/first-token latency); flagged for monitoring post-rollout rather than blocking on it, since net end-to-end latency still improved in every trial run (E2E latency dropped 20-30% despite the TTFT regression, because decode dominates total latency for multi-hundred-token completions).
- Cold-start cost, first time only. The first container to run with these flags anywhere pays
a one-time ~9.5-minute CUDA graph JIT tax for the new "target verify" code path (vs. the usual
~80-114s warm restart). This amortizes: once cached in the
sglang-cachevolume, subsequent restarts with the same flags were observed to return to normal restart times (~110-120s) in this session's own testing. - #19796 does not reproduce on this stack, but this is not a universal guarantee. Tested
configuration: single-GPU (
tp_size=1), this SGLang build (lmsysorg/sglang:latestas of 2026-07-02), Qwen3.5-35B-A3B AWQ specifically. The upstream issue may still apply under different conditions (multi-GPU tensor parallelism, other quantizations, other model families) not tested here — re-verify before assuming this finding transfers to a different serving configuration. - No wire-level or contract change.
--served-model-name qwen35-awq-localis unchanged; the router,configs/example-docker.toml, and any harness pointed at the router (including OpenClaw on Fakoli Mini) require no changes — this is purely a backend performance tuning decision. examples/fakoli-dark/docker-compose.heavy.yml(the superseded single-tier reference file) is not kept flag-for-flag in sync with this change — its header now says so explicitly, so a future reader does not assume it reflects current production flags.