Model Settings — Qwen3.5-35B-A3B (AWQ 4-bit)¶
Source: the model card (cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit, base Qwen/Qwen3.5-35B-A3B) + the
model's generation_config.json. This document covers serving configuration for this model on the
fast/heavy local tier (example: served-model-name: qwen35-awq-local).
What this model is¶
35B total / 3B active, 40 layers, hybrid Gated DeltaNet (linear attn) + Gated Attention + MoE (256 experts, 8+1 active), multimodal, 262K native context (→1M via YaRN), Apache-2.0. Benchmarks: SWE-bench Verified 69.2, Terminal-Bench 2 40.5, BFCL-V4 67.3, TAU2-Bench 81.2, LiveCodeBench 74.6. Genuinely strong for the specialist tier.
The gotcha: thinking is ON by default¶
The model emits <think>…</think> before the answer. With a small max_tokens budget it spends
the entire budget reasoning and returns empty content — a valid-looking JSON response with an
empty content array. Two correct ways to use the model:
- Non-thinking (recommended for bulk execution packets): send
chat_template_kwargs: {"enable_thinking": false}→ direct answers, no thinking overhead. - Thinking (for hard review/planning): leave thinking on but give adequate
max_tokens(≥4096) so the model finishes reasoning and still answers. (Does NOT support/think/nothinksoft switches.) - Multi-turn: conversation history should contain only final answers, not prior
<think>content (the chat template handles this; any harness that builds prompts itself must do the same).
Recommended sampling (verbatim from the model card)¶
| Mode | temperature | top_p | top_k | min_p | presence_penalty | repetition_penalty |
|---|---|---|---|---|---|---|
| Thinking — general | 1.0 | 0.95 | 20 | 0.0 | 1.5 | 1.0 |
| Thinking — precise coding (WebDev) | 0.6 | 0.95 | 20 | 0.0 | 0.0 | 1.0 |
| Instruct / non-thinking — general | 0.7 | 0.8 | 20 | 0.0 | 1.5 | 1.0 |
| Instruct / non-thinking — reasoning | 1.0 | 1.0 | 40 | 0.0 | 2.0 | 1.0 |
Output length: the card recommends ~32,768 tokens for most queries (give it room). Context: keep ≥128K to preserve thinking quality.
Request snippet (non-thinking coding tier)¶
{
"model": "qwen35-awq-local",
"messages": [ /* lean scoped prompt */ ],
"max_tokens": 4096,
"temperature": 0.7, "top_p": 0.8, "presence_penalty": 1.5,
"extra_body": {
"top_k": 20, "min_p": 0.0,
"chat_template_kwargs": { "enable_thinking": false }
}
}
For OpenClaw or any harness that supports per-model default params, set these as the per-agent
generate_cfg/extra_body for the local specialist slot.
SGLang server flags¶
--reasoning-parser qwen3+--tool-call-parser qwen3_coder— parse thinking tokens + Qwen tool calls (required for agentic use).--language-only— skip the vision encoder (text/code only) → frees VRAM for KV, faster load.--context-length 131072(128K),--kv-cache-dtype fp8_e5m2,--mem-fraction-static 0.88,--max-running-requests 16,--cuda-graph-max-bs-decode 8. (--weight-loader-disable-mmapwas dropped in #108 — it was a 9P-bind-mount-era workaround, obsolete now that weights load from a named Docker volume; see CLAUDE.md gotcha #16.)
Speculative decoding via the model's native MTP (self-speculative) — production default¶
The model ships a Multi-Token-Prediction head; SGLang self-speculates from it directly (no separate draft model or added VRAM cost):
--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4
This is enabled by default in docker-compose.yml as of 2026-07-02 — validated live on this
exact checkpoint before being turned on: +30-43% decode throughput depending on concurrency, ~82%
draft-token acceptance rate, and confirmed SGLang issue #19796 (an SM120 NaN-on-prefix-cache-hit
crash) does not reproduce on this stack. Known tradeoff: TTFT regresses under concurrency (+37% at
concurrency=4); net end-to-end latency still improved in every trial. Full methodology and numbers:
ADR-0008.
Process to change model settings (repeatable)¶
- Edit the
docker-compose.ymlserver flags and/or the sampling params above. - Recreate the container with the updated compose file (
docker compose up -d --force-recreate). - Watch:
docker logs -f sglang→ "ready to roll";curl http://127.0.0.1:30000/health→ 200. - Validate:
anvil-serving preflight --base-url http://127.0.0.1:30000/v1 --model qwen35-awq-local— pass--no-thinking(injectschat_template_kwargs:{enable_thinking:false}) so the model returns actual content rather than timing out on the thinking budget.