Model Settings — Qwen3.5-35B-A3B (AWQ 4-bit)

Source: the model card (cyankiwi/Qwen3.5-35B-A3B-AWQ-4bit, base Qwen/Qwen3.5-35B-A3B) + the model's generation_config.json. This document covers serving configuration for this model on the fast/heavy local tier (example: served-model-name: qwen35-awq-local).

What this model is

35B total / 3B active, 40 layers, hybrid Gated DeltaNet (linear attn) + Gated Attention + MoE (256 experts, 8+1 active), multimodal, 262K native context (→1M via YaRN), Apache-2.0. Benchmarks: SWE-bench Verified 69.2, Terminal-Bench 2 40.5, BFCL-V4 67.3, TAU2-Bench 81.2, LiveCodeBench 74.6. Genuinely strong for the specialist tier.

The gotcha: thinking is ON by default

The model emits <think>…</think> before the answer. With a small max_tokens budget it spends the entire budget reasoning and returns empty content — a valid-looking JSON response with an empty content array. Two correct ways to use the model:

  • Non-thinking (recommended for bulk execution packets): send chat_template_kwargs: {"enable_thinking": false} → direct answers, no thinking overhead.
  • Thinking (for hard review/planning): leave thinking on but give adequate max_tokens (≥4096) so the model finishes reasoning and still answers. (Does NOT support /think /nothink soft switches.)
  • Multi-turn: conversation history should contain only final answers, not prior <think> content (the chat template handles this; any harness that builds prompts itself must do the same).
Mode temperature top_p top_k min_p presence_penalty repetition_penalty
Thinking — general 1.0 0.95 20 0.0 1.5 1.0
Thinking — precise coding (WebDev) 0.6 0.95 20 0.0 0.0 1.0
Instruct / non-thinking — general 0.7 0.8 20 0.0 1.5 1.0
Instruct / non-thinking — reasoning 1.0 1.0 40 0.0 2.0 1.0

Output length: the card recommends ~32,768 tokens for most queries (give it room). Context: keep ≥128K to preserve thinking quality.

Request snippet (non-thinking coding tier)

{
  "model": "qwen35-awq-local",
  "messages": [ /* lean scoped prompt */ ],
  "max_tokens": 4096,
  "temperature": 0.7, "top_p": 0.8, "presence_penalty": 1.5,
  "extra_body": {
    "top_k": 20, "min_p": 0.0,
    "chat_template_kwargs": { "enable_thinking": false }
  }
}

For OpenClaw or any harness that supports per-model default params, set these as the per-agent generate_cfg/extra_body for the local specialist slot.

SGLang server flags

  • --reasoning-parser qwen3 + --tool-call-parser qwen3_coder — parse thinking tokens + Qwen tool calls (required for agentic use).
  • --language-only — skip the vision encoder (text/code only) → frees VRAM for KV, faster load.
  • --context-length 131072 (128K), --kv-cache-dtype fp8_e5m2, --mem-fraction-static 0.88, --max-running-requests 16, --cuda-graph-max-bs-decode 8. (--weight-loader-disable-mmap was dropped in #108 — it was a 9P-bind-mount-era workaround, obsolete now that weights load from a named Docker volume; see CLAUDE.md gotcha #16.)

Speculative decoding via the model's native MTP (self-speculative) — production default

The model ships a Multi-Token-Prediction head; SGLang self-speculates from it directly (no separate draft model or added VRAM cost):

--speculative-algorithm NEXTN --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4

This is enabled by default in docker-compose.yml as of 2026-07-02 — validated live on this exact checkpoint before being turned on: +30-43% decode throughput depending on concurrency, ~82% draft-token acceptance rate, and confirmed SGLang issue #19796 (an SM120 NaN-on-prefix-cache-hit crash) does not reproduce on this stack. Known tradeoff: TTFT regresses under concurrency (+37% at concurrency=4); net end-to-end latency still improved in every trial. Full methodology and numbers: ADR-0008.

Process to change model settings (repeatable)

  1. Edit the docker-compose.yml server flags and/or the sampling params above.
  2. Recreate the container with the updated compose file (docker compose up -d --force-recreate).
  3. Watch: docker logs -f sglang → "ready to roll"; curl http://127.0.0.1:30000/health → 200.
  4. Validate: anvil-serving preflight --base-url http://127.0.0.1:30000/v1 --model qwen35-awq-local — pass --no-thinking (injects chat_template_kwargs:{enable_thinking:false}) so the model returns actual content rather than timing out on the thinking budget.