ADR-0002 — Model serves are Docker-Compose-defined¶
- Status: Accepted (2026-06-30)
- Date: 2026-06-30
- Relates to:
anvil_serving/serves.py(serves up|down|status),anvil_serving/deploy.py(renders a compose file),examples/fakoli-dark/serves.toml,docs/SERVES-AND-EVAL.md, CLAUDE.md gotchas #11 (MSYS path-mangling) / #13 (GPU pinning) / #14 (UVA) / #15 (weights cache).
Context¶
The serving substrate stood up model containers two different ways. The heavy tier was already
Docker-Compose-defined (examples/fakoli-dark/docker-compose.yml); the fast tier was a hand-rolled
docker run one-liner in a bash script (serve-fast-gptoss-vllm.sh). Ad-hoc docker run proved
fragile in exactly the ways this repo keeps rediscovering:
- MSYS path-mangling. Under Git Bash,
docker run … serve /models/xrewrites the leading-slash path toC:/Program Files/Git/models/xand vLLM errorsRepo id must be in the form …. The scripts have to defend withMSYS_NO_PATHCONV=1 MSYS2_ARG_CONV_EXCL='*'(gotcha #11), and afirst-time serves up fastneedsbashon PATH just to run the script. - Quoting / flag drift. GPU-UUID env vars,
--ipc=host, long vLLM flag lists, and volume specs are easy to mistype in a shell line and hard to diff. - Port conflicts between overlapping hand-started containers.
- The stale-container drift bug (the decisive one).
serves uprestarts a stopped-but-existing container withdocker start, which replays its original create-time config. Edit the model, flags, or mount anddocker startsilently serves the stale container — the config on disk and the config actually running diverge with no signal.
Meanwhile docker compose up -d already solves the drift case: it diffs the desired spec against the
running container and recreates natively when the config has changed. Having one tier declarative
and the other imperative meant the drift-safe path only covered half the fleet.
Considered options¶
- Keep the split (compose for heavy,
docker runscript for fast). Rejected: leaves the fast tier exposed to every failure above, and keeps two mental models for one operation. - Harden the bash scripts (force-recreate, more MSYS guards, config-hash checks). Rejected: it
reimplements, worse, what
docker compose up -dalready does natively, and still carries the MSYS/quoting surface. - All serves Docker-Compose-defined;
serves updelegates todocker compose up -d(a service per tier in one compose file). Chosen.
Decision¶
Model serves are Docker-Compose-defined. anvil-serving serves up delegates to
docker compose up -d <service> for compose serves — which is drift-safe (native recreate on config
change). Concretely:
examples/fakoli-dark/docker-compose.ymlholds one service per tier (sglang= heavy,fast= gpt-oss-20b on vLLM). Eachserves.tomluptargets its service by name (docker compose … up -d sglang/… up -d fast) so the tiers stay independent.- The hard-won config is captured declaratively in the compose file: GPU pinned by UUID via
deploy.…devices.device_ids+CUDA_DEVICE_ORDER=PCI_BUS_ID,VLLM_USE_V2_MODEL_RUNNER=0(WSL2/UVA),ipc: host, the model mount, and the full vLLM command — no shell quoting, no MSYS path rewriting. - A parametrized experiment harness (
docker-compose.experiment.yml) covers one-off model trials: a single vLLM service driven byMODEL/SERVED_NAME/PORT/GPU_UUID/EXTRA_ARGSenv vars, with the sm_120/WSL2 defaults (stable image,VLLM_USE_V2_MODEL_RUNNER=0, the D:-backedvllm-hfcachevolume for ~15s native loads,CUDA_DEVICE_ORDER=PCI_BUS_ID) baked in. A future experiment isMODEL=… GPU_UUID=… PORT=… docker compose -f … up -d— never a hand-builtdocker run.
This supersedes the ad-hoc docker run serve scripts (serve-fast-gptoss-vllm.sh,
serve-fast-glm-vllm.sh) as the way serves are launched; they remain in-tree only as reference.
Consequences¶
- Docker Compose v2 is now a serving-substrate operational prerequisite — NOT a Python runtime
dependency. It is required only to operate the GPU serves (
serves/deploy), which already require Docker + NVIDIA GPUs. The router (anvil-serving serve) and the rest of the Python package remain stdlib-only — no new import, no PyPI dependency, nothing added to the hot path. This is an ops tool the operator installs, on the same footing as Docker itself, and it is documented as such inREADME.mdanddocs/SERVES-AND-EVAL.md. - The stale-container drift bug is closed for compose serves:
serves upissuesdocker compose up -d <service>unconditionally — even when the container is already running — so editing the compose file and re-runningserves uprecreates the container to match (and is a cheap no-op when unchanged), instead of short-circuiting on "already running" ordocker start-ing a stale one. - One mental model. Every serve is
docker compose up -d <service>/down; no per-tier bash entrypoint, nobash-on-PATH requirement for a first-timeserves up fast, no MSYS guards. - Trade-off: the compose files carry machine-specific facts (the fakoli-dark GPU UUIDs, Windows
model paths, the
vllm-hfcacheexternal volume). They are an example topology, not a portable default — an operator on other hardware edits them (or renders a fresh one withdeploy). This is the same locality the bash scripts already had, now declarative and diff-able. - Follow-up (out of scope here):
serves.pymakesserves updelegate todocker compose up -dfor compose serves (in a sibling change); this ADR records the decision that all serves are compose-defined, which that delegation depends on.