serves + eval — managing the model serves and running the evals¶
Two CLI verbs that close long-standing gaps: the router only ever connected to the model containers (never controlled them), and the evals were three different invocation styles with no single entry point.
anvil-serving serves — model-serve lifecycle¶
The router (anvil-serving serve) talks to the GPU model serves as backends but
never starts or stops them. serves does, driven by a declarative manifest
(default examples/fakoli-dark/serves.toml)
that is the single source of truth for which container runs on which port as
which model.
Operational prerequisite:
serves(anddeploy) drive Docker and Docker Compose v2 (docker compose …) to run the GPU model containers. These are ops requirements for standing up the serving substrate — not Python runtime dependencies. The router and the wholeanvil-servingpackage stay stdlib-only (nothing added topip install, nothing in the hot path); Docker Compose is a tool the operator installs alongside Docker + the NVIDIA runtime. Every serve is Docker-Compose-defined, soserves upis a drift-safedocker compose up -d— see ADR-0002.
anvil-serving serves status # docker state + health + GPU memory per serve
anvil-serving serves down # docker stop every serve (free the GPUs)
anvil-serving serves down fast # stop one (by manifest name or container name)
anvil-serving serves up # start them (see below)
anvil-serving serves up --dry-run # print what would run, start nothing
anvil-serving serves --manifest X.toml status # use a different topology
up is mechanism-aware by container state: running → left alone; stopped
(exited/created) → restarted with docker start (fast, no reload); paused →
docker unpause; missing → created fresh from the manifest's up command (a
docker compose up -d <service> per tier — both tiers are now Docker-Compose-
defined; see ADR-0002). A container in an
exotic state (dead/restarting) is left for you to resolve rather than blindly
re-created. down likewise stops any state that holds the GPU (running/paused/
restarting), not just running.
Two notes on
up: (1) The manifestupis executed — it's parsed withshlexand run as an argv list (no shell, so paths with spaces are safe and there's no injection sink), but treat the manifest as trusted like a Makefile. (2) Every serve'supisdocker compose -f {dir}/docker-compose.yml up -d <service>(sglangfor heavy,fastfor the gpt-oss vLLM tier).docker compose up -dis drift-safe — it natively recreates a service whose config has changed, closing the old bug where a stoppeddocker runcontainer kept serving a stale model. This supersedes the ad-hocserve-fast-*.shscripts (kept only as reference); a first-timeserves up fastno longer needsbashon PATH.
Manifest entry:
[[serve]]
name = "fast" # logical name (also accepted by down/up)
container = "vllm-gptoss" # docker container name (== the compose service's container_name)
port = 30001
model = "gpt-oss-20b" # served-model-name (used by `eval`)
health = "/health"
up = "docker compose -f {dir}/docker-compose.yml up -d fast" # {dir} = the manifest's dir
Standing up a one-off experiment serve¶
Trying a new model (e.g. for the Blackwell lab notebook) does not need a hand-built
docker run. The parametrized
examples/fakoli-dark/docker-compose.experiment.yml
is one vLLM service driven by env vars, with the hard-won sm_120/WSL2 defaults baked in
(stable image, VLLM_USE_V2_MODEL_RUNNER=0, the D:-backed vllm-hfcache volume for ~15s
native loads, CUDA_DEVICE_ORDER=PCI_BUS_ID):
MODEL=RedHatAI/Qwen3-32B-NVFP4 \
GPU_UUID=GPU-04d3b6e7-5691-3e86-1d34-c37999440cf1 \
PORT=30002 SERVED_NAME=qwen3-32b-nvfp4 \
docker compose -f examples/fakoli-dark/docker-compose.experiment.yml up -d
# extra vLLM flags (parsers, trust-remote-code, …) ride in EXTRA_ARGS:
# EXTRA_ARGS="--reasoning-parser qwen3 --tool-call-parser qwen3_coder --trust-remote-code"
MODEL and GPU_UUID are required; SERVED_NAME / PORT / EXTRA_ARGS default. Once it
answers on :{PORT}, point anvil-serving eval preflight --base-url http://127.0.0.1:{PORT}/v1
--model {SERVED_NAME} at it.
anvil-serving eval — one entry point for the evals¶
anvil-serving eval preflight --tier fast # correctness gate vs the fast serve
anvil-serving eval benchmark --tier heavy # throughput / request-replay
anvil-serving eval planning # planning bake-off (offline re-grade)
anvil-serving eval planning --live # also re-generate against live serves
anvil-serving eval bootstrap # replay eval fixtures -> quality profile
preflight/benchmarkresolve--base-urland--modelfrom the serves manifest, so--tier fastis enough. If that serve is down, you get an actionable hint (start it: anvil-serving serves up fast) instead of a connection error. Pass extra script flags after the options, or use--base-url/--modelto target any endpoint.planningdrives the planning-capability bake-off. The default--offlinere-runs the deterministic structural grade + aggregate over the committed eval-data (no serves needed, byte-reproducible).--livefirst runseval_gen.pyagainst the heavy+fast serves (the frontier baseline and blind judge panel remain human-agent steps — see the eval README).bootstrapreplays the committed eval fixtures into a quality-profile table (anvil_serving.router.profile_bootstrap --replay) — the eval-grounded seed for the router's routing policy (planning → cloudallow; localsdeny).
Typical flow¶
anvil-serving serves up # bring the models up
anvil-serving eval preflight --tier fast # is it correct?
anvil-serving eval benchmark --tier fast # is it fast enough?
anvil-serving serves down # free the GPUs when done
anvil-serving eval planning # re-grade the bake-off offline anytime