ADR-0006 — Multiplexer swaps drain in-flight requests before evicting the resident model¶

Status: Accepted
Date: 2026-07-02
Relates to: ADR-0002 (serves are compose-defined), the multiplexer single-resident swap model (anvil_serving/multiplexer.py), PR #98's parked follow-up list

Context¶

The multiplexer serves ONE resident model per GPU and swaps on demand: a request for a non-resident model stops the old backend container and starts the new one. Multiplexer.ensure_loaded serialized the load/swap under a lock, but the relay (the long-lived streaming copy of the backend's response to the client) ran deliberately outside the lock so concurrent same-model requests don't serialize.

The gap: nothing connected the swap to those in-flight relays. A request for model B while a request for model A was still streaming would docker rm -f A's container mid-stream — the A client saw a connection reset partway through its completion, with no error status (the 200 + headers were already sent). On the fakoli-dark fast tier — two models sharing one GPU/port as a swap pair, driven by alternating harness traffic — this is not a corner case; it is the steady state.

Constraints: stdlib-only (threading primitives, no async framework); the fix must not serialize same-model requests (AC3: concurrent requests for the resident model share it freely); and a swap must not be blockable forever by a hung client (bounded availability for the requested model).

Considered options¶

Reader/writer lock over the whole request — relays hold a shared lock, swaps take it exclusively. Rejected: an unbounded hold by one slow client blocks the swap forever, and stdlib has no fair RW lock; hand-rolling one adds more state than the problem needs.
Reject non-resident requests while anything is in flight (503, client retries) — simplest, but it pushes the drain problem onto every caller and makes swap latency visible as errors instead of waiting.
Lease counting + condition variable with a bounded drain (chosen) — each relay holds a lease on its model for exactly the duration of the upstream copy; a swap waits on a condition until the old resident's lease count reaches zero, up to a drain_timeout, then proceeds regardless (severing and logging any laggards).

Decision¶

Multiplexer gains lease-based in-flight tracking, all under the existing lock (now a threading.Condition):

mux.lease(name) is the new serve-path API: a context manager that performs ensure_loaded and registers the in-flight lease atomically (both under the condition's lock), yields the backend base_url, and releases + notifies on exit. Atomicity closes the race where a swap lands between "ensure_loaded returned" and "the relay registered itself". The HTTP handler holds the lease for the entire upstream open + relay.
Swaps drain before stopping. Inside the swap, before backend.stop(), the swapping thread waits on the condition until the old resident's lease count is zero or drain_timeout (default 30 s, CLI --drain-timeout) elapses. On timeout it logs the severed count and proceeds — a hung client bounds, never blocks, model availability. --drain-timeout 0 restores the old swap-immediately behaviour.
New arrivals queue behind an in-progress swap. Condition.wait releases the lock during the drain, so without a gate a stream of requests for the OLD model could keep taking fresh leases on the dying resident and starve the swap. A _swapping flag makes every ensure_loaded/lease acquisition wait until the swap settles, then re-evaluate residency (a queued old-model request then triggers its own swap back — thrash policy stays the router's residency concern, ADR-untouched).

Consequences¶

An active completion is no longer severed by a routine swap; swap latency for the new model is now bounded by min(longest in-flight request, drain_timeout).
Same-model concurrency is unchanged (leases are counted, not serialized), and ensure_loaded keeps its exact exception contract (UnknownModel / LoadError / BackendError) — the OOM guard still runs before the drain so a doomed swap never waits.
The severed-laggard path still exists, by design, at the drain_timeout boundary; the handler's existing "backend unreachable → 503" mapping covers a request that loses the race, and the timeout event is logged with the count.
Follow-up (out of scope here): swap debounce/hysteresis for alternating-model traffic, and surfacing drain waits in /healthz for operator visibility.