ADR-0006 — Multiplexer swaps drain in-flight requests before evicting the resident model¶
- Status: Accepted
- Date: 2026-07-02
- Relates to: ADR-0002 (serves are compose-defined), the multiplexer single-resident swap model (
anvil_serving/multiplexer.py), PR #98's parked follow-up list
Context¶
The multiplexer serves ONE resident model per GPU and swaps on demand: a request
for a non-resident model stops the old backend container and starts the new one.
Multiplexer.ensure_loaded serialized the load/swap under a lock, but the relay
(the long-lived streaming copy of the backend's response to the client) ran
deliberately outside the lock so concurrent same-model requests don't serialize.
The gap: nothing connected the swap to those in-flight relays. A request for
model B while a request for model A was still streaming would docker rm -f A's
container mid-stream — the A client saw a connection reset partway through its
completion, with no error status (the 200 + headers were already sent). On the
fakoli-dark fast tier — two models sharing one GPU/port as a swap pair, driven by
alternating harness traffic — this is not a corner case; it is the steady state.
Constraints: stdlib-only (threading primitives, no async framework); the fix
must not serialize same-model requests (AC3: concurrent requests for the resident
model share it freely); and a swap must not be blockable forever by a hung client
(bounded availability for the requested model).
Considered options¶
- Reader/writer lock over the whole request — relays hold a shared lock, swaps take it exclusively. Rejected: an unbounded hold by one slow client blocks the swap forever, and stdlib has no fair RW lock; hand-rolling one adds more state than the problem needs.
- Reject non-resident requests while anything is in flight (503, client retries) — simplest, but it pushes the drain problem onto every caller and makes swap latency visible as errors instead of waiting.
- Lease counting + condition variable with a bounded drain (chosen) — each
relay holds a lease on its model for exactly the duration of the upstream
copy; a swap waits on a condition until the old resident's lease count reaches
zero, up to a
drain_timeout, then proceeds regardless (severing and logging any laggards).
Decision¶
Multiplexer gains lease-based in-flight tracking, all under the existing lock
(now a threading.Condition):
mux.lease(name)is the new serve-path API: a context manager that performsensure_loadedand registers the in-flight lease atomically (both under the condition's lock), yields the backendbase_url, and releases + notifies on exit. Atomicity closes the race where a swap lands between "ensure_loaded returned" and "the relay registered itself". The HTTP handler holds the lease for the entire upstream open + relay.- Swaps drain before stopping. Inside the swap, before
backend.stop(), the swapping thread waits on the condition until the old resident's lease count is zero ordrain_timeout(default 30 s, CLI--drain-timeout) elapses. On timeout it logs the severed count and proceeds — a hung client bounds, never blocks, model availability.--drain-timeout 0restores the old swap-immediately behaviour. - New arrivals queue behind an in-progress swap.
Condition.waitreleases the lock during the drain, so without a gate a stream of requests for the OLD model could keep taking fresh leases on the dying resident and starve the swap. A_swappingflag makes everyensure_loaded/leaseacquisition wait until the swap settles, then re-evaluate residency (a queued old-model request then triggers its own swap back — thrash policy stays the router's residency concern, ADR-untouched).
Consequences¶
- An active completion is no longer severed by a routine swap; swap latency for
the new model is now bounded by
min(longest in-flight request, drain_timeout). - Same-model concurrency is unchanged (leases are counted, not serialized), and
ensure_loadedkeeps its exact exception contract (UnknownModel / LoadError / BackendError) — the OOM guard still runs before the drain so a doomed swap never waits. - The severed-laggard path still exists, by design, at the
drain_timeoutboundary; the handler's existing "backend unreachable → 503" mapping covers a request that loses the race, and the timeout event is logged with the count. - Follow-up (out of scope here): swap debounce/hysteresis for alternating-model
traffic, and surfacing drain waits in
/healthzfor operator visibility.