feature: continuous batching for concurrent request serving #6
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The ds4 server currently serializes requests through an instance lock — one decode at a time, no M>1 batched decode. This caps throughput at single-token t/s regardless of load.
Continuous batching (à la vLLM) would decode multiple concurrent requests tokens in a single kernel launch (batch M = active requests). This is the path that makes NVFP4 hardware advantage real: at M≥4-8, IQ2_XXS hits its 58.6 GB/s dequant-compute ceiling while NVFP4 stays bandwidth-bound at 140 GB/s → NVFP4 wins 1.1× net (more with mma.mxf4).
At M=1 (current), NVFP4 is ~2× slower than IQ2_XXS on the expert path due to 2.18× more bytes at similar bandwidth — occupancy-limited, not format-limited. The format choice only matters at scale.
Scope
This is a large architectural change (session scheduler, paged KV, batched decode kernels). The batched tile kernels exist for prefill; a batched decode path needs wiring. Out of scope for the decode-opt branch but worth tracking as the serving-scale lever.
Acceptance