feature: continuous batching for concurrent request serving #6

Open
opened 2026-06-23 09:55:16 +02:00 by sleepy · 0 comments
Owner

The ds4 server currently serializes requests through an instance lock — one decode at a time, no M>1 batched decode. This caps throughput at single-token t/s regardless of load.

Continuous batching (à la vLLM) would decode multiple concurrent requests tokens in a single kernel launch (batch M = active requests). This is the path that makes NVFP4 hardware advantage real: at M≥4-8, IQ2_XXS hits its 58.6 GB/s dequant-compute ceiling while NVFP4 stays bandwidth-bound at 140 GB/s → NVFP4 wins 1.1× net (more with mma.mxf4).

At M=1 (current), NVFP4 is ~2× slower than IQ2_XXS on the expert path due to 2.18× more bytes at similar bandwidth — occupancy-limited, not format-limited. The format choice only matters at scale.

Scope

This is a large architectural change (session scheduler, paged KV, batched decode kernels). The batched tile kernels exist for prefill; a batched decode path needs wiring. Out of scope for the decode-opt branch but worth tracking as the serving-scale lever.

Acceptance

  • Multiple concurrent /v1/chat/completions requests decode in a single batched pass
  • Per-request latency stays bounded; aggregate throughput scales with M
  • NVFP4 throughput at M≥4 exceeds IQ2_XXS throughput at M≥4
The ds4 server currently serializes requests through an instance lock — one decode at a time, no M>1 batched decode. This caps throughput at single-token t/s regardless of load. Continuous batching (à la vLLM) would decode multiple concurrent requests tokens in a single kernel launch (batch M = active requests). This is the path that makes NVFP4 hardware advantage real: at M≥4-8, IQ2_XXS hits its 58.6 GB/s dequant-compute ceiling while NVFP4 stays bandwidth-bound at 140 GB/s → NVFP4 wins 1.1× net (more with mma.mxf4). At M=1 (current), NVFP4 is ~2× slower than IQ2_XXS on the expert path due to 2.18× more bytes at similar bandwidth — occupancy-limited, not format-limited. The format choice only matters at scale. ## Scope This is a large architectural change (session scheduler, paged KV, batched decode kernels). The batched tile kernels exist for prefill; a batched decode path needs wiring. Out of scope for the decode-opt branch but worth tracking as the serving-scale lever. ## Acceptance - Multiple concurrent /v1/chat/completions requests decode in a single batched pass - Per-request latency stays bounded; aggregate throughput scales with M - NVFP4 throughput at M≥4 exceeds IQ2_XXS throughput at M≥4
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ds4-nvfp4-spark#6
No description provided.