feature: continuous batching for concurrent request serving #6

New issue

Open

opened 2026-06-23 09:55:16 +02:00 by sleepy · 0 comments

sleepy commented

2026-06-23 09:55:16 +02:00

Owner

The ds4 server currently serializes requests through an instance lock — one decode at a time, no M>1 batched decode. This caps throughput at single-token t/s regardless of load.

Continuous batching (à la vLLM) would decode multiple concurrent requests tokens in a single kernel launch (batch M = active requests). This is the path that makes NVFP4 hardware advantage real: at M≥4-8, IQ2_XXS hits its 58.6 GB/s dequant-compute ceiling while NVFP4 stays bandwidth-bound at 140 GB/s → NVFP4 wins 1.1× net (more with mma.mxf4).

At M=1 (current), NVFP4 is ~2× slower than IQ2_XXS on the expert path due to 2.18× more bytes at similar bandwidth — occupancy-limited, not format-limited. The format choice only matters at scale.

Scope

This is a large architectural change (session scheduler, paged KV, batched decode kernels). The batched tile kernels exist for prefill; a batched decode path needs wiring. Out of scope for the decode-opt branch but worth tracking as the serving-scale lever.

Acceptance

Multiple concurrent /v1/chat/completions requests decode in a single batched pass
Per-request latency stays bounded; aggregate throughput scales with M
NVFP4 throughput at M≥4 exceeds IQ2_XXS throughput at M≥4

The ds4 server currently serializes requests through an instance lock — one decode at a time, no M>1 batched decode. This caps throughput at single-token t/s regardless of load. Continuous batching (à la vLLM) would decode multiple concurrent requests tokens in a single kernel launch (batch M = active requests). This is the path that makes NVFP4 hardware advantage real: at M≥4-8, IQ2_XXS hits its 58.6 GB/s dequant-compute ceiling while NVFP4 stays bandwidth-bound at 140 GB/s → NVFP4 wins 1.1× net (more with mma.mxf4). At M=1 (current), NVFP4 is ~2× slower than IQ2_XXS on the expert path due to 2.18× more bytes at similar bandwidth — occupancy-limited, not format-limited. The format choice only matters at scale. ## Scope This is a large architectural change (session scheduler, paged KV, batched decode kernels). The batched tile kernels exist for prefill; a batched decode path needs wiring. Out of scope for the decode-opt branch but worth tracking as the serving-scale lever. ## Acceptance - Multiple concurrent /v1/chat/completions requests decode in a single batched pass - Per-request latency stays bounded; aggregate throughput scales with M - NVFP4 throughput at M≥4 exceeds IQ2_XXS throughput at M≥4

sleepy referenced this issue from a commit

2026-06-23 09:59:26 +02:00

decode-opt plan: record measured profile, revise priorities

sleepy referenced this issue from a commit

2026-06-23 09:59:26 +02:00

decode-opt #5: profile proves decode is CPU-dispatch-bound (98ms) not GPU (4ms)

sleepy referenced this issue from a commit

2026-06-23 09:59:26 +02:00

decode-opt #6: cooperative dequant-to-shared in turbo4 attention kernel