[feature] MTP speculative decoding with concurrent streams #9
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem:
MTP speculative decoding only works for single-request fast path. When concurrent streams are active, requests fall back to the scheduler (non-MTP batched path), losing the ~1.5x speedup.
Current behavior:
Goal:
Investigate and implement MTP support for concurrent streams so multiple users can benefit from MTP speedup simultaneously.
Acceptance criteria:
Investigation Complete
Root cause: MTP and continuous batching have incompatible execution patterns. MTP needs 2-3 sequential forward passes with variable-length inputs, per-request rollback state, and a separate MTP KV cache. Batched generation uses 1 uniform forward pass per step with BatchKVCache. These cannot be unified without rewriting the entire pipeline.
Additional bug found: The current MTP fast path blocks the asyncio event loop (synchronous
forloop inasync def), preventing all other request processing during generation.Recommended approach: Hybrid switching
This matches real-world usage: MTP matters most for single-request interactive chat; concurrent scenarios prioritize throughput over per-request latency.
Key implementation steps:
batched.py, route all requests through schedulerScheduler._mtp_step()— single-request MTP loop on the executor threadRequestApproaches considered and rejected:
Merged via squash (PR #11). Hybrid MTP switching implemented: single-request → MTP (~16.8 tok/s), concurrent → batched (~8.9 tok/s), auto-transitions between modes. Fix for Approach B tracked in #10.