[perf] Prefill speed optimization for long contexts #34

Open
opened 2026-05-08 21:40:24 +02:00 by sleepy · 1 comment
Owner

Current State

With QuantizedKVCache (q4_0) enabled, chunked prefill achieves ~200 tok/s.

Observed Behavior

  • 32K tokens: ~50s total (10s per 2K chunk)
  • Cache hits work correctly and reconstruct fast (~200ms)
  • Remaining token prefill after cache hit is still chunk-by-chunk

Potential Optimizations

  1. Larger chunk sizes: Currently capped at 2048 for boundary snapshots. Could increase for models with uniform attention layers.
  2. Overlap prefill with storage: Currently waits for cache storage between requests.
  3. Reduce boundary snapshot overhead: 5-10 snapshots per request add eval/sync cost.
  4. SpecPrefill: Sparse attention prefill for MoE models (already partially implemented but not enabled for Qwen3.6).

Benchmarks Needed

  • Profile chunk size vs memory tradeoff
  • Measure boundary snapshot overhead
  • Compare with llama.cpp prefill speed on same hardware

Acceptance Criteria

  • 256K context prefill under 5 minutes (currently ~15+ min projected)
  • No memory regression at max context
## Current State With QuantizedKVCache (q4_0) enabled, chunked prefill achieves ~200 tok/s. ## Observed Behavior - 32K tokens: ~50s total (10s per 2K chunk) - Cache hits work correctly and reconstruct fast (~200ms) - Remaining token prefill after cache hit is still chunk-by-chunk ## Potential Optimizations 1. **Larger chunk sizes**: Currently capped at 2048 for boundary snapshots. Could increase for models with uniform attention layers. 2. **Overlap prefill with storage**: Currently waits for cache storage between requests. 3. **Reduce boundary snapshot overhead**: 5-10 snapshots per request add eval/sync cost. 4. **SpecPrefill**: Sparse attention prefill for MoE models (already partially implemented but not enabled for Qwen3.6). ## Benchmarks Needed - Profile chunk size vs memory tradeoff - Measure boundary snapshot overhead - Compare with llama.cpp prefill speed on same hardware ## Acceptance Criteria - 256K context prefill under 5 minutes (currently ~15+ min projected) - No memory regression at max context
Author
Owner

Needs benchmark data before implementation. Before tackling this, we need:

  1. Profile current prefill at 32K and 128K with perf or timing instrumentation
  2. Measure boundary snapshot overhead (how much time per snapshot)
  3. Benchmark llama.cpp prefill speed on same hardware for comparison
  4. Profile chunk size vs memory tradeoff (128, 256, 512, 1024, 2048, 4096)

Without this data, any implementation would be shooting in the dark.

Needs benchmark data before implementation. Before tackling this, we need: 1. Profile current prefill at 32K and 128K with `perf` or timing instrumentation 2. Measure boundary snapshot overhead (how much time per snapshot) 3. Benchmark llama.cpp prefill speed on same hardware for comparison 4. Profile chunk size vs memory tradeoff (128, 256, 512, 1024, 2048, 4096) Without this data, any implementation would be shooting in the dark.
Sign in to join this conversation.
No labels
bug
feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/omlx#34
No description provided.