[perf] Prefill speed optimization for long contexts #34

New issue

Open

opened 2026-05-08 21:40:24 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-08 21:40:24 +02:00

Owner

Current State

With QuantizedKVCache (q4_0) enabled, chunked prefill achieves ~200 tok/s.

Observed Behavior

32K tokens: ~50s total (10s per 2K chunk)
Cache hits work correctly and reconstruct fast (~200ms)
Remaining token prefill after cache hit is still chunk-by-chunk

Potential Optimizations

Larger chunk sizes: Currently capped at 2048 for boundary snapshots. Could increase for models with uniform attention layers.
Overlap prefill with storage: Currently waits for cache storage between requests.
Reduce boundary snapshot overhead: 5-10 snapshots per request add eval/sync cost.
SpecPrefill: Sparse attention prefill for MoE models (already partially implemented but not enabled for Qwen3.6).

Benchmarks Needed

Profile chunk size vs memory tradeoff
Measure boundary snapshot overhead
Compare with llama.cpp prefill speed on same hardware

Acceptance Criteria

256K context prefill under 5 minutes (currently ~15+ min projected)
No memory regression at max context

## Current State With QuantizedKVCache (q4_0) enabled, chunked prefill achieves ~200 tok/s. ## Observed Behavior - 32K tokens: ~50s total (10s per 2K chunk) - Cache hits work correctly and reconstruct fast (~200ms) - Remaining token prefill after cache hit is still chunk-by-chunk ## Potential Optimizations 1. **Larger chunk sizes**: Currently capped at 2048 for boundary snapshots. Could increase for models with uniform attention layers. 2. **Overlap prefill with storage**: Currently waits for cache storage between requests. 3. **Reduce boundary snapshot overhead**: 5-10 snapshots per request add eval/sync cost. 4. **SpecPrefill**: Sparse attention prefill for MoE models (already partially implemented but not enabled for Qwen3.6). ## Benchmarks Needed - Profile chunk size vs memory tradeoff - Measure boundary snapshot overhead - Compare with llama.cpp prefill speed on same hardware ## Acceptance Criteria - 256K context prefill under 5 minutes (currently ~15+ min projected) - No memory regression at max context

sleepy commented

2026-05-15 18:33:28 +02:00

Author

Owner

Needs benchmark data before implementation. Before tackling this, we need:

Profile current prefill at 32K and 128K with perf or timing instrumentation
Measure boundary snapshot overhead (how much time per snapshot)
Benchmark llama.cpp prefill speed on same hardware for comparison
Profile chunk size vs memory tradeoff (128, 256, 512, 1024, 2048, 4096)

Without this data, any implementation would be shooting in the dark.

Needs benchmark data before implementation. Before tackling this, we need: 1. Profile current prefill at 32K and 128K with `perf` or timing instrumentation 2. Measure boundary snapshot overhead (how much time per snapshot) 3. Benchmark llama.cpp prefill speed on same hardware for comparison 4. Profile chunk size vs memory tradeoff (128, 256, 512, 1024, 2048, 4096) Without this data, any implementation would be shooting in the dark.

No labels

bug

feature

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

sleepy/omlx#34

No description provided.

Rows
Columns