[perf] Pre-allocated paged q4 KV cache to eliminate prefill memory spikes #51
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Branch: feature/1-paged-q4-kv-cache (2 commits ahead of main). Also includes fix/rotation-q4-merge.
Pre-allocated paged KV cache that quantizes keys/values to q4_0 immediately on write. Keys use Hadamard rotation before quantization. Eliminates fp16 KV peak memory.
Commits:
c92124ffeat(cache): add PagedQ4KVCache with incremental q4_0 quantizationa200dd0fix(mtp): auto-enable MTP speculative decoding for models with mtp_forwardKey files:
Acceptance criteria:
Related: #48 (Cache misses with q4 KV), #39 (Rewrite KV cache for memory efficiency)
Superseded by current main. Main uses mlx-lm
QuantizedKVCachedirectly (via_apply_quantized_kv) withBatchQuantizedKVCachefor batching, Hadamard rotation viaomlx/patches/quantized_attention.py, and merge support in_patched_merge_caches. ThePagedQ4KVCacheapproach in this branch was an alternative that is now unnecessary. Related issues #48 (cache misses) and #39 (memory efficiency) are both closed.Rescoped: Pre-allocated paged q4 KV cache to eliminate prefill memory spikes
Problem
Current
QuantizedKVCache(mlx-lm) grows dynamically during prefill: allocate fp16 buffer → write tokens → quantize in-place. This creates transient fp16 peaks that make Q4 KV save zero peak memory vs fp16 at long contexts:The q4 active memory after cleanup IS lower (~19.7 GB flat), but the fp16 intermediate during forward pass erases the savings.
Root Cause
KVCache.update_and_fetch()in mlx-lm stores fp16, then_apply_quantized_kv()converts after the fact. The fp16 KV exists simultaneously with the quantized copy during conversion.Solution: Pre-allocated paged q4 KV
Pre-allocate the full KV cache in q4 format at model load (or at prefill start). Write tokens directly to q4 pages during prefill — never hold fp16 KV at all.
This is what llama.cpp does (
attn-rot+ paged q4_0):Acceptance Criteria
Related
QuantizedKVCache(current approach — grows dynamically)[feature] Paged q4 KV cache with Hadamard rotationto [perf] Pre-allocated paged q4 KV cache to eliminate prefill memory spikes