[HIGH] Prefix cache reconstruction always dequantizes to fp16, wasting 4x memory on hits #70

Closed
opened 2026-05-20 01:45:57 +02:00 by sleepy · 0 comments
Owner

Severity: HIGH - Exact prefix cache hits consume 4x more memory than necessary.

Location: omlx/cache/type_handlers.py, QuantizedKVCacheHandler.reconstruct_cache(), lines 450-510

Problem: reconstruct_cache() always creates a KVCache with fp16 keys/values by dequantizing stored Q4 data. For exact prefix hits, _apply_quantized_kv is never called, so the cache stays fp16 for the entire generation.

Impact: A 32k prefix hit that should consume ~4GB now consumes ~16GB. Double quantization error: quantize (prefill #1) -> dequantize (SSD load) -> no re-quantization on exact hits.

Fix: reconstruct_cache() should return StreamQuantKVCache when metadata indicates quantized storage.

**Severity:** HIGH - Exact prefix cache hits consume 4x more memory than necessary. **Location:** `omlx/cache/type_handlers.py`, `QuantizedKVCacheHandler.reconstruct_cache()`, lines 450-510 **Problem:** `reconstruct_cache()` always creates a `KVCache` with fp16 keys/values by dequantizing stored Q4 data. For exact prefix hits, `_apply_quantized_kv` is never called, so the cache stays fp16 for the entire generation. **Impact:** A 32k prefix hit that should consume ~4GB now consumes ~16GB. Double quantization error: quantize (prefill #1) -> dequantize (SSD load) -> no re-quantization on exact hits. **Fix:** `reconstruct_cache()` should return `StreamQuantKVCache` when metadata indicates quantized storage.
Sign in to join this conversation.
No labels
bug
feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/omlx#70
No description provided.