[HIGH] Prefix cache reconstruction always dequantizes to fp16, wasting 4x memory on hits #70

New issue

Closed

opened 2026-05-20 01:45:57 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-20 01:45:57 +02:00

Owner

Severity: HIGH - Exact prefix cache hits consume 4x more memory than necessary.

Location: omlx/cache/type_handlers.py, QuantizedKVCacheHandler.reconstruct_cache(), lines 450-510

Problem: reconstruct_cache() always creates a KVCache with fp16 keys/values by dequantizing stored Q4 data. For exact prefix hits, _apply_quantized_kv is never called, so the cache stays fp16 for the entire generation.

Impact: A 32k prefix hit that should consume ~4GB now consumes ~16GB. Double quantization error: quantize (prefill #1) -> dequantize (SSD load) -> no re-quantization on exact hits.

Fix: reconstruct_cache() should return StreamQuantKVCache when metadata indicates quantized storage.

**Severity:** HIGH - Exact prefix cache hits consume 4x more memory than necessary. **Location:** `omlx/cache/type_handlers.py`, `QuantizedKVCacheHandler.reconstruct_cache()`, lines 450-510 **Problem:** `reconstruct_cache()` always creates a `KVCache` with fp16 keys/values by dequantizing stored Q4 data. For exact prefix hits, `_apply_quantized_kv` is never called, so the cache stays fp16 for the entire generation. **Impact:** A 32k prefix hit that should consume ~4GB now consumes ~16GB. Double quantization error: quantize (prefill #1) -> dequantize (SSD load) -> no re-quantization on exact hits. **Fix:** `reconstruct_cache()` should return `StreamQuantKVCache` when metadata indicates quantized storage.

sleepy referenced this issue from a commit

2026-05-20 02:02:47 +02:00

fix: return StreamQuantKVCache directly from reconstruct_cache() for quantized data

sleepy referenced this issue

2026-05-20 02:03:11 +02:00

[cache] Preserve Q4 quantization in prefix cache reconstruction (#70) #80

sleepy referenced this issue from a commit

2026-05-20 02:24:53 +02:00

fix: return StreamQuantKVCache directly from reconstruct_cache() for quantized data

sleepy referenced this issue from a commit

2026-05-20 02:25:07 +02:00

[cache] Preserve Q4 quantization in prefix cache reconstruction (#70) (#80)