[HIGH] Prefix cache reconstruction always dequantizes to fp16, wasting 4x memory on hits #70
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Severity: HIGH - Exact prefix cache hits consume 4x more memory than necessary.
Location:
omlx/cache/type_handlers.py,QuantizedKVCacheHandler.reconstruct_cache(), lines 450-510Problem:
reconstruct_cache()always creates aKVCachewith fp16 keys/values by dequantizing stored Q4 data. For exact prefix hits,_apply_quantized_kvis never called, so the cache stays fp16 for the entire generation.Impact: A 32k prefix hit that should consume ~4GB now consumes ~16GB. Double quantization error: quantize (prefill #1) -> dequantize (SSD load) -> no re-quantization on exact hits.
Fix:
reconstruct_cache()should returnStreamQuantKVCachewhen metadata indicates quantized storage.