[cache] Preserve Q4 quantization in prefix cache reconstruction (#70) #80

Merged
sleepy merged 1 commit from fix/70-reconstruct-quantized-cache into main 2026-05-20 02:25:07 +02:00
Owner

Fixes: #70

Problem: reconstruct_cache() always dequantized to fp16, wasting 4x memory on exact prefix cache hits.

Fix: When state contains quantized tuples and meta_state has valid bits/group_size, return StreamQuantKVCache directly. Only fall back to fp16 when data is already fp16 or metadata is missing.

Results: 80 passed, 3 deselected

**Fixes:** #70 **Problem:** `reconstruct_cache()` always dequantized to fp16, wasting 4x memory on exact prefix cache hits. **Fix:** When state contains quantized tuples and meta_state has valid bits/group_size, return `StreamQuantKVCache` directly. Only fall back to fp16 when data is already fp16 or metadata is missing. **Results:** 80 passed, 3 deselected
When the stored data is quantized (3-tuple of weights, scales, biases)
and meta_state contains quantization params, return StreamQuantKVCache
directly instead of dequantizing to fp16 KVCache. This avoids 4x memory
waste on prefix cache hits where _apply_quantized_kv() is never called.

Fixes #70
sleepy force-pushed fix/70-reconstruct-quantized-cache from 934e2ed65f to 58fc048fdb 2026-05-20 02:24:53 +02:00 Compare
sleepy merged commit b2142e501c into main 2026-05-20 02:25:07 +02:00
Sign in to join this conversation.
No reviewers
No labels
bug
feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/omlx!80
No description provided.