[cache] Preserve Q4 quantization in prefix cache reconstruction (#70) #80

Merged

sleepy merged 1 commit from fix/70-reconstruct-quantized-cache into main

2026-05-20 02:25:07 +02:00

sleepy commented

2026-05-20 02:03:11 +02:00

Owner

Fixes: #70

Problem: reconstruct_cache() always dequantized to fp16, wasting 4x memory on exact prefix cache hits.

Fix: When state contains quantized tuples and meta_state has valid bits/group_size, return StreamQuantKVCache directly. Only fall back to fp16 when data is already fp16 or metadata is missing.

Results: 80 passed, 3 deselected

**Fixes:** #70 **Problem:** `reconstruct_cache()` always dequantized to fp16, wasting 4x memory on exact prefix cache hits. **Fix:** When state contains quantized tuples and meta_state has valid bits/group_size, return `StreamQuantKVCache` directly. Only fall back to fp16 when data is already fp16 or metadata is missing. **Results:** 80 passed, 3 deselected

sleepy added 1 commit

2026-05-20 02:03:11 +02:00

fix: return StreamQuantKVCache directly from reconstruct_cache() for quantized data 934e2ed65f

When the stored data is quantized (3-tuple of weights, scales, biases)
and meta_state contains quantization params, return StreamQuantKVCache
directly instead of dequantizing to fp16 KVCache. This avoids 4x memory
waste on prefix cache hits where _apply_quantized_kv() is never called.

Fixes #70

sleepy force-pushed fix/70-reconstruct-quantized-cache from 934e2ed65f to 58fc048fdb

2026-05-20 02:24:53 +02:00

Compare

sleepy merged commit b2142e501c into main

2026-05-20 02:25:07 +02:00

sleepy referenced this pull request from a commit

2026-05-20 02:25:07 +02:00

[cache] Preserve Q4 quantization in prefix cache reconstruction (#70) (#80)