[fix] QuantizedKVCache prefix cache and batching fixes (#39 follow-up) #41

Merged

sleepy merged 5 commits from feature/39-llamacpp-kv-cache-rewrite into main

2026-05-09 15:11:03 +02:00

sleepy commented

2026-05-09 15:09:24 +02:00

Owner

Fixes for issues discovered after PR #40

1. Empty caches in _patched_merge_caches

Added guard for empty caches list (IndexError)

2. Layer count mismatch in prefix cache

_extract_block_tensor_slice now handles QuantizedKVCache tuples
_get_cache_seq_len detects quantized tuple format
5 new tests added

3. Quantization parameters lost during SSD storage

meta_state now preserves bits and group_size
QuantizedKVCacheHandler reads params from meta_state
Fallback to tensor-shape inference for backward compatibility

Test Results

658 cache tests pass
Live test: 10,240/11,422 tokens cache hit (90% hit rate)
Round 2 time: 9.6s (vs 63.2s for Round 1)
Memory stable at ~17GB peak (vs 31GB+ before fix)

Live Test Log

Round 1: 11,422 prompt tokens, 63.2s
Round 2: 11,493 prompt tokens, 9.6s, 10,240 cached tokens (90% hit)

## Fixes for issues discovered after PR #40 ### 1. Empty caches in _patched_merge_caches - Added guard for empty caches list (IndexError) ### 2. Layer count mismatch in prefix cache - _extract_block_tensor_slice now handles QuantizedKVCache tuples - _get_cache_seq_len detects quantized tuple format - 5 new tests added ### 3. Quantization parameters lost during SSD storage - meta_state now preserves bits and group_size - QuantizedKVCacheHandler reads params from meta_state - Fallback to tensor-shape inference for backward compatibility ## Test Results - 658 cache tests pass - Live test: 10,240/11,422 tokens cache hit (90% hit rate) - Round 2 time: 9.6s (vs 63.2s for Round 1) - Memory stable at ~17GB peak (vs 31GB+ before fix) ## Live Test Log ``` Round 1: 11,422 prompt tokens, 63.2s Round 2: 11,493 prompt tokens, 9.6s, 10,240 cached tokens (90% hit) ```

sleepy added 3 commits

2026-05-09 15:09:24 +02:00

feat: full QuantizedKVCache support for batching and prefix cache (fixes #39 ) 65baff89d4

- Monkey-patch QuantizedKVCache.__len__ to return offset instead of 0
- Create BatchQuantizedKVCache with full batching interface
- Monkey-patch mlx-lm _make_cache and _merge_caches for QuantizedKVCache
- Monkey-patch BatchKVCache.extract to return QuantizedKVCache when appropriate
- Add QuantizedKVCacheHandler for prefix cache SSD reconstruction
- Fix _apply_quantized_kv to use update_and_fetch instead of non-existent step()
- Update prefix cache offset validation to include QuantizedKVCache
- Add comprehensive tests for batch quantized cache operations

Fix PR #40 review feedback: merge bug, dead code removal, bounds check, validation, tests 2dca322fcf

- Fix critical bug in BatchQuantizedKVCache.merge(): vp[0] -> vp[1]
- Remove dead BatchKVCache.extract monkey-patch from scheduler.py
- Remove obsolete test_batchkv_extract_with_quantized_data test
- Clean up unused imports (_is_quantized_tuple, Any, Optional)
- Add meta_offset bounds check in QuantizedKVCacheHandler.reconstruct_cache()
- Add group_size/bits validation assertions in merge()
- Strengthen tests: distinct keys/values, filter shift, extend unequal, merge lengths
- Make merge() handle mx.array offset values gracefully

fix(scheduler): handle empty caches list in _patched_merge_caches 17073278a4

sleepy added 2 commits

2026-05-09 15:10:57 +02:00

fix(cache): prefix cache QuantizedKVCache support and param preservation 16c657bf9b

Merge main and resolve conflicts (keep QuantizedKVCache fixes) 1b3c4eb54f

sleepy merged commit 43e5073f6f into main

2026-05-09 15:11:03 +02:00

sleepy referenced this pull request from a commit

2026-05-09 15:11:03 +02:00

[fix] QuantizedKVCache prefix cache and batching fixes (#39 follow-up) (#41)

sleepy referenced this pull request

2026-05-09 15:11:21 +02:00