[fix] QuantizedKVCache prefix cache and batching fixes (#39 follow-up) #41

Merged
sleepy merged 5 commits from feature/39-llamacpp-kv-cache-rewrite into main 2026-05-09 15:11:03 +02:00
Owner

Fixes for issues discovered after PR #40

1. Empty caches in _patched_merge_caches

  • Added guard for empty caches list (IndexError)

2. Layer count mismatch in prefix cache

  • _extract_block_tensor_slice now handles QuantizedKVCache tuples
  • _get_cache_seq_len detects quantized tuple format
  • 5 new tests added

3. Quantization parameters lost during SSD storage

  • meta_state now preserves bits and group_size
  • QuantizedKVCacheHandler reads params from meta_state
  • Fallback to tensor-shape inference for backward compatibility

Test Results

  • 658 cache tests pass
  • Live test: 10,240/11,422 tokens cache hit (90% hit rate)
  • Round 2 time: 9.6s (vs 63.2s for Round 1)
  • Memory stable at ~17GB peak (vs 31GB+ before fix)

Live Test Log

Round 1: 11,422 prompt tokens, 63.2s
Round 2: 11,493 prompt tokens, 9.6s, 10,240 cached tokens (90% hit)
## Fixes for issues discovered after PR #40 ### 1. Empty caches in _patched_merge_caches - Added guard for empty caches list (IndexError) ### 2. Layer count mismatch in prefix cache - _extract_block_tensor_slice now handles QuantizedKVCache tuples - _get_cache_seq_len detects quantized tuple format - 5 new tests added ### 3. Quantization parameters lost during SSD storage - meta_state now preserves bits and group_size - QuantizedKVCacheHandler reads params from meta_state - Fallback to tensor-shape inference for backward compatibility ## Test Results - 658 cache tests pass - Live test: 10,240/11,422 tokens cache hit (90% hit rate) - Round 2 time: 9.6s (vs 63.2s for Round 1) - Memory stable at ~17GB peak (vs 31GB+ before fix) ## Live Test Log ``` Round 1: 11,422 prompt tokens, 63.2s Round 2: 11,493 prompt tokens, 9.6s, 10,240 cached tokens (90% hit) ```
- Monkey-patch QuantizedKVCache.__len__ to return offset instead of 0
- Create BatchQuantizedKVCache with full batching interface
- Monkey-patch mlx-lm _make_cache and _merge_caches for QuantizedKVCache
- Monkey-patch BatchKVCache.extract to return QuantizedKVCache when appropriate
- Add QuantizedKVCacheHandler for prefix cache SSD reconstruction
- Fix _apply_quantized_kv to use update_and_fetch instead of non-existent step()
- Update prefix cache offset validation to include QuantizedKVCache
- Add comprehensive tests for batch quantized cache operations
- Fix critical bug in BatchQuantizedKVCache.merge(): vp[0] -> vp[1]
- Remove dead BatchKVCache.extract monkey-patch from scheduler.py
- Remove obsolete test_batchkv_extract_with_quantized_data test
- Clean up unused imports (_is_quantized_tuple, Any, Optional)
- Add meta_offset bounds check in QuantizedKVCacheHandler.reconstruct_cache()
- Add group_size/bits validation assertions in merge()
- Strengthen tests: distinct keys/values, filter shift, extend unequal, merge lengths
- Make merge() handle mx.array offset values gracefully
sleepy merged commit 43e5073f6f into main 2026-05-09 15:11:03 +02:00
Sign in to join this conversation.
No reviewers
No labels
bug
feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/omlx!41
No description provided.