[perf] Full QuantizedKVCache batching support (#39) #40

Merged
sleepy merged 2 commits from feature/39-llamacpp-kv-cache-rewrite into main 2026-05-09 13:36:25 +02:00
Owner

Summary

Fixes #39 by implementing full QuantizedKVCache support in batching and prefix cache paths.

Changes

  • Monkey-patch QuantizedKVCache.__len__ to return actual token count (was returning 0)
  • Implement BatchQuantizedKVCache with full batching interface (update, extract, merge, filter, trim)
  • Monkey-patch _make_cache and _merge_caches to support QuantizedKVCache
  • Monkey-patch BatchKVCache.extract() to preserve quantization
  • Add QuantizedKVCacheHandler for prefix cache reconstruction
  • Fix _apply_quantized_kv() to use correct update_and_fetch API

Test Results

  • 17 new tests: all pass
  • 650 cache-related tests: all pass
  • 6 pre-existing failures (unrelated: integration test setup + async pytest plugin)

Memory Impact

With Qwen3.6-27B at 50K context:

  • Before: KV cache ~3.0GB (fp16), total ~31GB → OOM
  • After: KV cache ~0.8GB (4-bit), total ~28.8GB → within limit
  • Blocks evicted due to memory pressure should now be retained
## Summary Fixes #39 by implementing full QuantizedKVCache support in batching and prefix cache paths. ## Changes - Monkey-patch `QuantizedKVCache.__len__` to return actual token count (was returning 0) - Implement `BatchQuantizedKVCache` with full batching interface (update, extract, merge, filter, trim) - Monkey-patch `_make_cache` and `_merge_caches` to support QuantizedKVCache - Monkey-patch `BatchKVCache.extract()` to preserve quantization - Add `QuantizedKVCacheHandler` for prefix cache reconstruction - Fix `_apply_quantized_kv()` to use correct `update_and_fetch` API ## Test Results - 17 new tests: all pass - 650 cache-related tests: all pass - 6 pre-existing failures (unrelated: integration test setup + async pytest plugin) ## Memory Impact With Qwen3.6-27B at 50K context: - Before: KV cache ~3.0GB (fp16), total ~31GB → OOM - After: KV cache ~0.8GB (4-bit), total ~28.8GB → within limit - Blocks evicted due to memory pressure should now be retained
- Monkey-patch QuantizedKVCache.__len__ to return offset instead of 0
- Create BatchQuantizedKVCache with full batching interface
- Monkey-patch mlx-lm _make_cache and _merge_caches for QuantizedKVCache
- Monkey-patch BatchKVCache.extract to return QuantizedKVCache when appropriate
- Add QuantizedKVCacheHandler for prefix cache SSD reconstruction
- Fix _apply_quantized_kv to use update_and_fetch instead of non-existent step()
- Update prefix cache offset validation to include QuantizedKVCache
- Add comprehensive tests for batch quantized cache operations
- Fix critical bug in BatchQuantizedKVCache.merge(): vp[0] -> vp[1]
- Remove dead BatchKVCache.extract monkey-patch from scheduler.py
- Remove obsolete test_batchkv_extract_with_quantized_data test
- Clean up unused imports (_is_quantized_tuple, Any, Optional)
- Add meta_offset bounds check in QuantizedKVCacheHandler.reconstruct_cache()
- Add group_size/bits validation assertions in merge()
- Strengthen tests: distinct keys/values, filter shift, extend unequal, merge lengths
- Make merge() handle mx.array offset values gracefully
sleepy merged commit 00d215f5a5 into main 2026-05-09 13:36:25 +02:00
Sign in to join this conversation.
No reviewers
No labels
bug
feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/omlx!40
No description provided.