[perf] Full QuantizedKVCache batching support (#39) #40

Merged

sleepy merged 2 commits from feature/39-llamacpp-kv-cache-rewrite into main

2026-05-09 13:36:25 +02:00

sleepy commented

2026-05-09 13:29:30 +02:00

Owner

Summary

Fixes #39 by implementing full QuantizedKVCache support in batching and prefix cache paths.

Changes

Monkey-patch QuantizedKVCache.__len__ to return actual token count (was returning 0)
Implement BatchQuantizedKVCache with full batching interface (update, extract, merge, filter, trim)
Monkey-patch _make_cache and _merge_caches to support QuantizedKVCache
Monkey-patch BatchKVCache.extract() to preserve quantization
Add QuantizedKVCacheHandler for prefix cache reconstruction
Fix _apply_quantized_kv() to use correct update_and_fetch API

Test Results

17 new tests: all pass
650 cache-related tests: all pass
6 pre-existing failures (unrelated: integration test setup + async pytest plugin)

Memory Impact

With Qwen3.6-27B at 50K context:

Before: KV cache ~3.0GB (fp16), total ~31GB → OOM
After: KV cache ~0.8GB (4-bit), total ~28.8GB → within limit
Blocks evicted due to memory pressure should now be retained

## Summary Fixes #39 by implementing full QuantizedKVCache support in batching and prefix cache paths. ## Changes - Monkey-patch `QuantizedKVCache.__len__` to return actual token count (was returning 0) - Implement `BatchQuantizedKVCache` with full batching interface (update, extract, merge, filter, trim) - Monkey-patch `_make_cache` and `_merge_caches` to support QuantizedKVCache - Monkey-patch `BatchKVCache.extract()` to preserve quantization - Add `QuantizedKVCacheHandler` for prefix cache reconstruction - Fix `_apply_quantized_kv()` to use correct `update_and_fetch` API ## Test Results - 17 new tests: all pass - 650 cache-related tests: all pass - 6 pre-existing failures (unrelated: integration test setup + async pytest plugin) ## Memory Impact With Qwen3.6-27B at 50K context: - Before: KV cache ~3.0GB (fp16), total ~31GB → OOM - After: KV cache ~0.8GB (4-bit), total ~28.8GB → within limit - Blocks evicted due to memory pressure should now be retained

sleepy added 1 commit

2026-05-09 13:29:30 +02:00

feat: full QuantizedKVCache support for batching and prefix cache (fixes #39 ) 65baff89d4

- Monkey-patch QuantizedKVCache.__len__ to return offset instead of 0
- Create BatchQuantizedKVCache with full batching interface
- Monkey-patch mlx-lm _make_cache and _merge_caches for QuantizedKVCache
- Monkey-patch BatchKVCache.extract to return QuantizedKVCache when appropriate
- Add QuantizedKVCacheHandler for prefix cache SSD reconstruction
- Fix _apply_quantized_kv to use update_and_fetch instead of non-existent step()
- Update prefix cache offset validation to include QuantizedKVCache
- Add comprehensive tests for batch quantized cache operations

sleepy referenced this pull request

2026-05-09 13:32:05 +02:00

Rewrite KV cache system to match llama.cpp memory efficiency #39

sleepy referenced this pull request from a commit

2026-05-09 13:35:42 +02:00

Fix PR #40 review feedback: merge bug, dead code removal, bounds check, validation, tests

sleepy added 1 commit

2026-05-09 13:35:42 +02:00

Fix PR #40 review feedback: merge bug, dead code removal, bounds check, validation, tests 2dca322fcf

- Fix critical bug in BatchQuantizedKVCache.merge(): vp[0] -> vp[1]
- Remove dead BatchKVCache.extract monkey-patch from scheduler.py
- Remove obsolete test_batchkv_extract_with_quantized_data test
- Clean up unused imports (_is_quantized_tuple, Any, Optional)
- Add meta_offset bounds check in QuantizedKVCacheHandler.reconstruct_cache()
- Add group_size/bits validation assertions in merge()
- Strengthen tests: distinct keys/values, filter shift, extend unequal, merge lengths
- Make merge() handle mx.array offset values gracefully

sleepy merged commit 00d215f5a5 into main

2026-05-09 13:36:25 +02:00

sleepy referenced this pull request from a commit

2026-05-09 13:36:25 +02:00

[perf] Full QuantizedKVCache batching support (#39) (#40)

sleepy referenced this pull request

2026-05-09 13:36:47 +02:00

Rewrite KV cache system to match llama.cpp memory efficiency #39

sleepy referenced this pull request

2026-05-09 15:09:24 +02:00

[fix] QuantizedKVCache prefix cache and batching fixes (#39 follow-up) #41