Rewrite KV cache system to match llama.cpp memory efficiency #39
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
At 50K token context with Qwen3.6-27B-mxfp4, system hits 31GB+ memory (process limit 30.2GB), causing OOM crashes and aggressive cache eviction. Only ~25% of cached tokens survive between requests (12K/47K recovered).
Current mlx-lm QuantizedKVCache:
Requirements
Reference
llama.cpp uses:
PR #40 review: CHANGES_REQUESTED. Critical bug in merge() values assignment + dead code removal + test gaps. Re-dispatching subagent to fix.
Merged via PR #40 (squash:
00d215f).Fixes:
Memory impact: KV cache drops from ~3.0GB (fp16) to ~0.8GB (4-bit) at 50K context.
Next: Live testing with real prompts to verify cache hits and token/s.
Merged via PR #41 (squash).
Results
Before fix:
After fix:
Technical Summary
Live Test
QuantizedKVCache now works end-to-end with batching and SSD persistence.