[perf] Pre-allocate KV cache to max_kv_size upfront #2

Closed
opened 2026-05-15 19:51:58 +02:00 by sleepy · 1 comment
Owner

Problem

When max_kv_size is set, KV caches still grow in 256-token steps via trim-then-concat (cache.py:344-348). Each growth step creates 4 temporary copies of the cache per layer (old + trimmed + new + concatenated), causing multi-GB memory spikes on models with many layers.

Example: Qwen3-8B at 4K context, 36 layers — growth spike is ~4.5GB transient on top of ~1GB steady state.

Solution

When max_kv_size is known, allocate the full buffer on first update_and_fetch call. No trim, no concat, no growth spikes.

Affected classes

  • RotatingKVCache (cache.py:410) — already partially does this (bounded at max_size)
  • BatchRotatingKVCache (cache.py:1135) — same
  • KVCache (cache.py:325) — only used when max_kv_size=None, so not in scope
  • BatchKVCache (cache.py:914) — only used when max_kv_size=None, so not in scope

Required changes

  1. RotatingKVCache: On first update_and_fetch, instead of allocating min(256, max_size) tokens, allocate max_size tokens upfront. Subsequent calls write in-place at the correct _idx position. No concat ever needed.

  2. BatchRotatingKVCache: Same — allocate max_size columns on first call.

  3. Ensure _update_in_place path is always taken after initial allocation (no more _update_concat fallback during generation).

Acceptance criteria

  • RotatingKVCache allocates max_size columns on first update_and_fetch
  • BatchRotatingKVCache allocates max_size columns on first update_and_fetch
  • No concatenate calls during generation (only during prompt prefill which is expected)
  • Memory usage is flat after initial allocation (no spikes)
  • Existing tests pass
  • New test: verify cache.nbytes matches max_size allocation after first update

Testing

Run existing test suite. Add test that verifies:

  1. After first update_and_fetch, cache.keys.shape[2] == max_size
  2. After 1000 tokens, cache.keys.shape[2] is unchanged
  3. Output tokens match the old grow-by-256 behavior
## Problem When `max_kv_size` is set, KV caches still grow in 256-token steps via trim-then-concat (cache.py:344-348). Each growth step creates 4 temporary copies of the cache per layer (old + trimmed + new + concatenated), causing multi-GB memory spikes on models with many layers. Example: Qwen3-8B at 4K context, 36 layers — growth spike is ~4.5GB transient on top of ~1GB steady state. ## Solution When `max_kv_size` is known, allocate the full buffer on first `update_and_fetch` call. No trim, no concat, no growth spikes. ### Affected classes - `RotatingKVCache` (cache.py:410) — already partially does this (bounded at max_size) - `BatchRotatingKVCache` (cache.py:1135) — same - `KVCache` (cache.py:325) — only used when max_kv_size=None, so **not in scope** - `BatchKVCache` (cache.py:914) — only used when max_kv_size=None, so **not in scope** ### Required changes 1. **RotatingKVCache**: On first `update_and_fetch`, instead of allocating `min(256, max_size)` tokens, allocate `max_size` tokens upfront. Subsequent calls write in-place at the correct `_idx` position. No concat ever needed. 2. **BatchRotatingKVCache**: Same — allocate `max_size` columns on first call. 3. Ensure `_update_in_place` path is always taken after initial allocation (no more `_update_concat` fallback during generation). ### Acceptance criteria - [ ] RotatingKVCache allocates max_size columns on first update_and_fetch - [ ] BatchRotatingKVCache allocates max_size columns on first update_and_fetch - [ ] No concatenate calls during generation (only during prompt prefill which is expected) - [ ] Memory usage is flat after initial allocation (no spikes) - [ ] Existing tests pass - [ ] New test: verify cache.nbytes matches max_size allocation after first update ### Testing Run existing test suite. Add test that verifies: 1. After first `update_and_fetch`, `cache.keys.shape[2] == max_size` 2. After 1000 tokens, `cache.keys.shape[2]` is unchanged 3. Output tokens match the old grow-by-256 behavior
Author
Owner

Merged via PR #5 (squash, 07eaf36). RotatingKVCache and BatchRotatingKVCache now pre-allocate full max_size on first update, eliminating growth spikes during generation. 49/49 tests passed.

Merged via PR #5 (squash, 07eaf36). RotatingKVCache and BatchRotatingKVCache now pre-allocate full max_size on first update, eliminating growth spikes during generation. 49/49 tests passed.
Sign in to join this conversation.
No labels
feature
perf
refactor
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/mlx-lm#2
No description provided.