sleepy/mlx-lm

Fork 0

[perf] Pre-allocate KV cache to max_kv_size upfront #2

New issue

Closed

opened 2026-05-15 19:51:58 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-15 19:51:58 +02:00

Owner

Problem

When max_kv_size is set, KV caches still grow in 256-token steps via trim-then-concat (cache.py:344-348). Each growth step creates 4 temporary copies of the cache per layer (old + trimmed + new + concatenated), causing multi-GB memory spikes on models with many layers.

Example: Qwen3-8B at 4K context, 36 layers — growth spike is ~4.5GB transient on top of ~1GB steady state.

Solution

When max_kv_size is known, allocate the full buffer on first update_and_fetch call. No trim, no concat, no growth spikes.

Affected classes

RotatingKVCache (cache.py:410) — already partially does this (bounded at max_size)
BatchRotatingKVCache (cache.py:1135) — same
KVCache (cache.py:325) — only used when max_kv_size=None, so not in scope
BatchKVCache (cache.py:914) — only used when max_kv_size=None, so not in scope

Required changes

RotatingKVCache: On first update_and_fetch, instead of allocating min(256, max_size) tokens, allocate max_size tokens upfront. Subsequent calls write in-place at the correct _idx position. No concat ever needed.
BatchRotatingKVCache: Same — allocate max_size columns on first call.
Ensure _update_in_place path is always taken after initial allocation (no more _update_concat fallback during generation).

Acceptance criteria

RotatingKVCache allocates max_size columns on first update_and_fetch
BatchRotatingKVCache allocates max_size columns on first update_and_fetch
No concatenate calls during generation (only during prompt prefill which is expected)
Memory usage is flat after initial allocation (no spikes)
Existing tests pass
New test: verify cache.nbytes matches max_size allocation after first update

Testing

Run existing test suite. Add test that verifies:

After first update_and_fetch, cache.keys.shape[2] == max_size
After 1000 tokens, cache.keys.shape[2] is unchanged
Output tokens match the old grow-by-256 behavior

## Problem When `max_kv_size` is set, KV caches still grow in 256-token steps via trim-then-concat (cache.py:344-348). Each growth step creates 4 temporary copies of the cache per layer (old + trimmed + new + concatenated), causing multi-GB memory spikes on models with many layers. Example: Qwen3-8B at 4K context, 36 layers — growth spike is ~4.5GB transient on top of ~1GB steady state. ## Solution When `max_kv_size` is known, allocate the full buffer on first `update_and_fetch` call. No trim, no concat, no growth spikes. ### Affected classes - `RotatingKVCache` (cache.py:410) — already partially does this (bounded at max_size) - `BatchRotatingKVCache` (cache.py:1135) — same - `KVCache` (cache.py:325) — only used when max_kv_size=None, so **not in scope** - `BatchKVCache` (cache.py:914) — only used when max_kv_size=None, so **not in scope** ### Required changes 1. **RotatingKVCache**: On first `update_and_fetch`, instead of allocating `min(256, max_size)` tokens, allocate `max_size` tokens upfront. Subsequent calls write in-place at the correct `_idx` position. No concat ever needed. 2. **BatchRotatingKVCache**: Same — allocate `max_size` columns on first call. 3. Ensure `_update_in_place` path is always taken after initial allocation (no more `_update_concat` fallback during generation). ### Acceptance criteria - [ ] RotatingKVCache allocates max_size columns on first update_and_fetch - [ ] BatchRotatingKVCache allocates max_size columns on first update_and_fetch - [ ] No concatenate calls during generation (only during prompt prefill which is expected) - [ ] Memory usage is flat after initial allocation (no spikes) - [ ] Existing tests pass - [ ] New test: verify cache.nbytes matches max_size allocation after first update ### Testing Run existing test suite. Add test that verifies: 1. After first `update_and_fetch`, `cache.keys.shape[2] == max_size` 2. After 1000 tokens, `cache.keys.shape[2]` is unchanged 3. Output tokens match the old grow-by-256 behavior

sleepy added the

perf

label

2026-05-15 19:51:58 +02:00

sleepy referenced this issue

2026-05-15 20:03:39 +02:00

[cache] Pre-allocate RotatingKVCache to max_kv_size upfront (#2) #5

sleepy referenced this issue from a commit

2026-05-15 20:05:28 +02:00

[cache] Pre-allocate RotatingKVCache to max_kv_size upfront (#2) (#5)

sleepy closed this issue

2026-05-15 20:06:22 +02:00

sleepy commented

2026-05-15 20:06:22 +02:00

Author

Owner

Merged via PR #5 (squash, 07eaf36). RotatingKVCache and BatchRotatingKVCache now pre-allocate full max_size on first update, eliminating growth spikes during generation. 49/49 tests passed.

Merged via PR #5 (squash, 07eaf36). RotatingKVCache and BatchRotatingKVCache now pre-allocate full max_size on first update, eliminating growth spikes during generation. 49/49 tests passed.

No labels

feature

perf

refactor

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

sleepy/mlx-lm#2

No description provided.

Rows
Columns