[perf] Pre-allocate KV cache to max_kv_size upfront #2
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
When
max_kv_sizeis set, KV caches still grow in 256-token steps via trim-then-concat (cache.py:344-348). Each growth step creates 4 temporary copies of the cache per layer (old + trimmed + new + concatenated), causing multi-GB memory spikes on models with many layers.Example: Qwen3-8B at 4K context, 36 layers — growth spike is ~4.5GB transient on top of ~1GB steady state.
Solution
When
max_kv_sizeis known, allocate the full buffer on firstupdate_and_fetchcall. No trim, no concat, no growth spikes.Affected classes
RotatingKVCache(cache.py:410) — already partially does this (bounded at max_size)BatchRotatingKVCache(cache.py:1135) — sameKVCache(cache.py:325) — only used when max_kv_size=None, so not in scopeBatchKVCache(cache.py:914) — only used when max_kv_size=None, so not in scopeRequired changes
RotatingKVCache: On first
update_and_fetch, instead of allocatingmin(256, max_size)tokens, allocatemax_sizetokens upfront. Subsequent calls write in-place at the correct_idxposition. No concat ever needed.BatchRotatingKVCache: Same — allocate
max_sizecolumns on first call.Ensure
_update_in_placepath is always taken after initial allocation (no more_update_concatfallback during generation).Acceptance criteria
Testing
Run existing test suite. Add test that verifies:
update_and_fetch,cache.keys.shape[2] == max_sizecache.keys.shape[2]is unchangedMerged via PR #5 (squash,
07eaf36). RotatingKVCache and BatchRotatingKVCache now pre-allocate full max_size on first update, eliminating growth spikes during generation. 49/49 tests passed.