[perf] Default max_kv_size to 32768 #4

Closed
opened 2026-05-15 19:51:58 +02:00 by sleepy · 1 comment
Owner

Problem

max_kv_size defaults to None everywhere (generate.py:323, generate.py:1940), meaning KV caches grow unbounded. Long conversations OOM instead of gracefully degrading with sliding window attention.

Solution

Default max_kv_size=32768 in both generate_step() and BatchGenerator.__init__().

Required changes

  1. generate.py:323generate_step(): change default from None to 32768
  2. generate.py:1940BatchGenerator.__init__(): change default from None to 32768
  3. generate.py:488stream_generate(): propagate the default
  4. generate.py:670generate() / other entry points: propagate
  5. Check all other callers that pass max_kv_size=None

Acceptance criteria

  • All public generation entry points default to max_kv_size=32768
  • Users can still pass max_kv_size=None for unbounded (old behavior)
  • Existing tests pass (may need to update tests that relied on None default)
  • No behavioral change when max_kv_size is explicitly passed

Notes

  • This is a breaking default change. Consider if a deprecation period is needed.
  • 32K is ~2.25GB for Qwen3-8B (36 layers, 4 KV heads, 128 dim, fp16) — reasonable for 16GB+ machines.
  • Smaller models use proportionally less.
## Problem `max_kv_size` defaults to `None` everywhere (generate.py:323, generate.py:1940), meaning KV caches grow unbounded. Long conversations OOM instead of gracefully degrading with sliding window attention. ## Solution Default `max_kv_size=32768` in both `generate_step()` and `BatchGenerator.__init__()`. ### Required changes 1. **generate.py:323** — `generate_step()`: change default from `None` to `32768` 2. **generate.py:1940** — `BatchGenerator.__init__()`: change default from `None` to `32768` 3. **generate.py:488** — `stream_generate()`: propagate the default 4. **generate.py:670** — `generate()` / other entry points: propagate 5. Check all other callers that pass `max_kv_size=None` ### Acceptance criteria - [ ] All public generation entry points default to max_kv_size=32768 - [ ] Users can still pass max_kv_size=None for unbounded (old behavior) - [ ] Existing tests pass (may need to update tests that relied on None default) - [ ] No behavioral change when max_kv_size is explicitly passed ### Notes - This is a breaking default change. Consider if a deprecation period is needed. - 32K is ~2.25GB for Qwen3-8B (36 layers, 4 KV heads, 128 dim, fp16) — reasonable for 16GB+ machines. - Smaller models use proportionally less.
Author
Owner

Merged via PR #6 (squash). Default max_kv_size=32768 across all generation entry points. Combined with PR #5 pre-allocation, KV cache memory is now flat and bounded by default. 49/49 tests passed.

Merged via PR #6 (squash). Default max_kv_size=32768 across all generation entry points. Combined with PR #5 pre-allocation, KV cache memory is now flat and bounded by default. 49/49 tests passed.
Sign in to join this conversation.
No labels
feature
perf
refactor
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/mlx-lm#4
No description provided.