CRITICAL: BatchQuantizedKVCache.finalize() corrupts _idx when batch items have unequal right padding #47
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
When
BatchQuantizedKVCache.finalize()is called with unequalright_paddingacross batch items,_idxis set to(self._idx - padding).max(), which represents the maximum remaining length across all items — not the correct length for each individual item.This causes
extract()to return fewer tokens than expected for items with smaller right padding, corrupting the cache and causing the model to stop generating (the "gives up after thinking block" bug).What Works
(padding).max() == paddingfor all items.What Doesn't Work
prepare(right_padding=...)is affected.Symptoms
finish_reason=stopprematurely (no actual stop token hit)quantized_kv_enabled=True(q4kv mode)Root Cause
In
omlx/cache/batch_quantized_cache.py,finalize()(line 125):Why this is wrong:
paddingis an array:[2, 5](different per batch item)self._idxbefore finalize:10(buffer size / valid tokens)self._idx = max(10-2, 10-5) = 810-5 = 5valid tokens, not 8!Then in
extract()(line 214-227):For item 1 (padding=5, left_padding=5 after finalize):
cache.offset = 8 - 5 = 3The causal mask is now based on 3 tokens instead of 5. The attention pattern is wrong, and the model stops generating.
Comparison with Reference Implementation
BatchKVCache(from mlx-lm) does not modify_idxinfinalize():In
BatchKVCache.extract():It extracts from
paddingto_idx(buffer size) and sets offset to the extracted shape. This works becausedynamic_rollrotates data within the buffer without changing its size.Historical Context
This bug was introduced in fix #45 (
491a2cd), which added the_idxmodification tofinalize()to address a different issue (state setter using buffer size instead of actual tokens). The test for #45 only tests with equal padding[2, 2], so the unequal-padding case was not covered.Suggested Fix
Remove the
_idxmodification fromfinalize()(line 125):Review the
statesetter to ensure_idxis set correctly when loading from saved state. Currently:This sets
_idxtomax(offset), which is the max valid token count across all items. If items have different lengths, this could cause the same issue as #45 was trying to fix (extracting too much data for shorter items). However, since the state getter trims keys/values to_idxbefore saving, all items should have the same length in the saved state. This needs verification.Add a test for unequal right padding in
finalize():Impact
Related
Fixed in commit
74c5b2d.Summary:
_idxmodification fromfinalize()—dynamic_rollrotates within the buffer, so_idxshould remain as the buffer boundary.statesetter to usekeys[0].shape[2](buffer size) instead ofoffset.max().test_finalize_with_unequal_right_padding.All 29 tests pass.
Re-opening: the fix for finalize() _idx corruption was applied, but the model still stops before tool calls when q4 KV quant is active. Cache also appears to be missing (not hitting). Tool calling works fine with fp16 KV cache.
Model: Qwen3.6-27B-mxfp4 with q4 KV quant.
Fixed in
74c5b2dand merged to main. Closing.