#48 - CRITICAL: Cache misses with q4 KV quant — model stops before tool calls - sleepy/omlx

sleepy commented

2026-05-09 23:28:35 +02:00

Owner

Problem

After applying the fix for BatchQuantizedKVCache.finalize() _idx corruption (#47), the model still stops before tool calls when q4 KV quant is active.

Additionally, the cache appears to be missing (not hitting) with q4 KV quant enabled. Tool calling works fine with fp16 KV cache.

Model

Qwen3.6-27B-mxfp4 with q4 KV quant

Observed Behavior

Model stops before tool calls (generates thinking tokens then halts)
Cache does not appear to be hit — no prefix cache recovery
Same model/config works correctly with fp16 KV cache

Hypotheses

The finalize() fix may have exposed or caused a different bug in the q4 KV quant path
Prefix cache reconstruction for QuantizedKVCache may be failing silently
BatchQuantizedKVCache state may be inconsistent after finalize(), causing cache storage to skip

#47 — BatchQuantizedKVCache.finalize() _idx corruption (re-opened)
#1263 — QuantizedKVCacheHandler reconstruct_cache overrides correct offset with meta_state
#1266 — BatchQuantizedKVCache _idx corruption during finalize and state operations

Acceptance Criteria

Cache hits are observed with q4 KV quant enabled
Model completes tool calls correctly with q4 KV quant
No regression for fp16 KV cache behavior

## Problem After applying the fix for BatchQuantizedKVCache.finalize() _idx corruption (#47), the model still stops before tool calls when q4 KV quant is active. Additionally, the cache appears to be missing (not hitting) with q4 KV quant enabled. Tool calling works fine with fp16 KV cache. ## Model - Qwen3.6-27B-mxfp4 with q4 KV quant ## Observed Behavior 1. Model stops before tool calls (generates thinking tokens then halts) 2. Cache does not appear to be hit — no prefix cache recovery 3. Same model/config works correctly with fp16 KV cache ## Hypotheses - The finalize() fix may have exposed or caused a different bug in the q4 KV quant path - Prefix cache reconstruction for QuantizedKVCache may be failing silently - BatchQuantizedKVCache state may be inconsistent after finalize(), causing cache storage to skip ## Related Issues - #47 — BatchQuantizedKVCache.finalize() _idx corruption (re-opened) - #1263 — QuantizedKVCacheHandler reconstruct_cache overrides correct offset with meta_state - #1266 — BatchQuantizedKVCache _idx corruption during finalize and state operations ## Acceptance Criteria - [ ] Cache hits are observed with q4 KV quant enabled - [ ] Model completes tool calls correctly with q4 KV quant - [ ] No regression for fp16 KV cache behavior

sleepy referenced this issue

2026-05-14 23:29:17 +02:00

[perf] Pre-allocated paged q4 KV cache to eliminate prefill memory spikes #51

sleepy referenced this issue from a commit

2026-05-15 01:32:08 +02:00

fix: skip KV quantization for restored prefix caches (#48)

sleepy referenced this issue from a pull request that will close it,

2026-05-15 01:33:32 +02:00

[cache] Skip q4 KV quantization on restored prefix caches (#48) #62

sleepy referenced this issue from a commit

2026-05-15 01:37:41 +02:00

[cache] Skip q4 KV quantization on restored prefix caches (#48) (#62)

sleepy closed this issue

2026-05-15 01:37:41 +02:00

sleepy commented

2026-05-15 01:37:51 +02:00

Author

Owner

Merged via PR #62 (squash). Root cause: _apply_quantized_kv() was called on restored prefix caches, destroying their fp16 KV data. Fix: skip quantization when existing_cache is provided, both in _do_external_prefill and _transition_to_mtp.