CRITICAL: Cache misses with q4 KV quant — model stops before tool calls #48

Closed
opened 2026-05-09 23:28:35 +02:00 by sleepy · 1 comment
Owner

Problem

After applying the fix for BatchQuantizedKVCache.finalize() _idx corruption (#47), the model still stops before tool calls when q4 KV quant is active.

Additionally, the cache appears to be missing (not hitting) with q4 KV quant enabled. Tool calling works fine with fp16 KV cache.

Model

  • Qwen3.6-27B-mxfp4 with q4 KV quant

Observed Behavior

  1. Model stops before tool calls (generates thinking tokens then halts)
  2. Cache does not appear to be hit — no prefix cache recovery
  3. Same model/config works correctly with fp16 KV cache

Hypotheses

  • The finalize() fix may have exposed or caused a different bug in the q4 KV quant path
  • Prefix cache reconstruction for QuantizedKVCache may be failing silently
  • BatchQuantizedKVCache state may be inconsistent after finalize(), causing cache storage to skip
  • #47 — BatchQuantizedKVCache.finalize() _idx corruption (re-opened)
  • #1263 — QuantizedKVCacheHandler reconstruct_cache overrides correct offset with meta_state
  • #1266 — BatchQuantizedKVCache _idx corruption during finalize and state operations

Acceptance Criteria

  • Cache hits are observed with q4 KV quant enabled
  • Model completes tool calls correctly with q4 KV quant
  • No regression for fp16 KV cache behavior
## Problem After applying the fix for BatchQuantizedKVCache.finalize() _idx corruption (#47), the model still stops before tool calls when q4 KV quant is active. Additionally, the cache appears to be missing (not hitting) with q4 KV quant enabled. Tool calling works fine with fp16 KV cache. ## Model - Qwen3.6-27B-mxfp4 with q4 KV quant ## Observed Behavior 1. Model stops before tool calls (generates thinking tokens then halts) 2. Cache does not appear to be hit — no prefix cache recovery 3. Same model/config works correctly with fp16 KV cache ## Hypotheses - The finalize() fix may have exposed or caused a different bug in the q4 KV quant path - Prefix cache reconstruction for QuantizedKVCache may be failing silently - BatchQuantizedKVCache state may be inconsistent after finalize(), causing cache storage to skip ## Related Issues - #47 — BatchQuantizedKVCache.finalize() _idx corruption (re-opened) - #1263 — QuantizedKVCacheHandler reconstruct_cache overrides correct offset with meta_state - #1266 — BatchQuantizedKVCache _idx corruption during finalize and state operations ## Acceptance Criteria - [ ] Cache hits are observed with q4 KV quant enabled - [ ] Model completes tool calls correctly with q4 KV quant - [ ] No regression for fp16 KV cache behavior
Author
Owner

Merged via PR #62 (squash). Root cause: _apply_quantized_kv() was called on restored prefix caches, destroying their fp16 KV data. Fix: skip quantization when existing_cache is provided, both in _do_external_prefill and _transition_to_mtp.

Merged via PR #62 (squash). Root cause: _apply_quantized_kv() was called on restored prefix caches, destroying their fp16 KV data. Fix: skip quantization when existing_cache is provided, both in _do_external_prefill and _transition_to_mtp.
Sign in to join this conversation.
No labels
bug
feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/omlx#48
No description provided.