Uncached Q8_0 attention path produces garbage (DS4_CUDA_NO_Q8_F16_CACHE is load-bearing for correctness) #1
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
DS4_CUDA_NO_Q8_F16_CACHEis documented and intended as a performance toggle (disable the Q8→F16 dequant cache for attention projections, fall back to the uncached Q8_0 direct path — slower, same numerics). In practice it is load-bearing for correctness: with the cache disabled, the attention projection path emits garbage — mixed-script / unrelated tokens, clean English but incoherent ("schizo" output) — instead of merely running slower.This was uncovered while fixing the
decode-optimizationsbranch (commit336eb38). The env-cache conversion flipped the polarity of theNO_Q8_F16_CACHEguard, disabling the cache by default. Symptom:With the cache restored (default), the same prompt →
Paris.Repro
On current
main(336eb38):Drop
DS4_CUDA_NO_Q8_F16_CACHE=1→Paris.Why it matters
The cache (
cuda_q8_f16_ptrinsrc/gpu/ds4_cuda.cu) dequants Q8_0 attention weights to F16 into a device-memory cache, avoiding managed-memory TLB overhead on the ~136 MB/layer attention weight traffic. When disabled, dispatch falls back to the Q8_0 direct matmul path. That fallback is expected to be correct and slow, not wrong. A perf toggle that silently breaks correctness is a footgun, and it also blocks using the env flag for A/B profiling.Likely area
src/gpu/ds4_cuda.cucuda_q8_f16_cache_allowed/cuda_q8_f16_ptrand the Q8_0 matmul fallback insrc/gpu/ds4_cuda_dispatch1.cuh(matmul_q8_0_*kernels used forattn_output_a/b,attn_q_b, shared gate/up). Either:__dp4apath that the F16 cache bypasses).Scope
Out of scope for the decode-opt fix (
336eb38) — the default path is correct. This issue tracks makingDS4_CUDA_NO_Q8_F16_CACHEa safe perf toggle again (uncached path must be bit-identical to cached, just slower).Acceptance
DS4_CUDA_NO_Q8_F16_CACHE=1produces greedy output identical to the default (cached) path onverify.sh(math=Four, coherence=story, factual=Paris), just slower.