Uncached Q8_0 attention path produces garbage (DS4_CUDA_NO_Q8_F16_CACHE is load-bearing for correctness) #1

Open
opened 2026-06-22 23:14:56 +02:00 by sleepy · 0 comments
Owner

Summary

DS4_CUDA_NO_Q8_F16_CACHE is documented and intended as a performance toggle (disable the Q8→F16 dequant cache for attention projections, fall back to the uncached Q8_0 direct path — slower, same numerics). In practice it is load-bearing for correctness: with the cache disabled, the attention projection path emits garbage — mixed-script / unrelated tokens, clean English but incoherent ("schizo" output) — instead of merely running slower.

This was uncovered while fixing the decode-optimizations branch (commit 336eb38). The env-cache conversion flipped the polarity of the NO_Q8_F16_CACHE guard, disabling the cache by default. Symptom:

$ DS4_CUDA_MANAGED_MODEL=1 DS4_KV_TURBO=1 ./ds4 -m K180 -p "What is the capital of France? One word." -n 32 --nothink --ctx 1048576
 ba十年前向中国建立 最新技术;
Model II 暴雨东时BIT
...

With the cache restored (default), the same prompt → Paris.

Repro

On current main (336eb38):

DS4_CUDA_MANAGED_MODEL=1 DS4_KV_TURBO=1 DS4_CUDA_NO_Q8_F16_CACHE=1 \
  ./ds4 -m ../DeepSeek-V4-Flash-REAP-K180-hybrid.gguf \
  -p "What is the capital of France? One word." -n 32 --nothink --ctx 1048576
# → garbage

Drop DS4_CUDA_NO_Q8_F16_CACHE=1Paris.

Why it matters

The cache (cuda_q8_f16_ptr in src/gpu/ds4_cuda.cu) dequants Q8_0 attention weights to F16 into a device-memory cache, avoiding managed-memory TLB overhead on the ~136 MB/layer attention weight traffic. When disabled, dispatch falls back to the Q8_0 direct matmul path. That fallback is expected to be correct and slow, not wrong. A perf toggle that silently breaks correctness is a footgun, and it also blocks using the env flag for A/B profiling.

Likely area

src/gpu/ds4_cuda.cu cuda_q8_f16_cache_allowed / cuda_q8_f16_ptr and the Q8_0 matmul fallback in src/gpu/ds4_cuda_dispatch1.cuh (matmul_q8_0_* kernels used for attn_output_a/b, attn_q_b, shared gate/up). Either:

  • a Q8_0 fallback kernel has a bug that only manifests for the attention-projection shapes (in_dim=1024/out_dim=32768 etc.), masked by the cache on the default path, or
  • the cache path and the direct path disagree on dequant/scaling (e.g. a Q8_0 block-scale or __dp4a path that the F16 cache bypasses).

Scope

Out of scope for the decode-opt fix (336eb38) — the default path is correct. This issue tracks making DS4_CUDA_NO_Q8_F16_CACHE a safe perf toggle again (uncached path must be bit-identical to cached, just slower).

Acceptance

DS4_CUDA_NO_Q8_F16_CACHE=1 produces greedy output identical to the default (cached) path on verify.sh (math=Four, coherence=story, factual=Paris), just slower.

## Summary `DS4_CUDA_NO_Q8_F16_CACHE` is documented and intended as a **performance toggle** (disable the Q8→F16 dequant cache for attention projections, fall back to the uncached Q8_0 direct path — slower, same numerics). In practice it is **load-bearing for correctness**: with the cache disabled, the attention projection path emits garbage — mixed-script / unrelated tokens, clean English but incoherent ("schizo" output) — instead of merely running slower. This was uncovered while fixing the `decode-optimizations` branch (commit 336eb38). The env-cache conversion flipped the polarity of the `NO_Q8_F16_CACHE` guard, disabling the cache by default. Symptom: ``` $ DS4_CUDA_MANAGED_MODEL=1 DS4_KV_TURBO=1 ./ds4 -m K180 -p "What is the capital of France? One word." -n 32 --nothink --ctx 1048576 ba十年前向中国建立 最新技术; Model II 暴雨东时BIT ... ``` With the cache restored (default), the same prompt → `Paris`. ## Repro On current `main` (336eb38): ```bash DS4_CUDA_MANAGED_MODEL=1 DS4_KV_TURBO=1 DS4_CUDA_NO_Q8_F16_CACHE=1 \ ./ds4 -m ../DeepSeek-V4-Flash-REAP-K180-hybrid.gguf \ -p "What is the capital of France? One word." -n 32 --nothink --ctx 1048576 # → garbage ``` Drop `DS4_CUDA_NO_Q8_F16_CACHE=1` → `Paris`. ## Why it matters The cache (`cuda_q8_f16_ptr` in `src/gpu/ds4_cuda.cu`) dequants Q8_0 attention weights to F16 into a device-memory cache, avoiding managed-memory TLB overhead on the ~136 MB/layer attention weight traffic. When disabled, dispatch falls back to the Q8_0 direct matmul path. That fallback is expected to be **correct and slow**, not **wrong**. A perf toggle that silently breaks correctness is a footgun, and it also blocks using the env flag for A/B profiling. ## Likely area `src/gpu/ds4_cuda.cu` `cuda_q8_f16_cache_allowed` / `cuda_q8_f16_ptr` and the Q8_0 matmul fallback in `src/gpu/ds4_cuda_dispatch1.cuh` (`matmul_q8_0_*` kernels used for `attn_output_a/b`, `attn_q_b`, shared gate/up). Either: - a Q8_0 fallback kernel has a bug that only manifests for the attention-projection shapes (in_dim=1024/out_dim=32768 etc.), masked by the cache on the default path, or - the cache path and the direct path disagree on dequant/scaling (e.g. a Q8_0 block-scale or `__dp4a` path that the F16 cache bypasses). ## Scope Out of scope for the decode-opt fix (336eb38) — the default path is correct. This issue tracks making `DS4_CUDA_NO_Q8_F16_CACHE` a safe perf toggle again (uncached path must be bit-identical to cached, just slower). ## Acceptance `DS4_CUDA_NO_Q8_F16_CACHE=1` produces greedy output identical to the default (cached) path on `verify.sh` (math=Four, coherence=story, factual=Paris), just slower.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ds4-nvfp4-spark#1
No description provided.