Decode produces repetitive output after first token #32

New issue

Closed

opened 2026-05-15 00:50:47 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-15 00:50:47 +02:00

Owner

Problem

GPU-only decode path (branch perf/gpu-only-rewrite) produces correct first token but degenerates into repetitive output (. . ....).

Debug output shows tokens oscillating between IDs 13 (",") and 220 ("."), then stuck on 13.

Prefill produces correct first token (4858 = " everyone"). Second decode token (13 = ",") is also correct per reference. Third decode token (220 = ".") is wrong — should be " I" or similar.

Likely cause

K/V written by first decode step corrupts the cache for subsequent steps. Possible:

Wrong byte offset in cache during copy_with_offset
RoPE applied to gate portion of q_gate_buf
K-norm applied to wrong stride

Acceptance

./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-4B --prompt "Hello" --max-tokens 16 produces coherent English matching reference
No repetitive token loops

Max 2 attempts.

## Problem GPU-only decode path (branch `perf/gpu-only-rewrite`) produces correct first token but degenerates into repetitive output (`. . ....`). Debug output shows tokens oscillating between IDs 13 (",") and 220 ("."), then stuck on 13. Prefill produces correct first token (4858 = " everyone"). Second decode token (13 = ",") is also correct per reference. Third decode token (220 = ".") is wrong — should be " I" or similar. ## Likely cause K/V written by first decode step corrupts the cache for subsequent steps. Possible: - Wrong byte offset in cache during copy_with_offset - RoPE applied to gate portion of q_gate_buf - K-norm applied to wrong stride ## Acceptance - `./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-4B --prompt "Hello" --max-tokens 16` produces coherent English matching reference - No repetitive token loops Max 2 attempts.

sleepy referenced this issue

2026-05-15 00:50:48 +02:00

Reach 37+ tok/s decode target (match llama.cpp) #34

sleepy referenced this issue from a commit

2026-05-15 03:12:01 +02:00

fix: decode KV cache corruption causing repetitive output (#32)

sleepy closed this issue

2026-05-15 03:16:50 +02:00

sleepy commented

2026-05-15 03:16:50 +02:00

Author

Owner

Fixed by 72d908e. Root cause: residual_buf was shared across layers — MLP output was silently dropped because the residual was updated to input + attention but never included + mlp. Added copy after final residual_add per layer. Decode now produces coherent multi-token output.

Fixed by 72d908e. Root cause: residual_buf was shared across layers — MLP output was silently dropped because the residual was updated to `input + attention` but never included `+ mlp`. Added copy after final residual_add per layer. Decode now produces coherent multi-token output.