Decode produces repetitive output after first token #32

Closed
opened 2026-05-15 00:50:47 +02:00 by sleepy · 1 comment
Owner

Problem

GPU-only decode path (branch perf/gpu-only-rewrite) produces correct first token but degenerates into repetitive output (. . ....).

Debug output shows tokens oscillating between IDs 13 (",") and 220 ("."), then stuck on 13.

Prefill produces correct first token (4858 = " everyone"). Second decode token (13 = ",") is also correct per reference. Third decode token (220 = ".") is wrong — should be " I" or similar.

Likely cause

K/V written by first decode step corrupts the cache for subsequent steps. Possible:

  • Wrong byte offset in cache during copy_with_offset
  • RoPE applied to gate portion of q_gate_buf
  • K-norm applied to wrong stride

Acceptance

  • ./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-4B --prompt "Hello" --max-tokens 16 produces coherent English matching reference
  • No repetitive token loops

Max 2 attempts.

## Problem GPU-only decode path (branch `perf/gpu-only-rewrite`) produces correct first token but degenerates into repetitive output (`. . ....`). Debug output shows tokens oscillating between IDs 13 (",") and 220 ("."), then stuck on 13. Prefill produces correct first token (4858 = " everyone"). Second decode token (13 = ",") is also correct per reference. Third decode token (220 = ".") is wrong — should be " I" or similar. ## Likely cause K/V written by first decode step corrupts the cache for subsequent steps. Possible: - Wrong byte offset in cache during copy_with_offset - RoPE applied to gate portion of q_gate_buf - K-norm applied to wrong stride ## Acceptance - `./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-4B --prompt "Hello" --max-tokens 16` produces coherent English matching reference - No repetitive token loops Max 2 attempts.
Author
Owner

Fixed by 72d908e. Root cause: residual_buf was shared across layers — MLP output was silently dropped because the residual was updated to input + attention but never included + mlp. Added copy after final residual_add per layer. Decode now produces coherent multi-token output.

Fixed by 72d908e. Root cause: residual_buf was shared across layers — MLP output was silently dropped because the residual was updated to `input + attention` but never included `+ mlp`. Added copy after final residual_add per layer. Decode now produces coherent multi-token output.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#32
No description provided.