bug: Model generates garbage output (repetitive tokens) #62

Closed
opened 2026-05-21 17:17:55 +02:00 by sleepy · 1 comment
Owner

The model produces garbage/repetitive output instead of coherent text.

Reproduction

./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-0.8B-real --prompt "Hello" --max-tokens 20

Output: 和!!!!!!!!!!!!!!!!!!!

Expected: Coherent English text

Observations

  • Tokenizer encode/decode works correctly (verified separately)
  • Bench command runs: decode ~37-72 tok/s
  • First token after prefill is garbage, then repeats same token
  • CPU forward pass not yet verified

Possible Causes

  1. Weight loading: BF16 conversion incorrect or weights not mapped properly
  2. Metal kernels: Computation produces incorrect results
  3. KV cache: Not properly initialized or corrupted
  4. Attention computation: Wrong mask or scale factor
  5. Config mismatch: Model config doesn't match actual weights

Verification Needed

  • Compare first-layer output against reference (mlx or Python)
  • Check for NaN/Inf in logits
  • Verify weight shapes and dtypes match config
  • Test with deterministic weights (identity matrix)
  • Compare GPU vs CPU forward pass output
  • Issue #53 (performance target)
  • Correctness tests in src/tests/correctness.zig (hardcoded to 4B model)
The model produces garbage/repetitive output instead of coherent text. ## Reproduction ```bash ./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-0.8B-real --prompt "Hello" --max-tokens 20 ``` Output: `和!!!!!!!!!!!!!!!!!!!` Expected: Coherent English text ## Observations - Tokenizer encode/decode works correctly (verified separately) - Bench command runs: decode ~37-72 tok/s - First token after prefill is garbage, then repeats same token - CPU forward pass not yet verified ## Possible Causes 1. Weight loading: BF16 conversion incorrect or weights not mapped properly 2. Metal kernels: Computation produces incorrect results 3. KV cache: Not properly initialized or corrupted 4. Attention computation: Wrong mask or scale factor 5. Config mismatch: Model config doesn't match actual weights ## Verification Needed - [ ] Compare first-layer output against reference (mlx or Python) - [ ] Check for NaN/Inf in logits - [ ] Verify weight shapes and dtypes match config - [ ] Test with deterministic weights (identity matrix) - [ ] Compare GPU vs CPU forward pass output ## Related - Issue #53 (performance target) - Correctness tests in `src/tests/correctness.zig` (hardcoded to 4B model)
Author
Owner

Merged. Three bugs fixed: linear_norm F32→BF16 conversion, undersized q_gate buffers, and RMS norm formula (weight → 1+weight).

Benchmark on M4 Pro Qwen3.5-0.8B:

  • Decode: 79.87 tok/s (61% of MLX 130.3 tok/s)
  • Remaining gap: GEMV kernel optimization, kernel fusion, single command buffer
Merged. Three bugs fixed: linear_norm F32→BF16 conversion, undersized q_gate buffers, and RMS norm formula (weight → 1+weight). Benchmark on M4 Pro Qwen3.5-0.8B: - Decode: 79.87 tok/s (61% of MLX 130.3 tok/s) - Remaining gap: GEMV kernel optimization, kernel fusion, single command buffer
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#62
No description provided.