bug: Model generates garbage output (repetitive tokens) #62

New issue

Closed

opened 2026-05-21 17:17:55 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-21 17:17:55 +02:00

Owner

The model produces garbage/repetitive output instead of coherent text.

Reproduction

./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-0.8B-real --prompt "Hello" --max-tokens 20

Output: 和!!!!!!!!!!!!!!!!!!!

Expected: Coherent English text

Observations

Tokenizer encode/decode works correctly (verified separately)
Bench command runs: decode ~37-72 tok/s
First token after prefill is garbage, then repeats same token
CPU forward pass not yet verified

Possible Causes

Weight loading: BF16 conversion incorrect or weights not mapped properly
Metal kernels: Computation produces incorrect results
KV cache: Not properly initialized or corrupted
Attention computation: Wrong mask or scale factor
Config mismatch: Model config doesn't match actual weights

Verification Needed

Compare first-layer output against reference (mlx or Python)
Check for NaN/Inf in logits
Verify weight shapes and dtypes match config
Test with deterministic weights (identity matrix)
Compare GPU vs CPU forward pass output

Issue #53 (performance target)
Correctness tests in src/tests/correctness.zig (hardcoded to 4B model)

The model produces garbage/repetitive output instead of coherent text. ## Reproduction ```bash ./zig-out/bin/sleepy-llm generate --model ~/.sleepy-llm/models/Qwen3.5-0.8B-real --prompt "Hello" --max-tokens 20 ``` Output: `和!!!!!!!!!!!!!!!!!!!` Expected: Coherent English text ## Observations - Tokenizer encode/decode works correctly (verified separately) - Bench command runs: decode ~37-72 tok/s - First token after prefill is garbage, then repeats same token - CPU forward pass not yet verified ## Possible Causes 1. Weight loading: BF16 conversion incorrect or weights not mapped properly 2. Metal kernels: Computation produces incorrect results 3. KV cache: Not properly initialized or corrupted 4. Attention computation: Wrong mask or scale factor 5. Config mismatch: Model config doesn't match actual weights ## Verification Needed - [ ] Compare first-layer output against reference (mlx or Python) - [ ] Check for NaN/Inf in logits - [ ] Verify weight shapes and dtypes match config - [ ] Test with deterministic weights (identity matrix) - [ ] Compare GPU vs CPU forward pass output ## Related - Issue #53 (performance target) - Correctness tests in `src/tests/correctness.zig` (hardcoded to 4B model)

sleepy referenced this issue from a commit

2026-05-22 12:08:49 +02:00

debug(#62): fix linear_norm f32→bf16 dtype conversion, add debug infrastructure

sleepy referenced this issue from a commit

2026-05-22 12:58:37 +02:00

fix(#62): correct RMS norm formula for Qwen3.5 — use (1+w) not w

sleepy referenced this issue from a commit

2026-05-22 13:00:43 +02:00

fix(#62): correct model inference — RMS norm formula, BF16 conversion, buffer sizes

sleepy closed this issue

2026-05-22 13:00:43 +02:00

sleepy commented

2026-05-22 13:02:14 +02:00

Author

Owner

Merged. Three bugs fixed: linear_norm F32→BF16 conversion, undersized q_gate buffers, and RMS norm formula (weight → 1+weight).

Benchmark on M4 Pro Qwen3.5-0.8B:

Decode: 79.87 tok/s (61% of MLX 130.3 tok/s)
Remaining gap: GEMV kernel optimization, kernel fusion, single command buffer

Merged. Three bugs fixed: linear_norm F32→BF16 conversion, undersized q_gate buffers, and RMS norm formula (weight → 1+weight). Benchmark on M4 Pro Qwen3.5-0.8B: - Decode: 79.87 tok/s (61% of MLX 130.3 tok/s) - Remaining gap: GEMV kernel optimization, kernel fusion, single command buffer