Fix 6 training bugs causing impossible loss values (~22.3 vs max ~10.4) #1

New issue

Open

opened 2026-05-01 14:01:26 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-01 14:01:26 +02:00

(Migrated from localhost:18431)

Problem

Sweep produced flat loss ~22.3 across all 12 Stage 1 trials, exceeding theoretical max CE loss of log(32768) ~ 10.4.

Root Causes (6 bugs identified)

Bug 1: Missing residual connections in TransformerBlock

Attention output not added to input — signal destroyed per layer.

Bug 2: Missing PreNorm in TransformerBlock

No normalization before attention/FFN — activation variance explodes.

Bug 3: RoPE frequency interleaving

RoPE applies same frequencies to all dimensions instead of alternating — positional encoding broken.

Bug 4: BitLinear variance preservation scaling

Scales using latent (continuous) weights instead of ternary (quantized) weights — wrong variance estimate.

Bug 5: Data input/label misalignment

BOS token prepended to inputs, pad appended to labels — systematic target noise.

Bug 6: Engram compress_token_ids per-token loop

O(n) individual calls instead of batched — severe performance issue.

Fix Status

fixes/bitlinear.py — Bug 4 fixed
fixes/attention.py — Bugs 1,2,3 fixed
fixes/data.py — Bug 5 fixed
fixes/engram.py — Bug 6 fixed
fixes/smoke_test.py — validation tests written
Smoke test passed on GPU
Stage 1 retrain completed

Files

See bug-fixes.md for full analysis.

## Problem Sweep produced flat loss ~22.3 across all 12 Stage 1 trials, exceeding theoretical max CE loss of log(32768) ~ 10.4. ## Root Causes (6 bugs identified) ### Bug 1: Missing residual connections in TransformerBlock Attention output not added to input — signal destroyed per layer. ### Bug 2: Missing PreNorm in TransformerBlock No normalization before attention/FFN — activation variance explodes. ### Bug 3: RoPE frequency interleaving RoPE applies same frequencies to all dimensions instead of alternating — positional encoding broken. ### Bug 4: BitLinear variance preservation scaling Scales using latent (continuous) weights instead of ternary (quantized) weights — wrong variance estimate. ### Bug 5: Data input/label misalignment BOS token prepended to inputs, pad appended to labels — systematic target noise. ### Bug 6: Engram compress_token_ids per-token loop O(n) individual calls instead of batched — severe performance issue. ## Fix Status - [x] fixes/bitlinear.py — Bug 4 fixed - [x] fixes/attention.py — Bugs 1,2,3 fixed - [x] fixes/data.py — Bug 5 fixed - [x] fixes/engram.py — Bug 6 fixed - [x] fixes/smoke_test.py — validation tests written - [ ] Smoke test passed on GPU - [ ] Stage 1 retrain completed ## Files See `bug-fixes.md` for full analysis.

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

sleepy/ternary#1

No description provided.

Rows
Columns