Fix 6 training bugs causing impossible loss values (~22.3 vs max ~10.4) #1
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
Sweep produced flat loss ~22.3 across all 12 Stage 1 trials, exceeding theoretical max CE loss of log(32768) ~ 10.4.
Root Causes (6 bugs identified)
Bug 1: Missing residual connections in TransformerBlock
Attention output not added to input — signal destroyed per layer.
Bug 2: Missing PreNorm in TransformerBlock
No normalization before attention/FFN — activation variance explodes.
Bug 3: RoPE frequency interleaving
RoPE applies same frequencies to all dimensions instead of alternating — positional encoding broken.
Bug 4: BitLinear variance preservation scaling
Scales using latent (continuous) weights instead of ternary (quantized) weights — wrong variance estimate.
Bug 5: Data input/label misalignment
BOS token prepended to inputs, pad appended to labels — systematic target noise.
Bug 6: Engram compress_token_ids per-token loop
O(n) individual calls instead of batched — severe performance issue.
Fix Status
Files
See
bug-fixes.mdfor full analysis.