Fix 6 training bugs causing impossible loss values (~22.3 vs max ~10.4) #1

Open
opened 2026-05-01 14:01:26 +02:00 by sleepy · 0 comments
sleepy commented 2026-05-01 14:01:26 +02:00 (Migrated from localhost:18431)

Problem

Sweep produced flat loss ~22.3 across all 12 Stage 1 trials, exceeding theoretical max CE loss of log(32768) ~ 10.4.

Root Causes (6 bugs identified)

Bug 1: Missing residual connections in TransformerBlock

Attention output not added to input — signal destroyed per layer.

Bug 2: Missing PreNorm in TransformerBlock

No normalization before attention/FFN — activation variance explodes.

Bug 3: RoPE frequency interleaving

RoPE applies same frequencies to all dimensions instead of alternating — positional encoding broken.

Bug 4: BitLinear variance preservation scaling

Scales using latent (continuous) weights instead of ternary (quantized) weights — wrong variance estimate.

Bug 5: Data input/label misalignment

BOS token prepended to inputs, pad appended to labels — systematic target noise.

Bug 6: Engram compress_token_ids per-token loop

O(n) individual calls instead of batched — severe performance issue.

Fix Status

  • fixes/bitlinear.py — Bug 4 fixed
  • fixes/attention.py — Bugs 1,2,3 fixed
  • fixes/data.py — Bug 5 fixed
  • fixes/engram.py — Bug 6 fixed
  • fixes/smoke_test.py — validation tests written
  • Smoke test passed on GPU
  • Stage 1 retrain completed

Files

See bug-fixes.md for full analysis.

## Problem Sweep produced flat loss ~22.3 across all 12 Stage 1 trials, exceeding theoretical max CE loss of log(32768) ~ 10.4. ## Root Causes (6 bugs identified) ### Bug 1: Missing residual connections in TransformerBlock Attention output not added to input — signal destroyed per layer. ### Bug 2: Missing PreNorm in TransformerBlock No normalization before attention/FFN — activation variance explodes. ### Bug 3: RoPE frequency interleaving RoPE applies same frequencies to all dimensions instead of alternating — positional encoding broken. ### Bug 4: BitLinear variance preservation scaling Scales using latent (continuous) weights instead of ternary (quantized) weights — wrong variance estimate. ### Bug 5: Data input/label misalignment BOS token prepended to inputs, pad appended to labels — systematic target noise. ### Bug 6: Engram compress_token_ids per-token loop O(n) individual calls instead of batched — severe performance issue. ## Fix Status - [x] fixes/bitlinear.py — Bug 4 fixed - [x] fixes/attention.py — Bugs 1,2,3 fixed - [x] fixes/data.py — Bug 5 fixed - [x] fixes/engram.py — Bug 6 fixed - [x] fixes/smoke_test.py — validation tests written - [ ] Smoke test passed on GPU - [ ] Stage 1 retrain completed ## Files See `bug-fixes.md` for full analysis.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ternary#1
No description provided.