Files
llm_programming_tests/qwen36/backwards/FINAL.md
T
sleepy 8e72eef09c feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
2026-04-23 11:16:01 +02:00

34 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
Everything passes. Here's a summary of what was built:
## Files
| File | Purpose |
|---|---|
| `layer_norm_backward.py` | Core implementation + gradient check + complexity analysis + GPU fusion discussion |
| `test_layer_norm.py` | Edge cases (zero input, D=1, large D, large mean, scale invariance) |
| `benchmark_layer_norm.py` | Variance stability demo, performance benchmarks, alternative derivation cross-check |
## Key results
**Gradient accuracy** — all three gradients pass with relative error ~10⁻¹¹:
```
dx relative error: 5.04e-11 ✓
dgamma relative error: 1.75e-11 ✓
dbeta relative error: 1.46e-11 ✓
```
**Numerical stability** — the two-pass variance completely avoids catastrophic cancellation:
```
offset=1e10: naive=0.000000 stable=2.000000 true=2.000000
```
The naive `E[x²] - E[x]²` formula loses all precision at offset 1e10, while the two-pass algorithm stays exact.
**Backward formula** — the compact projection form:
```
dx = σ⁻¹ · [ g - mean(g) - x_hat · mean(g · x_hat) ]
```
where `g = γ · dy`. This requires only **two reductions** over D and **zero recomputation** of forward intermediates.
**Memory** — the cache stores exactly 3 tensors: `x_hat` (B·T·D), `std_inv` (B·T), `gamma` (D). No `x_centered`, no `var`, no `x` — the backward formula is self-contained from these.
**GPU fusion** — the fused kernel reduces global memory traffic from ~12 accesses/element (naive separate kernels) to 4 (forward) and 5 (backward), a 23× speedup since layer norm is memory-bandwidth bound.