Files
deep_pro_judge/glm5/backwards/FINAL.md
T
sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis
- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts
2026-04-27 18:49:22 +02:00

2.0 KiB
Raw Blame History

All gradients pass the finite-difference check with relative errors ~1e-10. Here's a summary of what's in layer_norm.py:


Key Design Decisions

1. Simplified dx formula (avoids redundant recomputation)

Instead of computing dvar and dmean separately and then combining them, I substitute them into a single closed-form expression:

dx = rstd · [ dxhat  -  x̂ · mean(dxhat · x̂)  -  mean(dxhat) ]

This requires only two reductions over the D dimension (the proj and sum terms), and everything else is elementwise. The intermediates and rstd are reused directly from the forward pass cache.

2. Numerical stability — where things go wrong

Danger zone Why Mitigation
var ≈ 01/σ̂ → ∞ All elements identical eps = 1e-5 caps rstd ≤ 1/√ε ≈ 316
x - mean cancellation Large x, small σ Two-pass variance (already used); Welford's for extreme cases
xc² overflow Float16/32 for large x Upcast to float32 for computation
Gradient explosion dx ∝ 1/σ̂ Gradient clipping upstream; eps bounds the scale

3. Complexity

  • Time: O(B·T·D) — each element is touched a constant number of times in both forward and backward.
  • Memory: Forward stores xhat (B,T,D) + rstd (B,T,1). Backward allocates one temporary (B,T,D) for dxhat, which can be fused in-place on GPU.

4. GPU fusion into one kernel

The strategy is one thread-block per (B,T) row of length D:

  1. Load dy and xhat → compute dxhat = dy * gamma into shared memory
  2. Reduce across the block to get Σ(dxhat · x̂) and Σ(dxhat) via warp shuffles
  3. Compute dx_d = rstd * (dxhat_d - x̂_d * proj/D - sum/D) per thread
  4. Atomic add contributions to global dgamma and dbeta

This reads each element once from global memory (~3D traffic per row) versus ~10D+ for an unfused chain of separate kernels. For D ≤ 1024, shared memory (~12 KB per block in float32) is well within GPU limits.