8e72eef09c
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
1.5 KiB
1.5 KiB
All gradient checks pass. The implementation is complete. Here's a summary:
Summary
File: ./minimax-m2.7/backwards/layer_norm_numpy.py
Key Components
-
Forward Pass (
layer_norm_forward):- Computes mean, variance, and standard deviation over the feature dimension (D)
- Normalizes:
x_norm = (x - mean) / sqrt(var + eps) - Scales and shifts:
y = gamma * x_norm + beta - Caches intermediates:
x_centered,x_norm,mean,std,gamma
-
Backward Pass (
layer_norm_backward):- d_gamma:
sum(dy * x_norm)over (B, T) - d_beta:
sum(dy)over (B, T) - dx:
(dz - mean(dz) - x_norm * mean(dz * x_norm)) / std
Where
dz = dy * gamma. The consolidated formula avoids computing the full Jacobian. - d_gamma:
-
Gradient Check:
- Uses central finite differences:
(f(x+h) - f(x-h)) / 2h - Fixed array flattening to use views instead of copies
- Spot-check for large tensors (>100k elements)
- Uses central finite differences:
-
Numerical Stability:
epsprevents division by zero:std = sqrt(var + eps)- Catastrophic cancellation analysis documented
- Pairwise summation suggestion for fp16
-
Complexity:
- Time: O(BTD) for both forward and backward
- Space: O(BTD) for activations (training), O(1) for inference
-
GPU Fusion Design:
- Grid of
(B × T)blocks, each handling one(b,t)position - Three phases: mean reduction, variance reduction, normalize+output
- Warp-level shuffle reductions for efficiency
- Single kernel replaces 4-5 separate kernels
- Grid of