8e72eef09c
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
42 lines
1.5 KiB
Markdown
42 lines
1.5 KiB
Markdown
|
||
|
||
All gradient checks pass. The implementation is complete. Here's a summary:
|
||
|
||
## Summary
|
||
|
||
**File:** `./minimax-m2.7/backwards/layer_norm_numpy.py`
|
||
|
||
### Key Components
|
||
|
||
1. **Forward Pass** (`layer_norm_forward`):
|
||
- Computes mean, variance, and standard deviation over the feature dimension (D)
|
||
- Normalizes: `x_norm = (x - mean) / sqrt(var + eps)`
|
||
- Scales and shifts: `y = gamma * x_norm + beta`
|
||
- Caches intermediates: `x_centered`, `x_norm`, `mean`, `std`, `gamma`
|
||
|
||
2. **Backward Pass** (`layer_norm_backward`):
|
||
- **d_gamma**: `sum(dy * x_norm)` over (B, T)
|
||
- **d_beta**: `sum(dy)` over (B, T)
|
||
- **dx**: `(dz - mean(dz) - x_norm * mean(dz * x_norm)) / std`
|
||
|
||
Where `dz = dy * gamma`. The consolidated formula avoids computing the full Jacobian.
|
||
|
||
3. **Gradient Check**:
|
||
- Uses central finite differences: `(f(x+h) - f(x-h)) / 2h`
|
||
- Fixed array flattening to use views instead of copies
|
||
- Spot-check for large tensors (>100k elements)
|
||
|
||
4. **Numerical Stability**:
|
||
- `eps` prevents division by zero: `std = sqrt(var + eps)`
|
||
- Catastrophic cancellation analysis documented
|
||
- Pairwise summation suggestion for fp16
|
||
|
||
5. **Complexity**:
|
||
- **Time**: O(BTD) for both forward and backward
|
||
- **Space**: O(BTD) for activations (training), O(1) for inference
|
||
|
||
6. **GPU Fusion Design**:
|
||
- Grid of `(B × T)` blocks, each handling one `(b,t)` position
|
||
- Three phases: mean reduction, variance reduction, normalize+output
|
||
- Warp-level shuffle reductions for efficiency
|
||
- Single kernel replaces 4-5 separate kernels |