Files
llm_programming_tests/minimax-m2.7/backwards/FINAL.md
T
sleepy 8e72eef09c feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
2026-04-23 11:16:01 +02:00

42 lines
1.5 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
All gradient checks pass. The implementation is complete. Here's a summary:
## Summary
**File:** `./minimax-m2.7/backwards/layer_norm_numpy.py`
### Key Components
1. **Forward Pass** (`layer_norm_forward`):
- Computes mean, variance, and standard deviation over the feature dimension (D)
- Normalizes: `x_norm = (x - mean) / sqrt(var + eps)`
- Scales and shifts: `y = gamma * x_norm + beta`
- Caches intermediates: `x_centered`, `x_norm`, `mean`, `std`, `gamma`
2. **Backward Pass** (`layer_norm_backward`):
- **d_gamma**: `sum(dy * x_norm)` over (B, T)
- **d_beta**: `sum(dy)` over (B, T)
- **dx**: `(dz - mean(dz) - x_norm * mean(dz * x_norm)) / std`
Where `dz = dy * gamma`. The consolidated formula avoids computing the full Jacobian.
3. **Gradient Check**:
- Uses central finite differences: `(f(x+h) - f(x-h)) / 2h`
- Fixed array flattening to use views instead of copies
- Spot-check for large tensors (>100k elements)
4. **Numerical Stability**:
- `eps` prevents division by zero: `std = sqrt(var + eps)`
- Catastrophic cancellation analysis documented
- Pairwise summation suggestion for fp16
5. **Complexity**:
- **Time**: O(BTD) for both forward and backward
- **Space**: O(BTD) for activations (training), O(1) for inference
6. **GPU Fusion Design**:
- Grid of `(B × T)` blocks, each handling one `(b,t)` position
- Three phases: mean reduction, variance reduction, normalize+output
- Warp-level shuffle reductions for efficiency
- Single kernel replaces 4-5 separate kernels