feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
This commit is contained in:
2026-04-23 11:16:01 +02:00
commit 8e72eef09c
62 changed files with 18469 additions and 0 deletions
+42
View File
@@ -0,0 +1,42 @@
All gradient checks pass. The implementation is complete. Here's a summary:
## Summary
**File:** `./minimax-m2.7/backwards/layer_norm_numpy.py`
### Key Components
1. **Forward Pass** (`layer_norm_forward`):
- Computes mean, variance, and standard deviation over the feature dimension (D)
- Normalizes: `x_norm = (x - mean) / sqrt(var + eps)`
- Scales and shifts: `y = gamma * x_norm + beta`
- Caches intermediates: `x_centered`, `x_norm`, `mean`, `std`, `gamma`
2. **Backward Pass** (`layer_norm_backward`):
- **d_gamma**: `sum(dy * x_norm)` over (B, T)
- **d_beta**: `sum(dy)` over (B, T)
- **dx**: `(dz - mean(dz) - x_norm * mean(dz * x_norm)) / std`
Where `dz = dy * gamma`. The consolidated formula avoids computing the full Jacobian.
3. **Gradient Check**:
- Uses central finite differences: `(f(x+h) - f(x-h)) / 2h`
- Fixed array flattening to use views instead of copies
- Spot-check for large tensors (>100k elements)
4. **Numerical Stability**:
- `eps` prevents division by zero: `std = sqrt(var + eps)`
- Catastrophic cancellation analysis documented
- Pairwise summation suggestion for fp16
5. **Complexity**:
- **Time**: O(BTD) for both forward and backward
- **Space**: O(BTD) for activations (training), O(1) for inference
6. **GPU Fusion Design**:
- Grid of `(B × T)` blocks, each handling one `(b,t)` position
- Three phases: mean reduction, variance reduction, normalize+output
- Warp-level shuffle reductions for efficiency
- Single kernel replaces 4-5 separate kernels