feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
2026-04-23 11:16:01 +02:00
commit 8e72eef09c
62 changed files with 18469 additions and 0 deletions
@@ -0,0 +1,42 @@
+
+
+All gradient checks pass. The implementation is complete. Here's a summary:
+
+## Summary
+
+**File:** `./minimax-m2.7/backwards/layer_norm_numpy.py`
+
+### Key Components
+
+1. **Forward Pass** (`layer_norm_forward`):
+   - Computes mean, variance, and standard deviation over the feature dimension (D)
+   - Normalizes: `x_norm = (x - mean) / sqrt(var + eps)`
+   - Scales and shifts: `y = gamma * x_norm + beta`
+   - Caches intermediates: `x_centered`, `x_norm`, `mean`, `std`, `gamma`
+
+2. **Backward Pass** (`layer_norm_backward`):
+   - **d_gamma**: `sum(dy * x_norm)` over (B, T)
+   - **d_beta**: `sum(dy)` over (B, T)
+   - **dx**: `(dz - mean(dz) - x_norm * mean(dz * x_norm)) / std`
+   
+   Where `dz = dy * gamma`. The consolidated formula avoids computing the full Jacobian.
+
+3. **Gradient Check**:
+   - Uses central finite differences: `(f(x+h) - f(x-h)) / 2h`
+   - Fixed array flattening to use views instead of copies
+   - Spot-check for large tensors (>100k elements)
+
+4. **Numerical Stability**:
+   - `eps` prevents division by zero: `std = sqrt(var + eps)`
+   - Catastrophic cancellation analysis documented
+   - Pairwise summation suggestion for fp16
+
+5. **Complexity**:
+   - **Time**: O(BTD) for both forward and backward
+   - **Space**: O(BTD) for activations (training), O(1) for inference
+
+6. **GPU Fusion Design**:
+   - Grid of `(B × T)` blocks, each handling one `(b,t)` position
+   - Three phases: mean reduction, variance reduction, normalize+output
+   - Warp-level shuffle reductions for efficiency
+   - Single kernel replaces 4-5 separate kernels