llm_programming_tests/minimax-m2.7/backwards/FINAL.md at 8e72eef09cd391a59292f22fbe1d830d2a2d39f5

Files

T

sleepy 8e72eef09c feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore

2026-04-23 11:16:01 +02:00

1.5 KiB

Raw Blame History

All gradient checks pass. The implementation is complete. Here's a summary:

Summary

File: ./minimax-m2.7/backwards/layer_norm_numpy.py

Key Components

Forward Pass (layer_norm_forward):
- Computes mean, variance, and standard deviation over the feature dimension (D)
- Normalizes: x_norm = (x - mean) / sqrt(var + eps)
- Scales and shifts: y = gamma * x_norm + beta
- Caches intermediates: x_centered, x_norm, mean, std, gamma
Backward Pass (layer_norm_backward):
- d_gamma: sum(dy * x_norm) over (B, T)
- d_beta: sum(dy) over (B, T)
- dx: (dz - mean(dz) - x_norm * mean(dz * x_norm)) / std
Where dz = dy * gamma. The consolidated formula avoids computing the full Jacobian.
Gradient Check:
- Uses central finite differences: (f(x+h) - f(x-h)) / 2h
- Fixed array flattening to use views instead of copies
- Spot-check for large tensors (>100k elements)
Numerical Stability:
- eps prevents division by zero: std = sqrt(var + eps)
- Catastrophic cancellation analysis documented
- Pairwise summation suggestion for fp16
Complexity:
- Time: O(BTD) for both forward and backward
- Space: O(BTD) for activations (training), O(1) for inference
GPU Fusion Design:
- Grid of (B × T) blocks, each handling one (b,t) position
- Three phases: mean reduction, variance reduction, normalize+output
- Warp-level shuffle reductions for efficiency
- Single kernel replaces 4-5 separate kernels

1.5 KiB Raw Blame History Unescape Escape

Summary

Key Components

1.5 KiB

Raw Blame History