Files
llm_programming_tests/minimax-m2.7/backwards/FINAL.md
T
sleepy 8e72eef09c feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
2026-04-23 11:16:01 +02:00

1.5 KiB
Raw Blame History

All gradient checks pass. The implementation is complete. Here's a summary:

Summary

File: ./minimax-m2.7/backwards/layer_norm_numpy.py

Key Components

  1. Forward Pass (layer_norm_forward):

    • Computes mean, variance, and standard deviation over the feature dimension (D)
    • Normalizes: x_norm = (x - mean) / sqrt(var + eps)
    • Scales and shifts: y = gamma * x_norm + beta
    • Caches intermediates: x_centered, x_norm, mean, std, gamma
  2. Backward Pass (layer_norm_backward):

    • d_gamma: sum(dy * x_norm) over (B, T)
    • d_beta: sum(dy) over (B, T)
    • dx: (dz - mean(dz) - x_norm * mean(dz * x_norm)) / std

    Where dz = dy * gamma. The consolidated formula avoids computing the full Jacobian.

  3. Gradient Check:

    • Uses central finite differences: (f(x+h) - f(x-h)) / 2h
    • Fixed array flattening to use views instead of copies
    • Spot-check for large tensors (>100k elements)
  4. Numerical Stability:

    • eps prevents division by zero: std = sqrt(var + eps)
    • Catastrophic cancellation analysis documented
    • Pairwise summation suggestion for fp16
  5. Complexity:

    • Time: O(BTD) for both forward and backward
    • Space: O(BTD) for activations (training), O(1) for inference
  6. GPU Fusion Design:

    • Grid of (B × T) blocks, each handling one (b,t) position
    • Three phases: mean reduction, variance reduction, normalize+output
    • Warp-level shuffle reductions for efficiency
    • Single kernel replaces 4-5 separate kernels