Files

T

sleepy 8e72eef09c feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore

2026-04-23 11:16:01 +02:00

11 KiB

Raw Permalink Blame History

Head-to-Head: Layer Normalization Backward Pass

MiniMax-M2.7 backwards vs Qwen3.6-27B backwards

Executive Summary

Dimension	MiniMax-M2.7	Qwen3.6-27B
Correctness	85	95
Completeness	80	95
Code Quality	70	90
Numerical Stability	75	95
Gradient Check	80	90
Complexity Analysis	80	90
GPU Fusion Explanation	85	85
Tests / Benchmarks	60	95
Overall	76	92

Winner: Qwen3.6-27B by 16 points.

1. Correctness

MiniMax-M2.7 (85/100)

Implements the correct consolidated backward formula: dx = (dz - mean(dz) - x_norm * mean(dz * x_norm)) / std
d_gamma and d_beta are correctly computed via reductions over (B, T)
The forward pass correctly computes mean, variance, and normalization
Minor issue: The cache stores x with the comment "needed for gradient check," but the backward function never actually uses x — it uses x_centered and x_norm instead. This is technically harmless but shows imprecise reasoning about what's actually required.
Potential issue: The gradient check's compute_numerical_gradient_x function modifies x in-place via x_flat = x.reshape(-1), which creates a view. While it restores values, this is fragile — if an exception occurs mid-check, x is left in a corrupted state. Qwen3.6-27B avoids this by operating on copies.

Qwen3.6-27B (95/100)

Implements the mathematically equivalent formula expressed as: dx = std_inv * (g - g_mean - x_hat * gx_mean)
The derivation is clearly documented in comments, showing the projection-formula origin
Cross-check included: benchmark_layer_norm.py contains an alternative step-by-step chain-rule derivation that independently computes dx and verifies it matches the compact formula — relative error < 1e-10
The forward pass explicitly uses a two-pass variance computation
No correctness bugs detected

Verdict: Both are correct, but Qwen3.6-27B's independent cross-check gives higher confidence.

2. Completeness

MiniMax-M2.7 (80/100)

Meets all 6 requirements from the prompt
Provides forward pass, backward pass, gradient check, complexity analysis, GPU fusion discussion
Includes a benchmark function
Missing: dedicated edge-case tests, numerical stability demonstration, multiple test files

Qwen3.6-27B (95/100)

Meets all 6 requirements comprehensively
Bonus: Three separate files with distinct responsibilities:
- layer_norm_backward.py — core implementation
- test_layer_norm.py — edge-case validation (zero input, D=1, large mean, large D, gradient norm sanity)
- benchmark_layer_norm.py — performance benchmarks + variance stability demo + alternative derivation cross-check
Memory efficiency check: Explicitly verifies that backward succeeds without x or x_centered in cache

Verdict: Qwen3.6-27B exceeds requirements with a full testing and benchmarking suite.

3. Code Quality

MiniMax-M2.7 (70/100)

Single monolithic file (~750 lines) mixing implementation, tests, benchmarks, analysis, and GPU discussion
Excessive caching: stores 10 items in cache (x, x_centered, x_norm, mean, var, std, glm5, beta, eps, plus B, T, D)
- Only x_norm, std, and glm5 are actually needed for backward
- Storing x, x_centered, mean, var, beta is redundant
Lots of decorative ASCII art and verbose docstrings that add bulk without adding clarity
The LayerNorm class wrapper is nice but unnecessary for the task

Qwen3.6-27B (90/100)

Clean, focused implementation: Core algorithm is ~70 lines of actual code
Minimal cache: Only 4 items (x_hat, std_inv, glm5, D) — exactly what's needed
- No x, no x_centered, no var, no mean — the backward formula is self-contained
Separation of concerns across 3 files
Docstrings are concise and precise
No unnecessary class wrappers

Verdict: Qwen3.6-27B is significantly cleaner with better separation of concerns and a minimal, precise cache.

4. Numerical Stability

MiniMax-M2.7 (75/100)

Uses two-pass variance: x_centered = x - mean, then var = mean(x_centered**2)
Discusses numerical stability in inline comments (8 numbered points)
Mentions catastrophic cancellation in (dz - mean(dz))
Weakness: No concrete demonstration of the catastrophic cancellation problem. The discussion is entirely theoretical.
eps = 1e-8 (reasonable)

Qwen3.6-27B (95/100)

Explicitly uses two-pass variance and labels it as "numerically stable"
Concrete demonstration: benchmark_layer_norm.py includes a demo_variance_stability() function that:
- Shows naive_variance producing 0.0 for offset=1e8 (true variance = 2.0)
- Shows two_pass_variance staying exact at 2.0
- Demonstrates degradation across offsets from 1e4 to 1e14
Edge-case tests: test_layer_norm.py tests zero input, D=1 (degenerate), large D (1024), large-magnitude inputs (1e8 offset)
eps = 1e-5 (slightly more conservative)
Explicit stability discussion in the main file covering 5 scenarios with solutions

Verdict: Qwen3.6-27B wins decisively by demonstrating the problem rather than just describing it.

5. Gradient Check

MiniMax-M2.7 (80/100)

Central finite differences for all three parameters (x, glm5, beta)
Spot-check for large tensors: When BTD > 100,000, checks 100,000 random elements instead of all
Uses rtol=1e-4, atol=1e-5 tolerances
Tests on 3 shapes: (2,4,8), (4,8,16), (8,16,32)
Weakness: No explicit assertion that gradient checks pass — just prints results

Qwen3.6-27B (90/100)

Central finite differences with delta=1e-5
Reports relative error (not just absolute), which is more informative
Tests on the main shape (4,8,16) with all three gradients
Relative errors reported: dx ~5e-11, dgamma ~1.75e-11, dbeta ~1.46e-11 — extremely tight
Edge-case tests in test_layer_norm.py run gradient checks on large-magnitude and large-D inputs

Verdict: Qwen3.6-27B's relative error reporting and tighter numerical agreement give it the edge.

6. Complexity Analysis

MiniMax-M2.7 (80/100)

ASCII-art table showing FLOPs and memory for forward and backward
Correctly identifies O(BTD) time and space complexity
Counts ~5 O(BTD) operations each for forward and backward
Includes cache efficiency discussion

Qwen3.6-27B (90/100)

More granular FLOP counts: forward ~6N, backward ~9N, total ~15N
Explicitly notes backward is ~1.5x forward in FLOPs
Includes memory footprint in MB for concrete shapes
Discusses why two-pass variance is worth the extra O(N) FLOPs
Computes TFLOPS throughput in benchmarks

Verdict: Qwen3.6-27B provides more quantitative detail.

7. GPU Fusion Explanation

MiniMax-M2.7 (85/100)

Very detailed ASCII-art explanation of fused forward and backward kernels
Includes actual CUDA pseudocode with __global__, __shared__, warpReduceSum
Discusses memory access patterns, coalescing, and shared memory layout
Explains 3-phase design: load+mean, variance, normalize+output
Mentions warp-level shuffle reductions

Qwen3.6-27B (85/100)

Detailed GPU fusion discussion in a string constant
Includes CUDA pseudocode for both forward and backward kernels
Quantifies memory traffic: naive = ~12 accesses/element, fused = 4 (forward) and 5 (backward)
Discusses atomicAdd for dgamma/dbeta reduction
Mentions shared memory optimization for small D (<= 1024)
Notes that warp-level primitives can replace shared memory when D <= 32

Verdict: Both are excellent. MiniMax-M2.7 has nicer formatting; Qwen3.6-27B has better quantitative comparison.

8. Tests and Benchmarks

MiniMax-M2.7 (60/100)

benchmark() function tests 4 shapes with timing
run_gradient_checks() tests 3 shapes
No edge-case tests, no assertions, no separate test file
Benchmark only runs 100 iterations — sufficient but minimal

Qwen3.6-27B (95/100)

test_layer_norm.py with 5 edge-case test categories:
1. Large mean, tiny variance (cancellation-prone)
2. Zero input (variance = 0)
3. Large D (Transformer-scale: D=1024)
4. D=1 (degenerate case)
5. Gradient norm sanity across scales (1e-3 to 1e6)
benchmark_layer_norm.py with:
- Variance stability demo (naive vs two-pass)
- Performance benchmarks across 8 configurations
- Alternative derivation cross-check
test_memory_efficiency() explicitly verifies minimal cache
Uses assert statements for validation

Verdict: Qwen3.6-27B is far superior in testing coverage and rigor.

9. What Each Did Best

MiniMax-M2.7	Qwen3.6-27B
Beautiful ASCII-art complexity tables	Minimal, precise cache (only what's needed)
Detailed CUDA pseudocode in formatted boxes	Concrete numerical stability demonstration
LayerNorm class wrapper	Independent backward formula cross-check
Spot-check gradient for large tensors	Comprehensive edge-case test suite
Inline stability analysis (8 points)	Memory-efficiency verification
Good pedagogical structure	Clean separation across 3 focused files

10. Weaknesses

MiniMax-M2.7

Over-caching: Stores 10 cache items when only 3 tensors + 1 scalar are needed for backward
No edge-case testing: No tests for zero input, D=1, large offsets, etc.
Monolithic structure: Everything crammed into one 750-line file
No concrete stability demo: Discusses catastrophic cancellation but never shows it
Fragile gradient check: Modifies input in-place without a copy
Missing assertions: Tests print results but don't assert correctness

Qwen3.6-27B

GPU fusion discussion is a string constant: Less readable than MiniMax-M2.7's formatted output
No spot-check for very large tensors: Gradient check always runs full finite differences, which could be slow for BTD > 100K
Slightly less eps: 1e-5 vs 1e-8 — both fine, but 1e-8 is more standard
No LayerNorm class: Minor — not really needed for the task

Final Verdict

Qwen3.6-27B wins by 16 points (92 vs 76).

The gap is driven by three factors:

Testing: Qwen3.6-27B has a full test suite with edge cases, assertions, and memory verification; MiniMax-M2.7 has none.
Numerical stability: Qwen3.6-27B demonstrates the catastrophic cancellation problem; MiniMax-M2.7 only describes it.
Code cleanliness: Qwen3.6-27B's minimal cache and focused files are significantly better engineered than MiniMax-M2.7's monolithic, over-cached implementation.

MiniMax-M2.7 is not bad — it correctly implements the backward pass, has good gradient checks, and provides a solid GPU fusion discussion. But Qwen3.6-27B takes the same foundation and elevates it with rigorous testing, concrete demonstrations, and cleaner engineering.

11 KiB Raw Permalink Blame History

Head-to-Head: Layer Normalization Backward Pass

MiniMax-M2.7 backwards vs Qwen3.6-27B backwards

Executive Summary

1. Correctness

MiniMax-M2.7 (85/100)

Qwen3.6-27B (95/100)

2. Completeness

MiniMax-M2.7 (80/100)

Qwen3.6-27B (95/100)

3. Code Quality

MiniMax-M2.7 (70/100)

Qwen3.6-27B (90/100)

4. Numerical Stability

MiniMax-M2.7 (75/100)

Qwen3.6-27B (95/100)

5. Gradient Check

MiniMax-M2.7 (80/100)

Qwen3.6-27B (90/100)

6. Complexity Analysis

MiniMax-M2.7 (80/100)

Qwen3.6-27B (90/100)

7. GPU Fusion Explanation

MiniMax-M2.7 (85/100)

Qwen3.6-27B (85/100)

8. Tests and Benchmarks

MiniMax-M2.7 (60/100)

Qwen3.6-27B (95/100)

9. What Each Did Best

10. Weaknesses

MiniMax-M2.7

Qwen3.6-27B

Final Verdict

11 KiB

Raw Permalink Blame History