- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
9.8 KiB
Head-to-Head: Layer Normalization Backward Pass
GLM-5 backwards vs Qwen3.6-27B backwards
Executive Summary
| Dimension | GLM-5 | Qwen3.6-27B |
|---|---|---|
| Correctness | 92 | 95 |
| Completeness | 80 | 95 |
| Code Quality | 88 | 90 |
| Numerical Stability | 80 | 95 |
| Gradient Check | 85 | 92 |
| Complexity Analysis | 82 | 90 |
| GPU Fusion Explanation | 85 | 88 |
| Tests / Benchmarks | 60 | 95 |
| Overall | 82 | 93 |
Winner: Qwen3.6-27B by 11 points.
1. Correctness
GLM-5 (92/100)
- Implements the correct consolidated backward formula:
dx = rstd * (dxhat - xhat * proj/D - dxhat_sum/D) - d_gamma and d_beta correctly computed via reductions over (B, T)
- Forward pass correctly uses two-pass variance (center first, then compute variance)
- Uses
rstd = 1.0 / np.sqrt(var + eps)directly, which is numerically preferable to1/std - Minor note: The docstring derivation is elegant but slightly condensed — it states the second term of dμ cancels to zero without showing the algebra, which could confuse readers trying to follow along
Qwen3.6-27B (95/100)
- Implements the equivalent formula:
dx = std_inv * (g - g_mean - x_hat * gx_mean) - Full step-by-step derivation documented in code comments, including the Jacobian projection form
- Independent cross-check:
benchmark_layer_norm.pycontains an alternative step-by-step chain-rule derivation that independently computes dx and verifies it matches the compact formula (relative error < 1e-10)
Verdict: Both correct. Qwen3.6-27B's independent cross-check gives slightly higher confidence.
2. Completeness
GLM-5 (80/100)
- Meets all 6 prompt requirements
- Single file containing: forward, backward, gradient check, complexity analysis, GPU fusion, numerical stability discussion
- Missing: dedicated edge-case tests, numerical stability demonstration, performance benchmarks, separate test files
Qwen3.6-27B (95/100)
- Meets all 6 requirements comprehensively
- Three separate files with distinct responsibilities:
layer_norm_backward.py— core implementation + gradient check + complexity + GPU fusiontest_layer_norm.py— edge-case validation (zero input, D=1, large D, large mean, scale invariance)benchmark_layer_norm.py— performance benchmarks + variance stability demo + alternative derivation cross-check
Verdict: Qwen3.6-27B exceeds requirements with a full testing and benchmarking suite.
3. Code Quality
GLM-5 (88/100)
- Single file (~280 lines) — remarkably concise for what it covers
- Minimal cache:
(xhat, rstd, glm5)— only 3 items, exactly what's needed - Clean function signatures with type hints
- Uses
np.random.default_rng()(modern NumPy API) - No unnecessary class wrappers or decorative ASCII art
- Gradient check operates on copies (not in-place), which is safer than MiniMax-M2.7's approach
Qwen3.6-27B (90/100)
- Focused implementation: Core algorithm is ~70 lines
- Minimal cache:
{x_hat, std_inv, glm5, D}— 4 items, essentially equivalent to GLM-5 - Separation of concerns across 3 files
- Docstrings are concise and precise
- No unnecessary class wrappers
Verdict: Both are very well-written. GLM-5 is more concise; Qwen3.6-27B has better separation. Nearly a tie.
4. Numerical Stability
GLM-5 (80/100)
- Uses two-pass variance:
xc = x - mean, thenvar = mean(xc**2) - Discusses 5 stability scenarios in the
print_complexity_and_fusion()function:- Division by near-zero σ̂ (eps guards against it)
- Catastrophic cancellation in
xc = x - mean - Overflow in
xc²orvar - Gradient explosion when σ̂ is very small
- rstd computation (direct 1/sqrt preferred over sqrt→divide)
- Weakness: No concrete demonstration. The discussion is theoretical.
- eps = 1e-5
Qwen3.6-27B (95/100)
- Explicitly uses two-pass variance and labels it as "numerically stable"
- Concrete demonstration:
benchmark_layer_norm.pyincludesdemo_variance_stability():- Shows
naive_varianceproducing0.0for offset=1e8 (true variance = 2.0) - Shows
two_pass_variancestaying exact at2.0 - Demonstrates degradation across offsets from 1e4 to 1e14
- Shows
- Edge-case tests:
test_layer_norm.pytests zero input, D=1 (degenerate), large D (1024), large-magnitude inputs (1e8 offset) - eps = 1e-5
Verdict: Qwen3.6-27B wins decisively by demonstrating the problem rather than just describing it.
5. Gradient Check
GLM-5 (85/100)
- Central finite differences for all three parameters (x, glm5, beta)
- Reports both max absolute error and relative error
- Uses
tol=1e-4for pass/fail determination - Tests on a single shape (B=2, T=3, D=8) in the default call, and (B=3, T=5, D=32) in the gradient_check function
- Strength: Operates on copies (
x_plus = x.copy()), avoiding the in-place corruption risk seen in MiniMax-M2.7
Qwen3.6-27B (92/100)
- Central finite differences with
delta=1e-5 - Reports relative error — more informative than absolute alone
- Tests on shape (4, 8, 16) with all three gradients
- Relative errors reported: dx ~5e-11, dgamma ~1.75e-11, dbeta ~1.46e-11 — extremely tight
- Edge-case tests in
test_layer_norm.pyrun gradient checks on large-magnitude and large-D inputs
Verdict: Qwen3.6-27B has tighter numerical agreement and broader test coverage.
6. Complexity Analysis
GLM-5 (82/100)
- Correctly identifies O(BTD) time and space complexity
- Breaks down forward and backward into component operations
- Discusses extra memory: O(M) for xhat + O(N) for rstd
- No quantitative FLOP counts or memory footprint in bytes
Qwen3.6-27B (90/100)
- More granular FLOP counts: forward ~6N, backward ~9N, total ~15N
- Explicitly notes backward is ~1.5x forward in FLOPs
- Includes memory footprint in MB for concrete shapes
- Discusses why two-pass variance is worth the extra O(N) FLOPs
- Computes TFLOPS throughput in benchmarks
Verdict: Qwen3.6-27B provides more quantitative detail.
7. GPU Fusion Explanation
GLM-5 (85/100)
- Describes a single-kernel backward fusion design
- Specifies shared memory layout:
smem_xhat[D],smem_dxhat[D],smem_proj[1],smem_sum[1] - 4-step algorithm: load+compute dxhat, cooperative reduction, compute dx, atomic adds for dgamma/dbeta
- Quantifies memory traffic: ≈3D elements vs ≈10D+ for unfused
- Mentions warp-level shuffles and vectorized loads as additional optimizations
- Clean, practical description
Qwen3.6-27B (88/100)
- Detailed GPU fusion discussion with CUDA pseudocode for both forward and backward
- Quantifies memory traffic: naive = ~12 accesses/element, fused = 4 (forward) and 5 (backward)
- Discusses atomicAdd for dgamma/dbeta reduction
- Mentions shared memory optimization for small D (<= 1024)
- Notes that warp-level primitives can replace shared memory when D <= 32
Verdict: Both are strong. Qwen3.6-27B has slightly better quantitative comparison.
8. Tests and Benchmarks
GLM-5 (60/100)
gradient_check()function tests one shape with all three parameters- No edge-case tests, no assertions, no separate test file
- No performance benchmarks
- No numerical stability demonstration
Qwen3.6-27B (95/100)
test_layer_norm.pywith 5 edge-case test categories:- Large mean, tiny variance (cancellation-prone)
- Zero input (variance = 0)
- Large D (Transformer-scale: D=1024)
- D=1 (degenerate case)
- Gradient norm sanity across scales (1e-3 to 1e6)
benchmark_layer_norm.pywith:- Variance stability demo (naive vs two-pass)
- Performance benchmarks across 8 configurations
- Alternative derivation cross-check
test_memory_efficiency()explicitly verifies minimal cache- Uses
assertstatements for validation
Verdict: Qwen3.6-27B is far superior in testing coverage and rigor.
9. What Each Did Best
| GLM-5 | Qwen3.6-27B |
|---|---|
| Exceptional conciseness — 280 lines covers everything | Minimal, precise cache + 3-file separation |
Modern NumPy API (default_rng, type hints) |
Concrete catastrophic cancellation demo |
| Safe gradient check (copies, not in-place) | Independent backward formula cross-check |
| Clean GPU fusion description with memory quantification | Comprehensive edge-case test suite |
| rstd computation (avoids sqrt→divide) | Memory-efficiency verification + benchmarks |
10. Weaknesses
GLM-5
- No edge-case testing: No tests for zero input, D=1, large offsets, etc.
- No concrete stability demo: Discusses catastrophic cancellation but never shows it
- No performance benchmarks: No timing or throughput measurements
- Single file: While concise, separation into test/benchmark files would be better
- Gradient check only on small shapes: No spot-check for large tensors
Qwen3.6-27B
- GPU fusion discussion is a string constant: Less readable than GLM-5's formatted output
- No spot-check for very large tensors: Gradient check always runs full finite differences
- Slightly more verbose: The core implementation is clean but surrounded by extensive analysis text
Final Verdict
Qwen3.6-27B wins by 11 points (93 vs 82).
The gap is driven by two factors:
- Testing: Qwen3.6-27B has a full test suite with edge cases, assertions, and memory verification; GLM-5 has only a basic gradient check.
- Numerical stability: Qwen3.6-27B demonstrates the catastrophic cancellation problem with concrete examples; GLM-5 only describes it.
GLM-5 is genuinely good — it correctly implements the backward pass with a minimal cache, clean code, and a solid GPU fusion discussion. It would score much higher than MiniMax-M2.7's implementation. But Qwen3.6-27B takes the same foundation and elevates it with rigorous testing, concrete demonstrations, and cleaner engineering separation.