Files

T

sleepy 8e72eef09c feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore

2026-04-23 11:16:01 +02:00

9.8 KiB

Raw Blame History

Head-to-Head: Layer Normalization Backward Pass

GLM-5 backwards vs Qwen3.6-27B backwards

Executive Summary

Dimension	GLM-5	Qwen3.6-27B
Correctness	92	95
Completeness	80	95
Code Quality	88	90
Numerical Stability	80	95
Gradient Check	85	92
Complexity Analysis	82	90
GPU Fusion Explanation	85	88
Tests / Benchmarks	60	95
Overall	82	93

Winner: Qwen3.6-27B by 11 points.

1. Correctness

GLM-5 (92/100)

Implements the correct consolidated backward formula: dx = rstd * (dxhat - xhat * proj/D - dxhat_sum/D)
d_gamma and d_beta correctly computed via reductions over (B, T)
Forward pass correctly uses two-pass variance (center first, then compute variance)
Uses rstd = 1.0 / np.sqrt(var + eps) directly, which is numerically preferable to 1/std
Minor note: The docstring derivation is elegant but slightly condensed — it states the second term of dμ cancels to zero without showing the algebra, which could confuse readers trying to follow along

Qwen3.6-27B (95/100)

Implements the equivalent formula: dx = std_inv * (g - g_mean - x_hat * gx_mean)
Full step-by-step derivation documented in code comments, including the Jacobian projection form
Independent cross-check: benchmark_layer_norm.py contains an alternative step-by-step chain-rule derivation that independently computes dx and verifies it matches the compact formula (relative error < 1e-10)

Verdict: Both correct. Qwen3.6-27B's independent cross-check gives slightly higher confidence.

2. Completeness

GLM-5 (80/100)

Meets all 6 prompt requirements
Single file containing: forward, backward, gradient check, complexity analysis, GPU fusion, numerical stability discussion
Missing: dedicated edge-case tests, numerical stability demonstration, performance benchmarks, separate test files

Qwen3.6-27B (95/100)

Meets all 6 requirements comprehensively
Three separate files with distinct responsibilities:
- layer_norm_backward.py — core implementation + gradient check + complexity + GPU fusion
- test_layer_norm.py — edge-case validation (zero input, D=1, large D, large mean, scale invariance)
- benchmark_layer_norm.py — performance benchmarks + variance stability demo + alternative derivation cross-check

Verdict: Qwen3.6-27B exceeds requirements with a full testing and benchmarking suite.

3. Code Quality

GLM-5 (88/100)

Single file (~280 lines) — remarkably concise for what it covers
Minimal cache: (xhat, rstd, glm5) — only 3 items, exactly what's needed
Clean function signatures with type hints
Uses np.random.default_rng() (modern NumPy API)
No unnecessary class wrappers or decorative ASCII art
Gradient check operates on copies (not in-place), which is safer than MiniMax-M2.7's approach

Qwen3.6-27B (90/100)

Focused implementation: Core algorithm is ~70 lines
Minimal cache: {x_hat, std_inv, glm5, D} — 4 items, essentially equivalent to GLM-5
Separation of concerns across 3 files
Docstrings are concise and precise
No unnecessary class wrappers

Verdict: Both are very well-written. GLM-5 is more concise; Qwen3.6-27B has better separation. Nearly a tie.

4. Numerical Stability

GLM-5 (80/100)

Uses two-pass variance: xc = x - mean, then var = mean(xc**2)
Discusses 5 stability scenarios in the print_complexity_and_fusion() function:
1. Division by near-zero σ̂ (eps guards against it)
2. Catastrophic cancellation in xc = x - mean
3. Overflow in xc² or var
4. Gradient explosion when σ̂ is very small
5. rstd computation (direct 1/sqrt preferred over sqrt→divide)
Weakness: No concrete demonstration. The discussion is theoretical.
eps = 1e-5

Qwen3.6-27B (95/100)

Explicitly uses two-pass variance and labels it as "numerically stable"
Concrete demonstration: benchmark_layer_norm.py includes demo_variance_stability():
- Shows naive_variance producing 0.0 for offset=1e8 (true variance = 2.0)
- Shows two_pass_variance staying exact at 2.0
- Demonstrates degradation across offsets from 1e4 to 1e14
Edge-case tests: test_layer_norm.py tests zero input, D=1 (degenerate), large D (1024), large-magnitude inputs (1e8 offset)
eps = 1e-5

Verdict: Qwen3.6-27B wins decisively by demonstrating the problem rather than just describing it.

5. Gradient Check

GLM-5 (85/100)

Central finite differences for all three parameters (x, glm5, beta)
Reports both max absolute error and relative error
Uses tol=1e-4 for pass/fail determination
Tests on a single shape (B=2, T=3, D=8) in the default call, and (B=3, T=5, D=32) in the gradient_check function
Strength: Operates on copies (x_plus = x.copy()), avoiding the in-place corruption risk seen in MiniMax-M2.7

Qwen3.6-27B (92/100)

Central finite differences with delta=1e-5
Reports relative error — more informative than absolute alone
Tests on shape (4, 8, 16) with all three gradients
Relative errors reported: dx ~5e-11, dgamma ~1.75e-11, dbeta ~1.46e-11 — extremely tight
Edge-case tests in test_layer_norm.py run gradient checks on large-magnitude and large-D inputs

Verdict: Qwen3.6-27B has tighter numerical agreement and broader test coverage.

6. Complexity Analysis

GLM-5 (82/100)

Correctly identifies O(BTD) time and space complexity
Breaks down forward and backward into component operations
Discusses extra memory: O(M) for xhat + O(N) for rstd
No quantitative FLOP counts or memory footprint in bytes

Qwen3.6-27B (90/100)

More granular FLOP counts: forward ~6N, backward ~9N, total ~15N
Explicitly notes backward is ~1.5x forward in FLOPs
Includes memory footprint in MB for concrete shapes
Discusses why two-pass variance is worth the extra O(N) FLOPs
Computes TFLOPS throughput in benchmarks

Verdict: Qwen3.6-27B provides more quantitative detail.

7. GPU Fusion Explanation

GLM-5 (85/100)

Describes a single-kernel backward fusion design
Specifies shared memory layout: smem_xhat[D], smem_dxhat[D], smem_proj[1], smem_sum[1]
4-step algorithm: load+compute dxhat, cooperative reduction, compute dx, atomic adds for dgamma/dbeta
Quantifies memory traffic: ≈3D elements vs ≈10D+ for unfused
Mentions warp-level shuffles and vectorized loads as additional optimizations
Clean, practical description

Qwen3.6-27B (88/100)

Detailed GPU fusion discussion with CUDA pseudocode for both forward and backward
Quantifies memory traffic: naive = ~12 accesses/element, fused = 4 (forward) and 5 (backward)
Discusses atomicAdd for dgamma/dbeta reduction
Mentions shared memory optimization for small D (<= 1024)
Notes that warp-level primitives can replace shared memory when D <= 32

Verdict: Both are strong. Qwen3.6-27B has slightly better quantitative comparison.

8. Tests and Benchmarks

GLM-5 (60/100)

gradient_check() function tests one shape with all three parameters
No edge-case tests, no assertions, no separate test file
No performance benchmarks
No numerical stability demonstration

Qwen3.6-27B (95/100)

test_layer_norm.py with 5 edge-case test categories:
1. Large mean, tiny variance (cancellation-prone)
2. Zero input (variance = 0)
3. Large D (Transformer-scale: D=1024)
4. D=1 (degenerate case)
5. Gradient norm sanity across scales (1e-3 to 1e6)
benchmark_layer_norm.py with:
- Variance stability demo (naive vs two-pass)
- Performance benchmarks across 8 configurations
- Alternative derivation cross-check
test_memory_efficiency() explicitly verifies minimal cache
Uses assert statements for validation

Verdict: Qwen3.6-27B is far superior in testing coverage and rigor.

9. What Each Did Best

GLM-5	Qwen3.6-27B
Exceptional conciseness — 280 lines covers everything	Minimal, precise cache + 3-file separation
Modern NumPy API (`default_rng`, type hints)	Concrete catastrophic cancellation demo
Safe gradient check (copies, not in-place)	Independent backward formula cross-check
Clean GPU fusion description with memory quantification	Comprehensive edge-case test suite
rstd computation (avoids sqrt→divide)	Memory-efficiency verification + benchmarks

10. Weaknesses

GLM-5

No edge-case testing: No tests for zero input, D=1, large offsets, etc.
No concrete stability demo: Discusses catastrophic cancellation but never shows it
No performance benchmarks: No timing or throughput measurements
Single file: While concise, separation into test/benchmark files would be better
Gradient check only on small shapes: No spot-check for large tensors

Qwen3.6-27B

GPU fusion discussion is a string constant: Less readable than GLM-5's formatted output
No spot-check for very large tensors: Gradient check always runs full finite differences
Slightly more verbose: The core implementation is clean but surrounded by extensive analysis text

Final Verdict

Qwen3.6-27B wins by 11 points (93 vs 82).

The gap is driven by two factors:

Testing: Qwen3.6-27B has a full test suite with edge cases, assertions, and memory verification; GLM-5 has only a basic gradient check.
Numerical stability: Qwen3.6-27B demonstrates the catastrophic cancellation problem with concrete examples; GLM-5 only describes it.

GLM-5 is genuinely good — it correctly implements the backward pass with a minimal cache, clean code, and a solid GPU fusion discussion. It would score much higher than MiniMax-M2.7's implementation. But Qwen3.6-27B takes the same foundation and elevates it with rigorous testing, concrete demonstrations, and cleaner engineering separation.

9.8 KiB Raw Blame History

Head-to-Head: Layer Normalization Backward Pass

GLM-5 backwards vs Qwen3.6-27B backwards

Executive Summary

1. Correctness

GLM-5 (92/100)

Qwen3.6-27B (95/100)

2. Completeness

GLM-5 (80/100)

Qwen3.6-27B (95/100)

3. Code Quality

GLM-5 (88/100)

Qwen3.6-27B (90/100)

4. Numerical Stability

GLM-5 (80/100)

Qwen3.6-27B (95/100)

5. Gradient Check

GLM-5 (85/100)

Qwen3.6-27B (92/100)

6. Complexity Analysis

GLM-5 (82/100)

Qwen3.6-27B (90/100)

7. GPU Fusion Explanation

GLM-5 (85/100)

Qwen3.6-27B (88/100)

8. Tests and Benchmarks

GLM-5 (60/100)

Qwen3.6-27B (95/100)

9. What Each Did Best

10. Weaknesses

GLM-5

Qwen3.6-27B

Final Verdict

9.8 KiB

Raw Blame History