Files
llm_programming_tests/model_comparison/glm5_backwards_comparison.md
T
sleepy 8e72eef09c feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
2026-04-23 11:16:01 +02:00

9.8 KiB

Head-to-Head: Layer Normalization Backward Pass

GLM-5 backwards vs Qwen3.6-27B backwards


Executive Summary

Dimension GLM-5 Qwen3.6-27B
Correctness 92 95
Completeness 80 95
Code Quality 88 90
Numerical Stability 80 95
Gradient Check 85 92
Complexity Analysis 82 90
GPU Fusion Explanation 85 88
Tests / Benchmarks 60 95
Overall 82 93

Winner: Qwen3.6-27B by 11 points.


1. Correctness

GLM-5 (92/100)

  • Implements the correct consolidated backward formula: dx = rstd * (dxhat - xhat * proj/D - dxhat_sum/D)
  • d_gamma and d_beta correctly computed via reductions over (B, T)
  • Forward pass correctly uses two-pass variance (center first, then compute variance)
  • Uses rstd = 1.0 / np.sqrt(var + eps) directly, which is numerically preferable to 1/std
  • Minor note: The docstring derivation is elegant but slightly condensed — it states the second term of dμ cancels to zero without showing the algebra, which could confuse readers trying to follow along

Qwen3.6-27B (95/100)

  • Implements the equivalent formula: dx = std_inv * (g - g_mean - x_hat * gx_mean)
  • Full step-by-step derivation documented in code comments, including the Jacobian projection form
  • Independent cross-check: benchmark_layer_norm.py contains an alternative step-by-step chain-rule derivation that independently computes dx and verifies it matches the compact formula (relative error < 1e-10)

Verdict: Both correct. Qwen3.6-27B's independent cross-check gives slightly higher confidence.


2. Completeness

GLM-5 (80/100)

  • Meets all 6 prompt requirements
  • Single file containing: forward, backward, gradient check, complexity analysis, GPU fusion, numerical stability discussion
  • Missing: dedicated edge-case tests, numerical stability demonstration, performance benchmarks, separate test files

Qwen3.6-27B (95/100)

  • Meets all 6 requirements comprehensively
  • Three separate files with distinct responsibilities:
    • layer_norm_backward.py — core implementation + gradient check + complexity + GPU fusion
    • test_layer_norm.py — edge-case validation (zero input, D=1, large D, large mean, scale invariance)
    • benchmark_layer_norm.py — performance benchmarks + variance stability demo + alternative derivation cross-check

Verdict: Qwen3.6-27B exceeds requirements with a full testing and benchmarking suite.


3. Code Quality

GLM-5 (88/100)

  • Single file (~280 lines) — remarkably concise for what it covers
  • Minimal cache: (xhat, rstd, glm5) — only 3 items, exactly what's needed
  • Clean function signatures with type hints
  • Uses np.random.default_rng() (modern NumPy API)
  • No unnecessary class wrappers or decorative ASCII art
  • Gradient check operates on copies (not in-place), which is safer than MiniMax-M2.7's approach

Qwen3.6-27B (90/100)

  • Focused implementation: Core algorithm is ~70 lines
  • Minimal cache: {x_hat, std_inv, glm5, D} — 4 items, essentially equivalent to GLM-5
  • Separation of concerns across 3 files
  • Docstrings are concise and precise
  • No unnecessary class wrappers

Verdict: Both are very well-written. GLM-5 is more concise; Qwen3.6-27B has better separation. Nearly a tie.


4. Numerical Stability

GLM-5 (80/100)

  • Uses two-pass variance: xc = x - mean, then var = mean(xc**2)
  • Discusses 5 stability scenarios in the print_complexity_and_fusion() function:
    1. Division by near-zero σ̂ (eps guards against it)
    2. Catastrophic cancellation in xc = x - mean
    3. Overflow in xc² or var
    4. Gradient explosion when σ̂ is very small
    5. rstd computation (direct 1/sqrt preferred over sqrt→divide)
  • Weakness: No concrete demonstration. The discussion is theoretical.
  • eps = 1e-5

Qwen3.6-27B (95/100)

  • Explicitly uses two-pass variance and labels it as "numerically stable"
  • Concrete demonstration: benchmark_layer_norm.py includes demo_variance_stability():
    • Shows naive_variance producing 0.0 for offset=1e8 (true variance = 2.0)
    • Shows two_pass_variance staying exact at 2.0
    • Demonstrates degradation across offsets from 1e4 to 1e14
  • Edge-case tests: test_layer_norm.py tests zero input, D=1 (degenerate), large D (1024), large-magnitude inputs (1e8 offset)
  • eps = 1e-5

Verdict: Qwen3.6-27B wins decisively by demonstrating the problem rather than just describing it.


5. Gradient Check

GLM-5 (85/100)

  • Central finite differences for all three parameters (x, glm5, beta)
  • Reports both max absolute error and relative error
  • Uses tol=1e-4 for pass/fail determination
  • Tests on a single shape (B=2, T=3, D=8) in the default call, and (B=3, T=5, D=32) in the gradient_check function
  • Strength: Operates on copies (x_plus = x.copy()), avoiding the in-place corruption risk seen in MiniMax-M2.7

Qwen3.6-27B (92/100)

  • Central finite differences with delta=1e-5
  • Reports relative error — more informative than absolute alone
  • Tests on shape (4, 8, 16) with all three gradients
  • Relative errors reported: dx ~5e-11, dgamma ~1.75e-11, dbeta ~1.46e-11 — extremely tight
  • Edge-case tests in test_layer_norm.py run gradient checks on large-magnitude and large-D inputs

Verdict: Qwen3.6-27B has tighter numerical agreement and broader test coverage.


6. Complexity Analysis

GLM-5 (82/100)

  • Correctly identifies O(BTD) time and space complexity
  • Breaks down forward and backward into component operations
  • Discusses extra memory: O(M) for xhat + O(N) for rstd
  • No quantitative FLOP counts or memory footprint in bytes

Qwen3.6-27B (90/100)

  • More granular FLOP counts: forward ~6N, backward ~9N, total ~15N
  • Explicitly notes backward is ~1.5x forward in FLOPs
  • Includes memory footprint in MB for concrete shapes
  • Discusses why two-pass variance is worth the extra O(N) FLOPs
  • Computes TFLOPS throughput in benchmarks

Verdict: Qwen3.6-27B provides more quantitative detail.


7. GPU Fusion Explanation

GLM-5 (85/100)

  • Describes a single-kernel backward fusion design
  • Specifies shared memory layout: smem_xhat[D], smem_dxhat[D], smem_proj[1], smem_sum[1]
  • 4-step algorithm: load+compute dxhat, cooperative reduction, compute dx, atomic adds for dgamma/dbeta
  • Quantifies memory traffic: ≈3D elements vs ≈10D+ for unfused
  • Mentions warp-level shuffles and vectorized loads as additional optimizations
  • Clean, practical description

Qwen3.6-27B (88/100)

  • Detailed GPU fusion discussion with CUDA pseudocode for both forward and backward
  • Quantifies memory traffic: naive = ~12 accesses/element, fused = 4 (forward) and 5 (backward)
  • Discusses atomicAdd for dgamma/dbeta reduction
  • Mentions shared memory optimization for small D (<= 1024)
  • Notes that warp-level primitives can replace shared memory when D <= 32

Verdict: Both are strong. Qwen3.6-27B has slightly better quantitative comparison.


8. Tests and Benchmarks

GLM-5 (60/100)

  • gradient_check() function tests one shape with all three parameters
  • No edge-case tests, no assertions, no separate test file
  • No performance benchmarks
  • No numerical stability demonstration

Qwen3.6-27B (95/100)

  • test_layer_norm.py with 5 edge-case test categories:
    1. Large mean, tiny variance (cancellation-prone)
    2. Zero input (variance = 0)
    3. Large D (Transformer-scale: D=1024)
    4. D=1 (degenerate case)
    5. Gradient norm sanity across scales (1e-3 to 1e6)
  • benchmark_layer_norm.py with:
    • Variance stability demo (naive vs two-pass)
    • Performance benchmarks across 8 configurations
    • Alternative derivation cross-check
  • test_memory_efficiency() explicitly verifies minimal cache
  • Uses assert statements for validation

Verdict: Qwen3.6-27B is far superior in testing coverage and rigor.


9. What Each Did Best

GLM-5 Qwen3.6-27B
Exceptional conciseness — 280 lines covers everything Minimal, precise cache + 3-file separation
Modern NumPy API (default_rng, type hints) Concrete catastrophic cancellation demo
Safe gradient check (copies, not in-place) Independent backward formula cross-check
Clean GPU fusion description with memory quantification Comprehensive edge-case test suite
rstd computation (avoids sqrt→divide) Memory-efficiency verification + benchmarks

10. Weaknesses

GLM-5

  1. No edge-case testing: No tests for zero input, D=1, large offsets, etc.
  2. No concrete stability demo: Discusses catastrophic cancellation but never shows it
  3. No performance benchmarks: No timing or throughput measurements
  4. Single file: While concise, separation into test/benchmark files would be better
  5. Gradient check only on small shapes: No spot-check for large tensors

Qwen3.6-27B

  1. GPU fusion discussion is a string constant: Less readable than GLM-5's formatted output
  2. No spot-check for very large tensors: Gradient check always runs full finite differences
  3. Slightly more verbose: The core implementation is clean but surrounded by extensive analysis text

Final Verdict

Qwen3.6-27B wins by 11 points (93 vs 82).

The gap is driven by two factors:

  1. Testing: Qwen3.6-27B has a full test suite with edge cases, assertions, and memory verification; GLM-5 has only a basic gradient check.
  2. Numerical stability: Qwen3.6-27B demonstrates the catastrophic cancellation problem with concrete examples; GLM-5 only describes it.

GLM-5 is genuinely good — it correctly implements the backward pass with a minimal cache, clean code, and a solid GPU fusion discussion. It would score much higher than MiniMax-M2.7's implementation. But Qwen3.6-27B takes the same foundation and elevates it with rigorous testing, concrete demonstrations, and cleaner engineering separation.