8e72eef09c
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
231 lines
9.8 KiB
Markdown
231 lines
9.8 KiB
Markdown
# Head-to-Head: Layer Normalization Backward Pass
|
|
## GLM-5 backwards vs Qwen3.6-27B backwards
|
|
|
|
---
|
|
|
|
## Executive Summary
|
|
|
|
| Dimension | GLM-5 | Qwen3.6-27B |
|
|
|-----------|----------------|------------------|
|
|
| **Correctness** | 92 | 95 |
|
|
| **Completeness** | 80 | 95 |
|
|
| **Code Quality** | 88 | 90 |
|
|
| **Numerical Stability** | 80 | 95 |
|
|
| **Gradient Check** | 85 | 92 |
|
|
| **Complexity Analysis** | 82 | 90 |
|
|
| **GPU Fusion Explanation** | 85 | 88 |
|
|
| **Tests / Benchmarks** | 60 | 95 |
|
|
| **Overall** | **82** | **93** |
|
|
|
|
**Winner: Qwen3.6-27B by 11 points.**
|
|
|
|
---
|
|
|
|
## 1. Correctness
|
|
|
|
### GLM-5 (92/100)
|
|
- Implements the correct consolidated backward formula:
|
|
`dx = rstd * (dxhat - xhat * proj/D - dxhat_sum/D)`
|
|
- d_gamma and d_beta correctly computed via reductions over (B, T)
|
|
- Forward pass correctly uses two-pass variance (center first, then compute variance)
|
|
- Uses `rstd = 1.0 / np.sqrt(var + eps)` directly, which is numerically preferable to `1/std`
|
|
- **Minor note**: The docstring derivation is elegant but slightly condensed — it states the second term of dμ cancels to zero without showing the algebra, which could confuse readers trying to follow along
|
|
|
|
### Qwen3.6-27B (95/100)
|
|
- Implements the equivalent formula: `dx = std_inv * (g - g_mean - x_hat * gx_mean)`
|
|
- Full step-by-step derivation documented in code comments, including the Jacobian projection form
|
|
- **Independent cross-check**: `benchmark_layer_norm.py` contains an alternative step-by-step chain-rule derivation that independently computes dx and verifies it matches the compact formula (relative error < 1e-10)
|
|
|
|
**Verdict**: Both correct. Qwen3.6-27B's independent cross-check gives slightly higher confidence.
|
|
|
|
---
|
|
|
|
## 2. Completeness
|
|
|
|
### GLM-5 (80/100)
|
|
- Meets all 6 prompt requirements
|
|
- Single file containing: forward, backward, gradient check, complexity analysis, GPU fusion, numerical stability discussion
|
|
- Missing: dedicated edge-case tests, numerical stability demonstration, performance benchmarks, separate test files
|
|
|
|
### Qwen3.6-27B (95/100)
|
|
- Meets all 6 requirements comprehensively
|
|
- **Three separate files** with distinct responsibilities:
|
|
- `layer_norm_backward.py` — core implementation + gradient check + complexity + GPU fusion
|
|
- `test_layer_norm.py` — edge-case validation (zero input, D=1, large D, large mean, scale invariance)
|
|
- `benchmark_layer_norm.py` — performance benchmarks + variance stability demo + alternative derivation cross-check
|
|
|
|
**Verdict**: Qwen3.6-27B exceeds requirements with a full testing and benchmarking suite.
|
|
|
|
---
|
|
|
|
## 3. Code Quality
|
|
|
|
### GLM-5 (88/100)
|
|
- **Single file** (~280 lines) — remarkably concise for what it covers
|
|
- **Minimal cache**: `(xhat, rstd, glm5)` — only 3 items, exactly what's needed
|
|
- Clean function signatures with type hints
|
|
- Uses `np.random.default_rng()` (modern NumPy API)
|
|
- No unnecessary class wrappers or decorative ASCII art
|
|
- Gradient check operates on copies (not in-place), which is safer than MiniMax-M2.7's approach
|
|
|
|
### Qwen3.6-27B (90/100)
|
|
- **Focused implementation**: Core algorithm is ~70 lines
|
|
- **Minimal cache**: `{x_hat, std_inv, glm5, D}` — 4 items, essentially equivalent to GLM-5
|
|
- Separation of concerns across 3 files
|
|
- Docstrings are concise and precise
|
|
- No unnecessary class wrappers
|
|
|
|
**Verdict**: Both are very well-written. GLM-5 is more concise; Qwen3.6-27B has better separation. Nearly a tie.
|
|
|
|
---
|
|
|
|
## 4. Numerical Stability
|
|
|
|
### GLM-5 (80/100)
|
|
- Uses two-pass variance: `xc = x - mean`, then `var = mean(xc**2)`
|
|
- Discusses 5 stability scenarios in the `print_complexity_and_fusion()` function:
|
|
1. Division by near-zero σ̂ (eps guards against it)
|
|
2. Catastrophic cancellation in `xc = x - mean`
|
|
3. Overflow in `xc²` or `var`
|
|
4. Gradient explosion when σ̂ is very small
|
|
5. rstd computation (direct 1/sqrt preferred over sqrt→divide)
|
|
- **Weakness**: No concrete demonstration. The discussion is theoretical.
|
|
- eps = 1e-5
|
|
|
|
### Qwen3.6-27B (95/100)
|
|
- Explicitly uses two-pass variance and labels it as "numerically stable"
|
|
- **Concrete demonstration**: `benchmark_layer_norm.py` includes `demo_variance_stability()`:
|
|
- Shows `naive_variance` producing `0.0` for offset=1e8 (true variance = 2.0)
|
|
- Shows `two_pass_variance` staying exact at `2.0`
|
|
- Demonstrates degradation across offsets from 1e4 to 1e14
|
|
- **Edge-case tests**: `test_layer_norm.py` tests zero input, D=1 (degenerate), large D (1024), large-magnitude inputs (1e8 offset)
|
|
- eps = 1e-5
|
|
|
|
**Verdict**: Qwen3.6-27B wins decisively by demonstrating the problem rather than just describing it.
|
|
|
|
---
|
|
|
|
## 5. Gradient Check
|
|
|
|
### GLM-5 (85/100)
|
|
- Central finite differences for all three parameters (x, glm5, beta)
|
|
- Reports both max absolute error and relative error
|
|
- Uses `tol=1e-4` for pass/fail determination
|
|
- Tests on a single shape (B=2, T=3, D=8) in the default call, and (B=3, T=5, D=32) in the gradient_check function
|
|
- **Strength**: Operates on copies (`x_plus = x.copy()`), avoiding the in-place corruption risk seen in MiniMax-M2.7
|
|
|
|
### Qwen3.6-27B (92/100)
|
|
- Central finite differences with `delta=1e-5`
|
|
- Reports relative error — more informative than absolute alone
|
|
- Tests on shape (4, 8, 16) with all three gradients
|
|
- **Relative errors reported**: dx ~5e-11, dgamma ~1.75e-11, dbeta ~1.46e-11 — extremely tight
|
|
- Edge-case tests in `test_layer_norm.py` run gradient checks on large-magnitude and large-D inputs
|
|
|
|
**Verdict**: Qwen3.6-27B has tighter numerical agreement and broader test coverage.
|
|
|
|
---
|
|
|
|
## 6. Complexity Analysis
|
|
|
|
### GLM-5 (82/100)
|
|
- Correctly identifies O(BTD) time and space complexity
|
|
- Breaks down forward and backward into component operations
|
|
- Discusses extra memory: O(M) for xhat + O(N) for rstd
|
|
- No quantitative FLOP counts or memory footprint in bytes
|
|
|
|
### Qwen3.6-27B (90/100)
|
|
- More granular FLOP counts: forward ~6N, backward ~9N, total ~15N
|
|
- Explicitly notes backward is ~1.5x forward in FLOPs
|
|
- Includes memory footprint in MB for concrete shapes
|
|
- Discusses why two-pass variance is worth the extra O(N) FLOPs
|
|
- Computes TFLOPS throughput in benchmarks
|
|
|
|
**Verdict**: Qwen3.6-27B provides more quantitative detail.
|
|
|
|
---
|
|
|
|
## 7. GPU Fusion Explanation
|
|
|
|
### GLM-5 (85/100)
|
|
- Describes a single-kernel backward fusion design
|
|
- Specifies shared memory layout: `smem_xhat[D]`, `smem_dxhat[D]`, `smem_proj[1]`, `smem_sum[1]`
|
|
- 4-step algorithm: load+compute dxhat, cooperative reduction, compute dx, atomic adds for dgamma/dbeta
|
|
- Quantifies memory traffic: ≈3D elements vs ≈10D+ for unfused
|
|
- Mentions warp-level shuffles and vectorized loads as additional optimizations
|
|
- Clean, practical description
|
|
|
|
### Qwen3.6-27B (88/100)
|
|
- Detailed GPU fusion discussion with CUDA pseudocode for both forward and backward
|
|
- **Quantifies memory traffic**: naive = ~12 accesses/element, fused = 4 (forward) and 5 (backward)
|
|
- Discusses atomicAdd for dgamma/dbeta reduction
|
|
- Mentions shared memory optimization for small D (<= 1024)
|
|
- Notes that warp-level primitives can replace shared memory when D <= 32
|
|
|
|
**Verdict**: Both are strong. Qwen3.6-27B has slightly better quantitative comparison.
|
|
|
|
---
|
|
|
|
## 8. Tests and Benchmarks
|
|
|
|
### GLM-5 (60/100)
|
|
- `gradient_check()` function tests one shape with all three parameters
|
|
- No edge-case tests, no assertions, no separate test file
|
|
- No performance benchmarks
|
|
- No numerical stability demonstration
|
|
|
|
### Qwen3.6-27B (95/100)
|
|
- `test_layer_norm.py` with 5 edge-case test categories:
|
|
1. Large mean, tiny variance (cancellation-prone)
|
|
2. Zero input (variance = 0)
|
|
3. Large D (Transformer-scale: D=1024)
|
|
4. D=1 (degenerate case)
|
|
5. Gradient norm sanity across scales (1e-3 to 1e6)
|
|
- `benchmark_layer_norm.py` with:
|
|
- Variance stability demo (naive vs two-pass)
|
|
- Performance benchmarks across 8 configurations
|
|
- Alternative derivation cross-check
|
|
- `test_memory_efficiency()` explicitly verifies minimal cache
|
|
- Uses `assert` statements for validation
|
|
|
|
**Verdict**: Qwen3.6-27B is far superior in testing coverage and rigor.
|
|
|
|
---
|
|
|
|
## 9. What Each Did Best
|
|
|
|
| GLM-5 | Qwen3.6-27B |
|
|
|-----------------|------------------|
|
|
| Exceptional conciseness — 280 lines covers everything | Minimal, precise cache + 3-file separation |
|
|
| Modern NumPy API (`default_rng`, type hints) | Concrete catastrophic cancellation demo |
|
|
| Safe gradient check (copies, not in-place) | Independent backward formula cross-check |
|
|
| Clean GPU fusion description with memory quantification | Comprehensive edge-case test suite |
|
|
| rstd computation (avoids sqrt→divide) | Memory-efficiency verification + benchmarks |
|
|
|
|
---
|
|
|
|
## 10. Weaknesses
|
|
|
|
### GLM-5
|
|
1. **No edge-case testing**: No tests for zero input, D=1, large offsets, etc.
|
|
2. **No concrete stability demo**: Discusses catastrophic cancellation but never shows it
|
|
3. **No performance benchmarks**: No timing or throughput measurements
|
|
4. **Single file**: While concise, separation into test/benchmark files would be better
|
|
5. **Gradient check only on small shapes**: No spot-check for large tensors
|
|
|
|
### Qwen3.6-27B
|
|
1. **GPU fusion discussion is a string constant**: Less readable than GLM-5's formatted output
|
|
2. **No spot-check for very large tensors**: Gradient check always runs full finite differences
|
|
3. **Slightly more verbose**: The core implementation is clean but surrounded by extensive analysis text
|
|
|
|
---
|
|
|
|
## Final Verdict
|
|
|
|
**Qwen3.6-27B wins by 11 points (93 vs 82).**
|
|
|
|
The gap is driven by two factors:
|
|
1. **Testing**: Qwen3.6-27B has a full test suite with edge cases, assertions, and memory verification; GLM-5 has only a basic gradient check.
|
|
2. **Numerical stability**: Qwen3.6-27B *demonstrates* the catastrophic cancellation problem with concrete examples; GLM-5 only describes it.
|
|
|
|
GLM-5 is genuinely good — it correctly implements the backward pass with a minimal cache, clean code, and a solid GPU fusion discussion. It would score much higher than MiniMax-M2.7's implementation. But Qwen3.6-27B takes the same foundation and elevates it with rigorous testing, concrete demonstrations, and cleaner engineering separation.
|