Files
deep_pro_judge/analysis/glm5_vs_qwen36_summary.md
T

6.2 KiB
Raw Blame History

Round 2 Summary: GLM-5 vs Qwen3.6-27B

Overall Scoreboard

Task GLM-5 Qwen3.6-27B Winner Margin
KV Cache 82/100 94/100 qwen36 +12
Backwards Pass 82/100 93/100 qwen36 +11
Fused Softmax+TopK 80/100 78/100 glm5 +2
Average 81 88 qwen36 +7

Winner: Qwen3.6-27B — won 2 of 3 tasks, but GLM-5 made it competitive (especially on fuse).


Task 1: KV Cache System

Dimension GLM-5 Qwen3.6-27B
Correctness 95 95
Completeness 78 95
Code Quality 80 92
Depth of Analysis 82 96
Optimizations 85 93
GPU Mapping 80 95
Tests/Demos 82 90
Overall 82 94

GLM-5 Strengths

  • Excellent documentation — best-in-class README with ASCII diagrams and pedagogical explanations
  • INT4 quantization — only implementation with true 2-values-per-byte packing
  • Rigorous correctness testing — cached vs non-cached attention matches to 1e-5, quantized cache has bounded error assertions
  • Clean, readable code — very approachable for learning
  • No correctness bugs — correct attention, proper cache updates, working batched inference

GLM-5 Weaknesses

  • Incomplete transformer — no MLP, no causal mask, no positional encoding
  • Limited batched masking — variable-length batching lacks full per-sequence masking
  • Less systems analysis — no arithmetic intensity calculations, no real GPU context limits

Qwen3.6-27B Strengths (same as Round 1)

  • Full transformer decoder with LayerNorm, MLP, GELU, residuals, positional encoding
  • GQA support — modern architecture awareness (Llama-2/3, Mistral)
  • Outstanding systems analysis — memory growth with real model names, max context per GPU, arithmetic intensity proving memory-bound generation
  • 10 comprehensive demos including full generation with temperature/top-k sampling

Task 2: Layer Norm Backward Pass

Dimension GLM-5 Qwen3.6-27B
Correctness 92 95
Completeness 80 95
Code Quality 88 90
Numerical Stability 80 95
Gradient Check 85 92
Complexity Analysis 82 90
GPU Fusion 85 88
Tests/Benchmarks 60 95
Overall 82 93

GLM-5 Strengths

  • Exceptional conciseness — ~280 lines covers everything (forward, backward, gradient check, complexity, GPU fusion, stability discussion)
  • Minimal cache(xhat, rstd, glm5) — only 3 items, exactly what's needed
  • Modern NumPy APIdefault_rng, type hints
  • Safe gradient check — operates on copies, not in-place
  • Clean GPU fusion description with memory traffic quantification (≈3D vs ≈10D+ unfused)

GLM-5 Weaknesses

  • No edge-case tests — no zero input, D=1, large offsets, etc.
  • No concrete stability demo — discusses catastrophic cancellation but never shows it
  • No performance benchmarks — no timing or throughput measurements
  • Single file — while concise, separation into test/benchmark files would be better

Qwen3.6-27B Strengths (same as Round 1)

  • 3-file separation: core + tests + benchmarks
  • Concrete catastrophic cancellation demo (naive variance = 0 at offset=1e8; two-pass = exact)
  • 5 edge-case test categories with assertions
  • Independent backward formula cross-check (<1e-10 error)

Task 3: Fused Softmax + TopK CUDA

Dimension GLM-5 Qwen3.6-27B
Correctness 65 95
Completeness 90 85
Code Quality 88 82
CUDA Depth 92 82
Memory Design 90 70
Complexity Analysis 88 72
Naive Comparison 85 78
Overall 80 78

GLM-5 Strengths

  • Single-pass online softmax (Milakov & Gimelshein 2018) — reads V only once, optimal
  • Research-level CUDA knowledge — register-resident sorted arrays, warp shuffle reductions, occupancy analysis
  • Excellent documentation — 9-section DESIGN.md with quantitative analysis, ASCII architecture diagram
  • Accurate complexity analysis — correctly identifies bandwidth-bound nature
  • One warp per row design — elegant mapping with strided coalesced access

GLM-5 Critical Weakness

  • 🐛 Cross-warp merge bug — When WARPS_PER_BLOCK > 1, the merge conflates heaps from different rows. Only works correctly with WARPS_PER_BLOCK = 1. The design claims "one warp per row" but then treats all warps in a block as cooperating on the same row — a fundamental contradiction.

Qwen3.6-27B Strengths

  • No critical correctness bugs — simpler one-block-per-row design avoids ambiguity
  • Two kernel versions (v1 + v2) showing iterative improvement
  • Vectorized float4 loads in v2 for wider memory transactions
  • Better test coverage — tests LLaMA-sized vocabularies (V=50257, K=256)

Qwen3.6-27B Weaknesses

  • Suboptimal 3-pass algorithm — 12× more global reads than necessary (3 passes × 4V bytes = 12V vs glm5's 4V)
  • Flawed complexity analysis — incorrectly claims compute-bound; with 12V reads it's actually bandwidth-bound
  • Dead code in v2warp_topk_merge and process_float4 functions are never called

The Ideal Hybrid

A production implementation would combine glm5's online softmax algorithm and register-resident heap with qwen36's vectorized loads and comprehensive testing — scoring ~95/100.


What Made GLM-5 Competitive

Factor GLM-5 Qwen3.6-27B
Correctness Correct (1 minor bug on fuse) Correct in all 3
Testing Basic (good assertions, limited coverage) Comprehensive
Analysis depth Good Excellent (quantitative + real models)
Code organization Clean, focused Modular and production-grade
Algorithmic sophistication Excellent (online softmax, INT4) Good (solid but conventional)

Key insight: GLM-5 was much closer to Qwen3.6-27B (+7 avg margin) than MiniMax-M2.7 was (+24). glm5's code was correct, concise, and well-engineered. It lost mainly on completeness (fewer tests, less analysis depth) rather than fundamental correctness issues.