sleepy/deep_pro_judge

Fork 0

Files

T

sleepy 107c805807 refactor: merge model_comparison into analysis; remove 4 qwen36 files

2026-04-27 19:00:15 +02:00

6.2 KiB

Raw Blame History

Round 2 Summary: GLM-5 vs Qwen3.6-27B

Overall Scoreboard

Task	GLM-5	Qwen3.6-27B	Winner	Margin
KV Cache	82/100	94/100	qwen36	+12
Backwards Pass	82/100	93/100	qwen36	+11
Fused Softmax+TopK	80/100	78/100	glm5	+2
Average	81	88	qwen36	+7

Winner: Qwen3.6-27B — won 2 of 3 tasks, but GLM-5 made it competitive (especially on fuse).

Task 1: KV Cache System

Dimension	GLM-5	Qwen3.6-27B
Correctness	95	95
Completeness	78	95
Code Quality	80	92
Depth of Analysis	82	96
Optimizations	85	93
GPU Mapping	80	95
Tests/Demos	82	90
Overall	82	94

GLM-5 Strengths

Excellent documentation — best-in-class README with ASCII diagrams and pedagogical explanations
INT4 quantization — only implementation with true 2-values-per-byte packing
Rigorous correctness testing — cached vs non-cached attention matches to 1e-5, quantized cache has bounded error assertions
Clean, readable code — very approachable for learning
No correctness bugs — correct attention, proper cache updates, working batched inference

GLM-5 Weaknesses

Incomplete transformer — no MLP, no causal mask, no positional encoding
Limited batched masking — variable-length batching lacks full per-sequence masking
Less systems analysis — no arithmetic intensity calculations, no real GPU context limits

Qwen3.6-27B Strengths (same as Round 1)

Full transformer decoder with LayerNorm, MLP, GELU, residuals, positional encoding
GQA support — modern architecture awareness (Llama-2/3, Mistral)
Outstanding systems analysis — memory growth with real model names, max context per GPU, arithmetic intensity proving memory-bound generation
10 comprehensive demos including full generation with temperature/top-k sampling

Task 2: Layer Norm Backward Pass

Dimension	GLM-5	Qwen3.6-27B
Correctness	92	95
Completeness	80	95
Code Quality	88	90
Numerical Stability	80	95
Gradient Check	85	92
Complexity Analysis	82	90
GPU Fusion	85	88
Tests/Benchmarks	60	95
Overall	82	93

GLM-5 Strengths

Exceptional conciseness — ~280 lines covers everything (forward, backward, gradient check, complexity, GPU fusion, stability discussion)
Minimal cache — (xhat, rstd, glm5) — only 3 items, exactly what's needed
Modern NumPy API — default_rng, type hints
Safe gradient check — operates on copies, not in-place
Clean GPU fusion description with memory traffic quantification (≈3D vs ≈10D+ unfused)

GLM-5 Weaknesses

No edge-case tests — no zero input, D=1, large offsets, etc.
No concrete stability demo — discusses catastrophic cancellation but never shows it
No performance benchmarks — no timing or throughput measurements
Single file — while concise, separation into test/benchmark files would be better

Qwen3.6-27B Strengths (same as Round 1)

3-file separation: core + tests + benchmarks
Concrete catastrophic cancellation demo (naive variance = 0 at offset=1e8; two-pass = exact)
5 edge-case test categories with assertions
Independent backward formula cross-check (<1e-10 error)

Task 3: Fused Softmax + TopK CUDA

Dimension	GLM-5	Qwen3.6-27B
Correctness	65	95
Completeness	90	85
Code Quality	88	82
CUDA Depth	92	82
Memory Design	90	70
Complexity Analysis	88	72
Naive Comparison	85	78
Overall	80	78

GLM-5 Strengths

Single-pass online softmax (Milakov & Gimelshein 2018) — reads V only once, optimal
Research-level CUDA knowledge — register-resident sorted arrays, warp shuffle reductions, occupancy analysis
Excellent documentation — 9-section DESIGN.md with quantitative analysis, ASCII architecture diagram
Accurate complexity analysis — correctly identifies bandwidth-bound nature
One warp per row design — elegant mapping with strided coalesced access

GLM-5 Critical Weakness

🐛 Cross-warp merge bug — When WARPS_PER_BLOCK > 1, the merge conflates heaps from different rows. Only works correctly with WARPS_PER_BLOCK = 1. The design claims "one warp per row" but then treats all warps in a block as cooperating on the same row — a fundamental contradiction.

Qwen3.6-27B Strengths

No critical correctness bugs — simpler one-block-per-row design avoids ambiguity
Two kernel versions (v1 + v2) showing iterative improvement
Vectorized float4 loads in v2 for wider memory transactions
Better test coverage — tests LLaMA-sized vocabularies (V=50257, K=256)

Qwen3.6-27B Weaknesses

Suboptimal 3-pass algorithm — 12× more global reads than necessary (3 passes × 4V bytes = 12V vs glm5's 4V)
Flawed complexity analysis — incorrectly claims compute-bound; with 12V reads it's actually bandwidth-bound
Dead code in v2 — warp_topk_merge and process_float4 functions are never called

The Ideal Hybrid

A production implementation would combine glm5's online softmax algorithm and register-resident heap with qwen36's vectorized loads and comprehensive testing — scoring ~95/100.

What Made GLM-5 Competitive

Factor	GLM-5	Qwen3.6-27B
Correctness	Correct (1 minor bug on fuse)	Correct in all 3
Testing	Basic (good assertions, limited coverage)	Comprehensive
Analysis depth	Good	Excellent (quantitative + real models)
Code organization	Clean, focused	Modular and production-grade
Algorithmic sophistication	Excellent (online softmax, INT4)	Good (solid but conventional)

Key insight: GLM-5 was much closer to Qwen3.6-27B (+7 avg margin) than MiniMax-M2.7 was (+24). glm5's code was correct, concise, and well-engineered. It lost mainly on completeness (fewer tests, less analysis depth) rather than fundamental correctness issues.

6.2 KiB Raw Blame History Unescape Escape

Round 2 Summary: GLM-5 vs Qwen3.6-27B

Overall Scoreboard

Task 1: KV Cache System

GLM-5 Strengths

GLM-5 Weaknesses

Qwen3.6-27B Strengths (same as Round 1)

Task 2: Layer Norm Backward Pass

GLM-5 Strengths

GLM-5 Weaknesses

Qwen3.6-27B Strengths (same as Round 1)

Task 3: Fused Softmax + TopK CUDA

GLM-5 Strengths

GLM-5 Critical Weakness

Qwen3.6-27B Strengths

Qwen3.6-27B Weaknesses

The Ideal Hybrid

What Made GLM-5 Competitive

6.2 KiB

Raw Blame History