Files

T

sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis

- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts

2026-04-27 18:49:22 +02:00

12 KiB

Raw Permalink Blame History

2-Way Head-to-Head Comparisons

GLM-5 vs MiniMax-M2.7

Task 1: Backward Layer Norm

Criteria	GLM-5	MiniMax-M2.7	Edge
Lines of code	275	1148	GLM (more concise)
Gradient correctness	PASS (~1e-10 rel)	PASS (~1e-10 rel)	Tie
Cache efficiency	3 items	12 items (9 redundant)	GLM
Numerical stability discussion	5 failure modes	Buried in code comments	GLM
GPU fusion detail	Backward only, 4 steps	Forward + backward, full CUDA pseudocode	MiniMax
Edge case testing	None	None (spot-check only)	Tie
Benchmark	None	4 shape configs	MiniMax
Spot-check for large tensors	No	Yes (>100k elements)	MiniMax

Winner: GLM-5 (cleaner, more correct cache design; MiniMax's GPU pseudocode is better but the cache bloat is a fundamental flaw)

Task 2: Fused Softmax+Top-K

Criteria	GLM-5	MiniMax-M2.7	Edge
Algorithm	Online softmax (single pass)	2-pass (max → sum → topk)	GLM
CUDA correctness	Compilable, correct	Has bugs (launch bounds, shared mem layout, stack overflow)	GLM
K limit	≤32	≤100	MiniMax
Warp-level	Butterfly shuffle reductions	Butterfly shuffle reductions	Tie
Top-K data structure	Register sorted array	Register sorted array	Tie
Cross-warp merge	Shared memory, serial	Shared memory, thread 0 only	Tie
Documentation	DESIGN.md (9 sections)	Inline ASCII diagrams (comprehensive)	GLM
Bandwidth analysis	AI=1.5, 3× speedup	AI=0.8, 4× speedup	Tie (both correct)
Production readiness	Medium	Low (bugs)	GLM

Winner: GLM-5 (MiniMax's CUDA has real bugs that prevent compilation/correctness; GLM's online algorithm is genuinely superior)

Task 3: KV-Cache

Criteria	GLM-5	MiniMax-M2.7	Edge
Core cache design	Clean, correct	Over-complicated, format mismatch	GLM
Memory layout	BHSD (good)	Multiple formats (good concept, messy impl)	Tie
Variable-length batching	Working	Attempted but flawed	GLM
Paged attention	Working, free-list	Working, block allocator	Tie
Quantization	INT8/INT4 working	Not implemented separately	GLM
Chunked prefill	Implemented (partial)	Mentioned but not implemented	GLM
Tests	8 tests, ALL PASS	0 tests	GLM
Memory analysis	Tables + FLOPs comparison	MemoryAnalyzer class (estimated latency)	Tie
Code organization	3 files (core + opt + test)	1 monolithic 1720-line file	GLM
Architecture issues	None significant	Format mismatch between stack and attention	GLM

Winner: GLM-5 (MiniMax's implementation has a critical format mismatch bug and no tests; GLM's is correct and well-tested)

GLM-5 vs MiniMax-M2.7 Overall: GLM-5 wins 3-0

MiniMax-M2.7 vs Qwen3-6

Task 1: Backward Layer Norm

Criteria	MiniMax-M2.7	Qwen3-6	Edge
Lines of code	1148 (monolithic)	294 + 113 + 150 = 557 (3 files)	Qwen
Gradient check	PASS	PASS (5× lower rel error)	Qwen
Cache minimality	12 items (bloated)	4 items (optimal)	Qwen
Edge cases	None	5 distinct edge cases	Qwen
Cross-verification	None	Alternative derivation check	Qwen
Stability demo	None	Two-pass vs naive variance demo	Qwen
GPU fusion	Full CUDA pseudocode	Both forward/backward, memory traffic table	Qwen
Benchmark	4 configs	8 configs + stability demo	Qwen
Memory analysis	Per-operation FLOPs table	N-based FLOPs estimate	Tie

Winner: Qwen3-6 (decisive — better in every dimension)

Task 2: Fused Softmax+Top-K

Criteria	MiniMax-M2.7	Qwen3-6	Edge
CUDA correctness	Has bugs	Both v1 and v2 compilable	Qwen
Algorithm	2-pass	2-pass (v1), semi-online (v2)	Tie
K support	≤100 (if/else chain)	≤256 (template, 5 instantiations)	Qwen
Vectorized loads	No	float4 in v2	Qwen
Top-K structure	Register array	Shared heap (O(log K) insert)	Qwen
Warp merge	Thread 0 serial	Warp-leader serial + barriers	Qwen
Cross-warp merge	Shared mem, thread 0	Warp-level staging → shared heap	Qwen
Documentation quality	Excellent ASCII diagrams	ANALYSIS.md + inline comments	Tie
Benchmark harness	None	benchmark.cu	Qwen
Multiple versions	No	v1 + v2 optimized	Qwen

Winner: Qwen3-6 (MiniMax has bugs; Qwen has two correct kernels with optimization)

Task 3: KV-Cache

Criteria	MiniMax-M2.7	Qwen3-6	Edge
File count	1	8	Qwen
Lines of code	1720 (monolithic)	205 + 234 + 390 + ... = ~1200 (modular)	Qwen
Architecture bugs	Format mismatch in attn/cache stack	None significant	Qwen
Tests/Demos	0	10 demos, ALL PASS	Qwen
Variable-length batching	Broken (engine logic error)	Working, 4 different lengths	Qwen
Paged attention	Working but fragmented	Working with page tables	Tie
Quantization	Not implemented	Implemented, notes overhead honestly	Qwen
Memory analysis	MemoryAnalyzer class	ModelSpec + find_max_context + 6 real models	Qwen
Attention variants	Standard only	Standard + GQA + MQA	Qwen
GPU mapping	Basic	Dedicated gpu_mapping.py with Tensor Cores	Qwen
Chunked prefill	Mentioned	Full implementation, matches full attn to 4.5e-10	Qwen
Model specs	None	Llama-2-7B/13B/70B, Llama-3-8B, Mistral-7B, GPT-4-class	Qwen
Max context calculator	Estimated latency only	Per-GPU max context (RTX 4090→H100)	Qwen

Winner: Qwen3-6 (decisive — functionally correct where MiniMax has bugs, 10× more thorough)

MiniMax-M2.7 vs Qwen3-6 Overall: Qwen3-6 wins 3-0

Qwen3-6 vs GLM-5

This is the closest matchup. Both are correct and well-engineered.

Task 1: Backward Layer Norm

Criteria	Qwen3-6	GLM-5	Edge
Code size	557 lines, 3 files	275 lines, 1 file	GLM (more concise)
Gradient precision	5.04e-11 (dx)	9.74e-11 (dx)	Qwen (2× better)
Cache items	4 (x_hat, std_inv, gamma, D)	3 (xhat, rstd, gamma)	GLM (one less!)
Edge cases	5 tested (zero, large mean, D=1, D=1024, norm sanity)	0 tested	Qwen
Formula cross-verify	Alternative derivation: matches to 1e-10	Not done	Qwen
Stability demo	2-pass vs naive variance (offset 1e10)	Prose discussion only	Qwen
GPU fusion scope	Forward + backward kernels, memory traffic	Backward kernel only, shared mem layout	Qwen
Complexity format	Concise formula (N-based)	Prose-based	Tie
Derivations	Shown in docstring	Shown in docstring	Tie
Speed (full grad check)	Very slow (element-wise, no spot-check)	Very slow (element-wise, no spot-check)	Tie

Winner: Qwen3-6 (slightly better precision, edge cases, cross-verification, broader GPU fusion scope)

Task 2: Fused Softmax+Top-K

Criteria	Qwen3-6	GLM-5	Edge
Algorithm elegance	2-pass (practical)	Online single-pass (elegant)	GLM
Memory reads	3 × V (max+sum+softmax)	1 × V (online pass)	GLM
K support	Up to 256	Up to 32	Qwen
Top-K structure	Shared heap (O(log K))	Register array (O(K))	Qwen (for K>32)
Vectorization	float4 in v2	None	Qwen
Multiple versions	v1 + v2	Single version	Qwen
Benchmark harness	benchmark.cu	test_fused.cu	Tie
Design doc	ANALYSIS.md	DESIGN.md (9 sections)	Tie
Numerical stability	Log-sum-exp (2-pass)	Online max tracking	Tie (both correct)
I/O efficiency	3 reads, 1 write (v1)	1 read, 1 write	GLM
Production readiness	Higher (v2, float4, K=256)	Medium (K=32 limit)	Qwen

This one is genuinely a split decision:

For algorithmic elegance and the specific constraint ("do NOT materialize"), GLM-5 wins.
For production readiness, vectorization, and K scalability, Qwen3-6 wins.

Winner: Split — GLM-5 on algorithm, Qwen3-6 on production readiness

Task 3: KV-Cache

Criteria	Qwen3-6	GLM-5	Edge
Files	8 modular files	3 files (core + opt + test)	Qwen
Core cache design	Clean, minimal	Clean, minimal	Tie
Memory layout	BHSD	BHSD	Tie
Abstractions	KVCache + BatchedKVCache	KVCache only	Qwen
Attention variants	Standard + GQA + MQA	Standard only	Qwen
Tests/Demos	10 demos (comprehensive)	8 tests (comprehensive)	Qwen (2 more)
Variable-length batching	Working, 4 lengths demo	Working, 3 lengths test	Tie
Paged attention	Page tables + free list	Block pool + free list	Tie
Quantization	INT8 with honest overhead notes	INT8/INT4 with reliable error measurement	GLM (INT4 support)
Chunked prefill	Full impl, verified to 4.5e-10	Partial impl (uses random Q)	Qwen
Memory analysis	6 real models, max context per GPU	2 model configs, growth tables	Qwen
GPU mapping	Dedicated file, Tensor Cores	README-level discussion	Qwen
Model integration	Full transformer with RoPE	IncrementalDecoder (simplified)	Qwen
Code quality	Dataclasses, type hints	Clean but simpler	Tie
Optimizations	Paged + Quant + Chunked + Hybrid	Paged + Quant + Chunked	Qwen (Hybrid)

Winner: Qwen3-6 (modular architecture, broader scope including attention variants, GPU mapping, and hybrid optimizations, more demos)

Qwen3-6 vs GLM-5 Overall: Qwen3-6 wins 2.5-0.5

Qwen3-6 takes backwards and KV-cache clearly. The fuse task is split — GLM-5's online softmax is algorithmically superior, but Qwen3-6's implementation is more production-ready with float4 vectorization and support for K up to 256.

Summary Matrix

Matchup	Backwards	Fuse	KV-Cache	Overall
GLM-5 vs MiniMax	GLM	GLM	GLM	GLM 3-0
MiniMax vs Qwen3-6	Qwen	Qwen	Qwen	Qwen 3-0
Qwen3-6 vs GLM-5	Qwen	Split	Qwen	Qwen 2.5-0.5

Final Rankings (from 2-way analysis)

Qwen3-6 — Best breadth, correctness, and production readiness
GLM-5 — Best algorithm design, clean code; limited scope
MiniMax-M2.7 — Ambitious but buggy; over-engineered yet under-delivered

Key Takeaways

Qwen3-6 is the most "engineering-mature" model — it writes modular code with separate test files, handles edge cases, cross-verifies formulas, and thinks about production deployment (GPU limits, real model specs).
GLM-5 is the most "algorithmically clever" model — its online softmax kernel is the only genuinely single-pass implementation, and its backward pass caches the fewest intermediates. It values elegance over exhaustiveness.
MiniMax-M2.7 is the most "verbose but inconsistent" model — it writes the most code but has the most bugs. The ambition is there (multiple memory formats, full transformer implementation) but execution falls short (format mismatches, incorrect CUDA syntax, no tests).
Common failure mode: All three models struggle with efficient numerical gradient checking — they all use Python element-by-element loops instead of batched finite differences, making gradient checks impractical for realistic tensor sizes. MiniMax has the best mitigation (spot-check for >100k elements) but doesn't apply it uniformly.
KV-cache is the most differentiating task: The complexity of designing a correct, efficient KV-cache system with variable-length batching, paged attention, and quantization reveals the largest quality gap between models. Qwen3-6's 8-file architecture vs MiniMax's monolithic buggy implementation is the clearest illustration.

12 KiB Raw Permalink Blame History Unescape Escape

2-Way Head-to-Head Comparisons

GLM-5 vs MiniMax-M2.7

Task 1: Backward Layer Norm

Task 2: Fused Softmax+Top-K

Task 3: KV-Cache

GLM-5 vs MiniMax-M2.7 Overall: GLM-5 wins 3-0

MiniMax-M2.7 vs Qwen3-6

Task 1: Backward Layer Norm

Task 2: Fused Softmax+Top-K

Task 3: KV-Cache

MiniMax-M2.7 vs Qwen3-6 Overall: Qwen3-6 wins 3-0

Qwen3-6 vs GLM-5

Task 1: Backward Layer Norm

Task 2: Fused Softmax+Top-K

Task 3: KV-Cache

Qwen3-6 vs GLM-5 Overall: Qwen3-6 wins 2.5-0.5

Summary Matrix

Final Rankings (from 2-way analysis)

Key Takeaways

12 KiB

Raw Permalink Blame History