Files

T

sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis

- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts

2026-04-27 18:49:22 +02:00

10 KiB

Raw Permalink Blame History

2-Way Head-to-Head: Harder Challenges (Round 2)

These pairwise comparisons cover the two harder challenges: Flash Attention and Beam Search. All implementations were run under opencode (full LSP tooling harness) rather than the minimal pi-mono harness from Round 1.

GLM-5 vs MiniMax-M2.7

Flash Attention

Criteria	GLM-5	MiniMax-M2.7	Edge
Lines of code	~215	~370	GLM (more concise)
Correct rescaling	✓ exp(m_old - m_new)	✓ exp(m_old - m_new)	Tie
NaN handling	np.isfinite guard	-inf boolean mask	Tie (both correct)
Causal skip	✓ per-tile	✓ per-tile	Tie
Rel error (N=256)	1.85e-16 (float64)	1.25e-10 (float64)	GLM (1000× better)
Peak memory (N=4096)	34.2 MB	32.6 MB	Tie
Tests	2 (specified)	3 (extra N=512 check)	MiniMax
Vectorization	Per-(b,h) loops	Per-(b,h) + per-row Python for-loop for exp	GLM (no per-row loops)
Doc quality	Clear, correct derivation	Confused 80-line self-dialogue that initially claims wrong answer	GLM
Extra modes	Causal only	Causal only	Tie

Winner: GLM-5 — The critical differentiator is MiniMax's docstring which argues with itself about the rescaling direction, initially claiming the wrong answer before correcting. The code is right but the reasoning is shaky. GLM-5's code is also faster (no per-row Python exp loop inside the tile loop).

Beam Search

Criteria	GLM-5	MiniMax-M2.7	Edge
Lines of code	~190	~360	GLM
Beam representation	3-tuple (tokens, lprob, finished)	dict with 7 string keys	GLM
EOS retention	✓ pool = candidates + finished	✓ merged at final ranking	Tie
Mocking precision	Exact logprobs via _make_logits	Pre-softmax logits → scores vary	GLM
EOS test score	-3.0 (exact)	0.0 (transformed by softmax)	GLM
Length penalty test	alpha=0.0 only	alpha=0.6 basic	MiniMax
Batch independence	Solo comparison	Token-overlap + solo	Tie
Tests	3	3	Tie
Code clarity	Clean, easy to follow	Verbose, hard to trace control flow	GLM
Model integration	MockModel class + MinimalTransformer	Full transformer inlined in test	GLM

Winner: GLM-5 — Both are correct, but GLM-5's implementation is dramatically cleaner. MiniMax uses dictionaries-with-string-keys for beams and has complex control flow between active_beams, finished_results, and all_candidates. The mocking precision is also better in GLM-5 (exact logprob control vs logit-based).

GLM-5 vs MiniMax-M2.7 Overall: GLM-5 wins 2-0

MiniMax-M2.7 vs Qwen3-6

Flash Attention

Criteria	MiniMax-M2.7	Qwen3-6	Edge
Correct rescaling	✓	✓	Tie
NaN handling	-inf boolean mask	row_valid = row_max > -inf	Qwen (cleaner)
Tests	3 (causal ×2, N=512 extra)	5 (causal, non-causal, multi-head, uneven tiles, memory)	Qwen
Non-causal support	Not tested	✓ tested	Qwen
Uneven tiles	Not tested	✓ N=300, T=97	Qwen
Rel error	1.25e-10 (float64)	5.93e-08 (float32)	MiniMax (dtype artifact)
Peak memory	32.6 MB (float64)	27.2 MB (float32)	Tie (both proportional to dtype)
Vectorization	Per-(b,h) + per-row Python exp loop	Batched einsum, no per-row loops	Qwen
Doc quality	Confused self-dialogue	Clear, concise	Qwen
Multi-head	H=8 in large test only	H=8 with separate correctness check	Qwen

Winner: Qwen3-6 — Decisive. More tests, cleaner code, better vectorization, correct doc. MiniMax's confused docstring is a major red flag even though the code works.

Beam Search

Criteria	MiniMax-M2.7	Qwen3-6	Edge
Files	1 (monolithic)	3 (model.py + beam_search.py + test)	Qwen
Beam representation	dict with 7 string keys	Beam class with slots	Qwen
EOS retention	✓ (indirect, via final merge)	✓ (direct, finished_beams list in pool)	Qwen
Mocking precision	Logit-based (can't control exact scores)	Exact logprobs via MockModel.set_log_probs	Qwen
Tests	3	4 (includes length penalty + two EOS beams)	Qwen
Length penalty interaction	Only basic alpha=0.6	Test 3b: alpha=0.6, two EOS beams at different lengths, verifies longer beam wins correctly	Qwen
Batch independence verification	Token-overlap check (weak)	Solo comparison + exact score match	Qwen
Code clarity	Hard to follow control flow	Clean, well-separated concerns	Qwen

Winner: Qwen3-6 — Decisive on every dimension. The length penalty interaction test (3b) is the strongest differentiator — only Qwen3-6 tested that two EOS beams at different lengths interact correctly with the penalty.

MiniMax-M2.7 vs Qwen3-6 Overall: Qwen3-6 wins 2-0

Qwen3-6 vs GLM-5

Flash Attention

Criteria	Qwen3-6	GLM-5	Edge
Correct rescaling	✓	✓	Tie
NaN handling	row_valid = row_max > -inf (cleaner)	np.isfinite(m_new) (also correct)	Tie
Tests	5 tests	2 tests	Qwen
Non-causal	✓ tested	Not tested	Qwen
Uneven tiles	✓ N=300, T=97	Not tested	Qwen
Multi-head B,H>1	✓ separate test	H=8 in large test only	Qwen
Vectorization	Batched einsum (fast)	Per-(b,h) loops (correct but slower)	Qwen
Precision	5.93e-08 (float32)	1.85e-16 (float64)	GLM (dtype choice)
Peak memory	27.2 MB (float32)	34.2 MB (float64)	Tie
Code clarity	Clear, concise	Clear, concise	Tie
Doc quality	Concise and correct	Concise and correct with derivation	GLM (slightly better derivation)

Winner: Qwen3-6 — GLM-5 has slightly better precision by using float64 (but both are well within 1e-4). Qwen3-6 wins on breadth: 5 tests covering modes GLM-5 didn't attempt (non-causal, uneven tiles), and batched einsum is more efficient. This is a close call — GLM-5's core algorithm is equally correct.

Beam Search

Criteria	Qwen3-6	GLM-5	Edge
Files	3 (model, beam_search, test)	1 (monolithic)	Qwen
Beam representation	Beam class with slots	3-tuple (tokens, lprob, finished)	Qwen (more readable)
EOS retention	✓ finished_beams list	✓ pool = candidates + finished	Tie
Mocking precision	Exact logprobs via MockModel	Exact logprobs via _make_logits	Tie
Tests	4 tests	3 tests	Qwen
Length penalty + two EOS beams	✓ Test 3b (alpha=0.6)	Not tested	Qwen
Greedy equivalence	✓ K=1, alpha=0	✓ K=1, alpha=0	Tie
Batch independence	✓ Solo comparison + score match	✓ Solo comparison	Tie
Code organization	Model separated from algorithm	Model inlined (MinimalTransformer)	Qwen
Score verification precision	Exact (-3.0, -6.0)	Exact (-3.0)	Qwen

Winner: Qwen3-6 — This is close. Both correctly retain EOS beams and use exact logprob control for testing. Qwen3-6 edges ahead with: (a) the length penalty + two EOS beams test (Test 3b), which verifies that the penalty correctly flips the ranking when a longer sequence has higher confidence; (b) better code organization with separate model file; (c) Beam class vs raw tuple for readability.

Qwen3-6 vs GLM-5 Overall: Qwen3-6 wins 2-0

Summary Matrix (Round 2)

Matchup	Flash Attention	Beam Search	Overall
GLM-5 vs MiniMax	GLM	GLM	GLM 2-0
MiniMax vs Qwen3-6	Qwen	Qwen	Qwen 2-0
Qwen3-6 vs GLM-5	Qwen	Qwen	Qwen 2-0

Combined Rankings (Both Rounds)

Matchup	Round 1	Round 2	Trend
GLM-5 vs MiniMax	GLM 3-0	GLM 2-0	GLM dominance holds
MiniMax vs Qwen3-6	Qwen 3-0	Qwen 2-0	Qwen dominance holds
Qwen3-6 vs GLM-5	Qwen 2.5-0.5	Qwen 2-0	Qwen gap WIDENS

Overall Model Rankings

Rank	Model	Round 1 Grade	Round 2 Grade	Notes
1	Qwen3-6	A-	A	Gap over #2 widened. Only model that maintained multi-file architecture in opencode
2	GLM-5	B+	B+	Consistent. Strong algorithms, clean code, limited test coverage vs Qwen
3	MiniMax-M2.7	B	B-	Regressed slightly. Confused docstring in flash attn, architectural mess in beam search

Key Observations from Round 2

The EOS retention trap didn't catch anyone. All three models correctly kept finished beams in the pool. This suggests either the bug is well-represented in training data (many blog posts about "the beam search EOS bug") or the prompt's explicit warning ("Do NOT remove finished beams from the pool") was too heavy a hint.
The rescaling direction trap also didn't catch anyone in code, but MiniMax's docstring reveals shaky understanding. The model wrote correct code but couldn't explain why — classic pattern-matching behavior. If you modify the recurrence slightly (e.g., use a different normalization), MiniMax would likely produce buggy code because it's reciting rather than reasoning.
The strongest differentiating test was length penalty interaction (Qwen3-6 Test 3b). Neither GLM-5 nor MiniMax tested that two EOS beams at different lengths interact correctly with the penalty. This is a subtle bug that would pass basic "does EOS stay?" tests.
OpenCode vs pi-mono impact: The richer harness seemed to encourage verbosity (MiniMax's 80-line self-dialogue) and slightly less modular code (GLM-5 went from multi-file to single-file). Qwen3-6 was unaffected — maintained 3-file architecture.
Qwen3-6's local 27B advantage over frontier models is real. The gap widened in round 2, suggesting the model's strength is genuine reasoning/engineering discipline rather than luck on the specific tasks. The pattern of doing MORE than asked (extra tests, extra modes, extra files) while maintaining correctness is consistent.

10 KiB Raw Permalink Blame History Unescape Escape

2-Way Head-to-Head: Harder Challenges (Round 2)

GLM-5 vs MiniMax-M2.7

Flash Attention

Beam Search

GLM-5 vs MiniMax-M2.7 Overall: GLM-5 wins 2-0

MiniMax-M2.7 vs Qwen3-6

Flash Attention

Beam Search

MiniMax-M2.7 vs Qwen3-6 Overall: Qwen3-6 wins 2-0

Qwen3-6 vs GLM-5

Flash Attention

Beam Search

Qwen3-6 vs GLM-5 Overall: Qwen3-6 wins 2-0

Summary Matrix (Round 2)

Combined Rankings (Both Rounds)

Overall Model Rankings

Key Observations from Round 2

10 KiB

Raw Permalink Blame History