- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7 - Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training - Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution - Add analysis/ folder with cross-model comparisons and per-challenge deep dives - Add deploy_challenges.sh script - Expand .gitignore to exclude Python envs, ML weights, and build artifacts
10 KiB
2-Way Head-to-Head: Harder Challenges (Round 2)
These pairwise comparisons cover the two harder challenges: Flash Attention and Beam Search. All implementations were run under opencode (full LSP tooling harness) rather than the minimal pi-mono harness from Round 1.
GLM-5 vs MiniMax-M2.7
Flash Attention
| Criteria | GLM-5 | MiniMax-M2.7 | Edge |
|---|---|---|---|
| Lines of code | ~215 | ~370 | GLM (more concise) |
| Correct rescaling | ✓ exp(m_old - m_new) | ✓ exp(m_old - m_new) | Tie |
| NaN handling | np.isfinite guard | -inf boolean mask | Tie (both correct) |
| Causal skip | ✓ per-tile | ✓ per-tile | Tie |
| Rel error (N=256) | 1.85e-16 (float64) | 1.25e-10 (float64) | GLM (1000× better) |
| Peak memory (N=4096) | 34.2 MB | 32.6 MB | Tie |
| Tests | 2 (specified) | 3 (extra N=512 check) | MiniMax |
| Vectorization | Per-(b,h) loops | Per-(b,h) + per-row Python for-loop for exp | GLM (no per-row loops) |
| Doc quality | Clear, correct derivation | Confused 80-line self-dialogue that initially claims wrong answer | GLM |
| Extra modes | Causal only | Causal only | Tie |
Winner: GLM-5 — The critical differentiator is MiniMax's docstring which argues with itself about the rescaling direction, initially claiming the wrong answer before correcting. The code is right but the reasoning is shaky. GLM-5's code is also faster (no per-row Python exp loop inside the tile loop).
Beam Search
| Criteria | GLM-5 | MiniMax-M2.7 | Edge |
|---|---|---|---|
| Lines of code | ~190 | ~360 | GLM |
| Beam representation | 3-tuple (tokens, lprob, finished) | dict with 7 string keys | GLM |
| EOS retention | ✓ pool = candidates + finished | ✓ merged at final ranking | Tie |
| Mocking precision | Exact logprobs via _make_logits | Pre-softmax logits → scores vary | GLM |
| EOS test score | -3.0 (exact) | 0.0 (transformed by softmax) | GLM |
| Length penalty test | alpha=0.0 only | alpha=0.6 basic | MiniMax |
| Batch independence | Solo comparison | Token-overlap + solo | Tie |
| Tests | 3 | 3 | Tie |
| Code clarity | Clean, easy to follow | Verbose, hard to trace control flow | GLM |
| Model integration | MockModel class + MinimalTransformer | Full transformer inlined in test | GLM |
Winner: GLM-5 — Both are correct, but GLM-5's implementation is dramatically cleaner. MiniMax uses dictionaries-with-string-keys for beams and has complex control flow between active_beams, finished_results, and all_candidates. The mocking precision is also better in GLM-5 (exact logprob control vs logit-based).
GLM-5 vs MiniMax-M2.7 Overall: GLM-5 wins 2-0
MiniMax-M2.7 vs Qwen3-6
Flash Attention
| Criteria | MiniMax-M2.7 | Qwen3-6 | Edge |
|---|---|---|---|
| Correct rescaling | ✓ | ✓ | Tie |
| NaN handling | -inf boolean mask | row_valid = row_max > -inf | Qwen (cleaner) |
| Tests | 3 (causal ×2, N=512 extra) | 5 (causal, non-causal, multi-head, uneven tiles, memory) | Qwen |
| Non-causal support | Not tested | ✓ tested | Qwen |
| Uneven tiles | Not tested | ✓ N=300, T=97 | Qwen |
| Rel error | 1.25e-10 (float64) | 5.93e-08 (float32) | MiniMax (dtype artifact) |
| Peak memory | 32.6 MB (float64) | 27.2 MB (float32) | Tie (both proportional to dtype) |
| Vectorization | Per-(b,h) + per-row Python exp loop | Batched einsum, no per-row loops | Qwen |
| Doc quality | Confused self-dialogue | Clear, concise | Qwen |
| Multi-head | H=8 in large test only | H=8 with separate correctness check | Qwen |
Winner: Qwen3-6 — Decisive. More tests, cleaner code, better vectorization, correct doc. MiniMax's confused docstring is a major red flag even though the code works.
Beam Search
| Criteria | MiniMax-M2.7 | Qwen3-6 | Edge |
|---|---|---|---|
| Files | 1 (monolithic) | 3 (model.py + beam_search.py + test) | Qwen |
| Beam representation | dict with 7 string keys | Beam class with slots | Qwen |
| EOS retention | ✓ (indirect, via final merge) | ✓ (direct, finished_beams list in pool) | Qwen |
| Mocking precision | Logit-based (can't control exact scores) | Exact logprobs via MockModel.set_log_probs | Qwen |
| Tests | 3 | 4 (includes length penalty + two EOS beams) | Qwen |
| Length penalty interaction | Only basic alpha=0.6 | Test 3b: alpha=0.6, two EOS beams at different lengths, verifies longer beam wins correctly | Qwen |
| Batch independence verification | Token-overlap check (weak) | Solo comparison + exact score match | Qwen |
| Code clarity | Hard to follow control flow | Clean, well-separated concerns | Qwen |
Winner: Qwen3-6 — Decisive on every dimension. The length penalty interaction test (3b) is the strongest differentiator — only Qwen3-6 tested that two EOS beams at different lengths interact correctly with the penalty.
MiniMax-M2.7 vs Qwen3-6 Overall: Qwen3-6 wins 2-0
Qwen3-6 vs GLM-5
Flash Attention
| Criteria | Qwen3-6 | GLM-5 | Edge |
|---|---|---|---|
| Correct rescaling | ✓ | ✓ | Tie |
| NaN handling | row_valid = row_max > -inf (cleaner) | np.isfinite(m_new) (also correct) | Tie |
| Tests | 5 tests | 2 tests | Qwen |
| Non-causal | ✓ tested | Not tested | Qwen |
| Uneven tiles | ✓ N=300, T=97 | Not tested | Qwen |
| Multi-head B,H>1 | ✓ separate test | H=8 in large test only | Qwen |
| Vectorization | Batched einsum (fast) | Per-(b,h) loops (correct but slower) | Qwen |
| Precision | 5.93e-08 (float32) | 1.85e-16 (float64) | GLM (dtype choice) |
| Peak memory | 27.2 MB (float32) | 34.2 MB (float64) | Tie |
| Code clarity | Clear, concise | Clear, concise | Tie |
| Doc quality | Concise and correct | Concise and correct with derivation | GLM (slightly better derivation) |
Winner: Qwen3-6 — GLM-5 has slightly better precision by using float64 (but both are well within 1e-4). Qwen3-6 wins on breadth: 5 tests covering modes GLM-5 didn't attempt (non-causal, uneven tiles), and batched einsum is more efficient. This is a close call — GLM-5's core algorithm is equally correct.
Beam Search
| Criteria | Qwen3-6 | GLM-5 | Edge |
|---|---|---|---|
| Files | 3 (model, beam_search, test) | 1 (monolithic) | Qwen |
| Beam representation | Beam class with slots | 3-tuple (tokens, lprob, finished) | Qwen (more readable) |
| EOS retention | ✓ finished_beams list | ✓ pool = candidates + finished | Tie |
| Mocking precision | Exact logprobs via MockModel | Exact logprobs via _make_logits | Tie |
| Tests | 4 tests | 3 tests | Qwen |
| Length penalty + two EOS beams | ✓ Test 3b (alpha=0.6) | Not tested | Qwen |
| Greedy equivalence | ✓ K=1, alpha=0 | ✓ K=1, alpha=0 | Tie |
| Batch independence | ✓ Solo comparison + score match | ✓ Solo comparison | Tie |
| Code organization | Model separated from algorithm | Model inlined (MinimalTransformer) | Qwen |
| Score verification precision | Exact (-3.0, -6.0) | Exact (-3.0) | Qwen |
Winner: Qwen3-6 — This is close. Both correctly retain EOS beams and use exact logprob control for testing. Qwen3-6 edges ahead with: (a) the length penalty + two EOS beams test (Test 3b), which verifies that the penalty correctly flips the ranking when a longer sequence has higher confidence; (b) better code organization with separate model file; (c) Beam class vs raw tuple for readability.
Qwen3-6 vs GLM-5 Overall: Qwen3-6 wins 2-0
Summary Matrix (Round 2)
| Matchup | Flash Attention | Beam Search | Overall |
|---|---|---|---|
| GLM-5 vs MiniMax | GLM | GLM | GLM 2-0 |
| MiniMax vs Qwen3-6 | Qwen | Qwen | Qwen 2-0 |
| Qwen3-6 vs GLM-5 | Qwen | Qwen | Qwen 2-0 |
Combined Rankings (Both Rounds)
| Matchup | Round 1 | Round 2 | Trend |
|---|---|---|---|
| GLM-5 vs MiniMax | GLM 3-0 | GLM 2-0 | GLM dominance holds |
| MiniMax vs Qwen3-6 | Qwen 3-0 | Qwen 2-0 | Qwen dominance holds |
| Qwen3-6 vs GLM-5 | Qwen 2.5-0.5 | Qwen 2-0 | Qwen gap WIDENS |
Overall Model Rankings
| Rank | Model | Round 1 Grade | Round 2 Grade | Notes |
|---|---|---|---|---|
| 1 | Qwen3-6 | A- | A | Gap over #2 widened. Only model that maintained multi-file architecture in opencode |
| 2 | GLM-5 | B+ | B+ | Consistent. Strong algorithms, clean code, limited test coverage vs Qwen |
| 3 | MiniMax-M2.7 | B | B- | Regressed slightly. Confused docstring in flash attn, architectural mess in beam search |
Key Observations from Round 2
-
The EOS retention trap didn't catch anyone. All three models correctly kept finished beams in the pool. This suggests either the bug is well-represented in training data (many blog posts about "the beam search EOS bug") or the prompt's explicit warning ("Do NOT remove finished beams from the pool") was too heavy a hint.
-
The rescaling direction trap also didn't catch anyone in code, but MiniMax's docstring reveals shaky understanding. The model wrote correct code but couldn't explain why — classic pattern-matching behavior. If you modify the recurrence slightly (e.g., use a different normalization), MiniMax would likely produce buggy code because it's reciting rather than reasoning.
-
The strongest differentiating test was length penalty interaction (Qwen3-6 Test 3b). Neither GLM-5 nor MiniMax tested that two EOS beams at different lengths interact correctly with the penalty. This is a subtle bug that would pass basic "does EOS stay?" tests.
-
OpenCode vs pi-mono impact: The richer harness seemed to encourage verbosity (MiniMax's 80-line self-dialogue) and slightly less modular code (GLM-5 went from multi-file to single-file). Qwen3-6 was unaffected — maintained 3-file architecture.
-
Qwen3-6's local 27B advantage over frontier models is real. The gap widened in round 2, suggesting the model's strength is genuine reasoning/engineering discipline rather than luck on the specific tasks. The pattern of doing MORE than asked (extra tests, extra modes, extra files) while maintaining correctness is consistent.