- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7 - Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training - Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution - Add analysis/ folder with cross-model comparisons and per-challenge deep dives - Add deploy_challenges.sh script - Expand .gitignore to exclude Python envs, ML weights, and build artifacts
25 KiB
3-Way Head-to-Head: Harder Challenges
Executive Summary
| Dimension | GLM-5 | MiniMax-M2.7 | Qwen3-6 |
|---|---|---|---|
| Flash Attention Grade | A- | B | A |
| Beam Search Grade | B+ | B- | A |
| Overall Grade | B+ | B- | A |
| All tests pass? | ✓ | ✓ | ✓ |
| Silent bugs? | 0 | 1 (cosmetic) | 0 |
Context note: These runs used opencode (full LSP/tooling harness) instead of the minimal pi-mono harness used in the first round. The system prompt is substantially larger and the environment has more tooling. This may affect verbosity and architectural choices but shouldn't impact algorithmic correctness.
Challenge 1: Tiled Flash Attention (Online Softmax)
The Hidden Trap — Rescaling Direction
The flash attention online softmax recurrence requires:
correction = exp(m_old - m_new) # ≤ 1, correct
The most common bug in open-source implementations is writing this as
exp(m_new - m_old) instead. Both produce identical final output (because
O/l is invariant to the correction factor magnitude), but with the wrong
direction, intermediate O and l values grow without bound, causing
overflow/underflow in the backward pass or with fp16. The bug is invisible
in forward-only correctness tests.
All three models got this right in their code.
GLM-5 (glm5/flash_attention/flash_attention.py, ~215 lines)
Grade: A-
Strengths:
- Correct online softmax with
correction = np.exp(m_tile - m_new) - Clean per-(b,h) loop structure that mirrors how GPU kernels are organized
- Handles the fully-masked first tile NaN hazard with
np.isfiniteguards:safe_mask = np.isfinite(m_new) P_tile = np.where(safe_mask[:, None], P_tile, 0.0) - Also handles l==0 at normalization time (masked rows get output 0)
- Causal skip optimization:
if causal and k_start > q_end - 1: continue - Uses float64 → relative error 1.85e-16 (near machine epsilon)
- Peak memory 34.24 MB vs 134.22 MB for naive (N=4096, D=64) — well under
- Comments clearly explain WHY correction is
exp(m_old - m_new)and NOTexp(m_new - m_old)with a concrete derivation - Two tests delivered as specified
Weaknesses:
- Per-(b,h) Python loops make it slow for large B,H (the matmul within tiles is vectorized, but the outer loops are serial). In a GPU kernel this structure is correct, but for NumPy testing, a fully batched
einsumwould be faster. - Only tests causal=True. Doesn't test non-causal mode
- No test for uneven tile sizes (N not divisible by tile_size)
- No multi-head test (H>1) separately — the large test does H=8 but doesn't verify correctness on each head
MiniMax-M2.7 (minimax-m2.7/flash_attention/flash_attention.py, ~370 lines)
Grade: B
Strengths:
- Code produces correct results (rel error 1.25e-10, passes all tests)
- Correctly implements online softmax with
exp(m[valid_corr_mask] - m_new_flat[valid_corr_mask]) - Explicit handling of
m_old == -infedge case with boolean masks:m_old_is_neg_inf = m == -np.inf need_correction = ~(m_old_is_neg_inf & m_new_is_neg_inf) - Causal skip condition is correct:
if kv_tile_start >= q_tile_end: continue - Peak memory 32.59 MB vs 128 MB naive — good
- Third test on N=512 for additional correctness verification
- Handles
l==0at normalization (if all KV tiles were masked)
Weaknesses:
- The explanation of the rescaling factor IS WRONG in the docstring. The docstring contains a ~80-line monologue where the model argues with itself about the correction direction, initially claiming it should be
exp(m_new - m_old)and "This is WRONG!" before eventually talking itself into the right answer. But in one line it says:Then later contradicts and gets it right. The confusion in the explanation is a red flag — it suggests the model is parroting a learned pattern rather than reasoning from first principles. The code happens to be correct, but the explanation reveals uncertainty.O_new = O_old * exp(m_new - m_old) # This is WRONG! Unless... - Per-(b,h) loops with per-row Python for-loop for exp computation:
This Python for-loop over query positions within each tile is extremely slow for realistic tile sizes. It should be a vectorized operation.
for i in range(S.shape[0]): if not np.isinf(m_new_flat[i]): exp_S_minus_m_new[i] = np.exp(S[i] - m_new_flat[i]) - The docstring is 3× longer than the actual code — mostly confused self-dialogue about the rescaling
- Uses
np.matmulbut then loops per-row for exp — inconsistent vectorization strategy - No tracemalloc in the actual large test (just printed analysis, which is fine but less rigorous)
Qwen3-6 (qwen36/flash_attention/flash_attention.py, ~310 lines)
Grade: A
Strengths:
- Most efficient implementation: uses
np.einsumfor batched score computation, avoiding per-(b,h) loops entirely - Precomputes Q_tiles, K_tiles, V_tiles as lists — clean separation of tiling from computation
- Proper
row_validmasking to handle the -inf - (-inf) = NaN problem:row_valid = row_max > -np.inf correction = np.exp(np.where(row_valid, m - m_new, 0.0)) P = np.where(row_valid[:, :, :, np.newaxis], P, 0.0) - Handles
l==0gracefully at normalization withl_safe = np.where(l > 0, l, 1.0) - 5 tests: accuracy (causal), non-causal, larger batch (B=2,H=8), uneven tiles (N=300, tile_size=97), memory
- Peak memory 27.2 MB (float32 → smaller absolute numbers, still well under 64 MB naive)
- Relative error 5.93e-08 — slightly higher than GLM/MiniMax but only because float32, not float64. Still well within 1e-4 threshold
- Comments explain the correction factor concisely and correctly
- Test 5 (uneven tiles) specifically catches off-by-one tile boundary bugs
Weaknesses:
- Precomputing all tiles as a list duplicates memory (stores all tiles at once). In a production GPU kernel you'd stream them, but for NumPy testing this is acceptable
- Uses float32 instead of float64 (slightly less precision, but the threshold is 1e-4 so it's fine)
- The non-causal test is Test 3 (out of order) which is slightly confusing labeling
- The
correct = np.exp(np.where(...))pattern wastes a tiny amount of compute on invalid rows (computesexp(0.0) = 1.0then discards via masking). Not a real issue but slightly inefficient
Flash Attention Winner: Qwen3-6 (narrowly over GLM-5)
Qwen3-6 wins on breadth (5 tests vs 2), efficiency (batched einsum vs per-head loops), and edge case handling (uneven tiles). GLM-5's implementation is equally correct in the core algorithm and has better (float64) precision and cleaner comments. MiniMax's implementation is correct but the confused docstring and per-row Python loops for exp are concerning.
| Metric | GLM-5 | MiniMax | Qwen3-6 |
|---|---|---|---|
| Correct rescaling | ✓ exp(m_old-m_new) | ✓ exp(m_old-m_new) | ✓ exp(m_old-m_new) |
| NaN handling | ✓ isfinite guard | ✓ -inf mask | ✓ row_valid mask |
| Causal skip | ✓ per-tile check | ✓ per-tile check | ✓ per-tile check |
| Uneven tiles | Not tested | Not tested | ✓ tested (N=300, T=97) |
| Non-causal | Not tested | Not tested | ✓ tested |
| Multi-head test | H=8 in large test | H=8 in large test | ✓ separate test |
| Peak memory (N=4096) | 34 MB (float64) | 33 MB (float64) | 27 MB (float32) |
| Vectorization | Per-(b,h) loops | Per-(b,h) + per-row exp | Batched einsum |
| Num tests | 2 | 3 (1 extra) | 5 |
| Doc quality | Clear, correct | Confused self-dialogue | Clear, concise |
Challenge 2: Batched Beam Search with EOS Semantics
The Hidden Trap — EOS Beam Removal
When a beam produces EOS, it represents a complete high-confidence candidate. If you remove it from the pool, a longer lower-confidence unfinished beam might "win" simply because it hasn't stopped yet. The correct behavior: finished beams stay in the pool and compete via their frozen logprob + length-penalized score against unfinished beams' growing scores.
All three models correctly retain EOS beams in the pool. None commit the removal bug.
GLM-5 (glm5/beam_search/beam_search.py, ~190 lines)
Grade: B+
Strengths:
- Most concise implementation — just 190 lines including tests and mock model
- Beam tracking is clean: each beam is
(tokens, acc_logprob, finished) - Correctly pools candidates + finished beams at each step:
pool = candidates + finished pool.sort(key=lambda b: penalized_score(b[1], len(b[0])), reverse=True) beams = pool[:K] _make_logitshelper correctly constructs logits that produce exact logprobs after softmax — this is important for the EOS test_MockModelclass is simple and reusable- EOS test verifies exact score (-3.0) for the EOS beam
- Length penalty
alpha=0.0test verifies greedy equivalence - Comments clearly explain why finished beams must stay in the pool
Weaknesses:
- Sequences are lists of ints stored in tuples — mutable but tracked via tuple identity, which works but is slightly fragile
_make_logitsspreads remaining probability mass uniformly — this means non-EOS tokens have non-zero probability even when not explicitly set, potentially affecting beam exploration. In the EOS test this is fine because K=2- Per-batch sequential execution (for loop over prompts) — not truly "batched" in the simultaneous sense
- Only 3 tests
- No test for the length penalty interaction case (alpha > 0 with two EOS beams at different lengths)
- The beam representation
(tokens, acc_logprob, finished)uses a 3-tuple with implicit field ordering — easy to confuse
MiniMax-M2.7 (minimax-m2.7/beam_search/beam_search.py, ~360 lines)
Grade: B-
Strengths:
- Full
MinimalLanguageModelwith multi-head attention built in (overkill but complete) - Proper variable-length batching with padded sequences and per-beam tracking
- Uses
np.argpartitionfor efficient top-k selection - Per-batch candidate filtering:
batch_candidates = [c for c in all_candidates if c['batch_idx'] == batch_idx] - EOS test correctly identifies that the EOS beam wins
- Comments explain EOS retention rationale
Weaknesses:
-
EOS logprob vs logit confusion in the test. The EOS test uses pre-softmax logits (
5.0for EOS,3.0for continue) and relies on softmax conversion. The model then converts logits → probs → logprobs vianp.log(token_prob). This means the actual accumulated logprobs are softmax-transformed values, not the controlled values the test thinks. The test printsscore=0.0000for the EOS beam — that's because 5.0 is a very high logit, and after softmax the probability is near 1.0, logprob near 0.0. The test "passes" but the scores aren't the exact values specified. -
Finished beams are added to
finished_resultsimmediately but also removed fromactive_beams:if c['finished']: finished_results[c['batch_idx']].append({...}) else: new_active_beams.append({...})This means finished beams move to
finished_resultsand don't compete in subsequent steps' candidate selection! Wait, let me re-read...Actually, looking more carefully: at the end of each step, finished beams go to
finished_resultsand unfinished beams go toactive_beams. Then at the START of the next step, ALL finished beams (already moved to finished_results) are NOT re-added to the active pool. So they don't compete.But wait — there's a loop that adds accumulated finished beams back:
if all(beam['finished'] for beam in beams): for beam in beams: finished_results[batch_idx].append({...}) active_beams[batch_idx] = []But this only triggers when ALL beams are finished, not when SOME are finished.
When SOME beams are finished: the finished ones go to
finished_resultsand the unfinished ones continue asactive_beams. At the next step, onlyactive_beamsare expanded. Finished beams infinished_resultsare NOT re-added to the candidate pool for the next step's top-K selection.THIS IS THE EOS REMOVAL BUG! The implementation only keeps finished beams in the pool within a single step (via
all_candidateswhich excludes them — actually wait,all_candidatesonly contains candidates from EXPANDING beams, not finished beams from previous steps).Let me re-read one more time... The candidate collection at step N:
for beam in beams: if beam['finished']: continue # skip finished beams # expand...So finished beams don't produce candidates. The pool for top-K at step N is ONLY
all_candidates(which excludes finished beams from previous steps because they were moved tofinished_results).But the test passes because... the EOS test has beam_width=3 and only PATCHES step 1. At step 1, the EOS beam is created as a CANDIDATE (from expanding an unfinished beam). This candidate is then selected into the active set if its length-penalized score is good enough. Then it's moved to
finished_resultsand removed fromactive_beams. At step 2, only the remaining unfinished beam is expanded. Its candidates compete with... only themselves (finished beams from step 1 are infinished_results, not inall_candidates). But beam_width=3 and there's only 1 unfinished beam producing 2*K=6 candidates, so 3 are selected, all from the continuing beam.Wait, but then how does the EOS beam WIN in the final output? Because at the end,
finished_resultscontains all finished beams andactive_beamscontains unfinished ones, and BOTH are merged intoresults:results.append(scored_results[:beam_width])where
scored_resultscombines both sources. So the EOS beam DOES appear in the final output and CAN win.But the issue is: during step 2 selection, the EOS beam doesn't compete against the continuing beam's candidates for the K active slots. The continuing beam fills all K slots with its own candidates. So at step 3, all K active beams are children of the continuing beam — the EOS beam is not "active" anymore. This means:
- If beam_width=1: at step 1, EOS finishes. Step 2: only the unfinished beam produces candidates. The unfinished beam fills the 1 slot. Step 3: same. Final output: EOS beam (score=-3.0) vs unfinished beam (score=-5.0). EOS wins. ✓
- If beam_width=2: same as above, the unfinished beam fills 2 slots. ✓
- The bug appears when beam_width is small and there are more unfinished beams than slots: the EOS beam might be pushed out of the top-K because only candidates from unfinished beams are being considered, not past finished beams.
Actually, the final output merges ALL finished + unfinished beams, sorts by score, and takes top-K. So the EOS beam from step 1 is in
finished_resultsand WILL appear. The only case where this fails is if there are K+1 unfinished beams that all have better scores than the EOS beam, AND the EOS beam would have been in the top-K if it had been allowed to compete at step selection time. But that's the correct behavior — if K other beams are truly better, they should win.The real EOS removal bug is when finished beams are DISCARDED entirely, not when they're moved to a separate list but still included in final ranking. MiniMax includes them in final ranking. So this is actually CORRECT behavior, just implemented in a roundabout way.
OK, I was over-analyzing. The implementation is functionally correct for the test case. Let me reconsider the grade.
-
The implementation is very verbose and the control flow is hard to follow:
active_beams↔finished_results↔all_candidateswith per-batch indices scattered throughout -
Dictionary-based beam representation with string keys is fragile and slow
-
The
all_candidateslist grows unboundedly across batches per step — not per-batch filtered until selection time -
No test for the length penalty + two EOS beams case
Re-graded to B- primarily for the complexity and confusing architecture, not for a correctness bug.
Qwen3-6 (qwen36/beam_search/beam_search.py + model.py + test_beam_search.py, ~380 lines total)
Grade: A
Strengths:
- Best architecture: 3 separate files —
model.py(MinimalLM),beam_search.py(algorithm),test_beam_search.py(4 tests) Beamclass with__slots__for memory efficiency and cleanlength_penalized_score()method- Clear separation:
accumulated_logprob(never modified by penalty) vslength_penalized_score()(used only for ranking) MockModelclass withset_log_probs()for precise control: returns EXACT log probs (not pre-softmax logits), so test assertions on exact scores work perfectly- 4 tests, including the critical "Test 3b" that verifies length penalty interaction:
- Greedy equivalence (K=1, alpha=0)
- Batch independence with cross-validation against solo runs
- EOS retention: verifies exact score -3.0 for EOS beam, -6.0 for continuing
- EOS retention + length penalty: longer beam (-1.0 at len=2) beats shorter beam (-2.0 at len=1) because -1.0/2^0.6 ≈ -0.66 > -2.0/1^0.6 = -2.0
- Uses
np.argpartitionfor efficient top-k selection finished_beamslist is maintained alongsidebeamslist — both participate in ranking- Comments comprehensively explain the EOS retention rationale
Weaknesses:
MinimalLM.forward()only returns logits for the last token position (not the full sequence). This is correct for beam search but inconsistent with the docstring- The
get_log_probsmethod recomputes forward for every candidate — could be batched, but for a correctness test this is fine - Per-batch sequential execution (for loop over batch items) — the same as GLM
- The
Beamclass uses__slots__which is a Python optimization detail that doesn't matter for a toy implementation (but shows the model is thinking about memory)
Beam Search Winner: Qwen3-6 (decisive)
Qwen3-6 wins on every dimension: architecture (3 files, Beam class), testing (4 tests including the critical length penalty interaction), mocking precision (exact logprobs), and code clarity. GLM-5's implementation is correct and concise but has fewer tests. MiniMax's implementation is architecturally confusing with dictionary-based beams and indirect finished beam tracking.
| Metric | GLM-5 | MiniMax | Qwen3-6 |
|---|---|---|---|
| Beam data structure | 3-tuple (tokens, lprob, finished) | dict with string keys | Beam class with slots |
| EOS retention | ✓ pooled + sorted | ✓ merged at final | ✓ finished_beams list |
| EOS test precision | Exact logprobs (-3.0, -4.0) | Logit-based (scores vary) | Exact logprobs (-3.0, -6.0) |
| Length penalty test | 0.0 only (greedy) | 0.6 (basic) | 0.0 + 0.6 with two EOS beams |
| Batch independence | ✓ solo comparison | ✓ token overlap check | ✓ solo comparison + score check |
| Model simulation | _make_logits + MockModel |
Full transformer inlined | Separate MinimalLM + MockModel |
| Tests | 3 | 3 | 4 |
| Code clarity | Clean, concise | Verbose, hard to follow | Clean, well-separated |
| Vectors | Lists of tuples | Lists of dicts | List of Beam objects |
Cross-Challenge Patterns
Code Quality & Architecture (OpenCode vs pi-mono)
The switch from pi-mono to opencode appears to have had these effects:
| Aspect | pi-mono (Round 1) | opencode (Round 2) |
|---|---|---|
| File organization | Mixed (1 file to 8 files) | All single-file except Qwen (3 files) |
| Verbosity | Moderate | Higher (MiniMax docstring is 3× code) |
| Comments/explanation | More concise | More verbose, sometimes confused |
| Test quality | Good | Generally still good |
| Architecture decisions | More opinionated | More "safe" (single file, less modular) |
The larger system prompt in opencode doesn't seem to have hurt correctness — all implementations pass. But it may have encouraged more verbosity (MiniMax's 80-line confused self-dialogue) and less confident architectural choices (flatter file structures).
The Critical Tests
Both challenges had adversarial tests designed to catch specific bugs:
| Bug | Test | GLM-5 | MiniMax | Qwen3-6 |
|---|---|---|---|---|
| Rescaling direction (flash attn) | Check exp(m_old - m_new) vs exp(m_new - m_old) in code |
✓ Correct | ✓ Correct | ✓ Correct |
| NaN from -inf - (-inf) (flash attn) | Causal with fully masked first tile | ✓ isfinite guard | ✓ -inf mask | ✓ row_valid mask |
| EOS beam removal (beam search) | EOS at step 1 with score -3.0 vs cont at -5.0 | ✓ EOS wins | ✓ EOS wins | ✓ EOS wins, exact score |
| Length penalty interaction | Two EOS beams at different lengths | Not tested | Not tested | ✓ tested (alpha=0.6) |
The most interesting finding: MiniMax's flash attention docstring gets the rescaling explanation wrong (it argues with itself about the direction) but the code is correct. This suggests the model has pattern-matched the correct code but doesn't have deep understanding of why it's correct. In a variant of the problem (e.g., changing the recurrence slightly), this model would likely produce buggy code because it's reciting rather than reasoning.
The NaN Hazard
All three models correctly identified and handled the NaN hazard from exp(-inf - (-inf)) when the first KV tile is fully masked under causal attention. The approaches differ:
- GLM-5:
np.isfinite(m_new)guard → zeros out P for invalid rows - MiniMax:
m_old_is_neg_inf & m_new_is_neg_infboolean mask → sets correction to 1, zeros out P - Qwen3-6:
row_valid = row_max > -np.inf→ zeros out P, correction = exp(0) = 1 for invalid rows
All three are correct. Qwen3-6's approach is cleanest because it detects validity at the source (are there any valid scores in this row?) rather than checking secondary conditions.
Overall Rankings (Harder Challenges)
1st Place: Qwen3-6 (A)
Wins both tasks. Flash attention: most efficient (batched einsum), 5 tests including uneven tiles and non-causal. Beam search: best architecture (3 files, Beam class, MockModel), 4 tests including the length penalty interaction case, exact score verification. The only model that separated the language model from the beam search algorithm into different files — exactly the right engineering instinct.
2nd Place: GLM-5 (B+)
Solid, correct implementations with concise code. Flash attention: clean online softmax, good NaN handling, float64 precision. Beam search: correct EOS retention, clean tuple-based tracking. Weaknesses are scope: fewer tests, missing edge cases (no uneven tiles, no length penalty interaction), no non-causal flash attention mode.
3rd Place: MiniMax-M2.7 (B-)
All tests pass but the implementations have concerning properties. Flash attention: correct code but the docstring shows confusion about the rescaling direction — the model clearly doesn't understand why the formula works. Beam search: architecturally confusing with dictionary-based beams and indirect EOS tracking. The code works but the reasoning and structure are both weaker than competitors. The 80-line self-dialogue in the flash attention docstring is a red flag for frontier model capability.
Combined Rankings (Both Rounds)
| Model | Round 1 Grade | Round 2 Grade | Combined |
|---|---|---|---|
| Qwen3-6 | A- | A | A/A- |
| GLM-5 | B+ | B+ | B+ |
| MiniMax-M2.7 | B | B- | B/B- |
Final Takeaways
-
Qwen3-6 (27B, local) continues to outperform two frontier-class models. The gap actually widened in round 2 — Qwen3-6 got STRONGER on harder tasks while the others stayed flat or regressed slightly. This is the opposite of what you'd expect if scale was the primary driver.
-
The rescaling direction test didn't catch anyone — all three models wrote
exp(m_old - m_new)correctly in code. This might mean the bug is well-known enough to be in training data, or the models genuinely understand the math. MiniMax's confused docstring argues for the former. -
The EOS retention test also didn't catch anyone — all three correctly keep finished beams. This is encouraging since the EOS removal bug is common in production frameworks.
-
The opencode harness may have been a slight negative. The implementations in round 2 are less modular (more single-file, fewer test files) than round 1. Qwen3-6 maintained modularity (3 files for beam search). GLM-5 went from 3 files (KV-cache) to single files. MiniMax was already monolithic.
-
The most differentiating factor remains engineering discipline, not algorithmic knowledge. All three models understand flash attention and beam search. What separates them is testing thoroughness (5 tests vs 2), edge case handling (uneven tiles, non-causal), mock precision (exact logprobs vs logits), and code organization (Beam class vs tuple vs dict).