Files
deep_pro_judge/analysis/3-way-head-to-head-round2.md
T
sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis
- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts
2026-04-27 18:49:22 +02:00

25 KiB
Raw Blame History

3-Way Head-to-Head: Harder Challenges

Executive Summary

Dimension GLM-5 MiniMax-M2.7 Qwen3-6
Flash Attention Grade A- B A
Beam Search Grade B+ B- A
Overall Grade B+ B- A
All tests pass?
Silent bugs? 0 1 (cosmetic) 0

Context note: These runs used opencode (full LSP/tooling harness) instead of the minimal pi-mono harness used in the first round. The system prompt is substantially larger and the environment has more tooling. This may affect verbosity and architectural choices but shouldn't impact algorithmic correctness.


Challenge 1: Tiled Flash Attention (Online Softmax)

The Hidden Trap — Rescaling Direction

The flash attention online softmax recurrence requires:

correction = exp(m_old - m_new)    # ≤ 1, correct

The most common bug in open-source implementations is writing this as exp(m_new - m_old) instead. Both produce identical final output (because O/l is invariant to the correction factor magnitude), but with the wrong direction, intermediate O and l values grow without bound, causing overflow/underflow in the backward pass or with fp16. The bug is invisible in forward-only correctness tests.

All three models got this right in their code.

GLM-5 (glm5/flash_attention/flash_attention.py, ~215 lines)

Grade: A-

Strengths:

  • Correct online softmax with correction = np.exp(m_tile - m_new)
  • Clean per-(b,h) loop structure that mirrors how GPU kernels are organized
  • Handles the fully-masked first tile NaN hazard with np.isfinite guards:
    safe_mask = np.isfinite(m_new)
    P_tile = np.where(safe_mask[:, None], P_tile, 0.0)
    
  • Also handles l==0 at normalization time (masked rows get output 0)
  • Causal skip optimization: if causal and k_start > q_end - 1: continue
  • Uses float64 → relative error 1.85e-16 (near machine epsilon)
  • Peak memory 34.24 MB vs 134.22 MB for naive (N=4096, D=64) — well under
  • Comments clearly explain WHY correction is exp(m_old - m_new) and NOT exp(m_new - m_old) with a concrete derivation
  • Two tests delivered as specified

Weaknesses:

  • Per-(b,h) Python loops make it slow for large B,H (the matmul within tiles is vectorized, but the outer loops are serial). In a GPU kernel this structure is correct, but for NumPy testing, a fully batched einsum would be faster.
  • Only tests causal=True. Doesn't test non-causal mode
  • No test for uneven tile sizes (N not divisible by tile_size)
  • No multi-head test (H>1) separately — the large test does H=8 but doesn't verify correctness on each head

MiniMax-M2.7 (minimax-m2.7/flash_attention/flash_attention.py, ~370 lines)

Grade: B

Strengths:

  • Code produces correct results (rel error 1.25e-10, passes all tests)
  • Correctly implements online softmax with exp(m[valid_corr_mask] - m_new_flat[valid_corr_mask])
  • Explicit handling of m_old == -inf edge case with boolean masks:
    m_old_is_neg_inf = m == -np.inf
    need_correction = ~(m_old_is_neg_inf & m_new_is_neg_inf)
    
  • Causal skip condition is correct: if kv_tile_start >= q_tile_end: continue
  • Peak memory 32.59 MB vs 128 MB naive — good
  • Third test on N=512 for additional correctness verification
  • Handles l==0 at normalization (if all KV tiles were masked)

Weaknesses:

  • The explanation of the rescaling factor IS WRONG in the docstring. The docstring contains a ~80-line monologue where the model argues with itself about the correction direction, initially claiming it should be exp(m_new - m_old) and "This is WRONG!" before eventually talking itself into the right answer. But in one line it says:
    O_new = O_old * exp(m_new - m_old)  # This is WRONG! Unless...
    
    Then later contradicts and gets it right. The confusion in the explanation is a red flag — it suggests the model is parroting a learned pattern rather than reasoning from first principles. The code happens to be correct, but the explanation reveals uncertainty.
  • Per-(b,h) loops with per-row Python for-loop for exp computation:
    for i in range(S.shape[0]):
        if not np.isinf(m_new_flat[i]):
            exp_S_minus_m_new[i] = np.exp(S[i] - m_new_flat[i])
    
    This Python for-loop over query positions within each tile is extremely slow for realistic tile sizes. It should be a vectorized operation.
  • The docstring is 3× longer than the actual code — mostly confused self-dialogue about the rescaling
  • Uses np.matmul but then loops per-row for exp — inconsistent vectorization strategy
  • No tracemalloc in the actual large test (just printed analysis, which is fine but less rigorous)

Qwen3-6 (qwen36/flash_attention/flash_attention.py, ~310 lines)

Grade: A

Strengths:

  • Most efficient implementation: uses np.einsum for batched score computation, avoiding per-(b,h) loops entirely
  • Precomputes Q_tiles, K_tiles, V_tiles as lists — clean separation of tiling from computation
  • Proper row_valid masking to handle the -inf - (-inf) = NaN problem:
    row_valid = row_max > -np.inf
    correction = np.exp(np.where(row_valid, m - m_new, 0.0))
    P = np.where(row_valid[:, :, :, np.newaxis], P, 0.0)
    
  • Handles l==0 gracefully at normalization with l_safe = np.where(l > 0, l, 1.0)
  • 5 tests: accuracy (causal), non-causal, larger batch (B=2,H=8), uneven tiles (N=300, tile_size=97), memory
  • Peak memory 27.2 MB (float32 → smaller absolute numbers, still well under 64 MB naive)
  • Relative error 5.93e-08 — slightly higher than GLM/MiniMax but only because float32, not float64. Still well within 1e-4 threshold
  • Comments explain the correction factor concisely and correctly
  • Test 5 (uneven tiles) specifically catches off-by-one tile boundary bugs

Weaknesses:

  • Precomputing all tiles as a list duplicates memory (stores all tiles at once). In a production GPU kernel you'd stream them, but for NumPy testing this is acceptable
  • Uses float32 instead of float64 (slightly less precision, but the threshold is 1e-4 so it's fine)
  • The non-causal test is Test 3 (out of order) which is slightly confusing labeling
  • The correct = np.exp(np.where(...)) pattern wastes a tiny amount of compute on invalid rows (computes exp(0.0) = 1.0 then discards via masking). Not a real issue but slightly inefficient

Flash Attention Winner: Qwen3-6 (narrowly over GLM-5)

Qwen3-6 wins on breadth (5 tests vs 2), efficiency (batched einsum vs per-head loops), and edge case handling (uneven tiles). GLM-5's implementation is equally correct in the core algorithm and has better (float64) precision and cleaner comments. MiniMax's implementation is correct but the confused docstring and per-row Python loops for exp are concerning.

Metric GLM-5 MiniMax Qwen3-6
Correct rescaling ✓ exp(m_old-m_new) ✓ exp(m_old-m_new) ✓ exp(m_old-m_new)
NaN handling ✓ isfinite guard ✓ -inf mask ✓ row_valid mask
Causal skip ✓ per-tile check ✓ per-tile check ✓ per-tile check
Uneven tiles Not tested Not tested ✓ tested (N=300, T=97)
Non-causal Not tested Not tested ✓ tested
Multi-head test H=8 in large test H=8 in large test ✓ separate test
Peak memory (N=4096) 34 MB (float64) 33 MB (float64) 27 MB (float32)
Vectorization Per-(b,h) loops Per-(b,h) + per-row exp Batched einsum
Num tests 2 3 (1 extra) 5
Doc quality Clear, correct Confused self-dialogue Clear, concise

Challenge 2: Batched Beam Search with EOS Semantics

The Hidden Trap — EOS Beam Removal

When a beam produces EOS, it represents a complete high-confidence candidate. If you remove it from the pool, a longer lower-confidence unfinished beam might "win" simply because it hasn't stopped yet. The correct behavior: finished beams stay in the pool and compete via their frozen logprob + length-penalized score against unfinished beams' growing scores.

All three models correctly retain EOS beams in the pool. None commit the removal bug.

GLM-5 (glm5/beam_search/beam_search.py, ~190 lines)

Grade: B+

Strengths:

  • Most concise implementation — just 190 lines including tests and mock model
  • Beam tracking is clean: each beam is (tokens, acc_logprob, finished)
  • Correctly pools candidates + finished beams at each step:
    pool = candidates + finished
    pool.sort(key=lambda b: penalized_score(b[1], len(b[0])), reverse=True)
    beams = pool[:K]
    
  • _make_logits helper correctly constructs logits that produce exact logprobs after softmax — this is important for the EOS test
  • _MockModel class is simple and reusable
  • EOS test verifies exact score (-3.0) for the EOS beam
  • Length penalty alpha=0.0 test verifies greedy equivalence
  • Comments clearly explain why finished beams must stay in the pool

Weaknesses:

  • Sequences are lists of ints stored in tuples — mutable but tracked via tuple identity, which works but is slightly fragile
  • _make_logits spreads remaining probability mass uniformly — this means non-EOS tokens have non-zero probability even when not explicitly set, potentially affecting beam exploration. In the EOS test this is fine because K=2
  • Per-batch sequential execution (for loop over prompts) — not truly "batched" in the simultaneous sense
  • Only 3 tests
  • No test for the length penalty interaction case (alpha > 0 with two EOS beams at different lengths)
  • The beam representation (tokens, acc_logprob, finished) uses a 3-tuple with implicit field ordering — easy to confuse

MiniMax-M2.7 (minimax-m2.7/beam_search/beam_search.py, ~360 lines)

Grade: B-

Strengths:

  • Full MinimalLanguageModel with multi-head attention built in (overkill but complete)
  • Proper variable-length batching with padded sequences and per-beam tracking
  • Uses np.argpartition for efficient top-k selection
  • Per-batch candidate filtering: batch_candidates = [c for c in all_candidates if c['batch_idx'] == batch_idx]
  • EOS test correctly identifies that the EOS beam wins
  • Comments explain EOS retention rationale

Weaknesses:

  • EOS logprob vs logit confusion in the test. The EOS test uses pre-softmax logits (5.0 for EOS, 3.0 for continue) and relies on softmax conversion. The model then converts logits → probs → logprobs via np.log(token_prob). This means the actual accumulated logprobs are softmax-transformed values, not the controlled values the test thinks. The test prints score=0.0000 for the EOS beam — that's because 5.0 is a very high logit, and after softmax the probability is near 1.0, logprob near 0.0. The test "passes" but the scores aren't the exact values specified.

  • Finished beams are added to finished_results immediately but also removed from active_beams:

    if c['finished']:
        finished_results[c['batch_idx']].append({...})
    else:
        new_active_beams.append({...})
    

    This means finished beams move to finished_results and don't compete in subsequent steps' candidate selection! Wait, let me re-read...

    Actually, looking more carefully: at the end of each step, finished beams go to finished_results and unfinished beams go to active_beams. Then at the START of the next step, ALL finished beams (already moved to finished_results) are NOT re-added to the active pool. So they don't compete.

    But wait — there's a loop that adds accumulated finished beams back:

    if all(beam['finished'] for beam in beams):
        for beam in beams:
            finished_results[batch_idx].append({...})
        active_beams[batch_idx] = []
    

    But this only triggers when ALL beams are finished, not when SOME are finished.

    When SOME beams are finished: the finished ones go to finished_results and the unfinished ones continue as active_beams. At the next step, only active_beams are expanded. Finished beams in finished_results are NOT re-added to the candidate pool for the next step's top-K selection.

    THIS IS THE EOS REMOVAL BUG! The implementation only keeps finished beams in the pool within a single step (via all_candidates which excludes them — actually wait, all_candidates only contains candidates from EXPANDING beams, not finished beams from previous steps).

    Let me re-read one more time... The candidate collection at step N:

    for beam in beams:
        if beam['finished']: continue  # skip finished beams
        # expand...
    

    So finished beams don't produce candidates. The pool for top-K at step N is ONLY all_candidates (which excludes finished beams from previous steps because they were moved to finished_results).

    But the test passes because... the EOS test has beam_width=3 and only PATCHES step 1. At step 1, the EOS beam is created as a CANDIDATE (from expanding an unfinished beam). This candidate is then selected into the active set if its length-penalized score is good enough. Then it's moved to finished_results and removed from active_beams. At step 2, only the remaining unfinished beam is expanded. Its candidates compete with... only themselves (finished beams from step 1 are in finished_results, not in all_candidates). But beam_width=3 and there's only 1 unfinished beam producing 2*K=6 candidates, so 3 are selected, all from the continuing beam.

    Wait, but then how does the EOS beam WIN in the final output? Because at the end, finished_results contains all finished beams and active_beams contains unfinished ones, and BOTH are merged into results:

    results.append(scored_results[:beam_width])
    

    where scored_results combines both sources. So the EOS beam DOES appear in the final output and CAN win.

    But the issue is: during step 2 selection, the EOS beam doesn't compete against the continuing beam's candidates for the K active slots. The continuing beam fills all K slots with its own candidates. So at step 3, all K active beams are children of the continuing beam — the EOS beam is not "active" anymore. This means:

    • If beam_width=1: at step 1, EOS finishes. Step 2: only the unfinished beam produces candidates. The unfinished beam fills the 1 slot. Step 3: same. Final output: EOS beam (score=-3.0) vs unfinished beam (score=-5.0). EOS wins. ✓
    • If beam_width=2: same as above, the unfinished beam fills 2 slots. ✓
    • The bug appears when beam_width is small and there are more unfinished beams than slots: the EOS beam might be pushed out of the top-K because only candidates from unfinished beams are being considered, not past finished beams.

    Actually, the final output merges ALL finished + unfinished beams, sorts by score, and takes top-K. So the EOS beam from step 1 is in finished_results and WILL appear. The only case where this fails is if there are K+1 unfinished beams that all have better scores than the EOS beam, AND the EOS beam would have been in the top-K if it had been allowed to compete at step selection time. But that's the correct behavior — if K other beams are truly better, they should win.

    The real EOS removal bug is when finished beams are DISCARDED entirely, not when they're moved to a separate list but still included in final ranking. MiniMax includes them in final ranking. So this is actually CORRECT behavior, just implemented in a roundabout way.

    OK, I was over-analyzing. The implementation is functionally correct for the test case. Let me reconsider the grade.

  • The implementation is very verbose and the control flow is hard to follow: active_beamsfinished_resultsall_candidates with per-batch indices scattered throughout

  • Dictionary-based beam representation with string keys is fragile and slow

  • The all_candidates list grows unboundedly across batches per step — not per-batch filtered until selection time

  • No test for the length penalty + two EOS beams case

Re-graded to B- primarily for the complexity and confusing architecture, not for a correctness bug.

Qwen3-6 (qwen36/beam_search/beam_search.py + model.py + test_beam_search.py, ~380 lines total)

Grade: A

Strengths:

  • Best architecture: 3 separate files — model.py (MinimalLM), beam_search.py (algorithm), test_beam_search.py (4 tests)
  • Beam class with __slots__ for memory efficiency and clean length_penalized_score() method
  • Clear separation: accumulated_logprob (never modified by penalty) vs length_penalized_score() (used only for ranking)
  • MockModel class with set_log_probs() for precise control: returns EXACT log probs (not pre-softmax logits), so test assertions on exact scores work perfectly
  • 4 tests, including the critical "Test 3b" that verifies length penalty interaction:
    • Greedy equivalence (K=1, alpha=0)
    • Batch independence with cross-validation against solo runs
    • EOS retention: verifies exact score -3.0 for EOS beam, -6.0 for continuing
    • EOS retention + length penalty: longer beam (-1.0 at len=2) beats shorter beam (-2.0 at len=1) because -1.0/2^0.6 ≈ -0.66 > -2.0/1^0.6 = -2.0
  • Uses np.argpartition for efficient top-k selection
  • finished_beams list is maintained alongside beams list — both participate in ranking
  • Comments comprehensively explain the EOS retention rationale

Weaknesses:

  • MinimalLM.forward() only returns logits for the last token position (not the full sequence). This is correct for beam search but inconsistent with the docstring
  • The get_log_probs method recomputes forward for every candidate — could be batched, but for a correctness test this is fine
  • Per-batch sequential execution (for loop over batch items) — the same as GLM
  • The Beam class uses __slots__ which is a Python optimization detail that doesn't matter for a toy implementation (but shows the model is thinking about memory)

Beam Search Winner: Qwen3-6 (decisive)

Qwen3-6 wins on every dimension: architecture (3 files, Beam class), testing (4 tests including the critical length penalty interaction), mocking precision (exact logprobs), and code clarity. GLM-5's implementation is correct and concise but has fewer tests. MiniMax's implementation is architecturally confusing with dictionary-based beams and indirect finished beam tracking.

Metric GLM-5 MiniMax Qwen3-6
Beam data structure 3-tuple (tokens, lprob, finished) dict with string keys Beam class with slots
EOS retention ✓ pooled + sorted ✓ merged at final ✓ finished_beams list
EOS test precision Exact logprobs (-3.0, -4.0) Logit-based (scores vary) Exact logprobs (-3.0, -6.0)
Length penalty test 0.0 only (greedy) 0.6 (basic) 0.0 + 0.6 with two EOS beams
Batch independence ✓ solo comparison ✓ token overlap check ✓ solo comparison + score check
Model simulation _make_logits + MockModel Full transformer inlined Separate MinimalLM + MockModel
Tests 3 3 4
Code clarity Clean, concise Verbose, hard to follow Clean, well-separated
Vectors Lists of tuples Lists of dicts List of Beam objects

Cross-Challenge Patterns

Code Quality & Architecture (OpenCode vs pi-mono)

The switch from pi-mono to opencode appears to have had these effects:

Aspect pi-mono (Round 1) opencode (Round 2)
File organization Mixed (1 file to 8 files) All single-file except Qwen (3 files)
Verbosity Moderate Higher (MiniMax docstring is 3× code)
Comments/explanation More concise More verbose, sometimes confused
Test quality Good Generally still good
Architecture decisions More opinionated More "safe" (single file, less modular)

The larger system prompt in opencode doesn't seem to have hurt correctness — all implementations pass. But it may have encouraged more verbosity (MiniMax's 80-line confused self-dialogue) and less confident architectural choices (flatter file structures).

The Critical Tests

Both challenges had adversarial tests designed to catch specific bugs:

Bug Test GLM-5 MiniMax Qwen3-6
Rescaling direction (flash attn) Check exp(m_old - m_new) vs exp(m_new - m_old) in code ✓ Correct ✓ Correct ✓ Correct
NaN from -inf - (-inf) (flash attn) Causal with fully masked first tile ✓ isfinite guard ✓ -inf mask ✓ row_valid mask
EOS beam removal (beam search) EOS at step 1 with score -3.0 vs cont at -5.0 ✓ EOS wins ✓ EOS wins ✓ EOS wins, exact score
Length penalty interaction Two EOS beams at different lengths Not tested Not tested ✓ tested (alpha=0.6)

The most interesting finding: MiniMax's flash attention docstring gets the rescaling explanation wrong (it argues with itself about the direction) but the code is correct. This suggests the model has pattern-matched the correct code but doesn't have deep understanding of why it's correct. In a variant of the problem (e.g., changing the recurrence slightly), this model would likely produce buggy code because it's reciting rather than reasoning.

The NaN Hazard

All three models correctly identified and handled the NaN hazard from exp(-inf - (-inf)) when the first KV tile is fully masked under causal attention. The approaches differ:

  • GLM-5: np.isfinite(m_new) guard → zeros out P for invalid rows
  • MiniMax: m_old_is_neg_inf & m_new_is_neg_inf boolean mask → sets correction to 1, zeros out P
  • Qwen3-6: row_valid = row_max > -np.inf → zeros out P, correction = exp(0) = 1 for invalid rows

All three are correct. Qwen3-6's approach is cleanest because it detects validity at the source (are there any valid scores in this row?) rather than checking secondary conditions.


Overall Rankings (Harder Challenges)

1st Place: Qwen3-6 (A)

Wins both tasks. Flash attention: most efficient (batched einsum), 5 tests including uneven tiles and non-causal. Beam search: best architecture (3 files, Beam class, MockModel), 4 tests including the length penalty interaction case, exact score verification. The only model that separated the language model from the beam search algorithm into different files — exactly the right engineering instinct.

2nd Place: GLM-5 (B+)

Solid, correct implementations with concise code. Flash attention: clean online softmax, good NaN handling, float64 precision. Beam search: correct EOS retention, clean tuple-based tracking. Weaknesses are scope: fewer tests, missing edge cases (no uneven tiles, no length penalty interaction), no non-causal flash attention mode.

3rd Place: MiniMax-M2.7 (B-)

All tests pass but the implementations have concerning properties. Flash attention: correct code but the docstring shows confusion about the rescaling direction — the model clearly doesn't understand why the formula works. Beam search: architecturally confusing with dictionary-based beams and indirect EOS tracking. The code works but the reasoning and structure are both weaker than competitors. The 80-line self-dialogue in the flash attention docstring is a red flag for frontier model capability.


Combined Rankings (Both Rounds)

Model Round 1 Grade Round 2 Grade Combined
Qwen3-6 A- A A/A-
GLM-5 B+ B+ B+
MiniMax-M2.7 B B- B/B-

Final Takeaways

  1. Qwen3-6 (27B, local) continues to outperform two frontier-class models. The gap actually widened in round 2 — Qwen3-6 got STRONGER on harder tasks while the others stayed flat or regressed slightly. This is the opposite of what you'd expect if scale was the primary driver.

  2. The rescaling direction test didn't catch anyone — all three models wrote exp(m_old - m_new) correctly in code. This might mean the bug is well-known enough to be in training data, or the models genuinely understand the math. MiniMax's confused docstring argues for the former.

  3. The EOS retention test also didn't catch anyone — all three correctly keep finished beams. This is encouraging since the EOS removal bug is common in production frameworks.

  4. The opencode harness may have been a slight negative. The implementations in round 2 are less modular (more single-file, fewer test files) than round 1. Qwen3-6 maintained modularity (3 files for beam search). GLM-5 went from 3 files (KV-cache) to single files. MiniMax was already monolithic.

  5. The most differentiating factor remains engineering discipline, not algorithmic knowledge. All three models understand flash attention and beam search. What separates them is testing thoroughness (5 tests vs 2), edge case handling (uneven tiles, non-causal), mock precision (exact logprobs vs logits), and code organization (Beam class vs tuple vs dict).