# 2-Way Head-to-Head: Harder Challenges (Round 2) These pairwise comparisons cover the two harder challenges: Flash Attention and Beam Search. All implementations were run under opencode (full LSP tooling harness) rather than the minimal pi-mono harness from Round 1. --- ## GLM-5 vs MiniMax-M2.7 ### Flash Attention | Criteria | GLM-5 | MiniMax-M2.7 | Edge | |----------|-------|-------------|------| | Lines of code | ~215 | ~370 | **GLM** (more concise) | | Correct rescaling | ✓ exp(m_old - m_new) | ✓ exp(m_old - m_new) | Tie | | NaN handling | np.isfinite guard | -inf boolean mask | Tie (both correct) | | Causal skip | ✓ per-tile | ✓ per-tile | Tie | | Rel error (N=256) | 1.85e-16 (float64) | 1.25e-10 (float64) | **GLM** (1000× better) | | Peak memory (N=4096) | 34.2 MB | 32.6 MB | Tie | | Tests | 2 (specified) | 3 (extra N=512 check) | MiniMax | | Vectorization | Per-(b,h) loops | Per-(b,h) + per-row Python for-loop for exp | **GLM** (no per-row loops) | | Doc quality | Clear, correct derivation | **Confused 80-line self-dialogue** that initially claims wrong answer | **GLM** | | Extra modes | Causal only | Causal only | Tie | **Winner: GLM-5** — The critical differentiator is MiniMax's docstring which **argues with itself about the rescaling direction**, initially claiming the wrong answer before correcting. The code is right but the reasoning is shaky. GLM-5's code is also faster (no per-row Python exp loop inside the tile loop). ### Beam Search | Criteria | GLM-5 | MiniMax-M2.7 | Edge | |----------|-------|-------------|------| | Lines of code | ~190 | ~360 | **GLM** | | Beam representation | 3-tuple (tokens, lprob, finished) | dict with 7 string keys | **GLM** | | EOS retention | ✓ pool = candidates + finished | ✓ merged at final ranking | Tie | | Mocking precision | Exact logprobs via _make_logits | Pre-softmax logits → scores vary | **GLM** | | EOS test score | -3.0 (exact) | 0.0 (transformed by softmax) | **GLM** | | Length penalty test | alpha=0.0 only | alpha=0.6 basic | **MiniMax** | | Batch independence | Solo comparison | Token-overlap + solo | Tie | | Tests | 3 | 3 | Tie | | Code clarity | Clean, easy to follow | Verbose, hard to trace control flow | **GLM** | | Model integration | MockModel class + MinimalTransformer | Full transformer inlined in test | **GLM** | **Winner: GLM-5** — Both are correct, but GLM-5's implementation is dramatically cleaner. MiniMax uses dictionaries-with-string-keys for beams and has complex control flow between `active_beams`, `finished_results`, and `all_candidates`. The mocking precision is also better in GLM-5 (exact logprob control vs logit-based). ### GLM-5 vs MiniMax-M2.7 Overall: **GLM-5 wins 2-0** --- ## MiniMax-M2.7 vs Qwen3-6 ### Flash Attention | Criteria | MiniMax-M2.7 | Qwen3-6 | Edge | |----------|-------------|---------|------| | Correct rescaling | ✓ | ✓ | Tie | | NaN handling | -inf boolean mask | row_valid = row_max > -inf | **Qwen** (cleaner) | | Tests | 3 (causal ×2, N=512 extra) | 5 (causal, non-causal, multi-head, uneven tiles, memory) | **Qwen** | | Non-causal support | Not tested | ✓ tested | **Qwen** | | Uneven tiles | Not tested | ✓ N=300, T=97 | **Qwen** | | Rel error | 1.25e-10 (float64) | 5.93e-08 (float32) | **MiniMax** (dtype artifact) | | Peak memory | 32.6 MB (float64) | 27.2 MB (float32) | Tie (both proportional to dtype) | | Vectorization | Per-(b,h) + per-row Python exp loop | Batched einsum, no per-row loops | **Qwen** | | Doc quality | Confused self-dialogue | Clear, concise | **Qwen** | | Multi-head | H=8 in large test only | H=8 with separate correctness check | **Qwen** | **Winner: Qwen3-6** — Decisive. More tests, cleaner code, better vectorization, correct doc. MiniMax's confused docstring is a major red flag even though the code works. ### Beam Search | Criteria | MiniMax-M2.7 | Qwen3-6 | Edge | |----------|-------------|---------|------| | Files | 1 (monolithic) | 3 (model.py + beam_search.py + test) | **Qwen** | | Beam representation | dict with 7 string keys | Beam class with __slots__ | **Qwen** | | EOS retention | ✓ (indirect, via final merge) | ✓ (direct, finished_beams list in pool) | **Qwen** | | Mocking precision | Logit-based (can't control exact scores) | Exact logprobs via MockModel.set_log_probs | **Qwen** | | Tests | 3 | 4 (includes length penalty + two EOS beams) | **Qwen** | | Length penalty interaction | Only basic alpha=0.6 | Test 3b: alpha=0.6, two EOS beams at different lengths, verifies longer beam wins correctly | **Qwen** | | Batch independence verification | Token-overlap check (weak) | Solo comparison + exact score match | **Qwen** | | Code clarity | Hard to follow control flow | Clean, well-separated concerns | **Qwen** | **Winner: Qwen3-6** — Decisive on every dimension. The length penalty interaction test (3b) is the strongest differentiator — only Qwen3-6 tested that two EOS beams at different lengths interact correctly with the penalty. ### MiniMax-M2.7 vs Qwen3-6 Overall: **Qwen3-6 wins 2-0** --- ## Qwen3-6 vs GLM-5 ### Flash Attention | Criteria | Qwen3-6 | GLM-5 | Edge | |----------|---------|-------|------| | Correct rescaling | ✓ | ✓ | Tie | | NaN handling | row_valid = row_max > -inf (cleaner) | np.isfinite(m_new) (also correct) | Tie | | Tests | 5 tests | 2 tests | **Qwen** | | Non-causal | ✓ tested | Not tested | **Qwen** | | Uneven tiles | ✓ N=300, T=97 | Not tested | **Qwen** | | Multi-head B,H>1 | ✓ separate test | H=8 in large test only | **Qwen** | | Vectorization | Batched einsum (fast) | Per-(b,h) loops (correct but slower) | **Qwen** | | Precision | 5.93e-08 (float32) | 1.85e-16 (float64) | **GLM** (dtype choice) | | Peak memory | 27.2 MB (float32) | 34.2 MB (float64) | Tie | | Code clarity | Clear, concise | Clear, concise | Tie | | Doc quality | Concise and correct | Concise and correct with derivation | **GLM** (slightly better derivation) | **Winner: Qwen3-6** — GLM-5 has slightly better precision by using float64 (but both are well within 1e-4). Qwen3-6 wins on breadth: 5 tests covering modes GLM-5 didn't attempt (non-causal, uneven tiles), and batched einsum is more efficient. This is a close call — GLM-5's core algorithm is equally correct. ### Beam Search | Criteria | Qwen3-6 | GLM-5 | Edge | |----------|---------|-------|------| | Files | 3 (model, beam_search, test) | 1 (monolithic) | **Qwen** | | Beam representation | Beam class with __slots__ | 3-tuple (tokens, lprob, finished) | **Qwen** (more readable) | | EOS retention | ✓ finished_beams list | ✓ pool = candidates + finished | Tie | | Mocking precision | Exact logprobs via MockModel | Exact logprobs via _make_logits | Tie | | Tests | 4 tests | 3 tests | **Qwen** | | Length penalty + two EOS beams | ✓ Test 3b (alpha=0.6) | Not tested | **Qwen** | | Greedy equivalence | ✓ K=1, alpha=0 | ✓ K=1, alpha=0 | Tie | | Batch independence | ✓ Solo comparison + score match | ✓ Solo comparison | Tie | | Code organization | Model separated from algorithm | Model inlined (MinimalTransformer) | **Qwen** | | Score verification precision | Exact (-3.0, -6.0) | Exact (-3.0) | **Qwen** | **Winner: Qwen3-6** — This is close. Both correctly retain EOS beams and use exact logprob control for testing. Qwen3-6 edges ahead with: (a) the length penalty + two EOS beams test (Test 3b), which verifies that the penalty correctly flips the ranking when a longer sequence has higher confidence; (b) better code organization with separate model file; (c) Beam class vs raw tuple for readability. ### Qwen3-6 vs GLM-5 Overall: **Qwen3-6 wins 2-0** --- ## Summary Matrix (Round 2) | Matchup | Flash Attention | Beam Search | Overall | |---------|----------------|-------------|---------| | **GLM-5 vs MiniMax** | GLM | GLM | **GLM 2-0** | | **MiniMax vs Qwen3-6** | Qwen | Qwen | **Qwen 2-0** | | **Qwen3-6 vs GLM-5** | Qwen | Qwen | **Qwen 2-0** | --- ## Combined Rankings (Both Rounds) | Matchup | Round 1 | Round 2 | Trend | |---------|---------|---------|-------| | **GLM-5 vs MiniMax** | GLM 3-0 | GLM 2-0 | GLM dominance holds | | **MiniMax vs Qwen3-6** | Qwen 3-0 | Qwen 2-0 | Qwen dominance holds | | **Qwen3-6 vs GLM-5** | Qwen 2.5-0.5 | Qwen 2-0 | Qwen gap WIDENS | ### Overall Model Rankings | Rank | Model | Round 1 Grade | Round 2 Grade | Notes | |------|-------|--------------|--------------|-------| | 1 | **Qwen3-6** | A- | A | Gap over #2 widened. Only model that maintained multi-file architecture in opencode | | 2 | **GLM-5** | B+ | B+ | Consistent. Strong algorithms, clean code, limited test coverage vs Qwen | | 3 | **MiniMax-M2.7** | B | B- | Regressed slightly. Confused docstring in flash attn, architectural mess in beam search | ### Key Observations from Round 2 1. **The EOS retention trap didn't catch anyone.** All three models correctly kept finished beams in the pool. This suggests either the bug is well-represented in training data (many blog posts about "the beam search EOS bug") or the prompt's explicit warning ("Do NOT remove finished beams from the pool") was too heavy a hint. 2. **The rescaling direction trap also didn't catch anyone** in code, but MiniMax's docstring reveals shaky understanding. The model wrote correct code but couldn't explain why — classic pattern-matching behavior. If you modify the recurrence slightly (e.g., use a different normalization), MiniMax would likely produce buggy code because it's reciting rather than reasoning. 3. **The strongest differentiating test was length penalty interaction** (Qwen3-6 Test 3b). Neither GLM-5 nor MiniMax tested that two EOS beams at different lengths interact correctly with the penalty. This is a subtle bug that would pass basic "does EOS stay?" tests. 4. **OpenCode vs pi-mono impact:** The richer harness seemed to encourage verbosity (MiniMax's 80-line self-dialogue) and slightly less modular code (GLM-5 went from multi-file to single-file). Qwen3-6 was unaffected — maintained 3-file architecture. 5. **Qwen3-6's local 27B advantage over frontier models is real.** The gap widened in round 2, suggesting the model's strength is genuine reasoning/engineering discipline rather than luck on the specific tasks. The pattern of doing MORE than asked (extra tests, extra modes, extra files) while maintaining correctness is consistent.