Files
deep_pro_judge/analysis/cross-model-comparison.md
T
sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis
- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts
2026-04-27 18:49:22 +02:00

21 KiB
Raw Blame History

Comprehensive Cross-Model Comparison: All Challenges

This synthesizes results across all 8 challenges: Layer Norm Backward, Fused Softmax+Top-K, KV-Cache, Flash Attention Forward, Beam Search, Flash Attention Backward, DFlash, and Ternary Training.

Models that participated in ≥2 rounds: GLM-5 (8 challenges, all), Qwen3-6 (7), Claude Opus 4.7 (7, absent ternary), Kimi K2.6 (3), GLM-5.1 (3), MiniMax-M2.7 (4, absent from last 4 rounds).


Per-Challenge Grade Matrix

Challenge Difficulty GLM-5 Qwen3-6 Opus 4.7 Kimi K2.6 GLM-5.1 MiniMax
Layer Norm Backward Medium B+ (2nd) A- (1st) A (1st) B (3rd)
Fused Softmax+TopK Medium A- (tie) A (tie) A (tie) B (4th)
KV-Cache Medium A- (2nd) A (1st) B+ (3rd) B- (4th)
Flash Attn Forward Hard A- (2nd) A (1st) A- (tie 2nd) B (4th)
Beam Search Hard B+ (3rd) A (1st) A (1st) B- (4th)
Flash Attn Backward Extra Hard A- (2nd) A- (3rd) A- (tie 2nd) A (1st) B+ (4th)
DFlash Extra Hard A (1st) B- (4th=) A (1st) B- (4th=) B+ (3rd)
Ternary Training SOTA Research A- (1st) B+ (2nd) B+ (3rd) C (5th) C+ (4th)

Note: The real Ternary Bonsai 8B HF model card confirms "ternary coverage: Embeddings, attention projections, MLP projections, LM head" — embeddings ARE ternary in the shipping model. Opus 4.7's decision to leave them non-ternary, while well-argued, is a deviation from both the spec and the actual shipped model. GLM-5 alone matches the real Bonsai's ternary coverage.


Head-to-Head Breakdown

GLM-5 vs Qwen3-6 (7 overlapping challenges)

Challenge Winner Margin
Layer Norm Backward Qwen Decisive (edge cases, cross-verification, broader GPU fusion scope)
Fused Softmax+TopK Split Qwen wins on production (float4, K≤256), GLM wins on algorithm (true single-pass online softmax)
KV-Cache Qwen Decisive (8 modular files, 10 demos, GQA/MQA, GPU mapping, 6 real model specs)
Flash Attn Forward Qwen Narrow (5 tests vs 2, batched einsum, uneven tiles tested)
Beam Search Qwen Decisive (Beam class, 4 tests, length penalty×EOS interaction tested)
Flash Attn Backward GLM Narrow (D-optimization single-pass vs Qwen's two-pass, higher precision)
DFlash GLM Decisive (only model with correct branching logits + proper subtree invalidation test)
Ternary Training GLM Moderate (clean rerun PPL=594 in 250 steps; Qwen's honest PPL=319 is better generalization but original was inflated)

Record: Qwen3-6 leads 4-3-1

The arc is consistent: Qwen3-6 dominates the early-round "engineering breadth" challenges (backwards, KV-cache, beam search), but GLM-5 wins all three late-round challenges (flash attn bwd, DFlash, ternary). This pattern is too strong to be coincidence:

  • GLM-5 is stronger on deeper algorithmic reasoning (backward passes, tree attention, unconventional training)
  • Qwen3-6 is stronger on engineering breadth and production-minded implementation
  • As challenges get harder and more open-ended, GLM-5's correctness-first approach wins

GLM-5 vs Kimi K2.6 (3 overlapping challenges)

Challenge Winner Margin
Flash Attn Backward Kimi Narrow (cleaner code, 3.4e-09 precision vs 1.2e-07, but GLM's D-optimization is more efficient)
DFlash GLM Decisive (correct branching logits vs Kimi's chain-only positional extraction)
Ternary Training GLM Decisive (honest PPL=594 vs 5,501; Kimi's embeddings not ternary, catastrophic overfit)

Record: GLM-5 leads 2-1

Kimi K2.6 has beaten GLM-5 on exactly one challenge: flash attn bwd (best precision, cleanest code). But in both DFlash and Ternary, fundamental correctness issues (broken logits for branching, non-ternary embeddings, overfitting) separate them substantially.

GLM-5 vs GLM-5.1 (3 overlapping challenges)

Challenge Winner Margin
Flash Attn Backward GLM-5 Narrow (lower errors: 1.2e-07 vs 8.5e-06, GLM-5.1's break optimization is fragile)
DFlash GLM-5 Decisive (proper branching subtree invalidation test; GLM-5.1 only tests chain)
Ternary Training GLM-5 Decisive (honest PPL=594 vs 30,731; GLM-5.1 catastrophically overfit at 1500 steps)

Record: GLM-5 leads 3-0

GLM-5.1 is consistently a regression from GLM-5 — same architectural instincts (parent-indexed DFlash logits, ternary embeddings) but worse execution: higher numerical errors, insufficient testing, catastrophic overfitting due to poor hyperparameter choices. GLM-5.1 writes code that looks right but falls apart under stress (small data, many steps).

Qwen3-6 vs Kimi K2.6 (3 overlapping challenges)

Challenge Winner Margin
Flash Attn Backward Kimi Narrow (better precision, cleaner code; Qwen has lower memory and more tests)
DFlash Qwen Narrow (both B-, but Qwen has 7 tests vs 3, golden test, logits consistency bonus)
Ternary Training Qwen Decisive (honest PPL=319 vs 5,501; Kimi's embeddings not ternary, catastrophic overfit)

Record: Qwen3-6 leads 2-1

Qwen3-6 and Kimi share the same DFlash logits bug (depth-based vs positional), keeping them in B- range. Kimi's flash attn bwd precision is genuinely better. But Qwen3-6's ternary implementation is fundamentally more correct — all layers ternary, PPL=319 vs 5,501.

Qwen3-6 vs GLM-5.1 (2 overlapping challenges: DFlash, Ternary)

Challenge Winner Margin
DFlash GLM-5.1 Algorithmic correctness (parent-indexed logits) vs Qwen's broken depth-based approach
Ternary Training Qwen Decisive (honest PPL=319 vs 30,731; GLM-5.1 catastrophically overfit)

Record: Tied 1-1

A fascinating split. GLM-5.1 gets DFlash right where Qwen3-6 gets it wrong — the parent-indexed logits insight that only GLM lineage models caught. But in ternary training, the roles reverse: Qwen3-6 generalizes reasonably (PPL=319) while GLM-5.1 completely collapses. Qwen3-6's engineering discipline (train/val separation, moderate step counts) is the difference.

Kimi K2.6 vs GLM-5.1 (3 overlapping challenges)

Challenge Winner Margin
Flash Attn Backward Kimi Decisive (3.4e-09 vs 8.5e-06, cleaner code)
DFlash GLM-5.1 Decisive (correct parent-indexed logits vs Kimi's broken positional extraction)
Ternary Training Kimi Narrow (both overfit catastrophically: PPL=5,501 vs 30,731; Kimi's is less bad)

Record: Kimi leads 2-1


Opus 4.7 Head-to-Head (7 overlapping challenges)

Opus 4.7 vs GLM-5 (8 challenges — all)

Challenge Winner Margin
Layer Norm Backward Opus Narrow (A vs B+; better docs, GPU fusion sketch, 5 edge cases, optimal 3-cache)
Fused Softmax+TopK Tie Both A: single-pass online softmax, template-based K, excellent design docs
KV-Cache GLM-5 Moderate (both correct; GLM-5 has actual tests, Opus uses Python lists)
Flash Attn Forward Tie Both A-; Opus has memory test and causal row-0 check; GLM-5 uses MLX
Beam Search Opus Moderate (A vs B+; both correct, Opus has mock model EOS test, 2K expansion)
Flash Attn Backward Tie Both A-; Opus has better tests, GLM-5 has batched D-optimization
DFlash Tie Both A; only two models with correct parent-indexed logits
Ternary Training GLM-5 Narrow (PPL=594 vs 643 at 250 steps; GLM has ternary embeddings matching real Bonsai's shipped coverage; Opus left embeddings non-ternary with a well-argued but factually-incorrect justification)

Record: Opus 4.7 leads 2-2-4

GLM-5's embedding decision was correct per both spec and the real model. This is the one place where Opus 4.7's "read the literature" approach backfired — BitNet b1.58 the paper may keep embeddings higher precision, but PrismML's shipped Ternary Bonsai doesn't. Following the spec was right.

Opus 4.7 vs Qwen3-6

Challenge Winner Margin
Layer Norm Backward Opus Narrow (A vs A-; more edge cases, GPU fusion, optimal cache)
Fused Softmax+TopK Opus Narrow (single-pass online vs Qwen's 2-pass; both have template K)
KV-Cache Qwen Decisive (8 modular files + 10 demos vs Python lists + good analysis)
Flash Attn Forward Qwen Narrow (batched einsum vs unbatched per-(b,h) loops)
Beam Search Tie Both A; correct EOS retention, mock model tests
Flash Attn Backward Opus Narrow (better tests: dV ALL elements; Qwen has lower memory)
DFlash Opus Decisive (correct parent-indexed logits vs Qwen's broken depth-based approach)

Record: Opus 4.7 leads 4-1-2

Opus beats Qwen3-6 on algorithmic depth (DFlash, fuse, backwards) while Qwen wins on engineering breadth (KV-cache modularity, batched flash attention). This mirrors the GLM-5 vs Qwen3-6 dynamic: algorithmic correctness wins the hard challenges; production engineering wins the broad ones.

Opus 4.7 vs Kimi K2.6 (2 overlapping challenges)

Challenge Winner Margin
Flash Attn Backward Kimi Narrow (3.4e-09 precision vs Opus's unbatched but well-tested approach)
DFlash Opus Decisive (correct parent-indexed logits vs Kimi's broken positional extraction)

Record: Tied 1-1

Opus 4.7 vs GLM-5.1 (2 overlapping challenges)

Challenge Winner Margin
Flash Attn Backward Opus Moderate (better dV test coverage, no fragile break optimization)
DFlash Tie Both A/B+: parent-indexed logits correct; GLM-5.1 has insufficient tests

Record: Opus 4.7 leads 1-0-1


The Shape of Each Model

GLM-5 — The Algorithmic Thinker

Strengths: Algorithmic elegance (single-pass online softmax, D-optimization in backward, parent-indexed DFlash logits, @mx.custom_function STE with explicit VJP), code clarity, correctness-first, concise implementations. The only model that participated in all 8 challenges and the only model that never declined in grade.

Weakness: Limited scope (single-file, fewer tests, no GQA/MQA, K≤32 in fuse kernel).

Signature pattern: True single-pass fused softmax. Parent-indexed DFlash logits. Gradient clipping at norm=1.0 in ternary. Values "why" over "how much."

Best showing: DFlash (only model with fully correct branching tree verification). Ternary (only model robust across two completely different datasets).

Claude Opus 4.7 — The Deep Generalist

Strengths: Algorithmic depth across every domain: correct parent-indexed DFlash logits, single-pass online softmax in fuse, optimal 3-item cache in backwards, correct EOS retention in beam search, proper causal skip in flash attention. The only model besides GLM-5 to catch the DFlash logits trap. In ternary: PPL=643 at 250 steps (tied with GLM-5), best FINDINGS.md of any model (weight_decay=0.0 rationale, paragraph-based val split, BitNet embedding convention, warmup analysis). Remarkably consistent — scores range from A to B+ across all challenges. Participated in all 8 challenges.

Weakness: Uses raw Python lists for KV-cache (toy-grade implementation despite excellent ANALYSIS.md). Unbatched per-(b,h) loops in flash attention. Leaving embeddings non-ternary is well-justified (BitNet b1.58 convention, gathers not matmuls, confirmed against real GGUFs) but deviates from the prompt's spec.

Signature pattern: Desktop-quality CUDA kernel design with single-pass streaming. Parent-indexed DFlash logits. Paragraph-based train/val split that no other model thought to do. Weight_decay=0.0 with a paragraph explaining why (ternary threshold crossing). Reads the research literature, not just the prompt.

Best showing: DFlash and Fuse (A grades). Ternary FINDINGS.md (best write-up). Backwards (A grade, best numerical stability discussion).

Qwen3-6 — The Full-Stack Engineer

Strengths: Modular code (3-8 files per task), comprehensive tests (always exceeds minimum), edge cases handled, real model specs, GPU mapping, attention variants (GQA/MQA), cross-verification of formulas.

Weakness: Breadth over depth. Falls behind on deep algorithmic reasoning (DFlash depth-based logits bug, flash attn bwd two-pass). Data leakage in original ternary run inflated results.

Signature pattern: Writes Beam class with __slots__, separates model.py from algorithm.py, tests N=300, tile_size=97 to catch uneven tile bugs.

Best showing: KV-Cache (8 files, 10 demos, 6 real model specs, GQA/MQA).

Kimi K2.6 — The Precision Instrument

Strengths: Best numerical precision (3.4e-09 dV error), cleanest code.

Weakness: Only 3 challenges. Pattern-matches prompts rather than understanding (DFlash positional logits, ternary embeddings not ternary).

Best showing: Flash Attention Backward (A grade, best precision).

GLM-5.1 — The Weaker Sibling

Strengths: Correct parent-indexed DFlash logits. Best debugging docs (MLX __dict__ trap).

Weakness: Regression from GLM-5 in every dimension. Catastrophic ternary overfitting (PPL=30K).

Best showing: DFlash (correct algorithm, insufficient tests). The break vs continue in flash attn's causal loop, the assert n % group_size == 0 in ternary, the 1500 steps on 48K tokens — all examples of getting the big picture but missing the detail that makes it work.


The Trajectory: Who Got Better (and Where) Over Time?

Over 5 rounds of increasingly difficult challenges:

        R1     R2     R3    DFlash  Ternary
GLM-5:  B+  →  B+  →  A-  →  A   →  A-    (rising, then sustained)
Opus:   A   →  A-  →  A-  →  A   →  B+    (strong early, slight late dip)
Qwen:   A-  →  A   →  A-  →  B-  →  B+    (early peak, late dip, partial recovery)
Kimi:   —   →  —   →  A   →  B-  →  C     (flash of brilliance, steep decline)
GLM5.1: —   →  —   →  B+  →  B+  →  C+    (flat then fell off)
MiniM:  B   →  B-  →  —   →  —   →  —     (exited)

Opus 4.7 and GLM-5 are mirror images. GLM-5 rises as challenges get harder; Opus maintains a flat A-tier from the start. Both end at A- in ternary. Both are the only models with A grades in DFlash. The difference is trajectory shape: GLM-5 grows into hard problems, Opus comes pre-loaded to handle them.

GLM-5 is the only model that doesn't decline. It rises through R3/DFlash and holds at A- into ternary. This is the trajectory you want: consistent competence that improves or maintains as challenges get harder and more open-ended.

Qwen3-6 has the steepest decline. From A/A- in early rounds to B- in DFlash, then partial recovery to B+ in ternary after the data leakage was corrected. The pattern suggests ceiling effects — Qwen3-6's breadth-first approach excels when the challenge rewards systematic testing (early rounds), but struggles when the challenge requires deep algorithmic insight (DFlash) or hyperparameter discipline (ternary overfitting).

Kimi and GLM-5.1 have a "one-hit wonder" pattern. Kimi's flash attn bwd (A) is genuinely excellent but surrounded by B- and C performances. GLM-5.1's parent-indexed DFlash logits are correct but everything else is regression from GLM-5. Neither is reliable across diverse challenges.


Aggregate Ranking

Rank Model Avg Grade R1 (3) R2 (2) R3 (1) DFlash Ternary Strength Weakness
1 GLM-5 A-/B+ B+ (2nd=) B+ (2nd) A- (2nd) A (1st=) A- (1st) Algorithmic reasoning, never declines, ternary coverage matches real Bonsai, gradient clipping Scope (fewer tests, single-file)
2 Opus 4.7 A-/B+ A (1st) A- (1st=) A- (1st) A (1st=) B+ (3rd) Algorithmic depth, best docs (FINDINGS.md, ANALYSIS.md), highest floor, paragraph-based val split Non-ternary embeddings (deviates from spec + real Bonsai), Python lists in KV
3 Qwen3-6 B+ A- (2nd=) A (1st) A- (3rd) B- (4th=) B+ (2nd) Engineering breadth, modularity, generalization DFlash logits trap, data leakage, algorithmic depth ceiling
4 Kimi K2.6 B A (1st) B- (4th=) C (4th) Numeric precision Only 3 challenges; DFlash/ternary bugs
5 GLM-5.1 B-/C+ B+ (4th) B+ (3rd) C+ (3rd) Correct DFlash algorithm Regression from GLM-5; catastrophic ternary overfitting
6 MiniMax-M2.7 B/B- B (3rd) B- (3rd) Ambition Bugs, no tests, exited after 4 rounds

GLM-5 takes sole #1. The real Ternary Bonsai model card confirms that embeddings ARE ternary in the shipped model — matching GLM-5's implementation and the prompt spec. Opus 4.7's non-ternary embedding decision, while the most carefully argued deviation in the field, is both a spec deviation and factually mismatched with what PrismML actually ships. The PPL difference (GLM-5: 594 vs Opus: 643 at 250 steps) is small, but the correctness of the architectural choice is now unambiguous.


The Two Most Differentiating Moments

1. DFlash logits extraction (Algorithmic Insight Test)

The prompt's pseudo-code says tree_logits = logits[len(generated_tokens):] which is wrong for branching trees. Only three models corrected this to parent-indexed logits: GLM-5, GLM-5.1, and Opus 4.7. Qwen3-6 and Kimi K2.6 — widely considered top-tier coding models — both took the pseudo-code at face value.

In this test:

  • GLM-5: Understands ✓ (proper branching subtree invalidation test)
  • Opus 4.7: Understands ✓ (chain-following acceptance, proper subtree invalidation)
  • GLM-5.1: Understands ✓ (but chain-only tests)
  • Qwen3-6: Pattern-matches ✗ (depth-based logits)
  • Kimi K2.6: Pattern-matches ✗ (positional logits, chain-only)

2. Ternary clean-data rerun (Engineering Discipline Test)

When models were given identical train_data.txt (48K tokens):

  • GLM-5: PPL=594 in 250 steps — moderate overfitting, still learning
  • Qwen3-6: PPL=319 in 300 steps — best generalization, disclosed data leakage
  • Kimi K2.6: PPL=5,501 with train loss 0.016 — memorized completely
  • GLM-5.1: PPL=30,731 with train loss 0.18 — catastrophic overfitting

3. Single-pass vs. two-pass fuse (Algorithmic Taste Test)

Only GLM-5 and Opus 4.7 implemented true single-pass online softmax for fuse. Qwen3-6, Kimi, and MiniMax all use two-pass. This distinction doesn't affect correctness (all pass), but it reveals models that think about "can I collapse this into one stream?" vs "what's the standard recipe?" — a proxy for algorithmic taste.


If You Could Only Pick One Model

Criterion Pick Why
Build a production system Qwen3-6 Modular, tested, handles edge cases, thinks about GPU limits and real model specs
Solve a hard algorithmic problem GLM-5 / Opus 4.7 Both caught DFlash logits trap, both do single-pass online softmax
Write numerically perfect code Kimi K2.6 Best precision, cleanest code structure (for challenges that play to its strengths)
Get a correct answer quickly GLM-5 Concisest implementations, fewest lines, correct-first philosophy
Most reliable across diverse tasks Opus 4.7 Narrowest grade range (B+ to A across 7 challenges), most consistent performer
Most improved / highest ceiling GLM-5 Only model that IMPROVES as challenges get harder
Best documentation & transparency Opus 4.7 / GLM-5 Both write excellent design docs with bandwidth analysis, GPU mapping, failure modes
Most complete participation GLM-5 Only model in all 8 challenges (including ternary)

The Definitive Answer

GLM-5 is the clear #1 across all 8 challenges. It participated in every challenge, never declined in grade, won the hardest challenges (DFlash, ternary), and — uniquely — is the only model whose ternary implementation matches what PrismML actually ships (embeddings included). It also has the key hyperparameter insight (gradient clipping at norm=1.0) that no other model documented.

Opus 4.7 is a strong #2 — highest floor of any model (B+ to A range), best documentation, and the only other model to catch the DFlash logits trap and implement single-pass fuse. Its one meaningful miss was leaving embeddings non-ternary with a justification that turned out to be factually incorrect.

Qwen3-6 is the best production engineer but has a clear algorithmic ceiling. Great for well-specified problems; verify for correctness on deep reasoning tasks.

Kimi K2.6 is the best numerical programmer but only for the narrow class of problems it's good at. Beautiful code, wrong algorithms on harder challenges.