Files
deep_pro_judge/analysis/minimax-m2.7_vs_qwen36_summary.md
T

4.9 KiB
Raw Blame History

Round 1 Summary: MiniMax-M2.7 vs Qwen3.6-27B

Overall Scoreboard

Task MiniMax-M2.7 Qwen3.6-27B Winner Margin
KV Cache 64/100 91/100 qwen36 +27
Backwards Pass 76/100 92/100 qwen36 +16
Fused Softmax+TopK 58/100 88/100 qwen36 +30
Average 66 90 qwen36 +24

Clear winner: Qwen3.6-27B — dominant across all 3 tasks.


Task 1: KV Cache System

Dimension MiniMax-M2.7 Qwen3.6-27B
Correctness 55 92
Completeness 75 95
Code Quality 60 88
Depth of Analysis 78 90
Optimizations 72 90
GPU Mapping 75 88
Tests/Demos 30 95
Overall 64 91

MiniMax-M2.7 Critical Issues

  • Inverted causal mask — masks the wrong triangle, allowing attention to future tokens
  • Broken batched caching — all batch elements share the same kv_cache dict keyed only by layer, not by batch item
  • Prefill doesn't store KV — prefill KV tensors never stored in persistent cache
  • No tests — only a 3-step hardcoded demo with zero assertions
  • 1,720-line monolith — everything crammed into one file

Qwen3.6-27B Strengths

  • 10 passing demos with numerical validation (cached attention diff < 1e-5, chunked prefill diff = 4.56e-10)
  • Modular 7-file architecture — clean separation of concerns
  • Correct variable-length batching — proper causal + length masks
  • 3 working optimizations — paged attention, int8 quantization, chunked prefill (all tested)
  • Quantitative analysis — arithmetic intensity calculations, per-GPU context limits, real model comparisons (Llama, Mistral, GPT-4)

Task 2: Layer Norm Backward Pass

Dimension MiniMax-M2.7 Qwen3.6-27B
Correctness 85 95
Completeness 80 95
Code Quality 70 90
Numerical Stability 75 95
Gradient Check 80 90
Complexity Analysis 80 90
GPU Fusion 85 85
Tests/Benchmarks 60 95
Overall 76 92

MiniMax-M2.7 Weaknesses

  • Over-caching: Stores 10 cache items when only 3 tensors are needed
  • No edge-case tests: No tests for zero input, D=1, large offsets
  • No concrete stability demo: Discusses catastrophic cancellation but never demonstrates it
  • Monolithic 750-line file: Everything mixed together
  • Fragile gradient check: Modifies input in-place without a copy

Qwen3.6-27B Strengths

  • Minimal cache: Only 4 items (x_hat, std_inv, glm5, D) — exactly what's needed
  • Concrete stability demo: Shows naive variance fails at offset=1e8 while two-pass stays exact
  • 3-file separation: Core + tests + benchmarks
  • Edge-case tests: Zero input, D=1, large D (1024), large mean, scale invariance
  • Alternative derivation cross-check: Independent step-by-step chain rule verifies compact formula (<1e-10 error)

Task 3: Fused Softmax + TopK CUDA

Dimension MiniMax-M2.7 Qwen3.6-27B
Correctness 40 95
Completeness 65 90
Code Quality 60 85
CUDA Depth 65 92
Memory Design 55 90
Complexity Analysis 60 88
Naive Comparison 55 88
Overall 58 88

MiniMax-M2.7 Critical Issues

  • Broken inter-warp top-k merge: Only ~100 of 256 threads contribute to final merge; 156 threads' results silently discarded → produces incorrect top-k
  • Compilation-stopping typo: topp_prob instead of topk_prob
  • Misleading bandwidth claims: Claims "4× reduction" but only counts one of three passes
  • Zero testing infrastructure: No benchmark harness, no CPU reference, no correctness verification

Qwen3.6-27B Strengths

  • Two kernel versions (v1 + optimized v2 with vectorized float4 loads)
  • Correct warp-by-warp merge — properly collects all 4096 candidates
  • Shared-memory min-heap for O(log K) insertions
  • Complete benchmark harness with CPU reference and correctness tests
  • Honest 3-pass bandwidth analysis — correctly identifies kernel as compute-bound (expf throughput)

What Separated These Two

Factor MiniMax-M2.7 Qwen3.6-27B
Correctness Buggy in all 3 tasks Correct in all 3
Testing None / minimal Comprehensive with assertions
Analysis depth High-level / conceptual Quantitative with real numbers
Code organization Monolithic Modular and focused
Engineering rigor Claims untested Every claim validated

The decisive pattern: MiniMax-M2.7 was conceptually broad but executionally weak — it mentioned many optimizations and ideas but delivered buggy, untested code. Qwen3.6-27B was narrower in scope but flawlessly executed — every claim backed by working, validated code.