Round 1 Summary: MiniMax-M2.7 vs Qwen3.6-27B
Overall Scoreboard
| Task |
MiniMax-M2.7 |
Qwen3.6-27B |
Winner |
Margin |
| KV Cache |
64/100 |
91/100 |
qwen36 |
+27 |
| Backwards Pass |
76/100 |
92/100 |
qwen36 |
+16 |
| Fused Softmax+TopK |
58/100 |
88/100 |
qwen36 |
+30 |
| Average |
66 |
90 |
qwen36 |
+24 |
Clear winner: Qwen3.6-27B — dominant across all 3 tasks.
Task 1: KV Cache System
| Dimension |
MiniMax-M2.7 |
Qwen3.6-27B |
| Correctness |
55 |
92 |
| Completeness |
75 |
95 |
| Code Quality |
60 |
88 |
| Depth of Analysis |
78 |
90 |
| Optimizations |
72 |
90 |
| GPU Mapping |
75 |
88 |
| Tests/Demos |
30 |
95 |
| Overall |
64 |
91 |
MiniMax-M2.7 Critical Issues
- Inverted causal mask — masks the wrong triangle, allowing attention to future tokens
- Broken batched caching — all batch elements share the same
kv_cache dict keyed only by layer, not by batch item
- Prefill doesn't store KV — prefill KV tensors never stored in persistent cache
- No tests — only a 3-step hardcoded demo with zero assertions
- 1,720-line monolith — everything crammed into one file
Qwen3.6-27B Strengths
- 10 passing demos with numerical validation (cached attention diff < 1e-5, chunked prefill diff = 4.56e-10)
- Modular 7-file architecture — clean separation of concerns
- Correct variable-length batching — proper causal + length masks
- 3 working optimizations — paged attention, int8 quantization, chunked prefill (all tested)
- Quantitative analysis — arithmetic intensity calculations, per-GPU context limits, real model comparisons (Llama, Mistral, GPT-4)
Task 2: Layer Norm Backward Pass
| Dimension |
MiniMax-M2.7 |
Qwen3.6-27B |
| Correctness |
85 |
95 |
| Completeness |
80 |
95 |
| Code Quality |
70 |
90 |
| Numerical Stability |
75 |
95 |
| Gradient Check |
80 |
90 |
| Complexity Analysis |
80 |
90 |
| GPU Fusion |
85 |
85 |
| Tests/Benchmarks |
60 |
95 |
| Overall |
76 |
92 |
MiniMax-M2.7 Weaknesses
- Over-caching: Stores 10 cache items when only 3 tensors are needed
- No edge-case tests: No tests for zero input, D=1, large offsets
- No concrete stability demo: Discusses catastrophic cancellation but never demonstrates it
- Monolithic 750-line file: Everything mixed together
- Fragile gradient check: Modifies input in-place without a copy
Qwen3.6-27B Strengths
- Minimal cache: Only 4 items (x_hat, std_inv, glm5, D) — exactly what's needed
- Concrete stability demo: Shows naive variance fails at offset=1e8 while two-pass stays exact
- 3-file separation: Core + tests + benchmarks
- Edge-case tests: Zero input, D=1, large D (1024), large mean, scale invariance
- Alternative derivation cross-check: Independent step-by-step chain rule verifies compact formula (<1e-10 error)
Task 3: Fused Softmax + TopK CUDA
| Dimension |
MiniMax-M2.7 |
Qwen3.6-27B |
| Correctness |
40 |
95 |
| Completeness |
65 |
90 |
| Code Quality |
60 |
85 |
| CUDA Depth |
65 |
92 |
| Memory Design |
55 |
90 |
| Complexity Analysis |
60 |
88 |
| Naive Comparison |
55 |
88 |
| Overall |
58 |
88 |
MiniMax-M2.7 Critical Issues
- Broken inter-warp top-k merge: Only ~100 of 256 threads contribute to final merge; 156 threads' results silently discarded → produces incorrect top-k
- Compilation-stopping typo:
topp_prob instead of topk_prob
- Misleading bandwidth claims: Claims "4× reduction" but only counts one of three passes
- Zero testing infrastructure: No benchmark harness, no CPU reference, no correctness verification
Qwen3.6-27B Strengths
- Two kernel versions (v1 + optimized v2 with vectorized float4 loads)
- Correct warp-by-warp merge — properly collects all 4096 candidates
- Shared-memory min-heap for O(log K) insertions
- Complete benchmark harness with CPU reference and correctness tests
- Honest 3-pass bandwidth analysis — correctly identifies kernel as compute-bound (expf throughput)
What Separated These Two
| Factor |
MiniMax-M2.7 |
Qwen3.6-27B |
| Correctness |
Buggy in all 3 tasks |
Correct in all 3 |
| Testing |
None / minimal |
Comprehensive with assertions |
| Analysis depth |
High-level / conceptual |
Quantitative with real numbers |
| Code organization |
Monolithic |
Modular and focused |
| Engineering rigor |
Claims untested |
Every claim validated |
The decisive pattern: MiniMax-M2.7 was conceptually broad but executionally weak — it mentioned many optimizations and ideas but delivered buggy, untested code. Qwen3.6-27B was narrower in scope but flawlessly executed — every claim backed by working, validated code.