# Round 1 Summary: MiniMax-M2.7 vs Qwen3.6-27B ## Overall Scoreboard | Task | MiniMax-M2.7 | Qwen3.6-27B | Winner | Margin | |------|--------|---------|--------|--------| | **KV Cache** | **64/100** | **91/100** | qwen36 | +27 | | **Backwards Pass** | **76/100** | **92/100** | qwen36 | +16 | | **Fused Softmax+TopK** | **58/100** | **88/100** | qwen36 | +30 | | **Average** | **66** | **90** | **qwen36** | **+24** | **Clear winner: Qwen3.6-27B — dominant across all 3 tasks.** --- ## Task 1: KV Cache System | Dimension | MiniMax-M2.7 | Qwen3.6-27B | |-----------|--------|---------| | Correctness | 55 | 92 | | Completeness | 75 | 95 | | Code Quality | 60 | 88 | | Depth of Analysis | 78 | 90 | | Optimizations | 72 | 90 | | GPU Mapping | 75 | 88 | | Tests/Demos | 30 | 95 | | **Overall** | **64** | **91** | ### MiniMax-M2.7 Critical Issues - **Inverted causal mask** — masks the wrong triangle, allowing attention to future tokens - **Broken batched caching** — all batch elements share the same `kv_cache` dict keyed only by layer, not by batch item - **Prefill doesn't store KV** — prefill KV tensors never stored in persistent cache - **No tests** — only a 3-step hardcoded demo with zero assertions - **1,720-line monolith** — everything crammed into one file ### Qwen3.6-27B Strengths - **10 passing demos** with numerical validation (cached attention diff < 1e-5, chunked prefill diff = 4.56e-10) - **Modular 7-file architecture** — clean separation of concerns - **Correct variable-length batching** — proper causal + length masks - **3 working optimizations** — paged attention, int8 quantization, chunked prefill (all tested) - **Quantitative analysis** — arithmetic intensity calculations, per-GPU context limits, real model comparisons (Llama, Mistral, GPT-4) --- ## Task 2: Layer Norm Backward Pass | Dimension | MiniMax-M2.7 | Qwen3.6-27B | |-----------|--------|---------| | Correctness | 85 | 95 | | Completeness | 80 | 95 | | Code Quality | 70 | 90 | | Numerical Stability | 75 | 95 | | Gradient Check | 80 | 90 | | Complexity Analysis | 80 | 90 | | GPU Fusion | 85 | 85 | | Tests/Benchmarks | 60 | 95 | | **Overall** | **76** | **92** | ### MiniMax-M2.7 Weaknesses - **Over-caching**: Stores 10 cache items when only 3 tensors are needed - **No edge-case tests**: No tests for zero input, D=1, large offsets - **No concrete stability demo**: Discusses catastrophic cancellation but never demonstrates it - **Monolithic 750-line file**: Everything mixed together - **Fragile gradient check**: Modifies input in-place without a copy ### Qwen3.6-27B Strengths - **Minimal cache**: Only 4 items (x_hat, std_inv, glm5, D) — exactly what's needed - **Concrete stability demo**: Shows naive variance fails at offset=1e8 while two-pass stays exact - **3-file separation**: Core + tests + benchmarks - **Edge-case tests**: Zero input, D=1, large D (1024), large mean, scale invariance - **Alternative derivation cross-check**: Independent step-by-step chain rule verifies compact formula (<1e-10 error) --- ## Task 3: Fused Softmax + TopK CUDA | Dimension | MiniMax-M2.7 | Qwen3.6-27B | |-----------|--------|---------| | Correctness | 40 | 95 | | Completeness | 65 | 90 | | Code Quality | 60 | 85 | | CUDA Depth | 65 | 92 | | Memory Design | 55 | 90 | | Complexity Analysis | 60 | 88 | | Naive Comparison | 55 | 88 | | **Overall** | **58** | **88** | ### MiniMax-M2.7 Critical Issues - **Broken inter-warp top-k merge**: Only ~100 of 256 threads contribute to final merge; 156 threads' results silently discarded → **produces incorrect top-k** - **Compilation-stopping typo**: `topp_prob` instead of `topk_prob` - **Misleading bandwidth claims**: Claims "4× reduction" but only counts one of three passes - **Zero testing infrastructure**: No benchmark harness, no CPU reference, no correctness verification ### Qwen3.6-27B Strengths - **Two kernel versions** (v1 + optimized v2 with vectorized float4 loads) - **Correct warp-by-warp merge** — properly collects all 4096 candidates - **Shared-memory min-heap** for O(log K) insertions - **Complete benchmark harness** with CPU reference and correctness tests - **Honest 3-pass bandwidth analysis** — correctly identifies kernel as compute-bound (expf throughput) --- ## What Separated These Two | Factor | MiniMax-M2.7 | Qwen3.6-27B | |--------|--------|---------| | **Correctness** | Buggy in all 3 tasks | Correct in all 3 | | **Testing** | None / minimal | Comprehensive with assertions | | **Analysis depth** | High-level / conceptual | Quantitative with real numbers | | **Code organization** | Monolithic | Modular and focused | | **Engineering rigor** | Claims untested | Every claim validated | **The decisive pattern**: MiniMax-M2.7 was conceptually broad but executionally weak — it mentioned many optimizations and ideas but delivered buggy, untested code. Qwen3.6-27B was narrower in scope but flawlessly executed — every claim backed by working, validated code.