feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
This commit is contained in:
@@ -0,0 +1,113 @@
|
||||
# Round 1 Summary: MiniMax-M2.7 vs Qwen3.6-27B
|
||||
|
||||
## Overall Scoreboard
|
||||
|
||||
| Task | MiniMax-M2.7 | Qwen3.6-27B | Winner | Margin |
|
||||
|------|--------|---------|--------|--------|
|
||||
| **KV Cache** | **64/100** | **91/100** | qwen36 | +27 |
|
||||
| **Backwards Pass** | **76/100** | **92/100** | qwen36 | +16 |
|
||||
| **Fused Softmax+TopK** | **58/100** | **88/100** | qwen36 | +30 |
|
||||
| **Average** | **66** | **90** | **qwen36** | **+24** |
|
||||
|
||||
**Clear winner: Qwen3.6-27B — dominant across all 3 tasks.**
|
||||
|
||||
---
|
||||
|
||||
## Task 1: KV Cache System
|
||||
|
||||
| Dimension | MiniMax-M2.7 | Qwen3.6-27B |
|
||||
|-----------|--------|---------|
|
||||
| Correctness | 55 | 92 |
|
||||
| Completeness | 75 | 95 |
|
||||
| Code Quality | 60 | 88 |
|
||||
| Depth of Analysis | 78 | 90 |
|
||||
| Optimizations | 72 | 90 |
|
||||
| GPU Mapping | 75 | 88 |
|
||||
| Tests/Demos | 30 | 95 |
|
||||
| **Overall** | **64** | **91** |
|
||||
|
||||
### MiniMax-M2.7 Critical Issues
|
||||
- **Inverted causal mask** — masks the wrong triangle, allowing attention to future tokens
|
||||
- **Broken batched caching** — all batch elements share the same `kv_cache` dict keyed only by layer, not by batch item
|
||||
- **Prefill doesn't store KV** — prefill KV tensors never stored in persistent cache
|
||||
- **No tests** — only a 3-step hardcoded demo with zero assertions
|
||||
- **1,720-line monolith** — everything crammed into one file
|
||||
|
||||
### Qwen3.6-27B Strengths
|
||||
- **10 passing demos** with numerical validation (cached attention diff < 1e-5, chunked prefill diff = 4.56e-10)
|
||||
- **Modular 7-file architecture** — clean separation of concerns
|
||||
- **Correct variable-length batching** — proper causal + length masks
|
||||
- **3 working optimizations** — paged attention, int8 quantization, chunked prefill (all tested)
|
||||
- **Quantitative analysis** — arithmetic intensity calculations, per-GPU context limits, real model comparisons (Llama, Mistral, GPT-4)
|
||||
|
||||
---
|
||||
|
||||
## Task 2: Layer Norm Backward Pass
|
||||
|
||||
| Dimension | MiniMax-M2.7 | Qwen3.6-27B |
|
||||
|-----------|--------|---------|
|
||||
| Correctness | 85 | 95 |
|
||||
| Completeness | 80 | 95 |
|
||||
| Code Quality | 70 | 90 |
|
||||
| Numerical Stability | 75 | 95 |
|
||||
| Gradient Check | 80 | 90 |
|
||||
| Complexity Analysis | 80 | 90 |
|
||||
| GPU Fusion | 85 | 85 |
|
||||
| Tests/Benchmarks | 60 | 95 |
|
||||
| **Overall** | **76** | **92** |
|
||||
|
||||
### MiniMax-M2.7 Weaknesses
|
||||
- **Over-caching**: Stores 10 cache items when only 3 tensors are needed
|
||||
- **No edge-case tests**: No tests for zero input, D=1, large offsets
|
||||
- **No concrete stability demo**: Discusses catastrophic cancellation but never demonstrates it
|
||||
- **Monolithic 750-line file**: Everything mixed together
|
||||
- **Fragile gradient check**: Modifies input in-place without a copy
|
||||
|
||||
### Qwen3.6-27B Strengths
|
||||
- **Minimal cache**: Only 4 items (x_hat, std_inv, glm5, D) — exactly what's needed
|
||||
- **Concrete stability demo**: Shows naive variance fails at offset=1e8 while two-pass stays exact
|
||||
- **3-file separation**: Core + tests + benchmarks
|
||||
- **Edge-case tests**: Zero input, D=1, large D (1024), large mean, scale invariance
|
||||
- **Alternative derivation cross-check**: Independent step-by-step chain rule verifies compact formula (<1e-10 error)
|
||||
|
||||
---
|
||||
|
||||
## Task 3: Fused Softmax + TopK CUDA
|
||||
|
||||
| Dimension | MiniMax-M2.7 | Qwen3.6-27B |
|
||||
|-----------|--------|---------|
|
||||
| Correctness | 40 | 95 |
|
||||
| Completeness | 65 | 90 |
|
||||
| Code Quality | 60 | 85 |
|
||||
| CUDA Depth | 65 | 92 |
|
||||
| Memory Design | 55 | 90 |
|
||||
| Complexity Analysis | 60 | 88 |
|
||||
| Naive Comparison | 55 | 88 |
|
||||
| **Overall** | **58** | **88** |
|
||||
|
||||
### MiniMax-M2.7 Critical Issues
|
||||
- **Broken inter-warp top-k merge**: Only ~100 of 256 threads contribute to final merge; 156 threads' results silently discarded → **produces incorrect top-k**
|
||||
- **Compilation-stopping typo**: `topp_prob` instead of `topk_prob`
|
||||
- **Misleading bandwidth claims**: Claims "4× reduction" but only counts one of three passes
|
||||
- **Zero testing infrastructure**: No benchmark harness, no CPU reference, no correctness verification
|
||||
|
||||
### Qwen3.6-27B Strengths
|
||||
- **Two kernel versions** (v1 + optimized v2 with vectorized float4 loads)
|
||||
- **Correct warp-by-warp merge** — properly collects all 4096 candidates
|
||||
- **Shared-memory min-heap** for O(log K) insertions
|
||||
- **Complete benchmark harness** with CPU reference and correctness tests
|
||||
- **Honest 3-pass bandwidth analysis** — correctly identifies kernel as compute-bound (expf throughput)
|
||||
|
||||
---
|
||||
|
||||
## What Separated These Two
|
||||
|
||||
| Factor | MiniMax-M2.7 | Qwen3.6-27B |
|
||||
|--------|--------|---------|
|
||||
| **Correctness** | Buggy in all 3 tasks | Correct in all 3 |
|
||||
| **Testing** | None / minimal | Comprehensive with assertions |
|
||||
| **Analysis depth** | High-level / conceptual | Quantitative with real numbers |
|
||||
| **Code organization** | Monolithic | Modular and focused |
|
||||
| **Engineering rigor** | Claims untested | Every claim validated |
|
||||
|
||||
**The decisive pattern**: MiniMax-M2.7 was conceptually broad but executionally weak — it mentioned many optimizations and ideas but delivered buggy, untested code. Qwen3.6-27B was narrower in scope but flawlessly executed — every claim backed by working, validated code.
|
||||
Reference in New Issue
Block a user