docs: add README with summary and Kimi K2.6 attribution

2026-04-23 11:24:24 +02:00
parent 8e72eef09c
commit ed77298a53
1 changed files with 127 additions and 0 deletions
@@ -0,0 +1,127 @@
 # LLM Programming Benchmarks
 A head-to-head evaluation of three coding LLMs on three low-level ML kernel tasks. Each model was given identical prompts to implement:
 1. **KV Cache** — incremental KV-cache for autoregressive transformer inference
 2. **Backwards Pass** — numerically stable layer norm backward pass from scratch in NumPy
 3. **Fused Softmax + Top-K** — high-performance CUDA kernel (no full softmax materialization)
 No frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA from scratch.
 > All judging, scoring, and write-ups were performed by **Kimi K2.6** analyzing the generated code for correctness, completeness, testing, and analysis depth.
 ---
 ## Models Tested
 | Folder | Model |
 |--------|-------|
 | `qwen36` | **Qwen3.6-27B** (via OpenRouter) |
 | `glm5` | **GLM-5** (via Z.ai) |
 | `minimax-m2.7` | **MiniMax-M2.7** (via OpenRouter) |
 ---
 ## Final Rankings
 | Rank | Model | Average Score | Best Task | Worst Task |
 |------|-------|---------------|-----------|------------|
 | 🥇 | **Qwen3.6-27B** | **89** | KV (92 avg) | Fuse (78) |
 | 🥈 | **GLM-5** | **81** | KV / Backwards (82) | Fuse (80) |
 | 🥉 | **MiniMax-M2.7** | **66** | Backwards (76) | Fuse (58) |
 ### Complete Scoreboard
 **Round 1: MiniMax-M2.7 vs Qwen3.6-27B**
 | Task | MiniMax-M2.7 | Qwen3.6-27B | Winner | Margin |
 |------|--------------|-------------|--------|--------|
 | KV Cache | **64** | **91** | qwen36 | +27 |
 | Backwards Pass | **76** | **92** | qwen36 | +16 |
 | Fused Softmax+TopK | **58** | **88** | qwen36 | +30 |
 | **Average** | **66** | **90** | **qwen36** | **+24** |
 **Round 2: GLM-5 vs Qwen3.6-27B**
 | Task | GLM-5 | Qwen3.6-27B | Winner | Margin |
 |------|-------|-------------|--------|--------|
 | KV Cache | **82** | **94** | qwen36 | +12 |
 | Backwards Pass | **82** | **93** | qwen36 | +11 |
 | Fused Softmax+TopK | **80** | **78** | **glm5** | **+2** |
 | **Average** | **81** | **88** | **qwen36** | **+7** |
 ---
 ## Task-by-Task Breakdown
 ### KV Cache
 - **Qwen3.6-27B (91, 94)** — Consistently dominant. 10 demos, modular architecture, real model comparisons, GQA, arithmetic intensity analysis.
 - **GLM-5 (82)** — Correct, good tests, excellent docs, INT4 quantization. Lost on missing MLP/causal masking and less systems depth.
 - **MiniMax-M2.7 (64)** — Inverted causal mask, broken batched caching, no tests, 1,720-line monolith.
 ### Backwards Pass
 - **Qwen3.6-27B (92, 93)** — Minimal cache, concrete stability demo, 3-file separation, 5 edge-case tests, cross-check derivation.
 - **GLM-5 (82)** — Excellent conciseness (~280 lines), minimal cache, safe gradient check. Lost on no edge-case tests and no stability demo.
 - **MiniMax-M2.7 (76)** — Over-cached (10 items), no edge-case tests, fragile in-place gradient check, monolithic.
 ### Fused Softmax+TopK
 - **GLM-5 (80)** — Single-pass online softmax (research-level), 1× global reads, register heaps. Won narrowly (+2) but has cross-warp merge bug when WARPS_PER_BLOCK > 1.
 - **Qwen3.6-27B (88, 78)** — Two kernel versions, correct merge, vectorized loads, benchmark harness. Lost on fuse due to suboptimal 3-pass algorithm (12V reads vs 4V).
 - **MiniMax-M2.7 (58)** — Broken inter-warp merge (156 threads ignored), compilation typo, zero tests.
 ---
 ## Key Patterns
 ### What Separates the Tiers
 | Dimension | MiniMax-M2.7 | GLM-5 | Qwen3.6-27B |
 |-----------|--------------|-------|-------------|
 | **Correctness** | ❌ Buggy in all 3 | ✅ Correct (1 minor bug) | ✅ Correct in all 3 |
 | **Testing** | ❌ None | ⚠️ Basic assertions | ✅ Comprehensive suites |
 | **Analysis depth** | ⚠️ High-level / conceptual | ✅ Good | ✅ Quantitative + real models |
 | **Code quality** | ❌ Bloated monoliths | ✅ Concise & focused | ✅ Modular & production-grade |
 | **Algorithmic sophistication** | ⚠️ Claims many, delivers few | ✅ Online softmax, INT4 | ✅ Solid, well-validated |
 | **Engineering rigor** | ❌ Untested claims | ✅ Clean & minimal | ✅ Every claim validated |
 ### The Decisive Factors
 1. **Testing is everything**: Qwen3.6-27B's comprehensive test suites caught issues that GLM-5 and MiniMax-M2.7 missed. GLM-5's fuse bug (cross-warp merge) would have been caught by a multi-row test. MiniMax-M2.7's causal mask bug would have been caught by any numerical validation.
 2. **Concrete > theoretical**: Qwen3.6-27B demonstrated numerical stability problems with actual numbers; MiniMax-M2.7 and GLM-5 only described them. This pattern repeated across all tasks.
 3. **Minimal cache wins**: Both Qwen3.6-27B and GLM-5 used minimal caches (3–4 items), while MiniMax-M2.7 over-cached (10 items). The backward pass is particularly sensitive to this — the compact projection formula eliminates most intermediates.
 4. **Algorithmic sophistication has tradeoffs**: GLM-5's online softmax was theoretically optimal but harder to get right (the cross-warp bug). Qwen3.6-27B's 3-pass approach was simpler and correct but suboptimal in memory traffic.
 ---
 ## Repo Layout
 ```
 ├── glm5/                    # GLM-5 implementations
 │   ├── backwards/
 │   ├── fuse/
 │   └── kv/
 ├── minimax-m2.7/            # MiniMax-M2.7 implementations
 │   ├── backwards/
 │   ├── fuse/
 │   └── kv/
 ├── qwen36/                  # Qwen3.6-27B implementations
 │   ├── backwards/
 │   ├── fuse/
 │   └── kv/
 └── model_comparison/        # Full head-to-head analyses
    ├── overall_summary.md
    ├── glm5_vs_qwen36_summary.md
    ├── minimax-m2.7_vs_qwen36_summary.md
    └── ... (per-task deep dives)
 ```
 ---
 ## Notes
 - Scores are out of 100 per task, judged on correctness, completeness, code quality, analysis depth, testing, and GPU mapping.
 - All raw session logs are preserved (sanitized) in each folder's `session.jsonl`.
 - Source code is untouched — exactly as each model generated it.