Files
llm_programming_tests/README.md
T

128 lines
5.8 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# LLM Programming Benchmarks
A head-to-head evaluation of three coding LLMs on three low-level ML kernel tasks. Each model was given identical prompts to implement:
1. **KV Cache** — incremental KV-cache for autoregressive transformer inference
2. **Backwards Pass** — numerically stable layer norm backward pass from scratch in NumPy
3. **Fused Softmax + Top-K** — high-performance CUDA kernel (no full softmax materialization)
No frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA from scratch.
> All judging, scoring, and write-ups were performed by **Kimi K2.6** analyzing the generated code for correctness, completeness, testing, and analysis depth.
---
## Models Tested
| Folder | Model |
|--------|-------|
| `qwen36` | **Qwen3.6-27B** (via OpenRouter) |
| `glm5` | **GLM-5** (via Z.ai) |
| `minimax-m2.7` | **MiniMax-M2.7** (via OpenRouter) |
---
## Final Rankings
| Rank | Model | Average Score | Best Task | Worst Task |
|------|-------|---------------|-----------|------------|
| 🥇 | **Qwen3.6-27B** | **89** | KV (92 avg) | Fuse (78) |
| 🥈 | **GLM-5** | **81** | KV / Backwards (82) | Fuse (80) |
| 🥉 | **MiniMax-M2.7** | **66** | Backwards (76) | Fuse (58) |
### Complete Scoreboard
**Round 1: MiniMax-M2.7 vs Qwen3.6-27B**
| Task | MiniMax-M2.7 | Qwen3.6-27B | Winner | Margin |
|------|--------------|-------------|--------|--------|
| KV Cache | **64** | **91** | qwen36 | +27 |
| Backwards Pass | **76** | **92** | qwen36 | +16 |
| Fused Softmax+TopK | **58** | **88** | qwen36 | +30 |
| **Average** | **66** | **90** | **qwen36** | **+24** |
**Round 2: GLM-5 vs Qwen3.6-27B**
| Task | GLM-5 | Qwen3.6-27B | Winner | Margin |
|------|-------|-------------|--------|--------|
| KV Cache | **82** | **94** | qwen36 | +12 |
| Backwards Pass | **82** | **93** | qwen36 | +11 |
| Fused Softmax+TopK | **80** | **78** | **glm5** | **+2** |
| **Average** | **81** | **88** | **qwen36** | **+7** |
---
## Task-by-Task Breakdown
### KV Cache
- **Qwen3.6-27B (91, 94)** — Consistently dominant. 10 demos, modular architecture, real model comparisons, GQA, arithmetic intensity analysis.
- **GLM-5 (82)** — Correct, good tests, excellent docs, INT4 quantization. Lost on missing MLP/causal masking and less systems depth.
- **MiniMax-M2.7 (64)** — Inverted causal mask, broken batched caching, no tests, 1,720-line monolith.
### Backwards Pass
- **Qwen3.6-27B (92, 93)** — Minimal cache, concrete stability demo, 3-file separation, 5 edge-case tests, cross-check derivation.
- **GLM-5 (82)** — Excellent conciseness (~280 lines), minimal cache, safe gradient check. Lost on no edge-case tests and no stability demo.
- **MiniMax-M2.7 (76)** — Over-cached (10 items), no edge-case tests, fragile in-place gradient check, monolithic.
### Fused Softmax+TopK
- **GLM-5 (80)** — Single-pass online softmax (research-level), 1× global reads, register heaps. Won narrowly (+2) but has cross-warp merge bug when WARPS_PER_BLOCK > 1.
- **Qwen3.6-27B (88, 78)** — Two kernel versions, correct merge, vectorized loads, benchmark harness. Lost on fuse due to suboptimal 3-pass algorithm (12V reads vs 4V).
- **MiniMax-M2.7 (58)** — Broken inter-warp merge (156 threads ignored), compilation typo, zero tests.
---
## Key Patterns
### What Separates the Tiers
| Dimension | MiniMax-M2.7 | GLM-5 | Qwen3.6-27B |
|-----------|--------------|-------|-------------|
| **Correctness** | ❌ Buggy in all 3 | ✅ Correct (1 minor bug) | ✅ Correct in all 3 |
| **Testing** | ❌ None | ⚠️ Basic assertions | ✅ Comprehensive suites |
| **Analysis depth** | ⚠️ High-level / conceptual | ✅ Good | ✅ Quantitative + real models |
| **Code quality** | ❌ Bloated monoliths | ✅ Concise & focused | ✅ Modular & production-grade |
| **Algorithmic sophistication** | ⚠️ Claims many, delivers few | ✅ Online softmax, INT4 | ✅ Solid, well-validated |
| **Engineering rigor** | ❌ Untested claims | ✅ Clean & minimal | ✅ Every claim validated |
### The Decisive Factors
1. **Testing is everything**: Qwen3.6-27B's comprehensive test suites caught issues that GLM-5 and MiniMax-M2.7 missed. GLM-5's fuse bug (cross-warp merge) would have been caught by a multi-row test. MiniMax-M2.7's causal mask bug would have been caught by any numerical validation.
2. **Concrete > theoretical**: Qwen3.6-27B demonstrated numerical stability problems with actual numbers; MiniMax-M2.7 and GLM-5 only described them. This pattern repeated across all tasks.
3. **Minimal cache wins**: Both Qwen3.6-27B and GLM-5 used minimal caches (34 items), while MiniMax-M2.7 over-cached (10 items). The backward pass is particularly sensitive to this — the compact projection formula eliminates most intermediates.
4. **Algorithmic sophistication has tradeoffs**: GLM-5's online softmax was theoretically optimal but harder to get right (the cross-warp bug). Qwen3.6-27B's 3-pass approach was simpler and correct but suboptimal in memory traffic.
---
## Repo Layout
```
├── glm5/ # GLM-5 implementations
│ ├── backwards/
│ ├── fuse/
│ └── kv/
├── minimax-m2.7/ # MiniMax-M2.7 implementations
│ ├── backwards/
│ ├── fuse/
│ └── kv/
├── qwen36/ # Qwen3.6-27B implementations
│ ├── backwards/
│ ├── fuse/
│ └── kv/
└── model_comparison/ # Full head-to-head analyses
├── overall_summary.md
├── glm5_vs_qwen36_summary.md
├── minimax-m2.7_vs_qwen36_summary.md
└── ... (per-task deep dives)
```
---
## Notes
- Scores are out of 100 per task, judged on correctness, completeness, code quality, analysis depth, testing, and GPU mapping.
- All raw session logs are preserved (sanitized) in each folder's `session.jsonl`.
- Source code is untouched — exactly as each model generated it.