LLM Programming Benchmarks
Head-to-head evaluation of six coding LLMs across eight low-level ML kernel tasks. Each model was given identical prompts to implement solutions from scratch — no frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA / MLX.
Methodology
All analysis, scoring, and write-ups were generated by DeepSeek V4 Pro running on max effort mode. This means:
- Every grade, comparison, and ranking was determined by an LLM judge, not a human reviewer
- The judge evaluated code for correctness, completeness, testing coverage, analysis depth, algorithmic sophistication, and engineering rigor
- Raw model outputs and session logs are preserved untouched in each model folder
- Scores should be treated as directional indicators, not absolute measurements
Take every score with a grain of salt. LLM judges can be consistent but are not infallible. The relative rankings are more useful than the exact numbers.
Tooling: The first 3 challenges (KV-Cache, Fused Softmax+TopK, Layer Norm Backward) were generated using pi-mono as the harness. The remaining 5 challenges were generated using opencode.
TL;DR — Final Rankings
| Rank | Model | Avg Grade | Challenges | Best Showing | Weakness |
|---|---|---|---|---|---|
| 1 | GLM-5 | A-/B+ | 8/8 | DFlash (A), Ternary (A-) | Limited scope (fewer tests, single-file) |
| 2 | Claude Opus 4.7 | A-/B+ | 7/8 | DFlash (A), Backwards (A) | Non-ternary embeddings deviates from spec |
| 3 | Qwen3-6 | B+ | 7/8 | KV-Cache (A), Beam Search (A) | DFlash logits trap, algorithmic ceiling |
| 4 | Kimi K2.6 | B | 3/8 | Flash Attn Bwd (A) | DFlash/ternary bugs; narrow strengths |
| 5 | GLM-5.1 | B-/C+ | 3/8 | DFlash (B+) | Regression from GLM-5; ternary overfit |
| 6 | MiniMax-M2.7 | B/B- | 4/8 | — | Bugs, no tests; exited early |
Per-Challenge Grade Matrix
| Challenge | Difficulty | GLM-5 | Qwen3-6 | Opus 4.7 | Kimi K2.6 | GLM-5.1 | MiniMax |
|---|---|---|---|---|---|---|---|
| Layer Norm Backward | Medium | B+ | A- | A | — | — | B |
| Fused Softmax+TopK | Medium | A- | A | A | — | — | B |
| KV-Cache | Medium | A- | A | B+ | — | — | B- |
| Flash Attn Forward | Hard | A- | A | A- | — | — | B |
| Beam Search | Hard | B+ | A | A | — | — | B- |
| Flash Attn Backward | Extra Hard | A- | A- | A- | A | B+ | — |
| DFlash | Extra Hard | A | B- | A | B- | B+ | — |
| Ternary Training | SOTA Research | A- | B+ | B+ | C | C+ | — |
Key Takeaways
- GLM-5 is the most consistent — only model to participate in all 8 challenges, never declined in grade, won the two hardest challenges (DFlash, Ternary)
- Opus 4.7 has the highest floor — strongest consistency across 7 challenges (A to B+), best documentation, caught the same algorithmic traps as GLM-5
- Qwen3-6 excels at engineering breadth — modular code, comprehensive tests, real model specs; falls behind on deep algorithmic reasoning
- The DFlash logits trap separated the tiers — only GLM-5, GLM-5.1, and Opus 4.7 corrected the broken pseudo-code to parent-indexed logits
- Ternary training exposed hyperparameter discipline — GLM-5 (PPL=594), Opus (PPL=643), Qwen (PPL=319 after correcting data leakage), Kimi (PPL=5,501), GLM-5.1 (PPL=30,731)
Models Tested
| Folder | Model | Provider |
|---|---|---|
glm5/ |
GLM-5 | Z.ai |
glm5.1/ |
GLM-5.1 | Z.ai |
qwen36/ |
Qwen3-6 | OpenRouter |
opus47_1m/ |
Claude Opus 4.7 | Anthropic |
kimi-k2.6/ |
Kimi K2.6 | Moonshot AI |
minimax-m2.7/ |
MiniMax-M2.7 | OpenRouter |
Challenges
| Task | Description |
|---|---|
backwards/ |
Numerically stable Layer Norm backward pass from scratch in NumPy |
fuse/ |
High-performance fused Softmax + Top-K CUDA kernel (no full softmax materialization) |
kv/ |
Incremental KV-cache for autoregressive transformer inference |
flash_attention/ |
Flash Attention forward pass with tiling and causal masking |
beam_search/ |
Beam search decoder with length penalty and EOS handling |
flash_attention_bwd/ |
Flash Attention backward pass with D-optimization |
dflash_verify/ |
Tree attention (DFlash) with branching logits and subtree invalidation |
ternary_training/ |
Ternary (1-bit) weight training matching PrismML's Ternary Bonsai spec |
Repo Layout
├── analysis/ # Cross-model comparison analyses
│ ├── cross-model-comparison.md # Full 8-challenge synthesis
│ ├── 2-way-head-to-head*.md # Pairwise round breakdowns
│ ├── 3-way-head-to-head*.md # Multi-model round breakdowns
│ ├── dflash-analysis.md # DFlash deep dive
│ ├── ternary-training-analysis.md # Ternary deep dive
│ └── ... # Per-challenge analyses
├── glm5/ # GLM-5 implementations (8 challenges)
├── glm5.1/ # GLM-5.1 implementations (3 challenges)
├── qwen36/ # Qwen3-6 implementations (7 challenges)
├── opus47_1m/ # Claude Opus 4.7 implementations (7 challenges)
├── kimi-k2.6/ # Kimi K2.6 implementations (3 challenges)
├── minimax-m2.7/ # MiniMax-M2.7 implementations (4 challenges)
├── deploy_challenges.sh # Challenge deployment script
└── README.md
Each model folder contains subdirectories per challenge with the model's raw generated code, session logs (session.jsonl), and any model-generated analysis files.
Notes
- Scores are letter grades (A through F), assigned by DeepSeek V4 Pro on max effort
- Grading dimensions: correctness, completeness, code quality, analysis depth, testing, algorithmic sophistication, and GPU mapping
- All raw session logs are preserved (sanitized) in each folder's
session.jsonl - Source code is untouched — exactly as each model generated it
- Detailed per-challenge and per-pair analyses live in
analysis/