LLM Programming Benchmarks

Head-to-head evaluation of six coding LLMs across eight low-level ML kernel tasks. Each model was given identical prompts to implement solutions from scratch — no frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA / MLX.

Methodology

All analysis, scoring, and write-ups were generated by DeepSeek V4 Pro running on max effort mode. This means:

Every grade, comparison, and ranking was determined by an LLM judge, not a human reviewer
The judge evaluated code for correctness, completeness, testing coverage, analysis depth, algorithmic sophistication, and engineering rigor
Raw model outputs and session logs are preserved untouched in each model folder
Scores should be treated as directional indicators, not absolute measurements

Take every score with a grain of salt. LLM judges can be consistent but are not infallible. The relative rankings are more useful than the exact numbers.

Tooling: The first 3 challenges (KV-Cache, Fused Softmax+TopK, Layer Norm Backward) were generated using pi-mono as the harness. The remaining 5 challenges were generated using opencode.

TL;DR — Final Rankings

Rank	Model	Avg Grade	Challenges	Best Showing	Weakness
1	GLM-5	A-/B+	8/8	DFlash (A), Ternary (A-)	Limited scope (fewer tests, single-file)
2	Claude Opus 4.7	A-/B+	7/8	DFlash (A), Backwards (A)	Non-ternary embeddings deviates from spec
3	Qwen3-6	B+	7/8	KV-Cache (A), Beam Search (A)	DFlash logits trap, algorithmic ceiling
4	Kimi K2.6	B	3/8	Flash Attn Bwd (A)	DFlash/ternary bugs; narrow strengths
5	GLM-5.1	B-/C+	3/8	DFlash (B+)	Regression from GLM-5; ternary overfit
6	MiniMax-M2.7	B/B-	4/8	—	Bugs, no tests; exited early

Per-Challenge Grade Matrix

Challenge	Difficulty	GLM-5	Qwen3-6	Opus 4.7	Kimi K2.6	GLM-5.1	MiniMax
Layer Norm Backward	Medium	B+	A-	A	—	—	B
Fused Softmax+TopK	Medium	A-	A	A	—	—	B
KV-Cache	Medium	A-	A	B+	—	—	B-
Flash Attn Forward	Hard	A-	A	A-	—	—	B
Beam Search	Hard	B+	A	A	—	—	B-
Flash Attn Backward	Extra Hard	A-	A-	A-	A	B+	—
DFlash	Extra Hard	A	B-	A	B-	B+	—
Ternary Training	SOTA Research	A-	B+	B+	C	C+	—

Key Takeaways

GLM-5 is the most consistent — only model to participate in all 8 challenges, never declined in grade, won the two hardest challenges (DFlash, Ternary)
Opus 4.7 has the highest floor — strongest consistency across 7 challenges (A to B+), best documentation, caught the same algorithmic traps as GLM-5
Qwen3-6 excels at engineering breadth — modular code, comprehensive tests, real model specs; falls behind on deep algorithmic reasoning
The DFlash logits trap separated the tiers — only GLM-5, GLM-5.1, and Opus 4.7 corrected the broken pseudo-code to parent-indexed logits
Ternary training exposed hyperparameter discipline — GLM-5 (PPL=594), Opus (PPL=643), Qwen (PPL=319 after correcting data leakage), Kimi (PPL=5,501), GLM-5.1 (PPL=30,731)

Models Tested

Folder	Model	Provider
`glm5/`	GLM-5	Z.ai
`glm5.1/`	GLM-5.1	Z.ai
`qwen36/`	Qwen3-6	OpenRouter
`opus47_1m/`	Claude Opus 4.7	Anthropic
`kimi-k2.6/`	Kimi K2.6	Moonshot AI
`minimax-m2.7/`	MiniMax-M2.7	OpenRouter

Challenges

Task	Description
`backwards/`	Numerically stable Layer Norm backward pass from scratch in NumPy
`fuse/`	High-performance fused Softmax + Top-K CUDA kernel (no full softmax materialization)
`kv/`	Incremental KV-cache for autoregressive transformer inference
`flash_attention/`	Flash Attention forward pass with tiling and causal masking
`beam_search/`	Beam search decoder with length penalty and EOS handling
`flash_attention_bwd/`	Flash Attention backward pass with D-optimization
`dflash_verify/`	Tree attention (DFlash) with branching logits and subtree invalidation
`ternary_training/`	Ternary (1-bit) weight training matching PrismML's Ternary Bonsai spec

Repo Layout

├── analysis/                  # Cross-model comparison analyses
│   ├── cross-model-comparison.md    # Full 8-challenge synthesis
│   ├── 2-way-head-to-head*.md       # Pairwise round breakdowns
│   ├── 3-way-head-to-head*.md       # Multi-model round breakdowns
│   ├── dflash-analysis.md           # DFlash deep dive
│   ├── ternary-training-analysis.md # Ternary deep dive
│   └── ...                          # Per-challenge analyses
├── glm5/                      # GLM-5 implementations (8 challenges)
├── glm5.1/                    # GLM-5.1 implementations (3 challenges)
├── qwen36/                    # Qwen3-6 implementations (7 challenges)
├── opus47_1m/                 # Claude Opus 4.7 implementations (7 challenges)
├── kimi-k2.6/                 # Kimi K2.6 implementations (3 challenges)
├── minimax-m2.7/              # MiniMax-M2.7 implementations (4 challenges)
├── deploy_challenges.sh       # Challenge deployment script
└── README.md

Each model folder contains subdirectories per challenge with the model's raw generated code, session logs (session.jsonl), and any model-generated analysis files.

Notes

Scores are letter grades (A through F), assigned by DeepSeek V4 Pro on max effort
Grading dimensions: correctness, completeness, code quality, analysis depth, testing, algorithmic sophistication, and GPU mapping
All raw session logs are preserved (sanitized) in each folder's session.jsonl
Source code is untouched — exactly as each model generated it
Detailed per-challenge and per-pair analyses live in analysis/

6.1 KiB Raw Permalink Blame History