Files

6.1 KiB

LLM Programming Benchmarks

Head-to-head evaluation of six coding LLMs across eight low-level ML kernel tasks. Each model was given identical prompts to implement solutions from scratch — no frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA / MLX.


Methodology

All analysis, scoring, and write-ups were generated by DeepSeek V4 Pro running on max effort mode. This means:

  • Every grade, comparison, and ranking was determined by an LLM judge, not a human reviewer
  • The judge evaluated code for correctness, completeness, testing coverage, analysis depth, algorithmic sophistication, and engineering rigor
  • Raw model outputs and session logs are preserved untouched in each model folder
  • Scores should be treated as directional indicators, not absolute measurements

Take every score with a grain of salt. LLM judges can be consistent but are not infallible. The relative rankings are more useful than the exact numbers.

Tooling: The first 3 challenges (KV-Cache, Fused Softmax+TopK, Layer Norm Backward) were generated using pi-mono as the harness. The remaining 5 challenges were generated using opencode.


TL;DR — Final Rankings

Rank Model Avg Grade Challenges Best Showing Weakness
1 GLM-5 A-/B+ 8/8 DFlash (A), Ternary (A-) Limited scope (fewer tests, single-file)
2 Claude Opus 4.7 A-/B+ 7/8 DFlash (A), Backwards (A) Non-ternary embeddings deviates from spec
3 Qwen3-6 B+ 7/8 KV-Cache (A), Beam Search (A) DFlash logits trap, algorithmic ceiling
4 Kimi K2.6 B 3/8 Flash Attn Bwd (A) DFlash/ternary bugs; narrow strengths
5 GLM-5.1 B-/C+ 3/8 DFlash (B+) Regression from GLM-5; ternary overfit
6 MiniMax-M2.7 B/B- 4/8 Bugs, no tests; exited early

Per-Challenge Grade Matrix

Challenge Difficulty GLM-5 Qwen3-6 Opus 4.7 Kimi K2.6 GLM-5.1 MiniMax
Layer Norm Backward Medium B+ A- A B
Fused Softmax+TopK Medium A- A A B
KV-Cache Medium A- A B+ B-
Flash Attn Forward Hard A- A A- B
Beam Search Hard B+ A A B-
Flash Attn Backward Extra Hard A- A- A- A B+
DFlash Extra Hard A B- A B- B+
Ternary Training SOTA Research A- B+ B+ C C+

Key Takeaways

  1. GLM-5 is the most consistent — only model to participate in all 8 challenges, never declined in grade, won the two hardest challenges (DFlash, Ternary)
  2. Opus 4.7 has the highest floor — strongest consistency across 7 challenges (A to B+), best documentation, caught the same algorithmic traps as GLM-5
  3. Qwen3-6 excels at engineering breadth — modular code, comprehensive tests, real model specs; falls behind on deep algorithmic reasoning
  4. The DFlash logits trap separated the tiers — only GLM-5, GLM-5.1, and Opus 4.7 corrected the broken pseudo-code to parent-indexed logits
  5. Ternary training exposed hyperparameter discipline — GLM-5 (PPL=594), Opus (PPL=643), Qwen (PPL=319 after correcting data leakage), Kimi (PPL=5,501), GLM-5.1 (PPL=30,731)

Models Tested

Folder Model Provider
glm5/ GLM-5 Z.ai
glm5.1/ GLM-5.1 Z.ai
qwen36/ Qwen3-6 OpenRouter
opus47_1m/ Claude Opus 4.7 Anthropic
kimi-k2.6/ Kimi K2.6 Moonshot AI
minimax-m2.7/ MiniMax-M2.7 OpenRouter

Challenges

Task Description
backwards/ Numerically stable Layer Norm backward pass from scratch in NumPy
fuse/ High-performance fused Softmax + Top-K CUDA kernel (no full softmax materialization)
kv/ Incremental KV-cache for autoregressive transformer inference
flash_attention/ Flash Attention forward pass with tiling and causal masking
beam_search/ Beam search decoder with length penalty and EOS handling
flash_attention_bwd/ Flash Attention backward pass with D-optimization
dflash_verify/ Tree attention (DFlash) with branching logits and subtree invalidation
ternary_training/ Ternary (1-bit) weight training matching PrismML's Ternary Bonsai spec

Repo Layout

├── analysis/                  # Cross-model comparison analyses
│   ├── cross-model-comparison.md    # Full 8-challenge synthesis
│   ├── 2-way-head-to-head*.md       # Pairwise round breakdowns
│   ├── 3-way-head-to-head*.md       # Multi-model round breakdowns
│   ├── dflash-analysis.md           # DFlash deep dive
│   ├── ternary-training-analysis.md # Ternary deep dive
│   └── ...                          # Per-challenge analyses
├── glm5/                      # GLM-5 implementations (8 challenges)
├── glm5.1/                    # GLM-5.1 implementations (3 challenges)
├── qwen36/                    # Qwen3-6 implementations (7 challenges)
├── opus47_1m/                 # Claude Opus 4.7 implementations (7 challenges)
├── kimi-k2.6/                 # Kimi K2.6 implementations (3 challenges)
├── minimax-m2.7/              # MiniMax-M2.7 implementations (4 challenges)
├── deploy_challenges.sh       # Challenge deployment script
└── README.md

Each model folder contains subdirectories per challenge with the model's raw generated code, session logs (session.jsonl), and any model-generated analysis files.

Notes

  • Scores are letter grades (A through F), assigned by DeepSeek V4 Pro on max effort
  • Grading dimensions: correctness, completeness, code quality, analysis depth, testing, algorithmic sophistication, and GPU mapping
  • All raw session logs are preserved (sanitized) in each folder's session.jsonl
  • Source code is untouched — exactly as each model generated it
  • Detailed per-challenge and per-pair analyses live in analysis/