# LLM Programming Benchmarks Head-to-head evaluation of six coding LLMs across eight low-level ML kernel tasks. Each model was given identical prompts to implement solutions from scratch — no frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA / MLX. --- ## Methodology **All analysis, scoring, and write-ups were generated by DeepSeek V4 Pro running on max effort mode.** This means: - Every grade, comparison, and ranking was determined by an LLM judge, not a human reviewer - The judge evaluated code for correctness, completeness, testing coverage, analysis depth, algorithmic sophistication, and engineering rigor - Raw model outputs and session logs are preserved untouched in each model folder - Scores should be treated as directional indicators, not absolute measurements **Take every score with a grain of salt.** LLM judges can be consistent but are not infallible. The relative rankings are more useful than the exact numbers. **Tooling:** The first 3 challenges (KV-Cache, Fused Softmax+TopK, Layer Norm Backward) were generated using **pi-mono** as the harness. The remaining 5 challenges were generated using **opencode**. --- ## TL;DR — Final Rankings | Rank | Model | Avg Grade | Challenges | Best Showing | Weakness | |------|-------|-----------|------------|--------------|----------| | **1** | **GLM-5** | **A-/B+** | 8/8 | DFlash (A), Ternary (A-) | Limited scope (fewer tests, single-file) | | **2** | **Claude Opus 4.7** | **A-/B+** | 7/8 | DFlash (A), Backwards (A) | Non-ternary embeddings deviates from spec | | **3** | **Qwen3-6** | **B+** | 7/8 | KV-Cache (A), Beam Search (A) | DFlash logits trap, algorithmic ceiling | | **4** | **Kimi K2.6** | **B** | 3/8 | Flash Attn Bwd (A) | DFlash/ternary bugs; narrow strengths | | **5** | **GLM-5.1** | **B-/C+** | 3/8 | DFlash (B+) | Regression from GLM-5; ternary overfit | | **6** | **MiniMax-M2.7** | **B/B-** | 4/8 | — | Bugs, no tests; exited early | ### Per-Challenge Grade Matrix | Challenge | Difficulty | GLM-5 | Qwen3-6 | Opus 4.7 | Kimi K2.6 | GLM-5.1 | MiniMax | |-----------|------------|-------|---------|----------|-----------|---------|---------| | Layer Norm Backward | Medium | B+ | **A-** | **A** | — | — | B | | Fused Softmax+TopK | Medium | **A-** | **A** | **A** | — | — | B | | KV-Cache | Medium | A- | **A** | B+ | — | — | B- | | Flash Attn Forward | Hard | A- | **A** | A- | — | — | B | | Beam Search | Hard | B+ | **A** | **A** | — | — | B- | | Flash Attn Backward | Extra Hard | A- | A- | A- | **A** | B+ | — | | DFlash | Extra Hard | **A** | B- | **A** | B- | B+ | — | | Ternary Training | SOTA Research | **A-** | B+ | B+ | C | C+ | — | --- ## Key Takeaways 1. **GLM-5 is the most consistent** — only model to participate in all 8 challenges, never declined in grade, won the two hardest challenges (DFlash, Ternary) 2. **Opus 4.7 has the highest floor** — strongest consistency across 7 challenges (A to B+), best documentation, caught the same algorithmic traps as GLM-5 3. **Qwen3-6 excels at engineering breadth** — modular code, comprehensive tests, real model specs; falls behind on deep algorithmic reasoning 4. **The DFlash logits trap separated the tiers** — only GLM-5, GLM-5.1, and Opus 4.7 corrected the broken pseudo-code to parent-indexed logits 5. **Ternary training exposed hyperparameter discipline** — GLM-5 (PPL=594), Opus (PPL=643), Qwen (PPL=319 after correcting data leakage), Kimi (PPL=5,501), GLM-5.1 (PPL=30,731) --- ## Models Tested | Folder | Model | Provider | |--------|-------|----------| | `glm5/` | GLM-5 | Z.ai | | `glm5.1/` | GLM-5.1 | Z.ai | | `qwen36/` | Qwen3-6 | OpenRouter | | `opus47_1m/` | Claude Opus 4.7 | Anthropic | | `kimi-k2.6/` | Kimi K2.6 | Moonshot AI | | `minimax-m2.7/` | MiniMax-M2.7 | OpenRouter | ## Challenges | Task | Description | |------|-------------| | `backwards/` | Numerically stable Layer Norm backward pass from scratch in NumPy | | `fuse/` | High-performance fused Softmax + Top-K CUDA kernel (no full softmax materialization) | | `kv/` | Incremental KV-cache for autoregressive transformer inference | | `flash_attention/` | Flash Attention forward pass with tiling and causal masking | | `beam_search/` | Beam search decoder with length penalty and EOS handling | | `flash_attention_bwd/` | Flash Attention backward pass with D-optimization | | `dflash_verify/` | Tree attention (DFlash) with branching logits and subtree invalidation | | `ternary_training/` | Ternary (1-bit) weight training matching PrismML's Ternary Bonsai spec | ## Repo Layout ``` ├── analysis/ # Cross-model comparison analyses │ ├── cross-model-comparison.md # Full 8-challenge synthesis │ ├── 2-way-head-to-head*.md # Pairwise round breakdowns │ ├── 3-way-head-to-head*.md # Multi-model round breakdowns │ ├── dflash-analysis.md # DFlash deep dive │ ├── ternary-training-analysis.md # Ternary deep dive │ └── ... # Per-challenge analyses ├── glm5/ # GLM-5 implementations (8 challenges) ├── glm5.1/ # GLM-5.1 implementations (3 challenges) ├── qwen36/ # Qwen3-6 implementations (7 challenges) ├── opus47_1m/ # Claude Opus 4.7 implementations (7 challenges) ├── kimi-k2.6/ # Kimi K2.6 implementations (3 challenges) ├── minimax-m2.7/ # MiniMax-M2.7 implementations (4 challenges) ├── deploy_challenges.sh # Challenge deployment script └── README.md ``` Each model folder contains subdirectories per challenge with the model's raw generated code, session logs (`session.jsonl`), and any model-generated analysis files. ## Notes - Scores are letter grades (A through F), assigned by DeepSeek V4 Pro on max effort - Grading dimensions: correctness, completeness, code quality, analysis depth, testing, algorithmic sophistication, and GPU mapping - All raw session logs are preserved (sanitized) in each folder's `session.jsonl` - Source code is untouched — exactly as each model generated it - Detailed per-challenge and per-pair analyses live in `analysis/`