# LLM Programming Benchmarks

Head-to-head evaluation of six coding LLMs across eight low-level ML kernel tasks. Each model was given identical prompts to implement solutions from scratch — no frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA / MLX.

---

## Methodology

**All analysis, scoring, and write-ups were generated by DeepSeek V4 Pro running on max effort mode.** This means:

- Every grade, comparison, and ranking was determined by an LLM judge, not a human reviewer
- The judge evaluated code for correctness, completeness, testing coverage, analysis depth, algorithmic sophistication, and engineering rigor
- Raw model outputs and session logs are preserved untouched in each model folder
- Scores should be treated as directional indicators, not absolute measurements

**Take every score with a grain of salt.** LLM judges can be consistent but are not infallible. The relative rankings are more useful than the exact numbers.

**Tooling:** The first 3 challenges (KV-Cache, Fused Softmax+TopK, Layer Norm Backward) were generated using **pi-mono** as the harness. The remaining 5 challenges were generated using **opencode**.

---

## TL;DR — Final Rankings

| Rank | Model | Avg Grade | Challenges | Best Showing | Weakness |
|------|-------|-----------|------------|--------------|----------|
| **1** | **GLM-5** | **A-/B+** | 8/8 | DFlash (A), Ternary (A-) | Limited scope (fewer tests, single-file) |
| **2** | **Claude Opus 4.7** | **A-/B+** | 7/8 | DFlash (A), Backwards (A) | Non-ternary embeddings deviates from spec |
| **3** | **Qwen3-6** | **B+** | 7/8 | KV-Cache (A), Beam Search (A) | DFlash logits trap, algorithmic ceiling |
| **4** | **Kimi K2.6** | **B** | 3/8 | Flash Attn Bwd (A) | DFlash/ternary bugs; narrow strengths |
| **5** | **GLM-5.1** | **B-/C+** | 3/8 | DFlash (B+) | Regression from GLM-5; ternary overfit |
| **6** | **MiniMax-M2.7** | **B/B-** | 4/8 | — | Bugs, no tests; exited early |

### Per-Challenge Grade Matrix

| Challenge | Difficulty | GLM-5 | Qwen3-6 | Opus 4.7 | Kimi K2.6 | GLM-5.1 | MiniMax |
|-----------|------------|-------|---------|----------|-----------|---------|---------|
| Layer Norm Backward | Medium | B+ | **A-** | **A** | — | — | B |
| Fused Softmax+TopK | Medium | **A-** | **A** | **A** | — | — | B |
| KV-Cache | Medium | A- | **A** | B+ | — | — | B- |
| Flash Attn Forward | Hard | A- | **A** | A- | — | — | B |
| Beam Search | Hard | B+ | **A** | **A** | — | — | B- |
| Flash Attn Backward | Extra Hard | A- | A- | A- | **A** | B+ | — |
| DFlash | Extra Hard | **A** | B- | **A** | B- | B+ | — |
| Ternary Training | SOTA Research | **A-** | B+ | B+ | C | C+ | — |

---

## Key Takeaways

1. **GLM-5 is the most consistent** — only model to participate in all 8 challenges, never declined in grade, won the two hardest challenges (DFlash, Ternary)
2. **Opus 4.7 has the highest floor** — strongest consistency across 7 challenges (A to B+), best documentation, caught the same algorithmic traps as GLM-5
3. **Qwen3-6 excels at engineering breadth** — modular code, comprehensive tests, real model specs; falls behind on deep algorithmic reasoning
4. **The DFlash logits trap separated the tiers** — only GLM-5, GLM-5.1, and Opus 4.7 corrected the broken pseudo-code to parent-indexed logits
5. **Ternary training exposed hyperparameter discipline** — GLM-5 (PPL=594), Opus (PPL=643), Qwen (PPL=319 after correcting data leakage), Kimi (PPL=5,501), GLM-5.1 (PPL=30,731)

---

## Models Tested

| Folder | Model | Provider |
|--------|-------|----------|
| `glm5/` | GLM-5 | Z.ai |
| `glm5.1/` | GLM-5.1 | Z.ai |
| `qwen36/` | Qwen3-6 | OpenRouter |
| `opus47_1m/` | Claude Opus 4.7 | Anthropic |
| `kimi-k2.6/` | Kimi K2.6 | Moonshot AI |
| `minimax-m2.7/` | MiniMax-M2.7 | OpenRouter |

## Challenges

| Task | Description |
|------|-------------|
| `backwards/` | Numerically stable Layer Norm backward pass from scratch in NumPy |
| `fuse/` | High-performance fused Softmax + Top-K CUDA kernel (no full softmax materialization) |
| `kv/` | Incremental KV-cache for autoregressive transformer inference |
| `flash_attention/` | Flash Attention forward pass with tiling and causal masking |
| `beam_search/` | Beam search decoder with length penalty and EOS handling |
| `flash_attention_bwd/` | Flash Attention backward pass with D-optimization |
| `dflash_verify/` | Tree attention (DFlash) with branching logits and subtree invalidation |
| `ternary_training/` | Ternary (1-bit) weight training matching PrismML's Ternary Bonsai spec |

## Repo Layout

```
├── analysis/                  # Cross-model comparison analyses
│   ├── cross-model-comparison.md    # Full 8-challenge synthesis
│   ├── 2-way-head-to-head*.md       # Pairwise round breakdowns
│   ├── 3-way-head-to-head*.md       # Multi-model round breakdowns
│   ├── dflash-analysis.md           # DFlash deep dive
│   ├── ternary-training-analysis.md # Ternary deep dive
│   └── ...                          # Per-challenge analyses
├── glm5/                      # GLM-5 implementations (8 challenges)
├── glm5.1/                    # GLM-5.1 implementations (3 challenges)
├── qwen36/                    # Qwen3-6 implementations (7 challenges)
├── opus47_1m/                 # Claude Opus 4.7 implementations (7 challenges)
├── kimi-k2.6/                 # Kimi K2.6 implementations (3 challenges)
├── minimax-m2.7/              # MiniMax-M2.7 implementations (4 challenges)
├── deploy_challenges.sh       # Challenge deployment script
└── README.md
```

Each model folder contains subdirectories per challenge with the model's raw generated code, session logs (`session.jsonl`), and any model-generated analysis files.

## Notes

- Scores are letter grades (A through F), assigned by DeepSeek V4 Pro on max effort
- Grading dimensions: correctness, completeness, code quality, analysis depth, testing, algorithmic sophistication, and GPU mapping
- All raw session logs are preserved (sanitized) in each folder's `session.jsonl`
- Source code is untouched — exactly as each model generated it
- Detailed per-challenge and per-pair analyses live in `analysis/`