Files
deep_pro_judge/README.md
T

111 lines
6.1 KiB
Markdown

# LLM Programming Benchmarks
Head-to-head evaluation of six coding LLMs across eight low-level ML kernel tasks. Each model was given identical prompts to implement solutions from scratch — no frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA / MLX.
---
## Methodology
**All analysis, scoring, and write-ups were generated by DeepSeek V4 Pro running on max effort mode.** This means:
- Every grade, comparison, and ranking was determined by an LLM judge, not a human reviewer
- The judge evaluated code for correctness, completeness, testing coverage, analysis depth, algorithmic sophistication, and engineering rigor
- Raw model outputs and session logs are preserved untouched in each model folder
- Scores should be treated as directional indicators, not absolute measurements
**Take every score with a grain of salt.** LLM judges can be consistent but are not infallible. The relative rankings are more useful than the exact numbers.
**Tooling:** The first 3 challenges (KV-Cache, Fused Softmax+TopK, Layer Norm Backward) were generated using **pi-mono** as the harness. The remaining 5 challenges were generated using **opencode**.
---
## TL;DR — Final Rankings
| Rank | Model | Avg Grade | Challenges | Best Showing | Weakness |
|------|-------|-----------|------------|--------------|----------|
| **1** | **GLM-5** | **A-/B+** | 8/8 | DFlash (A), Ternary (A-) | Limited scope (fewer tests, single-file) |
| **2** | **Claude Opus 4.7** | **A-/B+** | 7/8 | DFlash (A), Backwards (A) | Non-ternary embeddings deviates from spec |
| **3** | **Qwen3-6** | **B+** | 7/8 | KV-Cache (A), Beam Search (A) | DFlash logits trap, algorithmic ceiling |
| **4** | **Kimi K2.6** | **B** | 3/8 | Flash Attn Bwd (A) | DFlash/ternary bugs; narrow strengths |
| **5** | **GLM-5.1** | **B-/C+** | 3/8 | DFlash (B+) | Regression from GLM-5; ternary overfit |
| **6** | **MiniMax-M2.7** | **B/B-** | 4/8 | — | Bugs, no tests; exited early |
### Per-Challenge Grade Matrix
| Challenge | Difficulty | GLM-5 | Qwen3-6 | Opus 4.7 | Kimi K2.6 | GLM-5.1 | MiniMax |
|-----------|------------|-------|---------|----------|-----------|---------|---------|
| Layer Norm Backward | Medium | B+ | **A-** | **A** | — | — | B |
| Fused Softmax+TopK | Medium | **A-** | **A** | **A** | — | — | B |
| KV-Cache | Medium | A- | **A** | B+ | — | — | B- |
| Flash Attn Forward | Hard | A- | **A** | A- | — | — | B |
| Beam Search | Hard | B+ | **A** | **A** | — | — | B- |
| Flash Attn Backward | Extra Hard | A- | A- | A- | **A** | B+ | — |
| DFlash | Extra Hard | **A** | B- | **A** | B- | B+ | — |
| Ternary Training | SOTA Research | **A-** | B+ | B+ | C | C+ | — |
---
## Key Takeaways
1. **GLM-5 is the most consistent** — only model to participate in all 8 challenges, never declined in grade, won the two hardest challenges (DFlash, Ternary)
2. **Opus 4.7 has the highest floor** — strongest consistency across 7 challenges (A to B+), best documentation, caught the same algorithmic traps as GLM-5
3. **Qwen3-6 excels at engineering breadth** — modular code, comprehensive tests, real model specs; falls behind on deep algorithmic reasoning
4. **The DFlash logits trap separated the tiers** — only GLM-5, GLM-5.1, and Opus 4.7 corrected the broken pseudo-code to parent-indexed logits
5. **Ternary training exposed hyperparameter discipline** — GLM-5 (PPL=594), Opus (PPL=643), Qwen (PPL=319 after correcting data leakage), Kimi (PPL=5,501), GLM-5.1 (PPL=30,731)
---
## Models Tested
| Folder | Model | Provider |
|--------|-------|----------|
| `glm5/` | GLM-5 | Z.ai |
| `glm5.1/` | GLM-5.1 | Z.ai |
| `qwen36/` | Qwen3-6 | OpenRouter |
| `opus47_1m/` | Claude Opus 4.7 | Anthropic |
| `kimi-k2.6/` | Kimi K2.6 | Moonshot AI |
| `minimax-m2.7/` | MiniMax-M2.7 | OpenRouter |
## Challenges
| Task | Description |
|------|-------------|
| `backwards/` | Numerically stable Layer Norm backward pass from scratch in NumPy |
| `fuse/` | High-performance fused Softmax + Top-K CUDA kernel (no full softmax materialization) |
| `kv/` | Incremental KV-cache for autoregressive transformer inference |
| `flash_attention/` | Flash Attention forward pass with tiling and causal masking |
| `beam_search/` | Beam search decoder with length penalty and EOS handling |
| `flash_attention_bwd/` | Flash Attention backward pass with D-optimization |
| `dflash_verify/` | Tree attention (DFlash) with branching logits and subtree invalidation |
| `ternary_training/` | Ternary (1-bit) weight training matching PrismML's Ternary Bonsai spec |
## Repo Layout
```
├── analysis/ # Cross-model comparison analyses
│ ├── cross-model-comparison.md # Full 8-challenge synthesis
│ ├── 2-way-head-to-head*.md # Pairwise round breakdowns
│ ├── 3-way-head-to-head*.md # Multi-model round breakdowns
│ ├── dflash-analysis.md # DFlash deep dive
│ ├── ternary-training-analysis.md # Ternary deep dive
│ └── ... # Per-challenge analyses
├── glm5/ # GLM-5 implementations (8 challenges)
├── glm5.1/ # GLM-5.1 implementations (3 challenges)
├── qwen36/ # Qwen3-6 implementations (7 challenges)
├── opus47_1m/ # Claude Opus 4.7 implementations (7 challenges)
├── kimi-k2.6/ # Kimi K2.6 implementations (3 challenges)
├── minimax-m2.7/ # MiniMax-M2.7 implementations (4 challenges)
├── deploy_challenges.sh # Challenge deployment script
└── README.md
```
Each model folder contains subdirectories per challenge with the model's raw generated code, session logs (`session.jsonl`), and any model-generated analysis files.
## Notes
- Scores are letter grades (A through F), assigned by DeepSeek V4 Pro on max effort
- Grading dimensions: correctness, completeness, code quality, analysis depth, testing, algorithmic sophistication, and GPU mapping
- All raw session logs are preserved (sanitized) in each folder's `session.jsonl`
- Source code is untouched — exactly as each model generated it
- Detailed per-challenge and per-pair analyses live in `analysis/`