LLM Programming Benchmarks

A head-to-head evaluation of three coding LLMs on three low-level ML kernel tasks. Each model was given identical prompts to implement:

KV Cache — incremental KV-cache for autoregressive transformer inference
Backwards Pass — numerically stable layer norm backward pass from scratch in NumPy
Fused Softmax + Top-K — high-performance CUDA kernel (no full softmax materialization)

No frameworks, no autodiff, no copy-paste. Just raw NumPy / CUDA from scratch.

All judging, scoring, and write-ups were performed by Kimi K2.6 analyzing the generated code for correctness, completeness, testing, and analysis depth.

Models Tested

Folder	Model
`qwen36`	Qwen3.6-27B (via OpenRouter)
`glm5`	GLM-5 (via Z.ai)
`minimax-m2.7`	MiniMax-M2.7 (via OpenRouter)

Final Rankings

Rank	Model	Average Score	Best Task	Worst Task
🥇	Qwen3.6-27B	89	KV (92 avg)	Fuse (78)
🥈	GLM-5	81	KV / Backwards (82)	Fuse (80)
🥉	MiniMax-M2.7	66	Backwards (76)	Fuse (58)

Complete Scoreboard

Round 1: MiniMax-M2.7 vs Qwen3.6-27B

Task	MiniMax-M2.7	Qwen3.6-27B	Winner	Margin
KV Cache	64	91	qwen36	+27
Backwards Pass	76	92	qwen36	+16
Fused Softmax+TopK	58	88	qwen36	+30
Average	66	90	qwen36	+24

Round 2: GLM-5 vs Qwen3.6-27B

Task	GLM-5	Qwen3.6-27B	Winner	Margin
KV Cache	82	94	qwen36	+12
Backwards Pass	82	93	qwen36	+11
Fused Softmax+TopK	80	78	glm5	+2
Average	81	88	qwen36	+7

Task-by-Task Breakdown

KV Cache

Qwen3.6-27B (91, 94) — Consistently dominant. 10 demos, modular architecture, real model comparisons, GQA, arithmetic intensity analysis.
GLM-5 (82) — Correct, good tests, excellent docs, INT4 quantization. Lost on missing MLP/causal masking and less systems depth.
MiniMax-M2.7 (64) — Inverted causal mask, broken batched caching, no tests, 1,720-line monolith.

Backwards Pass

Qwen3.6-27B (92, 93) — Minimal cache, concrete stability demo, 3-file separation, 5 edge-case tests, cross-check derivation.
GLM-5 (82) — Excellent conciseness (~280 lines), minimal cache, safe gradient check. Lost on no edge-case tests and no stability demo.
MiniMax-M2.7 (76) — Over-cached (10 items), no edge-case tests, fragile in-place gradient check, monolithic.

Fused Softmax+TopK

GLM-5 (80) — Single-pass online softmax (research-level), 1× global reads, register heaps. Won narrowly (+2) but has cross-warp merge bug when WARPS_PER_BLOCK > 1.
Qwen3.6-27B (88, 78) — Two kernel versions, correct merge, vectorized loads, benchmark harness. Lost on fuse due to suboptimal 3-pass algorithm (12V reads vs 4V).
MiniMax-M2.7 (58) — Broken inter-warp merge (156 threads ignored), compilation typo, zero tests.

Key Patterns

What Separates the Tiers

Dimension	MiniMax-M2.7	GLM-5	Qwen3.6-27B
Correctness	❌ Buggy in all 3	✅ Correct (1 minor bug)	✅ Correct in all 3
Testing	❌ None	⚠️ Basic assertions	✅ Comprehensive suites
Analysis depth	⚠️ High-level / conceptual	✅ Good	✅ Quantitative + real models
Code quality	❌ Bloated monoliths	✅ Concise & focused	✅ Modular & production-grade
Algorithmic sophistication	⚠️ Claims many, delivers few	✅ Online softmax, INT4	✅ Solid, well-validated
Engineering rigor	❌ Untested claims	✅ Clean & minimal	✅ Every claim validated

The Decisive Factors

Testing is everything: Qwen3.6-27B's comprehensive test suites caught issues that GLM-5 and MiniMax-M2.7 missed. GLM-5's fuse bug (cross-warp merge) would have been caught by a multi-row test. MiniMax-M2.7's causal mask bug would have been caught by any numerical validation.
Concrete > theoretical: Qwen3.6-27B demonstrated numerical stability problems with actual numbers; MiniMax-M2.7 and GLM-5 only described them. This pattern repeated across all tasks.
Minimal cache wins: Both Qwen3.6-27B and GLM-5 used minimal caches (3–4 items), while MiniMax-M2.7 over-cached (10 items). The backward pass is particularly sensitive to this — the compact projection formula eliminates most intermediates.
Algorithmic sophistication has tradeoffs: GLM-5's online softmax was theoretically optimal but harder to get right (the cross-warp bug). Qwen3.6-27B's 3-pass approach was simpler and correct but suboptimal in memory traffic.

Repo Layout

├── glm5/                    # GLM-5 implementations
│   ├── backwards/
│   ├── fuse/
│   └── kv/
├── minimax-m2.7/            # MiniMax-M2.7 implementations
│   ├── backwards/
│   ├── fuse/
│   └── kv/
├── qwen36/                  # Qwen3.6-27B implementations
│   ├── backwards/
│   ├── fuse/
│   └── kv/
└── model_comparison/        # Full head-to-head analyses
    ├── overall_summary.md
    ├── glm5_vs_qwen36_summary.md
    ├── minimax-m2.7_vs_qwen36_summary.md
    └── ... (per-task deep dives)

Notes

Scores are out of 100 per task, judged on correctness, completeness, code quality, analysis depth, testing, and GPU mapping.
All raw session logs are preserved (sanitized) in each folder's session.jsonl.
Source code is untouched — exactly as each model generated it.

5.8 KiB Raw Blame History Unescape Escape