- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7 - Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training - Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution - Add analysis/ folder with cross-model comparisons and per-challenge deep dives - Add deploy_challenges.sh script - Expand .gitignore to exclude Python envs, ML weights, and build artifacts
22 KiB
3-Way Head-to-Head Analysis: GLM-5 vs MiniMax-M2.7 vs Qwen3-6
Executive Summary
| Dimension | GLM-5 | MiniMax-M2.7 | Qwen3-6 |
|---|---|---|---|
| Overall Grade | B+ | B | A- |
| Backwards (Layer Norm) | ✓ PASS, compact | ✓ PASS, verbose | ✓ PASS, excellent |
| Fuse (Softmax+Top-K) | Strong CUDA, online algo | Pseudocode/CUDA hybrid | Production-grade CUDA |
| KV (KV-Cache) | Clean, well-structured | Over-engineered | Comprehensive, best |
| Code correctness | All tests pass | All tests pass | All tests pass |
| Code quality | Clean, minimal | Verbose, unstructured | Modular, well-documented |
| Testing | 8 tests, 1 file | None (benchmark only) | 10 demos, multiple files |
| Novelty/Depth | Good | Acceptable | Excellent |
Task 1: Backward Pass for Layer Normalization
GLM-5 (file: glm5/backwards/layer_norm.py)
Grade: B+
Strengths:
- Single-file implementation (275 lines) — clean and contained
- Correct simplified dx formula:
rstd * (dxhat - xhat * proj/D - dxhat_sum/D) - Gradient check passes (rel error ~1e-10 on all three gradients)
- Good numerical stability discussion covering 5 distinct failure modes
- GPU fusion strategy is detailed with shared memory layout and 4-step kernel design
- Derivation thoroughly shown in docstring
Weaknesses:
- Full finite-difference check on x iterates element-by-element with Python loops — very slow for anything beyond tiny tensors. No spot-check heuristic
- Complexity analysis is prose-based, not tabular — harder to compare
- No edge case tests (zero input, large mean with tiny variance, D=1, etc.)
- GPU fusion discussion only covers the backward pass — forward pass fusion is mentioned but not detailed
- Only caches xhat, rstd, gamma — minimal but correct
Code tested:
dx: max|err| = 5.71e-10 rel = 9.74e-11 [PASS]
dgamma: max|err| = 3.21e-10 rel = 5.13e-11 [PASS]
dbeta: max|err| = 4.07e-10 rel = 4.69e-11 [PASS]
MiniMax-M2.7 (file: minimax-m2.7/backwards/layer_norm_numpy.py)
Grade: B
Strengths:
- Most verbose implementation (1148 lines) — extensive documentation
- Per-operation FLOPs table in complexity analysis
- Benchmark harness with 4 shape configurations
- GPU kernel pseudo-code is actually compilable-like CUDA with
__global__,__shared__,warpReduceSum - Proper spot-check for large tensors (>100k elements) in gradient check
- Includes a
LayerNormclass wrapper with parameter state management - Central finite differences with proper element-by-element checking
Weaknesses:
- Bloated cache: stores x, x_centered, x_norm, mean, var, std, gamma, beta, eps, B, T, D — way more than needed
- The backward formula
dx = (dz - sum_dz/D - x_norm * sum_dz_xnorm/D) / stdIS correct but the author incorrectly notesdz = dy * gammameaning dgamma/dbeta formula usesdy * x_normbut then stores ALL intermediates redundantly - The cache dict stores the original
xwhich is NEVER needed for backward - Gradient check uses per-element Python loops for
x(O(BTD) Python calls) with a progress bar — extremely slow for real sizes - Overly complex:
compute_numerical_gradient_x/gamma/betaas separate functions with near-duplicate code - Numerical stability analysis somewhat buried in code comments rather than clearly presented
- Complexity analysis uses ASCII art boxes — visually noisy
Code tested:
All gradient checks PASSED on 3 shape configurations
Performance benchmarks run successfully
Qwen3-6 (files: qwen36/backwards/layer_norm_backward.py, test_layer_norm.py, benchmark_layer_norm.py)
Grade: A-
Strengths:
- Best overall: 3 well-separated files for core impl (294 lines), edge case tests (113 lines), benchmarks (150 lines)
- Cleanest dx formula with derivation sketch:
dx = std_inv * [g - mean(g) - x_hat * mean(g * x_hat)]whereg = gamma * dy - Minimal cache: only x_hat (B,T,D), std_inv (B,T), gamma (D), and D (scalar). Perfect.
- Edge case test file covers:
- Large mean, tiny variance (cancellation-prone)
- Zero input (variance = 0)
- Large D (Transformer-scale: B=2, T=128, D=1024)
- D=1 (degenerate case)
- Gradient norm sanity check
- Backward-of-backward consistency
- Memory efficiency check (verifies optimal cache size)
- Benchmark file demonstrates two-pass vs naive variance stability (offset=1e10: naive=0.000, stable=2.000)
- Explicit verification against alternative derivation path (step-by-step chain rule cross-check)
- GPU fusion discussion is the most thorough — includes forward AND backward kernels with pseudocode, memory traffic comparison (12 vs 4 accesses per element), shared memory optimization, and hardware rsqrt note
- Gradients pass with ~5e-11 rel error
Weaknesses:
- The full numerical gradient function iterates element-by-element (can be extremely slow for D=1024, caused timeout in our test)
- Finite difference function doesn't auto-detect and switch to spot-check for large tensors (unlike MiniMax's
max_elementsparam) - Benchmark is CPU-only NumPy, inherently slow
Gradient check:
dx relative error: 5.04e-11 ✓ PASS
dgamma relative error: 1.75e-11 ✓ PASS
dbeta relative error: 1.46e-11 ✓ PASS
Backwards Task Winner: Qwen3-6
Qwen edges out GLM with its careful edge case testing, minimal memory caching, cross-verification of the backward formula, and practical stability demonstration. GLM is very close but lacks edge case testing. MiniMax is correct but over-engineered with bloated caches.
Task 2: Fused Softmax + Top-K Kernel
GLM-5 (files: glm5/fuse/fused_softmax_topk.cuh, test_fused.cu, DESIGN.md)
Grade: A-
Strengths:
- True CUDA
.cuhheader with template-based kernel - Uses online softmax algorithm (running max/sum recurrence) — genuinely single-pass
- Register-resident
TopKHeap<K>struct withvals[idxs]sorted array - Warp-level
__shfl_xor_syncfor max/sum reductions (5-step butterfly) - Cross-warp heap merge in shared memory with
__syncthreads() - Explicit template instantiation for K=5,10,20,32
- Clean 3-phase pipeline: local pass → cross-warp merge → write output
- DESIGN.md is comprehensive (9 sections) with detailed bandwidth analysis showing 3× I/O reduction
- Bandwidth-bound analysis correctly identifies AI=1.5 FLOP/byte << A100's 9.6 FLOP/byte
- Includes host launch wrapper and CUDA stream support
Weaknesses:
- Only supports K ≤ 32 (limited by
HEAP_Kregister constant) - Heap uses O(K) insertion — OK for K=32, but breaks for K=256
- Cross-warp merge is serial (warp 0 only) — bottleneck for
WARPS_PER_BLOCK > 8 - No FP16/vectorized load support (mentioned in DESIGN.md as future work)
- Shared memory use: ~2KB (very modest, not fully utilizing available)
test_fused.cuexists but wasn't read in detail — appears to be a test harness
Code:
// Key insight: online softmax recurrence
m_{j} = max(m_{j-1}, x_j)
d_{j} = d_{j-1} * exp(m_{j-1} - m_{j}) + exp(x_j - m_{j})
MiniMax-M2.7 (file: minimax-m2.7/fuse/fused_softmax_topk.cu)
Grade: B
Strengths:
- Full analysis document (1720 lines in the code comment) demonstrating strong systems thinking
- Good documentation of memory access pattern (coalesced strided reads), warp operations, and complexity
- Correctly identifies the kernel as bandwidth-bound
- Includes scalability analysis for V=10K, 50K, 500K, 1M+
- Discusses extensions: FP16/BF16, Tensor Cores, tiled approach, integration with backward pass
Weaknesses:
- The CUDA code has significant bugs:
- Uses
__launch_bounds__(THREADS)but THREADS is a template parameter — this is not valid CUDA syntax (__launch_bounds__requires integer constant) - Shared memory layout is broken:
int* s_topk_idx = (int*)&shared_mem[2 * THREADS]— pointer arithmetic onfloat*then cast toint*— byte offsets are likely wrong - Phase 3 top-K heap:
if (prob > local_topk_val[TOP_K - 1])— but TOP_K is a template parameter and TOP_K-1 indexing isn't guarded - Final merge phase uses
merge_val[THREADS]andmerge_idx[THREADS]as stack arrays — THREADS=256 means 2KB stack arrays inside kernel, potentially exceeding per-thread stack limits s_topk_val[lane] = local_topk_val[lane]is guarded bywarp_id == 0 && lane < TOP_K— but if TOP_K > 32, warp 0 threads 32..TOP_K-1 still execute this and accesslocal_topk_valwhich may be uninitialized for those lanes- The launcher function creates separate kernels for K≤10, K≤50, K≤100 but uses
topk_probvstopp_probtypo
- Uses
- Uses 2-pass softmax (max first, then sum, then top-k), not a true single-pass online softmax like GLM
- Top-K insertion does per-element linear scan of size TOP_K — O(V * K) instead of O(V * log K)
- No template-based instantiation — uses if/else chains in launcher
Qwen3-6 (files: qwen36/fuse/fused_softmax_topk.cu, fused_softmax_topk_v2.cu, ANALYSIS.md, benchmark.cu)
Grade: A
Strengths:
- Two kernel versions: v1 (production) and v2 (optimized with vectorized float4 loads, warp-level top-K merge, bitonic sort)
- Local top-K uses
LocalTopK<K>struct with min-eviction strategy - Proper min-heap in shared memory with
heap_sift_down()function — O(log K) insertions - Phase 4 warp-merging correctly serializes across 8 warps with barriers
- Phase 5 sorts the final K elements with selection sort (O(K²), acceptable for K=256)
- Warp-level primitives (
warp_max,warp_sum) use butterfly shuffle — correct - Vectorized float4 loads in v2 — proper alignment handling with tail loop
- Template-based with explicit instantiations for K=16,32,64,128,256
ANALYSIS.mdprovides deep design document alongside the codebenchmark.cufor correctness and performance harness- Dynamic shared memory for warp staging buffer (2048B for vals + 2048B for idxs)
- Phase 4 warp leader serialization uses explicit barrier pattern — correct but could be faster
Weaknesses:
- v1 also uses 2-pass (max phase → sum phase → top-K), not a true online algorithm like GLM
- v2's warp-level top-K merge (
warp_topk_merge) is declared but uses lane-0 serial collection — the comment claims "warp-level merge" but implementation is serial on lane 0 - v2's bitonic sort is mentioned in comments but not actually implemented (falls back to selection sort)
- Shared heap sift-down is correct but uses a
while(true)loop with break — slightly unconventional GPU style - Minor:
s_heap_idxsis declared but used ass_heap_idxs(typo exists in the code)
Fuse Task Winner: GLM-5 (by a hair) / Qwen3-6 (for production readiness)
GLM-5 wins on algorithmic elegance — it's the only one implementing true online softmax (single pass, running statistics). This is the correct answer for the "do NOT materialize the full softmax matrix" constraint.
Qwen3-6 wins on production completeness — v2 has float4 vectorization, supports K up to 256, has proper shared heap, and includes a benchmark harness. The 2-pass approach is slightly more memory traffic but still avoids the full matrix.
MiniMax's implementation has real bugs that would prevent compilation or correct execution.
Task 3: KV-Cache System
GLM-5 (files: glm5/kv/kv_cache.py, optimizations.py, test_kv_cache.py, README.md)
Grade: A-
Strengths:
- Clean, well-structured 471-line core with clear section headers
- BHSD memory layout with per-batch seq_lens — correct for variable-length batching
multi_head_attention_with_cachecorrectly queries from cacheIncrementalDecodershows end-to-end prefill→decode lifecycleoptimizations.py(508 lines) implements all three requested optimizations:- PagedKVCache with free-list management and block scattering
- ChunkedPrefill with sequential chunk processing
- QuantizedKVCache with INT8/INT4 symmetric quantization
test_kv_cache.py(429 lines) provides 8 comprehensive tests:- Basic cache update/retrieval
- Cached vs non-cached attention correctness (matches to 1e-5)
- Variable sequence lengths (lengths [5, 12, 3])
- Incremental decoder end-to-end
- Paged cache with block allocation/free
- Quantized cache (INT8 error ~0.004, INT4 error ~0.07)
- Memory growth analysis tables
- FLOPs comparison (109x speedup for 1024+100)
All 8 tests pass cleanly.
Weaknesses:
multi_head_attention_with_cachehas per-head Python loops — correct for NumPy but notes "maps 1:1 to CUDA" which is slightly misleading (CUDA batch matmul would be more efficient)- Attention computation loops over B, then H — O(B*H) Python loops per step
- Chunked prefill in
optimizations.pyusesnp.random.randnto simulate Q — doesn't actually compute chunks of a real prompt - Quantized cache uses per-token per-head per-dimension scale factors — huge metadata overhead (reported savings of only 0.5x vs FP32 for INT8, which is pessimistic; real systems use per-channel or per-token scales)
- Paged cache
get_kvconcatenates scattered blocks on every call — fine for NumPy demo but on GPU this needs a custom gather kernel - No FP16 support (uses NP float32 default)
MiniMax-M2.7 (file: minimax-m2.7/kv/kv_cache.py)
Grade: B-
Strengths:
- Most lines of code (1720 lines total) — very ambitious scope
- Implements multiple memory formats (
BHSD,BSHD,PAGED,HBSD) as an enum FlatKVCachewith [layers, batch, seq, 2, heads, dim] layout — reasonable for multi-layerPagedKVCachewith block allocatorMultiHeadAttentionclass with Q/K/V projection and causal maskingBatchedInferenceEngineclass for managing variable-length batchesMemoryAnalyzerclass with growth rate and latency estimation- All three optimizations covered (Paged, Chunked, Quantized)
Weaknesses:
- Significant structural issues:
KVCacheBlockstores layer_idx=-1 as placeholder but later reassigns — fragile designFlatKVCachestores [num_layers, max_batch, max_seq_len, 2, num_heads, head_dim] — the "2" dimension for K/V is awkward and non-standardBatchedInferenceEngine.step_inferencecreates fake outputs for finished sequences (zeros) but doesn't properly exclude them from computation_projectin MultiHeadAttention appliesnp.matmul(x, W)thenreshapethentranspose— three separate operations when oneeinsumwould be cleaner- The
_create_causal_maskfunction has incorrect logic:np.triu(..., k=1-seq_len)— whenseq_len=1(decode),k=0which creates a mask with zeros everywhere. For decode, no mask is actually needed (cache only has past tokens), so it's accidentally correct but the derivation is wrong. - Class
TransformerBlockStack.forwardstores KV cache inself.kv_cache[layer_idx]butMultiHeadAttention.forwardexpectskv_cacheas a Dict with{layer_idx: (k_cache, v_cache)}format — format mismatch: the stack stores(K, V)tuples but the attention expects to receive them in a particular nested dict structure. Thelayer_cachepreparation is wrong. - The code mixes two different kv_cache conventions (flat 6D tensor vs dict-based), causing confusion
- No test file — the entire 1720-line file has zero test functions
- Complexity analysis interleaved with implementation code — hard to separate
- Much of the code (TransformerBlock, TransformerBlockStack, KVCacheAwareGenerator) is partially implementing a full transformer rather than focusing on the KV-cache system
Qwen3-6 (files: qwen36/kv/kv_cache.py, attention.py, optimizations.py, transformer.py, memory_analysis.py, gpu_mapping.py, demo.py, README.md)
Grade: A
Strengths:
- Best architecture: 8 separate files with clear separation of concerns
kv_cache.py(205 lines): Clean, minimal KVCache + BatchedKVCache — exactly the right abstractionattention.py(234 lines): Implements standard, cached, masked, GQA attention variantsoptimizations.py(390 lines): PagedKVCache with page tables and free list, QuantizedKVCache with per-channel int8, ChunkedPrefill with proper causal chunking, HybridKVCache combining paged+quantizedtransformer.py(likely): Full transformer decoder integrationmemory_analysis.py(240 lines): Comprehensive withModelSpec,find_max_context(),compare_model_sizes(), detailed GPU limitsgpu_mapping.py(likely): GPU kernel pseudocode with Tensor Core analysisdemo.py(likely): 10 end-to-end demos covering all scenarios:- Basic KV cache operations — data integrity verified
- Cached attention computation — max diff 3.93e-10 from manual
- Full transformer with prefill + 5-step generation
- Variable-length batching (lengths [8, 5, 10, 3])
- Paged attention (vLLM-style, block_size=4)
- Quantized cache (int8, notes overhead correctly)
- Chunked prefill — matches full attention to 4.56e-10
- Optimization comparison table (5 strategies side by side)
- Memory growth analysis (6 models, various GPUs)
- GPU Tensor Core arithmetic intensity analysis
README.mdwith comprehensive documentation- Correctly identifies per-position quantization overhead issue (reports -125% savings vs fp32 due to scale metadata) and explains that production uses shared per-channel scales
BatchedKVCacheis the cleanest abstraction — manages L layers × 1 configmemory_analysis.pyhas real model specs: Llama-2-7B, 13B, 70B, Llama-3-8B, GPT-4-class- Finds max context: 7B model on H100-80GB → 121K tokens (correct)
Weaknesses:
KVCache.updatewriteskeys[:, :, 0, :]assuming batch dimension is first — hardcoded to work with (batch, heads, 1, head_dim) but slightly fragile- The cached attention function retrieves full cache every time (
cache.get_all()) — in production you'd retrieve only what's needed transformer.pyincludes anLLaMAModelclass with real RoPE — but whether this works correctly wasn't tested- Quantized cache reports negative savings vs fp16 due to per-position scale overhead — honest but shows the implementation isn't production-ready for quantization
- Paged cache physical pages are per-head — in real vLLM, pages are per-layer-per-head (much finer granularity)
KV-Cache Task Winner: Qwen3-6
Qwen3-6 wins convincingly. The 8-file modular architecture, comprehensive demo suite, correct variable-length batching, and practical memory analysis (with real model specs and GPU limits) set it apart. GLM-5 is a strong second with excellent test coverage but less depth in attention variants and GPU mapping. MiniMax's implementation has architectural flaws that would prevent correct operation.
Cross-Task Patterns
Code Quality & Architecture
| Aspect | GLM-5 | MiniMax-M2.7 | Qwen3-6 |
|---|---|---|---|
| File organization | 1-3 files/task | 1 file/task | 3-8 files/task |
| Code modularity | Good | Poor (monolithic) | Excellent |
| Documentation | Good (docstrings + DESIGN.md) | Very verbose, ASCII art | Excellent (docstrings + README) |
| Naming conventions | Clean | Mixed | Cleanest |
| Type hints | Minimal | Extensive (typing) | Good (dataclasses) |
| Error handling | Good assertions | Extensive assertions | Good assertions |
Numerical Correctness
All three models produce mathematically correct backward pass gradients. The key differentiator is how they cache intermediates:
- GLM-5: caches (xhat, rstd, gamma) — 3 items ✓
- MiniMax: caches (x, x_centered, x_norm, mean, var, std, gamma, beta, eps, B, T, D) — 12 items, 9 of which are redundant ✗
- Qwen3-6: caches (x_hat, std_inv, gamma, D) — 4 items, all needed ✓
GPU Kernel Quality
For the fuse kernel:
- GLM-5 has the most algorithmically sophisticated kernel (true online softmax, single pass)
- Qwen3-6 has the most production-ready kernel (two versions, float4 vectorization, supports K up to 256)
- MiniMax has significant bugs that would prevent compilation
Testing Philosophy
- GLM-5: Tests are thorough within a single test file. Covers correctness, edge cases, and analysis.
- MiniMax: Primarily benchmarks and gradient checks within the main file. No separate test file for KV-cache.
- Qwen3-6: Best testing culture. Separate test files, edge case files, benchmark files, demo files. Cross-verifies backward formula with alternative derivation.
Scope Creep
- GLM-5: Stays focused on the asked requirements. Delivers what's needed.
- MiniMax: Over-implements. The KV-cache file grows into a full transformer implementation, losing focus on the cache system itself.
- Qwen3-6: Expands thoughtfully. Each extra file adds value (attention variants, memory analysis, GPU mapping) without losing focus.
Overall Ranking
1st Place: Qwen3-6 (A-)
Best overall quality across all three tasks. Wins on KV-cache decisively, ties or slightly trails on backwards, matches on fuse with more practical implementation. Superior engineering practices: modular files, comprehensive testing, cross-verification, edge cases, proper docs.
2nd Place: GLM-5 (B+)
Strong showing with elegant algorithms (online softmax is the standout innovation). Code is clean and correct. Weaknesses are primarily in testing depth (no edge case tests for backwards, K limited to 32 for fuse) rather than correctness. The most "academically beautiful" solutions.
3rd Place: MiniMax-M2.7 (B)
Ambitious but inconsistent. Over-engineers some parts (backwards cache bloated 4x) while under-delivering on others (fuse CUDA has real bugs, KV-cache architecture is fragmented). No separate tests. The verbosity sometimes masks correctness issues. However, the models clearly understand the domain and the issues identified are execution problems rather than knowledge gaps.
Per-Task Winner Summary
| Task | Winner | Key Differentiator |
|---|---|---|
| Layer Norm Backward | Qwen3-6 | Edge case testing, minimal cache, cross-verification |
| Fused Softmax+TopK | GLM-5 | True online single-pass algorithm (only one that is genuinely "fused") |
| KV-Cache System | Qwen3-6 | Modular architecture, 10 demos, practical GPU limits analysis |