Files
deep_pro_judge/analysis/3-way-head-to-head.md
T
sleepy 45c3aad453 feat: expand to 6 models, 8 challenges; rewrite README with DeepSeek V4 Pro analysis
- Add Claude Opus 4.7, Kimi K2.6, GLM-5.1 to existing GLM-5, Qwen3-6, MiniMax-M2.7
- Add 5 new challenges: flash attention fwd/bwd, beam search, DFlash, ternary training
- Rewrite README with TL;DR rankings, grade matrix, and DeepSeek V4 Pro attribution
- Add analysis/ folder with cross-model comparisons and per-challenge deep dives
- Add deploy_challenges.sh script
- Expand .gitignore to exclude Python envs, ML weights, and build artifacts
2026-04-27 18:49:22 +02:00

22 KiB
Raw Blame History

3-Way Head-to-Head Analysis: GLM-5 vs MiniMax-M2.7 vs Qwen3-6

Executive Summary

Dimension GLM-5 MiniMax-M2.7 Qwen3-6
Overall Grade B+ B A-
Backwards (Layer Norm) ✓ PASS, compact ✓ PASS, verbose ✓ PASS, excellent
Fuse (Softmax+Top-K) Strong CUDA, online algo Pseudocode/CUDA hybrid Production-grade CUDA
KV (KV-Cache) Clean, well-structured Over-engineered Comprehensive, best
Code correctness All tests pass All tests pass All tests pass
Code quality Clean, minimal Verbose, unstructured Modular, well-documented
Testing 8 tests, 1 file None (benchmark only) 10 demos, multiple files
Novelty/Depth Good Acceptable Excellent

Task 1: Backward Pass for Layer Normalization

GLM-5 (file: glm5/backwards/layer_norm.py)

Grade: B+

Strengths:

  • Single-file implementation (275 lines) — clean and contained
  • Correct simplified dx formula: rstd * (dxhat - xhat * proj/D - dxhat_sum/D)
  • Gradient check passes (rel error ~1e-10 on all three gradients)
  • Good numerical stability discussion covering 5 distinct failure modes
  • GPU fusion strategy is detailed with shared memory layout and 4-step kernel design
  • Derivation thoroughly shown in docstring

Weaknesses:

  • Full finite-difference check on x iterates element-by-element with Python loops — very slow for anything beyond tiny tensors. No spot-check heuristic
  • Complexity analysis is prose-based, not tabular — harder to compare
  • No edge case tests (zero input, large mean with tiny variance, D=1, etc.)
  • GPU fusion discussion only covers the backward pass — forward pass fusion is mentioned but not detailed
  • Only caches xhat, rstd, gamma — minimal but correct

Code tested:

dx:  max|err| = 5.71e-10   rel = 9.74e-11   [PASS]
dgamma:  max|err| = 3.21e-10   rel = 5.13e-11   [PASS]
dbeta:  max|err| = 4.07e-10   rel = 4.69e-11   [PASS]

MiniMax-M2.7 (file: minimax-m2.7/backwards/layer_norm_numpy.py)

Grade: B

Strengths:

  • Most verbose implementation (1148 lines) — extensive documentation
  • Per-operation FLOPs table in complexity analysis
  • Benchmark harness with 4 shape configurations
  • GPU kernel pseudo-code is actually compilable-like CUDA with __global__, __shared__, warpReduceSum
  • Proper spot-check for large tensors (>100k elements) in gradient check
  • Includes a LayerNorm class wrapper with parameter state management
  • Central finite differences with proper element-by-element checking

Weaknesses:

  • Bloated cache: stores x, x_centered, x_norm, mean, var, std, gamma, beta, eps, B, T, D — way more than needed
  • The backward formula dx = (dz - sum_dz/D - x_norm * sum_dz_xnorm/D) / std IS correct but the author incorrectly notes dz = dy * gamma meaning dgamma/dbeta formula uses dy * x_norm but then stores ALL intermediates redundantly
  • The cache dict stores the original x which is NEVER needed for backward
  • Gradient check uses per-element Python loops for x (O(BTD) Python calls) with a progress bar — extremely slow for real sizes
  • Overly complex: compute_numerical_gradient_x/gamma/beta as separate functions with near-duplicate code
  • Numerical stability analysis somewhat buried in code comments rather than clearly presented
  • Complexity analysis uses ASCII art boxes — visually noisy

Code tested:

All gradient checks PASSED on 3 shape configurations
Performance benchmarks run successfully

Qwen3-6 (files: qwen36/backwards/layer_norm_backward.py, test_layer_norm.py, benchmark_layer_norm.py)

Grade: A-

Strengths:

  • Best overall: 3 well-separated files for core impl (294 lines), edge case tests (113 lines), benchmarks (150 lines)
  • Cleanest dx formula with derivation sketch: dx = std_inv * [g - mean(g) - x_hat * mean(g * x_hat)] where g = gamma * dy
  • Minimal cache: only x_hat (B,T,D), std_inv (B,T), gamma (D), and D (scalar). Perfect.
  • Edge case test file covers:
    • Large mean, tiny variance (cancellation-prone)
    • Zero input (variance = 0)
    • Large D (Transformer-scale: B=2, T=128, D=1024)
    • D=1 (degenerate case)
    • Gradient norm sanity check
    • Backward-of-backward consistency
    • Memory efficiency check (verifies optimal cache size)
  • Benchmark file demonstrates two-pass vs naive variance stability (offset=1e10: naive=0.000, stable=2.000)
  • Explicit verification against alternative derivation path (step-by-step chain rule cross-check)
  • GPU fusion discussion is the most thorough — includes forward AND backward kernels with pseudocode, memory traffic comparison (12 vs 4 accesses per element), shared memory optimization, and hardware rsqrt note
  • Gradients pass with ~5e-11 rel error

Weaknesses:

  • The full numerical gradient function iterates element-by-element (can be extremely slow for D=1024, caused timeout in our test)
  • Finite difference function doesn't auto-detect and switch to spot-check for large tensors (unlike MiniMax's max_elements param)
  • Benchmark is CPU-only NumPy, inherently slow

Gradient check:

dx        relative error: 5.04e-11  ✓ PASS
dgamma    relative error: 1.75e-11  ✓ PASS
dbeta     relative error: 1.46e-11  ✓ PASS

Backwards Task Winner: Qwen3-6

Qwen edges out GLM with its careful edge case testing, minimal memory caching, cross-verification of the backward formula, and practical stability demonstration. GLM is very close but lacks edge case testing. MiniMax is correct but over-engineered with bloated caches.


Task 2: Fused Softmax + Top-K Kernel

GLM-5 (files: glm5/fuse/fused_softmax_topk.cuh, test_fused.cu, DESIGN.md)

Grade: A-

Strengths:

  • True CUDA .cuh header with template-based kernel
  • Uses online softmax algorithm (running max/sum recurrence) — genuinely single-pass
  • Register-resident TopKHeap<K> struct with vals[idxs] sorted array
  • Warp-level __shfl_xor_sync for max/sum reductions (5-step butterfly)
  • Cross-warp heap merge in shared memory with __syncthreads()
  • Explicit template instantiation for K=5,10,20,32
  • Clean 3-phase pipeline: local pass → cross-warp merge → write output
  • DESIGN.md is comprehensive (9 sections) with detailed bandwidth analysis showing 3× I/O reduction
  • Bandwidth-bound analysis correctly identifies AI=1.5 FLOP/byte << A100's 9.6 FLOP/byte
  • Includes host launch wrapper and CUDA stream support

Weaknesses:

  • Only supports K ≤ 32 (limited by HEAP_K register constant)
  • Heap uses O(K) insertion — OK for K=32, but breaks for K=256
  • Cross-warp merge is serial (warp 0 only) — bottleneck for WARPS_PER_BLOCK > 8
  • No FP16/vectorized load support (mentioned in DESIGN.md as future work)
  • Shared memory use: ~2KB (very modest, not fully utilizing available)
  • test_fused.cu exists but wasn't read in detail — appears to be a test harness

Code:

// Key insight: online softmax recurrence
m_{j}   = max(m_{j-1}, x_j)
d_{j}   = d_{j-1} * exp(m_{j-1} - m_{j}) + exp(x_j - m_{j})

MiniMax-M2.7 (file: minimax-m2.7/fuse/fused_softmax_topk.cu)

Grade: B

Strengths:

  • Full analysis document (1720 lines in the code comment) demonstrating strong systems thinking
  • Good documentation of memory access pattern (coalesced strided reads), warp operations, and complexity
  • Correctly identifies the kernel as bandwidth-bound
  • Includes scalability analysis for V=10K, 50K, 500K, 1M+
  • Discusses extensions: FP16/BF16, Tensor Cores, tiled approach, integration with backward pass

Weaknesses:

  • The CUDA code has significant bugs:
    1. Uses __launch_bounds__(THREADS) but THREADS is a template parameter — this is not valid CUDA syntax (__launch_bounds__ requires integer constant)
    2. Shared memory layout is broken: int* s_topk_idx = (int*)&shared_mem[2 * THREADS] — pointer arithmetic on float* then cast to int* — byte offsets are likely wrong
    3. Phase 3 top-K heap: if (prob > local_topk_val[TOP_K - 1]) — but TOP_K is a template parameter and TOP_K-1 indexing isn't guarded
    4. Final merge phase uses merge_val[THREADS] and merge_idx[THREADS] as stack arrays — THREADS=256 means 2KB stack arrays inside kernel, potentially exceeding per-thread stack limits
    5. s_topk_val[lane] = local_topk_val[lane] is guarded by warp_id == 0 && lane < TOP_K — but if TOP_K > 32, warp 0 threads 32..TOP_K-1 still execute this and access local_topk_val which may be uninitialized for those lanes
    6. The launcher function creates separate kernels for K≤10, K≤50, K≤100 but uses topk_prob vs topp_prob typo
  • Uses 2-pass softmax (max first, then sum, then top-k), not a true single-pass online softmax like GLM
  • Top-K insertion does per-element linear scan of size TOP_K — O(V * K) instead of O(V * log K)
  • No template-based instantiation — uses if/else chains in launcher

Qwen3-6 (files: qwen36/fuse/fused_softmax_topk.cu, fused_softmax_topk_v2.cu, ANALYSIS.md, benchmark.cu)

Grade: A

Strengths:

  • Two kernel versions: v1 (production) and v2 (optimized with vectorized float4 loads, warp-level top-K merge, bitonic sort)
  • Local top-K uses LocalTopK<K> struct with min-eviction strategy
  • Proper min-heap in shared memory with heap_sift_down() function — O(log K) insertions
  • Phase 4 warp-merging correctly serializes across 8 warps with barriers
  • Phase 5 sorts the final K elements with selection sort (O(K²), acceptable for K=256)
  • Warp-level primitives (warp_max, warp_sum) use butterfly shuffle — correct
  • Vectorized float4 loads in v2 — proper alignment handling with tail loop
  • Template-based with explicit instantiations for K=16,32,64,128,256
  • ANALYSIS.md provides deep design document alongside the code
  • benchmark.cu for correctness and performance harness
  • Dynamic shared memory for warp staging buffer (2048B for vals + 2048B for idxs)
  • Phase 4 warp leader serialization uses explicit barrier pattern — correct but could be faster

Weaknesses:

  • v1 also uses 2-pass (max phase → sum phase → top-K), not a true online algorithm like GLM
  • v2's warp-level top-K merge (warp_topk_merge) is declared but uses lane-0 serial collection — the comment claims "warp-level merge" but implementation is serial on lane 0
  • v2's bitonic sort is mentioned in comments but not actually implemented (falls back to selection sort)
  • Shared heap sift-down is correct but uses a while(true) loop with break — slightly unconventional GPU style
  • Minor: s_heap_idxs is declared but used as s_heap_idxs (typo exists in the code)

Fuse Task Winner: GLM-5 (by a hair) / Qwen3-6 (for production readiness)

GLM-5 wins on algorithmic elegance — it's the only one implementing true online softmax (single pass, running statistics). This is the correct answer for the "do NOT materialize the full softmax matrix" constraint.

Qwen3-6 wins on production completeness — v2 has float4 vectorization, supports K up to 256, has proper shared heap, and includes a benchmark harness. The 2-pass approach is slightly more memory traffic but still avoids the full matrix.

MiniMax's implementation has real bugs that would prevent compilation or correct execution.


Task 3: KV-Cache System

GLM-5 (files: glm5/kv/kv_cache.py, optimizations.py, test_kv_cache.py, README.md)

Grade: A-

Strengths:

  • Clean, well-structured 471-line core with clear section headers
  • BHSD memory layout with per-batch seq_lens — correct for variable-length batching
  • multi_head_attention_with_cache correctly queries from cache
  • IncrementalDecoder shows end-to-end prefill→decode lifecycle
  • optimizations.py (508 lines) implements all three requested optimizations:
    • PagedKVCache with free-list management and block scattering
    • ChunkedPrefill with sequential chunk processing
    • QuantizedKVCache with INT8/INT4 symmetric quantization
  • test_kv_cache.py (429 lines) provides 8 comprehensive tests:
    • Basic cache update/retrieval
    • Cached vs non-cached attention correctness (matches to 1e-5)
    • Variable sequence lengths (lengths [5, 12, 3])
    • Incremental decoder end-to-end
    • Paged cache with block allocation/free
    • Quantized cache (INT8 error ~0.004, INT4 error ~0.07)
    • Memory growth analysis tables
    • FLOPs comparison (109x speedup for 1024+100)

All 8 tests pass cleanly.

Weaknesses:

  • multi_head_attention_with_cache has per-head Python loops — correct for NumPy but notes "maps 1:1 to CUDA" which is slightly misleading (CUDA batch matmul would be more efficient)
  • Attention computation loops over B, then H — O(B*H) Python loops per step
  • Chunked prefill in optimizations.py uses np.random.randn to simulate Q — doesn't actually compute chunks of a real prompt
  • Quantized cache uses per-token per-head per-dimension scale factors — huge metadata overhead (reported savings of only 0.5x vs FP32 for INT8, which is pessimistic; real systems use per-channel or per-token scales)
  • Paged cache get_kv concatenates scattered blocks on every call — fine for NumPy demo but on GPU this needs a custom gather kernel
  • No FP16 support (uses NP float32 default)

MiniMax-M2.7 (file: minimax-m2.7/kv/kv_cache.py)

Grade: B-

Strengths:

  • Most lines of code (1720 lines total) — very ambitious scope
  • Implements multiple memory formats (BHSD, BSHD, PAGED, HBSD) as an enum
  • FlatKVCache with [layers, batch, seq, 2, heads, dim] layout — reasonable for multi-layer
  • PagedKVCache with block allocator
  • MultiHeadAttention class with Q/K/V projection and causal masking
  • BatchedInferenceEngine class for managing variable-length batches
  • MemoryAnalyzer class with growth rate and latency estimation
  • All three optimizations covered (Paged, Chunked, Quantized)

Weaknesses:

  • Significant structural issues:
    1. KVCacheBlock stores layer_idx=-1 as placeholder but later reassigns — fragile design
    2. FlatKVCache stores [num_layers, max_batch, max_seq_len, 2, num_heads, head_dim] — the "2" dimension for K/V is awkward and non-standard
    3. BatchedInferenceEngine.step_inference creates fake outputs for finished sequences (zeros) but doesn't properly exclude them from computation
    4. _project in MultiHeadAttention applies np.matmul(x, W) then reshape then transpose — three separate operations when one einsum would be cleaner
    5. The _create_causal_mask function has incorrect logic: np.triu(..., k=1-seq_len) — when seq_len=1 (decode), k=0 which creates a mask with zeros everywhere. For decode, no mask is actually needed (cache only has past tokens), so it's accidentally correct but the derivation is wrong.
    6. Class TransformerBlockStack.forward stores KV cache in self.kv_cache[layer_idx] but MultiHeadAttention.forward expects kv_cache as a Dict with {layer_idx: (k_cache, v_cache)} format — format mismatch: the stack stores (K, V) tuples but the attention expects to receive them in a particular nested dict structure. The layer_cache preparation is wrong.
    7. The code mixes two different kv_cache conventions (flat 6D tensor vs dict-based), causing confusion
  • No test file — the entire 1720-line file has zero test functions
  • Complexity analysis interleaved with implementation code — hard to separate
  • Much of the code (TransformerBlock, TransformerBlockStack, KVCacheAwareGenerator) is partially implementing a full transformer rather than focusing on the KV-cache system

Qwen3-6 (files: qwen36/kv/kv_cache.py, attention.py, optimizations.py, transformer.py, memory_analysis.py, gpu_mapping.py, demo.py, README.md)

Grade: A

Strengths:

  • Best architecture: 8 separate files with clear separation of concerns
  • kv_cache.py (205 lines): Clean, minimal KVCache + BatchedKVCache — exactly the right abstraction
  • attention.py (234 lines): Implements standard, cached, masked, GQA attention variants
  • optimizations.py (390 lines): PagedKVCache with page tables and free list, QuantizedKVCache with per-channel int8, ChunkedPrefill with proper causal chunking, HybridKVCache combining paged+quantized
  • transformer.py (likely): Full transformer decoder integration
  • memory_analysis.py (240 lines): Comprehensive with ModelSpec, find_max_context(), compare_model_sizes(), detailed GPU limits
  • gpu_mapping.py (likely): GPU kernel pseudocode with Tensor Core analysis
  • demo.py (likely): 10 end-to-end demos covering all scenarios:
    1. Basic KV cache operations — data integrity verified
    2. Cached attention computation — max diff 3.93e-10 from manual
    3. Full transformer with prefill + 5-step generation
    4. Variable-length batching (lengths [8, 5, 10, 3])
    5. Paged attention (vLLM-style, block_size=4)
    6. Quantized cache (int8, notes overhead correctly)
    7. Chunked prefill — matches full attention to 4.56e-10
    8. Optimization comparison table (5 strategies side by side)
    9. Memory growth analysis (6 models, various GPUs)
    10. GPU Tensor Core arithmetic intensity analysis
  • README.md with comprehensive documentation
  • Correctly identifies per-position quantization overhead issue (reports -125% savings vs fp32 due to scale metadata) and explains that production uses shared per-channel scales
  • BatchedKVCache is the cleanest abstraction — manages L layers × 1 config
  • memory_analysis.py has real model specs: Llama-2-7B, 13B, 70B, Llama-3-8B, GPT-4-class
  • Finds max context: 7B model on H100-80GB → 121K tokens (correct)

Weaknesses:

  • KVCache.update writes keys[:, :, 0, :] assuming batch dimension is first — hardcoded to work with (batch, heads, 1, head_dim) but slightly fragile
  • The cached attention function retrieves full cache every time (cache.get_all()) — in production you'd retrieve only what's needed
  • transformer.py includes an LLaMAModel class with real RoPE — but whether this works correctly wasn't tested
  • Quantized cache reports negative savings vs fp16 due to per-position scale overhead — honest but shows the implementation isn't production-ready for quantization
  • Paged cache physical pages are per-head — in real vLLM, pages are per-layer-per-head (much finer granularity)

KV-Cache Task Winner: Qwen3-6

Qwen3-6 wins convincingly. The 8-file modular architecture, comprehensive demo suite, correct variable-length batching, and practical memory analysis (with real model specs and GPU limits) set it apart. GLM-5 is a strong second with excellent test coverage but less depth in attention variants and GPU mapping. MiniMax's implementation has architectural flaws that would prevent correct operation.


Cross-Task Patterns

Code Quality & Architecture

Aspect GLM-5 MiniMax-M2.7 Qwen3-6
File organization 1-3 files/task 1 file/task 3-8 files/task
Code modularity Good Poor (monolithic) Excellent
Documentation Good (docstrings + DESIGN.md) Very verbose, ASCII art Excellent (docstrings + README)
Naming conventions Clean Mixed Cleanest
Type hints Minimal Extensive (typing) Good (dataclasses)
Error handling Good assertions Extensive assertions Good assertions

Numerical Correctness

All three models produce mathematically correct backward pass gradients. The key differentiator is how they cache intermediates:

  • GLM-5: caches (xhat, rstd, gamma) — 3 items ✓
  • MiniMax: caches (x, x_centered, x_norm, mean, var, std, gamma, beta, eps, B, T, D) — 12 items, 9 of which are redundant ✗
  • Qwen3-6: caches (x_hat, std_inv, gamma, D) — 4 items, all needed ✓

GPU Kernel Quality

For the fuse kernel:

  • GLM-5 has the most algorithmically sophisticated kernel (true online softmax, single pass)
  • Qwen3-6 has the most production-ready kernel (two versions, float4 vectorization, supports K up to 256)
  • MiniMax has significant bugs that would prevent compilation

Testing Philosophy

  • GLM-5: Tests are thorough within a single test file. Covers correctness, edge cases, and analysis.
  • MiniMax: Primarily benchmarks and gradient checks within the main file. No separate test file for KV-cache.
  • Qwen3-6: Best testing culture. Separate test files, edge case files, benchmark files, demo files. Cross-verifies backward formula with alternative derivation.

Scope Creep

  • GLM-5: Stays focused on the asked requirements. Delivers what's needed.
  • MiniMax: Over-implements. The KV-cache file grows into a full transformer implementation, losing focus on the cache system itself.
  • Qwen3-6: Expands thoughtfully. Each extra file adds value (attention variants, memory analysis, GPU mapping) without losing focus.

Overall Ranking

1st Place: Qwen3-6 (A-)

Best overall quality across all three tasks. Wins on KV-cache decisively, ties or slightly trails on backwards, matches on fuse with more practical implementation. Superior engineering practices: modular files, comprehensive testing, cross-verification, edge cases, proper docs.

2nd Place: GLM-5 (B+)

Strong showing with elegant algorithms (online softmax is the standout innovation). Code is clean and correct. Weaknesses are primarily in testing depth (no edge case tests for backwards, K limited to 32 for fuse) rather than correctness. The most "academically beautiful" solutions.

3rd Place: MiniMax-M2.7 (B)

Ambitious but inconsistent. Over-engineers some parts (backwards cache bloated 4x) while under-delivering on others (fuse CUDA has real bugs, KV-cache architecture is fragmented). No separate tests. The verbosity sometimes masks correctness issues. However, the models clearly understand the domain and the issues identified are execution problems rather than knowledge gaps.


Per-Task Winner Summary

Task Winner Key Differentiator
Layer Norm Backward Qwen3-6 Edge case testing, minimal cache, cross-verification
Fused Softmax+TopK GLM-5 True online single-pass algorithm (only one that is genuinely "fused")
KV-Cache System Qwen3-6 Modular architecture, 10 demos, practical GPU limits analysis