Files
llm_programming_tests/model_comparison/kv_comparison.md
T
sleepy 8e72eef09c feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
2026-04-23 11:16:01 +02:00

20 KiB

Head-to-Head Analysis: KV-Cache System for Autoregressive Transformer Inference

Task: Implement an efficient KV-cache system for autoregressive transformer inference from scratch. Date: 2026-04-23 Analyst: pi coding agent


Table of Contents

  1. Executive Summary
  2. MiniMax-M2.7: MiniMax-M2.7
  3. Qwen3.6-27B: Qwen3.6-27B
  4. Detailed Scoring
  5. Head-to-Head Comparison
  6. Final Verdict

1. Executive Summary

Both implementations satisfy the core requirements of the prompt: incremental decoding, KV-cache reuse, multi-head attention, batching support, memory analysis, optimization proposals, and GPU execution mapping. However, Qwen3.6-27B (Qwen3.6-27B) is the clear winner with a decisive margin. It delivers a modular, well-tested, and rigorously validated codebase with 10 passing end-to-end demos, precise numerical correctness checks, and production-grade analysis. MiniMax-M2.7 is a single-file monolith with broader conceptual scope but weaker execution, no automated tests, and several correctness issues in its attention masking and batching logic.

Dimension MiniMax-M2.7 Score Qwen3.6-27B Score
Correctness 55 92
Completeness 75 95
Code Quality 60 88
Depth of Analysis 78 90
Optimizations Proposed 72 90
GPU Mapping Explanation 75 88
Tests / Demos 30 95
Overall 64 91

2. MiniMax-M2.7: MiniMax-M2.7

2.1 Files

  • kv_cache.py — Single 1,720-line monolithic file containing everything
  • FINAL.md — Summary document
  • PROMPT.md — Identical prompt

2.2 What It Does Well

  1. Conceptual Breadth: MiniMax-M2.7 covers an impressive range of topics in one file:

    • Multiple memory formats (BHSD, BSHD, PAGED, HBSD)
    • Both paged (PagedKVCache) and flat (FlatKVCache) cache implementations
    • Full transformer block with pre-norm, FFN, and residual connections
    • Batched inference engine with BatchElement tracking
    • Memory analyzer with formulas and latency estimates
    • GPU execution mapper with CUDA kernel pseudocode
    • Five optimization strategies (paged attention, chunked attention, quantization, sparse KV, speculative decoding)
  2. Data Structure Variety: It implements two distinct cache backends (paged and flat), which shows understanding of trade-offs.

  3. Extensive ASCII Diagrams: The code is heavily annotated with visual diagrams explaining memory layouts, execution pipelines, and GPU hierarchies.

  4. GPU Kernel Pseudocode: Includes actual CUDA-style pseudocode for kvcache_update and attention_with_cache kernels.

2.3 Weaknesses

  1. No Automated Tests: The only "test" is a 3-step hardcoded decode in run_demo() with no assertions, no numerical validation, and no edge-case coverage. There is no way to verify correctness systematically.

  2. Attention Masking Bug: The causal mask construction is incorrect:

    mask = np.triu(np.ones((seq_len, total_len), dtype=np.float32), k=1 - seq_len)
    

    This produces a mask where the lower-left triangle is 1s (masked) and upper-right is 0s (unmasked) — the opposite of causal masking. The correct causal mask should mask the upper triangle (future positions). This is a critical correctness bug.

  3. KV Cache Update Bug in Batched Setting: In BatchedInferenceEngine.step_inference(), the engine iterates over batch elements one at a time and calls self.model.forward() with batch_idx=elem.batch_idx, but TransformerBlockStack.forward() ignores batch_idx entirely — it always uses the same shared self.kv_cache dictionary keyed by layer_idx, not by batch element. This means all batch elements share the same KV cache, which is fundamentally broken for batched inference with different sequences.

  4. No Variable-Length Masking: While the prompt requires "batching with variable sequence lengths," MiniMax-M2.7 does not implement per-sequence length masking in its attention computation. The BatchElement class tracks lengths but they are never used to mask padded positions.

  5. Monolithic Architecture: Everything is crammed into a single 1,720-line file. This hurts readability, maintainability, and makes it impossible to import components independently.

  6. Prefill Does Not Store KV Cache Correctly: In KVCacheAwareGenerator.prefill(), the model forward is called but the returned KV tensors are never stored into the FlatKVCache or PagedKVCache data structures. The prefill only populates the in-memory self.kv_cache dict inside TransformerBlockStack, not the persistent cache.

  7. Weak Quantization Analysis: The quantization demo only shows format comparisons (FP32→FP16→INT8) without any actual quantization/dequantization implementation or error analysis.

  8. Chunked Attention Is Only Described, Not Implemented: The "chunked attention" optimization is documented in comments with no runnable code.

  9. Memory Analysis Is High-Level: The memory analyzer provides formulas and tables but lacks concrete model comparisons (e.g., Llama-7B vs GPT-4) and GPU-specific context limits.

  10. GPU Mapping Is Mostly Descriptive: While it includes CUDA pseudocode, the analysis lacks quantitative metrics like arithmetic intensity, memory-bound vs compute-bound classification, or concrete kernel tiling parameters.


3. Qwen3.6-27B: Qwen3.6-27B

3.1 Files

  • kv_cache.py — Core data structures (KVCache, BatchedKVCache)
  • attention.py — Attention computation (standard, cached, masked, GQA)
  • transformer.py — Full transformer decoder with prefill + generation
  • optimizations.py — Paged attention, quantization, chunked prefill
  • memory_analysis.py — Memory growth formulas, model comparisons, GPU limits
  • gpu_mapping.py — GPU kernel design, Tensor Core analysis, multi-GPU strategies
  • demo.py — 10 end-to-end demos with assertions
  • README.md — Comprehensive documentation
  • FINAL.md — Summary of passing demos

3.2 What It Does Well

  1. Modular Architecture: Seven focused files, each with a single responsibility. Clean imports, clear separation of concerns. This is production-quality structure.

  2. 10 Passing End-to-End Demos: Every component is exercised and validated:

    • Demo 1: Basic cache ops with shape assertions
    • Demo 2: Cached attention numerically verified against manual computation (diff < 1e-5)
    • Demo 3: Full transformer prefill + generation with variable-length batching
    • Demo 4: Variable-length batching with per-sequence attention
    • Demo 5: Paged attention with block allocation and page table verification
    • Demo 6: Quantized cache with error measurement
    • Demo 7: Chunked prefill numerically verified against full attention (diff = 4.56e-10)
    • Demo 8: Side-by-side optimization comparison
    • Demo 9: Memory analysis with real model specs (Llama-2/3, Mistral, GPT-4)
    • Demo 10: GPU Tensor Core analysis with arithmetic intensity and bound classification
  3. Correct Attention Implementation:

    • build_causal_mask() correctly masks the upper triangle with -inf
    • build_variable_length_mask() handles per-batch-item lengths with both causal and length masking
    • cached_attention() correctly notes that causality is implicit during generation (cache only contains past tokens)
    • prompt_attention() correctly applies causal masking during prefill
  4. Proper Prefill/Decode Separation:

    • TransformerDecoderLayer.forward_prefill() processes full prompts, stores K/V in cache, and applies causal masking
    • TransformerDecoderLayer.forward_generate() processes single tokens, appends K/V to cache, and uses cached attention
    • TransformerDecoder.prefill() and .generate_step() orchestrate the phases cleanly
  5. Variable-Length Batching Is Real: The lengths parameter is threaded through prefill and generation, and build_variable_length_mask() creates proper combined causal + length masks.

  6. Working Quantization Implementation: QuantizedKVCache implements actual per-channel int8 quantization with affine transform (x ≈ scale * q + zero). It honestly reports that per-position scales have high overhead and suggests shared per-channel scales for production.

  7. Working Chunked Prefill: ChunkedPrefill.compute_attention_chunked() is a real implementation that processes prompts in chunks, applies causal masks per chunk, and accumulates results. It is numerically verified to match full attention.

  8. Working Paged Attention: PagedKVCache implements page tables, free lists, physical page pools, and on-demand allocation. Demo 5 verifies block allocation and memory utilization.

  9. Rich Memory Analysis:

    • Compares 6 real model architectures (Llama-2 7B/13B/70B, Llama-3 8B, Mistral-7B, GPT-4-class)
    • Computes max context lengths per GPU (RTX 4090, A100-40/80GB, H100-80/96GB)
    • Shows KV cache fraction of total memory at different sequence lengths
    • Analyzes batch size impact with concrete numbers
  10. Quantitative GPU Mapping:

    • Computes arithmetic intensity (FLOPs/byte) for different configs
    • Classifies all configs as memory-bound (critical insight)
    • Describes kernel tiling with concrete sizes (BLOCK=32, shared memory = ~16-20 KB)
    • Includes FlashAttention-style online softmax algorithm
    • Covers multi-GPU strategies (tensor, pipeline, sequence, expert parallelism)
    • Provides practical tuning guide (CUDA graphs, continuous batching, INT8 Tensor Cores)
  11. Group Query Attention (GQA): Implements cached_attention_gqa() showing awareness of modern optimizations beyond standard MHA.

  12. Honest Self-Critique: The quantization demo explicitly notes that its per-position scale approach has high overhead and suggests the production approach (shared per-channel scales). This shows intellectual honesty.

3.3 Weaknesses

  1. Quantized Cache Has Negative Memory Savings in Demo: Due to per-position scales stored in fp16, the QuantizedKVCache actually uses more memory than fp16 in the demo. The code acknowledges this and explains the production fix, but the implementation itself is not optimized.

  2. Paged Attention Gather Is Inefficient: PagedKVCache.get_sequence() iterates over blocks and copies them one at a time. In a real GPU kernel, this would be a gather operation, but the NumPy implementation is O(num_blocks) with Python-level looping.

  3. No Speculative Decoding: While MiniMax-M2.7 at least mentions speculative decoding in its optimization list, Qwen3.6-27B does not cover it at all.

  4. No Sliding Window Attention: Qwen3.6-27B implements GQA but does not implement sliding window attention (a key optimization for very long contexts in models like Mistral).

  5. GQA Is Not Integrated into Transformer: The cached_attention_gqa() function exists in attention.py but is not used in TransformerDecoderLayer or TransformerDecoder.


4. Detailed Scoring

4.1 Correctness (0-100)

Aspect MiniMax-M2.7 Qwen3.6-27B
Attention masking Buggy — causal mask is inverted Correct — proper causal + length masks
KV cache update Buggy — batched cache is shared across all elements Correct — per-layer, per-batch caches
Prefill cache storage Buggy — prefill KV not stored in persistent cache Correctprompt_attention() stores all tokens
Numerical validation None 10 demos with assertions
Variable-length batching Described but not correctly implemented Fully working with masks
Score 55 92

MiniMax-M2.7 loses 45 points due to the inverted causal mask (critical), shared batched cache (critical), and missing prefill cache storage (major). These are not edge cases — they are fundamental to the task.

Qwen3.6-27B loses 8 points for the quantized cache overhead issue (minor, acknowledged) and the lack of GQA integration (minor).

4.2 Completeness (0-100)

Requirement MiniMax-M2.7 Qwen3.6-27B
Incremental decoding
Avoid recomputing attention ✓ (conceptually) ✓ (working)
Multi-head attention
Batching with variable lengths Partial (broken)
Data structure layout ✓ (4 formats) ✓ (clearly documented)
Update logic per step
Attention computation with cache ✓ (buggy mask)
Memory growth analysis ✓ (formulas) ✓ (formulas + models + GPUs)
≥2 optimizations proposed ✓ (5 listed, 2 implemented) ✓ (3 implemented + comparisons)
GPU execution mapping ✓ (descriptive) ✓ (quantitative + kernel design)
Score 75 95

MiniMax-M2.7 is incomplete on variable-length batching (the requirement is not met due to the shared cache bug) and its optimizations are partially documented rather than implemented.

4.3 Code Quality (0-100)

Aspect MiniMax-M2.7 Qwen3.6-27B
Modularity Single 1,720-line file 7 focused files
Readability Dense, diagram-heavy Clean, well-commented
Type hints Present but inconsistent Consistent and thorough
Naming Generally good Excellent
Docstrings Extensive Concise and precise
Reusability Poor (monolith) Good (modular imports)
Score 60 88

MiniMax-M2.7's single-file approach makes it difficult to navigate and impossible to import components selectively. Qwen3.6-27B's modular structure is a clear best practice.

4.4 Depth of Analysis (0-100)

Aspect MiniMax-M2.7 Qwen3.6-27B
Memory formulas ✓ (more detailed)
Model-specific analysis None 6 real models
GPU-specific limits Generic Per-GPU context limits
Arithmetic intensity Not computed Computed and classified
Multi-GPU strategies Listed Detailed with KV cache impact
Practical tuning Limited Comprehensive guide
Score 78 90

Both provide good analysis, but Qwen3.6-27B grounds everything in concrete numbers (real models, real GPUs, real FLOPs/byte ratios).

4.5 Optimizations Proposed (0-100)

Optimization MiniMax-M2.7 Qwen3.6-27B
Paged attention Described + partial implementation Fully implemented + tested
Quantization Described only Implemented + error measured
Chunked attention Described only Implemented + numerically verified
Sparse KV / token selection Described Not covered
Speculative decoding Described Not covered
GQA Not covered Implemented (not integrated)
Side-by-side comparison No Yes (5 strategies)
Score 72 90

MiniMax-M2.7 covers more optimization ideas (5 vs 3) but only implements 1 (paged) partially. Qwen3.6-27B implements 3 fully with tests and comparisons. Quality over quantity.

4.6 GPU Mapping Explanation (0-100)

Aspect MiniMax-M2.7 Qwen3.6-27B
Memory hierarchy ✓ (ASCII diagram) ✓ (table)
CUDA kernel pseudocode ✓ (more detailed)
Thread block design Brief Detailed with sizes
Tensor Core analysis Mentioned Quantified (FLOPs, intensity, bounds)
FlashAttention adaptation Mentioned Algorithm described
Multi-GPU strategies Listed Detailed per-strategy
Practical tuning Limited 5 concrete recommendations
Score 75 88

MiniMax-M2.7 has CUDA pseudocode; Qwen3.6-27B has quantitative analysis. Both are good, but Qwen3.6-27B's arithmetic intensity analysis and bound classification are more insightful.

4.7 Tests / Demos (0-100)

Aspect MiniMax-M2.7 Qwen3.6-27B
Number of demos 1 (hardcoded) 10 (comprehensive)
Assertions / validation None Numerical diff checks
Edge cases covered None Variable lengths, padding, quantization error
Integration test Partial Full prefill → generate pipeline
Score 30 95

This is the biggest gap. Qwen3.6-27B's 10 passing demos with numerical validation provide confidence that the system works. MiniMax-M2.7 has no systematic validation.


5. Head-to-Head Comparison

What Each Did Well

MiniMax-M2.7 Strengths:

  • Broader conceptual coverage (5 optimization ideas vs 3)
  • Multiple memory format enums (BHSD, BSHD, PAGED, HBSD)
  • Both paged and flat cache implementations in one file
  • Includes speculative decoding in optimization list
  • CUDA kernel pseudocode is more extensive
  • MemoryFormat enum shows awareness of layout trade-offs

Qwen3.6-27B Strengths:

  • Everything is tested and numerically validated
  • Modular, maintainable codebase
  • Correct attention masking (causal + variable length)
  • Proper prefill/decode phase separation
  • Working implementations of 3 optimizations (paged, quantized, chunked)
  • Concrete model and GPU analysis with real numbers
  • Quantitative GPU performance characterization (memory-bound classification)
  • GQA implementation (modern architecture awareness)
  • Honest self-critique of quantization overhead
  • Excellent documentation (README.md is comprehensive)

Weaknesses Comparison

MiniMax-M2.7 Critical Issues:

  1. Inverted causal mask — attention attends to future tokens instead of past
  2. Shared batched KV cache — all batch elements overwrite each other's cache
  3. No systematic testing — correctness is assumed, not verified
  4. Monolithic file — unmaintainable at scale

Qwen3.6-27B Minor Issues:

  1. Quantized cache has overhead in current implementation (acknowledged)
  2. GQA is not wired into the transformer
  3. No speculative decoding coverage
  4. No sliding window attention

Who Won and By How Much

Qwen3.6-27B wins decisively.

Metric MiniMax-M2.7 Qwen3.6-27B Delta
Overall Score 64 91 +27

The margin is large and justified:

  • Qwen3.6-27B is correct where MiniMax-M2.7 has fundamental bugs
  • Qwen3.6-27B is tested where MiniMax-M2.7 has no validation
  • Qwen3.6-27B is modular where MiniMax-M2.7 is a monolith
  • Qwen3.6-27B's analysis is quantitative where MiniMax-M2.7's is descriptive

MiniMax-M2.7 shows broader familiarity with concepts (more optimization ideas, more memory formats) but Qwen3.6-27B demonstrates deeper understanding and execution (working code, passing tests, numerical validation). In engineering, correctness and validation trump conceptual breadth.


6. Final Verdict

MiniMax-M2.7: 64/100 — "Conceptually Broad, Executionally Weak"

MiniMax-M2.7 demonstrates familiarity with a wide range of KV-cache concepts and writes extensive documentation. However, it suffers from critical correctness bugs (inverted causal mask, broken batched caching), lacks any systematic testing, and crams everything into an unmaintainable monolith. The implementation does not reliably meet the prompt's requirements for correct incremental decoding or variable-length batching. It reads like a knowledgeable engineer's first draft — full of good ideas but not yet debugged or validated.

Qwen3.6-27B: 91/100 — "Production-Grade, Rigorously Validated"

Qwen3.6-27B delivers a modular, correct, and thoroughly tested KV-cache system. Every component has a dedicated file, every demo passes with numerical validation, and the analysis is grounded in real models and GPUs. The attention masking is correct, the prefill/decode separation is clean, and the optimizations are actually implemented and verified. The README alone is a better technical document than MiniMax-M2.7's entire output. This is the work of an engineer who understands that correctness and testing are not optional.

Recommendation

If you need a KV-cache system to study, extend, or adapt: use Qwen3.6-27B. It is correct, tested, modular, and well-documented. MiniMax-M2.7 may be useful as a supplementary reference for additional optimization ideas (speculative decoding, sliding window, sparse KV), but its code should not be used without significant bug fixes.


Analysis completed by pi coding agent. Both implementations were read in full, executed, and evaluated against the original prompt requirements.