Files

T

sleepy 8e72eef09c feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore

2026-04-23 11:16:01 +02:00

20 KiB

Raw Blame History

Head-to-Head Analysis: KV-Cache System for Autoregressive Transformer Inference

Task: Implement an efficient KV-cache system for autoregressive transformer inference from scratch. Date: 2026-04-23 Analyst: pi coding agent

Executive Summary
MiniMax-M2.7: MiniMax-M2.7
Qwen3.6-27B: Qwen3.6-27B
Detailed Scoring
Head-to-Head Comparison
Final Verdict

1. Executive Summary

Both implementations satisfy the core requirements of the prompt: incremental decoding, KV-cache reuse, multi-head attention, batching support, memory analysis, optimization proposals, and GPU execution mapping. However, Qwen3.6-27B (Qwen3.6-27B) is the clear winner with a decisive margin. It delivers a modular, well-tested, and rigorously validated codebase with 10 passing end-to-end demos, precise numerical correctness checks, and production-grade analysis. MiniMax-M2.7 is a single-file monolith with broader conceptual scope but weaker execution, no automated tests, and several correctness issues in its attention masking and batching logic.

Dimension	MiniMax-M2.7 Score	Qwen3.6-27B Score
Correctness	55	92
Completeness	75	95
Code Quality	60	88
Depth of Analysis	78	90
Optimizations Proposed	72	90
GPU Mapping Explanation	75	88
Tests / Demos	30	95
Overall	64	91

2. MiniMax-M2.7: `MiniMax-M2.7`

2.1 Files

kv_cache.py — Single 1,720-line monolithic file containing everything
FINAL.md — Summary document
PROMPT.md — Identical prompt

2.2 What It Does Well

Conceptual Breadth: MiniMax-M2.7 covers an impressive range of topics in one file:
- Multiple memory formats (BHSD, BSHD, PAGED, HBSD)
- Both paged (PagedKVCache) and flat (FlatKVCache) cache implementations
- Full transformer block with pre-norm, FFN, and residual connections
- Batched inference engine with BatchElement tracking
- Memory analyzer with formulas and latency estimates
- GPU execution mapper with CUDA kernel pseudocode
- Five optimization strategies (paged attention, chunked attention, quantization, sparse KV, speculative decoding)
Data Structure Variety: It implements two distinct cache backends (paged and flat), which shows understanding of trade-offs.
Extensive ASCII Diagrams: The code is heavily annotated with visual diagrams explaining memory layouts, execution pipelines, and GPU hierarchies.
GPU Kernel Pseudocode: Includes actual CUDA-style pseudocode for kvcache_update and attention_with_cache kernels.

2.3 Weaknesses

No Automated Tests: The only "test" is a 3-step hardcoded decode in run_demo() with no assertions, no numerical validation, and no edge-case coverage. There is no way to verify correctness systematically.
Attention Masking Bug: The causal mask construction is incorrect:
```
mask = np.triu(np.ones((seq_len, total_len), dtype=np.float32), k=1 - seq_len)
```
This produces a mask where the lower-left triangle is 1s (masked) and upper-right is 0s (unmasked) — the opposite of causal masking. The correct causal mask should mask the upper triangle (future positions). This is a critical correctness bug.
KV Cache Update Bug in Batched Setting: In BatchedInferenceEngine.step_inference(), the engine iterates over batch elements one at a time and calls self.model.forward() with batch_idx=elem.batch_idx, but TransformerBlockStack.forward() ignores batch_idx entirely — it always uses the same shared self.kv_cache dictionary keyed by layer_idx, not by batch element. This means all batch elements share the same KV cache, which is fundamentally broken for batched inference with different sequences.
No Variable-Length Masking: While the prompt requires "batching with variable sequence lengths," MiniMax-M2.7 does not implement per-sequence length masking in its attention computation. The BatchElement class tracks lengths but they are never used to mask padded positions.
Monolithic Architecture: Everything is crammed into a single 1,720-line file. This hurts readability, maintainability, and makes it impossible to import components independently.
Prefill Does Not Store KV Cache Correctly: In KVCacheAwareGenerator.prefill(), the model forward is called but the returned KV tensors are never stored into the FlatKVCache or PagedKVCache data structures. The prefill only populates the in-memory self.kv_cache dict inside TransformerBlockStack, not the persistent cache.
Weak Quantization Analysis: The quantization demo only shows format comparisons (FP32→FP16→INT8) without any actual quantization/dequantization implementation or error analysis.
Chunked Attention Is Only Described, Not Implemented: The "chunked attention" optimization is documented in comments with no runnable code.
Memory Analysis Is High-Level: The memory analyzer provides formulas and tables but lacks concrete model comparisons (e.g., Llama-7B vs GPT-4) and GPU-specific context limits.
GPU Mapping Is Mostly Descriptive: While it includes CUDA pseudocode, the analysis lacks quantitative metrics like arithmetic intensity, memory-bound vs compute-bound classification, or concrete kernel tiling parameters.

3. Qwen3.6-27B: `Qwen3.6-27B`

3.1 Files

kv_cache.py — Core data structures (KVCache, BatchedKVCache)
attention.py — Attention computation (standard, cached, masked, GQA)
transformer.py — Full transformer decoder with prefill + generation
optimizations.py — Paged attention, quantization, chunked prefill
memory_analysis.py — Memory growth formulas, model comparisons, GPU limits
gpu_mapping.py — GPU kernel design, Tensor Core analysis, multi-GPU strategies
demo.py — 10 end-to-end demos with assertions
README.md — Comprehensive documentation
FINAL.md — Summary of passing demos

3.2 What It Does Well

Modular Architecture: Seven focused files, each with a single responsibility. Clean imports, clear separation of concerns. This is production-quality structure.
10 Passing End-to-End Demos: Every component is exercised and validated:
- Demo 1: Basic cache ops with shape assertions
- Demo 2: Cached attention numerically verified against manual computation (diff < 1e-5)
- Demo 3: Full transformer prefill + generation with variable-length batching
- Demo 4: Variable-length batching with per-sequence attention
- Demo 5: Paged attention with block allocation and page table verification
- Demo 6: Quantized cache with error measurement
- Demo 7: Chunked prefill numerically verified against full attention (diff = 4.56e-10)
- Demo 8: Side-by-side optimization comparison
- Demo 9: Memory analysis with real model specs (Llama-2/3, Mistral, GPT-4)
- Demo 10: GPU Tensor Core analysis with arithmetic intensity and bound classification
Correct Attention Implementation:
- build_causal_mask() correctly masks the upper triangle with -inf
- build_variable_length_mask() handles per-batch-item lengths with both causal and length masking
- cached_attention() correctly notes that causality is implicit during generation (cache only contains past tokens)
- prompt_attention() correctly applies causal masking during prefill
Proper Prefill/Decode Separation:
- TransformerDecoderLayer.forward_prefill() processes full prompts, stores K/V in cache, and applies causal masking
- TransformerDecoderLayer.forward_generate() processes single tokens, appends K/V to cache, and uses cached attention
- TransformerDecoder.prefill() and .generate_step() orchestrate the phases cleanly
Variable-Length Batching Is Real: The lengths parameter is threaded through prefill and generation, and build_variable_length_mask() creates proper combined causal + length masks.
Working Quantization Implementation: QuantizedKVCache implements actual per-channel int8 quantization with affine transform (x ≈ scale * q + zero). It honestly reports that per-position scales have high overhead and suggests shared per-channel scales for production.
Working Chunked Prefill: ChunkedPrefill.compute_attention_chunked() is a real implementation that processes prompts in chunks, applies causal masks per chunk, and accumulates results. It is numerically verified to match full attention.
Working Paged Attention: PagedKVCache implements page tables, free lists, physical page pools, and on-demand allocation. Demo 5 verifies block allocation and memory utilization.
Rich Memory Analysis:
- Compares 6 real model architectures (Llama-2 7B/13B/70B, Llama-3 8B, Mistral-7B, GPT-4-class)
- Computes max context lengths per GPU (RTX 4090, A100-40/80GB, H100-80/96GB)
- Shows KV cache fraction of total memory at different sequence lengths
- Analyzes batch size impact with concrete numbers
Quantitative GPU Mapping:
- Computes arithmetic intensity (FLOPs/byte) for different configs
- Classifies all configs as memory-bound (critical insight)
- Describes kernel tiling with concrete sizes (BLOCK=32, shared memory = ~16-20 KB)
- Includes FlashAttention-style online softmax algorithm
- Covers multi-GPU strategies (tensor, pipeline, sequence, expert parallelism)
- Provides practical tuning guide (CUDA graphs, continuous batching, INT8 Tensor Cores)
Group Query Attention (GQA): Implements cached_attention_gqa() showing awareness of modern optimizations beyond standard MHA.
Honest Self-Critique: The quantization demo explicitly notes that its per-position scale approach has high overhead and suggests the production approach (shared per-channel scales). This shows intellectual honesty.

3.3 Weaknesses

Quantized Cache Has Negative Memory Savings in Demo: Due to per-position scales stored in fp16, the QuantizedKVCache actually uses more memory than fp16 in the demo. The code acknowledges this and explains the production fix, but the implementation itself is not optimized.
Paged Attention Gather Is Inefficient: PagedKVCache.get_sequence() iterates over blocks and copies them one at a time. In a real GPU kernel, this would be a gather operation, but the NumPy implementation is O(num_blocks) with Python-level looping.
No Speculative Decoding: While MiniMax-M2.7 at least mentions speculative decoding in its optimization list, Qwen3.6-27B does not cover it at all.
No Sliding Window Attention: Qwen3.6-27B implements GQA but does not implement sliding window attention (a key optimization for very long contexts in models like Mistral).
GQA Is Not Integrated into Transformer: The cached_attention_gqa() function exists in attention.py but is not used in TransformerDecoderLayer or TransformerDecoder.

4. Detailed Scoring

4.1 Correctness (0-100)

Aspect	MiniMax-M2.7	Qwen3.6-27B
Attention masking	Buggy — causal mask is inverted	Correct — proper causal + length masks
KV cache update	Buggy — batched cache is shared across all elements	Correct — per-layer, per-batch caches
Prefill cache storage	Buggy — prefill KV not stored in persistent cache	Correct — `prompt_attention()` stores all tokens
Numerical validation	None	10 demos with assertions
Variable-length batching	Described but not correctly implemented	Fully working with masks
Score	55	92

MiniMax-M2.7 loses 45 points due to the inverted causal mask (critical), shared batched cache (critical), and missing prefill cache storage (major). These are not edge cases — they are fundamental to the task.

Qwen3.6-27B loses 8 points for the quantized cache overhead issue (minor, acknowledged) and the lack of GQA integration (minor).

4.2 Completeness (0-100)

Requirement	MiniMax-M2.7	Qwen3.6-27B
Incremental decoding	✓	✓
Avoid recomputing attention	✓ (conceptually)	✓ (working)
Multi-head attention	✓	✓
Batching with variable lengths	Partial (broken)	✓
Data structure layout	✓ (4 formats)	✓ (clearly documented)
Update logic per step	✓	✓
Attention computation with cache	✓ (buggy mask)	✓
Memory growth analysis	✓ (formulas)	✓ (formulas + models + GPUs)
≥2 optimizations proposed	✓ (5 listed, 2 implemented)	✓ (3 implemented + comparisons)
GPU execution mapping	✓ (descriptive)	✓ (quantitative + kernel design)
Score	75	95

MiniMax-M2.7 is incomplete on variable-length batching (the requirement is not met due to the shared cache bug) and its optimizations are partially documented rather than implemented.

4.3 Code Quality (0-100)

Aspect	MiniMax-M2.7	Qwen3.6-27B
Modularity	Single 1,720-line file	7 focused files
Readability	Dense, diagram-heavy	Clean, well-commented
Type hints	Present but inconsistent	Consistent and thorough
Naming	Generally good	Excellent
Docstrings	Extensive	Concise and precise
Reusability	Poor (monolith)	Good (modular imports)
Score	60	88

MiniMax-M2.7's single-file approach makes it difficult to navigate and impossible to import components selectively. Qwen3.6-27B's modular structure is a clear best practice.

4.4 Depth of Analysis (0-100)

Aspect	MiniMax-M2.7	Qwen3.6-27B
Memory formulas	✓	✓ (more detailed)
Model-specific analysis	None	6 real models
GPU-specific limits	Generic	Per-GPU context limits
Arithmetic intensity	Not computed	Computed and classified
Multi-GPU strategies	Listed	Detailed with KV cache impact
Practical tuning	Limited	Comprehensive guide
Score	78	90

Both provide good analysis, but Qwen3.6-27B grounds everything in concrete numbers (real models, real GPUs, real FLOPs/byte ratios).

4.5 Optimizations Proposed (0-100)

Optimization	MiniMax-M2.7	Qwen3.6-27B
Paged attention	Described + partial implementation	Fully implemented + tested
Quantization	Described only	Implemented + error measured
Chunked attention	Described only	Implemented + numerically verified
Sparse KV / token selection	Described	Not covered
Speculative decoding	Described	Not covered
GQA	Not covered	Implemented (not integrated)
Side-by-side comparison	No	Yes (5 strategies)
Score	72	90

MiniMax-M2.7 covers more optimization ideas (5 vs 3) but only implements 1 (paged) partially. Qwen3.6-27B implements 3 fully with tests and comparisons. Quality over quantity.

4.6 GPU Mapping Explanation (0-100)

Aspect	MiniMax-M2.7	Qwen3.6-27B
Memory hierarchy	✓ (ASCII diagram)	✓ (table)
CUDA kernel pseudocode	✓	✓ (more detailed)
Thread block design	Brief	Detailed with sizes
Tensor Core analysis	Mentioned	Quantified (FLOPs, intensity, bounds)
FlashAttention adaptation	Mentioned	Algorithm described
Multi-GPU strategies	Listed	Detailed per-strategy
Practical tuning	Limited	5 concrete recommendations
Score	75	88

MiniMax-M2.7 has CUDA pseudocode; Qwen3.6-27B has quantitative analysis. Both are good, but Qwen3.6-27B's arithmetic intensity analysis and bound classification are more insightful.

4.7 Tests / Demos (0-100)

Aspect	MiniMax-M2.7	Qwen3.6-27B
Number of demos	1 (hardcoded)	10 (comprehensive)
Assertions / validation	None	Numerical diff checks
Edge cases covered	None	Variable lengths, padding, quantization error
Integration test	Partial	Full prefill → generate pipeline
Score	30	95

This is the biggest gap. Qwen3.6-27B's 10 passing demos with numerical validation provide confidence that the system works. MiniMax-M2.7 has no systematic validation.

5. Head-to-Head Comparison

What Each Did Well

MiniMax-M2.7 Strengths:

Broader conceptual coverage (5 optimization ideas vs 3)
Multiple memory format enums (BHSD, BSHD, PAGED, HBSD)
Both paged and flat cache implementations in one file
Includes speculative decoding in optimization list
CUDA kernel pseudocode is more extensive
MemoryFormat enum shows awareness of layout trade-offs

Qwen3.6-27B Strengths:

Everything is tested and numerically validated
Modular, maintainable codebase
Correct attention masking (causal + variable length)
Proper prefill/decode phase separation
Working implementations of 3 optimizations (paged, quantized, chunked)
Concrete model and GPU analysis with real numbers
Quantitative GPU performance characterization (memory-bound classification)
GQA implementation (modern architecture awareness)
Honest self-critique of quantization overhead
Excellent documentation (README.md is comprehensive)

Weaknesses Comparison

MiniMax-M2.7 Critical Issues:

Inverted causal mask — attention attends to future tokens instead of past
Shared batched KV cache — all batch elements overwrite each other's cache
No systematic testing — correctness is assumed, not verified
Monolithic file — unmaintainable at scale

Qwen3.6-27B Minor Issues:

Quantized cache has overhead in current implementation (acknowledged)
GQA is not wired into the transformer
No speculative decoding coverage
No sliding window attention

Who Won and By How Much

Qwen3.6-27B wins decisively.

Metric	MiniMax-M2.7	Qwen3.6-27B	Delta
Overall Score	64	91	+27

The margin is large and justified:

Qwen3.6-27B is correct where MiniMax-M2.7 has fundamental bugs
Qwen3.6-27B is tested where MiniMax-M2.7 has no validation
Qwen3.6-27B is modular where MiniMax-M2.7 is a monolith
Qwen3.6-27B's analysis is quantitative where MiniMax-M2.7's is descriptive

MiniMax-M2.7 shows broader familiarity with concepts (more optimization ideas, more memory formats) but Qwen3.6-27B demonstrates deeper understanding and execution (working code, passing tests, numerical validation). In engineering, correctness and validation trump conceptual breadth.

6. Final Verdict

MiniMax-M2.7: 64/100 — "Conceptually Broad, Executionally Weak"

MiniMax-M2.7 demonstrates familiarity with a wide range of KV-cache concepts and writes extensive documentation. However, it suffers from critical correctness bugs (inverted causal mask, broken batched caching), lacks any systematic testing, and crams everything into an unmaintainable monolith. The implementation does not reliably meet the prompt's requirements for correct incremental decoding or variable-length batching. It reads like a knowledgeable engineer's first draft — full of good ideas but not yet debugged or validated.

Qwen3.6-27B: 91/100 — "Production-Grade, Rigorously Validated"

Qwen3.6-27B delivers a modular, correct, and thoroughly tested KV-cache system. Every component has a dedicated file, every demo passes with numerical validation, and the analysis is grounded in real models and GPUs. The attention masking is correct, the prefill/decode separation is clean, and the optimizations are actually implemented and verified. The README alone is a better technical document than MiniMax-M2.7's entire output. This is the work of an engineer who understands that correctness and testing are not optional.

Recommendation

If you need a KV-cache system to study, extend, or adapt: use Qwen3.6-27B. It is correct, tested, modular, and well-documented. MiniMax-M2.7 may be useful as a supplementary reference for additional optimization ideas (speculative decoding, sliding window, sparse KV), but its code should not be used without significant bug fixes.

Analysis completed by pi coding agent. Both implementations were read in full, executed, and evaluated against the original prompt requirements.

20 KiB Raw Blame History