llm_programming_tests/qwen36/kv/FINAL.md at 8e72eef09cd391a59292f22fbe1d830d2a2d39f5

Files

T

sleepy 8e72eef09c feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore

2026-04-23 11:16:01 +02:00

1.9 KiB

Raw Blame History

All 10 demos pass cleanly. Here's a summary of what was built:

Project Structure

kv/
├── kv_cache.py          # Core KV-cache data structures
├── attention.py         # Attention computation (standard, cached, GQA, masked)
├── transformer.py       # Full transformer decoder with KV-cache integration
├── optimizations.py     # Paged attention, quantization, chunked prefill
├── memory_analysis.py   # Memory growth formulas, GPU limits
├── gpu_mapping.py       # GPU kernel design, Tensor Core analysis
├── demo.py              # 10 end-to-end demos
└── README.md            # Comprehensive documentation

What Each Demo Validates

#	Demo	Validates
1	Basic KV Cache	Data structure layout, O(1) append, memory tracking
2	Cached Attention	Q @ K^T @ V using cache, matches manual computation
3	Full Transformer	Prefill + generation loop, variable-length batching
4	Variable-Length Batching	Per-sequence attention with different lengths
5	Paged Attention	Block allocation, page tables, non-contiguous memory
6	Quantized Cache	int8 quantization/dequantization with affine transform
7	Chunked Prefill	Matches full attention (4.56e-10 diff), 8× memory savings
8	Optimization Comparison	Side-by-side memory costs of all strategies
9	Memory Analysis	Model sizes, growth curves, max context per GPU
10	GPU Tensor Cores	Arithmetic intensity → all configs are memory-bound

Key Findings

KV cache is linear: 2 × batch × layers × heads × seq × head_dim × elem_bytes
7B model on H100-80GB: max ~121K tokens (batch=1), ~30K (batch=4)
Generation is memory-bound: 1.0 FLOPs/byte intensity — bottleneck is HBM bandwidth, not compute
Paged + quantized: 2-4× memory reduction, enabling proportionally longer contexts

1.9 KiB Raw Blame History Unescape Escape

Project Structure

What Each Demo Validates

Key Findings

1.9 KiB

Raw Blame History