feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
2026-04-23 11:16:01 +02:00
commit 8e72eef09c
62 changed files with 18469 additions and 0 deletions
@@ -0,0 +1,37 @@
+All 10 demos pass cleanly. Here's a summary of what was built:
+
+## Project Structure
+
+```
+kv/
+├── kv_cache.py          # Core KV-cache data structures
+├── attention.py         # Attention computation (standard, cached, GQA, masked)
+├── transformer.py       # Full transformer decoder with KV-cache integration
+├── optimizations.py     # Paged attention, quantization, chunked prefill
+├── memory_analysis.py   # Memory growth formulas, GPU limits
+├── gpu_mapping.py       # GPU kernel design, Tensor Core analysis
+├── demo.py              # 10 end-to-end demos
+└── README.md            # Comprehensive documentation
+```
+
+## What Each Demo Validates
+
+| # | Demo | Validates |
+|---|------|-----------|
+| 1 | Basic KV Cache | Data structure layout, O(1) append, memory tracking |
+| 2 | Cached Attention | Q @ K^T @ V using cache, matches manual computation |
+| 3 | Full Transformer | Prefill + generation loop, variable-length batching |
+| 4 | Variable-Length Batching | Per-sequence attention with different lengths |
+| 5 | Paged Attention | Block allocation, page tables, non-contiguous memory |
+| 6 | Quantized Cache | int8 quantization/dequantization with affine transform |
+| 7 | Chunked Prefill | Matches full attention (4.56e-10 diff), 8× memory savings |
+| 8 | Optimization Comparison | Side-by-side memory costs of all strategies |
+| 9 | Memory Analysis | Model sizes, growth curves, max context per GPU |
+| 10 | GPU Tensor Cores | Arithmetic intensity → all configs are memory-bound |
+
+## Key Findings
+
+- **KV cache is linear**: `2 × batch × layers × heads × seq × head_dim × elem_bytes`
+- **7B model on H100-80GB**: max ~121K tokens (batch=1), ~30K (batch=4)
+- **Generation is memory-bound**: 1.0 FLOPs/byte intensity — bottleneck is HBM bandwidth, not compute
+- **Paged + quantized**: 2-4× memory reduction, enabling proportionally longer contexts