feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
2026-04-23 11:16:01 +02:00
commit 8e72eef09c
62 changed files with 18469 additions and 0 deletions
@@ -0,0 +1,56 @@
+
+
+The KV-cache system is fully implemented and working. Here's a summary:
+
+## Implementation Summary
+
+### 1. Data Structures (`kv_cache.py`)
+
+| Class | Purpose |
+|-------|---------|
+| `CacheConfig` | Configuration for cache dimensions and memory format |
+| `KVCacheBlock` | Single block for paged attention (fixed-size memory unit) |
+| `PagedKVCache` | Block-wise KV storage with on-demand allocation |
+| `FlatKVCache` | Pre-allocated flat storage for max throughput |
+| `MultiHeadAttention` | Attention with integrated cache read/write |
+| `TransformerBlock` | Single layer with pre-norm and FFN |
+| `BatchedInferenceEngine` | Manages variable-length batch sequences |
+
+### 2. Memory Layout
+
+```
+[layers, batch, seq, 2, heads, dim]   # Flat format
+[block_size, heads, dim] × N blocks    # Paged format
+```
+
+### 3. Key Algorithms
+
+- **Incremental decoding**: Only computes Q for new token, reuses cached K,V
+- **Causal masking**: Prevents attending to future tokens
+- **Attention**: `softmax(Q × K^T / √d) × V`
+
+### 4. Memory Analysis Results
+
+| Sequence Length | Prefill Cost | Cached Decode | Speedup |
+|----------------|--------------|--------------|---------|
+| 4096 | 16,777,216 | 4,096 | **4,096x** |
+
+### 5. Optimizations Proposed
+
+1. **Paged Attention**: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
+
+2. **Quantization**: FP16→INT8 = 2x compression, INT4 = 4x
+
+3. **Chunked Attention** (Flash Attention style): 256x memory reduction for attention scores
+
+### 6. GPU Execution Mapping
+
+- HBM → Shared Memory → Registers hierarchy
+- KV-cache loaded from HBM per decode step
+- Tensor cores for GEMM operations
+- Critical bottlenecks: memory bandwidth, O(S²) attention
+
+Run with:
+```bash
+python kv_cache.py
+```
@@ -0,0 +1,19 @@
+Implement an efficient KV-cache system for autoregressive transformer inference from scratch.
+
+Requirements:
+1. Support incremental decoding (one token at a time).
+2. Avoid recomputing attention for past tokens.
+3. Handle:
+   - multi-head attention
+   - batching with variable sequence lengths
+4. Provide:
+   - data structure layout (memory format)
+   - update logic per step
+   - attention computation using cached keys/values
+
+Additionally:
+- Analyze memory growth over long sequences.
+- Propose at least two optimizations (e.g., paged attention, chunking, compression).
+- Explain how this would map to GPU execution.
+
+Do not use any frameworks.