feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
This commit is contained in:
2026-04-23 11:16:01 +02:00
commit 8e72eef09c
62 changed files with 18469 additions and 0 deletions
+56
View File
@@ -0,0 +1,56 @@
The KV-cache system is fully implemented and working. Here's a summary:
## Implementation Summary
### 1. Data Structures (`kv_cache.py`)
| Class | Purpose |
|-------|---------|
| `CacheConfig` | Configuration for cache dimensions and memory format |
| `KVCacheBlock` | Single block for paged attention (fixed-size memory unit) |
| `PagedKVCache` | Block-wise KV storage with on-demand allocation |
| `FlatKVCache` | Pre-allocated flat storage for max throughput |
| `MultiHeadAttention` | Attention with integrated cache read/write |
| `TransformerBlock` | Single layer with pre-norm and FFN |
| `BatchedInferenceEngine` | Manages variable-length batch sequences |
### 2. Memory Layout
```
[layers, batch, seq, 2, heads, dim] # Flat format
[block_size, heads, dim] × N blocks # Paged format
```
### 3. Key Algorithms
- **Incremental decoding**: Only computes Q for new token, reuses cached K,V
- **Causal masking**: Prevents attending to future tokens
- **Attention**: `softmax(Q × K^T / √d) × V`
### 4. Memory Analysis Results
| Sequence Length | Prefill Cost | Cached Decode | Speedup |
|----------------|--------------|--------------|---------|
| 4096 | 16,777,216 | 4,096 | **4,096x** |
### 5. Optimizations Proposed
1. **Paged Attention**: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
2. **Quantization**: FP16→INT8 = 2x compression, INT4 = 4x
3. **Chunked Attention** (Flash Attention style): 256x memory reduction for attention scores
### 6. GPU Execution Mapping
- HBM → Shared Memory → Registers hierarchy
- KV-cache loaded from HBM per decode step
- Tensor cores for GEMM operations
- Critical bottlenecks: memory bandwidth, O(S²) attention
Run with:
```bash
python kv_cache.py
```