feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
This commit is contained in:
@@ -0,0 +1,56 @@
|
||||
|
||||
|
||||
The KV-cache system is fully implemented and working. Here's a summary:
|
||||
|
||||
## Implementation Summary
|
||||
|
||||
### 1. Data Structures (`kv_cache.py`)
|
||||
|
||||
| Class | Purpose |
|
||||
|-------|---------|
|
||||
| `CacheConfig` | Configuration for cache dimensions and memory format |
|
||||
| `KVCacheBlock` | Single block for paged attention (fixed-size memory unit) |
|
||||
| `PagedKVCache` | Block-wise KV storage with on-demand allocation |
|
||||
| `FlatKVCache` | Pre-allocated flat storage for max throughput |
|
||||
| `MultiHeadAttention` | Attention with integrated cache read/write |
|
||||
| `TransformerBlock` | Single layer with pre-norm and FFN |
|
||||
| `BatchedInferenceEngine` | Manages variable-length batch sequences |
|
||||
|
||||
### 2. Memory Layout
|
||||
|
||||
```
|
||||
[layers, batch, seq, 2, heads, dim] # Flat format
|
||||
[block_size, heads, dim] × N blocks # Paged format
|
||||
```
|
||||
|
||||
### 3. Key Algorithms
|
||||
|
||||
- **Incremental decoding**: Only computes Q for new token, reuses cached K,V
|
||||
- **Causal masking**: Prevents attending to future tokens
|
||||
- **Attention**: `softmax(Q × K^T / √d) × V`
|
||||
|
||||
### 4. Memory Analysis Results
|
||||
|
||||
| Sequence Length | Prefill Cost | Cached Decode | Speedup |
|
||||
|----------------|--------------|--------------|---------|
|
||||
| 4096 | 16,777,216 | 4,096 | **4,096x** |
|
||||
|
||||
### 5. Optimizations Proposed
|
||||
|
||||
1. **Paged Attention**: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
|
||||
|
||||
2. **Quantization**: FP16→INT8 = 2x compression, INT4 = 4x
|
||||
|
||||
3. **Chunked Attention** (Flash Attention style): 256x memory reduction for attention scores
|
||||
|
||||
### 6. GPU Execution Mapping
|
||||
|
||||
- HBM → Shared Memory → Registers hierarchy
|
||||
- KV-cache loaded from HBM per decode step
|
||||
- Tensor cores for GEMM operations
|
||||
- Critical bottlenecks: memory bandwidth, O(S²) attention
|
||||
|
||||
Run with:
|
||||
```bash
|
||||
python kv_cache.py
|
||||
```
|
||||
Reference in New Issue
Block a user