8e72eef09c
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
56 lines
1.7 KiB
Markdown
56 lines
1.7 KiB
Markdown
|
||
|
||
The KV-cache system is fully implemented and working. Here's a summary:
|
||
|
||
## Implementation Summary
|
||
|
||
### 1. Data Structures (`kv_cache.py`)
|
||
|
||
| Class | Purpose |
|
||
|-------|---------|
|
||
| `CacheConfig` | Configuration for cache dimensions and memory format |
|
||
| `KVCacheBlock` | Single block for paged attention (fixed-size memory unit) |
|
||
| `PagedKVCache` | Block-wise KV storage with on-demand allocation |
|
||
| `FlatKVCache` | Pre-allocated flat storage for max throughput |
|
||
| `MultiHeadAttention` | Attention with integrated cache read/write |
|
||
| `TransformerBlock` | Single layer with pre-norm and FFN |
|
||
| `BatchedInferenceEngine` | Manages variable-length batch sequences |
|
||
|
||
### 2. Memory Layout
|
||
|
||
```
|
||
[layers, batch, seq, 2, heads, dim] # Flat format
|
||
[block_size, heads, dim] × N blocks # Paged format
|
||
```
|
||
|
||
### 3. Key Algorithms
|
||
|
||
- **Incremental decoding**: Only computes Q for new token, reuses cached K,V
|
||
- **Causal masking**: Prevents attending to future tokens
|
||
- **Attention**: `softmax(Q × K^T / √d) × V`
|
||
|
||
### 4. Memory Analysis Results
|
||
|
||
| Sequence Length | Prefill Cost | Cached Decode | Speedup |
|
||
|----------------|--------------|--------------|---------|
|
||
| 4096 | 16,777,216 | 4,096 | **4,096x** |
|
||
|
||
### 5. Optimizations Proposed
|
||
|
||
1. **Paged Attention**: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
|
||
|
||
2. **Quantization**: FP16→INT8 = 2x compression, INT4 = 4x
|
||
|
||
3. **Chunked Attention** (Flash Attention style): 256x memory reduction for attention scores
|
||
|
||
### 6. GPU Execution Mapping
|
||
|
||
- HBM → Shared Memory → Registers hierarchy
|
||
- KV-cache loaded from HBM per decode step
|
||
- Tensor cores for GEMM operations
|
||
- Critical bottlenecks: memory bandwidth, O(S²) attention
|
||
|
||
Run with:
|
||
```bash
|
||
python kv_cache.py
|
||
``` |