8e72eef09c
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
1.7 KiB
1.7 KiB
The KV-cache system is fully implemented and working. Here's a summary:
Implementation Summary
1. Data Structures (kv_cache.py)
| Class | Purpose |
|---|---|
CacheConfig |
Configuration for cache dimensions and memory format |
KVCacheBlock |
Single block for paged attention (fixed-size memory unit) |
PagedKVCache |
Block-wise KV storage with on-demand allocation |
FlatKVCache |
Pre-allocated flat storage for max throughput |
MultiHeadAttention |
Attention with integrated cache read/write |
TransformerBlock |
Single layer with pre-norm and FFN |
BatchedInferenceEngine |
Manages variable-length batch sequences |
2. Memory Layout
[layers, batch, seq, 2, heads, dim] # Flat format
[block_size, heads, dim] × N blocks # Paged format
3. Key Algorithms
- Incremental decoding: Only computes Q for new token, reuses cached K,V
- Causal masking: Prevents attending to future tokens
- Attention:
softmax(Q × K^T / √d) × V
4. Memory Analysis Results
| Sequence Length | Prefill Cost | Cached Decode | Speedup |
|---|---|---|---|
| 4096 | 16,777,216 | 4,096 | 4,096x |
5. Optimizations Proposed
-
Paged Attention: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
-
Quantization: FP16→INT8 = 2x compression, INT4 = 4x
-
Chunked Attention (Flash Attention style): 256x memory reduction for attention scores
6. GPU Execution Mapping
- HBM → Shared Memory → Registers hierarchy
- KV-cache loaded from HBM per decode step
- Tensor cores for GEMM operations
- Critical bottlenecks: memory bandwidth, O(S²) attention
Run with:
python kv_cache.py