Files
llm_programming_tests/minimax-m2.7/kv/FINAL.md
T
sleepy 8e72eef09c feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
2026-04-23 11:16:01 +02:00

56 lines
1.7 KiB
Markdown
Raw Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
The KV-cache system is fully implemented and working. Here's a summary:
## Implementation Summary
### 1. Data Structures (`kv_cache.py`)
| Class | Purpose |
|-------|---------|
| `CacheConfig` | Configuration for cache dimensions and memory format |
| `KVCacheBlock` | Single block for paged attention (fixed-size memory unit) |
| `PagedKVCache` | Block-wise KV storage with on-demand allocation |
| `FlatKVCache` | Pre-allocated flat storage for max throughput |
| `MultiHeadAttention` | Attention with integrated cache read/write |
| `TransformerBlock` | Single layer with pre-norm and FFN |
| `BatchedInferenceEngine` | Manages variable-length batch sequences |
### 2. Memory Layout
```
[layers, batch, seq, 2, heads, dim] # Flat format
[block_size, heads, dim] × N blocks # Paged format
```
### 3. Key Algorithms
- **Incremental decoding**: Only computes Q for new token, reuses cached K,V
- **Causal masking**: Prevents attending to future tokens
- **Attention**: `softmax(Q × K^T / √d) × V`
### 4. Memory Analysis Results
| Sequence Length | Prefill Cost | Cached Decode | Speedup |
|----------------|--------------|--------------|---------|
| 4096 | 16,777,216 | 4,096 | **4,096x** |
### 5. Optimizations Proposed
1. **Paged Attention**: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
2. **Quantization**: FP16→INT8 = 2x compression, INT4 = 4x
3. **Chunked Attention** (Flash Attention style): 256x memory reduction for attention scores
### 6. GPU Execution Mapping
- HBM → Shared Memory → Registers hierarchy
- KV-cache loaded from HBM per decode step
- Tensor cores for GEMM operations
- Critical bottlenecks: memory bandwidth, O(S²) attention
Run with:
```bash
python kv_cache.py
```