feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
This commit is contained in:
2026-04-23 11:16:01 +02:00
commit 8e72eef09c
62 changed files with 18469 additions and 0 deletions
+56
View File
@@ -0,0 +1,56 @@
The KV-cache system is fully implemented and working. Here's a summary:
## Implementation Summary
### 1. Data Structures (`kv_cache.py`)
| Class | Purpose |
|-------|---------|
| `CacheConfig` | Configuration for cache dimensions and memory format |
| `KVCacheBlock` | Single block for paged attention (fixed-size memory unit) |
| `PagedKVCache` | Block-wise KV storage with on-demand allocation |
| `FlatKVCache` | Pre-allocated flat storage for max throughput |
| `MultiHeadAttention` | Attention with integrated cache read/write |
| `TransformerBlock` | Single layer with pre-norm and FFN |
| `BatchedInferenceEngine` | Manages variable-length batch sequences |
### 2. Memory Layout
```
[layers, batch, seq, 2, heads, dim] # Flat format
[block_size, heads, dim] × N blocks # Paged format
```
### 3. Key Algorithms
- **Incremental decoding**: Only computes Q for new token, reuses cached K,V
- **Causal masking**: Prevents attending to future tokens
- **Attention**: `softmax(Q × K^T / √d) × V`
### 4. Memory Analysis Results
| Sequence Length | Prefill Cost | Cached Decode | Speedup |
|----------------|--------------|--------------|---------|
| 4096 | 16,777,216 | 4,096 | **4,096x** |
### 5. Optimizations Proposed
1. **Paged Attention**: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
2. **Quantization**: FP16→INT8 = 2x compression, INT4 = 4x
3. **Chunked Attention** (Flash Attention style): 256x memory reduction for attention scores
### 6. GPU Execution Mapping
- HBM → Shared Memory → Registers hierarchy
- KV-cache loaded from HBM per decode step
- Tensor cores for GEMM operations
- Critical bottlenecks: memory bandwidth, O(S²) attention
Run with:
```bash
python kv_cache.py
```
+19
View File
@@ -0,0 +1,19 @@
Implement an efficient KV-cache system for autoregressive transformer inference from scratch.
Requirements:
1. Support incremental decoding (one token at a time).
2. Avoid recomputing attention for past tokens.
3. Handle:
- multi-head attention
- batching with variable sequence lengths
4. Provide:
- data structure layout (memory format)
- update logic per step
- attention computation using cached keys/values
Additionally:
- Analyze memory growth over long sequences.
- Propose at least two optimizations (e.g., paged attention, chunking, compression).
- Explain how this would map to GPU execution.
Do not use any frameworks.
File diff suppressed because it is too large Load Diff
File diff suppressed because one or more lines are too long