llm_programming_tests/minimax-m2.7/kv/FINAL.md at ed77298a538c9308120f8e50f97deda4c0ebe619

Files

T

sleepy 8e72eef09c feat: add model comparisons and sanitize session files

- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore

2026-04-23 11:16:01 +02:00

1.7 KiB

Raw Blame History

The KV-cache system is fully implemented and working. Here's a summary:

Implementation Summary

1. Data Structures (`kv_cache.py`)

Class	Purpose
`CacheConfig`	Configuration for cache dimensions and memory format
`KVCacheBlock`	Single block for paged attention (fixed-size memory unit)
`PagedKVCache`	Block-wise KV storage with on-demand allocation
`FlatKVCache`	Pre-allocated flat storage for max throughput
`MultiHeadAttention`	Attention with integrated cache read/write
`TransformerBlock`	Single layer with pre-norm and FFN
`BatchedInferenceEngine`	Manages variable-length batch sequences

2. Memory Layout

[layers, batch, seq, 2, heads, dim]   # Flat format
[block_size, heads, dim] × N blocks    # Paged format

3. Key Algorithms

Incremental decoding: Only computes Q for new token, reuses cached K,V
Causal masking: Prevents attending to future tokens
Attention: softmax(Q × K^T / √d) × V

4. Memory Analysis Results

Sequence Length	Prefill Cost	Cached Decode	Speedup
4096	16,777,216	4,096	4,096x

5. Optimizations Proposed

Paged Attention: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation
Quantization: FP16→INT8 = 2x compression, INT4 = 4x
Chunked Attention (Flash Attention style): 256x memory reduction for attention scores

6. GPU Execution Mapping

HBM → Shared Memory → Registers hierarchy
KV-cache loaded from HBM per decode step
Tensor cores for GEMM operations
Critical bottlenecks: memory bandwidth, O(S²) attention

Run with:

python kv_cache.py

1.7 KiB Raw Blame History Unescape Escape

Implementation Summary

1. Data Structures (kv_cache.py)

2. Memory Layout

3. Key Algorithms

4. Memory Analysis Results

5. Optimizations Proposed

6. GPU Execution Mapping

1.7 KiB

Raw Blame History

1. Data Structures (`kv_cache.py`)