Files
llm_programming_tests/minimax-m2.7/kv/FINAL.md
T
sleepy 8e72eef09c feat: add model comparisons and sanitize session files
- Rename gamma to glm5 and model to minimax-m2.7
- Add model_comparison/ directory with head-to-head analyses
- Sanitize all session.jsonl files: remove absolute paths and usernames
- Remove __pycache__ artifacts
- Add .gitignore
2026-04-23 11:16:01 +02:00

1.7 KiB
Raw Blame History

The KV-cache system is fully implemented and working. Here's a summary:

Implementation Summary

1. Data Structures (kv_cache.py)

Class Purpose
CacheConfig Configuration for cache dimensions and memory format
KVCacheBlock Single block for paged attention (fixed-size memory unit)
PagedKVCache Block-wise KV storage with on-demand allocation
FlatKVCache Pre-allocated flat storage for max throughput
MultiHeadAttention Attention with integrated cache read/write
TransformerBlock Single layer with pre-norm and FFN
BatchedInferenceEngine Manages variable-length batch sequences

2. Memory Layout

[layers, batch, seq, 2, heads, dim]   # Flat format
[block_size, heads, dim] × N blocks    # Paged format

3. Key Algorithms

  • Incremental decoding: Only computes Q for new token, reuses cached K,V
  • Causal masking: Prevents attending to future tokens
  • Attention: softmax(Q × K^T / √d) × V

4. Memory Analysis Results

Sequence Length Prefill Cost Cached Decode Speedup
4096 16,777,216 4,096 4,096x

5. Optimizations Proposed

  1. Paged Attention: Block-based allocation (16 tokens/block) reduces memory waste from pre-allocation

  2. Quantization: FP16→INT8 = 2x compression, INT4 = 4x

  3. Chunked Attention (Flash Attention style): 256x memory reduction for attention scores

6. GPU Execution Mapping

  • HBM → Shared Memory → Registers hierarchy
  • KV-cache loaded from HBM per decode step
  • Tensor cores for GEMM operations
  • Critical bottlenecks: memory bandwidth, O(S²) attention

Run with:

python kv_cache.py