8e72eef09c
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
647 B
647 B
Implement an efficient KV-cache system for autoregressive transformer inference from scratch.
Requirements:
- Support incremental decoding (one token at a time).
- Avoid recomputing attention for past tokens.
- Handle:
- multi-head attention
- batching with variable sequence lengths
- Provide:
- data structure layout (memory format)
- update logic per step
- attention computation using cached keys/values
Additionally:
- Analyze memory growth over long sequences.
- Propose at least two optimizations (e.g., paged attention, chunking, compression).
- Explain how this would map to GPU execution.
Do not use any frameworks.