8e72eef09c
- Rename gamma to glm5 and model to minimax-m2.7 - Add model_comparison/ directory with head-to-head analyses - Sanitize all session.jsonl files: remove absolute paths and usernames - Remove __pycache__ artifacts - Add .gitignore
676 B
676 B
Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode).
Requirements:
- Input: logits [B, T, V]
- Output:
- top-k indices per (B, T)
- top-k probabilities (after softmax)
Constraints:
- Do NOT materialize the full softmax matrix in global memory.
- Must be numerically stable (log-sum-exp).
- Minimize global memory reads/writes.
- Use shared memory where appropriate.
- Handle large V (e.g., 50k+) efficiently.
Deliver:
- Kernel pseudocode or CUDA code
- Memory access pattern explanation
- Warp-level optimization strategy
- Complexity analysis (bandwidth vs compute bound)
- Comparison to naive implementation