Design and implement a high-performance fused softmax + top-k kernel in CUDA (or CUDA-like pseudocode). Requirements: - Input: logits [B, T, V] - Output: - top-k indices per (B, T) - top-k probabilities (after softmax) Constraints: 1. Do NOT materialize the full softmax matrix in global memory. 2. Must be numerically stable (log-sum-exp). 3. Minimize global memory reads/writes. 4. Use shared memory where appropriate. 5. Handle large V (e.g., 50k+) efficiently. Deliver: - Kernel pseudocode or CUDA code - Memory access pattern explanation - Warp-level optimization strategy - Complexity analysis (bandwidth vs compute bound) - Comparison to naive implementation