decode-opt: fuse 6-expert gateup into one high-occupancy kernel #4
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
At M=1 decode, both NVFP4 and IQ2_XXS achieve only ~112-124 GB/s (measured via DS4_CUDA_MOE_PROFILE), well below the ~210 GB/s hardware ceiling. The bottleneck is occupancy: 6 experts × 1 token generates too little parallel work for 48 SMs, so most SMs sit idle while a few stream weights.
The current decode kernel (moe_gate_up_mid_decode_lut_nvfp4_kernel) processes 6 experts sequentially within a single block, with 8 threads (quarter-warp) per expert. A fused kernel that launches 6× more blocks (one per expert, or distributing rows across more blocks) would fill more SMs and raise effective bandwidth toward the ceiling.
This benefits BOTH NVFP4 and IQ2_XXS — the occupancy limit is format-agnostic. But it especially helps NVFP4, which has more headroom (124 → 210 vs IQ2 112 → 58.6 compute ceiling).
Acceptance