decode-opt: fuse 6-expert gateup into one high-occupancy kernel #4

Open
opened 2026-06-23 09:55:16 +02:00 by sleepy · 0 comments
Owner

At M=1 decode, both NVFP4 and IQ2_XXS achieve only ~112-124 GB/s (measured via DS4_CUDA_MOE_PROFILE), well below the ~210 GB/s hardware ceiling. The bottleneck is occupancy: 6 experts × 1 token generates too little parallel work for 48 SMs, so most SMs sit idle while a few stream weights.

The current decode kernel (moe_gate_up_mid_decode_lut_nvfp4_kernel) processes 6 experts sequentially within a single block, with 8 threads (quarter-warp) per expert. A fused kernel that launches 6× more blocks (one per expert, or distributing rows across more blocks) would fill more SMs and raise effective bandwidth toward the ceiling.

This benefits BOTH NVFP4 and IQ2_XXS — the occupancy limit is format-agnostic. But it especially helps NVFP4, which has more headroom (124 → 210 vs IQ2 112 → 58.6 compute ceiling).

Acceptance

  • verify.sh passes (math=Four, coherence=story, factual=Paris)
  • MoE profiler shows gateup time drops (target: <0.35 ms/layer from 0.457 ms for NVFP4)
  • Decode t/s improves at M=1 (target: >12 t/s for K180 at medium prompt)
  • RAM neutral
At M=1 decode, both NVFP4 and IQ2_XXS achieve only ~112-124 GB/s (measured via DS4_CUDA_MOE_PROFILE), well below the ~210 GB/s hardware ceiling. The bottleneck is occupancy: 6 experts × 1 token generates too little parallel work for 48 SMs, so most SMs sit idle while a few stream weights. The current decode kernel (moe_gate_up_mid_decode_lut_nvfp4_kernel) processes 6 experts sequentially within a single block, with 8 threads (quarter-warp) per expert. A fused kernel that launches 6× more blocks (one per expert, or distributing rows across more blocks) would fill more SMs and raise effective bandwidth toward the ceiling. This benefits BOTH NVFP4 and IQ2_XXS — the occupancy limit is format-agnostic. But it especially helps NVFP4, which has more headroom (124 → 210 vs IQ2 112 → 58.6 compute ceiling). ## Acceptance - verify.sh passes (math=Four, coherence=story, factual=Paris) - MoE profiler shows gateup time drops (target: <0.35 ms/layer from 0.457 ms for NVFP4) - Decode t/s improves at M=1 (target: >12 t/s for K180 at medium prompt) - RAM neutral
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ds4-nvfp4-spark#4
No description provided.