decode-opt: fuse 6-expert gateup into one high-occupancy kernel #4

New issue

Open

opened 2026-06-23 09:55:16 +02:00 by sleepy · 0 comments

sleepy commented

2026-06-23 09:55:16 +02:00

Owner

The current decode kernel (moe_gate_up_mid_decode_lut_nvfp4_kernel) processes 6 experts sequentially within a single block, with 8 threads (quarter-warp) per expert. A fused kernel that launches 6× more blocks (one per expert, or distributing rows across more blocks) would fill more SMs and raise effective bandwidth toward the ceiling.

This benefits BOTH NVFP4 and IQ2_XXS — the occupancy limit is format-agnostic. But it especially helps NVFP4, which has more headroom (124 → 210 vs IQ2 112 → 58.6 compute ceiling).

Acceptance

verify.sh passes (math=Four, coherence=story, factual=Paris)
MoE profiler shows gateup time drops (target: <0.35 ms/layer from 0.457 ms for NVFP4)
Decode t/s improves at M=1 (target: >12 t/s for K180 at medium prompt)
RAM neutral

At M=1 decode, both NVFP4 and IQ2_XXS achieve only ~112-124 GB/s (measured via DS4_CUDA_MOE_PROFILE), well below the ~210 GB/s hardware ceiling. The bottleneck is occupancy: 6 experts × 1 token generates too little parallel work for 48 SMs, so most SMs sit idle while a few stream weights. The current decode kernel (moe_gate_up_mid_decode_lut_nvfp4_kernel) processes 6 experts sequentially within a single block, with 8 threads (quarter-warp) per expert. A fused kernel that launches 6× more blocks (one per expert, or distributing rows across more blocks) would fill more SMs and raise effective bandwidth toward the ceiling. This benefits BOTH NVFP4 and IQ2_XXS — the occupancy limit is format-agnostic. But it especially helps NVFP4, which has more headroom (124 → 210 vs IQ2 112 → 58.6 compute ceiling). ## Acceptance - verify.sh passes (math=Four, coherence=story, factual=Paris) - MoE profiler shows gateup time drops (target: <0.35 ms/layer from 0.457 ms for NVFP4) - Decode t/s improves at M=1 (target: >12 t/s for K180 at medium prompt) - RAM neutral

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

sleepy/ds4-nvfp4-spark#4

No description provided.

Rows
Columns