decode-opt: native mma.mxf4 (sm_121f) kernel for NVFP4 decode #5

Open
opened 2026-06-23 09:55:16 +02:00 by sleepy · 0 comments
Owner

The current NVFP4 __dp4a kernel achieves ~140 GB/s (measured at N=32768, ~124 at M=1 decode). The analysis (cuda_debug/ANALYSIS.md §5) identifies native mma.sync.kind::mxf4.block_scale on sm_121f as the path to ~186-210 GB/s — a 1.5× improvement over __dp4a.

The sm_121 target supports mma.mxf4 (emits OMMA.SF.16864.F32.E2M1.E2M1.E8 per Sggin1 SASS analysis). This is NOT the datacenter tcgen05 path — it is the consumer-Blackwell extended mma.sync that reads FP4 from registers.

Prerequisites

  • Decode occupancy must be solved first (separate issue) — mma.mxf4 with low occupancy gains nothing
  • The MXFP4→NVFP4 lossless conversion means the e2m1 nibbles are already in the right format for mma.mxf4

Acceptance

  • verify.sh passes
  • NVFP4 decode bandwidth approaches 186-210 GB/s at high occupancy
  • NVFP4 decode t/s exceeds IQ2_XXS at M=1 (the format win that justifies NVFP4)
  • Compile target sm_121f (or sm_121a with appropriate -gencode)
The current NVFP4 __dp4a kernel achieves ~140 GB/s (measured at N=32768, ~124 at M=1 decode). The analysis (cuda_debug/ANALYSIS.md §5) identifies native mma.sync.kind::mxf4.block_scale on sm_121f as the path to ~186-210 GB/s — a 1.5× improvement over __dp4a. The sm_121 target supports mma.mxf4 (emits OMMA.SF.16864.F32.E2M1.E2M1.E8 per Sggin1 SASS analysis). This is NOT the datacenter tcgen05 path — it is the consumer-Blackwell extended mma.sync that reads FP4 from registers. ## Prerequisites - Decode occupancy must be solved first (separate issue) — mma.mxf4 with low occupancy gains nothing - The MXFP4→NVFP4 lossless conversion means the e2m1 nibbles are already in the right format for mma.mxf4 ## Acceptance - verify.sh passes - NVFP4 decode bandwidth approaches 186-210 GB/s at high occupancy - NVFP4 decode t/s exceeds IQ2_XXS at M=1 (the format win that justifies NVFP4) - Compile target sm_121f (or sm_121a with appropriate -gencode)
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ds4-nvfp4-spark#5
No description provided.