decode-opt: native mma.mxf4 (sm_121f) kernel for NVFP4 decode #5
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
The current NVFP4 __dp4a kernel achieves ~140 GB/s (measured at N=32768, ~124 at M=1 decode). The analysis (cuda_debug/ANALYSIS.md §5) identifies native mma.sync.kind::mxf4.block_scale on sm_121f as the path to ~186-210 GB/s — a 1.5× improvement over __dp4a.
The sm_121 target supports mma.mxf4 (emits OMMA.SF.16864.F32.E2M1.E2M1.E8 per Sggin1 SASS analysis). This is NOT the datacenter tcgen05 path — it is the consumer-Blackwell extended mma.sync that reads FP4 from registers.
Prerequisites
Acceptance