perf: Fuse RMS norm into matmul for decode #43

New issue

Closed

opened 2026-05-15 19:02:26 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-15 19:02:26 +02:00

Owner

Every layer starts with: RMS norm hidden → matmul(hidden, weight). These are always paired.

For decode (M=1), the RMS norm reads 2560 bf16 (5KB) and writes 2560 bf16. The matmul then reads the same 2560 bf16. Fusing them eliminates one 5KB read+write per norm-matmul pair.

Saves ~4 reads/writes per layer (input_norm→qkv, post_norm→gate, etc.) × 32 layers = 128 fewer kernel dispatches + ~320MB less memory traffic per decode step.

Every layer starts with: RMS norm hidden → matmul(hidden, weight). These are always paired. For decode (M=1), the RMS norm reads 2560 bf16 (5KB) and writes 2560 bf16. The matmul then reads the same 2560 bf16. Fusing them eliminates one 5KB read+write per norm-matmul pair. Saves ~4 reads/writes per layer (input_norm→qkv, post_norm→gate, etc.) × 32 layers = 128 fewer kernel dispatches + ~320MB less memory traffic per decode step.

sleepy referenced this issue

2026-05-15 20:22:04 +02:00

[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40) #51

sleepy referenced this issue from a commit

2026-05-15 20:52:11 +02:00

perf(#43): fuse RMS norm into matmul for decode

sleepy commented

2026-05-15 20:54:10 +02:00

Author

Owner

Approach rejected. Fusing norm into row-per-TG GEMV is counterproductive for M=1 decode. Each threadgroup computes 1 output row, so the RMS norm gets redundantly computed N times (152960x for lm_head). Result: 162ms vs 34ms baseline (4.7x SLOWER). Fusion only works when multiple output rows share a threadgroup and thus share the norm computation. Our M=1 GEMV with 1-row-per-TG makes this impossible. Closing as wontfix for current architecture.

**Approach rejected.** Fusing norm into row-per-TG GEMV is counterproductive for M=1 decode. Each threadgroup computes 1 output row, so the RMS norm gets redundantly computed N times (152960x for lm_head). Result: 162ms vs 34ms baseline (4.7x SLOWER). Fusion only works when multiple output rows share a threadgroup and thus share the norm computation. Our M=1 GEMV with 1-row-per-TG makes this impossible. Closing as wontfix for current architecture.

sleepy closed this issue

2026-05-15 20:54:10 +02:00