perf: Batched q_norm/k_norm into single kernel dispatch

sleepy commented

2026-05-15 19:02:26 +02:00

Owner

Full attention layers (8 of 32) run per-head RMS norms: 16 q_norm + 4 k_norm = 20 separate dispatches per layer. Each dispatch only uses 256 threads for head_dim=256 elements. That's 20 tiny kernels with terrible GPU occupancy.

Solution: Write a batched_rms_norm_bf16 kernel that normalizes all heads in a single dispatch. Input: contiguous buffer with all heads laid out, num_heads parameter, per-head head_dim. One dispatch handles all 20 norms.

This saves 20×8 = 160 kernel dispatches per decode step.

Full attention layers (8 of 32) run per-head RMS norms: 16 q_norm + 4 k_norm = 20 separate dispatches per layer. Each dispatch only uses 256 threads for head_dim=256 elements. That's 20 tiny kernels with terrible GPU occupancy. Solution: Write a `batched_rms_norm_bf16` kernel that normalizes all heads in a single dispatch. Input: contiguous buffer with all heads laid out, num_heads parameter, per-head head_dim. One dispatch handles all 20 norms. This saves 20×8 = 160 kernel dispatches per decode step.

sleepy referenced this issue from a commit

2026-05-15 19:17:51 +02:00

perf(#38): batch q_norm/k_norm into single kernel dispatches

sleepy referenced this issue from a pull request that will close it,

2026-05-15 19:20:35 +02:00

[perf] Batched q_norm/k_norm into single kernel dispatches (#38) #49

sleepy referenced this issue from a commit

2026-05-15 19:23:41 +02:00

[perf] Batched q_norm/k_norm into single kernel dispatches (#38) (#49)

sleepy closed this issue

2026-05-15 19:23:41 +02:00

sleepy commented