[perf] Batched q_norm/k_norm into single kernel dispatches (#38) #49

Merged
sleepy merged 1 commit from perf/38-batched-qnorm-knorm into main 2026-05-15 19:23:41 +02:00
Owner

Summary

Replaces 20 per-head RMS norm dispatches per full attention layer with 2 batched dispatches (one for all Q heads, one for all K heads).

Savings: 18 fewer dispatches × 8 full attention layers = 144 fewer kernel dispatches per decode step.

Changes

  • New Metal kernel batched_rms_norm_bf16 in rms_norm.metal — one threadgroup per head, handles strided layout via byte_stride parameter
  • New dispatch function set_batched_rms_norm_bf16 in dispatch.zig
  • model.zig: replaced per-head norm loops with 2 batched dispatches

Benchmark

  • Baseline (main): 43ms/step = 17.1 tok/s
  • This PR: 40ms/step = 18.5 tok/s (7% faster)
  • Output: token-identical to baseline (verified)

Closes #38

## Summary Replaces 20 per-head RMS norm dispatches per full attention layer with 2 batched dispatches (one for all Q heads, one for all K heads). **Savings**: 18 fewer dispatches × 8 full attention layers = **144 fewer kernel dispatches** per decode step. ## Changes - New Metal kernel `batched_rms_norm_bf16` in `rms_norm.metal` — one threadgroup per head, handles strided layout via `byte_stride` parameter - New dispatch function `set_batched_rms_norm_bf16` in `dispatch.zig` - `model.zig`: replaced per-head norm loops with 2 batched dispatches ## Benchmark - Baseline (main): 43ms/step = 17.1 tok/s - This PR: 40ms/step = 18.5 tok/s (**7% faster**) - Output: token-identical to baseline (verified) Closes #38
Replace 20 per-head RMS norm dispatches (16 Q + 4 K) with 2 batched
dispatches per full attention layer. New batched_rms_norm_bf16 kernel
handles strided head layout (Q heads interleaved with gate in q_gate_buf).
Saves 160 kernel dispatches per decode step across 8 full attention layers.
sleepy merged commit 290123e65d into main 2026-05-15 19:23:41 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm!49
No description provided.