Reduce GPU dispatch count (1151 per tick) #28

New Issue

2026-04-30T18:11:34+02:00

sleepy commented

2026-04-30 18:11:34 +02:00

Problem

9B Q4_0 tokgen at ctx=256 issues 1151 actual GPU dispatches per decode tick. At 52.9 tok/s (18.9 ms/tick), average time per dispatch is 16.4 us. Metal dispatch floor is roughly 3-5 us, meaning dispatch overhead alone may account for 3.5-5.8 ms (18-30%) of tick time.

Data

Op	Dispatches/tick
MUL_MAT	249
MUL	161
RMS_NORM	105
UNARY/SILU	104
GET_ROWS	99
CPY	97
ADD	88
Others	248
Total	1151

Approach

Capture Xcode GPUtrace with GGML_METAL_CAPTURE_COMPUTE=2 to see per-kernel GPU time
Investigate op batching: many element-wise ops (MUL, ADD, UNARY) could be fused
Check if multiple GET_ROWS can be batched into one dispatch
Compare with MLX dispatch count for equivalent model

## Problem 9B Q4_0 tokgen at ctx=256 issues 1151 actual GPU dispatches per decode tick. At 52.9 tok/s (18.9 ms/tick), average time per dispatch is 16.4 us. Metal dispatch floor is roughly 3-5 us, meaning dispatch overhead alone may account for 3.5-5.8 ms (18-30%) of tick time. ## Data | Op | Dispatches/tick | |----|----------------| | MUL_MAT | 249 | | MUL | 161 | | RMS_NORM | 105 | | UNARY/SILU | 104 | | GET_ROWS | 99 | | CPY | 97 | | ADD | 88 | | Others | 248 | | **Total** | **1151** | ## Approach - Capture Xcode GPUtrace with GGML_METAL_CAPTURE_COMPUTE=2 to see per-kernel GPU time - Investigate op batching: many element-wise ops (MUL, ADD, UNARY) could be fused - Check if multiple GET_ROWS can be batched into one dispatch - Compare with MLX dispatch count for equivalent model

sleepy added the perf label 2026-04-30 18:11:34 +02:00

sleepy changed title from ~~Reduce GPU dispatch count - 1151 per tick~~ to Reduce GPU dispatch count (1151 per tick)

2026-04-30 18:16:37 +02:00

sleepy referenced this issue from a commit

2026-04-30 20:14:19 +02:00

[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)

sleepy referenced this issue

2026-04-30 20:16:59 +02:00

[metal] extend bin op fusion to MUL/SUB/DIV chains (#28) #38

sleepy closed this issue

2026-04-30 20:17:15 +02:00

sleepy referenced this issue from a commit