Reduce GPU dispatch count (1151 per tick) #28

Closed
opened 2026-04-30 18:11:34 +02:00 by sleepy · 0 comments
Owner

Problem

9B Q4_0 tokgen at ctx=256 issues 1151 actual GPU dispatches per decode tick. At 52.9 tok/s (18.9 ms/tick), average time per dispatch is 16.4 us. Metal dispatch floor is roughly 3-5 us, meaning dispatch overhead alone may account for 3.5-5.8 ms (18-30%) of tick time.

Data

Op Dispatches/tick
MUL_MAT 249
MUL 161
RMS_NORM 105
UNARY/SILU 104
GET_ROWS 99
CPY 97
ADD 88
Others 248
Total 1151

Approach

  • Capture Xcode GPUtrace with GGML_METAL_CAPTURE_COMPUTE=2 to see per-kernel GPU time
  • Investigate op batching: many element-wise ops (MUL, ADD, UNARY) could be fused
  • Check if multiple GET_ROWS can be batched into one dispatch
  • Compare with MLX dispatch count for equivalent model
## Problem 9B Q4_0 tokgen at ctx=256 issues 1151 actual GPU dispatches per decode tick. At 52.9 tok/s (18.9 ms/tick), average time per dispatch is 16.4 us. Metal dispatch floor is roughly 3-5 us, meaning dispatch overhead alone may account for 3.5-5.8 ms (18-30%) of tick time. ## Data | Op | Dispatches/tick | |----|----------------| | MUL_MAT | 249 | | MUL | 161 | | RMS_NORM | 105 | | UNARY/SILU | 104 | | GET_ROWS | 99 | | CPY | 97 | | ADD | 88 | | Others | 248 | | **Total** | **1151** | ## Approach - Capture Xcode GPUtrace with GGML_METAL_CAPTURE_COMPUTE=2 to see per-kernel GPU time - Investigate op batching: many element-wise ops (MUL, ADD, UNARY) could be fused - Check if multiple GET_ROWS can be batched into one dispatch - Compare with MLX dispatch count for equivalent model
sleepy added the perf label 2026-04-30 18:11:34 +02:00
sleepy changed title from Reduce GPU dispatch count - 1151 per tick to Reduce GPU dispatch count (1151 per tick) 2026-04-30 18:16:37 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#28