Reduce GPU dispatch count (1151 per tick) #28
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
9B Q4_0 tokgen at ctx=256 issues 1151 actual GPU dispatches per decode tick. At 52.9 tok/s (18.9 ms/tick), average time per dispatch is 16.4 us. Metal dispatch floor is roughly 3-5 us, meaning dispatch overhead alone may account for 3.5-5.8 ms (18-30%) of tick time.
Data
Approach
Reduce GPU dispatch count - 1151 per tickto Reduce GPU dispatch count (1151 per tick)