perf: Tune threadgroup sizes per matmul shape #37

Closed
opened 2026-05-15 19:02:26 +02:00 by sleepy · 1 comment
Owner

Current dispatch uses fixed ROWS_PER_TG=4 for all matmul shapes. Different shapes need different tuning:

  • lm_head (N=152960, K=2560): Needs many threadgroups, ROWS_PER_TG=1 with larger SG stride may be better
  • gate/up proj (N=9216, K=2560): Medium, current config may be ok
  • small projections (N=32, K=2560): Only 32 output rows — ROWS_PER_TG=4 gives just 8 threadgroups. Use ROWS_PER_TG=1 with a larger reduction group
  • MLP down (N=2560, K=9216): K is large, more lanes in reduction helps

Autotune or use a lookup table based on (N, K) to select optimal (ROWS_PER_TG, threads_per_row).

Current dispatch uses fixed ROWS_PER_TG=4 for all matmul shapes. Different shapes need different tuning: - lm_head (N=152960, K=2560): Needs many threadgroups, ROWS_PER_TG=1 with larger SG stride may be better - gate/up proj (N=9216, K=2560): Medium, current config may be ok - small projections (N=32, K=2560): Only 32 output rows — ROWS_PER_TG=4 gives just 8 threadgroups. Use ROWS_PER_TG=1 with a larger reduction group - MLP down (N=2560, K=9216): K is large, more lanes in reduction helps Autotune or use a lookup table based on (N, K) to select optimal (ROWS_PER_TG, threads_per_row).
Author
Owner

Merged via PR #50 (squash). Coherence: verified. Decode: 40ms→41ms (noise, no regression). ROWS_PER_TG is now runtime — enables future autotuning. Threadgroup tuning alone cannot break the 45% bandwidth ceiling; requires #40/#42.

Merged via PR #50 (squash). Coherence: verified. Decode: 40ms→41ms (noise, no regression). ROWS_PER_TG is now runtime — enables future autotuning. Threadgroup tuning alone cannot break the 45% bandwidth ceiling; requires #40/#42.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#37
No description provided.