[perf] Make ROWS_PER_TG runtime parameter, tune per matmul shape (#37) #50

Merged
sleepy merged 1 commit from perf/37-tune-threadgroup-sizes into main 2026-05-15 19:34:50 +02:00
Owner

Summary

Makes ROWS_PER_TG a runtime parameter (via buffer(5)) instead of compile-time constant. Adds matmul_rows_per_tg(N) lookup: returns 1 for N<=4096 (k_proj, v_proj, o_proj, down_proj), 4 for larger N.

Results

  • Decode time: 40ms→41ms (within noise, no measurable regression or improvement)
  • Output: coherent, token-identical to baseline

Analysis

Threadgroup tuning alone does not improve decode time. The dominant matmuls (gate/up_proj N=9216, lm_head N=152960) already saturate memory bandwidth with ROWS_PER_TG=4. Small-N matmuls (k_proj/v_proj N=1024) are only ~5% of weight traffic.

The ~45% bandwidth ceiling requires a fundamentally different kernel approach (see #40, #42).

This PR is still valuable as infrastructure — runtime ROWS_PER_TG enables future autotuning and MPP matmul2d integration.

Closes #37

## Summary Makes `ROWS_PER_TG` a runtime parameter (via buffer(5)) instead of compile-time constant. Adds `matmul_rows_per_tg(N)` lookup: returns 1 for N<=4096 (k_proj, v_proj, o_proj, down_proj), 4 for larger N. ## Results - Decode time: 40ms→41ms (within noise, no measurable regression or improvement) - Output: coherent, token-identical to baseline ## Analysis Threadgroup tuning alone does not improve decode time. The dominant matmuls (gate/up_proj N=9216, lm_head N=152960) already saturate memory bandwidth with ROWS_PER_TG=4. Small-N matmuls (k_proj/v_proj N=1024) are only ~5% of weight traffic. The ~45% bandwidth ceiling requires a fundamentally different kernel approach (see #40, #42). This PR is still valuable as infrastructure — runtime ROWS_PER_TG enables future autotuning and MPP matmul2d integration. Closes #37
The GEMV kernel now accepts ROWS_PER_TG via buffer(5) instead of a
compile-time constexpr. Dispatch uses N <= 4096 -> 1 row/TG (maximize
parallelism for small outputs like k_proj/v_proj), N > 4096 -> 4 rows/TG.
Benchmarks show ~39.4ms decode (marginal improvement from ~40ms baseline)
as the dominant matmuls already saturate GPU bandwidth.
sleepy merged commit d0a3724b06 into main 2026-05-15 19:34:50 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm!50
No description provided.