[perf] Make ROWS_PER_TG runtime parameter, tune per matmul shape (#37) #50

Merged

sleepy merged 1 commit from perf/37-tune-threadgroup-sizes into main

2026-05-15 19:34:50 +02:00

sleepy commented

2026-05-15 19:34:02 +02:00

Owner

Summary

Makes ROWS_PER_TG a runtime parameter (via buffer(5)) instead of compile-time constant. Adds matmul_rows_per_tg(N) lookup: returns 1 for N<=4096 (k_proj, v_proj, o_proj, down_proj), 4 for larger N.

Results

Decode time: 40ms→41ms (within noise, no measurable regression or improvement)
Output: coherent, token-identical to baseline

Analysis

Threadgroup tuning alone does not improve decode time. The dominant matmuls (gate/up_proj N=9216, lm_head N=152960) already saturate memory bandwidth with ROWS_PER_TG=4. Small-N matmuls (k_proj/v_proj N=1024) are only ~5% of weight traffic.

The ~45% bandwidth ceiling requires a fundamentally different kernel approach (see #40, #42).

This PR is still valuable as infrastructure — runtime ROWS_PER_TG enables future autotuning and MPP matmul2d integration.

Closes #37

## Summary Makes `ROWS_PER_TG` a runtime parameter (via buffer(5)) instead of compile-time constant. Adds `matmul_rows_per_tg(N)` lookup: returns 1 for N<=4096 (k_proj, v_proj, o_proj, down_proj), 4 for larger N. ## Results - Decode time: 40ms→41ms (within noise, no measurable regression or improvement) - Output: coherent, token-identical to baseline ## Analysis Threadgroup tuning alone does not improve decode time. The dominant matmuls (gate/up_proj N=9216, lm_head N=152960) already saturate memory bandwidth with ROWS_PER_TG=4. Small-N matmuls (k_proj/v_proj N=1024) are only ~5% of weight traffic. The ~45% bandwidth ceiling requires a fundamentally different kernel approach (see #40, #42). This PR is still valuable as infrastructure — runtime ROWS_PER_TG enables future autotuning and MPP matmul2d integration. Closes #37

sleepy added 1 commit

2026-05-15 19:34:02 +02:00

perf(#37 ): make ROWS_PER_TG a runtime parameter, tune per matmul shape 325c189675

The GEMV kernel now accepts ROWS_PER_TG via buffer(5) instead of a
compile-time constexpr. Dispatch uses N <= 4096 -> 1 row/TG (maximize
parallelism for small outputs like k_proj/v_proj), N > 4096 -> 4 rows/TG.
Benchmarks show ~39.4ms decode (marginal improvement from ~40ms baseline)
as the dominant matmuls already saturate GPU bandwidth.

sleepy merged commit d0a3724b06 into main

2026-05-15 19:34:50 +02:00

sleepy referenced this pull request from a commit

2026-05-15 19:34:50 +02:00

[perf] Make ROWS_PER_TG runtime parameter, tune per matmul shape (#37) (#50)

sleepy referenced this pull request

2026-05-15 19:35:01 +02:00

perf: Tune threadgroup sizes per matmul shape #37