[perf] Make ROWS_PER_TG runtime parameter, tune per matmul shape (#37) #50
Loading…
Reference in a new issue
No description provided.
Delete branch "perf/37-tune-threadgroup-sizes"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Makes
ROWS_PER_TGa runtime parameter (via buffer(5)) instead of compile-time constant. Addsmatmul_rows_per_tg(N)lookup: returns 1 for N<=4096 (k_proj, v_proj, o_proj, down_proj), 4 for larger N.Results
Analysis
Threadgroup tuning alone does not improve decode time. The dominant matmuls (gate/up_proj N=9216, lm_head N=152960) already saturate memory bandwidth with ROWS_PER_TG=4. Small-N matmuls (k_proj/v_proj N=1024) are only ~5% of weight traffic.
The ~45% bandwidth ceiling requires a fundamentally different kernel approach (see #40, #42).
This PR is still valuable as infrastructure — runtime ROWS_PER_TG enables future autotuning and MPP matmul2d integration.
Closes #37