perf: Tune threadgroup sizes per matmul shape #37
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Current dispatch uses fixed ROWS_PER_TG=4 for all matmul shapes. Different shapes need different tuning:
Autotune or use a lookup table based on (N, K) to select optimal (ROWS_PER_TG, threads_per_row).
Merged via PR #50 (squash). Coherence: verified. Decode: 40ms→41ms (noise, no regression). ROWS_PER_TG is now runtime — enables future autotuning. Threadgroup tuning alone cannot break the 45% bandwidth ceiling; requires #40/#42.