perf: MPP matmul2d for Apple Silicon M3+ #42
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
MPP
matmul2dwith 16x16 fragments for hardware-batched matmul.NOT ACTIONABLE FOR M4 MAX. llama.cpp source confirms matmul2d is no faster than simdgroup 8x8 on M4. Only M5+ benefits. This issue is deprioritized until we run on M5+ hardware.
Issue #40 (simdgroup 8x8 MMA) is the correct optimization path for current hardware.
Not actionable for M4 Max. llama.cpp explicitly disables matmul2d for pre-M5 chips: M4 Max shows no significant difference vs simdgroup 8x8. Only M5+ benefits. Deprioritizing — #40 (simdgroup 8x8 MMA) is the correct path for our hardware.