perf: MPP matmul2d for Apple Silicon M3+ #42

New issue

Open

opened 2026-05-15 19:02:26 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-15 19:02:26 +02:00

Owner

Overview

MPP matmul2d with 16x16 fragments for hardware-batched matmul.

NOT ACTIONABLE FOR M4 MAX. llama.cpp source confirms matmul2d is no faster than simdgroup 8x8 on M4. Only M5+ benefits. This issue is deprioritized until we run on M5+ hardware.

Issue #40 (simdgroup 8x8 MMA) is the correct optimization path for current hardware.

## Overview MPP `matmul2d` with 16x16 fragments for hardware-batched matmul. **NOT ACTIONABLE FOR M4 MAX.** llama.cpp source confirms matmul2d is no faster than simdgroup 8x8 on M4. Only M5+ benefits. This issue is deprioritized until we run on M5+ hardware. Issue #40 (simdgroup 8x8 MMA) is the correct optimization path for current hardware.

sleepy referenced this issue

2026-05-15 19:34:02 +02:00

[perf] Make ROWS_PER_TG runtime parameter, tune per matmul shape (#37) #50

sleepy commented

2026-05-15 19:47:01 +02:00

Author

Owner

Not actionable for M4 Max. llama.cpp explicitly disables matmul2d for pre-M5 chips: M4 Max shows no significant difference vs simdgroup 8x8. Only M5+ benefits. Deprioritizing — #40 (simdgroup 8x8 MMA) is the correct path for our hardware.

**Not actionable for M4 Max.** llama.cpp explicitly disables matmul2d for pre-M5 chips: M4 Max shows no significant difference vs simdgroup 8x8. Only M5+ benefits. Deprioritizing — #40 (simdgroup 8x8 MMA) is the correct path for our hardware.