perf: MPP matmul2d for Apple Silicon M3+ #42

Open
opened 2026-05-15 19:02:26 +02:00 by sleepy · 1 comment
Owner

Overview

MPP matmul2d with 16x16 fragments for hardware-batched matmul.

NOT ACTIONABLE FOR M4 MAX. llama.cpp source confirms matmul2d is no faster than simdgroup 8x8 on M4. Only M5+ benefits. This issue is deprioritized until we run on M5+ hardware.

Issue #40 (simdgroup 8x8 MMA) is the correct optimization path for current hardware.

## Overview MPP `matmul2d` with 16x16 fragments for hardware-batched matmul. **NOT ACTIONABLE FOR M4 MAX.** llama.cpp source confirms matmul2d is no faster than simdgroup 8x8 on M4. Only M5+ benefits. This issue is deprioritized until we run on M5+ hardware. Issue #40 (simdgroup 8x8 MMA) is the correct optimization path for current hardware.
Author
Owner

Not actionable for M4 Max. llama.cpp explicitly disables matmul2d for pre-M5 chips: M4 Max shows no significant difference vs simdgroup 8x8. Only M5+ benefits. Deprioritizing — #40 (simdgroup 8x8 MMA) is the correct path for our hardware.

**Not actionable for M4 Max.** llama.cpp explicitly disables matmul2d for pre-M5 chips: M4 Max shows no significant difference vs simdgroup 8x8. Only M5+ benefits. Deprioritizing — #40 (simdgroup 8x8 MMA) is the correct path for our hardware.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#42
No description provided.