perf: Increase GEMV bandwidth utilization from 45% to 70%+ #40
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Overview
Current simd-reduction GEMV kernel achieves ~183 GB/s out of 410 GB/s peak (45%). Target: 281+ GB/s (69%+).
Key Finding
matmul2d (MPP 16x16) is NOT faster on M4 — llama.cpp confirms this. The correct path is simdgroup_matrix 8x8 MMA, which is what MLX uses for ALL matmul sizes including M=1 decode.
Implementation Plan
Reference
Merged via PR #51 (squash). Coherence: token-identical to baseline. Benchmark: 41ms->35ms decode (-14.6%), 21.5->24.6 tok/s (+14.4%), 45%->53% BW utilization. Target 37 tok/s not yet reached — further optimization needed.