perf: Increase GEMV bandwidth utilization from 45% to 70%+

sleepy commented

2026-05-15 19:02:26 +02:00

Owner

Overview

Current simd-reduction GEMV kernel achieves ~183 GB/s out of 410 GB/s peak (45%). Target: 281+ GB/s (69%+).

Key Finding

matmul2d (MPP 16x16) is NOT faster on M4 — llama.cpp confirms this. The correct path is simdgroup_matrix 8x8 MMA, which is what MLX uses for ALL matmul sizes including M=1 decode.

Implementation Plan

Port MLX-style steel_gemm approach: simdgroup_matrix<float, 8, 8> + simdgroup_multiply_accumulate
Each threadgroup processes a tile of the output: BM x BN (e.g., 1x64 or 2x32)
K dimension tiled in BK steps (e.g., 32 or 64)
A vector (M=1 row) loaded into threadgroup memory, B matrix (weight column) streamed from device memory
8x8 outer product accumulation using simdgroup_multiply_accumulate

Reference

MLX steel_gemm: simdgroup_matrix 8x8 for all M including M=1
llama.cpp kernel_mul_mm: same approach, 8x8 fallback (faster than matmul2d on M4)
Our current kernel uses scalar simd_sum reduction — switching to MMA gives hardware-batched multiply+accumulate

## Overview Current simd-reduction GEMV kernel achieves ~183 GB/s out of 410 GB/s peak (45%). Target: 281+ GB/s (69%+). ## Key Finding matmul2d (MPP 16x16) is NOT faster on M4 — llama.cpp confirms this. The correct path is **simdgroup_matrix 8x8 MMA**, which is what MLX uses for ALL matmul sizes including M=1 decode. ## Implementation Plan 1. Port MLX-style steel_gemm approach: simdgroup_matrix<float, 8, 8> + simdgroup_multiply_accumulate 2. Each threadgroup processes a tile of the output: BM x BN (e.g., 1x64 or 2x32) 3. K dimension tiled in BK steps (e.g., 32 or 64) 4. A vector (M=1 row) loaded into threadgroup memory, B matrix (weight column) streamed from device memory 5. 8x8 outer product accumulation using simdgroup_multiply_accumulate ## Reference - MLX steel_gemm: simdgroup_matrix 8x8 for all M including M=1 - llama.cpp kernel_mul_mm: same approach, 8x8 fallback (faster than matmul2d on M4) - Our current kernel uses scalar simd_sum reduction — switching to MMA gives hardware-batched multiply+accumulate

sleepy referenced this issue

2026-05-15 19:34:02 +02:00

[perf] Make ROWS_PER_TG runtime parameter, tune per matmul shape (#37) #50

sleepy referenced this issue

2026-05-15 19:47:01 +02:00

perf: MPP matmul2d for Apple Silicon M3+ #42

sleepy referenced this issue

2026-05-15 19:47:01 +02:00

perf: MPP matmul2d for Apple Silicon M3+ #42

sleepy referenced this issue from a commit

2026-05-15 20:19:48 +02:00

[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40)

sleepy referenced this issue from a pull request that will close it,

2026-05-15 20:22:04 +02:00

[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40) #51

sleepy referenced this issue from a commit

2026-05-15 20:23:47 +02:00

[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40) (#51)

sleepy closed this issue

2026-05-15 20:23:47 +02:00

sleepy commented

2026-05-15 20:23:57 +02:00

Author

Owner

Merged via PR #51 (squash). Coherence: token-identical to baseline. Benchmark: 41ms->35ms decode (-14.6%), 21.5->24.6 tok/s (+14.4%), 45%->53% BW utilization. Target 37 tok/s not yet reached — further optimization needed.

sleepy referenced this issue

2026-05-18 01:02:55 +02:00

perf: Reach llama.cpp/MLX decode parity (37 tok/s target) #53

Rows
Columns

perf: Increase GEMV bandwidth utilization from 45% to 70%+ #40

Overview

Key Finding

Implementation Plan

Reference