perf: Increase GEMV bandwidth utilization from 45% to 70%+ #40

Closed
opened 2026-05-15 19:02:26 +02:00 by sleepy · 1 comment
Owner

Overview

Current simd-reduction GEMV kernel achieves ~183 GB/s out of 410 GB/s peak (45%). Target: 281+ GB/s (69%+).

Key Finding

matmul2d (MPP 16x16) is NOT faster on M4 — llama.cpp confirms this. The correct path is simdgroup_matrix 8x8 MMA, which is what MLX uses for ALL matmul sizes including M=1 decode.

Implementation Plan

  1. Port MLX-style steel_gemm approach: simdgroup_matrix<float, 8, 8> + simdgroup_multiply_accumulate
  2. Each threadgroup processes a tile of the output: BM x BN (e.g., 1x64 or 2x32)
  3. K dimension tiled in BK steps (e.g., 32 or 64)
  4. A vector (M=1 row) loaded into threadgroup memory, B matrix (weight column) streamed from device memory
  5. 8x8 outer product accumulation using simdgroup_multiply_accumulate

Reference

  • MLX steel_gemm: simdgroup_matrix 8x8 for all M including M=1
  • llama.cpp kernel_mul_mm: same approach, 8x8 fallback (faster than matmul2d on M4)
  • Our current kernel uses scalar simd_sum reduction — switching to MMA gives hardware-batched multiply+accumulate
## Overview Current simd-reduction GEMV kernel achieves ~183 GB/s out of 410 GB/s peak (45%). Target: 281+ GB/s (69%+). ## Key Finding matmul2d (MPP 16x16) is NOT faster on M4 — llama.cpp confirms this. The correct path is **simdgroup_matrix 8x8 MMA**, which is what MLX uses for ALL matmul sizes including M=1 decode. ## Implementation Plan 1. Port MLX-style steel_gemm approach: simdgroup_matrix<float, 8, 8> + simdgroup_multiply_accumulate 2. Each threadgroup processes a tile of the output: BM x BN (e.g., 1x64 or 2x32) 3. K dimension tiled in BK steps (e.g., 32 or 64) 4. A vector (M=1 row) loaded into threadgroup memory, B matrix (weight column) streamed from device memory 5. 8x8 outer product accumulation using simdgroup_multiply_accumulate ## Reference - MLX steel_gemm: simdgroup_matrix 8x8 for all M including M=1 - llama.cpp kernel_mul_mm: same approach, 8x8 fallback (faster than matmul2d on M4) - Our current kernel uses scalar simd_sum reduction — switching to MMA gives hardware-batched multiply+accumulate
Author
Owner

Merged via PR #51 (squash). Coherence: token-identical to baseline. Benchmark: 41ms->35ms decode (-14.6%), 21.5->24.6 tok/s (+14.4%), 45%->53% BW utilization. Target 37 tok/s not yet reached — further optimization needed.

Merged via PR #51 (squash). Coherence: token-identical to baseline. Benchmark: 41ms->35ms decode (-14.6%), 21.5->24.6 tok/s (+14.4%), 45%->53% BW utilization. Target 37 tok/s not yet reached — further optimization needed.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#40
No description provided.