[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40) #51

Merged
sleepy merged 1 commit from perf/40-gemv-bandwidth-optimization into main 2026-05-15 20:23:47 +02:00
Owner

Summary

New matmul_bf16_m1_vec4 kernel using vectorized bf16x4 loads (packed_bfloat4) with 2-SG reduction. Replaces scalar bf16 loads with 4-wide vector loads for better memory coalescing.

Benchmark

Metric Baseline (main) This PR Change
Decode GPU time 41ms 35ms -14.6%
Throughput 21.5 tok/s 24.6 tok/s +14.4%
Eff. bandwidth ~185 GB/s ~216 GB/s +17%
BW utilization 45% 53% +8pp

Output: token-identical to baseline (verified A/B).

Changes

  • matmul.metal: New matmul_bf16_m1_vec4 kernel with bf16x4 loads, 2-SG (64 threads/TG), threadgroup shared memory for cross-SG reduction
  • dispatch.zig: New set_matmul_bf16_m1_vec4 dispatch function
  • model.zig: forward_decode() switched to vec4 kernel

Remaining gap

Target is 37 tok/s (27ms). Currently at 24.6 tok/s (35ms). Need further optimization:

  • Better K tiling (BK=32 or 64 with threadgroup cache for A row)
  • Fuse operations (norm+matmul #43)
  • Eliminate dispatch overhead (#38 already saved 144 dispatches)

Closes #40

## Summary New `matmul_bf16_m1_vec4` kernel using vectorized bf16x4 loads (packed_bfloat4) with 2-SG reduction. Replaces scalar bf16 loads with 4-wide vector loads for better memory coalescing. ## Benchmark | Metric | Baseline (main) | This PR | Change | |--------|----------------|---------|--------| | Decode GPU time | 41ms | 35ms | -14.6% | | Throughput | 21.5 tok/s | 24.6 tok/s | +14.4% | | Eff. bandwidth | ~185 GB/s | ~216 GB/s | +17% | | BW utilization | 45% | 53% | +8pp | Output: token-identical to baseline (verified A/B). ## Changes - `matmul.metal`: New `matmul_bf16_m1_vec4` kernel with bf16x4 loads, 2-SG (64 threads/TG), threadgroup shared memory for cross-SG reduction - `dispatch.zig`: New `set_matmul_bf16_m1_vec4` dispatch function - `model.zig`: `forward_decode()` switched to vec4 kernel ## Remaining gap Target is 37 tok/s (27ms). Currently at 24.6 tok/s (35ms). Need further optimization: - Better K tiling (BK=32 or 64 with threadgroup cache for A row) - Fuse operations (norm+matmul #43) - Eliminate dispatch overhead (#38 already saved 144 dispatches) Closes #40
New matmul_bf16_m1_vec4 kernel uses bfloat4 vectorized loads with
2-simdgroup reduction, replacing scalar loads in decode path.
Reduces decode from ~40ms to ~34ms (26 tok/s, ~220 GB/s).
sleepy merged commit 80532b0b26 into main 2026-05-15 20:23:47 +02:00
Sign in to join this conversation.
No reviewers
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm!51
No description provided.