[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40) #51

Merged

sleepy merged 1 commit from perf/40-gemv-bandwidth-optimization into main

2026-05-15 20:23:47 +02:00

sleepy commented

2026-05-15 20:22:04 +02:00

Owner

Summary

New matmul_bf16_m1_vec4 kernel using vectorized bf16x4 loads (packed_bfloat4) with 2-SG reduction. Replaces scalar bf16 loads with 4-wide vector loads for better memory coalescing.

Benchmark

Metric	Baseline (main)	This PR	Change
Decode GPU time	41ms	35ms	-14.6%
Throughput	21.5 tok/s	24.6 tok/s	+14.4%
Eff. bandwidth	~185 GB/s	~216 GB/s	+17%
BW utilization	45%	53%	+8pp

Output: token-identical to baseline (verified A/B).

Changes

matmul.metal: New matmul_bf16_m1_vec4 kernel with bf16x4 loads, 2-SG (64 threads/TG), threadgroup shared memory for cross-SG reduction
dispatch.zig: New set_matmul_bf16_m1_vec4 dispatch function
model.zig: forward_decode() switched to vec4 kernel

Remaining gap

Target is 37 tok/s (27ms). Currently at 24.6 tok/s (35ms). Need further optimization:

Better K tiling (BK=32 or 64 with threadgroup cache for A row)
Fuse operations (norm+matmul #43)
Eliminate dispatch overhead (#38 already saved 144 dispatches)

Closes #40

## Summary New `matmul_bf16_m1_vec4` kernel using vectorized bf16x4 loads (packed_bfloat4) with 2-SG reduction. Replaces scalar bf16 loads with 4-wide vector loads for better memory coalescing. ## Benchmark | Metric | Baseline (main) | This PR | Change | |--------|----------------|---------|--------| | Decode GPU time | 41ms | 35ms | -14.6% | | Throughput | 21.5 tok/s | 24.6 tok/s | +14.4% | | Eff. bandwidth | ~185 GB/s | ~216 GB/s | +17% | | BW utilization | 45% | 53% | +8pp | Output: token-identical to baseline (verified A/B). ## Changes - `matmul.metal`: New `matmul_bf16_m1_vec4` kernel with bf16x4 loads, 2-SG (64 threads/TG), threadgroup shared memory for cross-SG reduction - `dispatch.zig`: New `set_matmul_bf16_m1_vec4` dispatch function - `model.zig`: `forward_decode()` switched to vec4 kernel ## Remaining gap Target is 37 tok/s (27ms). Currently at 24.6 tok/s (35ms). Need further optimization: - Better K tiling (BK=32 or 64 with threadgroup cache for A row) - Fuse operations (norm+matmul #43) - Eliminate dispatch overhead (#38 already saved 144 dispatches) Closes #40

sleepy added 1 commit

2026-05-15 20:22:05 +02:00

[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40 ) d55075b38a

New matmul_bf16_m1_vec4 kernel uses bfloat4 vectorized loads with
2-simdgroup reduction, replacing scalar loads in decode path.
Reduces decode from ~40ms to ~34ms (26 tok/s, ~220 GB/s).

sleepy merged commit 80532b0b26 into main

2026-05-15 20:23:47 +02:00

sleepy referenced this pull request from a commit

2026-05-15 20:23:47 +02:00

[perf] Vectorized bf16x4 GEMV kernel for M=1 decode (#40) (#51)

sleepy referenced this pull request

2026-05-15 20:23:57 +02:00

perf: Increase GEMV bandwidth utilization from 45% to 70%+ #40