Reach 37+ tok/s decode target (match llama.cpp) #34

New issue

Open

opened 2026-05-15 00:50:48 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-15 00:50:48 +02:00

Owner

Target

Match or beat llama.cpp baseline of 37.3 tok/s decode on Qwen3.5-4B BF16, M4 Max 36GB.

Current blockers

#32: Decode correctness (must be fixed first)
#33: Linear attention on CPU (24/32 layers)

After those are fixed, likely optimizations needed:

Fused kernels (RMSNorm+residual, gate+up+swiglu in one dispatch)
Single command buffer for ALL 32 layers (currently one per layer)
Persistent buffer reuse (already partially done)
Batched attention kernel for prefill
Larger matmul tile sizes tuned for M4 GPU

Prerequisites

#32 and #33 must be closed first
Coherent English output is mandatory

Max 2 attempts per optimization.

## Target Match or beat llama.cpp baseline of 37.3 tok/s decode on Qwen3.5-4B BF16, M4 Max 36GB. ## Current blockers - #32: Decode correctness (must be fixed first) - #33: Linear attention on CPU (24/32 layers) ## After those are fixed, likely optimizations needed: - Fused kernels (RMSNorm+residual, gate+up+swiglu in one dispatch) - Single command buffer for ALL 32 layers (currently one per layer) - Persistent buffer reuse (already partially done) - Batched attention kernel for prefill - Larger matmul tile sizes tuned for M4 GPU ## Prerequisites - #32 and #33 must be closed first - Coherent English output is mandatory Max 2 attempts per optimization.