Reach 37+ tok/s decode target (match llama.cpp) #34

Open
opened 2026-05-15 00:50:48 +02:00 by sleepy · 0 comments
Owner

Target

Match or beat llama.cpp baseline of 37.3 tok/s decode on Qwen3.5-4B BF16, M4 Max 36GB.

Current blockers

  • #32: Decode correctness (must be fixed first)
  • #33: Linear attention on CPU (24/32 layers)

After those are fixed, likely optimizations needed:

  • Fused kernels (RMSNorm+residual, gate+up+swiglu in one dispatch)
  • Single command buffer for ALL 32 layers (currently one per layer)
  • Persistent buffer reuse (already partially done)
  • Batched attention kernel for prefill
  • Larger matmul tile sizes tuned for M4 GPU

Prerequisites

  • #32 and #33 must be closed first
  • Coherent English output is mandatory

Max 2 attempts per optimization.

## Target Match or beat llama.cpp baseline of 37.3 tok/s decode on Qwen3.5-4B BF16, M4 Max 36GB. ## Current blockers - #32: Decode correctness (must be fixed first) - #33: Linear attention on CPU (24/32 layers) ## After those are fixed, likely optimizations needed: - Fused kernels (RMSNorm+residual, gate+up+swiglu in one dispatch) - Single command buffer for ALL 32 layers (currently one per layer) - Persistent buffer reuse (already partially done) - Batched attention kernel for prefill - Larger matmul tile sizes tuned for M4 GPU ## Prerequisites - #32 and #33 must be closed first - Coherent English output is mandatory Max 2 attempts per optimization.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#34
No description provided.