perf: Reach llama.cpp/MLX decode parity (37 tok/s target) #53

Open
opened 2026-05-15 21:47:43 +02:00 by sleepy · 2 comments
Owner

Goal

Reach and surpass llama.cpp (37.3 tok/s) and MLX (35.5 tok/s) on M4 Max 36GB (410 GB/s). Neither is the ceiling — a closed-source Zig implementation allegedly beats both. We match them first, then exceed.

Current: 24.6 tok/s (35ms, 53% BW)

Parity: 37 tok/s (27ms, 69% BW)

Stretch: 40+ tok/s

Key Differences (llama.cpp vs MLX vs Ours)

Parameter Ours llama.cpp MLX
Threads/TG 64 (2 SG) 128 (4 SG) 256 (8 SG)
Rows/TG 1 2 1
Dot product manual bf16x4 native dot(float4,float4) scalar FMA
Activation cache None Register yl4[4] Register v_coeff[TN]
Kernel fusion None norm+mul, GLU, ADD chains SDPA fused

Optimization Plan (Priority Order)

  1. TG=128 threads, ROWS_PER_TG=2, native dot(), activation register cache (biggest single win)
  2. Kernel fusion: GLU (SiLU+gate), norm+mul, ADD chains
  3. Wider TG experiments (256 threads like MLX) — may beat llama.cpp
  4. Dispatch overhead reduction

Do not assume llama.cpp/MLX strategies are optimal. Experiment beyond their configs.

## Goal Reach and surpass llama.cpp (37.3 tok/s) and MLX (35.5 tok/s) on M4 Max 36GB (410 GB/s). Neither is the ceiling — a closed-source Zig implementation allegedly beats both. We match them first, then exceed. ## Current: 24.6 tok/s (35ms, 53% BW) ## Parity: 37 tok/s (27ms, 69% BW) ## Stretch: 40+ tok/s ## Key Differences (llama.cpp vs MLX vs Ours) | Parameter | Ours | llama.cpp | MLX | |---|---|---|---| | Threads/TG | 64 (2 SG) | 128 (4 SG) | 256 (8 SG) | | Rows/TG | 1 | 2 | 1 | | Dot product | manual bf16x4 | native dot(float4,float4) | scalar FMA | | Activation cache | None | Register yl4[4] | Register v_coeff[TN] | | Kernel fusion | None | norm+mul, GLU, ADD chains | SDPA fused | ## Optimization Plan (Priority Order) 1. TG=128 threads, ROWS_PER_TG=2, native dot(), activation register cache (biggest single win) 2. Kernel fusion: GLU (SiLU+gate), norm+mul, ADD chains 3. Wider TG experiments (256 threads like MLX) — may beat llama.cpp 4. Dispatch overhead reduction Do not assume llama.cpp/MLX strategies are optimal. Experiment beyond their configs.
Author
Owner

Finding: We are purely memory-bandwidth bound

Tested 3 approaches — none moved the needle:

Optimization Result
llama-style GEMV (128th/TG, NR0=2, reg cache, native dot) 34ms — no change
Kernel fusion (517→453 dispatches) 35ms — no change
Earlier: bf16x4 loads (#40) 41→35ms — only win so far

The activation vector (5KB) stays in L2 cache regardless of kernel layout. Dispatch overhead (~500 × 10μs = 5ms) is only 14% of decode time. The remaining 30ms is pure weight-read bandwidth.

Conclusion: At 7.6GB weight reads and 410 GB/s peak, theoretical floor is 18.5ms. We hit 35ms = 53% utilization. To reach 37 tok/s (27ms), we need 69% utilization — that gap is in memory access efficiency, not kernel structure.

Next step: Quantization (#39) is the only path to a step-change. Q4 would cut weight reads to ~1.9GB → theoretical floor of 4.6ms = 217 tok/s ceiling. Even with dequant overhead, 40+ tok/s is realistic.

## Finding: We are purely memory-bandwidth bound Tested 3 approaches — none moved the needle: | Optimization | Result | |---|---| | llama-style GEMV (128th/TG, NR0=2, reg cache, native dot) | 34ms — no change | | Kernel fusion (517→453 dispatches) | 35ms — no change | | Earlier: bf16x4 loads (#40) | 41→35ms — only win so far | The activation vector (5KB) stays in L2 cache regardless of kernel layout. Dispatch overhead (~500 × 10μs = 5ms) is only 14% of decode time. The remaining 30ms is pure weight-read bandwidth. **Conclusion**: At 7.6GB weight reads and 410 GB/s peak, theoretical floor is 18.5ms. We hit 35ms = 53% utilization. To reach 37 tok/s (27ms), we need 69% utilization — that gap is in memory access efficiency, not kernel structure. **Next step**: Quantization (#39) is the only path to a step-change. Q4 would cut weight reads to ~1.9GB → theoretical floor of 4.6ms = 217 tok/s ceiling. Even with dequant overhead, 40+ tok/s is realistic.
Author
Owner

Per-Layer Profiling Results

embed=0.27ms
L00=1.11(la) L01=1.13(la) L02=1.23(la) L03=1.20(fa) L04=1.16(la) ...
L24=1.18(la) L25=1.16(la) L26=1.11(la) L27=1.13(fa) L28=1.18(la) L29=1.14(la) L30=1.18(la) L31=1.19(fa)
lm_head=3.83ms total=41.12ms (with profiling overhead)

Corrected (removing 0.18ms commit overhead per section):

  • 32 layers: ~31ms (~0.97ms avg per layer)
  • lm_head: ~3.65ms
  • embed: ~0.27ms
  • Real total: ~35ms ✓

llama.cpp comparison:

  • Total: 27ms → per-layer: ~0.75ms
  • Our per-layer: ~0.97ms → 24% slower per layer
  • lm_head: ours ~3.65ms, llama.cpp estimated ~2.8ms

Key finding: No outlier layers. All uniformly ~24% slower. The gap is in kernel memory access efficiency, not architecture.

Remaining optimization paths:

  1. Quantization (#39) — 4x less weight traffic, bypasses bandwidth ceiling
  2. Deeper GEMV kernel tuning (memory coalescing, L2 prefetch patterns)
  3. Double-buffered command encoding to overlap CPU dispatch with GPU execution
## Per-Layer Profiling Results ``` embed=0.27ms L00=1.11(la) L01=1.13(la) L02=1.23(la) L03=1.20(fa) L04=1.16(la) ... L24=1.18(la) L25=1.16(la) L26=1.11(la) L27=1.13(fa) L28=1.18(la) L29=1.14(la) L30=1.18(la) L31=1.19(fa) lm_head=3.83ms total=41.12ms (with profiling overhead) ``` **Corrected (removing 0.18ms commit overhead per section):** - 32 layers: ~31ms (~0.97ms avg per layer) - lm_head: ~3.65ms - embed: ~0.27ms - Real total: ~35ms ✓ **llama.cpp comparison:** - Total: 27ms → per-layer: ~0.75ms - Our per-layer: ~0.97ms → **24% slower per layer** - lm_head: ours ~3.65ms, llama.cpp estimated ~2.8ms **Key finding:** No outlier layers. All uniformly ~24% slower. The gap is in kernel memory access efficiency, not architecture. **Remaining optimization paths:** 1. Quantization (#39) — 4x less weight traffic, bypasses bandwidth ceiling 2. Deeper GEMV kernel tuning (memory coalescing, L2 prefetch patterns) 3. Double-buffered command encoding to overlap CPU dispatch with GPU execution
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#53
No description provided.