perf: Reach llama.cpp/MLX decode parity (37 tok/s target) #53

New issue

Open

opened 2026-05-15 21:47:43 +02:00 by sleepy · 2 comments

sleepy commented

2026-05-15 21:47:43 +02:00

Owner

Goal

Reach and surpass llama.cpp (37.3 tok/s) and MLX (35.5 tok/s) on M4 Max 36GB (410 GB/s). Neither is the ceiling — a closed-source Zig implementation allegedly beats both. We match them first, then exceed.

Current: 24.6 tok/s (35ms, 53% BW)

Parity: 37 tok/s (27ms, 69% BW)

Stretch: 40+ tok/s

Key Differences (llama.cpp vs MLX vs Ours)

Parameter	Ours	llama.cpp	MLX
Threads/TG	64 (2 SG)	128 (4 SG)	256 (8 SG)
Rows/TG	1	2	1
Dot product	manual bf16x4	native dot(float4,float4)	scalar FMA
Activation cache	None	Register yl4[4]	Register v_coeff[TN]
Kernel fusion	None	norm+mul, GLU, ADD chains	SDPA fused

Optimization Plan (Priority Order)

TG=128 threads, ROWS_PER_TG=2, native dot(), activation register cache (biggest single win)
Kernel fusion: GLU (SiLU+gate), norm+mul, ADD chains
Wider TG experiments (256 threads like MLX) — may beat llama.cpp
Dispatch overhead reduction

Do not assume llama.cpp/MLX strategies are optimal. Experiment beyond their configs.

## Goal Reach and surpass llama.cpp (37.3 tok/s) and MLX (35.5 tok/s) on M4 Max 36GB (410 GB/s). Neither is the ceiling — a closed-source Zig implementation allegedly beats both. We match them first, then exceed. ## Current: 24.6 tok/s (35ms, 53% BW) ## Parity: 37 tok/s (27ms, 69% BW) ## Stretch: 40+ tok/s ## Key Differences (llama.cpp vs MLX vs Ours) | Parameter | Ours | llama.cpp | MLX | |---|---|---|---| | Threads/TG | 64 (2 SG) | 128 (4 SG) | 256 (8 SG) | | Rows/TG | 1 | 2 | 1 | | Dot product | manual bf16x4 | native dot(float4,float4) | scalar FMA | | Activation cache | None | Register yl4[4] | Register v_coeff[TN] | | Kernel fusion | None | norm+mul, GLU, ADD chains | SDPA fused | ## Optimization Plan (Priority Order) 1. TG=128 threads, ROWS_PER_TG=2, native dot(), activation register cache (biggest single win) 2. Kernel fusion: GLU (SiLU+gate), norm+mul, ADD chains 3. Wider TG experiments (256 threads like MLX) — may beat llama.cpp 4. Dispatch overhead reduction Do not assume llama.cpp/MLX strategies are optimal. Experiment beyond their configs.

sleepy referenced this issue from a commit

2026-05-18 00:43:33 +02:00

perf(#53): llama-style GEMV kernel (128th/TG, NR0=2, reg cache) — no speedup

sleepy referenced this issue from a commit

2026-05-18 01:01:07 +02:00

Revert "perf(#53): llama-style GEMV kernel (128th/TG, NR0=2, reg cache) — no speedup"

sleepy referenced this issue from a commit

2026-05-18 01:01:07 +02:00

perf(#53): fuse residual_add + rms_norm into single kernel — 517→453 dispatches

sleepy commented

2026-05-18 01:02:55 +02:00

Author

Owner

Finding: We are purely memory-bandwidth bound

Tested 3 approaches — none moved the needle:

Optimization	Result
llama-style GEMV (128th/TG, NR0=2, reg cache, native dot)	34ms — no change
Kernel fusion (517→453 dispatches)	35ms — no change
Earlier: bf16x4 loads (#40)	41→35ms — only win so far

The activation vector (5KB) stays in L2 cache regardless of kernel layout. Dispatch overhead (~500 × 10μs = 5ms) is only 14% of decode time. The remaining 30ms is pure weight-read bandwidth.

Conclusion: At 7.6GB weight reads and 410 GB/s peak, theoretical floor is 18.5ms. We hit 35ms = 53% utilization. To reach 37 tok/s (27ms), we need 69% utilization — that gap is in memory access efficiency, not kernel structure.

Next step: Quantization (#39) is the only path to a step-change. Q4 would cut weight reads to ~1.9GB → theoretical floor of 4.6ms = 217 tok/s ceiling. Even with dequant overhead, 40+ tok/s is realistic.

## Finding: We are purely memory-bandwidth bound Tested 3 approaches — none moved the needle: | Optimization | Result | |---|---| | llama-style GEMV (128th/TG, NR0=2, reg cache, native dot) | 34ms — no change | | Kernel fusion (517→453 dispatches) | 35ms — no change | | Earlier: bf16x4 loads (#40) | 41→35ms — only win so far | The activation vector (5KB) stays in L2 cache regardless of kernel layout. Dispatch overhead (~500 × 10μs = 5ms) is only 14% of decode time. The remaining 30ms is pure weight-read bandwidth. **Conclusion**: At 7.6GB weight reads and 410 GB/s peak, theoretical floor is 18.5ms. We hit 35ms = 53% utilization. To reach 37 tok/s (27ms), we need 69% utilization — that gap is in memory access efficiency, not kernel structure. **Next step**: Quantization (#39) is the only path to a step-change. Q4 would cut weight reads to ~1.9GB → theoretical floor of 4.6ms = 217 tok/s ceiling. Even with dequant overhead, 40+ tok/s is realistic.

sleepy commented

2026-05-18 01:55:21 +02:00

Author

Owner

Per-Layer Profiling Results

embed=0.27ms
L00=1.11(la) L01=1.13(la) L02=1.23(la) L03=1.20(fa) L04=1.16(la) ...
L24=1.18(la) L25=1.16(la) L26=1.11(la) L27=1.13(fa) L28=1.18(la) L29=1.14(la) L30=1.18(la) L31=1.19(fa)
lm_head=3.83ms total=41.12ms (with profiling overhead)

Corrected (removing 0.18ms commit overhead per section):

32 layers: ~31ms (~0.97ms avg per layer)
lm_head: ~3.65ms
embed: ~0.27ms
Real total: ~35ms ✓

llama.cpp comparison:

Total: 27ms → per-layer: ~0.75ms
Our per-layer: ~0.97ms → 24% slower per layer
lm_head: ours ~3.65ms, llama.cpp estimated ~2.8ms

Key finding: No outlier layers. All uniformly ~24% slower. The gap is in kernel memory access efficiency, not architecture.

Remaining optimization paths:

Quantization (#39) — 4x less weight traffic, bypasses bandwidth ceiling
Deeper GEMV kernel tuning (memory coalescing, L2 prefetch patterns)
Double-buffered command encoding to overlap CPU dispatch with GPU execution

## Per-Layer Profiling Results ``` embed=0.27ms L00=1.11(la) L01=1.13(la) L02=1.23(la) L03=1.20(fa) L04=1.16(la) ... L24=1.18(la) L25=1.16(la) L26=1.11(la) L27=1.13(fa) L28=1.18(la) L29=1.14(la) L30=1.18(la) L31=1.19(fa) lm_head=3.83ms total=41.12ms (with profiling overhead) ``` **Corrected (removing 0.18ms commit overhead per section):** - 32 layers: ~31ms (~0.97ms avg per layer) - lm_head: ~3.65ms - embed: ~0.27ms - Real total: ~35ms ✓ **llama.cpp comparison:** - Total: 27ms → per-layer: ~0.75ms - Our per-layer: ~0.97ms → **24% slower per layer** - lm_head: ours ~3.65ms, llama.cpp estimated ~2.8ms **Key finding:** No outlier layers. All uniformly ~24% slower. The gap is in kernel memory access efficiency, not architecture. **Remaining optimization paths:** 1. Quantization (#39) — 4x less weight traffic, bypasses bandwidth ceiling 2. Deeper GEMV kernel tuning (memory coalescing, L2 prefetch patterns) 3. Double-buffered command encoding to overlap CPU dispatch with GPU execution

sleepy referenced this issue

2026-05-21 17:17:55 +02:00

bug: Model generates garbage output (repetitive tokens) #62