perf: Reach llama.cpp/MLX decode parity (37 tok/s target) #53
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Goal
Reach and surpass llama.cpp (37.3 tok/s) and MLX (35.5 tok/s) on M4 Max 36GB (410 GB/s). Neither is the ceiling — a closed-source Zig implementation allegedly beats both. We match them first, then exceed.
Current: 24.6 tok/s (35ms, 53% BW)
Parity: 37 tok/s (27ms, 69% BW)
Stretch: 40+ tok/s
Key Differences (llama.cpp vs MLX vs Ours)
Optimization Plan (Priority Order)
Do not assume llama.cpp/MLX strategies are optimal. Experiment beyond their configs.
Finding: We are purely memory-bandwidth bound
Tested 3 approaches — none moved the needle:
The activation vector (5KB) stays in L2 cache regardless of kernel layout. Dispatch overhead (~500 × 10μs = 5ms) is only 14% of decode time. The remaining 30ms is pure weight-read bandwidth.
Conclusion: At 7.6GB weight reads and 410 GB/s peak, theoretical floor is 18.5ms. We hit 35ms = 53% utilization. To reach 37 tok/s (27ms), we need 69% utilization — that gap is in memory access efficiency, not kernel structure.
Next step: Quantization (#39) is the only path to a step-change. Q4 would cut weight reads to ~1.9GB → theoretical floor of 4.6ms = 217 tok/s ceiling. Even with dequant overhead, 40+ tok/s is realistic.
Per-Layer Profiling Results
Corrected (removing 0.18ms commit overhead per section):
llama.cpp comparison:
Key finding: No outlier layers. All uniformly ~24% slower. The gap is in kernel memory access efficiency, not architecture.
Remaining optimization paths: