Compare llama.cpp and MLX dispatch structure #36

New Issue

2026-04-30T18:11:37+02:00

sleepy commented

2026-04-30 18:11:37 +02:00

Problem

MLX achieves roughly 24% higher effective bandwidth (355 vs 289 GB/s) on Qwen3.6-27B at 14+ GiB. The accumulation type is the same (F32). Thread organization is similar. The gap likely comes from dispatch structure and memory access patterns.

llama.cpp profile (9B Q4_0, ctx=256)

Total ops per tick: 1833 (1151 real GPU dispatches, 682 zero-ops)
MUL_MAT reads: 4.8 GB/tick
Non-MUL_MAT reads: roughly 1.3 GB/tick

MLX investigation

How many Metal dispatches does MLX issue per decode tick?
Does MLX fuse ops differently (fewer view/reshape ops)?
What is MLX effective dispatch count for an equivalent model?
Compare kernel-level dispatch patterns using Xcode GPUtrace

Reference

MLX source: oMLX application bundle at mlx/include/mlx/backend/metal/kernels/

## Problem MLX achieves roughly 24% higher effective bandwidth (355 vs 289 GB/s) on Qwen3.6-27B at 14+ GiB. The accumulation type is the same (F32). Thread organization is similar. The gap likely comes from dispatch structure and memory access patterns. ## llama.cpp profile (9B Q4_0, ctx=256) - Total ops per tick: 1833 (1151 real GPU dispatches, 682 zero-ops) - MUL_MAT reads: 4.8 GB/tick - Non-MUL_MAT reads: roughly 1.3 GB/tick ## MLX investigation - How many Metal dispatches does MLX issue per decode tick? - Does MLX fuse ops differently (fewer view/reshape ops)? - What is MLX effective dispatch count for an equivalent model? - Compare kernel-level dispatch patterns using Xcode GPUtrace ## Reference MLX source: oMLX application bundle at mlx/include/mlx/backend/metal/kernels/

sleepy added the profiling label 2026-04-30 18:11:37 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#36