Compare llama.cpp and MLX dispatch structure #36

Open
opened 2026-04-30 18:11:37 +02:00 by sleepy · 0 comments
Owner

Problem

MLX achieves roughly 24% higher effective bandwidth (355 vs 289 GB/s) on Qwen3.6-27B at 14+ GiB. The accumulation type is the same (F32). Thread organization is similar. The gap likely comes from dispatch structure and memory access patterns.

llama.cpp profile (9B Q4_0, ctx=256)

  • Total ops per tick: 1833 (1151 real GPU dispatches, 682 zero-ops)
  • MUL_MAT reads: 4.8 GB/tick
  • Non-MUL_MAT reads: roughly 1.3 GB/tick

MLX investigation

  • How many Metal dispatches does MLX issue per decode tick?
  • Does MLX fuse ops differently (fewer view/reshape ops)?
  • What is MLX effective dispatch count for an equivalent model?
  • Compare kernel-level dispatch patterns using Xcode GPUtrace

Reference

MLX source: oMLX application bundle at mlx/include/mlx/backend/metal/kernels/

## Problem MLX achieves roughly 24% higher effective bandwidth (355 vs 289 GB/s) on Qwen3.6-27B at 14+ GiB. The accumulation type is the same (F32). Thread organization is similar. The gap likely comes from dispatch structure and memory access patterns. ## llama.cpp profile (9B Q4_0, ctx=256) - Total ops per tick: 1833 (1151 real GPU dispatches, 682 zero-ops) - MUL_MAT reads: 4.8 GB/tick - Non-MUL_MAT reads: roughly 1.3 GB/tick ## MLX investigation - How many Metal dispatches does MLX issue per decode tick? - Does MLX fuse ops differently (fewer view/reshape ops)? - What is MLX effective dispatch count for an equivalent model? - Compare kernel-level dispatch patterns using Xcode GPUtrace ## Reference MLX source: oMLX application bundle at mlx/include/mlx/backend/metal/kernels/
sleepy added the profiling label 2026-04-30 18:11:37 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#36