achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max) #40

Open
opened 2026-05-01 00:24:09 +02:00 by sleepy · 0 comments
Owner

Goal

Match MLX generation throughput on 27B models. Target: 22 t/s on Qwen3.6-27B-Q4_0 at tg128 on M4 Max.

MLX Baseline

MLX-lm achieves ~22 t/s on 27B Q4_0. llama.cpp currently lags behind. This is the north star metric.

MLX Testing

MLX models at: ~/.omlx/models/
Run MLX-lm benchmarks for comparison.

Onboarding — What to Read

  1. GIT.md — build, test, coherence, timeout, interactive mode instructions
  2. BENCHMARKS.md — all benchmark results, track progress toward 22 t/s
  3. ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis
  4. Key source files (GIT.md):
    • ggml-metal.metal — Metal shader kernels
    • ggml-metal-device.cpp — Pipeline dispatch
    • ggml-metal-ops.cpp — Op encoding
    • ggml-metal-impl.h — Tuning params
  5. MLX reference: ../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h — qmv_fast_impl

When No More Issues Remain

If all tracked issues are resolved and we still have not hit 22 t/s:

  1. Profile llama.cpp vs MLX dispatch structure (op count, memory access, fusion)
  2. Compare kernel-level patterns with Xcode GPUtrace
  3. Investigate MLX optimizations not yet in llama.cpp
  4. The secondary orchestrator (M4 Pro) may have independent insights — use it for parallel investigation

Progress Tracking

Record all benchmark results in BENCHMARKS.md with date and commit hash.

## Goal Match MLX generation throughput on 27B models. Target: **22 t/s on Qwen3.6-27B-Q4_0** at tg128 on M4 Max. ## MLX Baseline MLX-lm achieves ~22 t/s on 27B Q4_0. llama.cpp currently lags behind. This is the north star metric. ## MLX Testing MLX models at: ~/.omlx/models/ Run MLX-lm benchmarks for comparison. ## Onboarding — What to Read 1. GIT.md — build, test, coherence, timeout, interactive mode instructions 2. BENCHMARKS.md — all benchmark results, track progress toward 22 t/s 3. ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis 4. Key source files (GIT.md): - ggml-metal.metal — Metal shader kernels - ggml-metal-device.cpp — Pipeline dispatch - ggml-metal-ops.cpp — Op encoding - ggml-metal-impl.h — Tuning params 5. MLX reference: ../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h — qmv_fast_impl ## When No More Issues Remain If all tracked issues are resolved and we still have not hit 22 t/s: 1. Profile llama.cpp vs MLX dispatch structure (op count, memory access, fusion) 2. Compare kernel-level patterns with Xcode GPUtrace 3. Investigate MLX optimizations not yet in llama.cpp 4. The secondary orchestrator (M4 Pro) may have independent insights — use it for parallel investigation ## Progress Tracking Record all benchmark results in BENCHMARKS.md with date and commit hash.
sleepy added the perf label 2026-05-01 00:25:13 +02:00
sleepy changed title from [perf] achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max) to achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max) 2026-05-01 00:25:19 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#40