achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max) #40

New Issue

2026-05-01T00:24:09+02:00

sleepy commented

2026-05-01 00:24:09 +02:00

Goal

Match MLX generation throughput on 27B models. Target: 22 t/s on Qwen3.6-27B-Q4_0 at tg128 on M4 Max.

MLX Baseline

MLX-lm achieves ~22 t/s on 27B Q4_0. llama.cpp currently lags behind. This is the north star metric.

MLX Testing

MLX models at: ~/.omlx/models/
Run MLX-lm benchmarks for comparison.

Onboarding — What to Read

GIT.md — build, test, coherence, timeout, interactive mode instructions
BENCHMARKS.md — all benchmark results, track progress toward 22 t/s
ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis
Key source files (GIT.md):
- ggml-metal.metal — Metal shader kernels
- ggml-metal-device.cpp — Pipeline dispatch
- ggml-metal-ops.cpp — Op encoding
- ggml-metal-impl.h — Tuning params
MLX reference: ../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h — qmv_fast_impl

When No More Issues Remain

If all tracked issues are resolved and we still have not hit 22 t/s:

Profile llama.cpp vs MLX dispatch structure (op count, memory access, fusion)
Compare kernel-level patterns with Xcode GPUtrace
Investigate MLX optimizations not yet in llama.cpp
The secondary orchestrator (M4 Pro) may have independent insights — use it for parallel investigation

Progress Tracking

Record all benchmark results in BENCHMARKS.md with date and commit hash.

## Goal Match MLX generation throughput on 27B models. Target: **22 t/s on Qwen3.6-27B-Q4_0** at tg128 on M4 Max. ## MLX Baseline MLX-lm achieves ~22 t/s on 27B Q4_0. llama.cpp currently lags behind. This is the north star metric. ## MLX Testing MLX models at: ~/.omlx/models/ Run MLX-lm benchmarks for comparison. ## Onboarding — What to Read 1. GIT.md — build, test, coherence, timeout, interactive mode instructions 2. BENCHMARKS.md — all benchmark results, track progress toward 22 t/s 3. ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis 4. Key source files (GIT.md): - ggml-metal.metal — Metal shader kernels - ggml-metal-device.cpp — Pipeline dispatch - ggml-metal-ops.cpp — Op encoding - ggml-metal-impl.h — Tuning params 5. MLX reference: ../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h — qmv_fast_impl ## When No More Issues Remain If all tracked issues are resolved and we still have not hit 22 t/s: 1. Profile llama.cpp vs MLX dispatch structure (op count, memory access, fusion) 2. Compare kernel-level patterns with Xcode GPUtrace 3. Investigate MLX optimizations not yet in llama.cpp 4. The secondary orchestrator (M4 Pro) may have independent insights — use it for parallel investigation ## Progress Tracking Record all benchmark results in BENCHMARKS.md with date and commit hash.

sleepy added the perf label 2026-05-01 00:25:13 +02:00

sleepy changed title from ~~[perf] achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max)~~ to achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max)

2026-05-01 00:25:19 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#40