Port contiguous weight reads to Q4_0 MUL_MAT kernel #29

New Issue

2026-04-30T18:11:34+02:00

sleepy commented

2026-04-30 18:11:34 +02:00

Problem

MLX qmv_fast_impl achieves roughly 355 GB/s effective bandwidth on M4 Max for large models. llama.cpp Q4_0 achieves roughly 289 GB/s. Both use F32 accumulation and similar thread organization (2 SIMD groups x 4 rows). The key structural difference is weight access pattern.

MLX vs llama.cpp weight access

MLX (quantized.h, qmv_fast_impl line 750):

Each thread reads packs_per_thread contiguous uint32_t from weights
Iterates over K in blocks of block_size=256
Contiguous access = better memory coalescing on Apple GPU

llama.cpp (ggml-metal.metal, mul_vec_q_n_f32_impl):

Each thread reads interleaved uint16_t with stride QK4_0*16 (288 bytes)
Iterates over 16 blocks per SIMD group with strided access
Strided access = worse memory coalescing

Approach

Implement a new Q4_0 kernel variant with contiguous weight reads
Use packs_per_thread uint32_t reads (like MLX)
Benchmark tg4096 on 9B before/after
Verify correctness with perplexity check

Reference

MLX kernel source: oMLX app bundle at mlx/include/mlx/backend/metal/kernels/quantized.h

## Problem MLX qmv_fast_impl achieves roughly 355 GB/s effective bandwidth on M4 Max for large models. llama.cpp Q4_0 achieves roughly 289 GB/s. Both use F32 accumulation and similar thread organization (2 SIMD groups x 4 rows). The key structural difference is weight access pattern. ## MLX vs llama.cpp weight access **MLX** (quantized.h, qmv_fast_impl line 750): - Each thread reads packs_per_thread contiguous uint32_t from weights - Iterates over K in blocks of block_size=256 - Contiguous access = better memory coalescing on Apple GPU **llama.cpp** (ggml-metal.metal, mul_vec_q_n_f32_impl): - Each thread reads interleaved uint16_t with stride QK4_0*16 (288 bytes) - Iterates over 16 blocks per SIMD group with strided access - Strided access = worse memory coalescing ## Approach 1. Implement a new Q4_0 kernel variant with contiguous weight reads 2. Use packs_per_thread uint32_t reads (like MLX) 3. Benchmark tg4096 on 9B before/after 4. Verify correctness with perplexity check ## Reference MLX kernel source: oMLX app bundle at mlx/include/mlx/backend/metal/kernels/quantized.h

sleepy added the kernel label 2026-04-30 18:11:34 +02:00

sleepy referenced this issue from a commit

2026-04-30 22:38:52 +02:00

[metal] wire contiguous Q4_0 kernel into dispatch (#29)

sleepy referenced this issue

2026-04-30 22:39:54 +02:00

[metal] wire contiguous Q4_0 kernel into dispatch (#29) #39

Sign in to join this conversation.