Port contiguous weight reads to Q4_0 MUL_MAT kernel #29

Open
opened 2026-04-30 18:11:34 +02:00 by sleepy · 0 comments
Owner

Problem

MLX qmv_fast_impl achieves roughly 355 GB/s effective bandwidth on M4 Max for large models. llama.cpp Q4_0 achieves roughly 289 GB/s. Both use F32 accumulation and similar thread organization (2 SIMD groups x 4 rows). The key structural difference is weight access pattern.

MLX vs llama.cpp weight access

MLX (quantized.h, qmv_fast_impl line 750):

  • Each thread reads packs_per_thread contiguous uint32_t from weights
  • Iterates over K in blocks of block_size=256
  • Contiguous access = better memory coalescing on Apple GPU

llama.cpp (ggml-metal.metal, mul_vec_q_n_f32_impl):

  • Each thread reads interleaved uint16_t with stride QK4_0*16 (288 bytes)
  • Iterates over 16 blocks per SIMD group with strided access
  • Strided access = worse memory coalescing

Approach

  1. Implement a new Q4_0 kernel variant with contiguous weight reads
  2. Use packs_per_thread uint32_t reads (like MLX)
  3. Benchmark tg4096 on 9B before/after
  4. Verify correctness with perplexity check

Reference

MLX kernel source: oMLX app bundle at mlx/include/mlx/backend/metal/kernels/quantized.h

## Problem MLX qmv_fast_impl achieves roughly 355 GB/s effective bandwidth on M4 Max for large models. llama.cpp Q4_0 achieves roughly 289 GB/s. Both use F32 accumulation and similar thread organization (2 SIMD groups x 4 rows). The key structural difference is weight access pattern. ## MLX vs llama.cpp weight access **MLX** (quantized.h, qmv_fast_impl line 750): - Each thread reads packs_per_thread contiguous uint32_t from weights - Iterates over K in blocks of block_size=256 - Contiguous access = better memory coalescing on Apple GPU **llama.cpp** (ggml-metal.metal, mul_vec_q_n_f32_impl): - Each thread reads interleaved uint16_t with stride QK4_0*16 (288 bytes) - Iterates over 16 blocks per SIMD group with strided access - Strided access = worse memory coalescing ## Approach 1. Implement a new Q4_0 kernel variant with contiguous weight reads 2. Use packs_per_thread uint32_t reads (like MLX) 3. Benchmark tg4096 on 9B before/after 4. Verify correctness with perplexity check ## Reference MLX kernel source: oMLX app bundle at mlx/include/mlx/backend/metal/kernels/quantized.h
sleepy added the kernel label 2026-04-30 18:11:34 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#29