Port contiguous weight reads to Q4_0 MUL_MAT kernel #29
Reference in New Issue
Block a user
Delete Branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Problem
MLX qmv_fast_impl achieves roughly 355 GB/s effective bandwidth on M4 Max for large models. llama.cpp Q4_0 achieves roughly 289 GB/s. Both use F32 accumulation and similar thread organization (2 SIMD groups x 4 rows). The key structural difference is weight access pattern.
MLX vs llama.cpp weight access
MLX (quantized.h, qmv_fast_impl line 750):
llama.cpp (ggml-metal.metal, mul_vec_q_n_f32_impl):
Approach
Reference
MLX kernel source: oMLX app bundle at mlx/include/mlx/backend/metal/kernels/quantized.h