Files
llama.cpp/ANALYSIS_QWEN3_5_MXFP4.md
Kaloyan Nikolov 222626cfdc
CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run
CI (android) / android (push) Waiting to run
CI (android) / android-ndk (push) Waiting to run
CI (apple) / macOS-latest-ios (push) Waiting to run
CI (apple) / macos-latest-ios-xcode (push) Waiting to run
CI (apple) / macOS-latest-tvos (push) Waiting to run
CI (apple) / macOS-latest-visionos (push) Waiting to run
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
CI (sycl) / windows-latest-sycl (push) Waiting to run
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-arm64 (push) Waiting to run
CI / macOS-latest-x64 (push) Waiting to run
CI / macOS-latest-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
CI / android-arm64 (push) Waiting to run
CI / ubuntu-latest-rpc (push) Waiting to run
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run
CI / ubuntu-24-webgpu (push) Waiting to run
CI / ubuntu-24-webgpu-wasm (push) Waiting to run
CI / ubuntu-22-hip (push) Waiting to run
CI / ubuntu-22-musa (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cuda (push) Waiting to run
CI / windows-2022-cuda (12.4) (push) Waiting to run
CI / windows-latest-hip (push) Waiting to run
CI / ubuntu-cpu-riscv64-native (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run
EditorConfig Checker / editorconfig (push) Waiting to run
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run
Release / android-arm64 (push) Waiting to run
Release / ubuntu-24-openvino (push) Waiting to run
Release / windows-cpu (arm64) (push) Waiting to run
Release / windows-cpu (x64) (push) Waiting to run
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run
Release / windows-cuda (12.4) (push) Waiting to run
Release / windows-cuda (13.1) (push) Waiting to run
Release / windows-sycl (push) Waiting to run
Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run
Release / ios-xcode-build (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run
Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run
Release / release (push) Blocked by required conditions
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run
Server / server (default) (push) Waiting to run
Server / server (backend-sampling) (push) Waiting to run
Server / server-windows (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
[docs] add GIT.md with workflow and agent instructions
2026-04-30 18:11:44 +02:00

14 KiB
Raw Permalink Blame History

Qwen3.6-27B MXFP4 → GGUF & Metal Performance Analysis

Date: 2026-04-30

Bottom line: The ~30% tg TPS gap (llama.cpp ~18 vs MLX ~24) is NOT from quant format or F16 accumulation. Root causes identified: (1) 1151 GPU dispatches per tick with high per-dispatch overhead, (2) 682 zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still require encoding, (3) MUL_MAT kernel memory access patterns, (4) non-MUL_MAT ops (GET_ROWS, CPY, SET_ROWS) that read/write ~400 MB/tick on top of the 4.8 GB weight reads.


1. Model Architecture

  • HF class: Qwen3_5ForConditionalGeneration, GGUF arch: qwen35
  • 64 layers: 3 linear_attention + 1 full_attention per 4 (GatedDeltaNet + GQA)
  • Linear attn: k_heads=16, v_heads=48, k_dim=128, v_dim=128
  • Full attn: heads=24, kv_heads=4, head_dim=256, partial RoPE factor=0.25
  • V-head reordering required (grouped→tiled)
  • Config: /Volumes/FastStore/hugging/Qwen3.6-27B/config.json

2. MXFP4 Tensor Format

Sources: /Users/sleepy/.omlx/models/Qwen3.6-27B-mxfp4/ (3 shards, ~14.9GB)

  • Weights: *.weight dtype=U32, shape [out, in/8] — 8 nibbles per uint32
  • Scales: *.scales dtype=U8, shape [out, in/32] — E4M3 unsigned (bias=7)
  • Non-quantized (BF16): layernorm, conv1d, A_log, dt_bias, norm, vision tower
  • Prefix: language_model.model.layers.N (differs from BF16 model.language_model.layers.N)
  • Full attention has separate Q/K/V projections (not fused in_proj_qkv)
  • conv1d.weight shape differs: BF16 [10240, 1, 4] vs MXFP4 [10240, 4, 1]
  • GGML expects E8M0 scales; MLX stores E4M3. Convert via ue4m3_to_fp32() + fp32_to_ue4m3()
  • Nibble packing differs from GptOss MoE format; must verify against BF16 ground truth

3. Converter Status

  • GptOssModel (line 12143): Only existing MXFP4 handler, MoE-specific, not reusable
  • Qwen3_5TextModel (line 5435): No MXFP4 support, will crash on MXFP4 weights
  • _LinearAttentionVReorderBase (line 5267): Has NVFP4 V-head reordering template
  • Detection works: quantization.mode == "mxfp4" from config.json

4. Benchmarks

MLX (M4 Max)

Model Quant Size tg 1K tg 64K
Text-mxfp4-mlx MXFP4 14.0 GB 23.8 12.0
mxfp4 MXFP4 14.9 GB 23.7 12.4
4bit affine-4b 15.7 GB 22.6 12.0
oQ4 oQ4 16.3 GB 21.7 11.7

llama.cpp (M4 Max)

Format Size tg 1K (q8) tg 4K (q8)
IQ4_XS 15.4 GB 18.3 18.4
Q4_0 15.8 GB 18.0 18.0
IQ4_NL 16.1 GB 17.8 17.9

All three GGUF formats within 3% of each other. Bandwidth dominates. KV cache type (q4 vs q8) has zero effect.

MLX effective bandwidth at 15.7 GB: 22.6 × 15.7 = 355 GB/s. llama.cpp at 16.1 GB: 17.8 × 16.1 = 287 GB/s. That's 19% less bandwidth utilization — the kernel gap.

5. Root Cause: F32 Accumulation

Every llama.cpp Metal kernel accumulates in F32. MLX accumulates in F16. Apple GPU does F16 FMA at 2× F32 rate.

Q4_0 decode kernel (block_q_n_dot_y, line 3228):

float d = qb_curr->d;
float4 acc1 = {0.f};
acc1 += yl[i] * (qs[i/2] & mask);  // F32×F32
return d * (sumy * -8.f + acc);         // F32 final

IQ4_NL decode kernel (line 8884):

shmem_f32[tiisg] = kvalues_iq4nl_f[tiisg%16];  // F32 lookup
sumf[row] += (float)xb.d * (acc1[0] + ...);       // F32 accumulate

MLX decode kernel (fp_qmv_fast):

half converted = as_type<half>(ushort((bits & 7) << 9)); // 2 ops to half
converted *= 16384.0;                                    // half multiply
return bits & 8 ? -converted : converted;               // apply sign
// Then: simdgroup_matrix<half,8,8> accumulation = F16×F16 throughout

Impact: ~15-20% of the gap. The rest is threadgroup occupancy (N_R0=2 vs 4) and kernel dispatch overhead (~3-5%).

6. Format Alignment Analysis

Format bpw Block bytes 4B-align N_R0 Lookup Dequant
Q4_0 4.50 18 No 4 No Fused dot
IQ4_NL 4.50 18 No 2 Yes Table+F32
IQ4_XS ~4.25 136 Yes 2 Yes Sub-block scale
MXFP4 4.25 17 No 2 Yes Shift+table+F32

No format is "aligned" to Apple SIMD. Q4_0 is closest (simplest kernel, highest occupancy) but still accumulates F32. MXFP4 has the best compression (4.25 bpw) but needs E8M0 conversion. IQ4_XS is smallest GGUF but has the most complex kernel.

No new format needed. F16 accumulation on existing formats is the path.

7. Key File Paths

Converter: convert_hf_to_gguf.py lines 5435 (Qwen3_5TextModel), 5267 (_LinearAttentionVReorderBase), 734 (quant detection), 12143 (GptOss MXFP4)

Metal kernels: ggml/src/ggml-metal/ggml-metal.metal lines 3228 (Q4_0 dot), 8850 (IQ4_NL mul_mv), 8960 (IQ4_XS mul_mv), 9069 (MXFP4 mul_mv), 597-625 (MXFP4 dequant)

Tuning: ggml/src/ggml-metal/ggml-metal-impl.h — N_R0/N_SG constants

GGML format: ggml/src/ggml-common.h line 204 (block_mxfp4), gguf-py/gguf/quants.py line 656 (MXFP4 quant)

Model architecture: src/models/qwen35.cpp, src/models/delta-net-base.cpp

oMLX: /Applications/oMLX.app/Contents/Resources/omlx/patches/qwen3_5_attention.py (RoPE fix), gated_delta_advance.py (cache fix), turboquant_kv.py (codebook KV), specprefill.py (sparse prefill)

MLX: .../mlx/include/mlx/backend/metal/kernels/fp4.h (F16 E2M1), fp_quantized.h (MXFP4 GEMM)


8. IMPLEMENTATION PLAN: F16 ACCUMULATION KERNELS

⚠️ This is a non-tested proposal. Treat as pseudocode. Actual implementation may need adjustments for register pressure, threadgroup memory limits, and hardware-specific tuning. Always benchmark before/after on target hardware.

8.1 Create dequantize_*_half variants

File: ggml/src/ggml-metal/ggml-metal.metal

For each quant format that will get F16 decode kernels, add a half-precision dequant function. These output half4x4 instead of float4x4.

Q4_0 (currently at line 172):

// EXISTING:
void dequantize_q4_0(device const block_q4_0 * xb, short il, thread type4x4 & reg) {
    // outputs float4x4
}

// PROPOSED: add dequantize_q4_0_half outputting half4x4
void dequantize_q4_0_half(device const block_q4_0 * xb, short il, thread half4x4 & reg) {
    device const uint16_t * qs = ((device const uint16_t *)xb + 1);
    const half d1 = (half)xb->d;
    const half d2 = d1 / (half)16.0h;
    const half md = (half)(-8.0h) * (half)xb->d;
    // Same nibble extraction, but multiply in half:
    for (int i = 0; i < 8; ++i) {
        reg[i/2][2*(i%2)+0] = d1 * (half)(qs[i] & mask0) + md;
        reg[i/2][2*(i%2)+1] = d2 * (half)(qs[i] & mask1) + md;
    }
}

IQ4_NL (line 921):

// PROPOSED: add dequantize_iq4_nl_half
void dequantize_iq4_nl_half(device const block_iq4_nl * xb, short il, thread half4x4 & reg) {
    device const uint16_t * q4 = (device const uint16_t *)xb->qs;
    const half d = (half)xb->d;
    uint32_t aux32;
    thread const uint8_t * q8 = (thread const uint8_t *)&aux32;
    // Use half-precision lookup table:
    threadgroup half * shmem_h = (threadgroup half *)shmem;
    // Need to load kvalues as half in shmem first (see §8.3)
    for (int i = 0; i < 4; ++i) {
        aux32 = ((q4[2*i] | (q4[2*i+1] << 16)) >> 4*il) & 0x0f0f0f0f;
        reg[i][0] = d * shmem_h[q8[0]];
        reg[i][1] = d * shmem_h[q8[1]];
        reg[i][2] = d * shmem_h[q8[2]];
        reg[i][3] = d * shmem_h[q8[3]];
    }
}

MXFP4 (line 597): Similar pattern, but E8M0→half conversion is ushort(bits << 7) → bfloat16, or (uint32_t)bits << 23 → float32 then cast to half.

8.2 Create F16 mul_mv kernel variants

File: ggml/src/ggml-metal/ggml-metal.metal

For each format, create a dedicated F16 decode kernel that accumulates in half and reduces via simd_sum at the end.

Q4_0 — the most impactful starting point:

template<int NR0, typename args_t>
void kernel_mul_mv_q4_0_f16_impl(
        args_t args,
        device const char * src0,
        device const char * src1,
        device       char * dst,
        threadgroup  char * shmem,
        uint3  tgpig,
        ushort tiisg,
        ushort sgitg) {
    // Same structure as kernel_mul_mv_q4_0_f32_impl
    // BUT: accumulate in half, reduce at end:
    half sumf[NR0] = {0.0h};  // HALF accumulators
    
    // ... same block loading and nibble extraction ...
    
    // In inner loop, use half multiply:
    // half d = (half)xb.d;
    // half m = (half)xb.m;
    // half yl_h = (half)yl[i];
    // sumf[row] += d * (nibble_val + m * sumy);  // HALF FMA
    
    // Final reduction: cast to float for simd_sum
    for (int row = 0; row < NR0 && first_row + row < args.ne0; ++row) {
        float sum_all = simd_sum((float)sumf[row]);
        if (tiisg == 0) {
            dst_f32[first_row + row] = sum_all;
        }
    }
}

Key insight: The output is still F32 (single token decode produces one float per row). Only the intermediate accumulation is F16. This means dst type stays float* — no output format change needed.

IQ4_NL (similar structure to existing kernel_mul_mv_iq4_nl_f32_impl at line 8850):

  • Same pattern: half sumf[NR0], half4 accumulators, lookup table loaded into threadgroup as half values, final simd_sum((float)sumf[row])
  • Must also create dequantize_iq4_nl_t4_half for the ext path (batch sizes 2-8)

IQ4_XS (similar to existing line 8960):

  • Same pattern with half accumulators
  • Must also create dequantize_iq4_xs_half variants

MXFP4 (existing line 9069):

  • dequantize_mxfp4_half: E8M0→half via ushort(bits << 7) → bfloat16 or bit-shift to float16
  • kvalues_mxfp4_h: half-precision lookup table in threadgroup shared memory

8.3 Threadgroup shared memory for lookup tables

Current state: Only kernel_mul_mv_mxfp4_f32_impl loads kvalues_mxfp4_f into threadgroup shared memory. IQ4_NL and IQ4_XS use constexpr constant arrays.

For F16 variants, the lookup values should be loaded as half into threadgroup:

// EXISTING (IQ4_NL, F32):
threadgroup float * shmem_f32 = (threadgroup float *) shmem;
shmem_f32[tiisg] = kvalues_iq4nl_f[tiisg%16];
threadgroup_barrier(mem_flags::mem_threadgroup);

// PROPOSED (IQ4_NL, F16):
threadgroup half * shmem_h = (threadgroup half *) shmem;
shmem_h[tiisg] = (half)kvalues_iq4nl_f[tiisg%16];  // cast constant float to half
threadgroup_barrier(mem_flags::mem_threadgroup);

Memory cost: 16 halves = 32 bytes (trivial). The existing IQ4_NL kernel already allocates shmem with threadgroup(0).

8.4 Dispatch registration

File: ggml/src/ggml-metal/ggml-metal-ops.cpp

Add F16 variants alongside existing F32 variants. Each format needs:

  1. A new mul_mv kernel name registered in ggml_metal_op_mul_mat
  2. A new mul_mv_ext template instantiation for batch sizes 2-8
  3. A new mul_mm template instantiation for batched GEMM

Example for Q4_0:

// EXISTING:
{"kernel_mul_mv_q4_0_f32", ...}
{"kernel_mul_mv_ext_q4_0_f32_r1_2", ...}
{"kernel_mul_mm_q4_0_f32", ...}
{"kernel_mul_mm_q4_0_f16", ...}  // f16 OUTPUT, still f32 dequant

// PROPOSED:
{"kernel_mul_mv_q4_0_f16", ...}     // f16 DEQUANT + accumulation
// ext variants r1_2 through r1_5
{"kernel_mul_mm_q4_0_f16_dequant", ...}  // f16 dequant, f16 accumulation in threadgroup

8.5 Kernel selection logic

File: ggml/src/ggml-metal/ggml-metal-ops.cpp

The dispatch logic (around line 2025) currently selects kernels based on op->src[0]->type. For F16 variants, add a runtime or compile-time flag to choose F16 accumulation kernels when available.

Two approaches:

  1. Always prefer F16: Replace existing kernels with F16 variants. The output is still F32 — no downstream changes needed. Risk: F16 may lose precision for very large models.
  2. Conditional selection: Add a GGML_METAL_F16_DEQUANT flag (env var or build option) that selects F16 kernels when set.

Recommended: Start with approach 1 for Q4_0 only (simplest kernel, well-tested format). If precision is fine, extend to other formats. Q4_0's arithmetic is trivially stable in F16 (scale + offset × nibble, range is well within F16 precision).

8.6 Tuning: N_R0/N_SG parameters

File: ggml/src/ggml-metal/ggml-metal-impl.h

Current values and proposed changes to benchmark:

// CURRENT:
#define N_R0_Q4_0     4
#define N_SG_Q4_0     2   // 8 rows/tg — already good
#define N_R0_IQ4_NL   2
#define N_SG_IQ4_NL   2   // 4 rows/tg — try increasing
#define N_R0_MXFP4    2
#define N_SG_MXFP4    2   // 4 rows/tg — try increasing

// PROPOSED BENCHMARK VARIANTS:
#define N_R0_IQ4_NL   4   // try 8 rows/tg like Q4_0
#define N_R0_MXFP4    4   // try 8 rows/tg

These are compile-time constants. Create benchmark builds with each variant and measure tg TPS on M4 Max at 1K, 4K, and 16K contexts.

8.7 mul_mm (prefill) F16 path

The mat-mat path already has _f16 output variants (e.g., kernel_mul_mm_q4_0_f16). These dequantize to F32 in float4x4 then store as half2x4 for the simdgroup multiply. The F16 optimization here is to change the dequant functions to output half4x4 directly, so the threadgroup memory stores half the data.

This is lower priority than the decode (mul_mv) path because prefill is compute-bound and the mat-mat kernels already use simdgroup_half8x8 for the accumulation stage. The main gain would be reduced threadgroup memory pressure.

8.8 Priority order

  1. Q4_0 F16 mul_mv kernel — highest impact, simplest kernel (no lookup table), highest N_R0. File: ggml-metal.metal new kernel_mul_mv_q4_0_f16_impl
  2. IQ4_NL F16 mul_mv kernel — second format, lookup table needs half shmem. File: ggml-metal.metal new kernel_mul_mv_iq4_nl_f16_impl
  3. IQ4_XS F16 mul_mv kernel — third format. File: ggml-metal.metal new kernel_mul_mv_iq4_xs_f16_impl
  4. MXFP4 F16 mul_mv kernel — fourth format, after converter works. File: ggml-metal.metal modify kernel_mul_mv_mxfp4_f32_impl
  5. N_R0 benchmarkingggml-metal-impl.h, try N_R0=4 for IQ4_NL and MXFP4
  6. mul_mm F16 dequant — lower priority, mat-mat path already uses half simdgroup
  7. MXFP4 converter — extend Qwen3_5TextModel per §3