14 KiB
Qwen3.6-27B MXFP4 → GGUF & Metal Performance Analysis
Date: 2026-04-30
Bottom line: The ~30% tg TPS gap (llama.cpp ~18 vs MLX ~24) is NOT from quant format or F16 accumulation. Root causes identified: (1) 1151 GPU dispatches per tick with high per-dispatch overhead, (2) 682 zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still require encoding, (3) MUL_MAT kernel memory access patterns, (4) non-MUL_MAT ops (GET_ROWS, CPY, SET_ROWS) that read/write ~400 MB/tick on top of the 4.8 GB weight reads.
1. Model Architecture
- HF class:
Qwen3_5ForConditionalGeneration, GGUF arch:qwen35 - 64 layers: 3 linear_attention + 1 full_attention per 4 (GatedDeltaNet + GQA)
- Linear attn: k_heads=16, v_heads=48, k_dim=128, v_dim=128
- Full attn: heads=24, kv_heads=4, head_dim=256, partial RoPE factor=0.25
- V-head reordering required (grouped→tiled)
- Config:
/Volumes/FastStore/hugging/Qwen3.6-27B/config.json
2. MXFP4 Tensor Format
Sources: /Users/sleepy/.omlx/models/Qwen3.6-27B-mxfp4/ (3 shards, ~14.9GB)
- Weights:
*.weightdtype=U32, shape[out, in/8]— 8 nibbles per uint32 - Scales:
*.scalesdtype=U8, shape[out, in/32]— E4M3 unsigned (bias=7) - Non-quantized (BF16): layernorm, conv1d, A_log, dt_bias, norm, vision tower
- Prefix:
language_model.model.layers.N(differs from BF16model.language_model.layers.N) - Full attention has separate Q/K/V projections (not fused
in_proj_qkv) conv1d.weightshape differs: BF16[10240, 1, 4]vs MXFP4[10240, 4, 1]- GGML expects E8M0 scales; MLX stores E4M3. Convert via
ue4m3_to_fp32()+fp32_to_ue4m3() - Nibble packing differs from GptOss MoE format; must verify against BF16 ground truth
3. Converter Status
GptOssModel(line 12143): Only existing MXFP4 handler, MoE-specific, not reusableQwen3_5TextModel(line 5435): No MXFP4 support, will crash on MXFP4 weights_LinearAttentionVReorderBase(line 5267): Has NVFP4 V-head reordering template- Detection works:
quantization.mode == "mxfp4"from config.json
4. Benchmarks
MLX (M4 Max)
| Model | Quant | Size | tg 1K | tg 64K |
|---|---|---|---|---|
| Text-mxfp4-mlx | MXFP4 | 14.0 GB | 23.8 | 12.0 |
| mxfp4 | MXFP4 | 14.9 GB | 23.7 | 12.4 |
| 4bit | affine-4b | 15.7 GB | 22.6 | 12.0 |
| oQ4 | oQ4 | 16.3 GB | 21.7 | 11.7 |
llama.cpp (M4 Max)
| Format | Size | tg 1K (q8) | tg 4K (q8) |
|---|---|---|---|
| IQ4_XS | 15.4 GB | 18.3 | 18.4 |
| Q4_0 | 15.8 GB | 18.0 | 18.0 |
| IQ4_NL | 16.1 GB | 17.8 | 17.9 |
All three GGUF formats within 3% of each other. Bandwidth dominates. KV cache type (q4 vs q8) has zero effect.
MLX effective bandwidth at 15.7 GB: 22.6 × 15.7 = 355 GB/s. llama.cpp at 16.1 GB: 17.8 × 16.1 = 287 GB/s. That's 19% less bandwidth utilization — the kernel gap.
5. Root Cause: F32 Accumulation
Every llama.cpp Metal kernel accumulates in F32. MLX accumulates in F16. Apple GPU does F16 FMA at 2× F32 rate.
Q4_0 decode kernel (block_q_n_dot_y, line 3228):
float d = qb_curr->d;
float4 acc1 = {0.f};
acc1 += yl[i] * (qs[i/2] & mask); // F32×F32
return d * (sumy * -8.f + acc); // F32 final
IQ4_NL decode kernel (line 8884):
shmem_f32[tiisg] = kvalues_iq4nl_f[tiisg%16]; // F32 lookup
sumf[row] += (float)xb.d * (acc1[0] + ...); // F32 accumulate
MLX decode kernel (fp_qmv_fast):
half converted = as_type<half>(ushort((bits & 7) << 9)); // 2 ops to half
converted *= 16384.0; // half multiply
return bits & 8 ? -converted : converted; // apply sign
// Then: simdgroup_matrix<half,8,8> accumulation = F16×F16 throughout
Impact: ~15-20% of the gap. The rest is threadgroup occupancy (N_R0=2 vs 4) and kernel dispatch overhead (~3-5%).
6. Format Alignment Analysis
| Format | bpw | Block bytes | 4B-align | N_R0 | Lookup | Dequant |
|---|---|---|---|---|---|---|
| Q4_0 | 4.50 | 18 | No | 4 | No | Fused dot |
| IQ4_NL | 4.50 | 18 | No | 2 | Yes | Table+F32 |
| IQ4_XS | ~4.25 | 136 | Yes | 2 | Yes | Sub-block scale |
| MXFP4 | 4.25 | 17 | No | 2 | Yes | Shift+table+F32 |
No format is "aligned" to Apple SIMD. Q4_0 is closest (simplest kernel, highest occupancy) but still accumulates F32. MXFP4 has the best compression (4.25 bpw) but needs E8M0 conversion. IQ4_XS is smallest GGUF but has the most complex kernel.
No new format needed. F16 accumulation on existing formats is the path.
7. Key File Paths
Converter: convert_hf_to_gguf.py lines 5435 (Qwen3_5TextModel), 5267 (_LinearAttentionVReorderBase), 734 (quant detection), 12143 (GptOss MXFP4)
Metal kernels: ggml/src/ggml-metal/ggml-metal.metal lines 3228 (Q4_0 dot), 8850 (IQ4_NL mul_mv), 8960 (IQ4_XS mul_mv), 9069 (MXFP4 mul_mv), 597-625 (MXFP4 dequant)
Tuning: ggml/src/ggml-metal/ggml-metal-impl.h — N_R0/N_SG constants
GGML format: ggml/src/ggml-common.h line 204 (block_mxfp4), gguf-py/gguf/quants.py line 656 (MXFP4 quant)
Model architecture: src/models/qwen35.cpp, src/models/delta-net-base.cpp
oMLX: /Applications/oMLX.app/Contents/Resources/omlx/patches/qwen3_5_attention.py (RoPE fix), gated_delta_advance.py (cache fix), turboquant_kv.py (codebook KV), specprefill.py (sparse prefill)
MLX: .../mlx/include/mlx/backend/metal/kernels/fp4.h (F16 E2M1), fp_quantized.h (MXFP4 GEMM)
8. IMPLEMENTATION PLAN: F16 ACCUMULATION KERNELS
⚠️ This is a non-tested proposal. Treat as pseudocode. Actual implementation may need adjustments for register pressure, threadgroup memory limits, and hardware-specific tuning. Always benchmark before/after on target hardware.
8.1 Create dequantize_*_half variants
File: ggml/src/ggml-metal/ggml-metal.metal
For each quant format that will get F16 decode kernels, add a half-precision dequant function. These output half4x4 instead of float4x4.
Q4_0 (currently at line 172):
// EXISTING:
void dequantize_q4_0(device const block_q4_0 * xb, short il, thread type4x4 & reg) {
// outputs float4x4
}
// PROPOSED: add dequantize_q4_0_half outputting half4x4
void dequantize_q4_0_half(device const block_q4_0 * xb, short il, thread half4x4 & reg) {
device const uint16_t * qs = ((device const uint16_t *)xb + 1);
const half d1 = (half)xb->d;
const half d2 = d1 / (half)16.0h;
const half md = (half)(-8.0h) * (half)xb->d;
// Same nibble extraction, but multiply in half:
for (int i = 0; i < 8; ++i) {
reg[i/2][2*(i%2)+0] = d1 * (half)(qs[i] & mask0) + md;
reg[i/2][2*(i%2)+1] = d2 * (half)(qs[i] & mask1) + md;
}
}
IQ4_NL (line 921):
// PROPOSED: add dequantize_iq4_nl_half
void dequantize_iq4_nl_half(device const block_iq4_nl * xb, short il, thread half4x4 & reg) {
device const uint16_t * q4 = (device const uint16_t *)xb->qs;
const half d = (half)xb->d;
uint32_t aux32;
thread const uint8_t * q8 = (thread const uint8_t *)&aux32;
// Use half-precision lookup table:
threadgroup half * shmem_h = (threadgroup half *)shmem;
// Need to load kvalues as half in shmem first (see §8.3)
for (int i = 0; i < 4; ++i) {
aux32 = ((q4[2*i] | (q4[2*i+1] << 16)) >> 4*il) & 0x0f0f0f0f;
reg[i][0] = d * shmem_h[q8[0]];
reg[i][1] = d * shmem_h[q8[1]];
reg[i][2] = d * shmem_h[q8[2]];
reg[i][3] = d * shmem_h[q8[3]];
}
}
MXFP4 (line 597): Similar pattern, but E8M0→half conversion is ushort(bits << 7) → bfloat16, or (uint32_t)bits << 23 → float32 then cast to half.
8.2 Create F16 mul_mv kernel variants
File: ggml/src/ggml-metal/ggml-metal.metal
For each format, create a dedicated F16 decode kernel that accumulates in half and reduces via simd_sum at the end.
Q4_0 — the most impactful starting point:
template<int NR0, typename args_t>
void kernel_mul_mv_q4_0_f16_impl(
args_t args,
device const char * src0,
device const char * src1,
device char * dst,
threadgroup char * shmem,
uint3 tgpig,
ushort tiisg,
ushort sgitg) {
// Same structure as kernel_mul_mv_q4_0_f32_impl
// BUT: accumulate in half, reduce at end:
half sumf[NR0] = {0.0h}; // HALF accumulators
// ... same block loading and nibble extraction ...
// In inner loop, use half multiply:
// half d = (half)xb.d;
// half m = (half)xb.m;
// half yl_h = (half)yl[i];
// sumf[row] += d * (nibble_val + m * sumy); // HALF FMA
// Final reduction: cast to float for simd_sum
for (int row = 0; row < NR0 && first_row + row < args.ne0; ++row) {
float sum_all = simd_sum((float)sumf[row]);
if (tiisg == 0) {
dst_f32[first_row + row] = sum_all;
}
}
}
Key insight: The output is still F32 (single token decode produces one float per row). Only the intermediate accumulation is F16. This means dst type stays float* — no output format change needed.
IQ4_NL (similar structure to existing kernel_mul_mv_iq4_nl_f32_impl at line 8850):
- Same pattern:
half sumf[NR0],half4accumulators, lookup table loaded into threadgroup ashalfvalues, finalsimd_sum((float)sumf[row]) - Must also create
dequantize_iq4_nl_t4_halffor the ext path (batch sizes 2-8)
IQ4_XS (similar to existing line 8960):
- Same pattern with
halfaccumulators - Must also create
dequantize_iq4_xs_halfvariants
MXFP4 (existing line 9069):
dequantize_mxfp4_half: E8M0→half viaushort(bits << 7) → bfloat16or bit-shift to float16kvalues_mxfp4_h: half-precision lookup table in threadgroup shared memory
8.3 Threadgroup shared memory for lookup tables
Current state: Only kernel_mul_mv_mxfp4_f32_impl loads kvalues_mxfp4_f into threadgroup shared memory. IQ4_NL and IQ4_XS use constexpr constant arrays.
For F16 variants, the lookup values should be loaded as half into threadgroup:
// EXISTING (IQ4_NL, F32):
threadgroup float * shmem_f32 = (threadgroup float *) shmem;
shmem_f32[tiisg] = kvalues_iq4nl_f[tiisg%16];
threadgroup_barrier(mem_flags::mem_threadgroup);
// PROPOSED (IQ4_NL, F16):
threadgroup half * shmem_h = (threadgroup half *) shmem;
shmem_h[tiisg] = (half)kvalues_iq4nl_f[tiisg%16]; // cast constant float to half
threadgroup_barrier(mem_flags::mem_threadgroup);
Memory cost: 16 halves = 32 bytes (trivial). The existing IQ4_NL kernel already allocates shmem with threadgroup(0).
8.4 Dispatch registration
File: ggml/src/ggml-metal/ggml-metal-ops.cpp
Add F16 variants alongside existing F32 variants. Each format needs:
- A new
mul_mvkernel name registered inggml_metal_op_mul_mat - A new
mul_mv_exttemplate instantiation for batch sizes 2-8 - A new
mul_mmtemplate instantiation for batched GEMM
Example for Q4_0:
// EXISTING:
{"kernel_mul_mv_q4_0_f32", ...}
{"kernel_mul_mv_ext_q4_0_f32_r1_2", ...}
{"kernel_mul_mm_q4_0_f32", ...}
{"kernel_mul_mm_q4_0_f16", ...} // f16 OUTPUT, still f32 dequant
// PROPOSED:
{"kernel_mul_mv_q4_0_f16", ...} // f16 DEQUANT + accumulation
// ext variants r1_2 through r1_5
{"kernel_mul_mm_q4_0_f16_dequant", ...} // f16 dequant, f16 accumulation in threadgroup
8.5 Kernel selection logic
File: ggml/src/ggml-metal/ggml-metal-ops.cpp
The dispatch logic (around line 2025) currently selects kernels based on op->src[0]->type. For F16 variants, add a runtime or compile-time flag to choose F16 accumulation kernels when available.
Two approaches:
- Always prefer F16: Replace existing kernels with F16 variants. The output is still F32 — no downstream changes needed. Risk: F16 may lose precision for very large models.
- Conditional selection: Add a
GGML_METAL_F16_DEQUANTflag (env var or build option) that selects F16 kernels when set.
Recommended: Start with approach 1 for Q4_0 only (simplest kernel, well-tested format). If precision is fine, extend to other formats. Q4_0's arithmetic is trivially stable in F16 (scale + offset × nibble, range is well within F16 precision).
8.6 Tuning: N_R0/N_SG parameters
File: ggml/src/ggml-metal/ggml-metal-impl.h
Current values and proposed changes to benchmark:
// CURRENT:
#define N_R0_Q4_0 4
#define N_SG_Q4_0 2 // 8 rows/tg — already good
#define N_R0_IQ4_NL 2
#define N_SG_IQ4_NL 2 // 4 rows/tg — try increasing
#define N_R0_MXFP4 2
#define N_SG_MXFP4 2 // 4 rows/tg — try increasing
// PROPOSED BENCHMARK VARIANTS:
#define N_R0_IQ4_NL 4 // try 8 rows/tg like Q4_0
#define N_R0_MXFP4 4 // try 8 rows/tg
These are compile-time constants. Create benchmark builds with each variant and measure tg TPS on M4 Max at 1K, 4K, and 16K contexts.
8.7 mul_mm (prefill) F16 path
The mat-mat path already has _f16 output variants (e.g., kernel_mul_mm_q4_0_f16). These dequantize to F32 in float4x4 then store as half2x4 for the simdgroup multiply. The F16 optimization here is to change the dequant functions to output half4x4 directly, so the threadgroup memory stores half the data.
This is lower priority than the decode (mul_mv) path because prefill is compute-bound and the mat-mat kernels already use simdgroup_half8x8 for the accumulation stage. The main gain would be reduced threadgroup memory pressure.
8.8 Priority order
- Q4_0 F16 mul_mv kernel — highest impact, simplest kernel (no lookup table), highest N_R0. File:
ggml-metal.metalnewkernel_mul_mv_q4_0_f16_impl - IQ4_NL F16 mul_mv kernel — second format, lookup table needs half shmem. File:
ggml-metal.metalnewkernel_mul_mv_iq4_nl_f16_impl - IQ4_XS F16 mul_mv kernel — third format. File:
ggml-metal.metalnewkernel_mul_mv_iq4_xs_f16_impl - MXFP4 F16 mul_mv kernel — fourth format, after converter works. File:
ggml-metal.metalmodifykernel_mul_mv_mxfp4_f32_impl - N_R0 benchmarking —
ggml-metal-impl.h, try N_R0=4 for IQ4_NL and MXFP4 - mul_mm F16 dequant — lower priority, mat-mat path already uses half simdgroup
- MXFP4 converter — extend
Qwen3_5TextModelper §3