5.9 KiB
Baseline Benchmarks
Date: 2026-04-30
Hardware: Apple M4 Max
Build: 683c5acb9 (upstream main)
Command: llama-bench -m MODEL -p 512 -t 1 -n 128 -o md -r 3 (pp512/tg128)
llama-bench -m MODEL -p 1 -t 1 -n 4096 -o md -r 2 (tg4096)
pp512 (tokens/s)
| Model | Q4_0 | IQ4_NL | IQ4_XS |
|---|---|---|---|
| 4B | 1262.78 | 1252.70 | 1238.49 |
| 9B | 712.91 | 707.50 | 697.51 |
tg128 (tokens/s)
| Model | Q4_0 | IQ4_NL | IQ4_XS |
|---|---|---|---|
| 4B | 80.00 | 79.24 | 80.04 |
| 9B | 53.83 | 53.93 | 54.95 |
tg4096 (tokens/s)
| Model | Q4_0 | IQ4_NL | IQ4_XS |
|---|---|---|---|
| 4B | 76.09 | 75.24 | 45.23 |
| 9B | 52.06 | 51.95 | 38.51 |
Perplexity (Q4_0 4B, ctx=128)
PPL = 2.2641 +/- 0.47327
Effective bandwidth (9B models, tg128)
| Format | Size (GiB) | tg TPS | Eq BW (GB/s) |
|---|---|---|---|
| Q4_0 | 5.00 | 53.83 | 289 |
| IQ4_NL | 4.99 | 53.93 | 289 |
| IQ4_XS | 4.80 | 54.95 | 283 |
F16 Accumulation Results
Date: 2026-04-30
Build: 683c5acb9 + F16 Q4_0 kernel (GGML_METAL_F16_ACCUM=1)
Q4_0 with F16 accumulation (tg4096)
| Model | tg4096 F32 | tg4096 F16 | Delta |
|---|---|---|---|
| 4B | 76.09 | 76.15 | +0.08% |
| 9B | 52.06 | 51.94 | -0.23% |
Perplexity with F16 accumulation (Q4_0 4B, ctx=128)
PPL = 2.2641 +/- 0.47327 (identical to baseline)
Conclusion: F16 accumulation = zero perf improvement, zero quality impact. Reverted.
Graph Profile (tokgen decode)
Date: 2026-04-30
Build: 683c5acb9 (upstream main, clean)
Tool: llama-eval-callback-profile (custom, non-syncing cb_eval)
Test: p="The", n=32, ctx=256, t=1
Key finding: llama.cpp dispatches 1833 ops per decode tick (9B model). 682 are zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE — no GPU kernel). 1151 are actual GPU dispatches. This is a significant structural source of overhead.
9B Q4_0 (52.9 tok/s, 1833 ops/tick, 1151 GPU dispatches/tick)
| Op | PerTick | BytesIn/tk | BytesOut/tk | GPU? | Notes |
|---|---|---|---|---|---|
| VIEW | 346 | 274 MB | 116 MB | NO | metadata only |
| RESHAPE | 288 | 108 MB | 108 MB | NO | metadata only |
| GET_ROWS | 99 | 678 MB | 53 MB | YES | token embed + DeltaNet state |
| CPY | 97 | 106 MB | 53 MB | YES | type conversion/layout |
| MUL_MAT | 249 | 4797 MB | 7 MB | YES | weight matmuls (dominant) |
| GATED_DELTA_NET | 24 | 51 MB | 51 MB | YES | linear attention update |
| PERMUTE | 24 | 9 MB | 9 MB | NO | metadata only |
| SET_ROWS | 16 | 8 MB | 8 MB | YES | KV cache write |
| GLU | 32 | 3 MB | 2 MB | YES | FFN activation |
| MUL | 161 | 4 MB | 2 MB | YES | element-wise multiply |
| UNARY/SILU | 104 | 1 MB | 1 MB | YES | activation functions |
| RMS_NORM | 105 | 2 MB | 2 MB | YES | layer norms |
| ADD | 88 | 2 MB | 1 MB | YES | residual connections |
| SSM_CONV | 24 | 6 MB | 1 MB | YES | DeltaNet conv1d |
| L2_NORM | 48 | 0.4 MB | 0.4 MB | YES | q/k norm |
| ROPE | 16 | 0.2 MB | 0.2 MB | YES | rotary embeddings |
| FLASH_ATTN_EXT | 8 | 9 MB | 0.1 MB | YES | full attention (8 layers) |
| CONCAT | 24 | 3 MB | 3 MB | YES | tensor concatenation |
| SCALE | 48 | 0 | 0 | YES | scaling |
| CONT | 8 | 0.3 MB | 0.1 MB | YES | contiguous copy |
| TRANSPOSE | 24 | 1 MB | 1 MB | NO | metadata only |
Total data read per tick: ~6.1 GB (MUL_MAT = 4.8 GB, GET_ROWS = 0.7 GB, CPY = 0.1 GB, rest ≈ 0.5 GB)
Context length impact (9B Q4_0)
| Context | SET_ROWS | TPS | Notes |
|---|---|---|---|
| 256 | 8 MB | 52.9 | KV cache negligible |
| 2048 | 67 MB | 52.8 | Still negligible |
| 8192 | 268 MB | 52.5 | Still negligible |
KV cache for 8 full-attention layers is tiny compared to MUL_MAT weight reads. The GatedDeltaNet state (51 MB) is larger but constant with context.
Architecture-specific notes
Qwen3.5 has a hybrid architecture: 3 GatedDeltaNet + 1 full-attention per group of 4 layers.
Per GatedDeltaNet layer:
- 3 input matmuls (qkv_a, alpha, beta) — Q8_0 ranked
- 1 z-gate matmul — Q4_0
- 1 output projection matmul — Q4_0
- 3 FFN matmuls (gate, up, out) — Q4_0
- SSM_CONV, L2_NORM, SCALE, MUL for state update
- Total: ~7-8 MUL_MAT + SSM_CONV + misc
Per full-attention layer:
- 3 input projections (Q, K, V) — Q4_0
- 1 output projection — Q4_0
- 3 FFN matmuls (gate, up, out) — Q4_0
- ROPE, FLASH_ATTN_EXT
- Total: 7-8 MUL_MAT
Dispatch overhead analysis
- 1833 ops/tick, 682 zero-ops (metadata), 1151 GPU dispatches
- At 52.9 tok/s → 18.9 ms/tick → 16.4 us per GPU dispatch average
- M4 Max Metal dispatch floor: ~3-5 us (from profiling)
- Dispatch overhead: 3.5-5.8 ms/tick (18-30% of total)
- MUL_MAT weight reads: 4.8 GB at observed 289 GB/s ≈ 16.6 ms (but pipelined with other ops)
- Other data: ~1.3 GB reads + ~0.4 GB writes ≈ 5-6 ms at 289 GB/s
- Neither compute, bandwidth, nor dispatch is fully utilized
Comparison with MLX
MLX achieves ~355 GB/s effective bandwidth vs llama.cpp's ~289 GB/s on similar models (24% gap).
Potential sources of gap:
- Kernel memory access patterns: MLX uses contiguous weight reads, llama.cpp uses interleaved
- Dispatch efficiency: 1151 GPU dispatches vs likely fewer in MLX (fewer view/reshape ops?)
- Non-MUL_MAT ops: Nearly 600 MB/tick of reads for GET_ROWS/CPY/SET_ROWS — are these as efficient in llama.cpp?
- Graph optimization: llama.cpp has many zero-ops (682 VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still need encoding — can these be eliminated?
Profiling methodology
llama-eval-callback-profile: custom tool usingcb_evalto observe ops without forcing syncGGML_METAL_GRAPH_DEBUG=1with-vflag: shows per-op graph structure (requires DEBUG log level)GGML_METAL_CAPTURE_COMPUTE=2: captures Xcode Instruments GPUtrace of 2nd compute call (first tokgen)- Concurrency disabled:
GGML_METAL_CONCURRENCY_DISABLE=1→ ~53 → 52 tok/s (slightly worse) - Fusion disabled:
GGML_METAL_FUSION_DISABLE=1→ negligible impact