# Baseline Benchmarks **Date**: 2026-04-30 **Hardware**: Apple M4 Max **Build**: 683c5acb9 (upstream main) **Command**: `llama-bench -m MODEL -p 512 -t 1 -n 128 -o md -r 3` (pp512/tg128) `llama-bench -m MODEL -p 1 -t 1 -n 4096 -o md -r 2` (tg4096) ## pp512 (tokens/s) | Model | Q4_0 | IQ4_NL | IQ4_XS | |-------|------|--------|--------| | 4B | 1262.78 | 1252.70 | 1238.49 | | 9B | 712.91 | 707.50 | 697.51 | ## tg128 (tokens/s) | Model | Q4_0 | IQ4_NL | IQ4_XS | |-------|------|--------|--------| | 4B | 80.00 | 79.24 | 80.04 | | 9B | 53.83 | 53.93 | 54.95 | ## tg4096 (tokens/s) | Model | Q4_0 | IQ4_NL | IQ4_XS | |-------|------|--------|--------| | 4B | 76.09 | 75.24 | 45.23 | | 9B | 52.06 | 51.95 | 38.51 | ## Perplexity (Q4_0 4B, ctx=128) PPL = 2.2641 +/- 0.47327 ## Effective bandwidth (9B models, tg128) | Format | Size (GiB) | tg TPS | Eq BW (GB/s) | |--------|-----------|--------|-----------| | Q4_0 | 5.00 | 53.83 | 289 | | IQ4_NL | 4.99 | 53.93 | 289 | | IQ4_XS | 4.80 | 54.95 | 283 | --- # F16 Accumulation Results **Date**: 2026-04-30 **Build**: 683c5acb9 + F16 Q4_0 kernel (GGML_METAL_F16_ACCUM=1) ## Q4_0 with F16 accumulation (tg4096) | Model | tg4096 F32 | tg4096 F16 | Delta | |-------|-----------|-----------|-------| | 4B | 76.09 | 76.15 | +0.08% | | 9B | 52.06 | 51.94 | -0.23% | ## Perplexity with F16 accumulation (Q4_0 4B, ctx=128) PPL = 2.2641 +/- 0.47327 (identical to baseline) **Conclusion**: F16 accumulation = zero perf improvement, zero quality impact. Reverted. --- # Graph Profile (tokgen decode) **Date**: 2026-04-30 **Build**: 683c5acb9 (upstream main, clean) **Tool**: `llama-eval-callback-profile` (custom, non-syncing cb_eval) **Test**: p="The", n=32, ctx=256, t=1 **Key finding**: llama.cpp dispatches 1833 ops per decode tick (9B model). 682 are zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE — no GPU kernel). 1151 are actual GPU dispatches. This is a significant structural source of overhead. ## 9B Q4_0 (52.9 tok/s, 1833 ops/tick, 1151 GPU dispatches/tick) | Op | PerTick | BytesIn/tk | BytesOut/tk | GPU? | Notes | |----|--------|------------|-------------|------|-------| | VIEW | 346 | 274 MB | 116 MB | NO | metadata only | | RESHAPE | 288 | 108 MB | 108 MB | NO | metadata only | | GET_ROWS | 99 | 678 MB | 53 MB | YES | token embed + DeltaNet state | | CPY | 97 | 106 MB | 53 MB | YES | type conversion/layout | | MUL_MAT | 249 | **4797 MB** | 7 MB | YES | weight matmuls (dominant) | | GATED_DELTA_NET | 24 | 51 MB | 51 MB | YES | linear attention update | | PERMUTE | 24 | 9 MB | 9 MB | NO | metadata only | | SET_ROWS | 16 | 8 MB | 8 MB | YES | KV cache write | | GLU | 32 | 3 MB | 2 MB | YES | FFN activation | | MUL | 161 | 4 MB | 2 MB | YES | element-wise multiply | | UNARY/SILU | 104 | 1 MB | 1 MB | YES | activation functions | | RMS_NORM | 105 | 2 MB | 2 MB | YES | layer norms | | ADD | 88 | 2 MB | 1 MB | YES | residual connections | | SSM_CONV | 24 | 6 MB | 1 MB | YES | DeltaNet conv1d | | L2_NORM | 48 | 0.4 MB | 0.4 MB | YES | q/k norm | | ROPE | 16 | 0.2 MB | 0.2 MB | YES | rotary embeddings | | FLASH_ATTN_EXT | 8 | 9 MB | 0.1 MB | YES | full attention (8 layers) | | CONCAT | 24 | 3 MB | 3 MB | YES | tensor concatenation | | SCALE | 48 | 0 | 0 | YES | scaling | | CONT | 8 | 0.3 MB | 0.1 MB | YES | contiguous copy | | TRANSPOSE | 24 | 1 MB | 1 MB | NO | metadata only | **Total data read per tick**: ~6.1 GB (MUL_MAT = 4.8 GB, GET_ROWS = 0.7 GB, CPY = 0.1 GB, rest ≈ 0.5 GB) ## Context length impact (9B Q4_0) | Context | SET_ROWS | TPS | Notes | |---------|----------|-----|-------| | 256 | 8 MB | 52.9 | KV cache negligible | | 2048 | 67 MB | 52.8 | Still negligible | | 8192 | 268 MB | 52.5 | Still negligible | KV cache for 8 full-attention layers is tiny compared to MUL_MAT weight reads. The GatedDeltaNet state (51 MB) is larger but constant with context. ## Architecture-specific notes Qwen3.5 has a hybrid architecture: 3 GatedDeltaNet + 1 full-attention per group of 4 layers. Per GatedDeltaNet layer: - 3 input matmuls (qkv_a, alpha, beta) — Q8_0 ranked - 1 z-gate matmul — Q4_0 - 1 output projection matmul — Q4_0 - 3 FFN matmuls (gate, up, out) — Q4_0 - SSM_CONV, L2_NORM, SCALE, MUL for state update - Total: ~7-8 MUL_MAT + SSM_CONV + misc Per full-attention layer: - 3 input projections (Q, K, V) — Q4_0 - 1 output projection — Q4_0 - 3 FFN matmuls (gate, up, out) — Q4_0 - ROPE, FLASH_ATTN_EXT - Total: 7-8 MUL_MAT ## Dispatch overhead analysis - 1833 ops/tick, 682 zero-ops (metadata), 1151 GPU dispatches - At 52.9 tok/s → 18.9 ms/tick → 16.4 us per GPU dispatch average - M4 Max Metal dispatch floor: ~3-5 us (from profiling) - Dispatch overhead: 3.5-5.8 ms/tick (18-30% of total) - MUL_MAT weight reads: 4.8 GB at observed 289 GB/s ≈ 16.6 ms (but pipelined with other ops) - Other data: ~1.3 GB reads + ~0.4 GB writes ≈ 5-6 ms at 289 GB/s - **Neither compute, bandwidth, nor dispatch is fully utilized** ## Comparison with MLX MLX achieves ~355 GB/s effective bandwidth vs llama.cpp's ~289 GB/s on similar models (24% gap). Potential sources of gap: 1. **Kernel memory access patterns**: MLX uses contiguous weight reads, llama.cpp uses interleaved 2. **Dispatch efficiency**: 1151 GPU dispatches vs likely fewer in MLX (fewer view/reshape ops?) 3. **Non-MUL_MAT ops**: Nearly 600 MB/tick of reads for GET_ROWS/CPY/SET_ROWS — are these as efficient in llama.cpp? 4. **Graph optimization**: llama.cpp has many zero-ops (682 VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still need encoding — can these be eliminated? ## Profiling methodology - `llama-eval-callback-profile`: custom tool using `cb_eval` to observe ops without forcing sync - `GGML_METAL_GRAPH_DEBUG=1` with `-v` flag: shows per-op graph structure (requires DEBUG log level) - `GGML_METAL_CAPTURE_COMPUTE=2`: captures Xcode Instruments GPUtrace of 2nd compute call (first tokgen) - Concurrency disabled: `GGML_METAL_CONCURRENCY_DISABLE=1` → ~53 → 52 tok/s (slightly worse) - Fusion disabled: `GGML_METAL_FUSION_DISABLE=1` → negligible impact