llama.cpp/BENCHMARKS.md

# Baseline Benchmarks

**Date**: 2026-04-30
**Hardware**: Apple M4 Max
**Build**: 683c5acb9 (upstream main)
**Command**: `llama-bench -m MODEL -p 512 -t 1 -n 128 -o md -r 3` (pp512/tg128)
             `llama-bench -m MODEL -p 1 -t 1 -n 4096 -o md -r 2` (tg4096)

## pp512 (tokens/s)

| Model | Q4_0 | IQ4_NL | IQ4_XS |
|-------|------|--------|--------|
| 4B | 1262.78 | 1252.70 | 1238.49 |
| 9B | 712.91 | 707.50 | 697.51 |

## tg128 (tokens/s)

| Model | Q4_0 | IQ4_NL | IQ4_XS |
|-------|------|--------|--------|
| 4B | 80.00 | 79.24 | 80.04 |
| 9B | 53.83 | 53.93 | 54.95 |

## tg4096 (tokens/s)

| Model | Q4_0 | IQ4_NL | IQ4_XS |
|-------|------|--------|--------|
| 4B | 76.09 | 75.24 | 45.23 |
| 9B | 52.06 | 51.95 | 38.51 |

## Perplexity (Q4_0 4B, ctx=128)

PPL = 2.2641 +/- 0.47327

## Effective bandwidth (9B models, tg128)

| Format | Size (GiB) | tg TPS | Eq BW (GB/s) |
|--------|-----------|--------|-----------|
| Q4_0 | 5.00 | 53.83 | 289 |
| IQ4_NL | 4.99 | 53.93 | 289 |
| IQ4_XS | 4.80 | 54.95 | 283 |

---

# F16 Accumulation Results

**Date**: 2026-04-30
**Build**: 683c5acb9 + F16 Q4_0 kernel (GGML_METAL_F16_ACCUM=1)

## Q4_0 with F16 accumulation (tg4096)

| Model | tg4096 F32 | tg4096 F16 | Delta |
|-------|-----------|-----------|-------|
| 4B  | 76.09 | 76.15 | +0.08% |
| 9B  | 52.06 | 51.94 | -0.23% |

## Perplexity with F16 accumulation (Q4_0 4B, ctx=128)

PPL = 2.2641 +/- 0.47327 (identical to baseline)

**Conclusion**: F16 accumulation = zero perf improvement, zero quality impact. Reverted.

---

# Graph Profile (tokgen decode)

**Date**: 2026-04-30
**Build**: 683c5acb9 (upstream main, clean)
**Tool**: `llama-eval-callback-profile` (custom, non-syncing cb_eval)
**Test**: p="The", n=32, ctx=256, t=1

**Key finding**: llama.cpp dispatches 1833 ops per decode tick (9B model). 682 are zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE — no GPU kernel). 1151 are actual GPU dispatches. This is a significant structural source of overhead.

## 9B Q4_0 (52.9 tok/s, 1833 ops/tick, 1151 GPU dispatches/tick)

| Op | PerTick | BytesIn/tk | BytesOut/tk | GPU? | Notes |
|----|--------|------------|-------------|------|-------|
| VIEW | 346 | 274 MB | 116 MB | NO | metadata only |
| RESHAPE | 288 | 108 MB | 108 MB | NO | metadata only |
| GET_ROWS | 99 | 678 MB | 53 MB | YES | token embed + DeltaNet state |
| CPY | 97 | 106 MB | 53 MB | YES | type conversion/layout |
| MUL_MAT | 249 | **4797 MB** | 7 MB | YES | weight matmuls (dominant) |
| GATED_DELTA_NET | 24 | 51 MB | 51 MB | YES | linear attention update |
| PERMUTE | 24 | 9 MB | 9 MB | NO | metadata only |
| SET_ROWS | 16 | 8 MB | 8 MB | YES | KV cache write |
| GLU | 32 | 3 MB | 2 MB | YES | FFN activation |
| MUL | 161 | 4 MB | 2 MB | YES | element-wise multiply |
| UNARY/SILU | 104 | 1 MB | 1 MB | YES | activation functions |
| RMS_NORM | 105 | 2 MB | 2 MB | YES | layer norms |
| ADD | 88 | 2 MB | 1 MB | YES | residual connections |
| SSM_CONV | 24 | 6 MB | 1 MB | YES | DeltaNet conv1d |
| L2_NORM | 48 | 0.4 MB | 0.4 MB | YES | q/k norm |
| ROPE | 16 | 0.2 MB | 0.2 MB | YES | rotary embeddings |
| FLASH_ATTN_EXT | 8 | 9 MB | 0.1 MB | YES | full attention (8 layers) |
| CONCAT | 24 | 3 MB | 3 MB | YES | tensor concatenation |
| SCALE | 48 | 0 | 0 | YES | scaling |
| CONT | 8 | 0.3 MB | 0.1 MB | YES | contiguous copy |
| TRANSPOSE | 24 | 1 MB | 1 MB | NO | metadata only |

**Total data read per tick**: ~6.1 GB (MUL_MAT = 4.8 GB, GET_ROWS = 0.7 GB, CPY = 0.1 GB, rest ≈ 0.5 GB)

## Context length impact (9B Q4_0)

| Context | SET_ROWS | TPS | Notes |
|---------|----------|-----|-------|
| 256 | 8 MB | 52.9 | KV cache negligible |
| 2048 | 67 MB | 52.8 | Still negligible |
| 8192 | 268 MB | 52.5 | Still negligible |

KV cache for 8 full-attention layers is tiny compared to MUL_MAT weight reads. The GatedDeltaNet state (51 MB) is larger but constant with context.

## Architecture-specific notes

Qwen3.5 has a hybrid architecture: 3 GatedDeltaNet + 1 full-attention per group of 4 layers.

Per GatedDeltaNet layer:
- 3 input matmuls (qkv_a, alpha, beta) — Q8_0 ranked
- 1 z-gate matmul — Q4_0
- 1 output projection matmul — Q4_0
- 3 FFN matmuls (gate, up, out) — Q4_0
- SSM_CONV, L2_NORM, SCALE, MUL for state update
- Total: ~7-8 MUL_MAT + SSM_CONV + misc

Per full-attention layer:
- 3 input projections (Q, K, V) — Q4_0
- 1 output projection — Q4_0
- 3 FFN matmuls (gate, up, out) — Q4_0
- ROPE, FLASH_ATTN_EXT
- Total: 7-8 MUL_MAT

## Dispatch overhead analysis

- 1833 ops/tick, 682 zero-ops (metadata), 1151 GPU dispatches
- At 52.9 tok/s → 18.9 ms/tick → 16.4 us per GPU dispatch average
- M4 Max Metal dispatch floor: ~3-5 us (from profiling)
- Dispatch overhead: 3.5-5.8 ms/tick (18-30% of total)
- MUL_MAT weight reads: 4.8 GB at observed 289 GB/s ≈ 16.6 ms (but pipelined with other ops)
- Other data: ~1.3 GB reads + ~0.4 GB writes ≈ 5-6 ms at 289 GB/s
- **Neither compute, bandwidth, nor dispatch is fully utilized**

## Comparison with MLX

MLX achieves ~355 GB/s effective bandwidth vs llama.cpp's ~289 GB/s on similar models (24% gap).

Potential sources of gap:
1. **Kernel memory access patterns**: MLX uses contiguous weight reads, llama.cpp uses interleaved
2. **Dispatch efficiency**: 1151 GPU dispatches vs likely fewer in MLX (fewer view/reshape ops?)
3. **Non-MUL_MAT ops**: Nearly 600 MB/tick of reads for GET_ROWS/CPY/SET_ROWS — are these as efficient in llama.cpp?
4. **Graph optimization**: llama.cpp has many zero-ops (682 VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still need encoding — can these be eliminated?

## Profiling methodology

- `llama-eval-callback-profile`: custom tool using `cb_eval` to observe ops without forcing sync
- `GGML_METAL_GRAPH_DEBUG=1` with `-v` flag: shows per-op graph structure (requires DEBUG log level)
- `GGML_METAL_CAPTURE_COMPUTE=2`: captures Xcode Instruments GPUtrace of 2nd compute call (first tokgen)
- Concurrency disabled: `GGML_METAL_CONCURRENCY_DISABLE=1` → ~53 → 52 tok/s (slightly worse)
- Fusion disabled: `GGML_METAL_FUSION_DISABLE=1` → negligible impact