Files
llama.cpp/BENCHMARKS.md
T
Kaloyan Nikolov 222626cfdc
CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run
CI (android) / android (push) Waiting to run
CI (android) / android-ndk (push) Waiting to run
CI (apple) / macOS-latest-ios (push) Waiting to run
CI (apple) / macos-latest-ios-xcode (push) Waiting to run
CI (apple) / macOS-latest-tvos (push) Waiting to run
CI (apple) / macOS-latest-visionos (push) Waiting to run
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
CI (sycl) / windows-latest-sycl (push) Waiting to run
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-arm64 (push) Waiting to run
CI / macOS-latest-x64 (push) Waiting to run
CI / macOS-latest-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
CI / android-arm64 (push) Waiting to run
CI / ubuntu-latest-rpc (push) Waiting to run
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run
CI / ubuntu-24-webgpu (push) Waiting to run
CI / ubuntu-24-webgpu-wasm (push) Waiting to run
CI / ubuntu-22-hip (push) Waiting to run
CI / ubuntu-22-musa (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cuda (push) Waiting to run
CI / windows-2022-cuda (12.4) (push) Waiting to run
CI / windows-latest-hip (push) Waiting to run
CI / ubuntu-cpu-riscv64-native (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run
EditorConfig Checker / editorconfig (push) Waiting to run
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run
Release / android-arm64 (push) Waiting to run
Release / ubuntu-24-openvino (push) Waiting to run
Release / windows-cpu (arm64) (push) Waiting to run
Release / windows-cpu (x64) (push) Waiting to run
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run
Release / windows-cuda (12.4) (push) Waiting to run
Release / windows-cuda (13.1) (push) Waiting to run
Release / windows-sycl (push) Waiting to run
Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run
Release / ios-xcode-build (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run
Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run
Release / release (push) Blocked by required conditions
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run
Server / server (default) (push) Waiting to run
Server / server (backend-sampling) (push) Waiting to run
Server / server-windows (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
[docs] add GIT.md with workflow and agent instructions
2026-04-30 18:11:44 +02:00

5.9 KiB

Baseline Benchmarks

Date: 2026-04-30 Hardware: Apple M4 Max Build: 683c5acb9 (upstream main) Command: llama-bench -m MODEL -p 512 -t 1 -n 128 -o md -r 3 (pp512/tg128) llama-bench -m MODEL -p 1 -t 1 -n 4096 -o md -r 2 (tg4096)

pp512 (tokens/s)

Model Q4_0 IQ4_NL IQ4_XS
4B 1262.78 1252.70 1238.49
9B 712.91 707.50 697.51

tg128 (tokens/s)

Model Q4_0 IQ4_NL IQ4_XS
4B 80.00 79.24 80.04
9B 53.83 53.93 54.95

tg4096 (tokens/s)

Model Q4_0 IQ4_NL IQ4_XS
4B 76.09 75.24 45.23
9B 52.06 51.95 38.51

Perplexity (Q4_0 4B, ctx=128)

PPL = 2.2641 +/- 0.47327

Effective bandwidth (9B models, tg128)

Format Size (GiB) tg TPS Eq BW (GB/s)
Q4_0 5.00 53.83 289
IQ4_NL 4.99 53.93 289
IQ4_XS 4.80 54.95 283

F16 Accumulation Results

Date: 2026-04-30 Build: 683c5acb9 + F16 Q4_0 kernel (GGML_METAL_F16_ACCUM=1)

Q4_0 with F16 accumulation (tg4096)

Model tg4096 F32 tg4096 F16 Delta
4B 76.09 76.15 +0.08%
9B 52.06 51.94 -0.23%

Perplexity with F16 accumulation (Q4_0 4B, ctx=128)

PPL = 2.2641 +/- 0.47327 (identical to baseline)

Conclusion: F16 accumulation = zero perf improvement, zero quality impact. Reverted.


Graph Profile (tokgen decode)

Date: 2026-04-30 Build: 683c5acb9 (upstream main, clean) Tool: llama-eval-callback-profile (custom, non-syncing cb_eval) Test: p="The", n=32, ctx=256, t=1

Key finding: llama.cpp dispatches 1833 ops per decode tick (9B model). 682 are zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE — no GPU kernel). 1151 are actual GPU dispatches. This is a significant structural source of overhead.

9B Q4_0 (52.9 tok/s, 1833 ops/tick, 1151 GPU dispatches/tick)

Op PerTick BytesIn/tk BytesOut/tk GPU? Notes
VIEW 346 274 MB 116 MB NO metadata only
RESHAPE 288 108 MB 108 MB NO metadata only
GET_ROWS 99 678 MB 53 MB YES token embed + DeltaNet state
CPY 97 106 MB 53 MB YES type conversion/layout
MUL_MAT 249 4797 MB 7 MB YES weight matmuls (dominant)
GATED_DELTA_NET 24 51 MB 51 MB YES linear attention update
PERMUTE 24 9 MB 9 MB NO metadata only
SET_ROWS 16 8 MB 8 MB YES KV cache write
GLU 32 3 MB 2 MB YES FFN activation
MUL 161 4 MB 2 MB YES element-wise multiply
UNARY/SILU 104 1 MB 1 MB YES activation functions
RMS_NORM 105 2 MB 2 MB YES layer norms
ADD 88 2 MB 1 MB YES residual connections
SSM_CONV 24 6 MB 1 MB YES DeltaNet conv1d
L2_NORM 48 0.4 MB 0.4 MB YES q/k norm
ROPE 16 0.2 MB 0.2 MB YES rotary embeddings
FLASH_ATTN_EXT 8 9 MB 0.1 MB YES full attention (8 layers)
CONCAT 24 3 MB 3 MB YES tensor concatenation
SCALE 48 0 0 YES scaling
CONT 8 0.3 MB 0.1 MB YES contiguous copy
TRANSPOSE 24 1 MB 1 MB NO metadata only

Total data read per tick: ~6.1 GB (MUL_MAT = 4.8 GB, GET_ROWS = 0.7 GB, CPY = 0.1 GB, rest ≈ 0.5 GB)

Context length impact (9B Q4_0)

Context SET_ROWS TPS Notes
256 8 MB 52.9 KV cache negligible
2048 67 MB 52.8 Still negligible
8192 268 MB 52.5 Still negligible

KV cache for 8 full-attention layers is tiny compared to MUL_MAT weight reads. The GatedDeltaNet state (51 MB) is larger but constant with context.

Architecture-specific notes

Qwen3.5 has a hybrid architecture: 3 GatedDeltaNet + 1 full-attention per group of 4 layers.

Per GatedDeltaNet layer:

  • 3 input matmuls (qkv_a, alpha, beta) — Q8_0 ranked
  • 1 z-gate matmul — Q4_0
  • 1 output projection matmul — Q4_0
  • 3 FFN matmuls (gate, up, out) — Q4_0
  • SSM_CONV, L2_NORM, SCALE, MUL for state update
  • Total: ~7-8 MUL_MAT + SSM_CONV + misc

Per full-attention layer:

  • 3 input projections (Q, K, V) — Q4_0
  • 1 output projection — Q4_0
  • 3 FFN matmuls (gate, up, out) — Q4_0
  • ROPE, FLASH_ATTN_EXT
  • Total: 7-8 MUL_MAT

Dispatch overhead analysis

  • 1833 ops/tick, 682 zero-ops (metadata), 1151 GPU dispatches
  • At 52.9 tok/s → 18.9 ms/tick → 16.4 us per GPU dispatch average
  • M4 Max Metal dispatch floor: ~3-5 us (from profiling)
  • Dispatch overhead: 3.5-5.8 ms/tick (18-30% of total)
  • MUL_MAT weight reads: 4.8 GB at observed 289 GB/s ≈ 16.6 ms (but pipelined with other ops)
  • Other data: ~1.3 GB reads + ~0.4 GB writes ≈ 5-6 ms at 289 GB/s
  • Neither compute, bandwidth, nor dispatch is fully utilized

Comparison with MLX

MLX achieves ~355 GB/s effective bandwidth vs llama.cpp's ~289 GB/s on similar models (24% gap).

Potential sources of gap:

  1. Kernel memory access patterns: MLX uses contiguous weight reads, llama.cpp uses interleaved
  2. Dispatch efficiency: 1151 GPU dispatches vs likely fewer in MLX (fewer view/reshape ops?)
  3. Non-MUL_MAT ops: Nearly 600 MB/tick of reads for GET_ROWS/CPY/SET_ROWS — are these as efficient in llama.cpp?
  4. Graph optimization: llama.cpp has many zero-ops (682 VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still need encoding — can these be eliminated?

Profiling methodology

  • llama-eval-callback-profile: custom tool using cb_eval to observe ops without forcing sync
  • GGML_METAL_GRAPH_DEBUG=1 with -v flag: shows per-op graph structure (requires DEBUG log level)
  • GGML_METAL_CAPTURE_COMPUTE=2: captures Xcode Instruments GPUtrace of 2nd compute call (first tokgen)
  • Concurrency disabled: GGML_METAL_CONCURRENCY_DISABLE=1 → ~53 → 52 tok/s (slightly worse)
  • Fusion disabled: GGML_METAL_FUSION_DISABLE=1 → negligible impact