Files
llama.cpp/BENCHMARKS.md
T
Kaloyan Nikolov 222626cfdc
CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run
CI (android) / android (push) Waiting to run
CI (android) / android-ndk (push) Waiting to run
CI (apple) / macOS-latest-ios (push) Waiting to run
CI (apple) / macos-latest-ios-xcode (push) Waiting to run
CI (apple) / macOS-latest-tvos (push) Waiting to run
CI (apple) / macOS-latest-visionos (push) Waiting to run
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
CI (sycl) / windows-latest-sycl (push) Waiting to run
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-arm64 (push) Waiting to run
CI / macOS-latest-x64 (push) Waiting to run
CI / macOS-latest-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
CI / android-arm64 (push) Waiting to run
CI / ubuntu-latest-rpc (push) Waiting to run
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run
CI / ubuntu-24-webgpu (push) Waiting to run
CI / ubuntu-24-webgpu-wasm (push) Waiting to run
CI / ubuntu-22-hip (push) Waiting to run
CI / ubuntu-22-musa (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cuda (push) Waiting to run
CI / windows-2022-cuda (12.4) (push) Waiting to run
CI / windows-latest-hip (push) Waiting to run
CI / ubuntu-cpu-riscv64-native (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run
EditorConfig Checker / editorconfig (push) Waiting to run
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run
Release / android-arm64 (push) Waiting to run
Release / ubuntu-24-openvino (push) Waiting to run
Release / windows-cpu (arm64) (push) Waiting to run
Release / windows-cpu (x64) (push) Waiting to run
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run
Release / windows-cuda (12.4) (push) Waiting to run
Release / windows-cuda (13.1) (push) Waiting to run
Release / windows-sycl (push) Waiting to run
Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run
Release / ios-xcode-build (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run
Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run
Release / release (push) Blocked by required conditions
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run
Server / server (default) (push) Waiting to run
Server / server (backend-sampling) (push) Waiting to run
Server / server-windows (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
[docs] add GIT.md with workflow and agent instructions
2026-04-30 18:11:44 +02:00

156 lines
5.9 KiB
Markdown

# Baseline Benchmarks
**Date**: 2026-04-30
**Hardware**: Apple M4 Max
**Build**: 683c5acb9 (upstream main)
**Command**: `llama-bench -m MODEL -p 512 -t 1 -n 128 -o md -r 3` (pp512/tg128)
`llama-bench -m MODEL -p 1 -t 1 -n 4096 -o md -r 2` (tg4096)
## pp512 (tokens/s)
| Model | Q4_0 | IQ4_NL | IQ4_XS |
|-------|------|--------|--------|
| 4B | 1262.78 | 1252.70 | 1238.49 |
| 9B | 712.91 | 707.50 | 697.51 |
## tg128 (tokens/s)
| Model | Q4_0 | IQ4_NL | IQ4_XS |
|-------|------|--------|--------|
| 4B | 80.00 | 79.24 | 80.04 |
| 9B | 53.83 | 53.93 | 54.95 |
## tg4096 (tokens/s)
| Model | Q4_0 | IQ4_NL | IQ4_XS |
|-------|------|--------|--------|
| 4B | 76.09 | 75.24 | 45.23 |
| 9B | 52.06 | 51.95 | 38.51 |
## Perplexity (Q4_0 4B, ctx=128)
PPL = 2.2641 +/- 0.47327
## Effective bandwidth (9B models, tg128)
| Format | Size (GiB) | tg TPS | Eq BW (GB/s) |
|--------|-----------|--------|-----------|
| Q4_0 | 5.00 | 53.83 | 289 |
| IQ4_NL | 4.99 | 53.93 | 289 |
| IQ4_XS | 4.80 | 54.95 | 283 |
---
# F16 Accumulation Results
**Date**: 2026-04-30
**Build**: 683c5acb9 + F16 Q4_0 kernel (GGML_METAL_F16_ACCUM=1)
## Q4_0 with F16 accumulation (tg4096)
| Model | tg4096 F32 | tg4096 F16 | Delta |
|-------|-----------|-----------|-------|
| 4B | 76.09 | 76.15 | +0.08% |
| 9B | 52.06 | 51.94 | -0.23% |
## Perplexity with F16 accumulation (Q4_0 4B, ctx=128)
PPL = 2.2641 +/- 0.47327 (identical to baseline)
**Conclusion**: F16 accumulation = zero perf improvement, zero quality impact. Reverted.
---
# Graph Profile (tokgen decode)
**Date**: 2026-04-30
**Build**: 683c5acb9 (upstream main, clean)
**Tool**: `llama-eval-callback-profile` (custom, non-syncing cb_eval)
**Test**: p="The", n=32, ctx=256, t=1
**Key finding**: llama.cpp dispatches 1833 ops per decode tick (9B model). 682 are zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE — no GPU kernel). 1151 are actual GPU dispatches. This is a significant structural source of overhead.
## 9B Q4_0 (52.9 tok/s, 1833 ops/tick, 1151 GPU dispatches/tick)
| Op | PerTick | BytesIn/tk | BytesOut/tk | GPU? | Notes |
|----|--------|------------|-------------|------|-------|
| VIEW | 346 | 274 MB | 116 MB | NO | metadata only |
| RESHAPE | 288 | 108 MB | 108 MB | NO | metadata only |
| GET_ROWS | 99 | 678 MB | 53 MB | YES | token embed + DeltaNet state |
| CPY | 97 | 106 MB | 53 MB | YES | type conversion/layout |
| MUL_MAT | 249 | **4797 MB** | 7 MB | YES | weight matmuls (dominant) |
| GATED_DELTA_NET | 24 | 51 MB | 51 MB | YES | linear attention update |
| PERMUTE | 24 | 9 MB | 9 MB | NO | metadata only |
| SET_ROWS | 16 | 8 MB | 8 MB | YES | KV cache write |
| GLU | 32 | 3 MB | 2 MB | YES | FFN activation |
| MUL | 161 | 4 MB | 2 MB | YES | element-wise multiply |
| UNARY/SILU | 104 | 1 MB | 1 MB | YES | activation functions |
| RMS_NORM | 105 | 2 MB | 2 MB | YES | layer norms |
| ADD | 88 | 2 MB | 1 MB | YES | residual connections |
| SSM_CONV | 24 | 6 MB | 1 MB | YES | DeltaNet conv1d |
| L2_NORM | 48 | 0.4 MB | 0.4 MB | YES | q/k norm |
| ROPE | 16 | 0.2 MB | 0.2 MB | YES | rotary embeddings |
| FLASH_ATTN_EXT | 8 | 9 MB | 0.1 MB | YES | full attention (8 layers) |
| CONCAT | 24 | 3 MB | 3 MB | YES | tensor concatenation |
| SCALE | 48 | 0 | 0 | YES | scaling |
| CONT | 8 | 0.3 MB | 0.1 MB | YES | contiguous copy |
| TRANSPOSE | 24 | 1 MB | 1 MB | NO | metadata only |
**Total data read per tick**: ~6.1 GB (MUL_MAT = 4.8 GB, GET_ROWS = 0.7 GB, CPY = 0.1 GB, rest ≈ 0.5 GB)
## Context length impact (9B Q4_0)
| Context | SET_ROWS | TPS | Notes |
|---------|----------|-----|-------|
| 256 | 8 MB | 52.9 | KV cache negligible |
| 2048 | 67 MB | 52.8 | Still negligible |
| 8192 | 268 MB | 52.5 | Still negligible |
KV cache for 8 full-attention layers is tiny compared to MUL_MAT weight reads. The GatedDeltaNet state (51 MB) is larger but constant with context.
## Architecture-specific notes
Qwen3.5 has a hybrid architecture: 3 GatedDeltaNet + 1 full-attention per group of 4 layers.
Per GatedDeltaNet layer:
- 3 input matmuls (qkv_a, alpha, beta) — Q8_0 ranked
- 1 z-gate matmul — Q4_0
- 1 output projection matmul — Q4_0
- 3 FFN matmuls (gate, up, out) — Q4_0
- SSM_CONV, L2_NORM, SCALE, MUL for state update
- Total: ~7-8 MUL_MAT + SSM_CONV + misc
Per full-attention layer:
- 3 input projections (Q, K, V) — Q4_0
- 1 output projection — Q4_0
- 3 FFN matmuls (gate, up, out) — Q4_0
- ROPE, FLASH_ATTN_EXT
- Total: 7-8 MUL_MAT
## Dispatch overhead analysis
- 1833 ops/tick, 682 zero-ops (metadata), 1151 GPU dispatches
- At 52.9 tok/s → 18.9 ms/tick → 16.4 us per GPU dispatch average
- M4 Max Metal dispatch floor: ~3-5 us (from profiling)
- Dispatch overhead: 3.5-5.8 ms/tick (18-30% of total)
- MUL_MAT weight reads: 4.8 GB at observed 289 GB/s ≈ 16.6 ms (but pipelined with other ops)
- Other data: ~1.3 GB reads + ~0.4 GB writes ≈ 5-6 ms at 289 GB/s
- **Neither compute, bandwidth, nor dispatch is fully utilized**
## Comparison with MLX
MLX achieves ~355 GB/s effective bandwidth vs llama.cpp's ~289 GB/s on similar models (24% gap).
Potential sources of gap:
1. **Kernel memory access patterns**: MLX uses contiguous weight reads, llama.cpp uses interleaved
2. **Dispatch efficiency**: 1151 GPU dispatches vs likely fewer in MLX (fewer view/reshape ops?)
3. **Non-MUL_MAT ops**: Nearly 600 MB/tick of reads for GET_ROWS/CPY/SET_ROWS — are these as efficient in llama.cpp?
4. **Graph optimization**: llama.cpp has many zero-ops (682 VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still need encoding — can these be eliminated?
## Profiling methodology
- `llama-eval-callback-profile`: custom tool using `cb_eval` to observe ops without forcing sync
- `GGML_METAL_GRAPH_DEBUG=1` with `-v` flag: shows per-op graph structure (requires DEBUG log level)
- `GGML_METAL_CAPTURE_COMPUTE=2`: captures Xcode Instruments GPUtrace of 2nd compute call (first tokgen)
- Concurrency disabled: `GGML_METAL_CONCURRENCY_DISABLE=1` → ~53 → 52 tok/s (slightly worse)
- Fusion disabled: `GGML_METAL_FUSION_DISABLE=1` → negligible impact