sleepy/llama.cpp

Fork 0

Files

T

Kaloyan Nikolov 222626cfdc

CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run

Details

CI (android) / android (push) Waiting to run

Details

CI (android) / android-ndk (push) Waiting to run

Details

CI (apple) / macOS-latest-ios (push) Waiting to run

Details

CI (apple) / macos-latest-ios-xcode (push) Waiting to run

Details

CI (apple) / macOS-latest-tvos (push) Waiting to run

Details

CI (apple) / macOS-latest-visionos (push) Waiting to run

Details

CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions

Details

CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions

Details

CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions

Details

CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run

Details

CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run

Details

CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run

Details

CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run

Details

CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run

Details

CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run

Details

CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run

Details

CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run

Details

CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run

Details

CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run

Details

CI (sycl) / windows-latest-sycl (push) Waiting to run

Details

CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run

Details

CI / build-cmake-pkg (push) Waiting to run

Details

CI / macOS-latest-arm64 (push) Waiting to run

Details

CI / macOS-latest-x64 (push) Waiting to run

Details

CI / macOS-latest-arm64-webgpu (push) Waiting to run

Details

CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run

Details

CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run

Details

CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run

Details

CI / android-arm64 (push) Waiting to run

Details

CI / ubuntu-latest-rpc (push) Waiting to run

Details

CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run

Details

CI / ubuntu-24-webgpu (push) Waiting to run

Details

CI / ubuntu-24-webgpu-wasm (push) Waiting to run

Details

CI / ubuntu-22-hip (push) Waiting to run

Details

CI / ubuntu-22-musa (push) Waiting to run

Details

CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run

Details

CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run

Details

CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run

Details

CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run

Details

CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run

Details

CI / ubuntu-latest-cuda (push) Waiting to run

Details

CI / windows-2022-cuda (12.4) (push) Waiting to run

Details

CI / windows-latest-hip (push) Waiting to run

Details

CI / ubuntu-cpu-riscv64-native (push) Waiting to run

Details

CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run

Details

CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run

Details

EditorConfig Checker / editorconfig (push) Waiting to run

Details

Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run

Details

Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run

Details

Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run

Details

Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run

Details

Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run

Details

Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run

Details

Release / android-arm64 (push) Waiting to run

Details

Release / ubuntu-24-openvino (push) Waiting to run

Details

Release / windows-cpu (arm64) (push) Waiting to run

Details

Release / windows-cpu (x64) (push) Waiting to run

Details

Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run

Details

Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run

Details

Release / windows-cuda (12.4) (push) Waiting to run

Details

Release / windows-cuda (13.1) (push) Waiting to run

Details

Release / windows-sycl (push) Waiting to run

Details

Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run

Details

Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run

Details

Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run

Details

Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run

Details

Release / ios-xcode-build (push) Waiting to run

Details

Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run

Details

Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run

Details

Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run

Details

Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run

Details

Release / release (push) Blocked by required conditions

Details

Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run

Details

Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run

Details

Server / server (default) (push) Waiting to run

Details

Server / server (backend-sampling) (push) Waiting to run

Details

Server / server-windows (push) Waiting to run

Details

CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled

Details

[docs] add GIT.md with workflow and agent instructions

2026-04-30 18:11:44 +02:00

5.9 KiB

Raw Blame History

Baseline Benchmarks

Date: 2026-04-30 Hardware: Apple M4 Max Build: 683c5acb9 (upstream main) Command: llama-bench -m MODEL -p 512 -t 1 -n 128 -o md -r 3 (pp512/tg128) llama-bench -m MODEL -p 1 -t 1 -n 4096 -o md -r 2 (tg4096)

pp512 (tokens/s)

Model	Q4_0	IQ4_NL	IQ4_XS
4B	1262.78	1252.70	1238.49
9B	712.91	707.50	697.51

tg128 (tokens/s)

Model	Q4_0	IQ4_NL	IQ4_XS
4B	80.00	79.24	80.04
9B	53.83	53.93	54.95

tg4096 (tokens/s)

Model	Q4_0	IQ4_NL	IQ4_XS
4B	76.09	75.24	45.23
9B	52.06	51.95	38.51

Perplexity (Q4_0 4B, ctx=128)

PPL = 2.2641 +/- 0.47327

Effective bandwidth (9B models, tg128)

Format	Size (GiB)	tg TPS	Eq BW (GB/s)
Q4_0	5.00	53.83	289
IQ4_NL	4.99	53.93	289
IQ4_XS	4.80	54.95	283

F16 Accumulation Results

Date: 2026-04-30 Build: 683c5acb9 + F16 Q4_0 kernel (GGML_METAL_F16_ACCUM=1)

Q4_0 with F16 accumulation (tg4096)

Model	tg4096 F32	tg4096 F16	Delta
4B	76.09	76.15	+0.08%
9B	52.06	51.94	-0.23%

Perplexity with F16 accumulation (Q4_0 4B, ctx=128)

PPL = 2.2641 +/- 0.47327 (identical to baseline)

Conclusion: F16 accumulation = zero perf improvement, zero quality impact. Reverted.

Graph Profile (tokgen decode)

Date: 2026-04-30 Build: 683c5acb9 (upstream main, clean) Tool: llama-eval-callback-profile (custom, non-syncing cb_eval) Test: p="The", n=32, ctx=256, t=1

Key finding: llama.cpp dispatches 1833 ops per decode tick (9B model). 682 are zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE — no GPU kernel). 1151 are actual GPU dispatches. This is a significant structural source of overhead.

9B Q4_0 (52.9 tok/s, 1833 ops/tick, 1151 GPU dispatches/tick)

Op	PerTick	BytesIn/tk	BytesOut/tk	GPU?	Notes
VIEW	346	274 MB	116 MB	NO	metadata only
RESHAPE	288	108 MB	108 MB	NO	metadata only
GET_ROWS	99	678 MB	53 MB	YES	token embed + DeltaNet state
CPY	97	106 MB	53 MB	YES	type conversion/layout
MUL_MAT	249	4797 MB	7 MB	YES	weight matmuls (dominant)
GATED_DELTA_NET	24	51 MB	51 MB	YES	linear attention update
PERMUTE	24	9 MB	9 MB	NO	metadata only
SET_ROWS	16	8 MB	8 MB	YES	KV cache write
GLU	32	3 MB	2 MB	YES	FFN activation
MUL	161	4 MB	2 MB	YES	element-wise multiply
UNARY/SILU	104	1 MB	1 MB	YES	activation functions
RMS_NORM	105	2 MB	2 MB	YES	layer norms
ADD	88	2 MB	1 MB	YES	residual connections
SSM_CONV	24	6 MB	1 MB	YES	DeltaNet conv1d
L2_NORM	48	0.4 MB	0.4 MB	YES	q/k norm
ROPE	16	0.2 MB	0.2 MB	YES	rotary embeddings
FLASH_ATTN_EXT	8	9 MB	0.1 MB	YES	full attention (8 layers)
CONCAT	24	3 MB	3 MB	YES	tensor concatenation
SCALE	48	0	0	YES	scaling
CONT	8	0.3 MB	0.1 MB	YES	contiguous copy
TRANSPOSE	24	1 MB	1 MB	NO	metadata only

Total data read per tick: ~6.1 GB (MUL_MAT = 4.8 GB, GET_ROWS = 0.7 GB, CPY = 0.1 GB, rest ≈ 0.5 GB)

Context length impact (9B Q4_0)

Context	SET_ROWS	TPS	Notes
256	8 MB	52.9	KV cache negligible
2048	67 MB	52.8	Still negligible
8192	268 MB	52.5	Still negligible

KV cache for 8 full-attention layers is tiny compared to MUL_MAT weight reads. The GatedDeltaNet state (51 MB) is larger but constant with context.

Architecture-specific notes

Qwen3.5 has a hybrid architecture: 3 GatedDeltaNet + 1 full-attention per group of 4 layers.

Per GatedDeltaNet layer:

3 input matmuls (qkv_a, alpha, beta) — Q8_0 ranked
1 z-gate matmul — Q4_0
1 output projection matmul — Q4_0
3 FFN matmuls (gate, up, out) — Q4_0
SSM_CONV, L2_NORM, SCALE, MUL for state update
Total: ~7-8 MUL_MAT + SSM_CONV + misc

Per full-attention layer:

3 input projections (Q, K, V) — Q4_0
1 output projection — Q4_0
3 FFN matmuls (gate, up, out) — Q4_0
ROPE, FLASH_ATTN_EXT
Total: 7-8 MUL_MAT

Dispatch overhead analysis

1833 ops/tick, 682 zero-ops (metadata), 1151 GPU dispatches
At 52.9 tok/s → 18.9 ms/tick → 16.4 us per GPU dispatch average
M4 Max Metal dispatch floor: ~3-5 us (from profiling)
Dispatch overhead: 3.5-5.8 ms/tick (18-30% of total)
MUL_MAT weight reads: 4.8 GB at observed 289 GB/s ≈ 16.6 ms (but pipelined with other ops)
Other data: ~1.3 GB reads + ~0.4 GB writes ≈ 5-6 ms at 289 GB/s
Neither compute, bandwidth, nor dispatch is fully utilized

Comparison with MLX

MLX achieves ~355 GB/s effective bandwidth vs llama.cpp's ~289 GB/s on similar models (24% gap).

Potential sources of gap:

Kernel memory access patterns: MLX uses contiguous weight reads, llama.cpp uses interleaved
Dispatch efficiency: 1151 GPU dispatches vs likely fewer in MLX (fewer view/reshape ops?)
Non-MUL_MAT ops: Nearly 600 MB/tick of reads for GET_ROWS/CPY/SET_ROWS — are these as efficient in llama.cpp?
Graph optimization: llama.cpp has many zero-ops (682 VIEW/RESHAPE/TRANSPOSE/PERMUTE) that still need encoding — can these be eliminated?

Profiling methodology

llama-eval-callback-profile: custom tool using cb_eval to observe ops without forcing sync
GGML_METAL_GRAPH_DEBUG=1 with -v flag: shows per-op graph structure (requires DEBUG log level)
GGML_METAL_CAPTURE_COMPUTE=2: captures Xcode Instruments GPUtrace of 2nd compute call (first tokgen)
Concurrency disabled: GGML_METAL_CONCURRENCY_DISABLE=1 → ~53 → 52 tok/s (slightly worse)
Fusion disabled: GGML_METAL_FUSION_DISABLE=1 → negligible impact

5.9 KiB Raw Blame History

Baseline Benchmarks

pp512 (tokens/s)

tg128 (tokens/s)

tg4096 (tokens/s)

Perplexity (Q4_0 4B, ctx=128)

Effective bandwidth (9B models, tg128)

F16 Accumulation Results

Q4_0 with F16 accumulation (tg4096)

Perplexity with F16 accumulation (Q4_0 4B, ctx=128)

Graph Profile (tokgen decode)

9B Q4_0 (52.9 tok/s, 1833 ops/tick, 1151 GPU dispatches/tick)

Context length impact (9B Q4_0)

Architecture-specific notes

Dispatch overhead analysis

Comparison with MLX

Profiling methodology

5.9 KiB

Raw Blame History