222626cfdc
CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run
CI (android) / android (push) Waiting to run
CI (android) / android-ndk (push) Waiting to run
CI (apple) / macOS-latest-ios (push) Waiting to run
CI (apple) / macos-latest-ios-xcode (push) Waiting to run
CI (apple) / macOS-latest-tvos (push) Waiting to run
CI (apple) / macOS-latest-visionos (push) Waiting to run
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
CI (sycl) / windows-latest-sycl (push) Waiting to run
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-arm64 (push) Waiting to run
CI / macOS-latest-x64 (push) Waiting to run
CI / macOS-latest-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
CI / android-arm64 (push) Waiting to run
CI / ubuntu-latest-rpc (push) Waiting to run
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run
CI / ubuntu-24-webgpu (push) Waiting to run
CI / ubuntu-24-webgpu-wasm (push) Waiting to run
CI / ubuntu-22-hip (push) Waiting to run
CI / ubuntu-22-musa (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cuda (push) Waiting to run
CI / windows-2022-cuda (12.4) (push) Waiting to run
CI / windows-latest-hip (push) Waiting to run
CI / ubuntu-cpu-riscv64-native (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run
EditorConfig Checker / editorconfig (push) Waiting to run
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run
Release / android-arm64 (push) Waiting to run
Release / ubuntu-24-openvino (push) Waiting to run
Release / windows-cpu (arm64) (push) Waiting to run
Release / windows-cpu (x64) (push) Waiting to run
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run
Release / windows-cuda (12.4) (push) Waiting to run
Release / windows-cuda (13.1) (push) Waiting to run
Release / windows-sycl (push) Waiting to run
Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run
Release / ios-xcode-build (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run
Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run
Release / release (push) Blocked by required conditions
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run
Server / server (default) (push) Waiting to run
Server / server (backend-sampling) (push) Waiting to run
Server / server-windows (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
5.8 KiB
5.8 KiB
Git Workflow — llama.cpp M4 Max Performance Fork
This is a private fork of ggerganov/llama.cpp focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.
Remotes
origin → https://github.com/ggerganov/llama.cpp.git (read-only: git pull only)
gitea → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git (read/write)
originhas no credentials — can pull but cannot push. Safe for agents.giteais the working fork on our Gitea instance (SSH port 2222, usersleepy).
Syncing Upstream
git fetch origin
git merge origin/master # fast-forward if clean
git push gitea master
Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.
Branch Structure
master — always tracks upstream master (clean merge)
feature/<short-desc> — active development branches (e.g., feature/mul-mat-contig-reads)
profile/<desc> — profiling/measurement branches
fix/<desc> — bug fixes found during profiling
exp/<desc> — experimental, may be discarded
Branches are short-lived. Merge to master via PR, then delete.
Issue Tracking
All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.
Issue labels:
perf— performance investigationkernel— Metal kernel changesprofiling— measurement/toolingdoc— documentation onlybug— correctness issuesinfra— CI, build, repo setup
Pull Request Workflow
- Create branch from master:
git checkout -b feature/<name> - Make changes, commit with
[area] descriptionconventions (see below) - Push branch:
git push gitea feature/<name> - Create PR on Gitea targeting
master - Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed
- Squash-merge to master
Commit Messages
Format: [area] short description (max 72 chars)
Areas: metal, profile, docs, build, tool
Examples:
[metal] add contiguous weight read path to Q4_0 mul_mat kernel
[profile] add per-op timing to metal encode loop
[docs] graph profile results for 9B Q4_0 at ctx=256
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler
Agent Instructions
When working autonomously, agents MUST:
- Never push to
origin—originhas no credentials, this is a safety measure - Create a branch for any code change:
feature/<issue-number>-<short-desc> - Reference the issue in commits:
[area] description (#123) - Run benchmarks before/after kernel changes and record in BENCHMARKS.md
- Run perplexity to verify correctness after any kernel change:
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128 - Build succeeds before pushing:
cmake --build build-build -j$(sysctl -n hw.ncpu) - Push branch to gitea, then create PR via Gitea API (not via git push)
Build
# Initial cmake (one time)
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON
# Incremental build
cmake --build build-build -j$(sysctl -n hw.ncpu)
# Build specific target
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)
Benchmark Commands
# Quick bench (pp + tg)
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3
# Long tg bench (bandwidth-sensitive)
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2
# Perplexity
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
Profiling Tools
| Tool | What it does |
|---|---|
llama-eval-callback-profile |
Counts ops + bytes per decode tick (non-syncing cb_eval) |
GGML_METAL_GRAPH_DEBUG=1 |
Prints per-op graph during compute (needs -v flag) |
GGML_METAL_GRAPH_DEBUG=2 |
Also prints tensor shapes |
GGML_METAL_CAPTURE_COMPUTE=N |
Captures Nth compute call to Xcode Instruments GPUtrace |
GGML_METAL_CONCURRENCY_DISABLE=1 |
Disable concurrent encoding (benchmark impact) |
GGML_METAL_FUSION_DISABLE=1 |
Disable op fusion (benchmark impact) |
Model Files
Located at /Users/sleepy/.llama/models/:
Qwen3.5-4B-Q4_0.gguf (2.40 GiB)
Qwen3.5-9B-Q4_0.gguf (5.00 GiB)
Qwen3.5-9B-IQ4_NL.gguf (4.99 GiB)
Qwen3.5-9B-IQ4_XS.gguf (4.80 GiB)
Qwen3.6-27B-Q4_0.gguf (14.70 GiB)
Key Source Files
ggml/src/ggml-metal/ggml-metal.metal — Metal shader kernels (Q4_0 dot: line 3228)
ggml/src/ggml-metal/ggml-metal-device.cpp — Pipeline dispatch (get_pipeline_mul_mv: line 741)
ggml/src/ggml-metal/ggml-metal-ops.cpp — Op encoding (MUL_MAT: line 2257)
ggml/src/ggml-metal/ggml-metal-context.m — Graph compute (line 438)
ggml/src/ggml-metal/ggml-metal-impl.h — Tuning params (N_R0, N_SG)
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
BENCHMARKS.md — All benchmark results
ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis
Gitea API
Base: https://git.kokoham.com/api/v1
Token in ~/.gitea_token (not committed).
Local API from server: http://127.0.0.1:18431/api/v1
# Create issue
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/issues" \
-H "Authorization: token $(cat ~/.gitea_token)" \
-H "Content-Type: application/json" \
-d '{"title":"...","body":"...","labels":["perf"]}'
# Create PR
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/pulls" \
-H "Authorization: token $(cat ~/.gitea_token)" \
-H "Content-Type: application/json" \
-d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'