Files
llama.cpp/GIT.md
Kaloyan Nikolov 222626cfdc
CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run
CI (android) / android (push) Waiting to run
CI (android) / android-ndk (push) Waiting to run
CI (apple) / macOS-latest-ios (push) Waiting to run
CI (apple) / macos-latest-ios-xcode (push) Waiting to run
CI (apple) / macOS-latest-tvos (push) Waiting to run
CI (apple) / macOS-latest-visionos (push) Waiting to run
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
CI (sycl) / windows-latest-sycl (push) Waiting to run
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-arm64 (push) Waiting to run
CI / macOS-latest-x64 (push) Waiting to run
CI / macOS-latest-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
CI / android-arm64 (push) Waiting to run
CI / ubuntu-latest-rpc (push) Waiting to run
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run
CI / ubuntu-24-webgpu (push) Waiting to run
CI / ubuntu-24-webgpu-wasm (push) Waiting to run
CI / ubuntu-22-hip (push) Waiting to run
CI / ubuntu-22-musa (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cuda (push) Waiting to run
CI / windows-2022-cuda (12.4) (push) Waiting to run
CI / windows-latest-hip (push) Waiting to run
CI / ubuntu-cpu-riscv64-native (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run
EditorConfig Checker / editorconfig (push) Waiting to run
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run
Release / android-arm64 (push) Waiting to run
Release / ubuntu-24-openvino (push) Waiting to run
Release / windows-cpu (arm64) (push) Waiting to run
Release / windows-cpu (x64) (push) Waiting to run
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run
Release / windows-cuda (12.4) (push) Waiting to run
Release / windows-cuda (13.1) (push) Waiting to run
Release / windows-sycl (push) Waiting to run
Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run
Release / ios-xcode-build (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run
Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run
Release / release (push) Blocked by required conditions
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run
Server / server (default) (push) Waiting to run
Server / server (backend-sampling) (push) Waiting to run
Server / server-windows (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
[docs] add GIT.md with workflow and agent instructions
2026-04-30 18:11:44 +02:00

5.8 KiB

Git Workflow — llama.cpp M4 Max Performance Fork

This is a private fork of ggerganov/llama.cpp focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.

Remotes

origin  → https://github.com/ggerganov/llama.cpp.git   (read-only: git pull only)
gitea   → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git  (read/write)
  • origin has no credentials — can pull but cannot push. Safe for agents.
  • gitea is the working fork on our Gitea instance (SSH port 2222, user sleepy).

Syncing Upstream

git fetch origin
git merge origin/master          # fast-forward if clean
git push gitea master

Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.

Branch Structure

master                    — always tracks upstream master (clean merge)
feature/<short-desc>      — active development branches (e.g., feature/mul-mat-contig-reads)
profile/<desc>            — profiling/measurement branches
fix/<desc>                — bug fixes found during profiling
exp/<desc>                — experimental, may be discarded

Branches are short-lived. Merge to master via PR, then delete.

Issue Tracking

All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.

Issue labels:

  • perf — performance investigation
  • kernel — Metal kernel changes
  • profiling — measurement/tooling
  • doc — documentation only
  • bug — correctness issues
  • infra — CI, build, repo setup

Pull Request Workflow

  1. Create branch from master: git checkout -b feature/<name>
  2. Make changes, commit with [area] description conventions (see below)
  3. Push branch: git push gitea feature/<name>
  4. Create PR on Gitea targeting master
  5. Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed
  6. Squash-merge to master

Commit Messages

Format: [area] short description (max 72 chars)

Areas: metal, profile, docs, build, tool

Examples:

[metal] add contiguous weight read path to Q4_0 mul_mat kernel
[profile] add per-op timing to metal encode loop
[docs] graph profile results for 9B Q4_0 at ctx=256
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler

Agent Instructions

When working autonomously, agents MUST:

  1. Never push to originorigin has no credentials, this is a safety measure
  2. Create a branch for any code change: feature/<issue-number>-<short-desc>
  3. Reference the issue in commits: [area] description (#123)
  4. Run benchmarks before/after kernel changes and record in BENCHMARKS.md
  5. Run perplexity to verify correctness after any kernel change:
    ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
    
  6. Build succeeds before pushing:
    cmake --build build-build -j$(sysctl -n hw.ncpu)
    
  7. Push branch to gitea, then create PR via Gitea API (not via git push)

Build

# Initial cmake (one time)
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON

# Incremental build
cmake --build build-build -j$(sysctl -n hw.ncpu)

# Build specific target
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)

Benchmark Commands

# Quick bench (pp + tg)
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3

# Long tg bench (bandwidth-sensitive)
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2

# Perplexity
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128

Profiling Tools

Tool What it does
llama-eval-callback-profile Counts ops + bytes per decode tick (non-syncing cb_eval)
GGML_METAL_GRAPH_DEBUG=1 Prints per-op graph during compute (needs -v flag)
GGML_METAL_GRAPH_DEBUG=2 Also prints tensor shapes
GGML_METAL_CAPTURE_COMPUTE=N Captures Nth compute call to Xcode Instruments GPUtrace
GGML_METAL_CONCURRENCY_DISABLE=1 Disable concurrent encoding (benchmark impact)
GGML_METAL_FUSION_DISABLE=1 Disable op fusion (benchmark impact)

Model Files

Located at /Users/sleepy/.llama/models/:

Qwen3.5-4B-Q4_0.gguf      (2.40 GiB)
Qwen3.5-9B-Q4_0.gguf      (5.00 GiB)
Qwen3.5-9B-IQ4_NL.gguf    (4.99 GiB)
Qwen3.5-9B-IQ4_XS.gguf    (4.80 GiB)
Qwen3.6-27B-Q4_0.gguf     (14.70 GiB)

Key Source Files

ggml/src/ggml-metal/ggml-metal.metal          — Metal shader kernels (Q4_0 dot: line 3228)
ggml/src/ggml-metal/ggml-metal-device.cpp     — Pipeline dispatch (get_pipeline_mul_mv: line 741)
ggml/src/ggml-metal/ggml-metal-ops.cpp        — Op encoding (MUL_MAT: line 2257)
ggml/src/ggml-metal/ggml-metal-context.m      — Graph compute (line 438)
ggml/src/ggml-metal/ggml-metal-impl.h         — Tuning params (N_R0, N_SG)
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
BENCHMARKS.md                                 — All benchmark results
ANALYSIS_QWEN3_5_MXFP4.md                     — MXFP4 format analysis

Gitea API

Base: https://git.kokoham.com/api/v1 Token in ~/.gitea_token (not committed). Local API from server: http://127.0.0.1:18431/api/v1

# Create issue
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/issues" \
  -H "Authorization: token $(cat ~/.gitea_token)" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","labels":["perf"]}'

# Create PR
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/pulls" \
  -H "Authorization: token $(cat ~/.gitea_token)" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'