222626cfdc
CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run
CI (android) / android (push) Waiting to run
CI (android) / android-ndk (push) Waiting to run
CI (apple) / macOS-latest-ios (push) Waiting to run
CI (apple) / macos-latest-ios-xcode (push) Waiting to run
CI (apple) / macOS-latest-tvos (push) Waiting to run
CI (apple) / macOS-latest-visionos (push) Waiting to run
CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions
CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions
CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run
CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run
CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run
CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run
CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run
CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run
CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
CI (sycl) / windows-latest-sycl (push) Waiting to run
CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run
CI / build-cmake-pkg (push) Waiting to run
CI / macOS-latest-arm64 (push) Waiting to run
CI / macOS-latest-x64 (push) Waiting to run
CI / macOS-latest-arm64-webgpu (push) Waiting to run
CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run
CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
CI / android-arm64 (push) Waiting to run
CI / ubuntu-latest-rpc (push) Waiting to run
CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run
CI / ubuntu-24-webgpu (push) Waiting to run
CI / ubuntu-24-webgpu-wasm (push) Waiting to run
CI / ubuntu-22-hip (push) Waiting to run
CI / ubuntu-22-musa (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run
CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run
CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run
CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run
CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run
CI / ubuntu-latest-cuda (push) Waiting to run
CI / windows-2022-cuda (12.4) (push) Waiting to run
CI / windows-latest-hip (push) Waiting to run
CI / ubuntu-cpu-riscv64-native (push) Waiting to run
CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run
CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run
CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run
CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run
EditorConfig Checker / editorconfig (push) Waiting to run
Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run
Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run
Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run
Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run
Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run
Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run
Release / android-arm64 (push) Waiting to run
Release / ubuntu-24-openvino (push) Waiting to run
Release / windows-cpu (arm64) (push) Waiting to run
Release / windows-cpu (x64) (push) Waiting to run
Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run
Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run
Release / windows-cuda (12.4) (push) Waiting to run
Release / windows-cuda (13.1) (push) Waiting to run
Release / windows-sycl (push) Waiting to run
Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run
Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run
Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run
Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run
Release / ios-xcode-build (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run
Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run
Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run
Release / release (push) Blocked by required conditions
Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run
Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run
Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run
Server / server (default) (push) Waiting to run
Server / server (backend-sampling) (push) Waiting to run
Server / server-windows (push) Waiting to run
CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled
CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled
171 lines
5.8 KiB
Markdown
171 lines
5.8 KiB
Markdown
# Git Workflow — llama.cpp M4 Max Performance Fork
|
|
|
|
This is a private fork of [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.
|
|
|
|
## Remotes
|
|
|
|
```
|
|
origin → https://github.com/ggerganov/llama.cpp.git (read-only: git pull only)
|
|
gitea → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git (read/write)
|
|
```
|
|
|
|
- `origin` has no credentials — can pull but cannot push. Safe for agents.
|
|
- `gitea` is the working fork on our Gitea instance (SSH port 2222, user `sleepy`).
|
|
|
|
## Syncing Upstream
|
|
|
|
```bash
|
|
git fetch origin
|
|
git merge origin/master # fast-forward if clean
|
|
git push gitea master
|
|
```
|
|
|
|
Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.
|
|
|
|
## Branch Structure
|
|
|
|
```
|
|
master — always tracks upstream master (clean merge)
|
|
feature/<short-desc> — active development branches (e.g., feature/mul-mat-contig-reads)
|
|
profile/<desc> — profiling/measurement branches
|
|
fix/<desc> — bug fixes found during profiling
|
|
exp/<desc> — experimental, may be discarded
|
|
```
|
|
|
|
Branches are short-lived. Merge to master via PR, then delete.
|
|
|
|
## Issue Tracking
|
|
|
|
All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.
|
|
|
|
Issue labels:
|
|
- `perf` — performance investigation
|
|
- `kernel` — Metal kernel changes
|
|
- `profiling` — measurement/tooling
|
|
- `doc` — documentation only
|
|
- `bug` — correctness issues
|
|
- `infra` — CI, build, repo setup
|
|
|
|
## Pull Request Workflow
|
|
|
|
1. Create branch from master: `git checkout -b feature/<name>`
|
|
2. Make changes, commit with `[area] description` conventions (see below)
|
|
3. Push branch: `git push gitea feature/<name>`
|
|
4. Create PR on Gitea targeting `master`
|
|
5. Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed
|
|
6. Squash-merge to master
|
|
|
|
## Commit Messages
|
|
|
|
Format: `[area] short description (max 72 chars)`
|
|
|
|
Areas: `metal`, `profile`, `docs`, `build`, `tool`
|
|
|
|
Examples:
|
|
```
|
|
[metal] add contiguous weight read path to Q4_0 mul_mat kernel
|
|
[profile] add per-op timing to metal encode loop
|
|
[docs] graph profile results for 9B Q4_0 at ctx=256
|
|
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler
|
|
```
|
|
|
|
## Agent Instructions
|
|
|
|
When working autonomously, agents MUST:
|
|
|
|
1. **Never push to `origin`** — `origin` has no credentials, this is a safety measure
|
|
2. **Create a branch** for any code change: `feature/<issue-number>-<short-desc>`
|
|
3. **Reference the issue** in commits: `[area] description (#123)`
|
|
4. **Run benchmarks** before/after kernel changes and record in BENCHMARKS.md
|
|
5. **Run perplexity** to verify correctness after any kernel change:
|
|
```bash
|
|
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
|
|
```
|
|
6. **Build succeeds** before pushing:
|
|
```bash
|
|
cmake --build build-build -j$(sysctl -n hw.ncpu)
|
|
```
|
|
7. **Push branch** to gitea, then **create PR via Gitea API** (not via git push)
|
|
|
|
## Build
|
|
|
|
```bash
|
|
# Initial cmake (one time)
|
|
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON
|
|
|
|
# Incremental build
|
|
cmake --build build-build -j$(sysctl -n hw.ncpu)
|
|
|
|
# Build specific target
|
|
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)
|
|
```
|
|
|
|
## Benchmark Commands
|
|
|
|
```bash
|
|
# Quick bench (pp + tg)
|
|
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3
|
|
|
|
# Long tg bench (bandwidth-sensitive)
|
|
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2
|
|
|
|
# Perplexity
|
|
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
|
|
```
|
|
|
|
## Profiling Tools
|
|
|
|
| Tool | What it does |
|
|
|------|-------------|
|
|
| `llama-eval-callback-profile` | Counts ops + bytes per decode tick (non-syncing cb_eval) |
|
|
| `GGML_METAL_GRAPH_DEBUG=1` | Prints per-op graph during compute (needs `-v` flag) |
|
|
| `GGML_METAL_GRAPH_DEBUG=2` | Also prints tensor shapes |
|
|
| `GGML_METAL_CAPTURE_COMPUTE=N` | Captures Nth compute call to Xcode Instruments GPUtrace |
|
|
| `GGML_METAL_CONCURRENCY_DISABLE=1` | Disable concurrent encoding (benchmark impact) |
|
|
| `GGML_METAL_FUSION_DISABLE=1` | Disable op fusion (benchmark impact) |
|
|
|
|
## Model Files
|
|
|
|
Located at `/Users/sleepy/.llama/models/`:
|
|
|
|
```
|
|
Qwen3.5-4B-Q4_0.gguf (2.40 GiB)
|
|
Qwen3.5-9B-Q4_0.gguf (5.00 GiB)
|
|
Qwen3.5-9B-IQ4_NL.gguf (4.99 GiB)
|
|
Qwen3.5-9B-IQ4_XS.gguf (4.80 GiB)
|
|
Qwen3.6-27B-Q4_0.gguf (14.70 GiB)
|
|
```
|
|
|
|
## Key Source Files
|
|
|
|
```
|
|
ggml/src/ggml-metal/ggml-metal.metal — Metal shader kernels (Q4_0 dot: line 3228)
|
|
ggml/src/ggml-metal/ggml-metal-device.cpp — Pipeline dispatch (get_pipeline_mul_mv: line 741)
|
|
ggml/src/ggml-metal/ggml-metal-ops.cpp — Op encoding (MUL_MAT: line 2257)
|
|
ggml/src/ggml-metal/ggml-metal-context.m — Graph compute (line 438)
|
|
ggml/src/ggml-metal/ggml-metal-impl.h — Tuning params (N_R0, N_SG)
|
|
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
|
|
BENCHMARKS.md — All benchmark results
|
|
ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis
|
|
```
|
|
|
|
## Gitea API
|
|
|
|
Base: `https://git.kokoham.com/api/v1`
|
|
Token in `~/.gitea_token` (not committed).
|
|
Local API from server: `http://127.0.0.1:18431/api/v1`
|
|
|
|
```bash
|
|
# Create issue
|
|
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/issues" \
|
|
-H "Authorization: token $(cat ~/.gitea_token)" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"title":"...","body":"...","labels":["perf"]}'
|
|
|
|
# Create PR
|
|
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/pulls" \
|
|
-H "Authorization: token $(cat ~/.gitea_token)" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'
|
|
```
|