9.7 KiB
Git Workflow — llama.cpp M4 Max Performance Fork
This is a private fork of ggerganov/llama.cpp focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.
Remotes
origin → https://github.com/ggerganov/llama.cpp.git (read-only: git pull only)
gitea → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git (read/write)
originhas no credentials — can pull but cannot push. Safe for agents.giteais the working fork on our Gitea instance (SSH port 2222, usersleepy).
Syncing Upstream
git fetch origin
git merge origin/master # fast-forward if clean
git push gitea master
Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.
Branch Structure
master — always tracks upstream master (clean merge)
feature/<short-desc> — active development branches (e.g., feature/mul-mat-contig-reads)
profile/<desc> — profiling/measurement branches
fix/<desc> — bug fixes found during profiling
exp/<desc> — experimental, may be discarded
Branches are short-lived. Merge to master via PR, then delete.
Issue Tracking
All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.
Issue labels:
perf— performance investigationkernel— Metal kernel changesprofiling— measurement/toolingdoc— documentation onlybug— correctness issuesinfra— CI, build, repo setup
Pull Request Workflow
- Create branch from master:
git checkout -b feature/<name> - Make changes, commit with
[area] descriptionconventions (see below) - Push branch:
git push gitea feature/<name> - Create PR on Gitea targeting
master - Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed, coherence test (see below)
- Squash-merge to master
Pre-Merge Coherence Tests
Mandatory before any merge to master. Run on both master and the PR branch to detect silent correctness regressions.
IMPORTANT: macOS has no
timeoutcommand. Usegtimeout(frombrew install coreutils).
Quick Coherence Test (4B model, ~30s)
gtimeout 60 ./build-build/bin/llama-cli \
-m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
-n 64 -p "Once upon a time" \
--temp 0 -s 42 -st
Perplexity Check (kernel changes only)
gtimeout 120 ./build-build/bin/llama-perplexity \
-m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
-f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
Verification
- llama-cli: PR output must be coherent speech (not gibberish). Does not need to be bit-perfect vs master.
- perplexity: PR perplexity must match master within floating-point tolerance (<0.1% delta)
- Gibberish output = block merge. Re-dispatch with specific feedback.
Timeout Policy
All test commands MUST use gtimeout to prevent hangs:
- Inference/cli:
gtimeout 60(60s) - Perplexity:
gtimeout 120(120s) - Benchmark:
gtimeout 300(5min) - Build:
gtimeout 300(5min)
A hung test is a test failure. Do not retry without investigating the hang.
IMPORTANT: llama-cli interactive mode
llama-cli enters interactive REPL after generating, flooding output with > prompts. This is NOT a correctness failure — it's the CLI waiting for input.
Always use --single-turn (-st) flag to prevent this:
gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st
Without -st, you will see > garbage and the process will hang. DO NOT attempt to "fix" the kernel because of this.
Commit Messages
Format: [area] short description (max 72 chars)
Areas: metal, profile, docs, build, tool
Examples:
[metal] add contiguous weight read path to Q4_0 mul_mat kernel
[profile] add per-op timing to metal encode loop
[docs] graph profile results for 9B Q4_0 at ctx=256
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler
Agent Instructions
When working autonomously, agents MUST:
- Never push to
origin—originhas no credentials, this is a safety measure - Create a branch for any code change:
feature/<issue-number>-<short-desc> - Reference the issue in commits:
[area] description (#123) - Run benchmarks before/after kernel changes and record in BENCHMARKS.md
- Run perplexity to verify correctness after any kernel change (with timeout):
gtimeout 120 ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128 - Run coherence test before any merge (with timeout):
Output must be coherent speech (not gibberish).
gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st - Build succeeds before pushing (with timeout):
gtimeout 300 cmake --build build-build -j$(sysctl -n hw.ncpu) - Push branch to gitea, then create PR via Gitea API (not via git push)
NOTE: macOS has no
timeoutcommand. Always usegtimeout(frombrew install coreutils). NOTE: Always use-stflag with llama-cli to prevent interactive mode>prompts.
Build
# Initial cmake (one time)
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON
# Incremental build
cmake --build build-build -j$(sysctl -n hw.ncpu)
# Build specific target
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)
Benchmark Commands
# Quick bench (pp + tg)
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3
# Long tg bench (bandwidth-sensitive)
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2
# Perplexity
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
MLX Benchmarking
MLX-lm is the performance target. Models at ~/.omlx/models/.
# Quick generation test
mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 128
# Benchmark with timing
time mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 4096
Compare llama.cpp results against MLX baselines. Record in BENCHMARKS.md.
Profiling Tools
| Tool | What it does |
|---|---|
llama-eval-callback-profile |
Counts ops + bytes per decode tick (non-syncing cb_eval) |
GGML_METAL_GRAPH_DEBUG=1 |
Prints per-op graph during compute (needs -v flag) |
GGML_METAL_GRAPH_DEBUG=2 |
Also prints tensor shapes |
GGML_METAL_CAPTURE_COMPUTE=N |
Captures Nth compute call to Xcode Instruments GPUtrace |
GGML_METAL_CONCURRENCY_DISABLE=1 |
Disable concurrent encoding (benchmark impact) |
GGML_METAL_FUSION_DISABLE=1 |
Disable op fusion (benchmark impact) |
Model Files
Located at /Users/sleepy/.llama/models/:
Qwen3.5-4B-Q4_0.gguf (2.40 GiB)
Qwen3.5-9B-Q4_0.gguf (5.00 GiB)
Qwen3.5-9B-IQ4_NL.gguf (4.99 GiB)
Qwen3.5-9B-IQ4_XS.gguf (4.80 GiB)
Qwen3.6-27B-Q4_0.gguf (14.70 GiB)
Key Source Files
ggml/src/ggml-metal/ggml-metal.metal — Metal shader kernels (Q4_0 dot: line 3228)
ggml/src/ggml-metal/ggml-metal-device.cpp — Pipeline dispatch (get_pipeline_mul_mv: line 741)
ggml/src/ggml-metal/ggml-metal-ops.cpp — Op encoding (MUL_MAT: line 2257)
ggml/src/ggml-metal/ggml-metal-context.m — Graph compute (line 438)
ggml/src/ggml-metal/ggml-metal-impl.h — Tuning params (N_R0, N_SG)
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
BENCHMARKS.md — All benchmark results
ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis
Gitea API
Base: https://git.kokoham.com/api/v1
Token in ~/Documents/personal/projects/.env as GITEA_TOKEN.
export $(grep -v '^#' ~/Documents/personal/projects/.env | xargs)
# Create issue
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"title":"...","body":"...","labels":["perf"]}'
# Create PR
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'
# Comment on issue/PR
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}/comments" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"body":"..."}'
# Close issue
curl -X PATCH "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"state":"closed"}'
# Merge PR
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls/{number}/merge" \
-H "Authorization: token $GITEA_TOKEN" \
-H "Content-Type: application/json" \
-d '{"do_force_merge":false,"merge_title":"..."}'
Onboarding — What to Read
For new agents or sessions, read in this order:
- GIT.md — this file (workflow, tests, commands, file locations)
- BENCHMARKS.md — all benchmark results, track progress toward 22 t/s
- ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis (if relevant)
- Issue #40 — target t/s goal and MLX comparison guidelines
- MLX reference:
../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h— qmv_fast_impl