# Git Workflow — llama.cpp M4 Max Performance Fork This is a private fork of [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub. ## Remotes ``` origin → https://github.com/ggerganov/llama.cpp.git (read-only: git pull only) gitea → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git (read/write) ``` - `origin` has no credentials — can pull but cannot push. Safe for agents. - `gitea` is the working fork on our Gitea instance (SSH port 2222, user `sleepy`). ## Syncing Upstream ```bash git fetch origin git merge origin/master # fast-forward if clean git push gitea master ``` Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code. ## Branch Structure ``` master — always tracks upstream master (clean merge) feature/ — active development branches (e.g., feature/mul-mat-contig-reads) profile/ — profiling/measurement branches fix/ — bug fixes found during profiling exp/ — experimental, may be discarded ``` Branches are short-lived. Merge to master via PR, then delete. ## Issue Tracking All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues. Issue labels: - `perf` — performance investigation - `kernel` — Metal kernel changes - `profiling` — measurement/tooling - `doc` — documentation only - `bug` — correctness issues - `infra` — CI, build, repo setup ## Pull Request Workflow 1. Create branch from master: `git checkout -b feature/` 2. Make changes, commit with `[area] description` conventions (see below) 3. Push branch: `git push gitea feature/` 4. Create PR on Gitea targeting `master` 5. Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed 6. Squash-merge to master ## Commit Messages Format: `[area] short description (max 72 chars)` Areas: `metal`, `profile`, `docs`, `build`, `tool` Examples: ``` [metal] add contiguous weight read path to Q4_0 mul_mat kernel [profile] add per-op timing to metal encode loop [docs] graph profile results for 9B Q4_0 at ctx=256 [tool] llama-eval-callback-profile: non-syncing cb_eval profiler ``` ## Agent Instructions When working autonomously, agents MUST: 1. **Never push to `origin`** — `origin` has no credentials, this is a safety measure 2. **Create a branch** for any code change: `feature/-` 3. **Reference the issue** in commits: `[area] description (#123)` 4. **Run benchmarks** before/after kernel changes and record in BENCHMARKS.md 5. **Run perplexity** to verify correctness after any kernel change: ```bash ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128 ``` 6. **Build succeeds** before pushing: ```bash cmake --build build-build -j$(sysctl -n hw.ncpu) ``` 7. **Push branch** to gitea, then **create PR via Gitea API** (not via git push) ## Build ```bash # Initial cmake (one time) cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON # Incremental build cmake --build build-build -j$(sysctl -n hw.ncpu) # Build specific target cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu) ``` ## Benchmark Commands ```bash # Quick bench (pp + tg) ./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3 # Long tg bench (bandwidth-sensitive) ./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2 # Perplexity ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128 ``` ## Profiling Tools | Tool | What it does | |------|-------------| | `llama-eval-callback-profile` | Counts ops + bytes per decode tick (non-syncing cb_eval) | | `GGML_METAL_GRAPH_DEBUG=1` | Prints per-op graph during compute (needs `-v` flag) | | `GGML_METAL_GRAPH_DEBUG=2` | Also prints tensor shapes | | `GGML_METAL_CAPTURE_COMPUTE=N` | Captures Nth compute call to Xcode Instruments GPUtrace | | `GGML_METAL_CONCURRENCY_DISABLE=1` | Disable concurrent encoding (benchmark impact) | | `GGML_METAL_FUSION_DISABLE=1` | Disable op fusion (benchmark impact) | ## Model Files Located at `/Users/sleepy/.llama/models/`: ``` Qwen3.5-4B-Q4_0.gguf (2.40 GiB) Qwen3.5-9B-Q4_0.gguf (5.00 GiB) Qwen3.5-9B-IQ4_NL.gguf (4.99 GiB) Qwen3.5-9B-IQ4_XS.gguf (4.80 GiB) Qwen3.6-27B-Q4_0.gguf (14.70 GiB) ``` ## Key Source Files ``` ggml/src/ggml-metal/ggml-metal.metal — Metal shader kernels (Q4_0 dot: line 3228) ggml/src/ggml-metal/ggml-metal-device.cpp — Pipeline dispatch (get_pipeline_mul_mv: line 741) ggml/src/ggml-metal/ggml-metal-ops.cpp — Op encoding (MUL_MAT: line 2257) ggml/src/ggml-metal/ggml-metal-context.m — Graph compute (line 438) ggml/src/ggml-metal/ggml-metal-impl.h — Tuning params (N_R0, N_SG) examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool BENCHMARKS.md — All benchmark results ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis ``` ## Gitea API Base: `https://git.kokoham.com/api/v1` Token in `~/.gitea_token` (not committed). Local API from server: `http://127.0.0.1:18431/api/v1` ```bash # Create issue curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/issues" \ -H "Authorization: token $(cat ~/.gitea_token)" \ -H "Content-Type: application/json" \ -d '{"title":"...","body":"...","labels":["perf"]}' # Create PR curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/pulls" \ -H "Authorization: token $(cat ~/.gitea_token)" \ -H "Content-Type: application/json" \ -d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}' ```