271 lines
9.7 KiB
Markdown
271 lines
9.7 KiB
Markdown
# Git Workflow — llama.cpp M4 Max Performance Fork
|
|
|
|
This is a private fork of [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.
|
|
|
|
## Remotes
|
|
|
|
```
|
|
origin → https://github.com/ggerganov/llama.cpp.git (read-only: git pull only)
|
|
gitea → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git (read/write)
|
|
```
|
|
|
|
- `origin` has no credentials — can pull but cannot push. Safe for agents.
|
|
- `gitea` is the working fork on our Gitea instance (SSH port 2222, user `sleepy`).
|
|
|
|
## Syncing Upstream
|
|
|
|
```bash
|
|
git fetch origin
|
|
git merge origin/master # fast-forward if clean
|
|
git push gitea master
|
|
```
|
|
|
|
Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.
|
|
|
|
## Branch Structure
|
|
|
|
```
|
|
master — always tracks upstream master (clean merge)
|
|
feature/<short-desc> — active development branches (e.g., feature/mul-mat-contig-reads)
|
|
profile/<desc> — profiling/measurement branches
|
|
fix/<desc> — bug fixes found during profiling
|
|
exp/<desc> — experimental, may be discarded
|
|
```
|
|
|
|
Branches are short-lived. Merge to master via PR, then delete.
|
|
|
|
## Issue Tracking
|
|
|
|
All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.
|
|
|
|
Issue labels:
|
|
- `perf` — performance investigation
|
|
- `kernel` — Metal kernel changes
|
|
- `profiling` — measurement/tooling
|
|
- `doc` — documentation only
|
|
- `bug` — correctness issues
|
|
- `infra` — CI, build, repo setup
|
|
|
|
## Pull Request Workflow
|
|
|
|
1. Create branch from master: `git checkout -b feature/<name>`
|
|
2. Make changes, commit with `[area] description` conventions (see below)
|
|
3. Push branch: `git push gitea feature/<name>`
|
|
4. Create PR on Gitea targeting `master`
|
|
5. Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed, **coherence test** (see below)
|
|
6. Squash-merge to master
|
|
|
|
## Pre-Merge Coherence Tests
|
|
|
|
**Mandatory before any merge to master.** Run on both `master` and the PR branch to detect silent correctness regressions.
|
|
|
|
> **IMPORTANT:** macOS has no `timeout` command. Use `gtimeout` (from `brew install coreutils`).
|
|
|
|
### Quick Coherence Test (4B model, ~30s)
|
|
|
|
```bash
|
|
gtimeout 60 ./build-build/bin/llama-cli \
|
|
-m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
|
|
-n 64 -p "Once upon a time" \
|
|
--temp 0 -s 42 -st
|
|
```
|
|
|
|
### Perplexity Check (kernel changes only)
|
|
|
|
```bash
|
|
gtimeout 120 ./build-build/bin/llama-perplexity \
|
|
-m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
|
|
-f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
|
|
```
|
|
|
|
### Verification
|
|
|
|
- **llama-cli**: PR output must be coherent speech (not gibberish). Does not need to be bit-perfect vs master.
|
|
- **perplexity**: PR perplexity must match master within floating-point tolerance (<0.1% delta)
|
|
- **Gibberish output = block merge.** Re-dispatch with specific feedback.
|
|
|
|
### Timeout Policy
|
|
|
|
**All test commands MUST use `gtimeout`** to prevent hangs:
|
|
- Inference/cli: `gtimeout 60` (60s)
|
|
- Perplexity: `gtimeout 120` (120s)
|
|
- Benchmark: `gtimeout 300` (5min)
|
|
- Build: `gtimeout 300` (5min)
|
|
|
|
A hung test is a test failure. Do not retry without investigating the hang.
|
|
|
|
### IMPORTANT: llama-cli interactive mode
|
|
|
|
llama-cli enters interactive REPL after generating, flooding output with `>` prompts. This is NOT a correctness failure — it's the CLI waiting for input.
|
|
**Always use `--single-turn` (`-st`) flag to prevent this:**
|
|
|
|
```bash
|
|
gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st
|
|
```
|
|
|
|
Without `-st`, you will see `>` garbage and the process will hang. DO NOT attempt to "fix" the kernel because of this.
|
|
|
|
## Commit Messages
|
|
|
|
Format: `[area] short description (max 72 chars)`
|
|
|
|
Areas: `metal`, `profile`, `docs`, `build`, `tool`
|
|
|
|
Examples:
|
|
```
|
|
[metal] add contiguous weight read path to Q4_0 mul_mat kernel
|
|
[profile] add per-op timing to metal encode loop
|
|
[docs] graph profile results for 9B Q4_0 at ctx=256
|
|
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler
|
|
```
|
|
|
|
## Agent Instructions
|
|
|
|
When working autonomously, agents MUST:
|
|
|
|
1. **Never push to `origin`** — `origin` has no credentials, this is a safety measure
|
|
2. **Create a branch** for any code change: `feature/<issue-number>-<short-desc>`
|
|
3. **Reference the issue** in commits: `[area] description (#123)`
|
|
4. **Run benchmarks** before/after kernel changes and record in BENCHMARKS.md
|
|
5. **Run perplexity** to verify correctness after any kernel change (with timeout):
|
|
```bash
|
|
gtimeout 120 ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
|
|
```
|
|
6. **Run coherence test** before any merge (with timeout):
|
|
```bash
|
|
gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st
|
|
```
|
|
Output must be coherent speech (not gibberish).
|
|
7. **Build succeeds** before pushing (with timeout):
|
|
```bash
|
|
gtimeout 300 cmake --build build-build -j$(sysctl -n hw.ncpu)
|
|
```
|
|
8. **Push branch** to gitea, then **create PR via Gitea API** (not via git push)
|
|
|
|
> **NOTE:** macOS has no `timeout` command. Always use `gtimeout` (from `brew install coreutils`).
|
|
> **NOTE:** Always use `-st` flag with llama-cli to prevent interactive mode `>` prompts.
|
|
|
|
## Build
|
|
|
|
```bash
|
|
# Initial cmake (one time)
|
|
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON
|
|
|
|
# Incremental build
|
|
cmake --build build-build -j$(sysctl -n hw.ncpu)
|
|
|
|
# Build specific target
|
|
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)
|
|
```
|
|
|
|
## Benchmark Commands
|
|
|
|
```bash
|
|
# Quick bench (pp + tg)
|
|
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3
|
|
|
|
# Long tg bench (bandwidth-sensitive)
|
|
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2
|
|
|
|
# Perplexity
|
|
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
|
|
```
|
|
|
|
## MLX Benchmarking
|
|
|
|
MLX-lm is the performance target. Models at `~/.omlx/models/`.
|
|
|
|
```bash
|
|
# Quick generation test
|
|
mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 128
|
|
|
|
# Benchmark with timing
|
|
time mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 4096
|
|
```
|
|
|
|
Compare llama.cpp results against MLX baselines. Record in BENCHMARKS.md.
|
|
|
|
## Profiling Tools
|
|
|
|
| Tool | What it does |
|
|
|------|-------------|
|
|
| `llama-eval-callback-profile` | Counts ops + bytes per decode tick (non-syncing cb_eval) |
|
|
| `GGML_METAL_GRAPH_DEBUG=1` | Prints per-op graph during compute (needs `-v` flag) |
|
|
| `GGML_METAL_GRAPH_DEBUG=2` | Also prints tensor shapes |
|
|
| `GGML_METAL_CAPTURE_COMPUTE=N` | Captures Nth compute call to Xcode Instruments GPUtrace |
|
|
| `GGML_METAL_CONCURRENCY_DISABLE=1` | Disable concurrent encoding (benchmark impact) |
|
|
| `GGML_METAL_FUSION_DISABLE=1` | Disable op fusion (benchmark impact) |
|
|
|
|
## Model Files
|
|
|
|
Located at `/Users/sleepy/.llama/models/`:
|
|
|
|
```
|
|
Qwen3.5-4B-Q4_0.gguf (2.40 GiB)
|
|
Qwen3.5-9B-Q4_0.gguf (5.00 GiB)
|
|
Qwen3.5-9B-IQ4_NL.gguf (4.99 GiB)
|
|
Qwen3.5-9B-IQ4_XS.gguf (4.80 GiB)
|
|
Qwen3.6-27B-Q4_0.gguf (14.70 GiB)
|
|
```
|
|
|
|
## Key Source Files
|
|
|
|
```
|
|
ggml/src/ggml-metal/ggml-metal.metal — Metal shader kernels (Q4_0 dot: line 3228)
|
|
ggml/src/ggml-metal/ggml-metal-device.cpp — Pipeline dispatch (get_pipeline_mul_mv: line 741)
|
|
ggml/src/ggml-metal/ggml-metal-ops.cpp — Op encoding (MUL_MAT: line 2257)
|
|
ggml/src/ggml-metal/ggml-metal-context.m — Graph compute (line 438)
|
|
ggml/src/ggml-metal/ggml-metal-impl.h — Tuning params (N_R0, N_SG)
|
|
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
|
|
BENCHMARKS.md — All benchmark results
|
|
ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis
|
|
```
|
|
|
|
## Gitea API
|
|
|
|
Base: `https://git.kokoham.com/api/v1`
|
|
Token in `~/Documents/personal/projects/.env` as `GITEA_TOKEN`.
|
|
|
|
```bash
|
|
export $(grep -v '^#' ~/Documents/personal/projects/.env | xargs)
|
|
|
|
# Create issue
|
|
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues" \
|
|
-H "Authorization: token $GITEA_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"title":"...","body":"...","labels":["perf"]}'
|
|
|
|
# Create PR
|
|
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls" \
|
|
-H "Authorization: token $GITEA_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'
|
|
|
|
# Comment on issue/PR
|
|
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}/comments" \
|
|
-H "Authorization: token $GITEA_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"body":"..."}'
|
|
|
|
# Close issue
|
|
curl -X PATCH "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}" \
|
|
-H "Authorization: token $GITEA_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"state":"closed"}'
|
|
|
|
# Merge PR
|
|
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls/{number}/merge" \
|
|
-H "Authorization: token $GITEA_TOKEN" \
|
|
-H "Content-Type: application/json" \
|
|
-d '{"do_force_merge":false,"merge_title":"..."}'
|
|
```
|
|
|
|
## Onboarding — What to Read
|
|
|
|
For new agents or sessions, read in this order:
|
|
1. **GIT.md** — this file (workflow, tests, commands, file locations)
|
|
2. **BENCHMARKS.md** — all benchmark results, track progress toward 22 t/s
|
|
3. **ANALYSIS_QWEN3_5_MXFP4.md** — MXFP4 format analysis (if relevant)
|
|
4. **Issue #40** — target t/s goal and MLX comparison guidelines
|
|
5. **MLX reference**: `../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h` — qmv_fast_impl
|