llama.cpp/GIT.md

# Git Workflow — llama.cpp M4 Max Performance Fork

This is a private fork of [ggerganov/llama.cpp](https://github.com/ggerganov/llama.cpp) focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.

## Remotes

```
origin  → https://github.com/ggerganov/llama.cpp.git   (read-only: git pull only)
gitea   → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git  (read/write)
```

- `origin` has no credentials — can pull but cannot push. Safe for agents.
- `gitea` is the working fork on our Gitea instance (SSH port 2222, user `sleepy`).

## Syncing Upstream

```bash
git fetch origin
git merge origin/master          # fast-forward if clean
git push gitea master
```

Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.

## Branch Structure

```
master                    — always tracks upstream master (clean merge)
feature/<short-desc>      — active development branches (e.g., feature/mul-mat-contig-reads)
profile/<desc>            — profiling/measurement branches
fix/<desc>                — bug fixes found during profiling
exp/<desc>                — experimental, may be discarded
```

Branches are short-lived. Merge to master via PR, then delete.

## Issue Tracking

All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.

Issue labels:
- `perf`       — performance investigation
- `kernel`     — Metal kernel changes
- `profiling`  — measurement/tooling
- `doc`        — documentation only
- `bug`        — correctness issues
- `infra`      — CI, build, repo setup

## Pull Request Workflow

1. Create branch from master: `git checkout -b feature/<name>`
2. Make changes, commit with `[area] description` conventions (see below)
3. Push branch: `git push gitea feature/<name>`
4. Create PR on Gitea targeting `master`
5. Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed
6. Squash-merge to master

## Commit Messages

Format: `[area] short description (max 72 chars)`

Areas: `metal`, `profile`, `docs`, `build`, `tool`

Examples:
```
[metal] add contiguous weight read path to Q4_0 mul_mat kernel
[profile] add per-op timing to metal encode loop
[docs] graph profile results for 9B Q4_0 at ctx=256
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler
```

## Agent Instructions

When working autonomously, agents MUST:

1. **Never push to `origin`** — `origin` has no credentials, this is a safety measure
2. **Create a branch** for any code change: `feature/<issue-number>-<short-desc>`
3. **Reference the issue** in commits: `[area] description (#123)`
4. **Run benchmarks** before/after kernel changes and record in BENCHMARKS.md
5. **Run perplexity** to verify correctness after any kernel change:
   ```bash
   ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
   ```
6. **Build succeeds** before pushing:
   ```bash
   cmake --build build-build -j$(sysctl -n hw.ncpu)
   ```
7. **Push branch** to gitea, then **create PR via Gitea API** (not via git push)

## Build

```bash
# Initial cmake (one time)
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON

# Incremental build
cmake --build build-build -j$(sysctl -n hw.ncpu)

# Build specific target
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)
```

## Benchmark Commands

```bash
# Quick bench (pp + tg)
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3

# Long tg bench (bandwidth-sensitive)
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2

# Perplexity
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
```

## Profiling Tools

| Tool | What it does |
|------|-------------|
| `llama-eval-callback-profile` | Counts ops + bytes per decode tick (non-syncing cb_eval) |
| `GGML_METAL_GRAPH_DEBUG=1` | Prints per-op graph during compute (needs `-v` flag) |
| `GGML_METAL_GRAPH_DEBUG=2` | Also prints tensor shapes |
| `GGML_METAL_CAPTURE_COMPUTE=N` | Captures Nth compute call to Xcode Instruments GPUtrace |
| `GGML_METAL_CONCURRENCY_DISABLE=1` | Disable concurrent encoding (benchmark impact) |
| `GGML_METAL_FUSION_DISABLE=1` | Disable op fusion (benchmark impact) |

## Model Files

Located at `/Users/sleepy/.llama/models/`:

```
Qwen3.5-4B-Q4_0.gguf      (2.40 GiB)
Qwen3.5-9B-Q4_0.gguf      (5.00 GiB)
Qwen3.5-9B-IQ4_NL.gguf    (4.99 GiB)
Qwen3.5-9B-IQ4_XS.gguf    (4.80 GiB)
Qwen3.6-27B-Q4_0.gguf     (14.70 GiB)
```

## Key Source Files

```
ggml/src/ggml-metal/ggml-metal.metal          — Metal shader kernels (Q4_0 dot: line 3228)
ggml/src/ggml-metal/ggml-metal-device.cpp     — Pipeline dispatch (get_pipeline_mul_mv: line 741)
ggml/src/ggml-metal/ggml-metal-ops.cpp        — Op encoding (MUL_MAT: line 2257)
ggml/src/ggml-metal/ggml-metal-context.m      — Graph compute (line 438)
ggml/src/ggml-metal/ggml-metal-impl.h         — Tuning params (N_R0, N_SG)
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
BENCHMARKS.md                                 — All benchmark results
ANALYSIS_QWEN3_5_MXFP4.md                     — MXFP4 format analysis
```

## Gitea API

Base: `https://git.kokoham.com/api/v1`
Token in `~/.gitea_token` (not committed).
Local API from server: `http://127.0.0.1:18431/api/v1`

```bash
# Create issue
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/issues" \
  -H "Authorization: token $(cat ~/.gitea_token)" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","labels":["perf"]}'

# Create PR
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/pulls" \
  -H "Authorization: token $(cat ~/.gitea_token)" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'
```