sleepy/llama.cpp

Fork 0

Files

T

Kaloyan Nikolov 757ef4de97

EditorConfig Checker / editorconfig (push) Waiting to run

Details

[docs] add coherence tests, MLX benchmarking, onboarding, Gitea API

2026-05-01 00:44:37 +02:00

9.7 KiB

Raw Blame History

Git Workflow — llama.cpp M4 Max Performance Fork

This is a private fork of ggerganov/llama.cpp focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.

Remotes

origin  → https://github.com/ggerganov/llama.cpp.git   (read-only: git pull only)
gitea   → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git  (read/write)

origin has no credentials — can pull but cannot push. Safe for agents.
gitea is the working fork on our Gitea instance (SSH port 2222, user sleepy).

Syncing Upstream

git fetch origin
git merge origin/master          # fast-forward if clean
git push gitea master

Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.

Branch Structure

master                    — always tracks upstream master (clean merge)
feature/<short-desc>      — active development branches (e.g., feature/mul-mat-contig-reads)
profile/<desc>            — profiling/measurement branches
fix/<desc>                — bug fixes found during profiling
exp/<desc>                — experimental, may be discarded

Branches are short-lived. Merge to master via PR, then delete.

Issue Tracking

All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.

Issue labels:

perf — performance investigation
kernel — Metal kernel changes
profiling — measurement/tooling
doc — documentation only
bug — correctness issues
infra — CI, build, repo setup

Pull Request Workflow

Create branch from master: git checkout -b feature/<name>
Make changes, commit with [area] description conventions (see below)
Push branch: git push gitea feature/<name>
Create PR on Gitea targeting master
Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed, coherence test (see below)
Squash-merge to master

Pre-Merge Coherence Tests

Mandatory before any merge to master. Run on both master and the PR branch to detect silent correctness regressions.

IMPORTANT: macOS has no timeout command. Use gtimeout (from brew install coreutils).

Quick Coherence Test (4B model, ~30s)

gtimeout 60 ./build-build/bin/llama-cli \
  -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
  -n 64 -p "Once upon a time" \
  --temp 0 -s 42 -st

Perplexity Check (kernel changes only)

gtimeout 120 ./build-build/bin/llama-perplexity \
  -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
  -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128

Verification

llama-cli: PR output must be coherent speech (not gibberish). Does not need to be bit-perfect vs master.
perplexity: PR perplexity must match master within floating-point tolerance (<0.1% delta)
Gibberish output = block merge. Re-dispatch with specific feedback.

Timeout Policy

All test commands MUST use gtimeout to prevent hangs:

Inference/cli: gtimeout 60 (60s)
Perplexity: gtimeout 120 (120s)
Benchmark: gtimeout 300 (5min)
Build: gtimeout 300 (5min)

A hung test is a test failure. Do not retry without investigating the hang.

IMPORTANT: llama-cli interactive mode

llama-cli enters interactive REPL after generating, flooding output with > prompts. This is NOT a correctness failure — it's the CLI waiting for input. Always use --single-turn (-st) flag to prevent this:

gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st

Without -st, you will see > garbage and the process will hang. DO NOT attempt to "fix" the kernel because of this.

Commit Messages

Format: [area] short description (max 72 chars)

Areas: metal, profile, docs, build, tool

Examples:

[metal] add contiguous weight read path to Q4_0 mul_mat kernel
[profile] add per-op timing to metal encode loop
[docs] graph profile results for 9B Q4_0 at ctx=256
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler

Agent Instructions

When working autonomously, agents MUST:

Never push to origin — origin has no credentials, this is a safety measure
Create a branch for any code change: feature/<issue-number>-<short-desc>
Reference the issue in commits: [area] description (#123)
Run benchmarks before/after kernel changes and record in BENCHMARKS.md

Run perplexity to verify correctness after any kernel change (with timeout):

gtimeout 120 ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128

Run coherence test before any merge (with timeout):

gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st

Output must be coherent speech (not gibberish).

Build succeeds before pushing (with timeout):

gtimeout 300 cmake --build build-build -j$(sysctl -n hw.ncpu)

Push branch to gitea, then create PR via Gitea API (not via git push)

NOTE: macOS has no timeout command. Always use gtimeout (from brew install coreutils). NOTE: Always use -st flag with llama-cli to prevent interactive mode > prompts.

Build

# Initial cmake (one time)
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON

# Incremental build
cmake --build build-build -j$(sysctl -n hw.ncpu)

# Build specific target
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)

Benchmark Commands

# Quick bench (pp + tg)
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3

# Long tg bench (bandwidth-sensitive)
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2

# Perplexity
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128

MLX Benchmarking

MLX-lm is the performance target. Models at ~/.omlx/models/.

# Quick generation test
mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 128

# Benchmark with timing
time mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 4096

Compare llama.cpp results against MLX baselines. Record in BENCHMARKS.md.

Profiling Tools

Tool	What it does
`llama-eval-callback-profile`	Counts ops + bytes per decode tick (non-syncing cb_eval)
`GGML_METAL_GRAPH_DEBUG=1`	Prints per-op graph during compute (needs `-v` flag)
`GGML_METAL_GRAPH_DEBUG=2`	Also prints tensor shapes
`GGML_METAL_CAPTURE_COMPUTE=N`	Captures Nth compute call to Xcode Instruments GPUtrace
`GGML_METAL_CONCURRENCY_DISABLE=1`	Disable concurrent encoding (benchmark impact)
`GGML_METAL_FUSION_DISABLE=1`	Disable op fusion (benchmark impact)

Model Files

Located at /Users/sleepy/.llama/models/:

Qwen3.5-4B-Q4_0.gguf      (2.40 GiB)
Qwen3.5-9B-Q4_0.gguf      (5.00 GiB)
Qwen3.5-9B-IQ4_NL.gguf    (4.99 GiB)
Qwen3.5-9B-IQ4_XS.gguf    (4.80 GiB)
Qwen3.6-27B-Q4_0.gguf     (14.70 GiB)

Key Source Files

ggml/src/ggml-metal/ggml-metal.metal          — Metal shader kernels (Q4_0 dot: line 3228)
ggml/src/ggml-metal/ggml-metal-device.cpp     — Pipeline dispatch (get_pipeline_mul_mv: line 741)
ggml/src/ggml-metal/ggml-metal-ops.cpp        — Op encoding (MUL_MAT: line 2257)
ggml/src/ggml-metal/ggml-metal-context.m      — Graph compute (line 438)
ggml/src/ggml-metal/ggml-metal-impl.h         — Tuning params (N_R0, N_SG)
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
BENCHMARKS.md                                 — All benchmark results
ANALYSIS_QWEN3_5_MXFP4.md                     — MXFP4 format analysis

Gitea API

Base: https://git.kokoham.com/api/v1 Token in ~/Documents/personal/projects/.env as GITEA_TOKEN.

export $(grep -v '^#' ~/Documents/personal/projects/.env | xargs)

# Create issue
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues" \
  -H "Authorization: token $GITEA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","labels":["perf"]}'

# Create PR
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls" \
  -H "Authorization: token $GITEA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'

# Comment on issue/PR
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}/comments" \
  -H "Authorization: token $GITEA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"body":"..."}'

# Close issue
curl -X PATCH "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}" \
  -H "Authorization: token $GITEA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"state":"closed"}'

# Merge PR
curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls/{number}/merge" \
  -H "Authorization: token $GITEA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"do_force_merge":false,"merge_title":"..."}'

Onboarding — What to Read

For new agents or sessions, read in this order:

GIT.md — this file (workflow, tests, commands, file locations)
BENCHMARKS.md — all benchmark results, track progress toward 22 t/s
ANALYSIS_QWEN3_5_MXFP4.md — MXFP4 format analysis (if relevant)
Issue #40 — target t/s goal and MLX comparison guidelines
MLX reference: ../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h — qmv_fast_impl

9.7 KiB Raw Blame History