sleepy/llama.cpp

Fork 0

Files

T

Kaloyan Nikolov 222626cfdc

CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run

Details

CI (android) / android (push) Waiting to run

Details

CI (android) / android-ndk (push) Waiting to run

Details

CI (apple) / macOS-latest-ios (push) Waiting to run

Details

CI (apple) / macos-latest-ios-xcode (push) Waiting to run

Details

CI (apple) / macOS-latest-tvos (push) Waiting to run

Details

CI (apple) / macOS-latest-visionos (push) Waiting to run

Details

CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions

Details

CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions

Details

CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions

Details

CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run

Details

CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run

Details

CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run

Details

CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run

Details

CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run

Details

CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run

Details

CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run

Details

CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run

Details

CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run

Details

CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run

Details

CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run

Details

CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run

Details

CI (sycl) / windows-latest-sycl (push) Waiting to run

Details

CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run

Details

CI / build-cmake-pkg (push) Waiting to run

Details

CI / macOS-latest-arm64 (push) Waiting to run

Details

CI / macOS-latest-x64 (push) Waiting to run

Details

CI / macOS-latest-arm64-webgpu (push) Waiting to run

Details

CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run

Details

CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run

Details

CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run

Details

CI / android-arm64 (push) Waiting to run

Details

CI / ubuntu-latest-rpc (push) Waiting to run

Details

CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run

Details

CI / ubuntu-24-webgpu (push) Waiting to run

Details

CI / ubuntu-24-webgpu-wasm (push) Waiting to run

Details

CI / ubuntu-22-hip (push) Waiting to run

Details

CI / ubuntu-22-musa (push) Waiting to run

Details

CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run

Details

CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run

Details

CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run

Details

CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run

Details

CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run

Details

CI / ubuntu-latest-cuda (push) Waiting to run

Details

CI / windows-2022-cuda (12.4) (push) Waiting to run

Details

CI / windows-latest-hip (push) Waiting to run

Details

CI / ubuntu-cpu-riscv64-native (push) Waiting to run

Details

CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run

Details

CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run

Details

CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run

Details

EditorConfig Checker / editorconfig (push) Waiting to run

Details

Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run

Details

Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run

Details

Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run

Details

Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run

Details

Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run

Details

Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run

Details

Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run

Details

Release / android-arm64 (push) Waiting to run

Details

Release / ubuntu-24-openvino (push) Waiting to run

Details

Release / windows-cpu (arm64) (push) Waiting to run

Details

Release / windows-cpu (x64) (push) Waiting to run

Details

Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run

Details

Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run

Details

Release / windows-cuda (12.4) (push) Waiting to run

Details

Release / windows-cuda (13.1) (push) Waiting to run

Details

Release / windows-sycl (push) Waiting to run

Details

Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run

Details

Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run

Details

Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run

Details

Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run

Details

Release / ios-xcode-build (push) Waiting to run

Details

Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run

Details

Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run

Details

Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run

Details

Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run

Details

Release / release (push) Blocked by required conditions

Details

Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run

Details

Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run

Details

Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run

Details

Server / server (default) (push) Waiting to run

Details

Server / server (backend-sampling) (push) Waiting to run

Details

Server / server-windows (push) Waiting to run

Details

CI (openvino) / ubuntu-24-openvino-CPU (push) Has been cancelled

Details

CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Has been cancelled

Details

[docs] add GIT.md with workflow and agent instructions

2026-04-30 18:11:44 +02:00

5.8 KiB

Raw Permalink Blame History

Git Workflow — llama.cpp M4 Max Performance Fork

This is a private fork of ggerganov/llama.cpp focused on Apple M4 Max Metal performance. All development happens on our Gitea instance. No changes ever touch upstream GitHub.

Remotes

origin  → https://github.com/ggerganov/llama.cpp.git   (read-only: git pull only)
gitea   → ssh://sleepy@git.kokoham.com:2222/sleepy/llama.cpp.git  (read/write)

origin has no credentials — can pull but cannot push. Safe for agents.
gitea is the working fork on our Gitea instance (SSH port 2222, user sleepy).

Syncing Upstream

git fetch origin
git merge origin/master          # fast-forward if clean
git push gitea master

Do this periodically. Conflicts should be rare since we only add tools/docs, not modify core code.

Branch Structure

master                    — always tracks upstream master (clean merge)
feature/<short-desc>      — active development branches (e.g., feature/mul-mat-contig-reads)
profile/<desc>            — profiling/measurement branches
fix/<desc>                — bug fixes found during profiling
exp/<desc>                — experimental, may be discarded

Branches are short-lived. Merge to master via PR, then delete.

Issue Tracking

All work items are tracked as issues on https://git.kokoham.com/sleepy/llama.cpp/issues.

Issue labels:

perf — performance investigation
kernel — Metal kernel changes
profiling — measurement/tooling
doc — documentation only
bug — correctness issues
infra — CI, build, repo setup

Pull Request Workflow

Create branch from master: git checkout -b feature/<name>
Make changes, commit with [area] description conventions (see below)
Push branch: git push gitea feature/<name>
Create PR on Gitea targeting master
Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed
Squash-merge to master

Commit Messages

Format: [area] short description (max 72 chars)

Areas: metal, profile, docs, build, tool

Examples:

[metal] add contiguous weight read path to Q4_0 mul_mat kernel
[profile] add per-op timing to metal encode loop
[docs] graph profile results for 9B Q4_0 at ctx=256
[tool] llama-eval-callback-profile: non-syncing cb_eval profiler

Agent Instructions

When working autonomously, agents MUST:

Never push to origin — origin has no credentials, this is a safety measure
Create a branch for any code change: feature/<issue-number>-<short-desc>
Reference the issue in commits: [area] description (#123)
Run benchmarks before/after kernel changes and record in BENCHMARKS.md

Run perplexity to verify correctness after any kernel change:

./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128

Build succeeds before pushing:

cmake --build build-build -j$(sysctl -n hw.ncpu)

Push branch to gitea, then create PR via Gitea API (not via git push)

Build

# Initial cmake (one time)
cmake -B build-build -DGGML_METAL=ON -DGGML_BLAS=ON -DGGML_ACCELERATE=ON

# Incremental build
cmake --build build-build -j$(sysctl -n hw.ncpu)

# Build specific target
cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.ncpu)

Benchmark Commands

# Quick bench (pp + tg)
./build-build/bin/llama-bench -m MODEL.gguf -p 512 -t 1 -n 128 -o md -r 3

# Long tg bench (bandwidth-sensitive)
./build-build/bin/llama-bench -m MODEL.gguf -p 1 -t 1 -n 4096 -o md -r 2

# Perplexity
./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128

Profiling Tools

Tool	What it does
`llama-eval-callback-profile`	Counts ops + bytes per decode tick (non-syncing cb_eval)
`GGML_METAL_GRAPH_DEBUG=1`	Prints per-op graph during compute (needs `-v` flag)
`GGML_METAL_GRAPH_DEBUG=2`	Also prints tensor shapes
`GGML_METAL_CAPTURE_COMPUTE=N`	Captures Nth compute call to Xcode Instruments GPUtrace
`GGML_METAL_CONCURRENCY_DISABLE=1`	Disable concurrent encoding (benchmark impact)
`GGML_METAL_FUSION_DISABLE=1`	Disable op fusion (benchmark impact)

Model Files

Located at /Users/sleepy/.llama/models/:

Qwen3.5-4B-Q4_0.gguf      (2.40 GiB)
Qwen3.5-9B-Q4_0.gguf      (5.00 GiB)
Qwen3.5-9B-IQ4_NL.gguf    (4.99 GiB)
Qwen3.5-9B-IQ4_XS.gguf    (4.80 GiB)
Qwen3.6-27B-Q4_0.gguf     (14.70 GiB)

Key Source Files

ggml/src/ggml-metal/ggml-metal.metal          — Metal shader kernels (Q4_0 dot: line 3228)
ggml/src/ggml-metal/ggml-metal-device.cpp     — Pipeline dispatch (get_pipeline_mul_mv: line 741)
ggml/src/ggml-metal/ggml-metal-ops.cpp        — Op encoding (MUL_MAT: line 2257)
ggml/src/ggml-metal/ggml-metal-context.m      — Graph compute (line 438)
ggml/src/ggml-metal/ggml-metal-impl.h         — Tuning params (N_R0, N_SG)
examples/eval-callback/eval-callback-profile.cpp — Custom profiler tool
BENCHMARKS.md                                 — All benchmark results
ANALYSIS_QWEN3_5_MXFP4.md                     — MXFP4 format analysis

Gitea API

Base: https://git.kokoham.com/api/v1 Token in ~/.gitea_token (not committed). Local API from server: http://127.0.0.1:18431/api/v1

# Create issue
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/issues" \
  -H "Authorization: token $(cat ~/.gitea_token)" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","labels":["perf"]}'

# Create PR
curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/pulls" \
  -H "Authorization: token $(cat ~/.gitea_token)" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'

5.8 KiB Raw Permalink Blame History

Git Workflow — llama.cpp M4 Max Performance Fork

Remotes

Syncing Upstream

Branch Structure

Issue Tracking

Pull Request Workflow

Commit Messages

Agent Instructions

Build

Benchmark Commands

Profiling Tools

Model Files

Key Source Files

Gitea API

5.8 KiB

Raw Permalink Blame History