[docs] add coherence tests, MLX benchmarking, onboarding, Gitea API

[metal] extend bin op fusion to MUL/SUB/DIV chains (#28 ) (#38 )
2026-05-01 00:44:37 +02:00 · 2026-04-30 21:03:14 +02:00
3 changed files with 141 additions and 35 deletions
@@ -52,9 +52,59 @@ Issue labels:
 2. Make changes, commit with `[area] description` conventions (see below)
 3. Push branch: `git push gitea feature/<name>`
 4. Create PR on Gitea targeting `master`
-5. Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed
+5. Before merge: build, benchmark (record in BENCHMARKS.md), perplexity check if kernel changed, **coherence test** (see below)
 6. Squash-merge to master

+## Pre-Merge Coherence Tests
+
+**Mandatory before any merge to master.** Run on both `master` and the PR branch to detect silent correctness regressions.
+
+> **IMPORTANT:** macOS has no `timeout` command. Use `gtimeout` (from `brew install coreutils`).
+
+### Quick Coherence Test (4B model, ~30s)
+
+```bash
+gtimeout 60 ./build-build/bin/llama-cli \
+  -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
+  -n 64 -p "Once upon a time" \
+  --temp 0 -s 42 -st
+```
+
+### Perplexity Check (kernel changes only)
+
+```bash
+gtimeout 120 ./build-build/bin/llama-perplexity \
+  -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf \
+  -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
+```
+
+### Verification
+
+- **llama-cli**: PR output must be coherent speech (not gibberish). Does not need to be bit-perfect vs master.
+- **perplexity**: PR perplexity must match master within floating-point tolerance (<0.1% delta)
+- **Gibberish output = block merge.** Re-dispatch with specific feedback.
+
+### Timeout Policy
+
+**All test commands MUST use `gtimeout`** to prevent hangs:
+- Inference/cli: `gtimeout 60` (60s)
+- Perplexity: `gtimeout 120` (120s)
+- Benchmark: `gtimeout 300` (5min)
+- Build: `gtimeout 300` (5min)
+
+A hung test is a test failure. Do not retry without investigating the hang.
+
+### IMPORTANT: llama-cli interactive mode
+
+llama-cli enters interactive REPL after generating, flooding output with `>` prompts. This is NOT a correctness failure — it's the CLI waiting for input.
+**Always use `--single-turn` (`-st`) flag to prevent this:**
+
+```bash
+gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st
+```
+
+Without `-st`, you will see `>` garbage and the process will hang. DO NOT attempt to "fix" the kernel because of this.
+
 ## Commit Messages

 Format: `[area] short description (max 72 chars)`
@@ -77,15 +127,23 @@ When working autonomously, agents MUST:
 2. **Create a branch** for any code change: `feature/<issue-number>-<short-desc>`
 3. **Reference the issue** in commits: `[area] description (#123)`
 4. **Run benchmarks** before/after kernel changes and record in BENCHMARKS.md
-5. **Run perplexity** to verify correctness after any kernel change:
-   ```bash
-   ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
-   ```
-6. **Build succeeds** before pushing:
-   ```bash
-   cmake --build build-build -j$(sysctl -n hw.ncpu)
-   ```
-7. **Push branch** to gitea, then **create PR via Gitea API** (not via git push)
+5. **Run perplexity** to verify correctness after any kernel change (with timeout):
+    ```bash
+    gtimeout 120 ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
+    ```
+6. **Run coherence test** before any merge (with timeout):
+    ```bash
+    gtimeout 60 ./build-build/bin/llama-cli -m ~/.llama/models/Qwen3.5-4B-Q4_0.gguf -n 64 -p "Once upon a time" --temp 0 -s 42 -st
+    ```
+    Output must be coherent speech (not gibberish).
+7. **Build succeeds** before pushing (with timeout):
+    ```bash
+    gtimeout 300 cmake --build build-build -j$(sysctl -n hw.ncpu)
+    ```
+8. **Push branch** to gitea, then **create PR via Gitea API** (not via git push)
+
+> **NOTE:** macOS has no `timeout` command. Always use `gtimeout` (from `brew install coreutils`).
+> **NOTE:** Always use `-st` flag with llama-cli to prevent interactive mode `>` prompts.

 ## Build

@@ -113,6 +171,20 @@ cmake --build build-build --target llama-eval-callback-profile -j$(sysctl -n hw.
 ./build-build/bin/llama-perplexity -m MODEL.gguf -f /tmp/coherence_test.txt -t 1 --chunks 1 -c 128
 ```

+## MLX Benchmarking
+
+MLX-lm is the performance target. Models at `~/.omlx/models/`.
+
+```bash
+# Quick generation test
+mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 128
+
+# Benchmark with timing
+time mlx_lm.generate --model ~/.omlx/models/Qwen3.6-27B-Q4_0 --prompt "Once upon a time" --max-tokens 4096
+```
+
+Compare llama.cpp results against MLX baselines. Record in BENCHMARKS.md.
+
 ## Profiling Tools

 | Tool | What it does |
@@ -152,19 +224,47 @@ ANALYSIS_QWEN3_5_MXFP4.md                     — MXFP4 format analysis
 ## Gitea API

 Base: `https://git.kokoham.com/api/v1`
-Token in `~/.gitea_token` (not committed).
-Local API from server: `http://127.0.0.1:18431/api/v1`
+Token in `~/Documents/personal/projects/.env` as `GITEA_TOKEN`.

 ```bash
+export $(grep -v '^#' ~/Documents/personal/projects/.env | xargs)
+
 # Create issue
-curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/issues" \
-  -H "Authorization: token $(cat ~/.gitea_token)" \
+curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues" \
+  -H "Authorization: token $GITEA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","labels":["perf"]}'

 # Create PR
-curl -X POST "http://127.0.0.1:18431/api/v1/repos/sleepy/llama.cpp/pulls" \
-  -H "Authorization: token $(cat ~/.gitea_token)" \
+curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls" \
+  -H "Authorization: token $GITEA_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"title":"...","body":"...","head":"feature/xyz","base":"master"}'
+
+# Comment on issue/PR
+curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}/comments" \
+  -H "Authorization: token $GITEA_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"body":"..."}'
+
+# Close issue
+curl -X PATCH "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/issues/{number}" \
+  -H "Authorization: token $GITEA_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"state":"closed"}'
+
+# Merge PR
+curl -X POST "https://git.kokoham.com/api/v1/repos/sleepy/llama.cpp/pulls/{number}/merge" \
+  -H "Authorization: token $GITEA_TOKEN" \
+  -H "Content-Type: application/json" \
+  -d '{"do_force_merge":false,"merge_title":"..."}'
 ```
+
+## Onboarding — What to Read
+
+For new agents or sessions, read in this order:
+1. **GIT.md** — this file (workflow, tests, commands, file locations)
+2. **BENCHMARKS.md** — all benchmark results, track progress toward 22 t/s
+3. **ANALYSIS_QWEN3_5_MXFP4.md** — MXFP4 format analysis (if relevant)
+4. **Issue #40** — target t/s goal and MLX comparison guidelines
+5. **MLX reference**: `../mlx-lm/mlx/include/mlx/backend/metal/kernels/quantized.h` — qmv_fast_impl
@@ -394,19 +394,29 @@ void ggml_graph_optimize(ggml_cgraph * gf) {
        // fuse only ops that start with these operations
        // can be expanded when needed
        if (node.op() == GGML_OP_ADD ||
+            node.op() == GGML_OP_SUB ||
+            node.op() == GGML_OP_MUL ||
+            node.op() == GGML_OP_DIV ||
            node.op() == GGML_OP_NORM ||
            node.op() == GGML_OP_RMS_NORM) {
            ops[0] = node.op();

            int f = i + 1;
            while (f < n && f < i + MAX_FUSE) {
-                // conservatively allow fusing only these ops
-                // can be expanded when needed
-                if (gf->nodes[f]->op != GGML_OP_ADD &&
-                    gf->nodes[f]->op != GGML_OP_MUL &&
-                    gf->nodes[f]->op != GGML_OP_NORM &&
-                    gf->nodes[f]->op != GGML_OP_RMS_NORM) {
-                    break;
+                // bin ops (ADD/SUB/MUL/DIV) must be same type to fuse
+                // NORM/RMS_NORM can chain with MUL/ADD
+                if (node.op() == GGML_OP_ADD ||
+                    node.op() == GGML_OP_SUB ||
+                    node.op() == GGML_OP_MUL ||
+                    node.op() == GGML_OP_DIV) {
+                    if (gf->nodes[f]->op != node.op()) break;
+                } else {
+                    if (gf->nodes[f]->op != GGML_OP_ADD &&
+                        gf->nodes[f]->op != GGML_OP_MUL &&
+                        gf->nodes[f]->op != GGML_OP_NORM &&
+                        gf->nodes[f]->op != GGML_OP_RMS_NORM) {
+                        break;
+                    }
                }
                ops[f - i] = gf->nodes[f]->op;
                f++;
@@ -3118,19 +3118,15 @@ int ggml_metal_op_bin(ggml_metal_op_t ctx, int idx) {

    int n_fuse = 1;

-    // c[0] = add(a,    b[0])
-    // c[1] = add(c[0], b[1])
-    // c[2] = add(c[1], b[2])
+    // c[0] = op(a,     b[0])
+    // c[1] = op(c[0],  b[1])
+    // c[2] = op(c[1],  b[2])
    // ...
    if (use_fusion) {
-        fops[0] = GGML_OP_ADD;
-        fops[1] = GGML_OP_ADD;
-        fops[2] = GGML_OP_ADD;
-        fops[3] = GGML_OP_ADD;
-        fops[4] = GGML_OP_ADD;
-        fops[5] = GGML_OP_ADD;
-        fops[6] = GGML_OP_ADD;
-        fops[7] = GGML_OP_ADD;
+        ggml_op cur_op = op->op;
+        for (int i = 0; i < 8; ++i) {
+            fops[i] = cur_op;
+        }

        // note: in metal, we sometimes encode the graph in parallel so we have to avoid fusing ops
        //       across splits. idx_end indicates the last node in the current split
@@ -3165,7 +3161,7 @@ int ggml_metal_op_bin(ggml_metal_op_t ctx, int idx) {
        ++n_fuse;

        if (debug_fusion > 1 && n_fuse > 1) {
-            GGML_LOG_DEBUG("%s: fuse: ADD x %d\n", __func__, n_fuse);
+            GGML_LOG_DEBUG("%s: fuse: %s x %d\n", __func__, ggml_op_name(cur_op), n_fuse);
        }
    }
Author	SHA1	Message	Date
Kaloyan Nikolov	757ef4de97	[docs] add coherence tests, MLX benchmarking, onboarding, Gitea API EditorConfig Checker / editorconfig (push) Waiting to run Details	2026-05-01 00:44:37 +02:00
sleepy	8c532835be	[metal] extend bin op fusion to MUL/SUB/DIV chains (#28 ) (#38 ) CI (3rd-party) / ubuntu-24-llguidance (push) Waiting to run Details CI (android) / android (push) Waiting to run Details CI (android) / android-ndk (push) Waiting to run Details CI (apple) / macOS-latest-ios (push) Waiting to run Details CI (apple) / macos-latest-ios-xcode (push) Waiting to run Details CI (apple) / macOS-latest-tvos (push) Waiting to run Details CI (apple) / macOS-latest-visionos (push) Waiting to run Details CI (apple) / macOS-latest-swift (generic/platform=iOS) (push) Blocked by required conditions Details CI (apple) / macOS-latest-swift (generic/platform=macOS) (push) Blocked by required conditions Details CI (apple) / macOS-latest-swift (generic/platform=tvOS) (push) Blocked by required conditions Details CI (cann) / openEuler-latest-cann (aarch64, Release, 310p, off) (push) Waiting to run Details CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, off) (push) Waiting to run Details CI (cann) / openEuler-latest-cann (aarch64, Release, 910b, on) (push) Waiting to run Details CI (cann) / openEuler-latest-cann (x86, Release, 310p, off) (push) Waiting to run Details CI (cann) / openEuler-latest-cann (x86, Release, 910b, off) (push) Waiting to run Details CI (cann) / openEuler-latest-cann (x86, Release, 910b, on) (push) Waiting to run Details CI (openvino) / ubuntu-24-openvino-CPU (push) Waiting to run Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, ADDRESS) (push) Waiting to run Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, THREAD) (push) Waiting to run Details CI (riscv) / ubuntu-riscv64-native-sanitizer (Debug, UNDEFINED) (push) Waiting to run Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, ADDRESS) (push) Waiting to run Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, THREAD) (push) Waiting to run Details CI (sanitize) / ubuntu-latest-sanitizer (Debug, UNDEFINED) (push) Waiting to run Details CI (openvino) / ubuntu-24-openvino-GPU (push) Has been cancelled Details CI (self-hosted) / ggml-ci-nvidia-cuda (push) Waiting to run Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm (push) Waiting to run Details CI (self-hosted) / ggml-ci-nvidia-vulkan-cm2 (push) Waiting to run Details CI (self-hosted) / ggml-ci-mac-metal (push) Waiting to run Details CI (self-hosted) / ggml-ci-mac-webgpu (push) Waiting to run Details CI (self-hosted) / ggml-ci-mac-vulkan (push) Waiting to run Details CI (self-hosted) / ggml-ci-linux-intel-vulkan (push) Waiting to run Details CI (self-hosted) / ggml-ci-win-intel-vulkan (push) Waiting to run Details CI (self-hosted) / ggml-ci-intel-openvino-gpu-low-perf (push) Waiting to run Details CI (sycl) / ubuntu-24-sycl (fp16, ON) (push) Waiting to run Details CI (sycl) / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run Details CI (sycl) / windows-latest-sycl (push) Waiting to run Details CI (vulkan) / ubuntu-24-vulkan-llvmpipe (push) Waiting to run Details CI / build-cmake-pkg (push) Waiting to run Details CI / macOS-latest-arm64 (push) Waiting to run Details CI / macOS-latest-x64 (push) Waiting to run Details CI / macOS-latest-arm64-webgpu (push) Waiting to run Details CI / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run Details CI / ubuntu-cpu (ppc64le, ubuntu-24.04-ppc64le) (push) Waiting to run Details CI / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run Details CI / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run Details CI / android-arm64 (push) Waiting to run Details CI / ubuntu-latest-rpc (push) Waiting to run Details CI / ubuntu-24-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run Details CI / ubuntu-24-vulkan (x64, ubuntu-24.04) (push) Waiting to run Details CI / ubuntu-24-webgpu (push) Waiting to run Details CI / ubuntu-24-webgpu-wasm (push) Waiting to run Details CI / ubuntu-22-hip (push) Waiting to run Details CI / ubuntu-22-musa (push) Waiting to run Details CI / windows-latest (arm64, llvm-arm64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON) (push) Waiting to run Details CI / windows-latest (arm64, llvm-arm64-opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON) (push) Waiting to run Details CI / windows-latest (x64, cpu-x64 (static), -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DBUILD_SHARED_LIBS=OFF) (push) Waiting to run Details CI / windows-latest (x64, openblas-x64, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/x64-windows-llvm.cmake -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_OPENMP=OFF -DGGML_BLAS=ON -DG… (push) Waiting to run Details CI / windows-latest (x64, vulkan-x64, -DCMAKE_BUILD_TYPE=Release -DGGML_NATIVE=OFF -DLLAMA_BUILD_SERVER=ON -DGGML_RPC=ON -DGGML_BACKEND_DL=ON -DGGML_CPU_ALL_VARIANTS=ON -DGGML_VULKAN=ON) (push) Waiting to run Details CI / ubuntu-latest-cuda (push) Waiting to run Details CI / windows-2022-cuda (12.4) (push) Waiting to run Details CI / windows-latest-hip (push) Waiting to run Details CI / ubuntu-cpu-riscv64-native (push) Waiting to run Details CI / ggml-ci-x64-cpu-low-perf (push) Waiting to run Details CI / ggml-ci-arm64-cpu-low-perf (push) Waiting to run Details CI / ggml-ci-x64-cpu-high-perf (push) Waiting to run Details CI / ggml-ci-arm64-cpu-high-perf (push) Waiting to run Details CI / ggml-ci-arm64-cpu-high-perf-sve (push) Waiting to run Details CI / ggml-ci-arm64-cpu-kleidiai (push) Waiting to run Details CI / ggml-ci-arm64-cpu-kleidiai-graviton4 (push) Waiting to run Details EditorConfig Checker / editorconfig (push) Waiting to run Details Release / macOS-cpu (arm64, arm64, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON, macos-14) (push) Waiting to run Details Release / macOS-cpu (arm64, arm64-kleidiai, -DGGML_METAL_USE_BF16=ON -DGGML_METAL_EMBED_LIBRARY=ON -DGGML_CPU_KLEIDIAI=ON, macos-14) (push) Waiting to run Details Release / macOS-cpu (x64, x64, -DGGML_METAL=OFF -DCMAKE_OSX_DEPLOYMENT_TARGET=13.3, macos-15-intel) (push) Waiting to run Details Release / ubuntu-cpu (arm64, ubuntu-24.04-arm) (push) Waiting to run Details Release / ubuntu-cpu (s390x, ubuntu-24.04-s390x) (push) Waiting to run Details Release / ubuntu-cpu (x64, ubuntu-22.04) (push) Waiting to run Details Release / ubuntu-vulkan (arm64, ubuntu-24.04-arm) (push) Waiting to run Details Release / ubuntu-vulkan (x64, ubuntu-22.04) (push) Waiting to run Details Release / android-arm64 (push) Waiting to run Details Release / ubuntu-24-openvino (push) Waiting to run Details Release / windows-cpu (arm64) (push) Waiting to run Details Release / windows-cpu (x64) (push) Waiting to run Details Release / windows (arm64, opencl-adreno, -G "Ninja Multi-Config" -D CMAKE_TOOLCHAIN_FILE=cmake/arm64-windows-llvm.cmake -DCMAKE_PREFIX_PATH="$env:RUNNER_TEMP/opencl-arm64-release" -DGGML_OPENCL=ON -DGGML_OPENCL_USE_ADRENO_KERNELS=ON, ggml-opencl) (push) Waiting to run Details Release / windows (x64, vulkan, -DGGML_VULKAN=ON, ggml-vulkan) (push) Waiting to run Details Release / windows-cuda (12.4) (push) Waiting to run Details Release / windows-cuda (13.1) (push) Waiting to run Details Release / windows-sycl (push) Waiting to run Details Release / ubuntu-24-sycl (fp16, ON) (push) Waiting to run Details Release / ubuntu-24-sycl (fp32, OFF) (push) Waiting to run Details Release / ubuntu-22-rocm (7.2.1, x64, gfx908;gfx90a;gfx942;gfx1030;gfx1100;gfx1101;gfx1102;gfx1151;gfx1150;gfx1200;gfx1201) (push) Waiting to run Details Release / windows-hip (gfx1150;gfx1151;gfx1200;gfx1201;gfx1100;gfx1101;gfx1102;gfx1030;gfx1031;gfx1032, radeon) (push) Waiting to run Details Release / ios-xcode-build (push) Waiting to run Details Release / openEuler-cann (aarch64, Release, 310p, off) (push) Waiting to run Details Release / openEuler-cann (aarch64, Release, 910b, on) (push) Waiting to run Details Release / openEuler-cann (x86, Release, 310p, off) (push) Waiting to run Details Release / openEuler-cann (x86, Release, 910b, on) (push) Waiting to run Details Release / release (push) Blocked by required conditions Details Server (sanitize) / server (RelWithDebInfo, ADDRESS) (push) Waiting to run Details Server (sanitize) / server (RelWithDebInfo, UNDEFINED) (push) Waiting to run Details Server (self-hosted) / server-metal (GPUx2, backend-sampling) (push) Waiting to run Details Server (self-hosted) / server-metal (GPUx2) (push) Waiting to run Details Server (self-hosted) / server-metal (GPUx1) (push) Waiting to run Details Server (self-hosted) / server-metal (GPUx1, backend-sampling) (push) Waiting to run Details Server / server (default) (push) Waiting to run Details Server / server (backend-sampling) (push) Waiting to run Details Server / server-windows (push) Waiting to run Details	2026-04-30 21:03:14 +02:00