# Intel GPU Diagnosis Patches Patches generated by a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing Intel Arc GPU performance issues in llama.cpp. Non-overlapping training data ensures blind spots are caught through cross-review. ## Quick Start ```bash # Apply one phase at a time (recommended): cd repos ./patch/apply-phase.sh 1 # Phase 1: SYCL graph default ./patch/apply-phase.sh 2 # Phase 2: Kernel tuning (VER_GEN + DMMV) ./patch/apply-phase.sh 3 # Phase 3: Vulkan Arc 140T fix (independent) ./patch/apply-phase.sh 4 # Phase 4: Host-buffer copy opt-in # Reverse: ./patch/apply-phase.sh 1 --reverse ./patch/apply-phase.sh all --reverse # Reverse all ``` ## Test Environment - **GPU:** Intel Arc A770 16GB (card0, xe driver, PCI 8086:56a0) - **Secondary:** AMD RX 580 8GB (card1, amdgpu — not used for SYCL) - **CPU:** AMD Ryzen 7 5800X, 32GB DDR4 - **OS:** CachyOS, kernel 6.19.10 - **oneAPI:** 2025.3.2, DPC++ compiler 2025.3.2 - **Build:** `-DGGML_SYCL=ON -DGGML_VULKAN=OFF -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release` ## Benchmark Results (Updated 2026-04-15, llama-bench) **Method:** `llama-bench -ngl 99 -p 512 -n 128 -r 3`, 3 repeats. ### Qwen3.5-9B (dense, fits entirely in VRAM) | Config | Q4_0 tg128 | Q8_0 tg128 | Q4_K_M tg128 | |--------|-----------|-----------|-------------| | Baseline (HEAD) | 29.4 | 28.6 | — | | +Phase 2 (VER+tuning) | 30.16 | 29.96 | 24.65 | | **+Phase 2+5 (vdr_mmvq)** | **35.96** | **30.82** | **25.32** | **Phase 5 is the first significant improvement: +19% on Q4_0 token generation.** ### Bandwidth Utilization | Config | Q4_0 BW | Q4_0 BW% | Q8_0 BW | Q8_0 BW% | |--------|---------|----------|---------|----------| | Baseline | 150 GiB/s | 29% | 265 GiB/s | 52% | | +Phase 5 | 180 GiB/s | 35% | 273 GiB/s | 53% | ### Root Cause (corrected) Previous analysis attributed low BW utilization to "SYCL submission model overhead". **This was WRONG.** Per-op profiling proves: 1. SYCL queue naturally batches async kernel submissions (CPU submits all 1077 ops in 7.5ms vs 32ms GPU execution) 2. Q4_0 is **dp4a-compute-bound**, not memory-BW-bound 3. Nibble packing requires 2 dp4a per byte (low + high nibbles), while Q8_0 needs only 1 dp4a per byte 4. Both formats hit the same dp4a throughput ceiling → same ~30 t/s despite different data sizes 5. Phase 5's vdr_mmvq increase processes more blocks per subgroup, better amortizing dp4a overhead See `../logs/root-cause-analysis-20260415.md` for full profiling data. ### Qwen3.5-35B-A3B (MoE, `--cpu-moe`) | Config | Status | |--------|--------| | Baseline (unpatched) | ✅ Works (15.2 t/s gen from earlier run) | | +Phase 1 (graph) | ❌ **CRASH**: `async_malloc` in `opt_for_reorder` | **Critical Finding:** Phase 1 (enabling SYCL graph by default) crashes on the 35B MoE model. The crash occurs in `opt_for_reorder` → `async_malloc` during graph computation. The graph path triggers async memory allocation that fails on Arc A770. This confirms the original developer's rationale for disabling graphs by default — the async malloc path is broken for this hardware/workload combination. ## Phases ### Phase 1 — SYCL Graph Default ⚠️ CRASHES ON MoE | Patch | File | Change | Status | |-------|------|--------|--------| | 0001 | ggml-sycl.cpp:217 | Graph default 1→0 | ⚠️ Crashes 35B MoE | Enables SYCL graph execution by default. Works on small dense models but crashes on 35B MoE with `async_malloc` failure in `opt_for_reorder`. The original default of disabled (1) was correct for Arc A770. ### Phase 2 — SYCL Kernel Tuning ✅ Neutral | Patch | File | Change | Status | |-------|------|--------|--------| | 0001 | common.hpp:90-91 | VER_GEN12 1M→1200, VER_GEN13→1300 | ✅ Neutral | | 0002 | presets.hpp:57,60 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral | | 0003 | common.hpp:106,109 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral | Fixes VER_GEN12 placeholder but shows no measurable impact on 9B dense model. Needs testing on MoE or larger models to evaluate properly. DMMV_X=64 vs 32 is within noise on this workload. ### Phase 3 — Vulkan Intel ✅ Independent | Patch | File | Change | Status | |-------|------|--------|--------| | 0001 | ggml-vulkan.cpp | Arc 140T Xe2 override | ⏳ Not tested (no spirv-headers) | ### Phase 4 — Host-Buffer Copy ✅ Neutral | Patch | File | Change | Status | |-------|------|--------|--------| | 0001 | ggml-sycl.cpp | mmap workaround opt-in | ✅ Neutral on 9B | Removes blanket Linux host-buffer double-copy. Neutral on 9B dense model that fits in VRAM. Should help on models that stress the host→device copy path. ## Council Deliberations Stored in `../logs/` (gitignored). Key files: - `logs/decisions.md` — All council decisions with rationale - `logs/test-machine-megumin.md` — Test machine environment and benchmarks - `logs/M-sync-overhead-*.md` — Agent-M sync analysis - `logs/K-kernel-tuning-*.md` — Agent-K kernel tuning analysis - `logs/M-review-K-*.md`, `logs/K-review-M-*.md` — Cross-reviews ### Phase 5 — Q4_0 MMVQ vdr_mmvq Tuning ✅ +19% Q4_0 | Patch | File | Change | Status | |-------|------|--------|--------| | 0001 | quants.hpp:47 | Q4_0 reorder vdr_mmvq 2→4 | ✅ +5.8 t/s Q4_0 | Increases the "vector dot product rows per MMVQ" parameter for Q4_0's reorder path. This processes more blocks per subgroup iteration (8→16 blocks), better amortizing the dp4a compute overhead. Q4_0 is dp4a-bound (not BW-bound) because nibble extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte. Result: Q4_0 tg128 goes from 30.16 → 35.96 t/s (+19%). No impact on Q8_0 or Q4_K_M. ## Key Lesson The Arc A770 bottleneck for Q4_0 token generation is **dp4a compute throughput**, not memory bandwidth or submission overhead. The 4-bit nibble packing requires 2 dp4a operations per byte (low + high nibbles), making the kernel compute-bound at ~30 t/s. Further improvements require: 1. DPAS/XMX integration for quantized dot products 2. Algorithmic changes to the nibble unpacking (e.g., lookup tables) 3. Larger vdr_mmvq (requires larger qi/QI4_0 — would need data format changes)