Files

T

sleepy 6ad84d543c feat: phased patch system for Intel Arc GPU performance fixes

3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzed Intel Arc GPU
performance issues and produced patches for llama.cpp:

Phase 1 - SYCL Sync: Enable graph execution by default (GGML_SYCL_DISABLE_GRAPH)
Phase 2 - SYCL Kernel: Fix VER_GEN12/13 thresholds, tune DMMV_X/MMV_Y
Phase 3 - Vulkan Intel: Arc 140T device-ID Xe2 override

Includes:
- Phased apply script (apply-phase.sh [1|2|3|all])
- Master apply script with --status/--reverse/--dry-run
- Per-phase READMEs with testing checklists
- Council deliberation logs (gitignored in logs/)

Verified: all patches apply/reverse cleanly via git apply.
Static verification: VER_GEN arithmetic and DMMV_X divisibility pass.

2026-04-15 14:53:40 +02:00

phase1-sycl-sync

feat: phased patch system for Intel Arc GPU performance fixes

2026-04-15 14:53:40 +02:00

phase2-sycl-kernel

feat: phased patch system for Intel Arc GPU performance fixes

2026-04-15 14:53:40 +02:00

phase3-vulkan-intel

feat: phased patch system for Intel Arc GPU performance fixes

2026-04-15 14:53:40 +02:00

apply-phase.sh

feat: phased patch system for Intel Arc GPU performance fixes

2026-04-15 14:53:40 +02:00

apply.sh

feat: phased patch system for Intel Arc GPU performance fixes

2026-04-15 14:53:40 +02:00

README.md

feat: phased patch system for Intel Arc GPU performance fixes

2026-04-15 14:53:40 +02:00

README.md

Intel GPU Diagnosis Patches

Patches generated by a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing Intel Arc GPU performance issues in llama.cpp. Non-overlapping training data ensures blind spots are caught through cross-review.

Quick Start

# Apply all phases at once:
cd repos
./patch/apply.sh

# Apply one phase at a time (recommended for testing):
./patch/apply-phase.sh 1              # Apply phase 1
./patch/apply-phase.sh 2              # Apply phase 2 (after testing phase 1)
./patch/apply-phase.sh 3              # Apply phase 3 (after testing phase 2)

# Dry-run / reverse:
./patch/apply-phase.sh 1 --dry-run
./patch/apply-phase.sh 2 --reverse

# Check what's applied:
./patch/apply.sh --status

Phases

Phase 1 — SYCL Sync (safest, highest impact)

Patch	File	Change	Decision
0001	ggml-sycl.cpp:217	Graph default 1→0	Approved 3/3

Enables SYCL graph execution by default. Eliminates 8 blocking .wait() calls. Expected 10-30% token generation speedup for single-GPU dense LLMs.

Phase 2 — SYCL Kernel Tuning (depends on Phase 1)

Patch	File	Change	Decision
0001	common.hpp:90-91	VER_GEN12 1M→1200, VER_GEN13→1300	Approved 3/3
0002	presets.hpp:57,60	DMMV_X 32→64, MMV_Y 1→2	Approved 3/3, needs bench
0003	common.hpp:106,109	DMMV_X 32→64, MMV_Y 1→2	Approved 3/3, needs bench

Fixes the VER_GEN12 placeholder (1M) that routed all Intel GPUs to NVIDIA Ampere paths. Tunes DMMV thread parameters for Arc hardware. Expected 5-15% additional improvement.

Phase 3 — Vulkan Intel (independent of Phase 1/2)

Patch	File	Change	Decision
0001	ggml-vulkan.cpp:302,349	Arc 140T device-ID Xe2 override	Approved 1/3

Fixes Arrow Lake H misdetection as non-Xe2. Enables cooperative matrix. Only affects Arc 140T systems.

Testing Protocol

On Intel Arc GPU test machine:

cd repos

# Apply one phase
./patch/apply-phase.sh 1

# Build
source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
  -DCMAKE_BUILD_TYPE=Release -DGGML_AVX2=ON -DGGML_AVX=ON -DGGML_FMA=ON -DGGML_F16C=ON
cmake --build build -j$(nproc)

# Test
ctest --test-dir build --output-on-failure

# Benchmark before/after each phase
./build/bin/llama-bench -m <model> -ngl 99 -p 512 -n 128

Council Deliberations

Stored in ../logs/ (gitignored). Key files:

logs/decisions.md — All 6 council decisions with rationale
logs/M-sync-overhead-20260415.md — Agent-M sync analysis
logs/K-kernel-tuning-20260415.md — Agent-K kernel tuning analysis
logs/M-review-K-20260415.md — Cross-review
logs/K-review-M-20260415.md — Cross-review (caught memset_tensor error)

Deferred to Future Phases

Q4_K DMMV reorder (medium complexity)
Q6_K DMMV reorder (medium complexity)
Q5_K reorder for both DMMV and MMVQ (hard)
Host-buffer double-copy elimination
Async memory ops decoupled from graph