6ad84d543c
3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzed Intel Arc GPU performance issues and produced patches for llama.cpp: Phase 1 - SYCL Sync: Enable graph execution by default (GGML_SYCL_DISABLE_GRAPH) Phase 2 - SYCL Kernel: Fix VER_GEN12/13 thresholds, tune DMMV_X/MMV_Y Phase 3 - Vulkan Intel: Arc 140T device-ID Xe2 override Includes: - Phased apply script (apply-phase.sh [1|2|3|all]) - Master apply script with --status/--reverse/--dry-run - Per-phase READMEs with testing checklists - Council deliberation logs (gitignored in logs/) Verified: all patches apply/reverse cleanly via git apply. Static verification: VER_GEN arithmetic and DMMV_X divisibility pass.
1.0 KiB
1.0 KiB
Phase 1 — SYCL Synchronization
0001-enable-sycl-graph-by-default.patch
Changes GGML_SYCL_DISABLE_GRAPH default from 1 (disabled) to 0 (enabled).
What it does
- Enables SYCL graph execution for single-GPU dense LLM inference
- Enables async memory operations (tied to graph support in upstream code)
- Eliminates 8 blocking
.wait()calls in reorder functions (Q4_0, Q8_0, Q4_K, Q6_K)
What it does NOT affect
- MoE models (MUL_MAT_ID) —
check_graph_compatibility()auto-disables graphs - CONCAT operations — auto-disabled
- Multi-GPU setups — always disabled
- Users can override:
GGML_SYCL_DISABLE_GRAPH=1
Expected impact
10-30% token generation speedup on single-GPU dense LLM inference.
Testing checklist
- Build succeeds with
-DGGML_SYCL=ON GGML_SYCL_DEBUG=1shows "SYCL-GRAPH" messages for dense models- Dense model inference produces correct output
- MoE model falls back gracefully (logs "disabling SYCL graphs")
GGML_SYCL_DISABLE_GRAPH=1restores old behavior