intel-gpu-llm-diagnosis

Author	SHA1	Message	Date
sleepy	f3caef26f7	Add cross-platform bandwidth utilization comparison and research data Key findings: - SYCL achieves only ~30% BW utilization (29% Q4_0, 54% Q8_0) - Vulkan on same hardware gets 47-75% utilization - CUDA on RTX 3060 gets ~72% utilization - The bottleneck is SYCL's 1-op-at-a-time submission model, not kernel params - Our patches are neutral because they don't address the submission bottleneck - Phase 1 (graph) crashes MoE but could be key to fixing submission if async_malloc bug is resolved	2026-04-15 17:54:03 +02:00
sleepy	105f1348dc	Update patch README with clean benchmark results and Phase 1 crash finding - Full A/B test on Qwen3.5-9B Q4_0 and Q8_0 across all phases - All patches neutral on 9B dense (within ~1 t/s noise) - Phase 1 (SYCL graph) crashes 35B MoE with async_malloc in opt_for_reorder - Decision 1 (graph default) overturned by empirical evidence - Baseline DISABLE_GRAPH=1 was correct for Arc A770	2026-04-15 17:40:58 +02:00
sleepy	1c374e1262	feat: phase 4 host-copy fix + docker build script + test machine docs Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor. The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via GGML_SYCL_MMAP_WORKAROUND=1. Also adds: - docker-build-test.sh for local amd64 SYCL build verification - test-machine-megumin.md with hardware/software env and test procedures - Updated apply-phase.sh to support phase 4 - Updated workplan with corrected council composition (GLM/Minimax/Kimi)	2026-04-15 15:35:29 +02:00
sleepy	6ad84d543c	feat: phased patch system for Intel Arc GPU performance fixes 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzed Intel Arc GPU performance issues and produced patches for llama.cpp: Phase 1 - SYCL Sync: Enable graph execution by default (GGML_SYCL_DISABLE_GRAPH) Phase 2 - SYCL Kernel: Fix VER_GEN12/13 thresholds, tune DMMV_X/MMV_Y Phase 3 - Vulkan Intel: Arc 140T device-ID Xe2 override Includes: - Phased apply script (apply-phase.sh [1\|2\|3\|all]) - Master apply script with --status/--reverse/--dry-run - Per-phase READMEs with testing checklists - Council deliberation logs (gitignored in logs/) Verified: all patches apply/reverse cleanly via git apply. Static verification: VER_GEN arithmetic and DMMV_X divisibility pass.	2026-04-15 14:53:40 +02:00
sleepy	ee85cce1b8	Merge pull request #2 from alex4o/sycl-optimization-analysis Add SYCL optimization analysis: why it's slow and how to fix it	2026-04-15 13:49:02 +02:00
sleepy	42537b9ee6	Add cross-verified synthesis overview and link from README	2026-04-15 13:48:04 +02:00
Alexandar Bonin	d94619b8bf	Add SYCL optimization analysis from hands-on debugging sessions Root cause analysis of why the SYCL backend underperforms on Arc GPUs, derived from actual debugging sessions comparing Arc A770 SYCL vs RX 580 Vulkan on llama.cpp. Key findings: - SYCL submits 1 op at a time with OS-level .wait() vs Vulkan's batched 100-op submission with spin-wait (~30-50% of the gap) - Memory transfers double-buffered through host as PVC/Arc workaround - SYCL graph execution disabled by default even when compiled in - Code is DPCT-translated CUDA with hardware tuning stubs never filled - Arc A770 classified as OTHER in Vulkan (coopmat disabled) - Kernel dispatch defaults not tuned for Arc architecture Includes prioritized improvement roadmap.	2026-04-15 14:45:31 +03:00
sleepy	eb44831e4c	Merge pull request #1 from alex4o/add-empirical-findings Add empirical findings from Arc A770 + RX 580 testing	2026-04-15 13:25:47 +02:00
Alexandar Bonin	f179611a6f	Add empirical findings from Arc A770 + RX 580 testing Real-world benchmarks, driver configurations, and working/broken matrix from hands-on llama.cpp testing with Qwen3.5-35B-A3B MoE on an Arc A770 (SYCL) + RX 580 (Vulkan) dual-GPU setup. Key findings: xe driver mandatory (i915 hangs), Vulkan compute broken on Arc, RX 580 Vulkan beats Arc SYCL with --cpu-moe, generation is DDR4 bandwidth-bound at ~20 t/s.	2026-04-15 14:22:54 +03:00
sleepy	ef614682bc	Add upstream repos as git submodules (shallow clones) llama.cpp, ipex-llm, intel-extension-for-pytorch, compute-runtime, intel-graphics-compiler, oneDNN, vllm, vllm-xpu-kernels, level-zero, llvm (sycl branch), openvino, sycl-tla	2026-04-15 13:19:15 +02:00
sleepy	8c6d377f74	Initial commit: Intel Arc GPU LLM inference diagnosis research	2026-04-15 13:07:03 +02:00

11 Commits