intel-gpu-llm-diagnosis

Author	SHA1	Message	Date
Pi Agent	48bbb14a64	Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s) Root cause analysis corrected: Q4_0 low BW utilization is NOT due to SYCL submission model overhead. Per-op profiling proves the bottleneck is dp4a compute throughput — nibble extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s. Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing 16 blocks per subgroup instead of 8. Better amortizes dp4a overhead. Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3): Q4_0: 30.16 → 35.96 t/s (+19%) Q8_0: 29.96 → 30.82 t/s (within noise) Q4_K_M: 24.65 → 25.32 t/s (within noise) Also includes: - Timing instrumentation patch (for debugging, not applied to source) - Updated decisions log (Decisions 8-9) - Updated workplan with revised benchmark data - Root cause analysis document	2026-04-15 19:55:44 +03:00
sleepy	f3caef26f7	Add cross-platform bandwidth utilization comparison and research data Key findings: - SYCL achieves only ~30% BW utilization (29% Q4_0, 54% Q8_0) - Vulkan on same hardware gets 47-75% utilization - CUDA on RTX 3060 gets ~72% utilization - The bottleneck is SYCL's 1-op-at-a-time submission model, not kernel params - Our patches are neutral because they don't address the submission bottleneck - Phase 1 (graph) crashes MoE but could be key to fixing submission if async_malloc bug is resolved	2026-04-15 17:54:03 +02:00
sleepy	105f1348dc	Update patch README with clean benchmark results and Phase 1 crash finding - Full A/B test on Qwen3.5-9B Q4_0 and Q8_0 across all phases - All patches neutral on 9B dense (within ~1 t/s noise) - Phase 1 (SYCL graph) crashes 35B MoE with async_malloc in opt_for_reorder - Decision 1 (graph default) overturned by empirical evidence - Baseline DISABLE_GRAPH=1 was correct for Arc A770	2026-04-15 17:40:58 +02:00
sleepy	1c374e1262	feat: phase 4 host-copy fix + docker build script + test machine docs Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor. The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via GGML_SYCL_MMAP_WORKAROUND=1. Also adds: - docker-build-test.sh for local amd64 SYCL build verification - test-machine-megumin.md with hardware/software env and test procedures - Updated apply-phase.sh to support phase 4 - Updated workplan with corrected council composition (GLM/Minimax/Kimi)	2026-04-15 15:35:29 +02:00
sleepy	6ad84d543c	feat: phased patch system for Intel Arc GPU performance fixes 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzed Intel Arc GPU performance issues and produced patches for llama.cpp: Phase 1 - SYCL Sync: Enable graph execution by default (GGML_SYCL_DISABLE_GRAPH) Phase 2 - SYCL Kernel: Fix VER_GEN12/13 thresholds, tune DMMV_X/MMV_Y Phase 3 - Vulkan Intel: Arc 140T device-ID Xe2 override Includes: - Phased apply script (apply-phase.sh [1\|2\|3\|all]) - Master apply script with --status/--reverse/--dry-run - Per-phase READMEs with testing checklists - Council deliberation logs (gitignored in logs/) Verified: all patches apply/reverse cleanly via git apply. Static verification: VER_GEN arithmetic and DMMV_X divisibility pass.	2026-04-15 14:53:40 +02:00
sleepy	ef614682bc	Add upstream repos as git submodules (shallow clones) llama.cpp, ipex-llm, intel-extension-for-pytorch, compute-runtime, intel-graphics-compiler, oneDNN, vllm, vllm-xpu-kernels, level-zero, llvm (sycl branch), openvino, sycl-tla	2026-04-15 13:19:15 +02:00

6 Commits