Commit Graph

11 Commits

Author SHA1 Message Date
sleepy f3caef26f7 Add cross-platform bandwidth utilization comparison and research data
Key findings:
- SYCL achieves only ~30% BW utilization (29% Q4_0, 54% Q8_0)
- Vulkan on same hardware gets 47-75% utilization
- CUDA on RTX 3060 gets ~72% utilization
- The bottleneck is SYCL's 1-op-at-a-time submission model, not kernel params
- Our patches are neutral because they don't address the submission bottleneck
- Phase 1 (graph) crashes MoE but could be key to fixing submission if async_malloc bug is resolved
2026-04-15 17:54:03 +02:00
sleepy 105f1348dc Update patch README with clean benchmark results and Phase 1 crash finding
- Full A/B test on Qwen3.5-9B Q4_0 and Q8_0 across all phases
- All patches neutral on 9B dense (within ~1 t/s noise)
- Phase 1 (SYCL graph) crashes 35B MoE with async_malloc in opt_for_reorder
- Decision 1 (graph default) overturned by empirical evidence
- Baseline DISABLE_GRAPH=1 was correct for Arc A770
2026-04-15 17:40:58 +02:00
sleepy 1c374e1262 feat: phase 4 host-copy fix + docker build script + test machine docs
Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor.
The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra
malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via
GGML_SYCL_MMAP_WORKAROUND=1.

Also adds:
- docker-build-test.sh for local amd64 SYCL build verification
- test-machine-megumin.md with hardware/software env and test procedures
- Updated apply-phase.sh to support phase 4
- Updated workplan with corrected council composition (GLM/Minimax/Kimi)
2026-04-15 15:35:29 +02:00
sleepy 6ad84d543c feat: phased patch system for Intel Arc GPU performance fixes
3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzed Intel Arc GPU
performance issues and produced patches for llama.cpp:

Phase 1 - SYCL Sync: Enable graph execution by default (GGML_SYCL_DISABLE_GRAPH)
Phase 2 - SYCL Kernel: Fix VER_GEN12/13 thresholds, tune DMMV_X/MMV_Y
Phase 3 - Vulkan Intel: Arc 140T device-ID Xe2 override

Includes:
- Phased apply script (apply-phase.sh [1|2|3|all])
- Master apply script with --status/--reverse/--dry-run
- Per-phase READMEs with testing checklists
- Council deliberation logs (gitignored in logs/)

Verified: all patches apply/reverse cleanly via git apply.
Static verification: VER_GEN arithmetic and DMMV_X divisibility pass.
2026-04-15 14:53:40 +02:00
sleepy ee85cce1b8 Merge pull request #2 from alex4o/sycl-optimization-analysis
Add SYCL optimization analysis: why it's slow and how to fix it
2026-04-15 13:49:02 +02:00
sleepy 42537b9ee6 Add cross-verified synthesis overview and link from README 2026-04-15 13:48:04 +02:00
Alexandar Bonin d94619b8bf Add SYCL optimization analysis from hands-on debugging sessions
Root cause analysis of why the SYCL backend underperforms on Arc GPUs,
derived from actual debugging sessions comparing Arc A770 SYCL vs
RX 580 Vulkan on llama.cpp.

Key findings:
- SYCL submits 1 op at a time with OS-level .wait() vs Vulkan's
  batched 100-op submission with spin-wait (~30-50% of the gap)
- Memory transfers double-buffered through host as PVC/Arc workaround
- SYCL graph execution disabled by default even when compiled in
- Code is DPCT-translated CUDA with hardware tuning stubs never filled
- Arc A770 classified as OTHER in Vulkan (coopmat disabled)
- Kernel dispatch defaults not tuned for Arc architecture

Includes prioritized improvement roadmap.
2026-04-15 14:45:31 +03:00
sleepy eb44831e4c Merge pull request #1 from alex4o/add-empirical-findings
Add empirical findings from Arc A770 + RX 580 testing
2026-04-15 13:25:47 +02:00
Alexandar Bonin f179611a6f Add empirical findings from Arc A770 + RX 580 testing
Real-world benchmarks, driver configurations, and working/broken
matrix from hands-on llama.cpp testing with Qwen3.5-35B-A3B MoE
on an Arc A770 (SYCL) + RX 580 (Vulkan) dual-GPU setup.

Key findings: xe driver mandatory (i915 hangs), Vulkan compute
broken on Arc, RX 580 Vulkan beats Arc SYCL with --cpu-moe,
generation is DDR4 bandwidth-bound at ~20 t/s.
2026-04-15 14:22:54 +03:00
sleepy ef614682bc Add upstream repos as git submodules (shallow clones)
llama.cpp, ipex-llm, intel-extension-for-pytorch, compute-runtime,
intel-graphics-compiler, oneDNN, vllm, vllm-xpu-kernels, level-zero,
llvm (sycl branch), openvino, sycl-tla
2026-04-15 13:19:15 +02:00
sleepy 8c6d377f74 Initial commit: Intel Arc GPU LLM inference diagnosis research 2026-04-15 13:07:03 +02:00