Root cause analysis of why the SYCL backend underperforms on Arc GPUs,
derived from actual debugging sessions comparing Arc A770 SYCL vs
RX 580 Vulkan on llama.cpp.
Key findings:
- SYCL submits 1 op at a time with OS-level .wait() vs Vulkan's
batched 100-op submission with spin-wait (~30-50% of the gap)
- Memory transfers double-buffered through host as PVC/Arc workaround
- SYCL graph execution disabled by default even when compiled in
- Code is DPCT-translated CUDA with hardware tuning stubs never filled
- Arc A770 classified as OTHER in Vulkan (coopmat disabled)
- Kernel dispatch defaults not tuned for Arc architecture
Includes prioritized improvement roadmap.