1 Commits

Author SHA1 Message Date
Alexandar Bonin d94619b8bf Add SYCL optimization analysis from hands-on debugging sessions
Root cause analysis of why the SYCL backend underperforms on Arc GPUs,
derived from actual debugging sessions comparing Arc A770 SYCL vs
RX 580 Vulkan on llama.cpp.

Key findings:
- SYCL submits 1 op at a time with OS-level .wait() vs Vulkan's
  batched 100-op submission with spin-wait (~30-50% of the gap)
- Memory transfers double-buffered through host as PVC/Arc workaround
- SYCL graph execution disabled by default even when compiled in
- Code is DPCT-translated CUDA with hardware tuning stubs never filled
- Arc A770 classified as OTHER in Vulkan (coopmat disabled)
- Kernel dispatch defaults not tuned for Arc architecture

Includes prioritized improvement roadmap.
2026-04-15 14:45:31 +03:00