intel-gpu-llm-diagnosis

sleepy/intel-gpu-llm-diagnosis

Fork 0

Commit Graph

Author	SHA1	Message	Date
Alexandar Bonin	d94619b8bf	Add SYCL optimization analysis from hands-on debugging sessions Root cause analysis of why the SYCL backend underperforms on Arc GPUs, derived from actual debugging sessions comparing Arc A770 SYCL vs RX 580 Vulkan on llama.cpp. Key findings: - SYCL submits 1 op at a time with OS-level .wait() vs Vulkan's batched 100-op submission with spin-wait (~30-50% of the gap) - Memory transfers double-buffered through host as PVC/Arc workaround - SYCL graph execution disabled by default even when compiled in - Code is DPCT-translated CUDA with hardware tuning stubs never filled - Arc A770 classified as OTHER in Vulkan (coopmat disabled) - Kernel dispatch defaults not tuned for Arc architecture Includes prioritized improvement roadmap.	2026-04-15 14:45:31 +03:00

Author

SHA1

Message

Date

Alexandar Bonin

d94619b8bf

Add SYCL optimization analysis from hands-on debugging sessions

Root cause analysis of why the SYCL backend underperforms on Arc GPUs,
derived from actual debugging sessions comparing Arc A770 SYCL vs
RX 580 Vulkan on llama.cpp.

Key findings:
- SYCL submits 1 op at a time with OS-level .wait() vs Vulkan's
  batched 100-op submission with spin-wait (~30-50% of the gap)
- Memory transfers double-buffered through host as PVC/Arc workaround
- SYCL graph execution disabled by default even when compiled in
- Code is DPCT-translated CUDA with hardware tuning stubs never filled
- Arc A770 classified as OTHER in Vulkan (coopmat disabled)
- Kernel dispatch defaults not tuned for Arc architecture

Includes prioritized improvement roadmap.

2026-04-15 14:45:31 +03:00

1 Commits