Files

T

Alexandar Bonin f179611a6f Add empirical findings from Arc A770 + RX 580 testing

Real-world benchmarks, driver configurations, and working/broken
matrix from hands-on llama.cpp testing with Qwen3.5-35B-A3B MoE
on an Arc A770 (SYCL) + RX 580 (Vulkan) dual-GPU setup.

Key findings: xe driver mandatory (i915 hangs), Vulkan compute
broken on Arc, RX 580 Vulkan beats Arc SYCL with --cpu-moe,
generation is DDR4 bandwidth-bound at ~20 t/s.

2026-04-15 14:22:54 +03:00

6.3 KiB

Raw Permalink Blame History

Empirical Findings: Arc A770 + RX 580 on llama.cpp

Date: April 2026 Hardware: AMD Ryzen 7 5800X, 32GB DDR4, Intel Arc A770 16GB, AMD RX 580 4GB OS: CachyOS (Arch-based), kernel 6.19.10-1-cachyos Model: Qwen3.5-35B-A3B (MoE, 35B total / 3B active) — UD-IQ4_XS (17.5 GB)

1. Hardware & Driver Setup

Intel Arc A770

PCIe: 0b:00.0 (DG2, device 56a0)
Driver: xe kernel module — not i915
- i915 causes GPU hangs on compute workloads (ecode 12:1:85def5fb)
- Switched via modprobe: options i915 force_probe=!56a0 + options xe force_probe=56a0
Compute path: SYCL/Level Zero is the only reliable compute path
- Vulkan compute: BROKEN — DeviceLostError on all compute workloads, even on xe driver
- Vulkan display: works fine (desktop environment runs on this GPU)
oneAPI: basekit 2025.3.1-6, DPC++ 2025.3.2
- 2025.0.4 is too old (missing sycl/ext/oneapi/work_group_static.hpp)
- source /opt/intel/oneapi/setvars.sh required before cmake AND at runtime (SYCL backend crashes without it)

AMD RX 580

PCIe: 04:00.0 (Polaris10)
Driver: amdgpu + RADV (Mesa Vulkan)
Vulkan: works perfectly out of the box (needs vulkan-radeon package)
OpenCL: via rusticl (ROCm broken on kernel 6.19.x)

Multi-GPU OpenCL

Platform #0: Intel(R) OpenCL Graphics → Arc A770
Platform #1: rusticl → Mesa Arc A770 (DG2) + AMD RX 580 (polaris10)

DRI devices: card0 = RX 580 (04:00.0), card1 = Arc A770 (0b:00.0) Render nodes: renderD128 = Arc A770, renderD129 = RX 580

2. llama.cpp Build

Built with oneAPI SYCL + Vulkan support:

source /opt/intel/oneapi/setvars.sh
cmake -B build -DGGML_SYCL=ON -DGGML_VULKAN=ON \
  -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
  -DCMAKE_BUILD_TYPE=Release -DVulkan_INCLUDE_DIR=/usr/include \
  -DGGML_AVX2=ON -DGGML_AVX=ON -DGGML_FMA=ON -DGGML_F16C=ON -DGGML_SSE42=ON
cmake --build build --config Release -j$(nproc)

Critical: icx/icpx does NOT auto-detect SIMD with GGML_NATIVE — you must explicitly set AVX2/FMA/F16C/SSE42 flags.

3. Performance Benchmarks

Token generation (t/s) — Qwen3.5-35B-A3B UD-IQ4_XS

Config	Device	Prompt t/s	Gen t/s	Notes
`-ngl 38 --flash-attn on`	SYCL0 (Arc A770)	18.8	21.3	Best speed, fragile — crashes if ngl too high
`--cpu-moe -ngl 99`	Vulkan1 (RX 580)	42.6	20.3	Best reliable config
`--cpu-moe -ngl 99`	SYCL0 (Arc A770)	36.0	18.0	Intel SYCL slower than AMD Vulkan
`--cpu-moe -ngl 99` (no AVX2)	SYCL0 (Arc A770)	19.5	17.5	AVX2 only helps prompt, not gen
`-ngl 0 -ot ".*_exps=SYCL0"`	SYCL0	15.5	10.3	CPU-driven graph = slow
`--cpu-dense -ngl 99`	SYCL0 (Arc A770)	27.9	9.0	Inverse of cpu-moe, bad
Cross-GPU SYCL+Vulkan	Mixed	—	5.5	Garbage output, data corruption
Cross-GPU Vulkan+Vulkan	Mixed	—	—	DeviceLostError crash

Key observations

Generation is memory bandwidth bound — DDR4 ~50 GB/s is the wall for MoE experts on CPU, not GPU compute.
AVX2 gives 3x prompt speedup (19 → 57 t/s) but zero gen improvement — confirms bandwidth bottleneck.
Q4_K_XL vs IQ4_XS = identical gen speed — same bandwidth, different compute, same result.
Q8 KV cache (-ctk q8_0 -ctv q8_0) = no speed difference on RX 580.
RX 580 ($40 used) beats Arc A770 ($350) on SYCL with --cpu-moe — Intel's software stack is the bottleneck, not the hardware.
Cross-GPU tensor splitting between different backends is broken — corruption or crashes.

4. What Works, What Doesn't

Working configurations

Feature	Status
Arc A770 SYCL/Level Zero compute	Works
Arc A770 partial layer offload (`-ngl 38`)	Works
RX 580 Vulkan compute	Works perfectly
`--cpu-moe` with GPU for dense layers	Works, best approach for MoE
Flash attention on SYCL (no --cpu-moe)	Works
Multi-GPU OpenCL enumeration	Works

Broken

Feature	Failure mode
Arc A770 Vulkan compute	`DeviceLostError` GPU hangs (both i915 and xe)
i915 driver on Arc A770	GPU hangs on compute (ecode `12:1:85def5fb`)
Flash attention on SYCL + `--cpu-moe`	Crashes on 2nd prompt (`fattn-tile.hpp:1255 fatal error`)
Cross-GPU SYCL+Vulkan tensor split	Data corruption, garbage output
Cross-GPU Vulkan+Vulkan	`DeviceLostError` crash
`-ngl 99` on SYCL without --cpu-moe	OOM (model 16.3GB > 15.5GB usable VRAM)
SYCL without setvars.sh	Crashes at startup
ROCm on kernel 6.19.x	Broken, use rusticl instead

5. Qwen3.5 Model Quirks

--reasoning off required — otherwise generates infinite empty thinking tokens (500MB+ of newlines)
--flash-attn off needed when using --cpu-moe on SYCL (crashes on multi-turn)
--flash-attn on works on SYCL with partial layer offload (-ngl 38)

6. Recommended Configs

Reliable daily driver (RX 580 Vulkan)

ZES_ENABLE_SYSMAN=1 ./llama.cpp/build/bin/llama-cli \
  -m Qwen3.5-35B-A3B-UD-IQ4_XS.gguf \
  -ngl 99 --device Vulkan1 --cpu-moe \
  -c 2048 --reasoning off

20.3 t/s generation. Just works.

Maximum speed (Arc A770 SYCL, fragile)

source /opt/intel/oneapi/setvars.sh && ZES_ENABLE_SYSMAN=1 \
  ./llama.cpp/build/bin/llama-cli \
  -m Qwen3.5-35B-A3B-UD-IQ4_XS.gguf \
  -ngl 38 --device SYCL0 \
  -c 4000 --reasoning off --flash-attn on

21.3 t/s generation. Crashes if you push ngl too high.

7. Conclusions

The Arc A770 is not fundamentally slow hardware — 512 GB/s memory bandwidth should destroy a $40 RX 580 on bandwidth-bound workloads. The fact that it doesn't tells you everything about Intel's software stack maturity:

SYCL/Level Zero is the only working compute path on Linux. Vulkan compute is broken.
The xe driver is mandatory — i915 GPU hangs on compute workloads.
oneAPI version matters enormously — too old and the SYCL backend won't compile.
For MoE models, --cpu-moe is the only sane strategy — it keeps the massive expert tensors on CPU (bandwidth-bound anyway) and uses GPU for the smaller dense layers.
Intel needs to fix Vulkan compute on Arc — this is the biggest single blocker for mainstream llama.cpp users who don't want to deal with oneAPI.

6.3 KiB Raw Permalink Blame History