sleepy/intel-gpu-llm-diagnosis

Fork 0

Files

T

sleepy 8c6d377f74 Initial commit: Intel Arc GPU LLM inference diagnosis research

2026-04-15 13:07:03 +02:00

5.4 KiB

Raw Blame History

Community Issues & Discourse Summary

Source: GitHub Issues, Discussions, Reddit (March-April 2026)

Critical Issues Filed

1. #21517 - Q8_0 4x Slower on Arc Pro B70

Reporter: PMZFX (April 6, 2026)
Status: Closed - PR #21527 submitted

Benchmark Data (Arc Pro B70, Qwen3.5-27B):

Quant	Token Gen (t/s)	BW Utilization
Q4_K_M	20.56	53%
Q8_0	4.88	21%

Key Findings:

Q8_0 stuck on generic DMMV kernel (iter_stride=64)
Q4_0 reorder kernel uses iter_stride=512 (8x more work)
Driver updates don't help (IGC 2.28.4 → 2.30.1 unchanged Q8_0 perf)
Both SYCL and Vulkan affected equally
Dual GPU doesn't help - confirmed kernel-level issue

Fix: PR #21527 adds Q8_0 to reorder framework. Validation showed 3.1x speedup (4.88 → 15.24 t/s).

2. #12318 - K-Quant Crash on Xe2 iGPU

Reporter: lhl (November 3, 2024)
Status: Closed
Hardware: Lunar Lake Arc 140V

Sub-group size 8 is not supported on the device
Exception at ggml-sycl.cpp:3164

Reproduction: Q4_K_M crashes, Q4_0 works fine.

Workaround: Use upstream llama.cpp SYCL backend (slower but stable).

3. #20776 - Arc 140T Misdetection

Reporter: diegokolling (March 19, 2026)
Status: Open
Hardware: Arrow Lake H, Arc 140T (48GB shared)

Root Cause:

Driver reports minSubgroupSize = 8
Code requires minSubgroupSize == 16 for INTEL_XE2 classification
Same driver on Arc 140V reports minSubgroupSize = 16

Impact: Cooperative matrix completely disabled despite hardware support.

Key Discussions

#12570 - Arc Status for llama.cpp

Date: March 25-28, 2025
Participants: ky438, Rbiessy (Codeplay), NeoZhangJianyu

Key Quotes:

"tg should already be decent" - 0cc4m (llama.cpp collaborator)

"There are huge performance gaps between k-quant and legacy quant. Some quantizations like IQ4_NL reach only 14% of memory bandwidth utilization." - Community report

"For BMG, we don't promise to optimize it in time of the marketing." - NeoZhangJianyu

"If you want to see the best performance on Intel GPU, please try OpenVINO." - NeoZhangJianyu

Outcomes:

Acknowledged poor performance on k-quants
Planned work on mul_mat_vec_q kernel optimization
Discussion of DPAS instruction utilization
Note that community contributors work on this in spare time

#12805 - A750 User Experience

Date: April 7-9, 2025
User: codayon (Arch Linux, 8GB VRAM)

Findings:

Ubuntu Vulkan binary worked on Arch Linux
Q4_K_M slower than expected on 8GB card
Q4_0 recommended for better performance
IPEX-LLM provides better VRAM utilization
Complexity of setup is barrier to entry

Recommendations from community:

Use Qwen2.5-Coder-0.5B-Q8_0 for autocomplete (150+ t/s)
Qwen2.5-Coder-7B-Q4_0 for chat
Vulkan more stable than SYCL on Arch

Reddit Discourse

r/LocalLLaMA - "Intel Arc for LLMs?"

Key Comments:

"Not a lot of kernels for arc so many of the quantized models will be out of reach" (u/shakhal1)
Arc A770 with 16GB runs models up to 24B with 4-6bit quantization
oneAPI less mature than CUDA - expect compatibility issues

r/LocalLLaMA - "llama.cpp 3.1x Q8_0 speedup on Intel Arc GPUs"

Key Details:

PR submitted by AI Agent + user collaboration
Binary-patched Intel's closed-source IPEX-LLM to validate solution
IPEX-LLM achieved 61% bandwidth - confirming problem is solvable in software

r/IntelArc - "Intel ARC for local LLMs"

User reports:

B580 setup issues (unsupported message)
Even dual A770 (32GB) not enough for 30B at FP16
No consumer Intel GPU has sufficient VRAM for large models

GitHub Issue #19887 - A770 Inverse Quantization Anomaly

On A770: Q8_0 is faster than Q4/Q6
On B70: Q8_0 is 4x slower than Q4

This is a Xe2/Battlemage regression - indicates:

Xe1 optimizations work
Xe2 memory architecture is different
Kernel tuning needed for new architecture

Performance Summary Table

Compiled from community benchmarks:

GPU	Backend	Q4_0 tg	Q4_K_M tg	Q8_0 tg	Notes
A770 (Xe1)	SYCL	~40 t/s	~25 t/s	~30 t/s	Q8_0 works well
A770 (Xe1)	Vulkan	~30 t/s	~20 t/s	~35 t/s	Good prompt processing
B580 (Xe2)	SYCL	~45 t/s	~20 t/s	~8 t/s	Q8_0 broken
B580 (Xe2)	Vulkan	~35 t/s	~18 t/s	~10 t/s	Better prompt perf
B70 (Xe2)	SYCL	~35 t/s	~20 t/s	~5 t/s	Q8_0 very slow
140V iGPU (Xe2)	SYCL	~23 t/s	N/A (crash)	N/A	K-quants broken

Community Complaints Summary

"30% of peak performance" - Users see far below hardware potential
"Instability with k-quants" - Some formats crash, others work
"Documentation chaos" - Multiple docs, Ubuntu-focused, Arch struggles
"IPEX-LLM is too slow but stable, llama.cpp is fast but broken" - No perfect option
"Driver updates don't fix issues" - Confirms software stack problem
"No Intel official contribution" - Community maintains in spare time

Last Updated: April 2026

5.4 KiB Raw Blame History