sleepy/intel-gpu-llm-diagnosis

Fork 0

Files

T

sleepy 8c6d377f74 Initial commit: Intel Arc GPU LLM inference diagnosis research

2026-04-15 13:07:03 +02:00

7.8 KiB

Raw Permalink Blame History

Driver/Stack Misalignment Analysis

Overview

This document catalogs the specific code locations, design decisions, and architectural mismatches that cause Intel Arc GPUs to underperform on LLM inference.

1. llama.cpp SYCL Backend Misalignments

1.1 Kernel Dispatch Logic

File: repos/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp
Lines: ~3258-3660

Current Dispatch Algorithm:

mul_mat dispatch prefers:
1. MMVQ (reorder path) if src0 type in ggml_sycl_supports_reorder_mmvq()
2. DMMV (reorder path) if src0 type in ggml_sycl_supports_reorder_dmmv()
3. SYCL native matmul as fallback

Support Lists (lines ~3269-3300):

// Supports MMVQ reorder
ggml_sycl_supports_reorder_mmvq(): Q4_0, Q8_0, Q4_K, Q6_K

// Supports DMMV reorder  
ggml_sycl_supports_reorder_dmmv(): Q4_0, Q8_0 ONLY

// Supports SYCL matmul reorder
ggml_sycl_supports_reorder_mul_mat_sycl(): Q4_0, Q8_0, Q4_K*, Q6_K*
(* = !g_ggml_sycl_prioritize_dmmv)

Problem: Q4_K, Q6_K support MMVQ reorder but NOT DMMV reorder. When conditions favor DMMV, these quants fall through to slow generic path.

1.2 DMMV Kernel iter_stride Problem

File: repos/llama.cpp/ggml/src/ggml-sycl/dmmv.cpp
Lines: ~975-1100 (dequantize_mul_mat_vec_q8_0_sycl)

Generic DMMV (used by Q8_0):

iter_stride = 2 * GGML_SYCL_DMMV_X = 64  // processes 2 values per iteration

Reorder DMMV (Q4_0 path):

iter_stride = 8 * 2 * GGML_SYCL_DMMV_X = 512  // processes 16 values per iteration

Root Cause: Q8_0's 34-byte block structure prevents simple power-of-2 optimization that works for Q4_0's 18-byte blocks.

1.3 Missing Q8_0 Reorder Implementation

File: repos/llama.cpp/ggml/src/ggml-sycl/mmvq.cpp
Lines: ~682-730

Q4_0 Reorder Kernel:

mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q4_0>>

Q8_0 Reorder Kernel:

mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0>>

Note: PR #21527 adds Q8_0 to reorder framework. Without this fix, Q8_0 defaults to slow DMMV path.

1.4 Q4_K DMMV Reorder Gap

Problem: Q4_K has reorder structure (reorder_qw_q4_k()) but DMMV path doesn't use it.

Current State:

Q4_K MMVQ reorder: ✅ Working
Q4_K DMMV reorder: ❌ Not implemented

Impact: When DMMV is prioritized (GGML_SYCL_PRIORITIZE_DMMV=1), Q4_K gets no optimization.

2. llama.cpp Vulkan Backend Misalignments

2.1 Cooperative Matrix Detection

File: repos/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp
Lines: ~343, ~15972

Detection Logic:

// Step 1: Architecture classification
if (subgroup_size_control_props.minSubgroupSize == 16) {
    return vk_device_architecture::INTEL_XE2;
}
// Falls through to OTHER for 140T (minSubgroupSize=8)

// Step 2: Coopmat support check
case VK_VENDOR_ID_INTEL:
    return arch == vk_device_architecture::INTEL_XE2;
// Returns false for OTHER

Problem: Arc 140T (Arrow Lake H) reports minSubgroupSize=8 despite having Xe2 architecture and full coopmat support.

2.2 DP4A/DPAS Utilization Gap

Current State:

Vulkan backend has DP4A instruction support
Matrix multiplication (matmul) path doesn't use DPAS
Only Flash Attention path partially uses coopmat

Missing:

Q4_K, Q8_0 quantized matmul via DPAS
Subgroup-level parallelism for token generation

3. IPEX-LLM vs llama.cpp Gap

3.1 Performance Comparison

Aspect	IPEX-LLM	llama.cpp SYCL	llama.cpp Vulkan
Q4_0	Fast	Fast	Medium
Q4_K	Fast	Medium	Medium
Q8_0	Fast	Was Broken	Was Broken
K-quants on Xe2	Crashes	Works	Works
FlashAttention	Full	Partial	Partial
vRAM usage	Lower	Higher	Higher

3.2 Source of Optimization Gap

IPEX-LLM advantages:

Closed-source optimized kernels (not in llama.cpp)
oneDNN GEMM integration
Lower-level hardware access
syclcompat library for platform-specific tuning

llama.cpp limitations:

Open-source kernels visible to competitors
Generic SYCL must work across all Intel GPUs
Can't leverage IPEX's proprietary optimizations

4. Architecture Detection Mismatches

4.1 Xe1 vs Xe2 Detection

Current Detection: Uses compute capability (device version)

Problem:

Arc A770 reports compute version 1.3 (Xe1)
Arc B580 reports compute version 1.6 (Xe2)
BUT: Same driver branch reports different subgroup sizes (8 vs 16)

4.2 Missing Architecture-Specific Tuning

Current kernels: Single implementation for all Intel GPUs

Needed:

Feature	Xe1 (Alchemist)	Xe2 (Battlemage)
L2 cache	16 MB	Larger
Optimal block size	64	128
Prefetch depth	2	4
Vector width	8	16

5. Quantization Format Support Matrix

5.1 Current Support State

Format	DMMV Reorder	MMVQ Reorder	SYCL Matmul	Vulkan	Notes
Q4_0	✅	✅	✅	✅	Fully optimized
Q4_1	❌	❌	✅	✅	Legacy, slow
Q5_0	❌	❌	✅	✅	Legacy, slow
Q5_1	❌	❌	✅	✅	Legacy, slow
Q8_0	✅*	✅*	✅	✅	*Fixed by PR #21527
Q4_K	❌	✅	✅*	✅	*Prioritize DMMV breaks
Q5_K	❌	❌	❌	❌	No reorder support
Q6_K	❌	✅	✅*	✅	*Prioritize DMMV breaks
IQ4_NL	❌	❌	✅	❌	14% bandwidth, crashes
IQ4_XS	❌	❌	✅	✅	Not optimized

5.2 Block Size Analysis

Format	Block Size	Power of 2?	Cache Line Aligned?
Q4_0	18 bytes	No	Partial
Q4_K	54 bytes	No	No
Q5_K	62 bytes	No	No
Q6_K	66 bytes	No	No
Q8_0	34 bytes	No	No
IQ4_NL	16 bytes	Yes	Yes

Hypothesis: Power-of-2 block sizes (Q4_0, IQ4_NL) enable efficient memory access patterns. Non-power-of-2 formats suffer.

6. Key File Locations Summary

Core Problem Areas:

repos/llama.cpp/ggml/src/ggml-sycl/
├── ggml-sycl.cpp
│   ├── Line 219: GGML_SYCL_PRIORITIZE_DMMV env var
│   ├── Line 3258-3260: mul_mat_algo enum (DMMV, MMVQ, SYCL)
│   ├── Line 3269-3292: ggml_sycl_supports_reorder_*() functions
│   ├── Line 3549-3650: dispatch logic with fallback chains
│   └── Problem: Routing logic doesn't handle Q4_K/Q6_K correctly
│
├── dmmv.cpp
│   ├── ~975-1100: dequantize_mul_mat_vec_q8_0_sycl()
│   ├── iter_stride = 64 (generic path)
│   └── Problem: 8x less work than reorder path
│
├── mmvq.cpp
│   ├── ~550-570: Q4_0 reorder kernel
│   ├── ~695-720: Q8_0 reorder kernel (after PR #21527)
│   ├── ~1100-1200: Q4_K kernel (no DMMV support)
│   └── Problem: Missing Q5_K, Q6_K reorder
│
└── vecdotq.hpp
    ├── ~844: vec_dot_q8_0_q8_1 implementation
    └── Problem: Memory coalescing suboptimal for Xe2

repos/llama.cpp/ggml/src/ggml-vulkan/
└── ggml-vulkan.cpp
    ├── ~343: get_device_architecture() classification
    ├── ~15972: coopmat support check
    └── Problem: minSubgroupSize = 8 causes 140T misdetection

repos/ipex-llm/ (archived Jan 2026)
├── Closed-source optimized kernels (not in upstream)
├── syclcompat library
├── oneDNN integration
└── Problem: Archive status, no community maintenance

7. Misalignment Summary Table

Component	Expected	Actual	Impact
Q8_0 DMMV	64 values/iter	2 values/iter	4x slower
Q4_K DMMV	Reorder enabled	Not implemented	40% slower
Q5_K MMVQ	Reorder support	Missing	3x slower
Arc 140T detection	INTEL_XE2	OTHER	Coopmat disabled
Q8_0 on B70	60% BW	21% BW	3x slower

Last Updated: April 2026

7.8 KiB Raw Permalink Blame History