262 lines
7.8 KiB
Markdown
262 lines
7.8 KiB
Markdown
# Driver/Stack Misalignment Analysis
|
|
|
|
## Overview
|
|
|
|
This document catalogs the specific code locations, design decisions, and architectural mismatches that cause Intel Arc GPUs to underperform on LLM inference.
|
|
|
|
---
|
|
|
|
## 1. llama.cpp SYCL Backend Misalignments
|
|
|
|
### 1.1 Kernel Dispatch Logic
|
|
|
|
**File:** `repos/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp`
|
|
**Lines:** ~3258-3660
|
|
|
|
**Current Dispatch Algorithm:**
|
|
```
|
|
mul_mat dispatch prefers:
|
|
1. MMVQ (reorder path) if src0 type in ggml_sycl_supports_reorder_mmvq()
|
|
2. DMMV (reorder path) if src0 type in ggml_sycl_supports_reorder_dmmv()
|
|
3. SYCL native matmul as fallback
|
|
```
|
|
|
|
**Support Lists (lines ~3269-3300):**
|
|
|
|
```cpp
|
|
// Supports MMVQ reorder
|
|
ggml_sycl_supports_reorder_mmvq(): Q4_0, Q8_0, Q4_K, Q6_K
|
|
|
|
// Supports DMMV reorder
|
|
ggml_sycl_supports_reorder_dmmv(): Q4_0, Q8_0 ONLY
|
|
|
|
// Supports SYCL matmul reorder
|
|
ggml_sycl_supports_reorder_mul_mat_sycl(): Q4_0, Q8_0, Q4_K*, Q6_K*
|
|
(* = !g_ggml_sycl_prioritize_dmmv)
|
|
```
|
|
|
|
**Problem:** Q4_K, Q6_K support MMVQ reorder but NOT DMMV reorder. When conditions favor DMMV, these quants fall through to slow generic path.
|
|
|
|
### 1.2 DMMV Kernel iter_stride Problem
|
|
|
|
**File:** `repos/llama.cpp/ggml/src/ggml-sycl/dmmv.cpp`
|
|
**Lines:** ~975-1100 (dequantize_mul_mat_vec_q8_0_sycl)
|
|
|
|
**Generic DMMV (used by Q8_0):**
|
|
```cpp
|
|
iter_stride = 2 * GGML_SYCL_DMMV_X = 64 // processes 2 values per iteration
|
|
```
|
|
|
|
**Reorder DMMV (Q4_0 path):**
|
|
```cpp
|
|
iter_stride = 8 * 2 * GGML_SYCL_DMMV_X = 512 // processes 16 values per iteration
|
|
```
|
|
|
|
**Root Cause:** Q8_0's 34-byte block structure prevents simple power-of-2 optimization that works for Q4_0's 18-byte blocks.
|
|
|
|
### 1.3 Missing Q8_0 Reorder Implementation
|
|
|
|
**File:** `repos/llama.cpp/ggml/src/ggml-sycl/mmvq.cpp`
|
|
**Lines:** ~682-730
|
|
|
|
**Q4_0 Reorder Kernel:**
|
|
```cpp
|
|
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q4_0>>
|
|
```
|
|
|
|
**Q8_0 Reorder Kernel:**
|
|
```cpp
|
|
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0>>
|
|
```
|
|
|
|
**Note:** PR #21527 adds Q8_0 to reorder framework. Without this fix, Q8_0 defaults to slow DMMV path.
|
|
|
|
### 1.4 Q4_K DMMV Reorder Gap
|
|
|
|
**Problem:** Q4_K has reorder structure (`reorder_qw_q4_k()`) but DMMV path doesn't use it.
|
|
|
|
**Current State:**
|
|
- Q4_K MMVQ reorder: ✅ Working
|
|
- Q4_K DMMV reorder: ❌ Not implemented
|
|
|
|
**Impact:** When DMMV is prioritized (GGML_SYCL_PRIORITIZE_DMMV=1), Q4_K gets no optimization.
|
|
|
|
---
|
|
|
|
## 2. llama.cpp Vulkan Backend Misalignments
|
|
|
|
### 2.1 Cooperative Matrix Detection
|
|
|
|
**File:** `repos/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp`
|
|
**Lines:** ~343, ~15972
|
|
|
|
**Detection Logic:**
|
|
```cpp
|
|
// Step 1: Architecture classification
|
|
if (subgroup_size_control_props.minSubgroupSize == 16) {
|
|
return vk_device_architecture::INTEL_XE2;
|
|
}
|
|
// Falls through to OTHER for 140T (minSubgroupSize=8)
|
|
|
|
// Step 2: Coopmat support check
|
|
case VK_VENDOR_ID_INTEL:
|
|
return arch == vk_device_architecture::INTEL_XE2;
|
|
// Returns false for OTHER
|
|
```
|
|
|
|
**Problem:** Arc 140T (Arrow Lake H) reports minSubgroupSize=8 despite having Xe2 architecture and full coopmat support.
|
|
|
|
### 2.2 DP4A/DPAS Utilization Gap
|
|
|
|
**Current State:**
|
|
- Vulkan backend has DP4A instruction support
|
|
- Matrix multiplication (matmul) path doesn't use DPAS
|
|
- Only Flash Attention path partially uses coopmat
|
|
|
|
**Missing:**
|
|
- Q4_K, Q8_0 quantized matmul via DPAS
|
|
- Subgroup-level parallelism for token generation
|
|
|
|
---
|
|
|
|
## 3. IPEX-LLM vs llama.cpp Gap
|
|
|
|
### 3.1 Performance Comparison
|
|
|
|
| Aspect | IPEX-LLM | llama.cpp SYCL | llama.cpp Vulkan |
|
|
|--------|----------|----------------|------------------|
|
|
| Q4_0 | Fast | Fast | Medium |
|
|
| Q4_K | Fast | Medium | Medium |
|
|
| Q8_0 | Fast | Was Broken | Was Broken |
|
|
| K-quants on Xe2 | Crashes | Works | Works |
|
|
| FlashAttention | Full | Partial | Partial |
|
|
| vRAM usage | Lower | Higher | Higher |
|
|
|
|
### 3.2 Source of Optimization Gap
|
|
|
|
**IPEX-LLM advantages:**
|
|
1. Closed-source optimized kernels (not in llama.cpp)
|
|
2. oneDNN GEMM integration
|
|
3. Lower-level hardware access
|
|
4. syclcompat library for platform-specific tuning
|
|
|
|
**llama.cpp limitations:**
|
|
1. Open-source kernels visible to competitors
|
|
2. Generic SYCL must work across all Intel GPUs
|
|
3. Can't leverage IPEX's proprietary optimizations
|
|
|
|
---
|
|
|
|
## 4. Architecture Detection Mismatches
|
|
|
|
### 4.1 Xe1 vs Xe2 Detection
|
|
|
|
**Current Detection:** Uses compute capability (device version)
|
|
|
|
**Problem:**
|
|
- Arc A770 reports compute version 1.3 (Xe1)
|
|
- Arc B580 reports compute version 1.6 (Xe2)
|
|
- BUT: Same driver branch reports different subgroup sizes (8 vs 16)
|
|
|
|
### 4.2 Missing Architecture-Specific Tuning
|
|
|
|
**Current kernels:** Single implementation for all Intel GPUs
|
|
|
|
**Needed:**
|
|
| Feature | Xe1 (Alchemist) | Xe2 (Battlemage) |
|
|
|---------|-----------------|------------------|
|
|
| L2 cache | 16 MB | Larger |
|
|
| Optimal block size | 64 | 128 |
|
|
| Prefetch depth | 2 | 4 |
|
|
| Vector width | 8 | 16 |
|
|
|
|
---
|
|
|
|
## 5. Quantization Format Support Matrix
|
|
|
|
### 5.1 Current Support State
|
|
|
|
| Format | DMMV Reorder | MMVQ Reorder | SYCL Matmul | Vulkan | Notes |
|
|
|--------|--------------|--------------|-------------|--------|-------|
|
|
| Q4_0 | ✅ | ✅ | ✅ | ✅ | Fully optimized |
|
|
| Q4_1 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
|
|
| Q5_0 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
|
|
| Q5_1 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
|
|
| Q8_0 | ✅* | ✅* | ✅ | ✅ | *Fixed by PR #21527 |
|
|
| Q4_K | ❌ | ✅ | ✅* | ✅ | *Prioritize DMMV breaks |
|
|
| Q5_K | ❌ | ❌ | ❌ | ❌ | No reorder support |
|
|
| Q6_K | ❌ | ✅ | ✅* | ✅ | *Prioritize DMMV breaks |
|
|
| IQ4_NL | ❌ | ❌ | ✅ | ❌ | 14% bandwidth, crashes |
|
|
| IQ4_XS | ❌ | ❌ | ✅ | ✅ | Not optimized |
|
|
|
|
### 5.2 Block Size Analysis
|
|
|
|
| Format | Block Size | Power of 2? | Cache Line Aligned? |
|
|
|--------|-----------|-------------|---------------------|
|
|
| Q4_0 | 18 bytes | No | Partial |
|
|
| Q4_K | 54 bytes | No | No |
|
|
| Q5_K | 62 bytes | No | No |
|
|
| Q6_K | 66 bytes | No | No |
|
|
| Q8_0 | 34 bytes | No | No |
|
|
| IQ4_NL | 16 bytes | Yes | Yes |
|
|
|
|
**Hypothesis:** Power-of-2 block sizes (Q4_0, IQ4_NL) enable efficient memory access patterns. Non-power-of-2 formats suffer.
|
|
|
|
---
|
|
|
|
## 6. Key File Locations Summary
|
|
|
|
### Core Problem Areas:
|
|
|
|
```
|
|
repos/llama.cpp/ggml/src/ggml-sycl/
|
|
├── ggml-sycl.cpp
|
|
│ ├── Line 219: GGML_SYCL_PRIORITIZE_DMMV env var
|
|
│ ├── Line 3258-3260: mul_mat_algo enum (DMMV, MMVQ, SYCL)
|
|
│ ├── Line 3269-3292: ggml_sycl_supports_reorder_*() functions
|
|
│ ├── Line 3549-3650: dispatch logic with fallback chains
|
|
│ └── Problem: Routing logic doesn't handle Q4_K/Q6_K correctly
|
|
│
|
|
├── dmmv.cpp
|
|
│ ├── ~975-1100: dequantize_mul_mat_vec_q8_0_sycl()
|
|
│ ├── iter_stride = 64 (generic path)
|
|
│ └── Problem: 8x less work than reorder path
|
|
│
|
|
├── mmvq.cpp
|
|
│ ├── ~550-570: Q4_0 reorder kernel
|
|
│ ├── ~695-720: Q8_0 reorder kernel (after PR #21527)
|
|
│ ├── ~1100-1200: Q4_K kernel (no DMMV support)
|
|
│ └── Problem: Missing Q5_K, Q6_K reorder
|
|
│
|
|
└── vecdotq.hpp
|
|
├── ~844: vec_dot_q8_0_q8_1 implementation
|
|
└── Problem: Memory coalescing suboptimal for Xe2
|
|
|
|
repos/llama.cpp/ggml/src/ggml-vulkan/
|
|
└── ggml-vulkan.cpp
|
|
├── ~343: get_device_architecture() classification
|
|
├── ~15972: coopmat support check
|
|
└── Problem: minSubgroupSize = 8 causes 140T misdetection
|
|
|
|
repos/ipex-llm/ (archived Jan 2026)
|
|
├── Closed-source optimized kernels (not in upstream)
|
|
├── syclcompat library
|
|
├── oneDNN integration
|
|
└── Problem: Archive status, no community maintenance
|
|
```
|
|
|
|
---
|
|
|
|
## 7. Misalignment Summary Table
|
|
|
|
| Component | Expected | Actual | Impact |
|
|
|-----------|----------|--------|--------|
|
|
| Q8_0 DMMV | 64 values/iter | 2 values/iter | 4x slower |
|
|
| Q4_K DMMV | Reorder enabled | Not implemented | 40% slower |
|
|
| Q5_K MMVQ | Reorder support | Missing | 3x slower |
|
|
| Arc 140T detection | INTEL_XE2 | OTHER | Coopmat disabled |
|
|
| Q8_0 on B70 | 60% BW | 21% BW | 3x slower |
|
|
|
|
---
|
|
|
|
*Last Updated: April 2026* |