Initial commit: Intel Arc GPU LLM inference diagnosis research
This commit is contained in:
@@ -0,0 +1,2 @@
|
|||||||
|
repos/
|
||||||
|
.DS_Store
|
||||||
@@ -0,0 +1,52 @@
|
|||||||
|
# Intel Arc GPU — LLM Inference Diagnosis
|
||||||
|
|
||||||
|
Research into why Intel Arc GPUs (Alchemist / Xe1 and Battlemage / Xe2) severely underperform on quantized LLM inference, often achieving only **21–40% of theoretical memory bandwidth** during token generation — compared to 80–95% on equivalent NVIDIA and AMD hardware.
|
||||||
|
|
||||||
|
## The Problem
|
||||||
|
|
||||||
|
Intel Arc GPUs look great on paper for LLM inference: ample VRAM, wide memory buses, dedicated XMX matrix engines. In practice, community benchmarks consistently show:
|
||||||
|
|
||||||
|
- **Q8_0 quantized models running 4–5× slower** than Q4_K_M despite only moving 1.7× more data
|
||||||
|
- Token generation achieving only **21% of peak bandwidth** on some quantization types
|
||||||
|
- Wildly inconsistent performance across SYCL, Vulkan, OpenVINO, and IPEX-LLM backends
|
||||||
|
- Architecture-specific regressions on Xe2 (Battlemage) that don't exist on Xe1 (Alchemist)
|
||||||
|
|
||||||
|
The root causes are multi-layered: missing kernel optimizations in `llama.cpp`, a fragmented Intel software stack (five semi-independent efforts that don't interoperate), quantization-specific dispatch path bugs, and an overall underinvestment in open-source kernel development for Intel GPU architectures.
|
||||||
|
|
||||||
|
## Overviews
|
||||||
|
|
||||||
|
Each overview was independently produced by a different LLM, analyzing community issues, kernel source code, driver stacks, and benchmark data:
|
||||||
|
|
||||||
|
- **[Kimi's Overview](overview_kimi.md)** — Focuses on driver/runtime stack mapping, quantization kernel inefficiencies (DMMV vs. MMVQ paths), and the missing reorder optimization for Q8_0.
|
||||||
|
- **[GLM's Overview](overview_glm.md)** — Broadest scope: full stack architecture diagram, version compatibility matrix, fragmentation analysis across five Intel inference stacks, and the Battlemage regression class.
|
||||||
|
- **[MiniMax's Overview](overview_minimax.md)** — Hardware landscape, per-GPU status table, critical issue triage (Q8_0 catastrophe, iGPU misdetection), and kernel-level root cause analysis.
|
||||||
|
|
||||||
|
## Research
|
||||||
|
|
||||||
|
Supporting deep-dives in [`research/`](research/):
|
||||||
|
|
||||||
|
- [`research/kernels/kernel_analysis_minimax.md`](research/kernels/kernel_analysis_minimax.md) — Detailed kernel dispatch path analysis
|
||||||
|
- [`research/community_issues/issues_and_discourse_minimax.md`](research/community_issues/issues_and_discourse_minimax.md) — Curated community issue reports and discourse
|
||||||
|
|
||||||
|
## Repo Map
|
||||||
|
|
||||||
|
The `repos/` directory contains source clones of the relevant Intel GPU and LLM inference projects for offline analysis (not tracked in this repository):
|
||||||
|
|
||||||
|
| Repository | Purpose |
|
||||||
|
|---|---|
|
||||||
|
| `llama.cpp` | SYCL & Vulkan backends, GGUF quantization kernels |
|
||||||
|
| `ipex-llm` | Intel's former PyTorch integration layer (archived Jan 2026) |
|
||||||
|
| `intel-extension-for-pytorch` | PyTorch XPU extension (deprecated) |
|
||||||
|
| `compute-runtime` | Intel Level Zero / OpenCL driver (NEO) |
|
||||||
|
| `intel-graphics-compiler` | JIT compiler (SYCL → Xe ISA) |
|
||||||
|
| `oneDNN` | Deep-learning primitive library |
|
||||||
|
| `vllm` | vLLM mainline (XPU backend in flux) |
|
||||||
|
| `vllm-xpu-kernels` | Dedicated Intel kernel repo for vLLM |
|
||||||
|
| `level-zero` | Level Zero loader and headers |
|
||||||
|
| `llvm` | DPC++ / SYCL compiler toolchain |
|
||||||
|
| `openvino` | Intel's inference optimizer/runtime |
|
||||||
|
| `sycl-tla` | SYCL abstraction layer |
|
||||||
|
|
||||||
|
## License
|
||||||
|
|
||||||
|
This research documentation is released under [CC0](https://creativecommons.org/publicdomain/zero/1.0/). Referenced repositories carry their own licenses.
|
||||||
+542
@@ -0,0 +1,542 @@
|
|||||||
|
# Intel Arc GPU LLM Inference: Comprehensive Research Overview
|
||||||
|
|
||||||
|
**Author:** GLM Agent
|
||||||
|
**Date:** April 15, 2026
|
||||||
|
**Phase:** Research & Data Preparation (No driver/framework modifications)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Table of Contents
|
||||||
|
|
||||||
|
1. [Executive Summary](#executive-summary)
|
||||||
|
2. [The Intel GPU Software Stack](#the-intel-gpu-software-stack)
|
||||||
|
3. [Identified Problems](#identified-problems)
|
||||||
|
4. [Performance Landscape](#performance-landscape)
|
||||||
|
5. [Root Cause Analysis](#root-cause-analysis)
|
||||||
|
6. [Speculations & Hypotheses](#speculations--hypotheses)
|
||||||
|
7. [Suggested Solutions](#suggested-solutions)
|
||||||
|
8. [Repo Map & Key Files](#repo-map--key-files)
|
||||||
|
9. [References](#references)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Executive Summary
|
||||||
|
|
||||||
|
Intel Arc GPUs (both discrete Alchemist/Xe1 and Battlemage/Xe2, and integrated Lunar Lake/Arrow Lake) suffer from **severely degraded LLM inference performance** compared to their theoretical hardware capabilities. Users consistently report achieving only **21-40% of theoretical memory bandwidth** on quantized models, versus **80-95% on equivalent NVIDIA/AMD hardware**. The issues are multifaceted, spanning kernel optimization gaps, software stack fragmentation, driver/kernel version incompatibilities, and a fundamental underinvestment in open-source kernel development for Intel GPU architectures.
|
||||||
|
|
||||||
|
The Battlemage (B580/B570/B70) generation introduced a **new class of regression**: quantization types that worked well on Xe1 (Alchemist) perform catastrophically on Xe2, with Q8_0 being **4-5x slower** than Q4_K_M despite only having 1.7x more data. This was traced to a kernel dispatch path issue (now partially fixed by PR #21527).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## The Intel GPU Software Stack
|
||||||
|
|
||||||
|
### Layer Architecture
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────┐
|
||||||
|
│ User-facing: Ollama, LM Studio, llama.cpp │
|
||||||
|
├─────────────────────────────────────────────────┤
|
||||||
|
│ Framework: IPEX-LLM (archived Jan 2026) │
|
||||||
|
│ vLLM (intel/vllm docker) │
|
||||||
|
│ OpenVINO │
|
||||||
|
│ PyTorch + IPEX │
|
||||||
|
├─────────────────────────────────────────────────┤
|
||||||
|
│ Backend: SYCL (llama.cpp) │
|
||||||
|
│ Vulkan (llama.cpp, cross-vendor) │
|
||||||
|
│ Level Zero (low-level compute) │
|
||||||
|
├─────────────────────────────────────────────────┤
|
||||||
|
│ Compiler: DPC++ (intel/llvm sycl branch) │
|
||||||
|
│ IGC (Intel Graphics Compiler) │
|
||||||
|
├─────────────────────────────────────────────────┤
|
||||||
|
│ Runtime: compute-runtime (NEO) │
|
||||||
|
│ Level Zero Loader │
|
||||||
|
│ oneDNN (BLAS/GEMM) │
|
||||||
|
├─────────────────────────────────────────────────┤
|
||||||
|
│ Kernel Driver: i915 / xe (Linux) │
|
||||||
|
│ Firmware: linux-firmware │
|
||||||
|
├─────────────────────────────────────────────────┤
|
||||||
|
│ Hardware: Xe1 (Alchemist: A380/A750/A770) │
|
||||||
|
│ Xe2 (Battlemage: B570/B580/B70) │
|
||||||
|
│ Xe2 iGPU (Lunar Lake 140V, Arrow Lake) │
|
||||||
|
└─────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### The Fragmentation Problem
|
||||||
|
|
||||||
|
Intel's software stack for GPU inference has **at least five semi-independent efforts** that don't fully interoperate:
|
||||||
|
|
||||||
|
| Stack | Maintainer | Status | Backend | Optimized? |
|
||||||
|
|-------|-----------|--------|---------|-----------|
|
||||||
|
| llama.cpp SYCL | Community (NeoZhangJianyu, Rbiessy/Codeplay) | Active | SYCL/Level Zero | Only Q4_0 fully |
|
||||||
|
| llama.cpp Vulkan | 0cc4m (community) | Active | Vulkan | Improving, behind CUDA |
|
||||||
|
| IPEX-LLM | Intel (analytics team) | **Archived Jan 2026** | SYCL + proprietary | Best perf, dying |
|
||||||
|
| OpenVINO | Intel (openvino team) | Active | Own runtime | Different model format |
|
||||||
|
| vLLM XPU | Intel (vllm fork) | Active | PyTorch/IPEX | Server-focused |
|
||||||
|
|
||||||
|
**Key complaint from Hacker News user lhl:**
|
||||||
|
> "PyTorch requires its own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work — both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked."
|
||||||
|
|
||||||
|
### Version Compatibility Nightmare
|
||||||
|
|
||||||
|
| Component | IPEX-LLM requires | Latest available | Conflict? |
|
||||||
|
|-----------|-------------------|-----------------|-----------|
|
||||||
|
| oneAPI Base Toolkit | 2024.2.1 | 2025.3+ | **Yes** |
|
||||||
|
| PyTorch | 2.5.x | 2.8+ | **Yes** |
|
||||||
|
| compute-runtime | 25.x | 26.x | **Yes** |
|
||||||
|
| Linux kernel | 6.5-6.17 | 6.18+ | **Yes (6.18 breaks)** |
|
||||||
|
| IGC | 2.27+ | 2.30+ | Minor |
|
||||||
|
| DPC++ Compiler | 2024.2 | 2025.3 | **ABI changes** |
|
||||||
|
|
||||||
|
**Critical finding**: Linux kernel 6.18 completely breaks compute-runtime's memory management (`bindless_heaps_helper.cpp` abort). Users must stay on 6.17 or older. This is tracked in compute-runtime#875.
|
||||||
|
|
||||||
|
**Another critical finding**: IPEX-LLM shifted oneAPI dependency from 2024.2.1 to 2025.0.1 starting from ipex-llm[cpp]==2.2.0b20250207, but the archived repo won't get further updates. Users with older IPEX-LLM versions are stuck on old oneAPI.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Identified Problems
|
||||||
|
|
||||||
|
### Problem 1: SYCL Kernel Dispatch — Quantization Type Inequality (CRITICAL)
|
||||||
|
|
||||||
|
**Status**: Partially fixed (Q8_0), open for most other types
|
||||||
|
**Impact**: 2-5x performance degradation on non-Q4_0 quantizations
|
||||||
|
|
||||||
|
The llama.cpp SYCL backend has three kernel paths for quantized matrix-vector multiplication:
|
||||||
|
1. **DMMV** (Dequantize-Mul-Mat-Vec) — generic, slow
|
||||||
|
2. **MMVQ** (Mul-Mat-Vec-Q) — optimized, uses reorder
|
||||||
|
3. **SYCL native matmul** — fallback
|
||||||
|
|
||||||
|
Only Q4_0 and Q8_0 (after PR #21527) have full reorder support. All other quantization types fall through to slower paths:
|
||||||
|
|
||||||
|
| Format | DMMV Reorder | MMVQ Reorder | SYCL Matmul Reorder | Effective BW |
|
||||||
|
|--------|:------------:|:------------:|:-------------------:|:------------:|
|
||||||
|
| Q4_0 | ✅ | ✅ | ✅ | 57% |
|
||||||
|
| Q8_0 | ✅ (PR #21527) | ✅ (PR #21527) | ✅ | 66% |
|
||||||
|
| Q4_K | ❌ | ✅ | ✅* | 53% |
|
||||||
|
| Q5_K | ❌ | ❌ | ❌ | ~39% |
|
||||||
|
| Q6_K | ❌ | ✅ | ✅* | ~48% |
|
||||||
|
| IQ4_NL | ❌ | ❌ | ✅ | 14% |
|
||||||
|
| Q4_1/Q5_0/Q5_1 | ❌ | ❌ | ✅ | ~30-44% |
|
||||||
|
|
||||||
|
*\* Only when `g_ggml_sycl_prioritize_dmmv` is NOT set*
|
||||||
|
|
||||||
|
The root cause is in the dispatch logic (`ggml-sycl.cpp` lines 3269-3340):
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
// Only Q4_0 and Q8_0 supported in DMMV reorder
|
||||||
|
inline bool ggml_sycl_supports_reorder_dmmv(enum ggml_type type) {
|
||||||
|
switch (type) {
|
||||||
|
case GGML_TYPE_Q4_0:
|
||||||
|
case GGML_TYPE_Q8_0:
|
||||||
|
return true;
|
||||||
|
default:
|
||||||
|
return false; // Q4_K, Q5_K, Q6_K all fall through
|
||||||
|
}
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### Problem 2: Xe2/Battlemage Regression — Q8_0 Catastrophically Slow (CRITICAL, mostly fixed)
|
||||||
|
|
||||||
|
**Status**: PR #21527 submitted, 3.1x speedup validated
|
||||||
|
**Impact**: Q8_0 ran at 21% bandwidth on Xe2 vs 53-64% for Q4_K_M
|
||||||
|
|
||||||
|
On Arc A770 (Xe1), Q8_0 was **faster** than Q4/Q6. On Arc B70/B580 (Xe2), Q8_0 was **4-5x slower**. The root cause:
|
||||||
|
|
||||||
|
- **Generic DMMV path**: `iter_stride = 2 * GGML_SYCL_DMMV_X = 64` → processes 2 values per thread per iteration
|
||||||
|
- **Reorder DMMV path** (Q4_0): `iter_stride = 8 * 2 * GGML_SYCL_DMMV_X = 512` → processes 16 values per thread per iteration
|
||||||
|
|
||||||
|
Q8_0 was stuck on the generic path because it wasn't in `ggml_sycl_supports_reorder_dmmv()`. This was confirmed not a driver issue (IGC 2.28.4 → 2.30.1 showed no change) and not a backend issue (both SYCL and Vulkan equally affected).
|
||||||
|
|
||||||
|
PR #21527 added Q8_0 to the reorder framework, achieving 66% bandwidth utilization (up from 21%).
|
||||||
|
|
||||||
|
### Problem 3: K-Quantization Crashes on Xe2 iGPU (CRITICAL)
|
||||||
|
|
||||||
|
**Status**: Workaround exists (use upstream llama.cpp SYCL instead of IPEX-LLM)
|
||||||
|
**Impact**: Q4_K_M, Q5_K, Q6_K crash on Arc 140V/140T
|
||||||
|
|
||||||
|
```
|
||||||
|
Sub-group size 8 is not supported on the device
|
||||||
|
Exception at ggml-sycl.cpp:3164
|
||||||
|
```
|
||||||
|
|
||||||
|
This error occurs in IPEX-LLM's bundled llama.cpp but not in upstream. IPEX-LLM's llama.cpp is based on an August 2024 snapshot and hasn't received the fixes. Since the project was archived in January 2026, this will likely never be fixed in IPEX-LLM.
|
||||||
|
|
||||||
|
### Problem 4: Arc 140T Misdetection — Coopmat Disabled (HIGH)
|
||||||
|
|
||||||
|
**Status**: Open issue
|
||||||
|
**Impact**: Cooperative matrix operations completely disabled on valid Xe2 hardware
|
||||||
|
|
||||||
|
The Vulkan backend classifies GPUs by `minSubgroupSize`:
|
||||||
|
- `minSubgroupSize == 16` → classified as `INTEL_XE2` → coopmat enabled
|
||||||
|
- Everything else → classified as `OTHER` → no coopmat
|
||||||
|
|
||||||
|
Arrow Lake H (Arc 140T) reports `minSubgroupSize = 8` despite having Xe2 architecture and full cooperative matrix support. This appears to be a driver-level reporting bug.
|
||||||
|
|
||||||
|
### Problem 5: Linux Kernel Version Fragility (HIGH)
|
||||||
|
|
||||||
|
**Status**: Active issue (compute-runtime#875)
|
||||||
|
**Impact**: Complete failure of all GPU compute on kernel 6.18
|
||||||
|
|
||||||
|
| Kernel Version | Works? | Notes |
|
||||||
|
|:--------------:|:------:|-------|
|
||||||
|
| ≤ 6.6.25 | ✅ | Stable baseline |
|
||||||
|
| 6.6.26 - 6.8 | ⚠️ | Some CCS fence timeout issues |
|
||||||
|
| 6.9 - 6.17 | ✅ | Working range |
|
||||||
|
| 6.18+ | ❌ | `bindless_heaps_helper.cpp` abort, all Level Zero fails |
|
||||||
|
|
||||||
|
Additionally, Linux firmware updates have broken Intel GPU compute in the past. The `linux-firmware` package version `20240409.1addd7dc` introduced fence timeouts that hung llama.cpp, requiring downgrades.
|
||||||
|
|
||||||
|
### Problem 6: IPEX-LLM Abandonment (HIGH)
|
||||||
|
|
||||||
|
**Status**: Archived January 2026
|
||||||
|
**Impact**: Best-performing Intel GPU inference stack no longer maintained
|
||||||
|
|
||||||
|
IPEX-LLM consistently outperformed upstream llama.cpp SYCL by **50-80%** on token generation (e.g., 24.35 t/s vs 13.51 t/s on Arc 140V with Llama-2-7B Q4_0). This performance gap came from:
|
||||||
|
|
||||||
|
1. **Closed-source optimized kernels** not shared with upstream
|
||||||
|
2. **oneDNN GEMM integration** for prompt processing
|
||||||
|
3. **syclcompat library** for platform-specific tuning
|
||||||
|
4. **Proprietary quantization optimizations**
|
||||||
|
|
||||||
|
With IPEX-LLM archived, these optimizations are frozen. The llama.cpp community must independently re-derive all of this work.
|
||||||
|
|
||||||
|
### Problem 7: Vulkan vs SYCL Performance Inconsistency (MEDIUM)
|
||||||
|
|
||||||
|
**Status**: Improving, but still inconsistent
|
||||||
|
**Impact**: Users must test both backends for each model/generation
|
||||||
|
|
||||||
|
Historically:
|
||||||
|
- SYCL: Better prompt processing (5-6x faster than Vulkan)
|
||||||
|
- Vulkan: Better token generation (up to 50% faster than SYCL in late 2024)
|
||||||
|
- Recent SYCL improvements have narrowed the gap
|
||||||
|
|
||||||
|
The two backends have completely separate kernel implementations with different optimization strategies, and neither consistently wins across all quantization types and hardware generations.
|
||||||
|
|
||||||
|
### Problem 8: vLLM XPU Quantization Limitations (MEDIUM)
|
||||||
|
|
||||||
|
**Status**: Active development
|
||||||
|
**Impact**: Only FP16, Dynamic FP8, MXFP4 validated on XPU
|
||||||
|
|
||||||
|
vLLM's Intel XPU support does not support many quantization formats that work on CUDA:
|
||||||
|
- **GPTQ**: Limited, torchao AWQ models crash (hard-requires CUDA)
|
||||||
|
- **AWQ**: Occupies more memory than model size on XPU
|
||||||
|
- **Marlin/Machete kernels**: CUDA-only
|
||||||
|
- **GGUF**: Not supported in vLLM at all
|
||||||
|
|
||||||
|
The intel/llm-scaler-vllm docker images lag behind mainline vLLM (currently at 0.14.0-b8.1 vs 0.16+ mainline).
|
||||||
|
|
||||||
|
### Problem 9: Missing DPAS/XMX Utilization for Quantized Inference (HIGH)
|
||||||
|
|
||||||
|
**Status**: In early investigation
|
||||||
|
**Impact**: Intel's key hardware advantage (XMX tensor cores) goes unused for quantized matmul
|
||||||
|
|
||||||
|
Intel Xe/Xe2 GPUs have XMX (Xe Matrix eXtensions) units capable of DPAS (Dot Product and Accumulate Systolic) operations. The Arc A770 has 4096-bit XMX units; B580 has similar. However:
|
||||||
|
|
||||||
|
- **SYCL backend**: Uses `joint_matrix` extension only for FP16/BF16 GEMM, not for quantized formats
|
||||||
|
- **Vulkan backend**: DP4A instruction support added, but not yet wired to matmul path
|
||||||
|
- **K-quants**: No DPAS path at all — rely entirely on scalar DP4A or software emulation
|
||||||
|
- The community contributors (Rbiessy, NeoZhangJianyu) have noted that kernels are memory-bound *before* they can benefit from DPAS, so the memory access patterns must be fixed first
|
||||||
|
|
||||||
|
Quote from Rbiessy (Codeplay):
|
||||||
|
> "I say potentially [using the matrix engine] because currently the kernel is memory bound in the configurations we have tried. If we're still not able to improve that for some reason using HMX won't help."
|
||||||
|
|
||||||
|
### Problem 10: Understaffed Open-Source Development (SYSTEMIC)
|
||||||
|
|
||||||
|
**Status**: Ongoing
|
||||||
|
**Impact**: All other problems stem from this root cause
|
||||||
|
|
||||||
|
The SYCL backend in llama.cpp is primarily maintained by:
|
||||||
|
- **NeoZhangJianyu**: Independent contributor, spare time
|
||||||
|
- **Rbiessy (Codeplay/Samsung)**: Part of Codeplay team (acquired by Samsung), contributing to SYCL optimization
|
||||||
|
- **0cc4m**: Vulkan backend development
|
||||||
|
- **qnixsynapse**: Testing and CI
|
||||||
|
|
||||||
|
Quote from NeoZhangJianyu:
|
||||||
|
> "We are private contributors to maintain the SYCL backend on Intel GPU. You shouldn't complain so much, since we spend our spare time in past year to maintain it and make it work. Yes, it works, instead of work perfect. For BMG, we don't promise to optimize it in time of the marketing."
|
||||||
|
|
||||||
|
Intel itself does not officially contribute to the llama.cpp SYCL backend. Their focus is on IPEX-LLM (now archived), OpenVINO, and vLLM XPU.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Landscape
|
||||||
|
|
||||||
|
### Benchmarks: llama.cpp SYCL vs Vulkan vs IPEX-LLM
|
||||||
|
|
||||||
|
Compiled from community reports (llm-tracker.info, GitHub issues, Reddit):
|
||||||
|
|
||||||
|
#### Arc 140V iGPU (Lunar Lake, Xe2) — Llama-2-7B Q4_0
|
||||||
|
|
||||||
|
| Backend | pp512 (t/s) | tg128 (t/s) | MBW Efficiency |
|
||||||
|
|---------|:-----------:|:-----------:|:--------------:|
|
||||||
|
| CPU (4 P-cores) | 25.05 | 11.59 | 30% |
|
||||||
|
| Vulkan | 44.65 | 5.54 | 14% |
|
||||||
|
| SYCL FP32 | 180.77 | 14.39 | 38% |
|
||||||
|
| SYCL FP16 | 526.38 | 13.51 | 35% |
|
||||||
|
| **IPEX-LLM** | **708.15** | **24.35** | **64%** |
|
||||||
|
|
||||||
|
Theoretical max tg: 136.5 GB/s ÷ 3.56 GB = ~38.3 t/s
|
||||||
|
|
||||||
|
#### Arc A770 (Alchemist, Xe1) — Various Models
|
||||||
|
|
||||||
|
| Model | Quant | SYCL pp512 | SYCL tg128 | Vulkan tg128 |
|
||||||
|
|-------|-------|:----------:|:----------:|:------------:|
|
||||||
|
| Llama 7B | Q4_0 | ~500 | ~40 | ~30 |
|
||||||
|
| Llama 7B | Q6_K | ~700 | ~22 | ~21 |
|
||||||
|
| Llama 13B | Q5_K | ~400 | ~16 | ~8 |
|
||||||
|
| Llama 30B | Q2_K | ~160 | ~8.5 | ~5 |
|
||||||
|
|
||||||
|
#### Arc B580 (Battlemage, Xe2) — Llama-3.1-8B
|
||||||
|
|
||||||
|
| Quant | SYCL tg (t/s) | Expected (456 GB/s BW) | Efficiency |
|
||||||
|
|-------|:------------:|:---------------------:|:----------:|
|
||||||
|
| Q4_K_M | 25-30 | ~38 | 66-79% |
|
||||||
|
| Q8_0 (pre-fix) | ~8 | ~22 | 36% |
|
||||||
|
| Q8_0 (post-fix) | ~18 | ~22 | 82% |
|
||||||
|
|
||||||
|
#### Arc Pro B70 (Battlemage, Xe2) — Qwen3.5-27B (comprehensive sweep)
|
||||||
|
|
||||||
|
| Quant | Size (GiB) | tg128 (t/s) | Effective BW | % of 608 GB/s |
|
||||||
|
|-------|:----------:|:-----------:|:------------:|:-------------:|
|
||||||
|
| Q4_0 | 14.63 | 23.67 | 346 GB/s | **57%** |
|
||||||
|
| Q4_K_M | 15.58 | 20.56 | 321 GB/s | **53%** |
|
||||||
|
| Q6_K | 20.90 | 13.83 | 289 GB/s | **48%** |
|
||||||
|
| Q5_K_M | 18.25 | 13.78 | 252 GB/s | **41%** |
|
||||||
|
| IQ4_NL | 14.60 | 5.85 | 85 GB/s | **14%** |
|
||||||
|
| Q8_0 (pre-fix) | 26.62 | 4.88 | 130 GB/s | **21%** |
|
||||||
|
| Q8_0 (post-fix) | — | 15.24 | 402 GB/s | **66%** |
|
||||||
|
|
||||||
|
### Comparison vs NVIDIA/AMD
|
||||||
|
|
||||||
|
| GPU | Price | VRAM | BW (GB/s) | 8B Q4_K_M tg | BW Efficiency |
|
||||||
|
|-----|:-----:|:----:|:---------:|:------------:|:------------:|
|
||||||
|
| Arc B580 | $249 | 12GB | 456 | ~25-30 t/s | ~66-79% |
|
||||||
|
| RTX 4060 | $299 | 8GB | 272 | ~35 t/s | ~92% |
|
||||||
|
| RTX 4060 Ti 16GB | $499 | 16GB | 288 | ~40 t/s | ~90% |
|
||||||
|
| RTX 3060 12GB (used) | $170 | 12GB | 360 | ~35 t/s | ~88% |
|
||||||
|
|
||||||
|
Intel Arc delivers **20-30% less inference speed** than NVIDIA at similar price points, despite having competitive raw bandwidth. The gap is entirely in software efficiency.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Root Cause Analysis
|
||||||
|
|
||||||
|
### Layer 1: Kernel Code — Missing Reorder Implementations
|
||||||
|
|
||||||
|
The **reorder optimization** (originally added for Q4_0 in PR #12035 by NeoZhangJianyu) separates quantized data from metadata (scales/zero-points) in GPU memory, enabling coalesced memory access patterns. This is the single biggest performance differentiator:
|
||||||
|
|
||||||
|
- **Without reorder**: iter_stride=64, 2 values per thread per iteration → poor memory coalescing
|
||||||
|
- **With reorder**: iter_stride=512, 16 values per thread per iteration → near-optimal memory access
|
||||||
|
|
||||||
|
Only Q4_0 had this optimization until PR #21527 added Q8_0. All K-quant formats (Q4_K, Q5_K, Q6_K) are missing DMMV reorder implementations.
|
||||||
|
|
||||||
|
**Files involved:**
|
||||||
|
- `llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp` (dispatch logic)
|
||||||
|
- `llama.cpp/ggml/src/ggml-sycl/dmmv.cpp` (DMMV kernels)
|
||||||
|
- `llama.cpp/ggml/src/ggml-sycl/mmvq.cpp` (MMVQ kernels)
|
||||||
|
|
||||||
|
### Layer 2: Architecture Blindness — No Xe1 vs Xe2 Differentiation
|
||||||
|
|
||||||
|
The SYCL backend treats all Intel GPUs identically. There's no runtime adaptation for:
|
||||||
|
- Different L2 cache sizes (Xe1: 16MB, Xe2: larger)
|
||||||
|
- Different optimal block sizes (Xe1: 64, Xe2: 128)
|
||||||
|
- Different prefetch depths
|
||||||
|
- Different vector widths
|
||||||
|
|
||||||
|
This explains why Xe2-specific regressions occur: optimizations tuned for Xe1 can be counterproductive on Xe2's different memory hierarchy.
|
||||||
|
|
||||||
|
### Layer 3: Driver/Runtime — Kernel Version Coupling
|
||||||
|
|
||||||
|
Intel's compute-runtime has tight coupling with the Linux kernel's i915/xe driver. Changes in kernel memory management (e.g., SVM, CCS enablement, bindless heaps) break the userspace stack. The kernel 6.18 regression is the latest in a series:
|
||||||
|
- Kernel 6.6.26+: Fence timeouts (CCS changes)
|
||||||
|
- Kernel 6.8+: Various hangs
|
||||||
|
- Kernel 6.18: Complete memory allocation failure
|
||||||
|
|
||||||
|
### Layer 4: Organizational — Intel's Strategic Confusion
|
||||||
|
|
||||||
|
Intel has multiple teams working on overlapping GPU inference stacks without coordination:
|
||||||
|
1. **Intel Analytics** (China): IPEX-LLM → archived
|
||||||
|
2. **Intel OpenVINO team**: OpenVINO inference runtime
|
||||||
|
3. **Intel vLLM team**: intel/vllm docker fork
|
||||||
|
4. **Intel PyTorch team**: intel-extension-for-pytorch
|
||||||
|
5. **Intel compiler team**: DPC++/SYCL (llvm sycl branch)
|
||||||
|
6. **Intel compute-runtime team**: Level Zero driver
|
||||||
|
7. **Codeplay** (Samsung subsidiary): SYCL backend contributions to llama.cpp
|
||||||
|
8. **Community volunteers**: Most of the actual llama.cpp SYCL work
|
||||||
|
|
||||||
|
These teams use different oneAPI versions, different PyTorch versions, and target different benchmarks. There's no unified "make Intel GPUs fast for LLM inference" strategy.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Speculations & Hypotheses
|
||||||
|
|
||||||
|
### H1: IPEX-LLM's Performance Secret Was Kernel Specialization, Not Magic
|
||||||
|
|
||||||
|
IPEX-LLM's 50-80% advantage over upstream llama.cpp SYCL likely comes from:
|
||||||
|
1. **Hand-tuned DPAS kernels** for specific quantization formats
|
||||||
|
2. **oneDNN integration** for prompt processing GEMM (not available in llama.cpp)
|
||||||
|
3. **Memory layout optimizations** that upstream hasn't replicated
|
||||||
|
|
||||||
|
The fact that PR #21527 achieved 66% bandwidth for Q8_0 (close to IPEX-LLM's ~61% on Arc 140V) suggests that the reorder approach can close much of the gap, but the closed-source DPAS kernels remain unrecoverable.
|
||||||
|
|
||||||
|
### H2: The Xe2 Memory Subsystem Has Different Optimal Access Patterns
|
||||||
|
|
||||||
|
The fact that Q8_0 was *faster* on Xe1 but *catastrophically slower* on Xe2 suggests fundamental architectural differences in:
|
||||||
|
- L2 cache behavior (Xe2 may have different cache line policies)
|
||||||
|
- Memory controller scheduling (Xe2 GDDR6 vs Xe1 GDDR6 may have different timing)
|
||||||
|
- EU thread scheduling (Xe2 may have different SIMT behavior)
|
||||||
|
|
||||||
|
Without Intel publishing detailed Xe2 microarchitecture documentation, kernel developers are flying blind.
|
||||||
|
|
||||||
|
### H3: The Driver Stack Will Remain Fragile
|
||||||
|
|
||||||
|
Intel's ongoing transition from i915 to xe kernel driver, combined with the compute-runtime's tight kernel coupling, suggests that:
|
||||||
|
- New kernel versions will periodically break GPU compute
|
||||||
|
- Docker images will be pinned to specific kernel versions
|
||||||
|
- Users on rolling-release distros (Arch, Fedora) will be most affected
|
||||||
|
|
||||||
|
### H4: OpenVINO Is Intel's Real Strategy, But It Doesn't Help llama.cpp
|
||||||
|
|
||||||
|
Intel seems to be positioning OpenVINO as their primary inference runtime. OpenVINO has:
|
||||||
|
- Its own model format (not GGUF)
|
||||||
|
- Its own quantization pipeline (not GPTQ/AWQ/GGUF)
|
||||||
|
- Better performance out of the box
|
||||||
|
|
||||||
|
But the local LLM community overwhelmingly uses GGUF/llama.cpp, not OpenVINO. Intel's strategy doesn't align with user demand.
|
||||||
|
|
||||||
|
### H5: The Community Could Fix Most Issues With Proper Resources
|
||||||
|
|
||||||
|
The reorder optimization framework is extensible by design. With focused effort, the remaining quantization types could be optimized:
|
||||||
|
- Q4_K DMMV reorder: ~2-3 weeks of focused work
|
||||||
|
- Q5_K, Q6_K reorder: ~4-6 weeks each
|
||||||
|
- Xe2-specific tuning: ~2-4 weeks with profiling tools
|
||||||
|
- DPAS integration: ~2-3 months for quantized formats
|
||||||
|
|
||||||
|
The bottleneck is not technical complexity but developer time.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Suggested Solutions
|
||||||
|
|
||||||
|
### Priority 1: Critical Fixes (1-2 weeks)
|
||||||
|
|
||||||
|
| Task | Effort | Impact | Status |
|
||||||
|
|------|--------|--------|--------|
|
||||||
|
| Merge PR #21527 (Q8_0 reorder) | Done | 3.1x Q8_0 speedup | Pending review |
|
||||||
|
| Implement Q4_K DMMV reorder | Medium | 40% speedup for Q4_K | Not started |
|
||||||
|
| Fix Arc 140T minSubgroupSize detection | Small | Enables coopmat | Not started |
|
||||||
|
| Document kernel version compatibility | Small | Prevents user frustration | Not started |
|
||||||
|
|
||||||
|
### Priority 2: Quantization Coverage (1-3 months)
|
||||||
|
|
||||||
|
| Task | Effort | Impact |
|
||||||
|
|------|--------|--------|
|
||||||
|
| Add Q5_K to reorder framework | Medium | ~2x speedup |
|
||||||
|
| Add Q6_K to reorder framework | Medium | ~1.5x speedup |
|
||||||
|
| Fix IQ4_NL kernel (14% → 50%+ BW) | Hard | ~3.5x speedup |
|
||||||
|
| Increase DMMV iter_stride for non-reorder types | Small | 20-30% improvement |
|
||||||
|
|
||||||
|
### Priority 3: Architecture Optimization (3-6 months)
|
||||||
|
|
||||||
|
| Task | Effort | Impact |
|
||||||
|
|------|--------|--------|
|
||||||
|
| Implement Xe2-specific kernel variants | Hard | Architecture-appropriate tuning |
|
||||||
|
| Enable DPAS for quantized matmul via prefetch optimization | Hard | Could reach 80%+ BW |
|
||||||
|
| Complete FlashAttention for SYCL | Medium | 2-3x prompt processing improvement |
|
||||||
|
| Add runtime GPU architecture detection (Xe1 vs Xe2) | Medium | Auto-tuning |
|
||||||
|
|
||||||
|
### Priority 4: Ecosystem Fixes (ongoing)
|
||||||
|
|
||||||
|
| Task | Effort | Impact |
|
||||||
|
|------|--------|--------|
|
||||||
|
| Create unified benchmark suite for Intel GPUs | Medium | Reproducible perf tracking |
|
||||||
|
| Test matrix: kernel × compute-runtime × oneAPI | Large | Prevent version combo failures |
|
||||||
|
| Coordinate with Intel for official contributions | Political | Sustainable maintenance |
|
||||||
|
| Investigate OpenVINO's optimized kernels for porting | Medium | Leverage existing Intel work |
|
||||||
|
| Add vLLM XPU support for GPTQ/AWQ | Large | Production quantized serving |
|
||||||
|
|
||||||
|
### Priority 5: Strategic Recommendations
|
||||||
|
|
||||||
|
1. **Intel should officially contribute to llama.cpp SYCL backend** — this is where the users are
|
||||||
|
2. **Open-source IPEX-LLM's optimized kernels** before they're permanently lost
|
||||||
|
3. **Decouple compute-runtime from kernel version** — validate on LTS kernels, add version negotiation
|
||||||
|
4. **Create a "one command" setup** for Intel GPU LLM inference (like `pip install torch` for CUDA)
|
||||||
|
5. **Publish Xe2 microarchitecture details** to enable community kernel optimization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Repo Map & Key Files
|
||||||
|
|
||||||
|
```
|
||||||
|
repos/
|
||||||
|
├── llama.cpp/ # Main inference engine
|
||||||
|
│ ├── ggml/src/ggml-sycl/
|
||||||
|
│ │ ├── ggml-sycl.cpp # Dispatch logic (lines 3269-3660)
|
||||||
|
│ │ ├── dmmv.cpp # DMMV kernels (iter_stride issue)
|
||||||
|
│ │ ├── mmvq.cpp # MMVQ reorder kernels
|
||||||
|
│ │ ├── dequantize.hpp # Dequantization functions
|
||||||
|
│ │ └── vecdotq.hpp # Vector dot product implementations
|
||||||
|
│ ├── ggml/src/ggml-vulkan/
|
||||||
|
│ │ └── ggml-vulkan.cpp # Vulkan backend (140T misdetection)
|
||||||
|
│ └── docs/backend/SYCL.md # SYCL backend documentation
|
||||||
|
│
|
||||||
|
├── compute-runtime/ # Level Zero + OpenCL driver
|
||||||
|
│ ├── shared/source/helpers/
|
||||||
|
│ │ └── bindless_heaps_helper.cpp # Kernel 6.18 crash point
|
||||||
|
│ └── level_zero/ # Level Zero implementation
|
||||||
|
│
|
||||||
|
├── intel-graphics-compiler/ # IGC - GPU shader/compute compiler
|
||||||
|
│ └── documentation/visa/instructions/
|
||||||
|
│ └── DPAS.md # DPAS instruction documentation
|
||||||
|
│
|
||||||
|
├── intel-extension-for-pytorch/ # IPEX - PyTorch GPU extension
|
||||||
|
│
|
||||||
|
├── ipex-llm/ # IPEX-LLM (archived Jan 2026)
|
||||||
|
│ ├── docs/mddocs/Quickstart/ # Installation guides
|
||||||
|
│ └── [closed-source optimized kernels]
|
||||||
|
│
|
||||||
|
├── vllm/ # vLLM mainline
|
||||||
|
├── vllm-xpu-kernels/ # Intel XPU-specific vLLM kernels
|
||||||
|
│
|
||||||
|
├── oneDNN/ # Intel BLAS/GEMM library
|
||||||
|
├── openvino/ # Intel's inference runtime
|
||||||
|
├── llvm/ # DPC++ compiler (SYCL branch)
|
||||||
|
│ └── sycl/ # SYCL runtime implementation
|
||||||
|
│
|
||||||
|
├── level-zero/ # Level Zero loader + headers
|
||||||
|
└── sycl-tla/ # SYCL Templates for Linear Algebra
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## References
|
||||||
|
|
||||||
|
### Critical GitHub Issues
|
||||||
|
| # | Repo | Title | Severity |
|
||||||
|
|---|------|-------|----------|
|
||||||
|
| [#21517](https://github.com/ggml-org/llama.cpp/issues/21517) | llama.cpp | Q8_0 4x slower on Arc Pro B70 | Critical |
|
||||||
|
| [#21527](https://github.com/ggml-org/llama.cpp/pull/21527) | llama.cpp | Q8_0 reorder fix (3.1x speedup) | Critical (fix) |
|
||||||
|
| [#12570](https://github.com/ggml-org/llama.cpp/discussions/12570) | llama.cpp | Current status of Intel Arc GPUs | High (discussion) |
|
||||||
|
| [#5277](https://github.com/ggml-org/llama.cpp/discussions/5277) | llama.cpp | SYCL Long Term Features & Issues | High (tracking) |
|
||||||
|
| [#12318](https://github.com/intel/ipex-llm/issues/12318) | ipex-llm | K-quant crashes on Xe2 iGPU | Critical |
|
||||||
|
| [#12991](https://github.com/intel/ipex-llm/issues/12991) | ipex-llm | Vulkan faster than SYCL on B580 | Medium |
|
||||||
|
| [#875](https://github.com/intel/compute-runtime/issues/875) | compute-runtime | Kernel 6.18 breaks Level Zero | Critical |
|
||||||
|
| [#788](https://github.com/intel/compute-runtime/issues/788) | compute-runtime | sycl-ls fails on B580 | High |
|
||||||
|
|
||||||
|
### Community Resources
|
||||||
|
- [llm-tracker.info Intel GPU guide](https://llm-tracker.info/howto/Intel-GPUs) — Comprehensive setup/performance data
|
||||||
|
- [CraftRigs B580 LLM review](https://craftrigs.com/reviews/intel-arc-b580-local-llm-performance/) — Honest benchmark assessment
|
||||||
|
- [Hacker News discussion](https://news.ycombinator.com/item?id=42500245) — Software stack critiques
|
||||||
|
- [Phoronix Vulkan benchmarks](https://www.phoronix.com/review/llama-cpp-vulkan-eoy2025) — Cross-vendor Vulkan comparison
|
||||||
|
- [vLLM XPU docs](https://docs.vllm.ai/en/stable/models/hardware_supported_models/xpu/) — Supported models matrix
|
||||||
|
|
||||||
|
### Key People
|
||||||
|
- **NeoZhangJianyu**: Primary SYCL backend maintainer (spare-time contributor)
|
||||||
|
- **Rbiessy (Codeplay)**: Working on mul_mat_vec_q kernel optimization, prefetch, DPAS
|
||||||
|
- **0cc4m**: Vulkan backend developer for Intel GPUs
|
||||||
|
- **PMZFX**: Filed #21517, submitted PR #21527 (Q8_0 reorder fix)
|
||||||
|
- **lhl (llm-tracker.info)**: Comprehensive benchmarking and documentation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*This document is part of Phase 1: Research & Data Preparation*
|
||||||
|
*No driver/framework modifications were made during this phase*
|
||||||
|
*Companion documents: `research/community_issues/issues_and_discourse_minimax.md`, `research/kernels/kernel_analysis_minimax.md`*
|
||||||
@@ -0,0 +1,206 @@
|
|||||||
|
# Intel Arc GPU LLM Inference: Driver & Software Stack Research Overview
|
||||||
|
|
||||||
|
**Date:** 2026-04-15
|
||||||
|
**Scope:** Research-only phase. No code or driver modifications were made. This document collects online community discourse, bug reports, and architectural analysis to identify why Intel Arc GPUs underperform on quantized LLM inference and where the relevant software stacks are misaligned.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Executive Summary
|
||||||
|
|
||||||
|
Intel Arc GPUs (Alchemist / Battlemage) are mechanically capable LLM inference cards on paper—large VRAM pools (up to 24 GB), wide memory buses (456–608 GB/s), and dedicated XMX matrix engines. In practice, community reports consistently describe **severe performance cliffs on quantized models**, **backend-specific kernel inefficiencies**, and **driver/runtime instability** that prevent Arc from reaching the bandwidth utilization seen on NVIDIA/AMD hardware.
|
||||||
|
|
||||||
|
The core observation across Reddit, GitHub Issues, and Intel documentation is that **token generation (TG) on quantized GGUF models often achieves only 20–40% of theoretical memory bandwidth**, while prompt processing (PP) can swing wildly depending on which backend (SYCL, Vulkan, OpenVINO, IPEX-LLM) is used. The problem is **not a single driver bug**; it is a misalignment between:
|
||||||
|
|
||||||
|
- **Kernel implementations** in `llama.cpp`’s SYCL/Vulkan backends that were optimized for NVIDIA/AMD data layouts and not fully ported for Intel Xe architecture.
|
||||||
|
- **The Intel graphics runtime stack** (Compute Runtime / Level Zero / IGC) which exposes the hardware but relies on user-space kernels to extract performance.
|
||||||
|
- **Rapid ecosystem churn**: IPEX-LLM was archived in Jan 2026, PyTorch XPU support moved upstream, and vLLM is mid-migration from IPEX to a new `vllm-xpu-kernels` backend, leaving users with fragmented, often conflicting setup instructions.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. The Driver & Runtime Stack
|
||||||
|
|
||||||
|
For LLM inference on Intel Arc, the following layers are involved:
|
||||||
|
|
||||||
|
| Layer | Project / Driver | Role |
|
||||||
|
|-------|------------------|------|
|
||||||
|
| **Kernel driver** | `i915` / `xe` (Linux), Intel Graphics Driver (Windows) | Base GPU scheduling, memory management |
|
||||||
|
| **Compute Runtime** | `intel/compute-runtime` (NEO) | OpenCL + Level Zero driver; exposes SYCL devices |
|
||||||
|
| **Graphics Compiler** | Intel Graphics Compiler (IGC) | JIT-compiles SYCL kernels to Xe ISA |
|
||||||
|
| **Math libs** | oneMKL + oneDNN | GEMM/SDPA backends for PyTorch/SYCL |
|
||||||
|
| **Mesa Vulkan** | `anv` (Intel Vulkan driver) | Backend for `llama.cpp` Vulkan path |
|
||||||
|
| **Framework integrations** | `ipex-llm` (archived), `intel-extension-for-pytorch`, `vllm-xpu-kernels`, upstream `vllm` | User-facing inference stacks |
|
||||||
|
|
||||||
|
### Key Repositories Pulled
|
||||||
|
|
||||||
|
```
|
||||||
|
repos/
|
||||||
|
├── llama.cpp # SYCL & Vulkan backends, GGUF quantization kernels
|
||||||
|
├── ipex-llm # Intel’s former integration layer (archived Jan 2026)
|
||||||
|
├── intel-extension-for-pytorch # PyTorch XPU extension (also archived / deprecated)
|
||||||
|
├── compute-runtime # Intel Level Zero / OpenCL driver (NEO)
|
||||||
|
├── oneDNN # Intel deep-learning primitive library
|
||||||
|
├── vllm # vLLM mainline (XPU backend in flux)
|
||||||
|
└── vllm-xpu-kernels # New dedicated Intel kernel repo for vLLM
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Problems Identified in Community Discourse
|
||||||
|
|
||||||
|
### 3.1 Quantized Model Performance Is Disproportionately Bad
|
||||||
|
|
||||||
|
The most repeated complaint is that **quantized models run far slower than they should** given their reduced size.
|
||||||
|
|
||||||
|
**Example: `llama.cpp` SYCL backend on Arc Pro B70 (Xe2 / Battlemage)**
|
||||||
|
*(GitHub Issue #21517, Reddit r/LocalLLAMA)*
|
||||||
|
|
||||||
|
| Quant | Model Size | TG (t/s) | Effective BW | % of Peak |
|
||||||
|
|-------|------------|----------|--------------|-----------|
|
||||||
|
| Q4_K_M | 15.6 GiB | 20.56 | 321 GB/s | 53% |
|
||||||
|
| Q8_0 | 26.6 GiB | 4.88 | 130 GB/s | **21%** |
|
||||||
|
|
||||||
|
**Critical finding:** Q8_0 is **4× slower** than Q4_K_M despite moving only **1.7× more bytes**. This rules out pure memory-bandwidth limits and points to **kernel-level inefficiency**.
|
||||||
|
|
||||||
|
The same issue affects both SYCL and Vulkan backends equally, and persists when splitting across two GPUs with abundant free VRAM. Updating the Compute Runtime / IGC had **no effect** on Q8_0 token-generation speed, confirming the bottleneck is in the inference-framework kernels, not the compiler or driver.
|
||||||
|
|
||||||
|
#### Inverse anomaly on older Arc A770 (Xe1 / Alchemist)
|
||||||
|
On Vulkan, some users report the *opposite*: Q8_0 **outperforms** Q4/Q6 in prompt processing (Issue #19887: ~600 t/s for Q8_0 vs ~200 t/s for Q6_K). This suggests the quantization-kernel imbalance is **architecture-specific**, not universal.
|
||||||
|
|
||||||
|
### 3.2 Missing or Partial "Reorder" Optimizations
|
||||||
|
|
||||||
|
`llama.cpp`’s SYCL backend introduced a **reorder optimization** (PR #12035, Feb 2025) that separates quantized weights from their scale factors so the GPU can load them with coalesced memory access. This optimization was implemented **only for Q4_0** and later extended to Q4_K and Q6_K. **Q8_0 was never added** to the reorder/MMVQ fast path.
|
||||||
|
|
||||||
|
Consequences:
|
||||||
|
- Q8_0 falls back to the generic **DMMV** kernel with `iter_stride = 64` (2 values per thread per iteration).
|
||||||
|
- The reorder path for Q4_0 uses `iter_stride = 512` (16 values per thread per iteration).
|
||||||
|
- A forced DMMV path for Q4_K_M drops its TG speed by ~40%, but forcing MMVQ on Q8_0 does **not** recover performance—both paths are slow for Q8_0 on Xe2.
|
||||||
|
|
||||||
|
Community speculation: the 34-byte `block_q8_0` layout is not a power-of-two, making it harder to vectorize efficiently on Intel’s EU/SIMD width without explicit data-layout rewriting.
|
||||||
|
|
||||||
|
### 3.3 XMX Matrix Engines Are Underutilized for Token Generation
|
||||||
|
|
||||||
|
Intel Arc has **Xe Matrix Extensions (XMX)**—analogous to NVIDIA Tensor Cores. They are used automatically by oneMKL/oneDNN for **FP16/BF16 GEMM**, which is why prompt processing (compute-bound) sees large speedups when `-DGGML_SYCL_F16=ON` is enabled (reported ~2.4× improvement: 302 → 725 t/s).
|
||||||
|
|
||||||
|
However, **token generation uses DMMV/MMVQ**, which are memory-bandwidth-bound dequantize-and-dot kernels. The SYCL backend does **not** currently route these small quantized matrix-vector operations through XMX/DPAS instructions. Developers (Codeplay/Intel contributors in Discussion #12570) note they are investigating DPAS directly, but consider the kernel "memory bound" in current configurations, so adding matrix engines may not help until memory-access patterns are fixed first.
|
||||||
|
|
||||||
|
### 3.4 Vulkan Backend Regressions and Driver Sensitivity
|
||||||
|
|
||||||
|
The Vulkan path in `llama.cpp` is maintained independently and sees recurring Intel-specific regressions:
|
||||||
|
|
||||||
|
- **Mesa version sensitivity**: Newer Mesa versions have reportedly **slowed TG on Intel** (Discussion #10879).
|
||||||
|
- **Performance degradation between builds**: A drop from 53 → 42 t/s on A770 was bisected to changes between `b7189` and `b7209` (Issue #17628).
|
||||||
|
- **Cooperative matrix TDRs**: Windows drivers (101.8509/101.8531) cause GPU timeouts when `VK_KHR_cooperative_matrix` is enabled with `llama.cpp` Vulkan (Issue #20554).
|
||||||
|
- **BMG (Battlemage) support lag**: B580 GPUs require very recent Mesa + kernel combinations; users often see fallback-to-CPU behavior or "unsupported device" warnings (Discussion #12570).
|
||||||
|
|
||||||
|
### 3.5 IPEX-LLM Deprecation and Ecosystem Fragmentation
|
||||||
|
|
||||||
|
`intel/ipex-llm` was **archived on January 28, 2026**. It was the primary documented way to run Ollama/llama.cpp on Intel Arc via Docker. Since archiving:
|
||||||
|
|
||||||
|
- Open issues (prompt-processing slowdowns, container GPU visibility, B580 model-load failures) are frozen.
|
||||||
|
- Users are left choosing between:
|
||||||
|
1. **Upstream `llama.cpp` SYCL** (manual oneAPI setup, variable quant performance).
|
||||||
|
2. **OpenVINO** (Intel-recommended, but not GGUF-native).
|
||||||
|
3. **vLLM + IPEX** (deprecated; vLLM is migrating to `vllm-xpu-kernels`).
|
||||||
|
4. **Vulkan `llama.cpp`** (easier setup, but lower peak performance and regressions).
|
||||||
|
|
||||||
|
### 3.6 Compute Runtime / Level Zero Issues
|
||||||
|
|
||||||
|
The `intel/compute-runtime` repository contains long-standing issues that affect LLM workloads:
|
||||||
|
|
||||||
|
- **4 GB allocation limit** on Arc A770 16 GB (`CL_DEVICE_MAX_MEM_ALLOC_SIZE`) — Issue #627. Large single buffers must be split or use host paging.
|
||||||
|
- **Incorrect free-memory reporting** — Issue #750. Runtime reports `free_memory == global_mem_size` even when VRAM is in use, confusing memory managers in `llama.cpp` and vLLM.
|
||||||
|
- **BAR / SVM allocation failures on Arrow Lake** — Issue #890. OEM laptops with fixed 256 MB BAR hang or fail to allocate >4 GB, verified on both Linux and Windows.
|
||||||
|
- **iGPU + dGPU conflicts**: SYCL initialization can fail or select the wrong device when both an Intel iGPU and Arc dGPU are present (Issues #13775, #9106).
|
||||||
|
|
||||||
|
### 3.7 vLLM on Intel XPU Is Still Immature
|
||||||
|
|
||||||
|
vLLM upstream added Intel XPU support, but user reports highlight:
|
||||||
|
|
||||||
|
- **B-series crashes during model inspection** (Issue #27408): `SIGABRT` inside `drm_neo.cpp` on Battlemage.
|
||||||
|
- **Dual-GPU scaling breaks TG**: Two A770s in tensor/pipeline parallel double throughput but **halve text-generation speed** to 3–4 t/s (Issue #12190).
|
||||||
|
- **AWQ/INT4 pre-quantized models fail** because the torchao codepath is CUDA-only (Issue #269 in `intel/llm-scaler`).
|
||||||
|
- **IPEX deprecation**: vLLM release notes explicitly "deprecated IPEX for XPU, switched to vllm-xpu-kernels" (v0.11.x+), but the new kernel repo is still catching up on feature parity.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Speculations on Root Causes
|
||||||
|
|
||||||
|
Based on the collected discourse, the following hypotheses best explain the observed gaps:
|
||||||
|
|
||||||
|
### A. Kernel Data Layouts Are CUDA-Centric
|
||||||
|
`llama.cpp`’s quantized kernels (DMMV, MMVQ) were originally written and tuned for NVIDIA warp sizes, shared-memory banking, and coalesced-load patterns. The Intel Xe architecture has **different SIMD widths, cache-line behavior, and scatter-gather characteristics**. The "reorder" fix for Q4_0 proves that **rewriting the data layout specifically for Intel** yields large gains, but this work was not systematically extended to all quant types.
|
||||||
|
|
||||||
|
### B. SYCL Is a Thick Abstraction with Opaque Performance Characteristics
|
||||||
|
Developers (both community and Intel-affiliated) note that the SYCL stack (DPC++ compiler → IGC → Level Zero) works, but makes it **difficult to reason about whether generated ISA actually uses XMX/DPAS or falls back to scalar ALU paths**. The IGC JIT compiler does auto-vectorization, but for non-power-of-two block sizes (e.g., Q8_0’s 34 bytes), the generated code may serialize loads and leave EUs idle.
|
||||||
|
|
||||||
|
### C. The Driver Stack Correctly Exposes Hardware, But Does Not Hide Its Quirks
|
||||||
|
Intel Compute Runtime is functionally correct—models load, kernels execute, and results are valid—but it does **not** provide the same transparent memory-management or profiling feedback that CUDA/Rocm drivers offer. Issues like the 4 GB `MAX_MEM_ALLOC_SIZE`, wrong free-memory reporting, and iGPU+dGPU device-selection bugs force **application-layer workarounds** that are inconsistently implemented across frameworks.
|
||||||
|
|
||||||
|
### D. Rapid Corporate Strategy Shifts Create Maintenance Debt
|
||||||
|
The shift from **BigDL-LLM → IPEX-LLM → archived IPEX-LLM**, and the parallel shift from **IPEX → native PyTorch XPU → vllm-xpu-kernels**, means optimizations and bug fixes were repeatedly abandoned mid-stream. Community contributors (e.g., Codeplay, private maintainers in Discussion #12570) are left carrying the SYCL backend in `llama.cpp` with limited resources.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Potential Solutions & Lines of Investigation
|
||||||
|
|
||||||
|
**No code changes were made in this phase.** The following are research-backed proposals for a subsequent implementation phase.
|
||||||
|
|
||||||
|
### 5.1 Complete the Reorder Optimization for Q8_0 and K-Quants in `ggml-sycl`
|
||||||
|
- Implement `dequantize_block_q8_0_reorder` and add Q8_0 to `ggml_sycl_supports_reorder_mmvq()`.
|
||||||
|
- Increase `iter_stride` for Q8_0 DMMV to match the 8× factor used in Q4_0 reorder kernels.
|
||||||
|
- **Expected impact:** Close the 4× TG gap between Q8_0 and Q4_K_M on Xe2 GPUs.
|
||||||
|
|
||||||
|
### 5.2 Profile SYCL Kernels with Intel-Specific Tools
|
||||||
|
- Use **Intel VTune** or **ze_tracer** on the slow Q8_0 DMMV kernel to measure:
|
||||||
|
- EU utilization
|
||||||
|
- L3 cache miss rates
|
||||||
|
- Memory-latency hiding efficiency
|
||||||
|
- **Goal:** Determine whether the bottleneck is load coalescing, register pressure, or insufficient occupancy.
|
||||||
|
|
||||||
|
### 5.3 Implement DPAS/XMX Kernels for Quantized Matrix-Vector Multiplication
|
||||||
|
- Target **DPAS/DPASW** instructions directly (or via lightweight SYCL extensions) for INT8/INT4 dot products in MMVQ.
|
||||||
|
- oneDNN/oneMKL already use XMX for FP16 GEMM; the gap is in the **custom quantized DMMV/MMVQ kernels** inside `llama.cpp`.
|
||||||
|
|
||||||
|
### 5.4 Audit and Fix Intel Compute Runtime Memory Reporting
|
||||||
|
- Patch `compute-runtime` to return accurate `free_memory` via Level Zero (`zesMemoryGetState`) so that `llama.cpp` and vLLM can make correct offloading decisions without the `ZES_ENABLE_SYSMAN=1` workaround.
|
||||||
|
- Investigate the 4 GB `CL_DEVICE_MAX_MEM_ALLOC_SIZE` cap on 16 GB cards.
|
||||||
|
|
||||||
|
### 5.5 Stabilize the Vulkan Backend for Xe2
|
||||||
|
- Add BMG/B580 device IDs and tune pipeline layouts in `ggml-vulkan`.
|
||||||
|
- Bisect and revert or adapt the Mesa change that caused TG regression on Intel Arc.
|
||||||
|
|
||||||
|
### 5.6 Align vLLM XPU Quantization Roadmap
|
||||||
|
- Ensure `vllm-xpu-kernels` supports the same GGUF/AWQ/GPTQ paths that CUDA does, or at minimum documents unsupported formats clearly.
|
||||||
|
- Fix the B-series model-inspection crash (`drm_neo.cpp` abort) before broader Battlemage deployment.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Key Data Sources & References
|
||||||
|
|
||||||
|
| Source | Significance |
|
||||||
|
|--------|--------------|
|
||||||
|
| `ggml-org/llama.cpp#21517` | Rigorous Q8_0 bandwidth analysis on Arc Pro B70; proves kernel inefficiency |
|
||||||
|
| `ggml-org/llama.cpp#12035` | Original Q4_0 reorder optimization; template for fixing Q8_0 |
|
||||||
|
| `ggml-org/llama.cpp#12570` | Maintainer discussion on SYCL vs Vulkan, XMX, DPAS, and Codeplay roadmap |
|
||||||
|
| `ggml-org/llama.cpp#19887` | Inverse quant anomaly on A770 Vulkan (Q8_0 faster than Q4/Q6) |
|
||||||
|
| `intel/ipex-llm#12632` | Prompt-processing slowdown on Arc A380 in Docker (frozen, repo archived) |
|
||||||
|
| `intel/ipex-llm#12994` | B580 model load failures with Level Zero |
|
||||||
|
| `intel/compute-runtime#627` | 4 GB allocation limit on Arc A770 16 GB |
|
||||||
|
| `intel/compute-runtime#890` | Arrow Lake >4 GB allocation failures / system hangs |
|
||||||
|
| `vllm-project/vllm#27408` | B-series SIGABRT during model inspection |
|
||||||
|
| `vllm-project/vllm#33214` | Migration from IPEX to `vllm-xpu-kernels` |
|
||||||
|
| Reddit r/LocalLLAMA (2025) | User-reported bandwidth math, “3.1× Q8_0 speedup” thread |
|
||||||
|
| Reddit r/IntelArc | Benchmark comparisons: Vulkan vs SYCL, Ollama setup guides |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Conclusion
|
||||||
|
|
||||||
|
Intel Arc GPUs are **not fundamentally broken** for LLM inference, but they are **victims of incomplete kernel optimization and ecosystem churn**. The hardware has competitive bandwidth and matrix engines, yet the dominant open-source inference path (`llama.cpp` SYCL) only optimized Q4_0’s data layout, leaving Q8_0 and several K-quants on a slow generic fallback. Simultaneously, Intel’s own integration layer (`ipex-llm`) was archived while downstream projects (vLLM, PyTorch) are mid-transition to new backends.
|
||||||
|
|
||||||
|
The highest-leverage fixes, in order, appear to be:
|
||||||
|
1. **Kernel-side:** Extend the reorder/MMVD/MMVQ optimizations to Q8_0 and remaining quants in `ggml-sycl`.
|
||||||
|
2. **Driver-side:** Fix memory reporting and large-buffer allocation limits in Compute Runtime.
|
||||||
|
3. **Ecosystem-side:** Stabilize `vllm-xpu-kernels` and document a single, maintained path for Arc users.
|
||||||
|
|
||||||
|
All relevant repositories have been pulled locally for the next phase of analysis.
|
||||||
@@ -0,0 +1,376 @@
|
|||||||
|
# Intel Arc GPU LLM Inference Research Overview
|
||||||
|
|
||||||
|
**Date:** April 2026
|
||||||
|
**Scope:** Intel Arc (Alchemist Xe1, Battlemage Xe2) GPUs for LLM inference, with focus on quantized model performance issues
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. Executive Summary
|
||||||
|
|
||||||
|
Intel Arc GPUs suffer from **severe performance underperformance** on quantized LLM inference compared to theoretical hardware capabilities. Community benchmarks reveal that token generation often achieves only **21-40% of theoretical memory bandwidth utilization**, while NVIDIA RTX and AMD GPUs typically achieve 80-95%. The root causes are multi-layered: missing kernel optimizations, quantization-specific bottlenecks, architecture detection bugs, and an immature software stack.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. Hardware & Software Landscape
|
||||||
|
|
||||||
|
### 2.1 Supported Intel GPUs
|
||||||
|
|
||||||
|
| GPU | Architecture | Memory Bandwidth | Status |
|
||||||
|
|-----|-------------|------------------|--------|
|
||||||
|
| Arc A770 (16GB) | Xe1/Alchemist | 512 GB/s | Active support |
|
||||||
|
| Arc A750 (8GB) | Xe1/Alchemist | 448 GB/s | Active support |
|
||||||
|
| Arc B580 | Xe2/Battlemage | 456 GB/s | Partial support (driver issues) |
|
||||||
|
| Arc B70 Pro | Xe2/Battlemage | 608 GB/s | Active, but regressed |
|
||||||
|
| Arc 140T (iGPU) | Xe2/Arrow Lake H | Unified | **Broken** - misdetected |
|
||||||
|
| Arc 140V (iGPU) | Xe2/Lunar Lake | Unified | Working |
|
||||||
|
|
||||||
|
### 2.2 Software Stack
|
||||||
|
|
||||||
|
```
|
||||||
|
┌─────────────────────────────────────────────────────────┐
|
||||||
|
│ User Applications │
|
||||||
|
│ (Ollama, vLLM, llama.cpp CLI, etc.) │
|
||||||
|
├─────────────────────────────────────────────────────────┤
|
||||||
|
│ Inference Frameworks │
|
||||||
|
│ IPEX-LLM (PyTorch) │ vLLM (Intel Port) │ llama.cpp │
|
||||||
|
├─────────────────────────────────────────────────────────┤
|
||||||
|
│ Backend Implementations │
|
||||||
|
│ ┌──────────┐ ┌──────────┐ ┌──────────────────────┐ │
|
||||||
|
│ │ SYCL │ │ Vulkan │ │ OpenVINO (IPEX-LLM) │ │
|
||||||
|
│ │ backend │ │ backend │ │ │ │
|
||||||
|
│ └──────────┘ └──────────┘ └──────────────────────┘ │
|
||||||
|
├─────────────────────────────────────────────────────────┤
|
||||||
|
│ Intel Software Components │
|
||||||
|
│ oneAPI DPC++ │ IGC Compiler │ Level Zero │ oneDNN │
|
||||||
|
├─────────────────────────────────────────────────────────┤
|
||||||
|
│ GPU Hardware │
|
||||||
|
└─────────────────────────────────────────────────────────┘
|
||||||
|
```
|
||||||
|
|
||||||
|
### 2.3 Key Repositories
|
||||||
|
|
||||||
|
- **llama.cpp**: Main inference engine with SYCL and Vulkan backends for Intel GPUs
|
||||||
|
- Location: `repos/llama.cpp/`
|
||||||
|
- Key paths: `ggml/src/ggml-sycl/`, `ggml/src/ggml-vulkan/`
|
||||||
|
|
||||||
|
- **IPEX-LLM**: Intel's optimized PyTorch extension (archived Jan 2026)
|
||||||
|
- Location: `repos/ipex-llm/`
|
||||||
|
- Note: Archive status raises concerns about future maintenance
|
||||||
|
|
||||||
|
- **vLLM Intel**: Production-grade serving with Arc Pro B-series support
|
||||||
|
- Reference: https://blog.vllm.ai/2025/11/11/intel-arc-pro-b.html
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. Critical Problems Identified
|
||||||
|
|
||||||
|
### 3.1 Q8_0 Quantization Catastrophic Performance (Issue #21517)
|
||||||
|
|
||||||
|
**Severity:** Critical - **4-5x slower than expected**
|
||||||
|
|
||||||
|
The Arc Pro B70 (Xe2) achieves only **21-24% of theoretical memory bandwidth** on Q8_0 quantized models, compared to **53-64%** for Q4_K_M:
|
||||||
|
|
||||||
|
| Quantization | Size (GiB) | Token Gen (t/s) | BW Utilization |
|
||||||
|
|-------------|-----------|-----------------|----------------|
|
||||||
|
| Q4_K_M | 15.58 | 20.56 | 53% |
|
||||||
|
| Q8_0 | 26.62 | 4.88 | 21% |
|
||||||
|
|
||||||
|
**Root Cause:**
|
||||||
|
- Q8_0 is stuck on **DMMV kernel path** (generic dequantize-mul-mat-vec)
|
||||||
|
- iter_stride = 64 → processes only 2 values per thread per iteration
|
||||||
|
- Q4_0 reorder kernel uses iter_stride = 512 → 16 values per iteration (8x more)
|
||||||
|
- Q8_0's 34-byte block structure is non-power-of-2, causing cache line misalignment
|
||||||
|
|
||||||
|
**Fix Status:** PR #21527 submitted (Apr 2026) - adds Q8_0 to reorder framework, achieves **3.1x speedup**
|
||||||
|
|
||||||
|
### 3.2 K-Quantization Crashes on Xe2 iGPU (Issue #12318)
|
||||||
|
|
||||||
|
**Severity:** Critical - crashes on Lunar Lake Arc 140V
|
||||||
|
|
||||||
|
```
|
||||||
|
Sub-group size 8 is not supported on the device
|
||||||
|
Exception at ggml_sycl.cpp:3164
|
||||||
|
```
|
||||||
|
|
||||||
|
**Root Cause:** IPEX-LLM's SYCL backend hardcodes sub-group size assumptions that don't hold on Xe2 architecture.
|
||||||
|
|
||||||
|
**Note:** Upstream llama.cpp works but with ~2x lower performance than IPEX-LLM.
|
||||||
|
|
||||||
|
### 3.3 Architecture Misdetection on Arc 140T (Issue #20776)
|
||||||
|
|
||||||
|
**Severity:** High - Cooperative Matrix completely disabled
|
||||||
|
|
||||||
|
The Arc 140T (Arrow Lake H, Xe2) reports `matrix cores: none` because:
|
||||||
|
- Driver reports `minSubgroupSize = 8`
|
||||||
|
- Code requires `minSubgroupSize == 16` to classify as `INTEL_XE2`
|
||||||
|
- Same driver branch on Arc 140V reports `minSubgroupSize = 16`
|
||||||
|
|
||||||
|
**Impact:** All DPAS/matrix unit optimizations skipped despite hardware support.
|
||||||
|
|
||||||
|
### 3.4 Missing/Imcomplete Kernel Support Matrix
|
||||||
|
|
||||||
|
| Quantization | Reorder DMMV | Reorder MMVQ | SYCL | Vulkan |
|
||||||
|
|-------------|--------------|--------------|------|--------|
|
||||||
|
| Q4_0 | ✅ | ✅ | Fast | Fast |
|
||||||
|
| Q4_K_M | ❌ | ✅ | Medium | Medium |
|
||||||
|
| Q5_K_M | ❌ | ❌ | **Slow** | Medium |
|
||||||
|
| Q6_K | ❌ | ❌ | **Slow** | Medium |
|
||||||
|
| Q8_0 | ✅ (fixed) | ✅ (fixed) | Was Broken | Was Broken |
|
||||||
|
| IQ4_NL | ❌ | ❌ | **14% BW** | **Broken** |
|
||||||
|
| IQ4_XS | ❌ | ❌ | Slow | Medium |
|
||||||
|
|
||||||
|
### 3.5 Xe2/Battlemage Regression
|
||||||
|
|
||||||
|
On Arc A770 (Xe1), Q8_0 is actually **faster** than Q4. On Arc B70 (Xe2), Q8_0 is **4-5x slower**. This regression indicates the kernel optimizations work on Xe1 but fail on Xe2's different memory architecture.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Code Analysis: Misaligned Components
|
||||||
|
|
||||||
|
### 4.1 llama.cpp SYCL Backend
|
||||||
|
|
||||||
|
**Location:** `repos/llama.cpp/ggml/src/ggml-sycl/`
|
||||||
|
|
||||||
|
**Key Files:**
|
||||||
|
| File | Purpose | Issue |
|
||||||
|
|------|---------|-------|
|
||||||
|
| `ggml-sycl.cpp` | Main dispatch logic | Routing to wrong kernels for Xe2 |
|
||||||
|
| `dmmv.cpp` | Generic dequantize mat-vec | Inefficient iter_stride for Q8_0 |
|
||||||
|
| `mmvq.cpp` | Optimized mat-vec quants | Missing Q4_K DMMV reorder support |
|
||||||
|
| `vecdotq.hpp` | Vector dot products | Suboptimal memory coalescing |
|
||||||
|
|
||||||
|
**Dispatch Logic Problem:**
|
||||||
|
```
|
||||||
|
Lines ~3269-3292: ggml_sycl_supports_reorder_* functions
|
||||||
|
- Q4_K supports reorder for MMVQ but NOT DMMV
|
||||||
|
- This forces Q4_K through slower generic DMMV path
|
||||||
|
- Q5_K, Q6_K have NO reorder support at all
|
||||||
|
```
|
||||||
|
|
||||||
|
### 4.2 llama.cpp Vulkan Backend
|
||||||
|
|
||||||
|
**Location:** `repos/llama.cpp/ggml/src/ggml-vulkan/`
|
||||||
|
|
||||||
|
**Issues:**
|
||||||
|
- DP4A/DPAS support incomplete
|
||||||
|
- Subgroup size detection relies on driver-reported values (unreliable for Arrow Lake H)
|
||||||
|
- Memory access patterns not optimized for Xe2 cache hierarchy
|
||||||
|
|
||||||
|
### 4.3 IPEX-LLM (Archived)
|
||||||
|
|
||||||
|
**Location:** `repos/ipex-llm/`
|
||||||
|
|
||||||
|
**Status:** Archive as of January 28, 2026 - maintainability concerns
|
||||||
|
|
||||||
|
**Key Issues:**
|
||||||
|
- Closed-source optimized kernels not merged upstream
|
||||||
|
- Lagging behind llama.cpp main development
|
||||||
|
- Version fragmentation (ollama integration, vLLM integration, standalone C++)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Root Cause Hypotheses
|
||||||
|
|
||||||
|
### 5.1 Primary Hypothesis: Memory Access Pattern Mismatch
|
||||||
|
|
||||||
|
The Intel GPU memory hierarchy (L3 cache, SLM) has different optimal access patterns than NVIDIA/AMD. Current kernels:
|
||||||
|
- Use generic dequantization that doesn't account for Xe2's larger L2 cache
|
||||||
|
- Process too few elements per thread, leaving EU utilization low
|
||||||
|
- Don't leverage prefetch mechanisms that work on CUDA/ROCm
|
||||||
|
|
||||||
|
**Evidence:** Q4_0 reorder achieves 56% bandwidth, but Q8_0 DMMV achieves only 21%. The difference is purely in kernel design, not hardware capability.
|
||||||
|
|
||||||
|
### 5.2 Secondary Hypothesis: Compiler/Driver Inefficiencies
|
||||||
|
|
||||||
|
Intel's IGC (Intel Graphics Compiler):
|
||||||
|
- May not be generating optimal SIMD instructions for quantization kernels
|
||||||
|
- Register allocation for mixed precision (fp16 scales + int8 data) may be suboptimal
|
||||||
|
- Loop unrolling and vectorization may not be aggressive enough
|
||||||
|
|
||||||
|
**Evidence:** Driver updates (IGC 2.28.4 → 2.30.1) showed no improvement for Q8_0, confirming issue is in llama.cpp kernels, not compiler.
|
||||||
|
|
||||||
|
### 5.3 Tertiary Hypothesis: Architecture-Specific Tuning Missing
|
||||||
|
|
||||||
|
Xe1 (Alchemist) and Xe2 (Battlemage) have:
|
||||||
|
- Different L2 cache sizes (16MB vs larger on B70)
|
||||||
|
- Different memory latency characteristics
|
||||||
|
- Different SIMD width preferences
|
||||||
|
|
||||||
|
Current code has **no architecture-aware tuning** - same kernels run on all Intel GPUs.
|
||||||
|
|
||||||
|
### 5.4 Ecosystem Fragmentation
|
||||||
|
|
||||||
|
```
|
||||||
|
IPEX-LLM: Closed-source optimizations, lagging updates
|
||||||
|
llama.cpp SYCL: Community maintained, gaps in coverage
|
||||||
|
llama.cpp Vulkan: Good prompt processing, poor token gen
|
||||||
|
vLLM Intel: Production-grade but limited model support
|
||||||
|
```
|
||||||
|
|
||||||
|
No unified optimization effort across all quantization formats.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Proposed Solutions
|
||||||
|
|
||||||
|
### 6.1 Immediate Actions (Existing PRs)
|
||||||
|
|
||||||
|
1. **PR #21527** - Q8_0 Reorder Support (3.1x speedup for Q8_0)
|
||||||
|
- Status: Submitted, needs testing on A770/A750
|
||||||
|
- Impact: Major for users needing FP16-equivalent quality
|
||||||
|
|
||||||
|
2. **Issue #20776 Fix** - Add device ID fallback for Arc 140T
|
||||||
|
```cpp
|
||||||
|
// In get_device_architecture(), add:
|
||||||
|
if (props.deviceID == 0x7D51 || props.deviceID == 0x7D45) {
|
||||||
|
return vk_device_architecture::INTEL_XE2;
|
||||||
|
}
|
||||||
|
```
|
||||||
|
|
||||||
|
### 6.2 Short-term Optimizations (1-3 months)
|
||||||
|
|
||||||
|
1. **Extend Reorder Framework to K-Quants**
|
||||||
|
- Implement `reorder_qw_q4_k()` equivalent for DMMV path
|
||||||
|
- Add Q5_K, Q6_K to reorder support list
|
||||||
|
- Target: 2-3x speedup for these formats
|
||||||
|
|
||||||
|
2. **Increase DMMV iter_stride**
|
||||||
|
- Current: 64 elements per iteration
|
||||||
|
- Target: 512 (match Q4_0 reorder)
|
||||||
|
- Estimate: 50-100% speedup for formats stuck on DMMV
|
||||||
|
|
||||||
|
3. **Architecture-Aware Kernel Selection**
|
||||||
|
- Detect Xe1 vs Xe2 at runtime
|
||||||
|
- Select optimal kernel variants per architecture
|
||||||
|
- Enable Xe2-specific optimizations (larger prefetch, different block sizes)
|
||||||
|
|
||||||
|
### 6.3 Medium-term Improvements (3-6 months)
|
||||||
|
|
||||||
|
1. **DPAS/Matrix Engine Utilization**
|
||||||
|
- Intel Xe2 has matrix units (DPAS instructions)
|
||||||
|
- Current utilization: 0% for quantized formats
|
||||||
|
- Target: Use DPAS for Q4_K, Q8_0 via direct instruction insertion
|
||||||
|
- Reference: Intel IGC DPAS documentation
|
||||||
|
|
||||||
|
2. **FlashAttention Implementation**
|
||||||
|
- Currently disabled or partially working on Intel
|
||||||
|
- Critical for long context models
|
||||||
|
- Enable proper co-operative matrix usage
|
||||||
|
|
||||||
|
3. **Host-Side Kernel Submission Optimization**
|
||||||
|
- Reduce submission overhead
|
||||||
|
- Use SYCL Graphs to batch operations
|
||||||
|
- Especially important for iGPU scenarios
|
||||||
|
|
||||||
|
### 6.4 Long-term Recommendations
|
||||||
|
|
||||||
|
1. **Consolidate IPEX-LLM Optimizations**
|
||||||
|
- Negotiate with Intel to open-source closed-source kernels
|
||||||
|
- Merge proven optimizations into llama.cpp mainline
|
||||||
|
- Establish maintenance commitment (archive status is concerning)
|
||||||
|
|
||||||
|
2. **Comprehensive Quantization Coverage**
|
||||||
|
- All K-quantizations need reorder support
|
||||||
|
- IQ (Invisible Quantization) formats need optimization
|
||||||
|
- AWQ format support for production deployments
|
||||||
|
|
||||||
|
3. **Unified Benchmark Suite**
|
||||||
|
- Create Intel-specific benchmark covering all quantization formats
|
||||||
|
- Track regression detection
|
||||||
|
- Profile with Intel VTune for systematic optimization
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Testing & Validation Plan
|
||||||
|
|
||||||
|
### 7.1 Benchmark Scenarios
|
||||||
|
|
||||||
|
```bash
|
||||||
|
# Token generation bandwidth test
|
||||||
|
./llama-bench -m <model>.Q8_0.gguf -ngl 99 -pp 512 -tg 128
|
||||||
|
|
||||||
|
# Expected on B70 (608 GB/s):
|
||||||
|
# Q4_K_M: ~18-22 t/s (baseline)
|
||||||
|
# Q8_0: ~15 t/s (after PR #21527) vs 4.88 t/s (before)
|
||||||
|
# Target: 35-40 t/s (60-65% bandwidth utilization)
|
||||||
|
```
|
||||||
|
|
||||||
|
### 7.2 Hardware Test Matrix
|
||||||
|
|
||||||
|
| GPU | Architecture | Priority | Test Focus |
|
||||||
|
|-----|-------------|----------|------------|
|
||||||
|
| Arc A770 | Xe1 | Medium | Regression check |
|
||||||
|
| Arc B580 | Xe2 | High | Primary Xe2 target |
|
||||||
|
| Arc Pro B70 | Xe2 | High | High-end Xe2 |
|
||||||
|
| Arc 140T iGPU | Xe2 | High | Detection fix validation |
|
||||||
|
| Arc 140V iGPU | Xe2 | Medium | Baseline comparison |
|
||||||
|
|
||||||
|
### 7.3 Quantization Test Matrix
|
||||||
|
|
||||||
|
| Quant | Priority | Current State | Target |
|
||||||
|
|-------|----------|---------------|--------|
|
||||||
|
| Q4_0 | Low | Optimized | Maintain |
|
||||||
|
| Q4_K_M | High | Medium | 2x speedup |
|
||||||
|
| Q5_K_M | High | Slow | 3x speedup |
|
||||||
|
| Q6_K | High | Slow | 3x speedup |
|
||||||
|
| Q8_0 | Critical | Fixed (PR pending) | Validate fix |
|
||||||
|
| IQ4_NL | Medium | Broken | Investigate root cause |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 8. References & Links
|
||||||
|
|
||||||
|
### GitHub Issues
|
||||||
|
|
||||||
|
- [#21517](https://github.com/ggml-org/llama.cpp/issues/21517) - Q8_0 4x slower on B70
|
||||||
|
- [#12318](https://github.com/intel/ipex-llm/issues/12318) - K-quants crash Xe2 iGPU
|
||||||
|
- [#20776](https://github.com/ggml-org/llama.cpp/issues/20776) - Arc 140T misdetection
|
||||||
|
- [#12570](https://github.com/ggml-org/llama.cpp/discussions/12570) - Arc status discussion
|
||||||
|
- [#12805](https://github.com/ggml-org/llama.cpp/discussions/12805) - A750 user experiences
|
||||||
|
- [#19887](https://github.com/ggml-org/llama.cpp/issues/19887) - A770 inverse quant anomaly
|
||||||
|
|
||||||
|
### Documentation
|
||||||
|
|
||||||
|
- [llama.cpp SYCL Backend](https://github.com/ggml-org/llama.cpp/blob/master/docs/backend/SYCL.md)
|
||||||
|
- [IPEX-LLM Quickstart](https://github.com/intel/ipex-llm/blob/main/docs/mddocs/Quickstart/llama_cpp_quickstart.md)
|
||||||
|
- [vLLM Intel Arc Pro](https://blog.vllm.ai/2025/11/11/intel-arc-pro-b.html)
|
||||||
|
- [Intel DPAS Instructions](https://github.com/intel/intel-graphics-compiler/blob/master/documentation/visa/instructions/DPAS.md)
|
||||||
|
|
||||||
|
### Key Files in repos/
|
||||||
|
|
||||||
|
```
|
||||||
|
repos/llama.cpp/
|
||||||
|
├── ggml/src/ggml-sycl/
|
||||||
|
│ ├── ggml-sycl.cpp # Dispatch logic (~line 3258-3648)
|
||||||
|
│ ├── dmmv.cpp # Generic DMMV kernels
|
||||||
|
│ ├── mmvq.cpp # Optimized MMVQ kernels
|
||||||
|
│ ├── vecdotq.hpp # Vector dot products
|
||||||
|
│ └── quantize.hpp # Quantization routines
|
||||||
|
├── ggml/src/ggml-vulkan/
|
||||||
|
│ └── ggml-vulkan.cpp # Vulkan backend (~line 343 detection)
|
||||||
|
└── docs/backend/SYCL.md # SYCL setup guide
|
||||||
|
|
||||||
|
repos/ipex-llm/
|
||||||
|
├── docs/mddocs/ # IPEX-LLM documentation
|
||||||
|
├── docker/llm/inference-cpp/ # C++ inference container
|
||||||
|
└── README.md # Project overview (archived notice)
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 9. Open Questions
|
||||||
|
|
||||||
|
1. **IPEX-LLM Future:** Intel archived the main ipex-llm repo in Jan 2026. Is there a replacement/maintained fork?
|
||||||
|
|
||||||
|
2. **Driver Release Cadence:** How frequently does Intel update GPU drivers? Does this impact reproducibility?
|
||||||
|
|
||||||
|
3. **Architecture-Specific Guidance:** Are there Intel-published optimization guides for Xe2?
|
||||||
|
|
||||||
|
4. **vLLM vs llama.cpp:** For production serving, which stack should be prioritized?
|
||||||
|
|
||||||
|
5. **Quantization Format Priority:** Given user needs, should we prioritize K-quants (quality) or legacy quants (speed)?
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Document Version: 1.0*
|
||||||
|
*Research Sources: GitHub issues, llama.cpp discussions, Reddit, Intel developer articles*
|
||||||
@@ -0,0 +1,171 @@
|
|||||||
|
# Community Issues & Discourse Summary
|
||||||
|
|
||||||
|
## Source: GitHub Issues, Discussions, Reddit (March-April 2026)
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Critical Issues Filed
|
||||||
|
|
||||||
|
### 1. [#21517](https://github.com/ggml-org/llama.cpp/issues/21517) - Q8_0 4x Slower on Arc Pro B70
|
||||||
|
|
||||||
|
**Reporter:** PMZFX (April 6, 2026)
|
||||||
|
**Status:** Closed - PR #21527 submitted
|
||||||
|
|
||||||
|
**Benchmark Data (Arc Pro B70, Qwen3.5-27B):**
|
||||||
|
|
||||||
|
| Quant | Token Gen (t/s) | BW Utilization |
|
||||||
|
|-------|-----------------|----------------|
|
||||||
|
| Q4_K_M | 20.56 | 53% |
|
||||||
|
| Q8_0 | 4.88 | 21% |
|
||||||
|
|
||||||
|
**Key Findings:**
|
||||||
|
- Q8_0 stuck on generic DMMV kernel (iter_stride=64)
|
||||||
|
- Q4_0 reorder kernel uses iter_stride=512 (8x more work)
|
||||||
|
- Driver updates don't help (IGC 2.28.4 → 2.30.1 unchanged Q8_0 perf)
|
||||||
|
- Both SYCL and Vulkan affected equally
|
||||||
|
- Dual GPU doesn't help - confirmed kernel-level issue
|
||||||
|
|
||||||
|
**Fix:** PR #21527 adds Q8_0 to reorder framework. Validation showed 3.1x speedup (4.88 → 15.24 t/s).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 2. [#12318](https://github.com/intel/ipex-llm/issues/12318) - K-Quant Crash on Xe2 iGPU
|
||||||
|
|
||||||
|
**Reporter:** lhl (November 3, 2024)
|
||||||
|
**Status:** Closed
|
||||||
|
**Hardware:** Lunar Lake Arc 140V
|
||||||
|
|
||||||
|
```
|
||||||
|
Sub-group size 8 is not supported on the device
|
||||||
|
Exception at ggml-sycl.cpp:3164
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reproduction:** Q4_K_M crashes, Q4_0 works fine.
|
||||||
|
|
||||||
|
**Workaround:** Use upstream llama.cpp SYCL backend (slower but stable).
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### 3. [#20776](https://github.com/ggml-org/llama.cpp/issues/20776) - Arc 140T Misdetection
|
||||||
|
|
||||||
|
**Reporter:** diegokolling (March 19, 2026)
|
||||||
|
**Status:** Open
|
||||||
|
**Hardware:** Arrow Lake H, Arc 140T (48GB shared)
|
||||||
|
|
||||||
|
**Root Cause:**
|
||||||
|
- Driver reports `minSubgroupSize = 8`
|
||||||
|
- Code requires `minSubgroupSize == 16` for INTEL_XE2 classification
|
||||||
|
- Same driver on Arc 140V reports `minSubgroupSize = 16`
|
||||||
|
|
||||||
|
**Impact:** Cooperative matrix completely disabled despite hardware support.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Key Discussions
|
||||||
|
|
||||||
|
### [#12570](https://github.com/ggml-org/llama.cpp/discussions/12570) - Arc Status for llama.cpp
|
||||||
|
|
||||||
|
**Date:** March 25-28, 2025
|
||||||
|
**Participants:** ky438, Rbiessy (Codeplay), NeoZhangJianyu
|
||||||
|
|
||||||
|
**Key Quotes:**
|
||||||
|
|
||||||
|
> "tg should already be decent" - 0cc4m (llama.cpp collaborator)
|
||||||
|
|
||||||
|
> "There are huge performance gaps between k-quant and legacy quant. Some quantizations like IQ4_NL reach only 14% of memory bandwidth utilization." - Community report
|
||||||
|
|
||||||
|
> "For BMG, we don't promise to optimize it in time of the marketing." - NeoZhangJianyu
|
||||||
|
|
||||||
|
> "If you want to see the best performance on Intel GPU, please try OpenVINO." - NeoZhangJianyu
|
||||||
|
|
||||||
|
**Outcomes:**
|
||||||
|
- Acknowledged poor performance on k-quants
|
||||||
|
- Planned work on mul_mat_vec_q kernel optimization
|
||||||
|
- Discussion of DPAS instruction utilization
|
||||||
|
- Note that community contributors work on this in spare time
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
### [#12805](https://github.com/ggml-org/llama.cpp/discussions/12805) - A750 User Experience
|
||||||
|
|
||||||
|
**Date:** April 7-9, 2025
|
||||||
|
**User:** codayon (Arch Linux, 8GB VRAM)
|
||||||
|
|
||||||
|
**Findings:**
|
||||||
|
- Ubuntu Vulkan binary worked on Arch Linux
|
||||||
|
- Q4_K_M slower than expected on 8GB card
|
||||||
|
- Q4_0 recommended for better performance
|
||||||
|
- IPEX-LLM provides better VRAM utilization
|
||||||
|
- Complexity of setup is barrier to entry
|
||||||
|
|
||||||
|
**Recommendations from community:**
|
||||||
|
- Use Qwen2.5-Coder-0.5B-Q8_0 for autocomplete (150+ t/s)
|
||||||
|
- Qwen2.5-Coder-7B-Q4_0 for chat
|
||||||
|
- Vulkan more stable than SYCL on Arch
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Reddit Discourse
|
||||||
|
|
||||||
|
### r/LocalLLaMA - "Intel Arc for LLMs?"
|
||||||
|
|
||||||
|
**Key Comments:**
|
||||||
|
- "Not a lot of kernels for arc so many of the quantized models will be out of reach" (u/shakhal1)
|
||||||
|
- Arc A770 with 16GB runs models up to 24B with 4-6bit quantization
|
||||||
|
- oneAPI less mature than CUDA - expect compatibility issues
|
||||||
|
|
||||||
|
### r/LocalLLaMA - "llama.cpp 3.1x Q8_0 speedup on Intel Arc GPUs"
|
||||||
|
|
||||||
|
**Key Details:**
|
||||||
|
- PR submitted by AI Agent + user collaboration
|
||||||
|
- Binary-patched Intel's closed-source IPEX-LLM to validate solution
|
||||||
|
- IPEX-LLM achieved 61% bandwidth - confirming problem is solvable in software
|
||||||
|
|
||||||
|
### r/IntelArc - "Intel ARC for local LLMs"
|
||||||
|
|
||||||
|
**User reports:**
|
||||||
|
- B580 setup issues (unsupported message)
|
||||||
|
- Even dual A770 (32GB) not enough for 30B at FP16
|
||||||
|
- No consumer Intel GPU has sufficient VRAM for large models
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## GitHub Issue #19887 - A770 Inverse Quantization Anomaly
|
||||||
|
|
||||||
|
**On A770:** Q8_0 is faster than Q4/Q6
|
||||||
|
**On B70:** Q8_0 is 4x slower than Q4
|
||||||
|
|
||||||
|
**This is a Xe2/Battlemage regression** - indicates:
|
||||||
|
- Xe1 optimizations work
|
||||||
|
- Xe2 memory architecture is different
|
||||||
|
- Kernel tuning needed for new architecture
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Performance Summary Table
|
||||||
|
|
||||||
|
Compiled from community benchmarks:
|
||||||
|
|
||||||
|
| GPU | Backend | Q4_0 tg | Q4_K_M tg | Q8_0 tg | Notes |
|
||||||
|
|-----|---------|---------|-----------|---------|-------|
|
||||||
|
| A770 (Xe1) | SYCL | ~40 t/s | ~25 t/s | ~30 t/s | Q8_0 works well |
|
||||||
|
| A770 (Xe1) | Vulkan | ~30 t/s | ~20 t/s | ~35 t/s | Good prompt processing |
|
||||||
|
| B580 (Xe2) | SYCL | ~45 t/s | ~20 t/s | ~8 t/s | Q8_0 broken |
|
||||||
|
| B580 (Xe2) | Vulkan | ~35 t/s | ~18 t/s | ~10 t/s | Better prompt perf |
|
||||||
|
| B70 (Xe2) | SYCL | ~35 t/s | ~20 t/s | ~5 t/s | Q8_0 very slow |
|
||||||
|
| 140V iGPU (Xe2) | SYCL | ~23 t/s | N/A (crash) | N/A | K-quants broken |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## Community Complaints Summary
|
||||||
|
|
||||||
|
1. **"30% of peak performance"** - Users see far below hardware potential
|
||||||
|
2. **"Instability with k-quants"** - Some formats crash, others work
|
||||||
|
3. **"Documentation chaos"** - Multiple docs, Ubuntu-focused, Arch struggles
|
||||||
|
4. **"IPEX-LLM is too slow but stable, llama.cpp is fast but broken"** - No perfect option
|
||||||
|
5. **"Driver updates don't fix issues"** - Confirms software stack problem
|
||||||
|
6. **"No Intel official contribution"** - Community maintains in spare time
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last Updated: April 2026*
|
||||||
@@ -0,0 +1,262 @@
|
|||||||
|
# Driver/Stack Misalignment Analysis
|
||||||
|
|
||||||
|
## Overview
|
||||||
|
|
||||||
|
This document catalogs the specific code locations, design decisions, and architectural mismatches that cause Intel Arc GPUs to underperform on LLM inference.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 1. llama.cpp SYCL Backend Misalignments
|
||||||
|
|
||||||
|
### 1.1 Kernel Dispatch Logic
|
||||||
|
|
||||||
|
**File:** `repos/llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp`
|
||||||
|
**Lines:** ~3258-3660
|
||||||
|
|
||||||
|
**Current Dispatch Algorithm:**
|
||||||
|
```
|
||||||
|
mul_mat dispatch prefers:
|
||||||
|
1. MMVQ (reorder path) if src0 type in ggml_sycl_supports_reorder_mmvq()
|
||||||
|
2. DMMV (reorder path) if src0 type in ggml_sycl_supports_reorder_dmmv()
|
||||||
|
3. SYCL native matmul as fallback
|
||||||
|
```
|
||||||
|
|
||||||
|
**Support Lists (lines ~3269-3300):**
|
||||||
|
|
||||||
|
```cpp
|
||||||
|
// Supports MMVQ reorder
|
||||||
|
ggml_sycl_supports_reorder_mmvq(): Q4_0, Q8_0, Q4_K, Q6_K
|
||||||
|
|
||||||
|
// Supports DMMV reorder
|
||||||
|
ggml_sycl_supports_reorder_dmmv(): Q4_0, Q8_0 ONLY
|
||||||
|
|
||||||
|
// Supports SYCL matmul reorder
|
||||||
|
ggml_sycl_supports_reorder_mul_mat_sycl(): Q4_0, Q8_0, Q4_K*, Q6_K*
|
||||||
|
(* = !g_ggml_sycl_prioritize_dmmv)
|
||||||
|
```
|
||||||
|
|
||||||
|
**Problem:** Q4_K, Q6_K support MMVQ reorder but NOT DMMV reorder. When conditions favor DMMV, these quants fall through to slow generic path.
|
||||||
|
|
||||||
|
### 1.2 DMMV Kernel iter_stride Problem
|
||||||
|
|
||||||
|
**File:** `repos/llama.cpp/ggml/src/ggml-sycl/dmmv.cpp`
|
||||||
|
**Lines:** ~975-1100 (dequantize_mul_mat_vec_q8_0_sycl)
|
||||||
|
|
||||||
|
**Generic DMMV (used by Q8_0):**
|
||||||
|
```cpp
|
||||||
|
iter_stride = 2 * GGML_SYCL_DMMV_X = 64 // processes 2 values per iteration
|
||||||
|
```
|
||||||
|
|
||||||
|
**Reorder DMMV (Q4_0 path):**
|
||||||
|
```cpp
|
||||||
|
iter_stride = 8 * 2 * GGML_SYCL_DMMV_X = 512 // processes 16 values per iteration
|
||||||
|
```
|
||||||
|
|
||||||
|
**Root Cause:** Q8_0's 34-byte block structure prevents simple power-of-2 optimization that works for Q4_0's 18-byte blocks.
|
||||||
|
|
||||||
|
### 1.3 Missing Q8_0 Reorder Implementation
|
||||||
|
|
||||||
|
**File:** `repos/llama.cpp/ggml/src/ggml-sycl/mmvq.cpp`
|
||||||
|
**Lines:** ~682-730
|
||||||
|
|
||||||
|
**Q4_0 Reorder Kernel:**
|
||||||
|
```cpp
|
||||||
|
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q4_0>>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Q8_0 Reorder Kernel:**
|
||||||
|
```cpp
|
||||||
|
mul_mat_vec_q_reorder<reorder_vec_dot_q_sycl<GGML_TYPE_Q8_0>>
|
||||||
|
```
|
||||||
|
|
||||||
|
**Note:** PR #21527 adds Q8_0 to reorder framework. Without this fix, Q8_0 defaults to slow DMMV path.
|
||||||
|
|
||||||
|
### 1.4 Q4_K DMMV Reorder Gap
|
||||||
|
|
||||||
|
**Problem:** Q4_K has reorder structure (`reorder_qw_q4_k()`) but DMMV path doesn't use it.
|
||||||
|
|
||||||
|
**Current State:**
|
||||||
|
- Q4_K MMVQ reorder: ✅ Working
|
||||||
|
- Q4_K DMMV reorder: ❌ Not implemented
|
||||||
|
|
||||||
|
**Impact:** When DMMV is prioritized (GGML_SYCL_PRIORITIZE_DMMV=1), Q4_K gets no optimization.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 2. llama.cpp Vulkan Backend Misalignments
|
||||||
|
|
||||||
|
### 2.1 Cooperative Matrix Detection
|
||||||
|
|
||||||
|
**File:** `repos/llama.cpp/ggml/src/ggml-vulkan/ggml-vulkan.cpp`
|
||||||
|
**Lines:** ~343, ~15972
|
||||||
|
|
||||||
|
**Detection Logic:**
|
||||||
|
```cpp
|
||||||
|
// Step 1: Architecture classification
|
||||||
|
if (subgroup_size_control_props.minSubgroupSize == 16) {
|
||||||
|
return vk_device_architecture::INTEL_XE2;
|
||||||
|
}
|
||||||
|
// Falls through to OTHER for 140T (minSubgroupSize=8)
|
||||||
|
|
||||||
|
// Step 2: Coopmat support check
|
||||||
|
case VK_VENDOR_ID_INTEL:
|
||||||
|
return arch == vk_device_architecture::INTEL_XE2;
|
||||||
|
// Returns false for OTHER
|
||||||
|
```
|
||||||
|
|
||||||
|
**Problem:** Arc 140T (Arrow Lake H) reports minSubgroupSize=8 despite having Xe2 architecture and full coopmat support.
|
||||||
|
|
||||||
|
### 2.2 DP4A/DPAS Utilization Gap
|
||||||
|
|
||||||
|
**Current State:**
|
||||||
|
- Vulkan backend has DP4A instruction support
|
||||||
|
- Matrix multiplication (matmul) path doesn't use DPAS
|
||||||
|
- Only Flash Attention path partially uses coopmat
|
||||||
|
|
||||||
|
**Missing:**
|
||||||
|
- Q4_K, Q8_0 quantized matmul via DPAS
|
||||||
|
- Subgroup-level parallelism for token generation
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 3. IPEX-LLM vs llama.cpp Gap
|
||||||
|
|
||||||
|
### 3.1 Performance Comparison
|
||||||
|
|
||||||
|
| Aspect | IPEX-LLM | llama.cpp SYCL | llama.cpp Vulkan |
|
||||||
|
|--------|----------|----------------|------------------|
|
||||||
|
| Q4_0 | Fast | Fast | Medium |
|
||||||
|
| Q4_K | Fast | Medium | Medium |
|
||||||
|
| Q8_0 | Fast | Was Broken | Was Broken |
|
||||||
|
| K-quants on Xe2 | Crashes | Works | Works |
|
||||||
|
| FlashAttention | Full | Partial | Partial |
|
||||||
|
| vRAM usage | Lower | Higher | Higher |
|
||||||
|
|
||||||
|
### 3.2 Source of Optimization Gap
|
||||||
|
|
||||||
|
**IPEX-LLM advantages:**
|
||||||
|
1. Closed-source optimized kernels (not in llama.cpp)
|
||||||
|
2. oneDNN GEMM integration
|
||||||
|
3. Lower-level hardware access
|
||||||
|
4. syclcompat library for platform-specific tuning
|
||||||
|
|
||||||
|
**llama.cpp limitations:**
|
||||||
|
1. Open-source kernels visible to competitors
|
||||||
|
2. Generic SYCL must work across all Intel GPUs
|
||||||
|
3. Can't leverage IPEX's proprietary optimizations
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 4. Architecture Detection Mismatches
|
||||||
|
|
||||||
|
### 4.1 Xe1 vs Xe2 Detection
|
||||||
|
|
||||||
|
**Current Detection:** Uses compute capability (device version)
|
||||||
|
|
||||||
|
**Problem:**
|
||||||
|
- Arc A770 reports compute version 1.3 (Xe1)
|
||||||
|
- Arc B580 reports compute version 1.6 (Xe2)
|
||||||
|
- BUT: Same driver branch reports different subgroup sizes (8 vs 16)
|
||||||
|
|
||||||
|
### 4.2 Missing Architecture-Specific Tuning
|
||||||
|
|
||||||
|
**Current kernels:** Single implementation for all Intel GPUs
|
||||||
|
|
||||||
|
**Needed:**
|
||||||
|
| Feature | Xe1 (Alchemist) | Xe2 (Battlemage) |
|
||||||
|
|---------|-----------------|------------------|
|
||||||
|
| L2 cache | 16 MB | Larger |
|
||||||
|
| Optimal block size | 64 | 128 |
|
||||||
|
| Prefetch depth | 2 | 4 |
|
||||||
|
| Vector width | 8 | 16 |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 5. Quantization Format Support Matrix
|
||||||
|
|
||||||
|
### 5.1 Current Support State
|
||||||
|
|
||||||
|
| Format | DMMV Reorder | MMVQ Reorder | SYCL Matmul | Vulkan | Notes |
|
||||||
|
|--------|--------------|--------------|-------------|--------|-------|
|
||||||
|
| Q4_0 | ✅ | ✅ | ✅ | ✅ | Fully optimized |
|
||||||
|
| Q4_1 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
|
||||||
|
| Q5_0 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
|
||||||
|
| Q5_1 | ❌ | ❌ | ✅ | ✅ | Legacy, slow |
|
||||||
|
| Q8_0 | ✅* | ✅* | ✅ | ✅ | *Fixed by PR #21527 |
|
||||||
|
| Q4_K | ❌ | ✅ | ✅* | ✅ | *Prioritize DMMV breaks |
|
||||||
|
| Q5_K | ❌ | ❌ | ❌ | ❌ | No reorder support |
|
||||||
|
| Q6_K | ❌ | ✅ | ✅* | ✅ | *Prioritize DMMV breaks |
|
||||||
|
| IQ4_NL | ❌ | ❌ | ✅ | ❌ | 14% bandwidth, crashes |
|
||||||
|
| IQ4_XS | ❌ | ❌ | ✅ | ✅ | Not optimized |
|
||||||
|
|
||||||
|
### 5.2 Block Size Analysis
|
||||||
|
|
||||||
|
| Format | Block Size | Power of 2? | Cache Line Aligned? |
|
||||||
|
|--------|-----------|-------------|---------------------|
|
||||||
|
| Q4_0 | 18 bytes | No | Partial |
|
||||||
|
| Q4_K | 54 bytes | No | No |
|
||||||
|
| Q5_K | 62 bytes | No | No |
|
||||||
|
| Q6_K | 66 bytes | No | No |
|
||||||
|
| Q8_0 | 34 bytes | No | No |
|
||||||
|
| IQ4_NL | 16 bytes | Yes | Yes |
|
||||||
|
|
||||||
|
**Hypothesis:** Power-of-2 block sizes (Q4_0, IQ4_NL) enable efficient memory access patterns. Non-power-of-2 formats suffer.
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 6. Key File Locations Summary
|
||||||
|
|
||||||
|
### Core Problem Areas:
|
||||||
|
|
||||||
|
```
|
||||||
|
repos/llama.cpp/ggml/src/ggml-sycl/
|
||||||
|
├── ggml-sycl.cpp
|
||||||
|
│ ├── Line 219: GGML_SYCL_PRIORITIZE_DMMV env var
|
||||||
|
│ ├── Line 3258-3260: mul_mat_algo enum (DMMV, MMVQ, SYCL)
|
||||||
|
│ ├── Line 3269-3292: ggml_sycl_supports_reorder_*() functions
|
||||||
|
│ ├── Line 3549-3650: dispatch logic with fallback chains
|
||||||
|
│ └── Problem: Routing logic doesn't handle Q4_K/Q6_K correctly
|
||||||
|
│
|
||||||
|
├── dmmv.cpp
|
||||||
|
│ ├── ~975-1100: dequantize_mul_mat_vec_q8_0_sycl()
|
||||||
|
│ ├── iter_stride = 64 (generic path)
|
||||||
|
│ └── Problem: 8x less work than reorder path
|
||||||
|
│
|
||||||
|
├── mmvq.cpp
|
||||||
|
│ ├── ~550-570: Q4_0 reorder kernel
|
||||||
|
│ ├── ~695-720: Q8_0 reorder kernel (after PR #21527)
|
||||||
|
│ ├── ~1100-1200: Q4_K kernel (no DMMV support)
|
||||||
|
│ └── Problem: Missing Q5_K, Q6_K reorder
|
||||||
|
│
|
||||||
|
└── vecdotq.hpp
|
||||||
|
├── ~844: vec_dot_q8_0_q8_1 implementation
|
||||||
|
└── Problem: Memory coalescing suboptimal for Xe2
|
||||||
|
|
||||||
|
repos/llama.cpp/ggml/src/ggml-vulkan/
|
||||||
|
└── ggml-vulkan.cpp
|
||||||
|
├── ~343: get_device_architecture() classification
|
||||||
|
├── ~15972: coopmat support check
|
||||||
|
└── Problem: minSubgroupSize = 8 causes 140T misdetection
|
||||||
|
|
||||||
|
repos/ipex-llm/ (archived Jan 2026)
|
||||||
|
├── Closed-source optimized kernels (not in upstream)
|
||||||
|
├── syclcompat library
|
||||||
|
├── oneDNN integration
|
||||||
|
└── Problem: Archive status, no community maintenance
|
||||||
|
```
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
## 7. Misalignment Summary Table
|
||||||
|
|
||||||
|
| Component | Expected | Actual | Impact |
|
||||||
|
|-----------|----------|--------|--------|
|
||||||
|
| Q8_0 DMMV | 64 values/iter | 2 values/iter | 4x slower |
|
||||||
|
| Q4_K DMMV | Reorder enabled | Not implemented | 40% slower |
|
||||||
|
| Q5_K MMVQ | Reorder support | Missing | 3x slower |
|
||||||
|
| Arc 140T detection | INTEL_XE2 | OTHER | Coopmat disabled |
|
||||||
|
| Q8_0 on B70 | 60% BW | 21% BW | 3x slower |
|
||||||
|
|
||||||
|
---
|
||||||
|
|
||||||
|
*Last Updated: April 2026*
|
||||||
Reference in New Issue
Block a user