Files

27 KiB
Raw Permalink Blame History

Intel Arc GPU LLM Inference: Comprehensive Research Overview

Author: GLM Agent
Date: April 15, 2026
Phase: Research & Data Preparation (No driver/framework modifications)


Table of Contents

  1. Executive Summary
  2. The Intel GPU Software Stack
  3. Identified Problems
  4. Performance Landscape
  5. Root Cause Analysis
  6. Speculations & Hypotheses
  7. Suggested Solutions
  8. Repo Map & Key Files
  9. References

Executive Summary

Intel Arc GPUs (both discrete Alchemist/Xe1 and Battlemage/Xe2, and integrated Lunar Lake/Arrow Lake) suffer from severely degraded LLM inference performance compared to their theoretical hardware capabilities. Users consistently report achieving only 21-40% of theoretical memory bandwidth on quantized models, versus 80-95% on equivalent NVIDIA/AMD hardware. The issues are multifaceted, spanning kernel optimization gaps, software stack fragmentation, driver/kernel version incompatibilities, and a fundamental underinvestment in open-source kernel development for Intel GPU architectures.

The Battlemage (B580/B570/B70) generation introduced a new class of regression: quantization types that worked well on Xe1 (Alchemist) perform catastrophically on Xe2, with Q8_0 being 4-5x slower than Q4_K_M despite only having 1.7x more data. This was traced to a kernel dispatch path issue (now partially fixed by PR #21527).


The Intel GPU Software Stack

Layer Architecture

┌─────────────────────────────────────────────────┐
│  User-facing: Ollama, LM Studio, llama.cpp      │
├─────────────────────────────────────────────────┤
│  Framework: IPEX-LLM (archived Jan 2026)         │
│             vLLM (intel/vllm docker)              │
│             OpenVINO                              │
│             PyTorch + IPEX                        │
├─────────────────────────────────────────────────┤
│  Backend: SYCL (llama.cpp)                       │
│           Vulkan (llama.cpp, cross-vendor)        │
│           Level Zero (low-level compute)          │
├─────────────────────────────────────────────────┤
│  Compiler: DPC++ (intel/llvm sycl branch)         │
│            IGC (Intel Graphics Compiler)           │
├─────────────────────────────────────────────────┤
│  Runtime: compute-runtime (NEO)                   │
│           Level Zero Loader                       │
│           oneDNN (BLAS/GEMM)                      │
├─────────────────────────────────────────────────┤
│  Kernel Driver: i915 / xe (Linux)                 │
│  Firmware: linux-firmware                         │
├─────────────────────────────────────────────────┤
│  Hardware: Xe1 (Alchemist: A380/A750/A770)        │
│            Xe2 (Battlemage: B570/B580/B70)        │
│            Xe2 iGPU (Lunar Lake 140V, Arrow Lake) │
└─────────────────────────────────────────────────┘

The Fragmentation Problem

Intel's software stack for GPU inference has at least five semi-independent efforts that don't fully interoperate:

Stack Maintainer Status Backend Optimized?
llama.cpp SYCL Community (NeoZhangJianyu, Rbiessy/Codeplay) Active SYCL/Level Zero Only Q4_0 fully
llama.cpp Vulkan 0cc4m (community) Active Vulkan Improving, behind CUDA
IPEX-LLM Intel (analytics team) Archived Jan 2026 SYCL + proprietary Best perf, dying
OpenVINO Intel (openvino team) Active Own runtime Different model format
vLLM XPU Intel (vllm fork) Active PyTorch/IPEX Server-focused

Key complaint from Hacker News user lhl:

"PyTorch requires its own support kit separate from the oneAPI Toolkit (and runs slightly different versions of everything), the vLLM xpu support doesn't work — both source and the docker failed to build/run for me. The IPEX-LLM whisper support is completely borked."

Version Compatibility Nightmare

Component IPEX-LLM requires Latest available Conflict?
oneAPI Base Toolkit 2024.2.1 2025.3+ Yes
PyTorch 2.5.x 2.8+ Yes
compute-runtime 25.x 26.x Yes
Linux kernel 6.5-6.17 6.18+ Yes (6.18 breaks)
IGC 2.27+ 2.30+ Minor
DPC++ Compiler 2024.2 2025.3 ABI changes

Critical finding: Linux kernel 6.18 completely breaks compute-runtime's memory management (bindless_heaps_helper.cpp abort). Users must stay on 6.17 or older. This is tracked in compute-runtime#875.

Another critical finding: IPEX-LLM shifted oneAPI dependency from 2024.2.1 to 2025.0.1 starting from ipex-llm[cpp]==2.2.0b20250207, but the archived repo won't get further updates. Users with older IPEX-LLM versions are stuck on old oneAPI.


Identified Problems

Problem 1: SYCL Kernel Dispatch — Quantization Type Inequality (CRITICAL)

Status: Partially fixed (Q8_0), open for most other types
Impact: 2-5x performance degradation on non-Q4_0 quantizations

The llama.cpp SYCL backend has three kernel paths for quantized matrix-vector multiplication:

  1. DMMV (Dequantize-Mul-Mat-Vec) — generic, slow
  2. MMVQ (Mul-Mat-Vec-Q) — optimized, uses reorder
  3. SYCL native matmul — fallback

Only Q4_0 and Q8_0 (after PR #21527) have full reorder support. All other quantization types fall through to slower paths:

Format DMMV Reorder MMVQ Reorder SYCL Matmul Reorder Effective BW
Q4_0 57%
Q8_0 (PR #21527) (PR #21527) 66%
Q4_K * 53%
Q5_K ~39%
Q6_K * ~48%
IQ4_NL 14%
Q4_1/Q5_0/Q5_1 ~30-44%

* Only when g_ggml_sycl_prioritize_dmmv is NOT set

The root cause is in the dispatch logic (ggml-sycl.cpp lines 3269-3340):

// Only Q4_0 and Q8_0 supported in DMMV reorder
inline bool ggml_sycl_supports_reorder_dmmv(enum ggml_type type) {
    switch (type) {
        case GGML_TYPE_Q4_0:
        case GGML_TYPE_Q8_0:
            return true;
        default:
            return false;  // Q4_K, Q5_K, Q6_K all fall through
    }
}

Problem 2: Xe2/Battlemage Regression — Q8_0 Catastrophically Slow (CRITICAL, mostly fixed)

Status: PR #21527 submitted, 3.1x speedup validated
Impact: Q8_0 ran at 21% bandwidth on Xe2 vs 53-64% for Q4_K_M

On Arc A770 (Xe1), Q8_0 was faster than Q4/Q6. On Arc B70/B580 (Xe2), Q8_0 was 4-5x slower. The root cause:

  • Generic DMMV path: iter_stride = 2 * GGML_SYCL_DMMV_X = 64 → processes 2 values per thread per iteration
  • Reorder DMMV path (Q4_0): iter_stride = 8 * 2 * GGML_SYCL_DMMV_X = 512 → processes 16 values per thread per iteration

Q8_0 was stuck on the generic path because it wasn't in ggml_sycl_supports_reorder_dmmv(). This was confirmed not a driver issue (IGC 2.28.4 → 2.30.1 showed no change) and not a backend issue (both SYCL and Vulkan equally affected).

PR #21527 added Q8_0 to the reorder framework, achieving 66% bandwidth utilization (up from 21%).

Problem 3: K-Quantization Crashes on Xe2 iGPU (CRITICAL)

Status: Workaround exists (use upstream llama.cpp SYCL instead of IPEX-LLM)
Impact: Q4_K_M, Q5_K, Q6_K crash on Arc 140V/140T

Sub-group size 8 is not supported on the device
Exception at ggml-sycl.cpp:3164

This error occurs in IPEX-LLM's bundled llama.cpp but not in upstream. IPEX-LLM's llama.cpp is based on an August 2024 snapshot and hasn't received the fixes. Since the project was archived in January 2026, this will likely never be fixed in IPEX-LLM.

Problem 4: Arc 140T Misdetection — Coopmat Disabled (HIGH)

Status: Open issue
Impact: Cooperative matrix operations completely disabled on valid Xe2 hardware

The Vulkan backend classifies GPUs by minSubgroupSize:

  • minSubgroupSize == 16 → classified as INTEL_XE2 → coopmat enabled
  • Everything else → classified as OTHER → no coopmat

Arrow Lake H (Arc 140T) reports minSubgroupSize = 8 despite having Xe2 architecture and full cooperative matrix support. This appears to be a driver-level reporting bug.

Problem 5: Linux Kernel Version Fragility (HIGH)

Status: Active issue (compute-runtime#875)
Impact: Complete failure of all GPU compute on kernel 6.18

Kernel Version Works? Notes
≤ 6.6.25 Stable baseline
6.6.26 - 6.8 ⚠️ Some CCS fence timeout issues
6.9 - 6.17 Working range
6.18+ bindless_heaps_helper.cpp abort, all Level Zero fails

Additionally, Linux firmware updates have broken Intel GPU compute in the past. The linux-firmware package version 20240409.1addd7dc introduced fence timeouts that hung llama.cpp, requiring downgrades.

Problem 6: IPEX-LLM Abandonment (HIGH)

Status: Archived January 2026
Impact: Best-performing Intel GPU inference stack no longer maintained

IPEX-LLM consistently outperformed upstream llama.cpp SYCL by 50-80% on token generation (e.g., 24.35 t/s vs 13.51 t/s on Arc 140V with Llama-2-7B Q4_0). This performance gap came from:

  1. Closed-source optimized kernels not shared with upstream
  2. oneDNN GEMM integration for prompt processing
  3. syclcompat library for platform-specific tuning
  4. Proprietary quantization optimizations

With IPEX-LLM archived, these optimizations are frozen. The llama.cpp community must independently re-derive all of this work.

Problem 7: Vulkan vs SYCL Performance Inconsistency (MEDIUM)

Status: Improving, but still inconsistent
Impact: Users must test both backends for each model/generation

Historically:

  • SYCL: Better prompt processing (5-6x faster than Vulkan)
  • Vulkan: Better token generation (up to 50% faster than SYCL in late 2024)
  • Recent SYCL improvements have narrowed the gap

The two backends have completely separate kernel implementations with different optimization strategies, and neither consistently wins across all quantization types and hardware generations.

Problem 8: vLLM XPU Quantization Limitations (MEDIUM)

Status: Active development
Impact: Only FP16, Dynamic FP8, MXFP4 validated on XPU

vLLM's Intel XPU support does not support many quantization formats that work on CUDA:

  • GPTQ: Limited, torchao AWQ models crash (hard-requires CUDA)
  • AWQ: Occupies more memory than model size on XPU
  • Marlin/Machete kernels: CUDA-only
  • GGUF: Not supported in vLLM at all

The intel/llm-scaler-vllm docker images lag behind mainline vLLM (currently at 0.14.0-b8.1 vs 0.16+ mainline).

Problem 9: Missing DPAS/XMX Utilization for Quantized Inference (HIGH)

Status: In early investigation
Impact: Intel's key hardware advantage (XMX tensor cores) goes unused for quantized matmul

Intel Xe/Xe2 GPUs have XMX (Xe Matrix eXtensions) units capable of DPAS (Dot Product and Accumulate Systolic) operations. The Arc A770 has 4096-bit XMX units; B580 has similar. However:

  • SYCL backend: Uses joint_matrix extension only for FP16/BF16 GEMM, not for quantized formats
  • Vulkan backend: DP4A instruction support added, but not yet wired to matmul path
  • K-quants: No DPAS path at all — rely entirely on scalar DP4A or software emulation
  • The community contributors (Rbiessy, NeoZhangJianyu) have noted that kernels are memory-bound before they can benefit from DPAS, so the memory access patterns must be fixed first

Quote from Rbiessy (Codeplay):

"I say potentially [using the matrix engine] because currently the kernel is memory bound in the configurations we have tried. If we're still not able to improve that for some reason using HMX won't help."

Problem 10: Understaffed Open-Source Development (SYSTEMIC)

Status: Ongoing
Impact: All other problems stem from this root cause

The SYCL backend in llama.cpp is primarily maintained by:

  • NeoZhangJianyu: Independent contributor, spare time
  • Rbiessy (Codeplay/Samsung): Part of Codeplay team (acquired by Samsung), contributing to SYCL optimization
  • 0cc4m: Vulkan backend development
  • qnixsynapse: Testing and CI

Quote from NeoZhangJianyu:

"We are private contributors to maintain the SYCL backend on Intel GPU. You shouldn't complain so much, since we spend our spare time in past year to maintain it and make it work. Yes, it works, instead of work perfect. For BMG, we don't promise to optimize it in time of the marketing."

Intel itself does not officially contribute to the llama.cpp SYCL backend. Their focus is on IPEX-LLM (now archived), OpenVINO, and vLLM XPU.


Performance Landscape

Benchmarks: llama.cpp SYCL vs Vulkan vs IPEX-LLM

Compiled from community reports (llm-tracker.info, GitHub issues, Reddit):

Arc 140V iGPU (Lunar Lake, Xe2) — Llama-2-7B Q4_0

Backend pp512 (t/s) tg128 (t/s) MBW Efficiency
CPU (4 P-cores) 25.05 11.59 30%
Vulkan 44.65 5.54 14%
SYCL FP32 180.77 14.39 38%
SYCL FP16 526.38 13.51 35%
IPEX-LLM 708.15 24.35 64%

Theoretical max tg: 136.5 GB/s ÷ 3.56 GB = ~38.3 t/s

Arc A770 (Alchemist, Xe1) — Various Models

Model Quant SYCL pp512 SYCL tg128 Vulkan tg128
Llama 7B Q4_0 ~500 ~40 ~30
Llama 7B Q6_K ~700 ~22 ~21
Llama 13B Q5_K ~400 ~16 ~8
Llama 30B Q2_K ~160 ~8.5 ~5

Arc B580 (Battlemage, Xe2) — Llama-3.1-8B

Quant SYCL tg (t/s) Expected (456 GB/s BW) Efficiency
Q4_K_M 25-30 ~38 66-79%
Q8_0 (pre-fix) ~8 ~22 36%
Q8_0 (post-fix) ~18 ~22 82%

Arc Pro B70 (Battlemage, Xe2) — Qwen3.5-27B (comprehensive sweep)

Quant Size (GiB) tg128 (t/s) Effective BW % of 608 GB/s
Q4_0 14.63 23.67 346 GB/s 57%
Q4_K_M 15.58 20.56 321 GB/s 53%
Q6_K 20.90 13.83 289 GB/s 48%
Q5_K_M 18.25 13.78 252 GB/s 41%
IQ4_NL 14.60 5.85 85 GB/s 14%
Q8_0 (pre-fix) 26.62 4.88 130 GB/s 21%
Q8_0 (post-fix) 15.24 402 GB/s 66%

Comparison vs NVIDIA/AMD

GPU Price VRAM BW (GB/s) 8B Q4_K_M tg BW Efficiency
Arc B580 $249 12GB 456 ~25-30 t/s ~66-79%
RTX 4060 $299 8GB 272 ~35 t/s ~92%
RTX 4060 Ti 16GB $499 16GB 288 ~40 t/s ~90%
RTX 3060 12GB (used) $170 12GB 360 ~35 t/s ~88%

Intel Arc delivers 20-30% less inference speed than NVIDIA at similar price points, despite having competitive raw bandwidth. The gap is entirely in software efficiency.


Root Cause Analysis

Layer 1: Kernel Code — Missing Reorder Implementations

The reorder optimization (originally added for Q4_0 in PR #12035 by NeoZhangJianyu) separates quantized data from metadata (scales/zero-points) in GPU memory, enabling coalesced memory access patterns. This is the single biggest performance differentiator:

  • Without reorder: iter_stride=64, 2 values per thread per iteration → poor memory coalescing
  • With reorder: iter_stride=512, 16 values per thread per iteration → near-optimal memory access

Only Q4_0 had this optimization until PR #21527 added Q8_0. All K-quant formats (Q4_K, Q5_K, Q6_K) are missing DMMV reorder implementations.

Files involved:

  • llama.cpp/ggml/src/ggml-sycl/ggml-sycl.cpp (dispatch logic)
  • llama.cpp/ggml/src/ggml-sycl/dmmv.cpp (DMMV kernels)
  • llama.cpp/ggml/src/ggml-sycl/mmvq.cpp (MMVQ kernels)

Layer 2: Architecture Blindness — No Xe1 vs Xe2 Differentiation

The SYCL backend treats all Intel GPUs identically. There's no runtime adaptation for:

  • Different L2 cache sizes (Xe1: 16MB, Xe2: larger)
  • Different optimal block sizes (Xe1: 64, Xe2: 128)
  • Different prefetch depths
  • Different vector widths

This explains why Xe2-specific regressions occur: optimizations tuned for Xe1 can be counterproductive on Xe2's different memory hierarchy.

Layer 3: Driver/Runtime — Kernel Version Coupling

Intel's compute-runtime has tight coupling with the Linux kernel's i915/xe driver. Changes in kernel memory management (e.g., SVM, CCS enablement, bindless heaps) break the userspace stack. The kernel 6.18 regression is the latest in a series:

  • Kernel 6.6.26+: Fence timeouts (CCS changes)
  • Kernel 6.8+: Various hangs
  • Kernel 6.18: Complete memory allocation failure

Layer 4: Organizational — Intel's Strategic Confusion

Intel has multiple teams working on overlapping GPU inference stacks without coordination:

  1. Intel Analytics (China): IPEX-LLM → archived
  2. Intel OpenVINO team: OpenVINO inference runtime
  3. Intel vLLM team: intel/vllm docker fork
  4. Intel PyTorch team: intel-extension-for-pytorch
  5. Intel compiler team: DPC++/SYCL (llvm sycl branch)
  6. Intel compute-runtime team: Level Zero driver
  7. Codeplay (Samsung subsidiary): SYCL backend contributions to llama.cpp
  8. Community volunteers: Most of the actual llama.cpp SYCL work

These teams use different oneAPI versions, different PyTorch versions, and target different benchmarks. There's no unified "make Intel GPUs fast for LLM inference" strategy.


Speculations & Hypotheses

H1: IPEX-LLM's Performance Secret Was Kernel Specialization, Not Magic

IPEX-LLM's 50-80% advantage over upstream llama.cpp SYCL likely comes from:

  1. Hand-tuned DPAS kernels for specific quantization formats
  2. oneDNN integration for prompt processing GEMM (not available in llama.cpp)
  3. Memory layout optimizations that upstream hasn't replicated

The fact that PR #21527 achieved 66% bandwidth for Q8_0 (close to IPEX-LLM's ~61% on Arc 140V) suggests that the reorder approach can close much of the gap, but the closed-source DPAS kernels remain unrecoverable.

H2: The Xe2 Memory Subsystem Has Different Optimal Access Patterns

The fact that Q8_0 was faster on Xe1 but catastrophically slower on Xe2 suggests fundamental architectural differences in:

  • L2 cache behavior (Xe2 may have different cache line policies)
  • Memory controller scheduling (Xe2 GDDR6 vs Xe1 GDDR6 may have different timing)
  • EU thread scheduling (Xe2 may have different SIMT behavior)

Without Intel publishing detailed Xe2 microarchitecture documentation, kernel developers are flying blind.

H3: The Driver Stack Will Remain Fragile

Intel's ongoing transition from i915 to xe kernel driver, combined with the compute-runtime's tight kernel coupling, suggests that:

  • New kernel versions will periodically break GPU compute
  • Docker images will be pinned to specific kernel versions
  • Users on rolling-release distros (Arch, Fedora) will be most affected

H4: OpenVINO Is Intel's Real Strategy, But It Doesn't Help llama.cpp

Intel seems to be positioning OpenVINO as their primary inference runtime. OpenVINO has:

  • Its own model format (not GGUF)
  • Its own quantization pipeline (not GPTQ/AWQ/GGUF)
  • Better performance out of the box

But the local LLM community overwhelmingly uses GGUF/llama.cpp, not OpenVINO. Intel's strategy doesn't align with user demand.

H5: The Community Could Fix Most Issues With Proper Resources

The reorder optimization framework is extensible by design. With focused effort, the remaining quantization types could be optimized:

  • Q4_K DMMV reorder: ~2-3 weeks of focused work
  • Q5_K, Q6_K reorder: ~4-6 weeks each
  • Xe2-specific tuning: ~2-4 weeks with profiling tools
  • DPAS integration: ~2-3 months for quantized formats

The bottleneck is not technical complexity but developer time.


Suggested Solutions

Priority 1: Critical Fixes (1-2 weeks)

Task Effort Impact Status
Merge PR #21527 (Q8_0 reorder) Done 3.1x Q8_0 speedup Pending review
Implement Q4_K DMMV reorder Medium 40% speedup for Q4_K Not started
Fix Arc 140T minSubgroupSize detection Small Enables coopmat Not started
Document kernel version compatibility Small Prevents user frustration Not started

Priority 2: Quantization Coverage (1-3 months)

Task Effort Impact
Add Q5_K to reorder framework Medium ~2x speedup
Add Q6_K to reorder framework Medium ~1.5x speedup
Fix IQ4_NL kernel (14% → 50%+ BW) Hard ~3.5x speedup
Increase DMMV iter_stride for non-reorder types Small 20-30% improvement

Priority 3: Architecture Optimization (3-6 months)

Task Effort Impact
Implement Xe2-specific kernel variants Hard Architecture-appropriate tuning
Enable DPAS for quantized matmul via prefetch optimization Hard Could reach 80%+ BW
Complete FlashAttention for SYCL Medium 2-3x prompt processing improvement
Add runtime GPU architecture detection (Xe1 vs Xe2) Medium Auto-tuning

Priority 4: Ecosystem Fixes (ongoing)

Task Effort Impact
Create unified benchmark suite for Intel GPUs Medium Reproducible perf tracking
Test matrix: kernel × compute-runtime × oneAPI Large Prevent version combo failures
Coordinate with Intel for official contributions Political Sustainable maintenance
Investigate OpenVINO's optimized kernels for porting Medium Leverage existing Intel work
Add vLLM XPU support for GPTQ/AWQ Large Production quantized serving

Priority 5: Strategic Recommendations

  1. Intel should officially contribute to llama.cpp SYCL backend — this is where the users are
  2. Open-source IPEX-LLM's optimized kernels before they're permanently lost
  3. Decouple compute-runtime from kernel version — validate on LTS kernels, add version negotiation
  4. Create a "one command" setup for Intel GPU LLM inference (like pip install torch for CUDA)
  5. Publish Xe2 microarchitecture details to enable community kernel optimization

Repo Map & Key Files

repos/
├── llama.cpp/                          # Main inference engine
│   ├── ggml/src/ggml-sycl/
│   │   ├── ggml-sycl.cpp               # Dispatch logic (lines 3269-3660)
│   │   ├── dmmv.cpp                    # DMMV kernels (iter_stride issue)
│   │   ├── mmvq.cpp                    # MMVQ reorder kernels
│   │   ├── dequantize.hpp              # Dequantization functions
│   │   └── vecdotq.hpp                 # Vector dot product implementations
│   ├── ggml/src/ggml-vulkan/
│   │   └── ggml-vulkan.cpp             # Vulkan backend (140T misdetection)
│   └── docs/backend/SYCL.md            # SYCL backend documentation
│
├── compute-runtime/                    # Level Zero + OpenCL driver
│   ├── shared/source/helpers/
│   │   └── bindless_heaps_helper.cpp   # Kernel 6.18 crash point
│   └── level_zero/                     # Level Zero implementation
│
├── intel-graphics-compiler/            # IGC - GPU shader/compute compiler
│   └── documentation/visa/instructions/
│       └── DPAS.md                     # DPAS instruction documentation
│
├── intel-extension-for-pytorch/        # IPEX - PyTorch GPU extension
│
├── ipex-llm/                           # IPEX-LLM (archived Jan 2026)
│   ├── docs/mddocs/Quickstart/         # Installation guides
│   └── [closed-source optimized kernels]
│
├── vllm/                               # vLLM mainline
├── vllm-xpu-kernels/                   # Intel XPU-specific vLLM kernels
│
├── oneDNN/                             # Intel BLAS/GEMM library
├── openvino/                           # Intel's inference runtime
├── llvm/                               # DPC++ compiler (SYCL branch)
│   └── sycl/                           # SYCL runtime implementation
│
├── level-zero/                         # Level Zero loader + headers
└── sycl-tla/                           # SYCL Templates for Linear Algebra

References

Critical GitHub Issues

# Repo Title Severity
#21517 llama.cpp Q8_0 4x slower on Arc Pro B70 Critical
#21527 llama.cpp Q8_0 reorder fix (3.1x speedup) Critical (fix)
#12570 llama.cpp Current status of Intel Arc GPUs High (discussion)
#5277 llama.cpp SYCL Long Term Features & Issues High (tracking)
#12318 ipex-llm K-quant crashes on Xe2 iGPU Critical
#12991 ipex-llm Vulkan faster than SYCL on B580 Medium
#875 compute-runtime Kernel 6.18 breaks Level Zero Critical
#788 compute-runtime sycl-ls fails on B580 High

Community Resources

Key People

  • NeoZhangJianyu: Primary SYCL backend maintainer (spare-time contributor)
  • Rbiessy (Codeplay): Working on mul_mat_vec_q kernel optimization, prefetch, DPAS
  • 0cc4m: Vulkan backend developer for Intel GPUs
  • PMZFX: Filed #21517, submitted PR #21527 (Q8_0 reorder fix)
  • lhl (llm-tracker.info): Comprehensive benchmarking and documentation

This document is part of Phase 1: Research & Data Preparation
No driver/framework modifications were made during this phase
Companion documents: research/community_issues/issues_and_discourse_minimax.md, research/kernels/kernel_analysis_minimax.md