Add cross-verified synthesis overview and link from README

2026-04-15 13:48:04 +02:00
parent eb44831e4c
commit 42537b9ee6
2 changed files with 168 additions and 0 deletions
@@ -17,6 +17,10 @@ The root causes are multi-layered: missing kernel optimizations in `llama.cpp`,

 - **[Empirical Findings](empirical_findings.md)** — Real-world benchmarks and configurations from an Arc A770 + RX 580 system running llama.cpp with Qwen3.5-35B-A3B MoE. Includes driver setup (xe vs i915), SYCL/Vulkan status, performance tables, and working/broken configuration matrix.

+## Summary of Findings
+
+**[`overview.md`](overview.md)** — Cross-verified synthesis of all three agent overviews. Every major claim was checked against live GitHub issues/PRs and the actual source code in `repos/`. Includes confirmed findings, one correction to a research document (K-quant block sizes), and a clear breakdown of what is solid vs. uncertain.
+
 ## Overviews

 Each overview was independently produced by a different LLM, analyzing community issues, kernel source code, driver stacks, and benchmark data:
@@ -0,0 +1,164 @@
+# Intel Arc GPU LLM Inference: Verified Findings
+
+**Date:** April 15, 2026  
+**Method:** Cross-checked all three agent overviews and the `research/` documents against live GitHub issues/PRs and the actual source code in `repos/`. Claims marked ✅ were verified; ⚠️ were partially correct or nuanced; ❌ were wrong.
+
+---
+
+## 1. The Core Problem Is Confirmed
+
+All three overviews agree and the benchmark data holds up: **Intel Arc GPUs (Xe1/Alchemist and Xe2/Battlemage) achieve only 21–53% of theoretical memory bandwidth during quantized token generation**, versus 80–95% on NVIDIA/AMD hardware at equivalent price points.
+
+This is not a driver bug. Updating IGC from 2.28.4 → 2.30.1 showed no change in Q8_0 performance (confirmed in issue #21517). The bottleneck is in the inference-framework kernel layer, not the hardware or compiler.
+
+---
+
+## 2. The Q8_0 Catastrophe (Mostly Fixed)
+
+**Status:** ✅ Confirmed and code-verified. PR #21527 is merged; the fix is present in the repo.
+
+Issue #21517 (`ggml-org/llama.cpp`) documented Q8_0 achieving only **21% memory bandwidth** (4.88 t/s) vs Q4_K_M at 53% (20.56 t/s) on Arc Pro B70 — a 4× gap despite only 1.7× more data. Both SYCL and Vulkan backends were equally affected. The root cause was confirmed in the dispatch logic:
+
+**Verified in `ggml/src/ggml-sycl/ggml-sycl.cpp` (lines 3282–3302):**
+
+```
+ggml_sycl_supports_reorder_dmmv():  Q4_0, Q8_0      ← Q8_0 added by PR #21527
+ggml_sycl_supports_reorder_mmvq():  Q4_0, Q8_0, Q4_K, Q6_K
+ggml_sycl_supports_reorder_mul_mat_sycl(): Q4_0, Q8_0; Q4_K/Q6_K only when !prioritize_dmmv
+```
+
+Before PR #21527, Q8_0 was missing from both lists, forcing it through the slow generic DMMV path. After the fix, Q8_0 achieves ~66% bandwidth (15.24 t/s, 3.1× speedup). The `dequantize_q8_0_reorder` function exists in `dequantize.hpp:147`, and `reorder_mul_mat_vec_q8_0_q8_1_sycl` exists in `mmvq.cpp:682`.
+
+**The iter_stride gap is real (verified):** The generic DMMV uses `iter_stride = 2 * GGML_SYCL_DMMV_X` (=64); the reorder path uses `iter_stride = 8 * 2 * GGML_SYCL_DMMV_X` (=512). That is an 8× difference. `GGML_SYCL_DMMV_X` is defined as 32; for Intel targets, `GGML_SYCL_WARP_SIZE` is compiled to 16 (not 32 as some overview text implies), so `vals_per_iter` = 4 (generic) and 32 (reorder) — the 8× ratio holds regardless.
+
+---
+
+## 3. Remaining Quantization Gaps
+
+**Status:** ✅ Confirmed and code-verified.
+
+Verified in `dmmv.cpp` and `mmvq.cpp`:
+
+| Format | DMMV Reorder | MMVQ Reorder | Notes |
+|--------|:---:|:---:|-------|
+| Q4_0 | ✅ | ✅ | Fully optimized (PR #12035, Feb 2025) |
+| Q8_0 | ✅ | ✅ | Fixed by PR #21527 |
+| Q4_K | ❌ (ABORT) | ✅ | DMMV reorder hits `GGML_ABORT` — unimplemented; `ggml-sycl.cpp:1238–1239` |
+| Q6_K | ❌ | ✅ | MMVQ reorder exists (`mmvq.cpp:842`); DMMV has no reorder |
+| Q5_K | ❌ | ❌ | Generic MMVQ only (`mmvq.cpp:835`); no reorder path of any kind |
+| Q4_1, Q5_0, Q5_1 | ❌ | ❌ | Generic paths only |
+| IQ4_NL | ❌ | ❌ | Only appears in SYCL native matmul path; 14% bandwidth figure is plausible |
+
+K-quants still missing DMMV reorder means any configuration that enables `GGML_SYCL_PRIORITIZE_DMMV` (or falls through to DMMV due to matrix dimensions) will use the 8× slower generic path for Q4_K and Q6_K.
+
+**Correction to `research/kernels/kernel_analysis_minimax.md`:** The Block Size table in that document contains **wrong values** for K-quants. Actual values from `ggml-common.h` (with `QK_K=256`, `K_SCALE_SIZE=12`):
+
+| Format | Claimed | Actual |
+|--------|---------|--------|
+| Q4_0 | 18 bytes | **18 bytes** ✓ (`sizeof(ggml_half) + QK4_0/2`) |
+| Q8_0 | 34 bytes | **34 bytes** ✓ (`sizeof(ggml_half) + QK8_0`) |
+| Q4_K | 54 bytes | **144 bytes** (`2*2 + 12 + 128`) |
+| Q5_K | 62 bytes | **176 bytes** (`2*2 + 12 + 128 + 32`) |
+| Q6_K | 66 bytes | **210 bytes** (`2 + 16 + 192`) |
+
+The broader hypothesis — that non-power-of-2 block sizes impede vectorization — remains valid. All sizes are non-power-of-2. But the specific numbers for K-quants were hallucinated.
+
+---
+
+## 4. Architecture Detection Bug (Vulkan / Arc 140T)
+
+**Status:** ✅ Confirmed, both code-verified and issue-confirmed (#20776, open).
+
+The Vulkan backend classifies Intel GPUs in `ggml-vulkan.cpp:329–356`. Detection of `INTEL_XE2` requires `minSubgroupSize == 16`. Intel itself documents that Xe2 uses SIMD16 while older architectures use SIMD8 (the comment in the code links to Intel's 2024 Tech Tour). Everything else falls through to `OTHER`, and cooperative matrix support is disabled for non-XE2 Intel (`ggml-vulkan.cpp:5119–5120`, `5543`).
+
+The Arc 140T (Arrow Lake H) reports `minSubgroupSize = 8` despite being Xe2 hardware. This appears to be a driver-level reporting inconsistency: the 140V on Lunar Lake correctly reports 16. The result is that coopmat and all dependent optimizations are silently disabled on 140T systems.
+
+MiniMax's proposed device-ID override fix is reasonable:
+```cpp
+if (props.deviceID == 0x7D51 || props.deviceID == 0x7D45) {
+    return vk_device_architecture::INTEL_XE2;
+}
+```
+This is a targeted workaround; a systemic fix would require Intel to correct the driver-reported value.
+
+---
+
+## 5. K-Quant Crash on Xe2 iGPU (IPEX-LLM)
+
+**Status:** ✅ Confirmed (issue #12318, `intel/ipex-llm`).
+
+IPEX-LLM's bundled llama.cpp is based on an August 2024 snapshot. Q4_K_M and other K-quants crash on Arc 140V with `Sub-group size 8 is not supported on the device`. Upstream llama.cpp does not have this crash. Since ipex-llm is archived (see §7), this will not be fixed.
+
+---
+
+## 6. XMX/DPAS Underutilization
+
+**Status:** ✅ Confirmed as a real gap; specific DPAS investigation is ongoing with no timeline.
+
+XMX (Xe Matrix eXtensions) are used for FP16/BF16 GEMM via `joint_matrix` in the SYCL backend — confirmed by the 2.4× prompt-processing speedup with `-DGGML_SYCL_F16=ON`. They are **not** used for quantized token-generation kernels (DMMV/MMVQ), which are the actual bottleneck.
+
+Rbiessy (Codeplay) confirmed in discussion #12570 that using XMX/DPAS for quantized kernels requires fixing memory-access patterns first: "currently the kernel is memory bound in the configurations we have tried." The reorder fix for Q8_0 was the necessary prerequisite; DPAS integration may follow but has no committed timeline.
+
+---
+
+## 7. IPEX-LLM Archival and Ecosystem Fragmentation
+
+**Status:** ✅ Confirmed. ipex-llm `README.md` begins: "THIS PROJECT IS ARCHIVED."
+
+Intel archived `intel/ipex-llm` on January 28, 2026. IPEX-LLM had measurable advantages over upstream llama.cpp SYCL — the GLM overview's data for Arc 140V (Llama-2-7B Q4_0: IPEX-LLM 24.35 t/s vs SYCL FP16 13.51 t/s, an ~80% advantage) is consistent with independent reports and a plausible consequence of IPEX-LLM's closed-source oneDNN integration and lower-level kernel access.
+
+The optimizations that produced this gap were never upstreamed. The community must re-derive them independently.
+
+The five partially-overlapping Intel inference stacks (llama.cpp SYCL, llama.cpp Vulkan, IPEX-LLM, OpenVINO, vLLM XPU) remain. They use different oneAPI and PyTorch versions and target different benchmarks. There is no unified Intel strategy for community-facing LLM inference.
+
+---
+
+## 8. Linux Kernel 6.18 Break
+
+**Status:** ✅ Confirmed. `intel/compute-runtime` issue #875 documents this.
+
+After upgrading to Linux kernel 6.18, device memory allocation via Level Zero fails with `UR_RESULT_ERROR_OUT_OF_DEVICE_MEMORY`. Kernel 6.17 works. The failure is in `bindless_heaps_helper.cpp` (the file exists in `repos/compute-runtime/shared/source/helpers/`). Users on rolling-release distributions (Arch, Fedora) are most exposed. The working range through 6.17 is wide, but the 6.18 cliff is a sharp regression.
+
+---
+
+## 9. vLLM Intel XPU
+
+**Status:** ✅ Issue #27408 confirmed (B-series SIGABRT during model inspection).
+
+vLLM's Intel XPU backend is in active transition: it has officially deprecated IPEX in favor of `vllm-xpu-kernels` (v0.11.x+). The new kernel repo is not yet at feature parity. AWQ/GPTQ formats remain CUDA-only (torchao codepath); GGUF is unsupported in vLLM entirely. The Arc Pro B-series crash (#27408) occurs during model inspection on Battlemage hardware.
+
+---
+
+## 10. Summary: What Is Solid vs. Uncertain
+
+### Solid (code-verified or multi-source confirmed)
+
+- Q8_0 performance disaster on Xe2 and its root cause (iter_stride / missing reorder path)
+- PR #21527 fix: 3.1× speedup for Q8_0, merged
+- Q4_K DMMV reorder is unimplemented (GGML_ABORT in the code)
+- Q5_K has no reorder support at all
+- Arc 140T Vulkan misdetection mechanism (minSubgroupSize=8 vs 16 check)
+- ipex-llm archived Jan 28, 2026
+- compute-runtime kernel 6.18 break confirmed
+- 34-byte Q8_0 block structure; 18-byte Q4_0 block structure
+
+### Reasonable / Well-Supported But Not Independently Re-Benchmarked
+
+- Specific t/s numbers from community benchmarks (20.56 Q4_K_M, 4.88 Q8_0 pre-fix) — these appear in a real GitHub issue
+- ~80% IPEX-LLM advantage over upstream SYCL on Xe2 iGPU
+- IQ4_NL at 14% bandwidth — stated in discussion #12570 by a community contributor
+- Xe2-specific regression for Q8_0 (inverse: Q8_0 faster than Q4 on Xe1 per issue #19887)
+
+### Flagged as Incorrect (Research Document Only)
+
+- `research/kernels/kernel_analysis_minimax.md` Block Size table for K-quants (Q4_K/Q5_K/Q6_K sizes are wrong — see §3)
+
+### Not Independently Verified
+
+- compute-runtime issues #627 (4 GB allocation cap), #750 (free memory reporting), #890 (Arrow Lake OOM hangs) — referenced only in Kimi's overview
+- Mesa version sensitivity in Vulkan TG on Intel (referenced as Discussion #10879)
+- Xe2-specific microarchitecture differences (L2 size, prefetch depth) — Intel has not published a detailed Xe2 optimization guide; the figures cited by GLM are extrapolated
+
+---
+
+*No code or driver modifications were made in producing this document.*