T

sleepy 1c374e1262 feat: phase 4 host-copy fix + docker build script + test machine docs

Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor.
The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra
malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via
GGML_SYCL_MMAP_WORKAROUND=1.

Also adds:
- docker-build-test.sh for local amd64 SYCL build verification
- test-machine-megumin.md with hardware/software env and test procedures
- Updated apply-phase.sh to support phase 4
- Updated workplan with corrected council composition (GLM/Minimax/Kimi)

2026-04-15 15:35:29 +02:00

repos

feat: phase 4 host-copy fix + docker build script + test machine docs

2026-04-15 15:35:29 +02:00

research

Initial commit: Intel Arc GPU LLM inference diagnosis research

2026-04-15 13:07:03 +02:00

.gitignore

feat: phased patch system for Intel Arc GPU performance fixes

2026-04-15 14:53:40 +02:00

.gitmodules

Add upstream repos as git submodules (shallow clones)

2026-04-15 13:19:15 +02:00

empirical_findings.md

Add empirical findings from Arc A770 + RX 580 testing

2026-04-15 14:22:54 +03:00

overview_glm.md

Initial commit: Intel Arc GPU LLM inference diagnosis research

2026-04-15 13:07:03 +02:00

overview_kimi.md

Initial commit: Intel Arc GPU LLM inference diagnosis research

2026-04-15 13:07:03 +02:00

overview_minimax.md

Initial commit: Intel Arc GPU LLM inference diagnosis research

2026-04-15 13:07:03 +02:00

overview.md

Add cross-verified synthesis overview and link from README

2026-04-15 13:48:04 +02:00

README.md

Merge pull request #2 from alex4o/sycl-optimization-analysis

2026-04-15 13:49:02 +02:00

sycl_optimization_analysis.md

Add SYCL optimization analysis from hands-on debugging sessions

2026-04-15 14:45:31 +03:00

README.md

Intel Arc GPU — LLM Inference Diagnosis

Research into why Intel Arc GPUs (Alchemist / Xe1 and Battlemage / Xe2) severely underperform on quantized LLM inference, often achieving only 21–40% of theoretical memory bandwidth during token generation — compared to 80–95% on equivalent NVIDIA and AMD hardware.

The Problem

Intel Arc GPUs look great on paper for LLM inference: ample VRAM, wide memory buses, dedicated XMX matrix engines. In practice, community benchmarks consistently show:

Q8_0 quantized models running 4–5× slower than Q4_K_M despite only moving 1.7× more data
Token generation achieving only 21% of peak bandwidth on some quantization types
Wildly inconsistent performance across SYCL, Vulkan, OpenVINO, and IPEX-LLM backends
Architecture-specific regressions on Xe2 (Battlemage) that don't exist on Xe1 (Alchemist)

The root causes are multi-layered: missing kernel optimizations in llama.cpp, a fragmented Intel software stack (five semi-independent efforts that don't interoperate), quantization-specific dispatch path bugs, and an overall underinvestment in open-source kernel development for Intel GPU architectures.

Empirical Findings

Empirical Findings — Real-world benchmarks and configurations from an Arc A770 + RX 580 system running llama.cpp with Qwen3.5-35B-A3B MoE. Includes driver setup (xe vs i915), SYCL/Vulkan status, performance tables, and working/broken configuration matrix.
SYCL Optimization Analysis — Deep-dive into why the SYCL backend is slow: 6 root causes (double-buffered memory, disabled graph execution, blocking .wait() calls, DPCT translation artifacts), Vulkan vs SYCL submission architecture comparison, kernel dispatch issues, and a prioritized improvement roadmap.

Summary of Findings

overview.md — Cross-verified synthesis of all three agent overviews. Every major claim was checked against live GitHub issues/PRs and the actual source code in repos/. Includes confirmed findings, one correction to a research document (K-quant block sizes), and a clear breakdown of what is solid vs. uncertain.

Overviews

Each overview was independently produced by a different LLM, analyzing community issues, kernel source code, driver stacks, and benchmark data:

Kimi's Overview — Focuses on driver/runtime stack mapping, quantization kernel inefficiencies (DMMV vs. MMVQ paths), and the missing reorder optimization for Q8_0.
GLM's Overview — Broadest scope: full stack architecture diagram, version compatibility matrix, fragmentation analysis across five Intel inference stacks, and the Battlemage regression class.
MiniMax's Overview — Hardware landscape, per-GPU status table, critical issue triage (Q8_0 catastrophe, iGPU misdetection), and kernel-level root cause analysis.

Research

Supporting deep-dives in research/:

research/kernels/kernel_analysis_minimax.md — Detailed kernel dispatch path analysis
research/community_issues/issues_and_discourse_minimax.md — Curated community issue reports and discourse

Repo Map

The repos/ directory contains source clones of the relevant Intel GPU and LLM inference projects for offline analysis (not tracked in this repository):

Repository	Purpose
`llama.cpp`	SYCL & Vulkan backends, GGUF quantization kernels
`ipex-llm`	Intel's former PyTorch integration layer (archived Jan 2026)
`intel-extension-for-pytorch`	PyTorch XPU extension (deprecated)
`compute-runtime`	Intel Level Zero / OpenCL driver (NEO)
`intel-graphics-compiler`	JIT compiler (SYCL → Xe ISA)
`oneDNN`	Deep-learning primitive library
`vllm`	vLLM mainline (XPU backend in flux)
`vllm-xpu-kernels`	Dedicated Intel kernel repo for vLLM
`level-zero`	Level Zero loader and headers
`llvm`	DPC++ / SYCL compiler toolchain
`openvino`	Intel's inference optimizer/runtime
`sycl-tla`	SYCL abstraction layer

License

This research documentation is released under CC0. Referenced repositories carry their own licenses.

README.md Unescape Escape

Intel Arc GPU — LLM Inference Diagnosis

The Problem

Empirical Findings

Summary of Findings

Overviews

Research

Repo Map

License

README.md