Files

T

Pi Agent 9dbf9d428b Update README with council results, root cause analysis, and patch phases

- Add Key Result section: +19% Q4_0 token generation (Phase 5)
- Add Root Cause Analysis section linking to corrected dp4a-bound findings
- Add Patch Development section with council structure and phases summary table
- Add Next Steps section for Phase 6+ and upstream contributions
- Add Council Deliberations section listing cross-review analysis files
- Reorganize sections for better flow: problem → results → analysis → findings

2026-04-15 20:08:16 +03:00

7.6 KiB

Raw Permalink Blame History

Intel Arc GPU — LLM Inference Diagnosis

Research into why Intel Arc GPUs (Alchemist / Xe1 and Battlemage / Xe2) severely underperform on quantized LLM inference, often achieving only 21–40% of theoretical memory bandwidth during token generation — compared to 80–95% on equivalent NVIDIA and AMD hardware.

Key Result: +19% Q4_0 Token Generation

Through a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing llama.cpp SYCL kernel performance on an Intel Arc A770 16GB, we identified that Q4_0 token generation was dp4a-compute-bound (not memory-bandwidth-bound as previously assumed) and achieved a +19% improvement (29.4 → 35.96 t/s) by tuning the vdr_mmvq parameter from 2 → 4.

Config	Q4_0 tg128	Q4_0 BW%	Q8_0 tg128	Q8_0 BW%	Q4_K_M tg128
Baseline (HEAD)	29.4	29%	28.6	29%	—
+Phase 5 (vdr_mmvq)	35.96	35%	30.82	32%	25.32

See repos/patch/README.md for full benchmark methodology and results.

The Problem

Intel Arc GPUs look great on paper for LLM inference: ample VRAM, wide memory buses, dedicated XMX matrix engines. In practice, community benchmarks consistently show:

Q8_0 quantized models running 4–5× slower than Q4_K_M despite only moving 1.7× more data
Token generation achieving only 21% of peak bandwidth on some quantization types
Wildly inconsistent performance across SYCL, Vulkan, OpenVINO, and IPEX-LLM backends
Architecture-specific regressions on Xe2 (Battlemage) that don't exist on Xe1 (Alchemist)

The root causes are multi-layered: missing kernel optimizations in llama.cpp, a fragmented Intel software stack (five semi-independent efforts that don't interoperate), quantization-specific dispatch path bugs, and an overall underinvestment in open-source kernel development for Intel GPU architectures.

Empirical Findings

Empirical Findings — Real-world benchmarks and configurations from an Arc A770 + RX 580 system running llama.cpp with Qwen3.5-35B-A3B MoE. Includes driver setup (xe vs i915), SYCL/Vulkan status, performance tables, and working/broken configuration matrix.
SYCL Optimization Analysis — Deep-dive into why the SYCL backend is slow: 6 root causes (double-buffered memory, disabled graph execution, blocking .wait() calls, DPCT translation artifacts), Vulkan vs SYCL submission architecture comparison, kernel dispatch issues, and a prioritized improvement roadmap.

Root Cause Analysis

logs/root-cause-analysis-20260415.md — Corrected root cause for Q4_0 underperformance. Previous analysis blamed SYCL submission model overhead; empirical profiling proved this wrong. The real bottleneck:

Q4_0 nibble packing requires 2 dp4a operations per byte (low + high nibbles), while Q8_0 needs only 1 dp4a per byte
Both formats hit the same dp4a throughput ceiling → same ~30 t/s despite Q8_0 reading 1.76× more data
The SYCL queue naturally batches async submissions (CPU submits 1077 ops in 7.5ms vs 32ms GPU execution) — the GPU is never starved
XMX/DPAS matrix units are not used for quantized kernels — only integer dp4a through the EU datapath

Patch Development (Council)

A 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) developed and cross-reviewed patches through a phased system:

repos/patch/README.md — Patch phases, benchmark results, and status
logs/workplan.md — Council structure, testing protocol, and guidelines
logs/decisions.md — All council decisions with rationale

Patch Phases Summary

Phase	Change	Result
1 — SYCL Graph Default	Enable graph by default	⚠️ Crashes on MoE (`async_malloc` failure). Original disabled default was correct.
2 — Kernel Tuning	Fix VER_GEN thresholds, DMMV tuning	✅ Neutral on 9B dense
3 — Vulkan Arc 140T	Xe2 device-ID override	⏳ Not tested (missing spirv-headers)
4 — Host-Buffer Copy	Remove blanket Linux double-copy	✅ Neutral on 9B dense
5 — Q4_0 vdr_mmvq	vdr_mmvq 2→4 for Q4_0 reorder	✅ +19% Q4_0 tg128

Further improvements require DPAS/XMX integration or algorithmic changes to the nibble dot-product. See the Next Steps section below.

Summary of Findings

overview.md — Cross-verified synthesis of all three agent overviews. Every major claim was checked against live GitHub issues/PRs and the actual source code in repos/. Includes confirmed findings, one correction to a research document (K-quant block sizes), and a clear breakdown of what is solid vs. uncertain.

Overviews

Each overview was independently produced by a different LLM, analyzing community issues, kernel source code, driver stacks, and benchmark data:

Kimi's Overview — Focuses on driver/runtime stack mapping, quantization kernel inefficiencies (DMMV vs. MMVQ paths), and the missing reorder optimization for Q8_0.
GLM's Overview — Broadest scope: full stack architecture diagram, version compatibility matrix, fragmentation analysis across five Intel inference stacks, and the Battlemage regression class.
MiniMax's Overview — Hardware landscape, per-GPU status table, critical issue triage (Q8_0 catastrophe, iGPU misdetection), and kernel-level root cause analysis.

Research

Supporting deep-dives in research/:

research/kernels/kernel_analysis_minimax.md — Detailed kernel dispatch path analysis
research/community_issues/issues_and_discourse_minimax.md — Curated community issue reports and discourse

Council Deliberations

Internal council analysis files (in logs/, gitignored):

logs/M-sync-overhead-*.md — Agent-M SYCL submission analysis
logs/K-kernel-tuning-*.md — Agent-K kernel tuning analysis
logs/M-review-K-*.md, logs/K-review-M-*.md — Cross-reviews
logs/benchmark-research.md — Benchmark methodology research

Repo Map

The repos/ directory contains source clones of the relevant Intel GPU and LLM inference projects for offline analysis (not tracked in this repository):

Repository	Purpose
`llama.cpp`	SYCL & Vulkan backends, GGUF quantization kernels
`ipex-llm`	Intel's former PyTorch integration layer (archived Jan 2026)
`intel-extension-for-pytorch`	PyTorch XPU extension (deprecated)
`compute-runtime`	Intel Level Zero / OpenCL driver (NEO)
`intel-graphics-compiler`	JIT compiler (SYCL → Xe ISA)
`oneDNN`	Deep-learning primitive library
`vllm`	vLLM mainline (XPU backend in flux)
`vllm-xpu-kernels`	Dedicated Intel kernel repo for vLLM
`level-zero`	Level Zero loader and headers
`llvm`	DPC++ / SYCL compiler toolchain
`openvino`	Intel's inference optimizer/runtime
`sycl-tla`	SYCL abstraction layer

Next Steps

Phase 6+ (Deferred): Q4_K / Q6_K DMMV reorder, Q5_K reorder, DPAS/XMX integration for quantized kernels
Upstream contributions: Patches prepared against llama.cpp HEAD, ready for submission
Key blocker for further gains: DPAS/XMX integration requires substantial kernel rewrites

License

This research documentation is released under CC0. Referenced repositories carry their own licenses.

7.6 KiB Raw Permalink Blame History Unescape Escape