Files
Pi Agent 9dbf9d428b Update README with council results, root cause analysis, and patch phases
- Add Key Result section: +19% Q4_0 token generation (Phase 5)
- Add Root Cause Analysis section linking to corrected dp4a-bound findings
- Add Patch Development section with council structure and phases summary table
- Add Next Steps section for Phase 6+ and upstream contributions
- Add Council Deliberations section listing cross-review analysis files
- Reorganize sections for better flow: problem → results → analysis → findings
2026-04-15 20:08:16 +03:00

117 lines
7.6 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Intel Arc GPU — LLM Inference Diagnosis
Research into why Intel Arc GPUs (Alchemist / Xe1 and Battlemage / Xe2) severely underperform on quantized LLM inference, often achieving only **2140% of theoretical memory bandwidth** during token generation — compared to 8095% on equivalent NVIDIA and AMD hardware.
## Key Result: +19% Q4_0 Token Generation
Through a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing llama.cpp SYCL kernel performance on an **Intel Arc A770 16GB**, we identified that Q4_0 token generation was **dp4a-compute-bound** (not memory-bandwidth-bound as previously assumed) and achieved a **+19% improvement** (29.4 → 35.96 t/s) by tuning the `vdr_mmvq` parameter from 2 → 4.
| Config | Q4_0 tg128 | Q4_0 BW% | Q8_0 tg128 | Q8_0 BW% | Q4_K_M tg128 |
|--------|-----------|----------|-----------|----------|-------------|
| Baseline (HEAD) | 29.4 | 29% | 28.6 | 29% | — |
| **+Phase 5 (vdr_mmvq)** | **35.96** | **35%** | **30.82** | **32%** | 25.32 |
See [`repos/patch/README.md`](repos/patch/README.md) for full benchmark methodology and results.
## The Problem
Intel Arc GPUs look great on paper for LLM inference: ample VRAM, wide memory buses, dedicated XMX matrix engines. In practice, community benchmarks consistently show:
- **Q8_0 quantized models running 45× slower** than Q4_K_M despite only moving 1.7× more data
- Token generation achieving only **21% of peak bandwidth** on some quantization types
- Wildly inconsistent performance across SYCL, Vulkan, OpenVINO, and IPEX-LLM backends
- Architecture-specific regressions on Xe2 (Battlemage) that don't exist on Xe1 (Alchemist)
The root causes are multi-layered: missing kernel optimizations in `llama.cpp`, a fragmented Intel software stack (five semi-independent efforts that don't interoperate), quantization-specific dispatch path bugs, and an overall underinvestment in open-source kernel development for Intel GPU architectures.
## Empirical Findings
- **[Empirical Findings](empirical_findings.md)** — Real-world benchmarks and configurations from an Arc A770 + RX 580 system running llama.cpp with Qwen3.5-35B-A3B MoE. Includes driver setup (xe vs i915), SYCL/Vulkan status, performance tables, and working/broken configuration matrix.
- **[SYCL Optimization Analysis](sycl_optimization_analysis.md)** — Deep-dive into why the SYCL backend is slow: 6 root causes (double-buffered memory, disabled graph execution, blocking `.wait()` calls, DPCT translation artifacts), Vulkan vs SYCL submission architecture comparison, kernel dispatch issues, and a prioritized improvement roadmap.
## Root Cause Analysis
**[`logs/root-cause-analysis-20260415.md`](logs/root-cause-analysis-20260415.md)** — Corrected root cause for Q4_0 underperformance. Previous analysis blamed SYCL submission model overhead; empirical profiling proved this **wrong**. The real bottleneck:
1. Q4_0 nibble packing requires **2 dp4a operations per byte** (low + high nibbles), while Q8_0 needs only 1 dp4a per byte
2. Both formats hit the same dp4a throughput ceiling → same ~30 t/s despite Q8_0 reading 1.76× more data
3. The SYCL queue naturally batches async submissions (CPU submits 1077 ops in 7.5ms vs 32ms GPU execution) — the GPU is never starved
4. XMX/DPAS matrix units are **not used** for quantized kernels — only integer dp4a through the EU datapath
## Patch Development (Council)
A 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) developed and cross-reviewed patches through a phased system:
- **[`repos/patch/README.md`](repos/patch/README.md)** — Patch phases, benchmark results, and status
- **[`logs/workplan.md`](logs/workplan.md)** — Council structure, testing protocol, and guidelines
- **[`logs/decisions.md`](logs/decisions.md)** — All council decisions with rationale
### Patch Phases Summary
| Phase | Change | Result |
|-------|--------|--------|
| 1 — SYCL Graph Default | Enable graph by default | ⚠️ **Crashes on MoE** (`async_malloc` failure). Original disabled default was correct. |
| 2 — Kernel Tuning | Fix VER_GEN thresholds, DMMV tuning | ✅ Neutral on 9B dense |
| 3 — Vulkan Arc 140T | Xe2 device-ID override | ⏳ Not tested (missing spirv-headers) |
| 4 — Host-Buffer Copy | Remove blanket Linux double-copy | ✅ Neutral on 9B dense |
| **5 — Q4_0 vdr_mmvq** | **vdr_mmvq 2→4 for Q4_0 reorder** | **✅ +19% Q4_0 tg128** |
Further improvements require DPAS/XMX integration or algorithmic changes to the nibble dot-product. See the [Next Steps](#next-steps) section below.
## Summary of Findings
**[`overview.md`](overview.md)** — Cross-verified synthesis of all three agent overviews. Every major claim was checked against live GitHub issues/PRs and the actual source code in `repos/`. Includes confirmed findings, one correction to a research document (K-quant block sizes), and a clear breakdown of what is solid vs. uncertain.
## Overviews
Each overview was independently produced by a different LLM, analyzing community issues, kernel source code, driver stacks, and benchmark data:
- **[Kimi's Overview](overview_kimi.md)** — Focuses on driver/runtime stack mapping, quantization kernel inefficiencies (DMMV vs. MMVQ paths), and the missing reorder optimization for Q8_0.
- **[GLM's Overview](overview_glm.md)** — Broadest scope: full stack architecture diagram, version compatibility matrix, fragmentation analysis across five Intel inference stacks, and the Battlemage regression class.
- **[MiniMax's Overview](overview_minimax.md)** — Hardware landscape, per-GPU status table, critical issue triage (Q8_0 catastrophe, iGPU misdetection), and kernel-level root cause analysis.
## Research
Supporting deep-dives in [`research/`](research/):
- [`research/kernels/kernel_analysis_minimax.md`](research/kernels/kernel_analysis_minimax.md) — Detailed kernel dispatch path analysis
- [`research/community_issues/issues_and_discourse_minimax.md`](research/community_issues/issues_and_discourse_minimax.md) — Curated community issue reports and discourse
## Council Deliberations
Internal council analysis files (in `logs/`, gitignored):
- `logs/M-sync-overhead-*.md` — Agent-M SYCL submission analysis
- `logs/K-kernel-tuning-*.md` — Agent-K kernel tuning analysis
- `logs/M-review-K-*.md`, `logs/K-review-M-*.md` — Cross-reviews
- `logs/benchmark-research.md` — Benchmark methodology research
## Repo Map
The `repos/` directory contains source clones of the relevant Intel GPU and LLM inference projects for offline analysis (not tracked in this repository):
| Repository | Purpose |
|---|---|
| `llama.cpp` | SYCL & Vulkan backends, GGUF quantization kernels |
| `ipex-llm` | Intel's former PyTorch integration layer (archived Jan 2026) |
| `intel-extension-for-pytorch` | PyTorch XPU extension (deprecated) |
| `compute-runtime` | Intel Level Zero / OpenCL driver (NEO) |
| `intel-graphics-compiler` | JIT compiler (SYCL → Xe ISA) |
| `oneDNN` | Deep-learning primitive library |
| `vllm` | vLLM mainline (XPU backend in flux) |
| `vllm-xpu-kernels` | Dedicated Intel kernel repo for vLLM |
| `level-zero` | Level Zero loader and headers |
| `llvm` | DPC++ / SYCL compiler toolchain |
| `openvino` | Intel's inference optimizer/runtime |
| `sycl-tla` | SYCL abstraction layer |
## Next Steps
- **Phase 6+ (Deferred):** Q4_K / Q6_K DMMV reorder, Q5_K reorder, DPAS/XMX integration for quantized kernels
- **Upstream contributions:** Patches prepared against llama.cpp HEAD, ready for submission
- **Key blocker for further gains:** DPAS/XMX integration requires substantial kernel rewrites
## License
This research documentation is released under [CC0](https://creativecommons.org/publicdomain/zero/1.0/). Referenced repositories carry their own licenses.