Root cause analysis corrected: Q4_0 low BW utilization is NOT due to SYCL submission model overhead. Per-op profiling proves the bottleneck is dp4a compute throughput — nibble extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s. Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing 16 blocks per subgroup instead of 8. Better amortizes dp4a overhead. Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3): Q4_0: 30.16 → 35.96 t/s (+19%) Q8_0: 29.96 → 30.82 t/s (within noise) Q4_K_M: 24.65 → 25.32 t/s (within noise) Also includes: - Timing instrumentation patch (for debugging, not applied to source) - Updated decisions log (Decisions 8-9) - Updated workplan with revised benchmark data - Root cause analysis document
6.1 KiB
Intel GPU Diagnosis Patches
Patches generated by a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing Intel Arc GPU performance issues in llama.cpp. Non-overlapping training data ensures blind spots are caught through cross-review.
Quick Start
# Apply one phase at a time (recommended):
cd repos
./patch/apply-phase.sh 1 # Phase 1: SYCL graph default
./patch/apply-phase.sh 2 # Phase 2: Kernel tuning (VER_GEN + DMMV)
./patch/apply-phase.sh 3 # Phase 3: Vulkan Arc 140T fix (independent)
./patch/apply-phase.sh 4 # Phase 4: Host-buffer copy opt-in
# Reverse:
./patch/apply-phase.sh 1 --reverse
./patch/apply-phase.sh all --reverse # Reverse all
Test Environment
- GPU: Intel Arc A770 16GB (card0, xe driver, PCI 8086:56a0)
- Secondary: AMD RX 580 8GB (card1, amdgpu — not used for SYCL)
- CPU: AMD Ryzen 7 5800X, 32GB DDR4
- OS: CachyOS, kernel 6.19.10
- oneAPI: 2025.3.2, DPC++ compiler 2025.3.2
- Build:
-DGGML_SYCL=ON -DGGML_VULKAN=OFF -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release
Benchmark Results (Updated 2026-04-15, llama-bench)
Method: llama-bench -ngl 99 -p 512 -n 128 -r 3, 3 repeats.
Qwen3.5-9B (dense, fits entirely in VRAM)
| Config | Q4_0 tg128 | Q8_0 tg128 | Q4_K_M tg128 |
|---|---|---|---|
| Baseline (HEAD) | 29.4 | 28.6 | — |
| +Phase 2 (VER+tuning) | 30.16 | 29.96 | 24.65 |
| +Phase 2+5 (vdr_mmvq) | 35.96 | 30.82 | 25.32 |
Phase 5 is the first significant improvement: +19% on Q4_0 token generation.
Bandwidth Utilization
| Config | Q4_0 BW | Q4_0 BW% | Q8_0 BW | Q8_0 BW% |
|---|---|---|---|---|
| Baseline | 150 GiB/s | 29% | 265 GiB/s | 52% |
| +Phase 5 | 180 GiB/s | 35% | 273 GiB/s | 53% |
Root Cause (corrected)
Previous analysis attributed low BW utilization to "SYCL submission model overhead". This was WRONG. Per-op profiling proves:
- SYCL queue naturally batches async kernel submissions (CPU submits all 1077 ops in 7.5ms vs 32ms GPU execution)
- Q4_0 is dp4a-compute-bound, not memory-BW-bound
- Nibble packing requires 2 dp4a per byte (low + high nibbles), while Q8_0 needs only 1 dp4a per byte
- Both formats hit the same dp4a throughput ceiling → same ~30 t/s despite different data sizes
- Phase 5's vdr_mmvq increase processes more blocks per subgroup, better amortizing dp4a overhead
See ../logs/root-cause-analysis-20260415.md for full profiling data.
Qwen3.5-35B-A3B (MoE, --cpu-moe)
| Config | Status |
|---|---|
| Baseline (unpatched) | ✅ Works (15.2 t/s gen from earlier run) |
| +Phase 1 (graph) | ❌ CRASH: async_malloc in opt_for_reorder |
Critical Finding: Phase 1 (enabling SYCL graph by default) crashes on the 35B MoE model.
The crash occurs in opt_for_reorder → async_malloc during graph computation. The graph
path triggers async memory allocation that fails on Arc A770. This confirms the original
developer's rationale for disabling graphs by default — the async malloc path is broken
for this hardware/workload combination.
Phases
Phase 1 — SYCL Graph Default ⚠️ CRASHES ON MoE
| Patch | File | Change | Status |
|---|---|---|---|
| 0001 | ggml-sycl.cpp:217 | Graph default 1→0 | ⚠️ Crashes 35B MoE |
Enables SYCL graph execution by default. Works on small dense models but crashes
on 35B MoE with async_malloc failure in opt_for_reorder. The original default
of disabled (1) was correct for Arc A770.
Phase 2 — SYCL Kernel Tuning ✅ Neutral
| Patch | File | Change | Status |
|---|---|---|---|
| 0001 | common.hpp:90-91 | VER_GEN12 1M→1200, VER_GEN13→1300 | ✅ Neutral |
| 0002 | presets.hpp:57,60 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral |
| 0003 | common.hpp:106,109 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral |
Fixes VER_GEN12 placeholder but shows no measurable impact on 9B dense model. Needs testing on MoE or larger models to evaluate properly. DMMV_X=64 vs 32 is within noise on this workload.
Phase 3 — Vulkan Intel ✅ Independent
| Patch | File | Change | Status |
|---|---|---|---|
| 0001 | ggml-vulkan.cpp | Arc 140T Xe2 override | ⏳ Not tested (no spirv-headers) |
Phase 4 — Host-Buffer Copy ✅ Neutral
| Patch | File | Change | Status |
|---|---|---|---|
| 0001 | ggml-sycl.cpp | mmap workaround opt-in | ✅ Neutral on 9B |
Removes blanket Linux host-buffer double-copy. Neutral on 9B dense model that fits in VRAM. Should help on models that stress the host→device copy path.
Council Deliberations
Stored in ../logs/ (gitignored). Key files:
logs/decisions.md— All council decisions with rationalelogs/test-machine-megumin.md— Test machine environment and benchmarkslogs/M-sync-overhead-*.md— Agent-M sync analysislogs/K-kernel-tuning-*.md— Agent-K kernel tuning analysislogs/M-review-K-*.md,logs/K-review-M-*.md— Cross-reviews
Phase 5 — Q4_0 MMVQ vdr_mmvq Tuning ✅ +19% Q4_0
| Patch | File | Change | Status |
|---|---|---|---|
| 0001 | quants.hpp:47 | Q4_0 reorder vdr_mmvq 2→4 | ✅ +5.8 t/s Q4_0 |
Increases the "vector dot product rows per MMVQ" parameter for Q4_0's reorder path. This processes more blocks per subgroup iteration (8→16 blocks), better amortizing the dp4a compute overhead. Q4_0 is dp4a-bound (not BW-bound) because nibble extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte.
Result: Q4_0 tg128 goes from 30.16 → 35.96 t/s (+19%). No impact on Q8_0 or Q4_K_M.
Key Lesson
The Arc A770 bottleneck for Q4_0 token generation is dp4a compute throughput, not memory bandwidth or submission overhead. The 4-bit nibble packing requires 2 dp4a operations per byte (low + high nibbles), making the kernel compute-bound at ~30 t/s. Further improvements require:
- DPAS/XMX integration for quantized dot products
- Algorithmic changes to the nibble unpacking (e.g., lookup tables)
- Larger vdr_mmvq (requires larger qi/QI4_0 — would need data format changes)