Files
Pi Agent 48bbb14a64 Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s)
Root cause analysis corrected: Q4_0 low BW utilization is NOT due to
SYCL submission model overhead. Per-op profiling proves the bottleneck
is dp4a compute throughput — nibble extraction requires 2 dp4a per byte
vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s.

Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing
16 blocks per subgroup instead of 8. Better amortizes dp4a overhead.

Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3):
  Q4_0:  30.16 → 35.96 t/s (+19%)
  Q8_0:  29.96 → 30.82 t/s (within noise)
  Q4_K_M: 24.65 → 25.32 t/s (within noise)

Also includes:
- Timing instrumentation patch (for debugging, not applied to source)
- Updated decisions log (Decisions 8-9)
- Updated workplan with revised benchmark data
- Root cause analysis document
2026-04-15 19:55:44 +03:00
..

Intel GPU Diagnosis Patches

Patches generated by a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing Intel Arc GPU performance issues in llama.cpp. Non-overlapping training data ensures blind spots are caught through cross-review.

Quick Start

# Apply one phase at a time (recommended):
cd repos
./patch/apply-phase.sh 1              # Phase 1: SYCL graph default
./patch/apply-phase.sh 2              # Phase 2: Kernel tuning (VER_GEN + DMMV)
./patch/apply-phase.sh 3              # Phase 3: Vulkan Arc 140T fix (independent)
./patch/apply-phase.sh 4              # Phase 4: Host-buffer copy opt-in

# Reverse:
./patch/apply-phase.sh 1 --reverse
./patch/apply-phase.sh all --reverse  # Reverse all

Test Environment

  • GPU: Intel Arc A770 16GB (card0, xe driver, PCI 8086:56a0)
  • Secondary: AMD RX 580 8GB (card1, amdgpu — not used for SYCL)
  • CPU: AMD Ryzen 7 5800X, 32GB DDR4
  • OS: CachyOS, kernel 6.19.10
  • oneAPI: 2025.3.2, DPC++ compiler 2025.3.2
  • Build: -DGGML_SYCL=ON -DGGML_VULKAN=OFF -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release

Benchmark Results (Updated 2026-04-15, llama-bench)

Method: llama-bench -ngl 99 -p 512 -n 128 -r 3, 3 repeats.

Qwen3.5-9B (dense, fits entirely in VRAM)

Config Q4_0 tg128 Q8_0 tg128 Q4_K_M tg128
Baseline (HEAD) 29.4 28.6
+Phase 2 (VER+tuning) 30.16 29.96 24.65
+Phase 2+5 (vdr_mmvq) 35.96 30.82 25.32

Phase 5 is the first significant improvement: +19% on Q4_0 token generation.

Bandwidth Utilization

Config Q4_0 BW Q4_0 BW% Q8_0 BW Q8_0 BW%
Baseline 150 GiB/s 29% 265 GiB/s 52%
+Phase 5 180 GiB/s 35% 273 GiB/s 53%

Root Cause (corrected)

Previous analysis attributed low BW utilization to "SYCL submission model overhead". This was WRONG. Per-op profiling proves:

  1. SYCL queue naturally batches async kernel submissions (CPU submits all 1077 ops in 7.5ms vs 32ms GPU execution)
  2. Q4_0 is dp4a-compute-bound, not memory-BW-bound
  3. Nibble packing requires 2 dp4a per byte (low + high nibbles), while Q8_0 needs only 1 dp4a per byte
  4. Both formats hit the same dp4a throughput ceiling → same ~30 t/s despite different data sizes
  5. Phase 5's vdr_mmvq increase processes more blocks per subgroup, better amortizing dp4a overhead

See ../logs/root-cause-analysis-20260415.md for full profiling data.

Qwen3.5-35B-A3B (MoE, --cpu-moe)

Config Status
Baseline (unpatched) Works (15.2 t/s gen from earlier run)
+Phase 1 (graph) CRASH: async_malloc in opt_for_reorder

Critical Finding: Phase 1 (enabling SYCL graph by default) crashes on the 35B MoE model. The crash occurs in opt_for_reorderasync_malloc during graph computation. The graph path triggers async memory allocation that fails on Arc A770. This confirms the original developer's rationale for disabling graphs by default — the async malloc path is broken for this hardware/workload combination.

Phases

Phase 1 — SYCL Graph Default ⚠️ CRASHES ON MoE

Patch File Change Status
0001 ggml-sycl.cpp:217 Graph default 1→0 ⚠️ Crashes 35B MoE

Enables SYCL graph execution by default. Works on small dense models but crashes on 35B MoE with async_malloc failure in opt_for_reorder. The original default of disabled (1) was correct for Arc A770.

Phase 2 — SYCL Kernel Tuning Neutral

Patch File Change Status
0001 common.hpp:90-91 VER_GEN12 1M→1200, VER_GEN13→1300 Neutral
0002 presets.hpp:57,60 DMMV_X 32→64, MMV_Y 1→2 Neutral
0003 common.hpp:106,109 DMMV_X 32→64, MMV_Y 1→2 Neutral

Fixes VER_GEN12 placeholder but shows no measurable impact on 9B dense model. Needs testing on MoE or larger models to evaluate properly. DMMV_X=64 vs 32 is within noise on this workload.

Phase 3 — Vulkan Intel Independent

Patch File Change Status
0001 ggml-vulkan.cpp Arc 140T Xe2 override Not tested (no spirv-headers)

Phase 4 — Host-Buffer Copy Neutral

Patch File Change Status
0001 ggml-sycl.cpp mmap workaround opt-in Neutral on 9B

Removes blanket Linux host-buffer double-copy. Neutral on 9B dense model that fits in VRAM. Should help on models that stress the host→device copy path.

Council Deliberations

Stored in ../logs/ (gitignored). Key files:

  • logs/decisions.md — All council decisions with rationale
  • logs/test-machine-megumin.md — Test machine environment and benchmarks
  • logs/M-sync-overhead-*.md — Agent-M sync analysis
  • logs/K-kernel-tuning-*.md — Agent-K kernel tuning analysis
  • logs/M-review-K-*.md, logs/K-review-M-*.md — Cross-reviews

Phase 5 — Q4_0 MMVQ vdr_mmvq Tuning +19% Q4_0

Patch File Change Status
0001 quants.hpp:47 Q4_0 reorder vdr_mmvq 2→4 +5.8 t/s Q4_0

Increases the "vector dot product rows per MMVQ" parameter for Q4_0's reorder path. This processes more blocks per subgroup iteration (8→16 blocks), better amortizing the dp4a compute overhead. Q4_0 is dp4a-bound (not BW-bound) because nibble extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte.

Result: Q4_0 tg128 goes from 30.16 → 35.96 t/s (+19%). No impact on Q8_0 or Q4_K_M.

Key Lesson

The Arc A770 bottleneck for Q4_0 token generation is dp4a compute throughput, not memory bandwidth or submission overhead. The 4-bit nibble packing requires 2 dp4a operations per byte (low + high nibbles), making the kernel compute-bound at ~30 t/s. Further improvements require:

  1. DPAS/XMX integration for quantized dot products
  2. Algorithmic changes to the nibble unpacking (e.g., lookup tables)
  3. Larger vdr_mmvq (requires larger qi/QI4_0 — would need data format changes)