Files

T

Pi Agent 48bbb14a64 Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s)

Root cause analysis corrected: Q4_0 low BW utilization is NOT due to
SYCL submission model overhead. Per-op profiling proves the bottleneck
is dp4a compute throughput — nibble extraction requires 2 dp4a per byte
vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s.

Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing
16 blocks per subgroup instead of 8. Better amortizes dp4a overhead.

Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3):
  Q4_0:  30.16 → 35.96 t/s (+19%)
  Q8_0:  29.96 → 30.82 t/s (within noise)
  Q4_K_M: 24.65 → 25.32 t/s (within noise)

Also includes:
- Timing instrumentation patch (for debugging, not applied to source)
- Updated decisions log (Decisions 8-9)
- Updated workplan with revised benchmark data
- Root cause analysis document

2026-04-15 19:55:44 +03:00

6.1 KiB

Raw Permalink Blame History

Intel GPU Diagnosis Patches

Patches generated by a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing Intel Arc GPU performance issues in llama.cpp. Non-overlapping training data ensures blind spots are caught through cross-review.

Quick Start

# Apply one phase at a time (recommended):
cd repos
./patch/apply-phase.sh 1              # Phase 1: SYCL graph default
./patch/apply-phase.sh 2              # Phase 2: Kernel tuning (VER_GEN + DMMV)
./patch/apply-phase.sh 3              # Phase 3: Vulkan Arc 140T fix (independent)
./patch/apply-phase.sh 4              # Phase 4: Host-buffer copy opt-in

# Reverse:
./patch/apply-phase.sh 1 --reverse
./patch/apply-phase.sh all --reverse  # Reverse all

Test Environment

GPU: Intel Arc A770 16GB (card0, xe driver, PCI 8086:56a0)
Secondary: AMD RX 580 8GB (card1, amdgpu — not used for SYCL)
CPU: AMD Ryzen 7 5800X, 32GB DDR4
OS: CachyOS, kernel 6.19.10
oneAPI: 2025.3.2, DPC++ compiler 2025.3.2
Build: -DGGML_SYCL=ON -DGGML_VULKAN=OFF -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release

Benchmark Results (Updated 2026-04-15, llama-bench)

Method: llama-bench -ngl 99 -p 512 -n 128 -r 3, 3 repeats.

Qwen3.5-9B (dense, fits entirely in VRAM)

Config	Q4_0 tg128	Q8_0 tg128	Q4_K_M tg128
Baseline (HEAD)	29.4	28.6	—
+Phase 2 (VER+tuning)	30.16	29.96	24.65
+Phase 2+5 (vdr_mmvq)	35.96	30.82	25.32

Phase 5 is the first significant improvement: +19% on Q4_0 token generation.

Bandwidth Utilization

Config	Q4_0 BW	Q4_0 BW%	Q8_0 BW	Q8_0 BW%
Baseline	150 GiB/s	29%	265 GiB/s	52%
+Phase 5	180 GiB/s	35%	273 GiB/s	53%

Root Cause (corrected)

Previous analysis attributed low BW utilization to "SYCL submission model overhead". This was WRONG. Per-op profiling proves:

SYCL queue naturally batches async kernel submissions (CPU submits all 1077 ops in 7.5ms vs 32ms GPU execution)
Q4_0 is dp4a-compute-bound, not memory-BW-bound
Nibble packing requires 2 dp4a per byte (low + high nibbles), while Q8_0 needs only 1 dp4a per byte
Both formats hit the same dp4a throughput ceiling → same ~30 t/s despite different data sizes
Phase 5's vdr_mmvq increase processes more blocks per subgroup, better amortizing dp4a overhead

See ../logs/root-cause-analysis-20260415.md for full profiling data.

Qwen3.5-35B-A3B (MoE, `--cpu-moe`)

Config	Status
Baseline (unpatched)	✅ Works (15.2 t/s gen from earlier run)
+Phase 1 (graph)	❌ CRASH: `async_malloc` in `opt_for_reorder`

Critical Finding: Phase 1 (enabling SYCL graph by default) crashes on the 35B MoE model. The crash occurs in opt_for_reorder → async_malloc during graph computation. The graph path triggers async memory allocation that fails on Arc A770. This confirms the original developer's rationale for disabling graphs by default — the async malloc path is broken for this hardware/workload combination.

Phases

Phase 1 — SYCL Graph Default ⚠️ CRASHES ON MoE

Patch	File	Change	Status
0001	ggml-sycl.cpp:217	Graph default 1→0	⚠️ Crashes 35B MoE

Enables SYCL graph execution by default. Works on small dense models but crashes on 35B MoE with async_malloc failure in opt_for_reorder. The original default of disabled (1) was correct for Arc A770.

Phase 2 — SYCL Kernel Tuning ✅ Neutral

Patch	File	Change	Status
0001	common.hpp:90-91	VER_GEN12 1M→1200, VER_GEN13→1300	✅ Neutral
0002	presets.hpp:57,60	DMMV_X 32→64, MMV_Y 1→2	✅ Neutral
0003	common.hpp:106,109	DMMV_X 32→64, MMV_Y 1→2	✅ Neutral

Fixes VER_GEN12 placeholder but shows no measurable impact on 9B dense model. Needs testing on MoE or larger models to evaluate properly. DMMV_X=64 vs 32 is within noise on this workload.

Phase 3 — Vulkan Intel ✅ Independent

Patch	File	Change	Status
0001	ggml-vulkan.cpp	Arc 140T Xe2 override	⏳ Not tested (no spirv-headers)

Phase 4 — Host-Buffer Copy ✅ Neutral

Patch	File	Change	Status
0001	ggml-sycl.cpp	mmap workaround opt-in	✅ Neutral on 9B

Removes blanket Linux host-buffer double-copy. Neutral on 9B dense model that fits in VRAM. Should help on models that stress the host→device copy path.

Council Deliberations

Stored in ../logs/ (gitignored). Key files:

logs/decisions.md — All council decisions with rationale
logs/test-machine-megumin.md — Test machine environment and benchmarks
logs/M-sync-overhead-*.md — Agent-M sync analysis
logs/K-kernel-tuning-*.md — Agent-K kernel tuning analysis
logs/M-review-K-*.md, logs/K-review-M-*.md — Cross-reviews

Phase 5 — Q4_0 MMVQ vdr_mmvq Tuning ✅ +19% Q4_0

Patch	File	Change	Status
0001	quants.hpp:47	Q4_0 reorder vdr_mmvq 2→4	✅ +5.8 t/s Q4_0

Increases the "vector dot product rows per MMVQ" parameter for Q4_0's reorder path. This processes more blocks per subgroup iteration (8→16 blocks), better amortizing the dp4a compute overhead. Q4_0 is dp4a-bound (not BW-bound) because nibble extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte.

Result: Q4_0 tg128 goes from 30.16 → 35.96 t/s (+19%). No impact on Q8_0 or Q4_K_M.

Key Lesson

The Arc A770 bottleneck for Q4_0 token generation is dp4a compute throughput, not memory bandwidth or submission overhead. The 4-bit nibble packing requires 2 dp4a operations per byte (low + high nibbles), making the kernel compute-bound at ~30 t/s. Further improvements require:

DPAS/XMX integration for quantized dot products
Algorithmic changes to the nibble unpacking (e.g., lookup tables)
Larger vdr_mmvq (requires larger qi/QI4_0 — would need data format changes)

6.1 KiB Raw Permalink Blame History