# Intel GPU Diagnosis Patches

Patches generated by a 3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzing
Intel Arc GPU performance issues in llama.cpp. Non-overlapping training data ensures
blind spots are caught through cross-review.

## Quick Start

```bash
# Apply one phase at a time (recommended):
cd repos
./patch/apply-phase.sh 1              # Phase 1: SYCL graph default
./patch/apply-phase.sh 2              # Phase 2: Kernel tuning (VER_GEN + DMMV)
./patch/apply-phase.sh 3              # Phase 3: Vulkan Arc 140T fix (independent)
./patch/apply-phase.sh 4              # Phase 4: Host-buffer copy opt-in

# Reverse:
./patch/apply-phase.sh 1 --reverse
./patch/apply-phase.sh all --reverse  # Reverse all
```

## Test Environment

- **GPU:** Intel Arc A770 16GB (card0, xe driver, PCI 8086:56a0)
- **Secondary:** AMD RX 580 8GB (card1, amdgpu — not used for SYCL)
- **CPU:** AMD Ryzen 7 5800X, 32GB DDR4
- **OS:** CachyOS, kernel 6.19.10
- **oneAPI:** 2025.3.2, DPC++ compiler 2025.3.2
- **Build:** `-DGGML_SYCL=ON -DGGML_VULKAN=OFF -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release`

## Benchmark Results (Updated 2026-04-15, llama-bench)

**Method:** `llama-bench -ngl 99 -p 512 -n 128 -r 3`, 3 repeats.

### Qwen3.5-9B (dense, fits entirely in VRAM)

| Config | Q4_0 tg128 | Q8_0 tg128 | Q4_K_M tg128 |
|--------|-----------|-----------|-------------|
| Baseline (HEAD) | 29.4 | 28.6 | — |
| +Phase 2 (VER+tuning) | 30.16 | 29.96 | 24.65 |
| **+Phase 2+5 (vdr_mmvq)** | **35.96** | **30.82** | **25.32** |

**Phase 5 is the first significant improvement: +19% on Q4_0 token generation.**

### Bandwidth Utilization

| Config | Q4_0 BW | Q4_0 BW% | Q8_0 BW | Q8_0 BW% |
|--------|---------|----------|---------|----------|
| Baseline | 150 GiB/s | 29% | 265 GiB/s | 52% |
| +Phase 5 | 180 GiB/s | 35% | 273 GiB/s | 53% |

### Root Cause (corrected)

Previous analysis attributed low BW utilization to "SYCL submission model overhead". **This was WRONG.** Per-op profiling proves:

1. SYCL queue naturally batches async kernel submissions (CPU submits all 1077 ops in 7.5ms vs 32ms GPU execution)
2. Q4_0 is **dp4a-compute-bound**, not memory-BW-bound
3. Nibble packing requires 2 dp4a per byte (low + high nibbles), while Q8_0 needs only 1 dp4a per byte
4. Both formats hit the same dp4a throughput ceiling → same ~30 t/s despite different data sizes
5. Phase 5's vdr_mmvq increase processes more blocks per subgroup, better amortizing dp4a overhead

See `../logs/root-cause-analysis-20260415.md` for full profiling data.

### Qwen3.5-35B-A3B (MoE, `--cpu-moe`)

| Config | Status |
|--------|--------|
| Baseline (unpatched) | ✅ Works (15.2 t/s gen from earlier run) |
| +Phase 1 (graph) | ❌ **CRASH**: `async_malloc` in `opt_for_reorder` |

**Critical Finding:** Phase 1 (enabling SYCL graph by default) crashes on the 35B MoE model.
The crash occurs in `opt_for_reorder` → `async_malloc` during graph computation. The graph
path triggers async memory allocation that fails on Arc A770. This confirms the original
developer's rationale for disabling graphs by default — the async malloc path is broken
for this hardware/workload combination.

## Phases

### Phase 1 — SYCL Graph Default ⚠️ CRASHES ON MoE
| Patch | File | Change | Status |
|-------|------|--------|--------|
| 0001 | ggml-sycl.cpp:217 | Graph default 1→0 | ⚠️ Crashes 35B MoE |

Enables SYCL graph execution by default. Works on small dense models but crashes
on 35B MoE with `async_malloc` failure in `opt_for_reorder`. The original default
of disabled (1) was correct for Arc A770.

### Phase 2 — SYCL Kernel Tuning ✅ Neutral
| Patch | File | Change | Status |
|-------|------|--------|--------|
| 0001 | common.hpp:90-91 | VER_GEN12 1M→1200, VER_GEN13→1300 | ✅ Neutral |
| 0002 | presets.hpp:57,60 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral |
| 0003 | common.hpp:106,109 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral |

Fixes VER_GEN12 placeholder but shows no measurable impact on 9B dense model.
Needs testing on MoE or larger models to evaluate properly. DMMV_X=64 vs 32
is within noise on this workload.

### Phase 3 — Vulkan Intel ✅ Independent
| Patch | File | Change | Status |
|-------|------|--------|--------|
| 0001 | ggml-vulkan.cpp | Arc 140T Xe2 override | ⏳ Not tested (no spirv-headers) |

### Phase 4 — Host-Buffer Copy ✅ Neutral
| Patch | File | Change | Status |
|-------|------|--------|--------|
| 0001 | ggml-sycl.cpp | mmap workaround opt-in | ✅ Neutral on 9B |

Removes blanket Linux host-buffer double-copy. Neutral on 9B dense model that
fits in VRAM. Should help on models that stress the host→device copy path.

## Council Deliberations

Stored in `../logs/` (gitignored). Key files:
- `logs/decisions.md` — All council decisions with rationale
- `logs/test-machine-megumin.md` — Test machine environment and benchmarks
- `logs/M-sync-overhead-*.md` — Agent-M sync analysis
- `logs/K-kernel-tuning-*.md` — Agent-K kernel tuning analysis
- `logs/M-review-K-*.md`, `logs/K-review-M-*.md` — Cross-reviews

### Phase 5 — Q4_0 MMVQ vdr_mmvq Tuning ✅ +19% Q4_0
| Patch | File | Change | Status |
|-------|------|--------|--------|
| 0001 | quants.hpp:47 | Q4_0 reorder vdr_mmvq 2→4 | ✅ +5.8 t/s Q4_0 |

Increases the "vector dot product rows per MMVQ" parameter for Q4_0's reorder path.
This processes more blocks per subgroup iteration (8→16 blocks), better amortizing
the dp4a compute overhead. Q4_0 is dp4a-bound (not BW-bound) because nibble
extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte.

Result: Q4_0 tg128 goes from 30.16 → 35.96 t/s (+19%). No impact on Q8_0 or Q4_K_M.

## Key Lesson

The Arc A770 bottleneck for Q4_0 token generation is **dp4a compute throughput**, not
memory bandwidth or submission overhead. The 4-bit nibble packing requires 2 dp4a
operations per byte (low + high nibbles), making the kernel compute-bound at ~30 t/s.
Further improvements require:
1. DPAS/XMX integration for quantized dot products
2. Algorithmic changes to the nibble unpacking (e.g., lookup tables)
3. Larger vdr_mmvq (requires larger qi/QI4_0 — would need data format changes)