Update patch README with clean benchmark results and Phase 1 crash finding

- Full A/B test on Qwen3.5-9B Q4_0 and Q8_0 across all phases - All patches neutral on 9B dense (within ~1 t/s noise) - Phase 1 (SYCL graph) crashes 35B MoE with async_malloc in opt_for_reorder - Decision 1 (graph default) overturned by empirical evidence - Baseline DISABLE_GRAPH=1 was correct for Arc A770
2026-04-15 17:40:58 +02:00
parent 1c374e1262
commit 105f1348dc
1 changed files with 88 additions and 66 deletions
@@ -7,87 +7,109 @@ blind spots are caught through cross-review.
 ## Quick Start

 ```bash
-# Apply all phases at once:
+# Apply one phase at a time (recommended):
 cd repos
-./patch/apply.sh
+./patch/apply-phase.sh 1              # Phase 1: SYCL graph default
+./patch/apply-phase.sh 2              # Phase 2: Kernel tuning (VER_GEN + DMMV)
+./patch/apply-phase.sh 3              # Phase 3: Vulkan Arc 140T fix (independent)
+./patch/apply-phase.sh 4              # Phase 4: Host-buffer copy opt-in

-# Apply one phase at a time (recommended for testing):
-./patch/apply-phase.sh 1              # Apply phase 1
-./patch/apply-phase.sh 2              # Apply phase 2 (after testing phase 1)
-./patch/apply-phase.sh 3              # Apply phase 3 (after testing phase 2)
-
-# Dry-run / reverse:
-./patch/apply-phase.sh 1 --dry-run
-./patch/apply-phase.sh 2 --reverse
-
-# Check what's applied:
-./patch/apply.sh --status
+# Reverse:
+./patch/apply-phase.sh 1 --reverse
+./patch/apply-phase.sh all --reverse  # Reverse all
 ```

+## Test Environment
+
+- **GPU:** Intel Arc A770 16GB (card0, xe driver, PCI 8086:56a0)
+- **Secondary:** AMD RX 580 8GB (card1, amdgpu — not used for SYCL)
+- **CPU:** AMD Ryzen 7 5800X, 32GB DDR4
+- **OS:** CachyOS, kernel 6.19.10
+- **oneAPI:** 2025.3.2, DPC++ compiler 2025.3.2
+- **Build:** `-DGGML_SYCL=ON -DGGML_VULKAN=OFF -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx -DCMAKE_BUILD_TYPE=Release`
+
+## Benchmark Results (Clean Run, 2026-04-15)
+
+**Method:** Same prompt ("Write a short poem about a cat."), `-ngl 99 --device SYCL0 -c 2048 -n 128 --reasoning off`, fresh build per phase.
+
+### Qwen3.5-9B (dense, fits entirely in VRAM)
+
+| Config | Q4_0 Gen t/s | Q4_0 Prompt t/s | Q8_0 Gen t/s | Q8_0 Prompt t/s |
+|--------|-------------|-----------------|-------------|-----------------|
+| Baseline | 29.4 | 17.6 | 28.6 | 20.7 |
+| +Phase 1 (graph) | 29.7 | 20.0 | 29.0 | 20.7 |
+| +Phase 2 (kernel) | 29.8 | 19.8 | 29.0 | 20.6 |
+| +Phase 4 (host-copy) | 29.6 | 19.9 | 29.5 | 20.4 |
+
+**Analysis:**
+- All results within ~1 t/s noise floor — no significant regression or improvement
+- Phase 1 gives a modest prompt processing improvement (+2.4 t/s on Q4_0)
+- The 9B dense model fits entirely in A770 VRAM, so sync/graph/reorder optimizations
+  don't exercise the bottleneck path these patches target
+
+### Qwen3.5-35B-A3B (MoE, `--cpu-moe`)
+
+| Config | Status |
+|--------|--------|
+| Baseline (unpatched) | ✅ Works (15.2 t/s gen from earlier run) |
+| +Phase 1 (graph) | ❌ **CRASH**: `async_malloc` in `opt_for_reorder` |
+
+**Critical Finding:** Phase 1 (enabling SYCL graph by default) crashes on the 35B MoE model.
+The crash occurs in `opt_for_reorder` → `async_malloc` during graph computation. The graph
+path triggers async memory allocation that fails on Arc A770. This confirms the original
+developer's rationale for disabling graphs by default — the async malloc path is broken
+for this hardware/workload combination.
+
 ## Phases

-### Phase 1 — SYCL Sync (safest, highest impact)
-| Patch | File | Change | Decision |
-|-------|------|--------|----------|
-| 0001 | ggml-sycl.cpp:217 | Graph default 1→0 | Approved 3/3 |
+### Phase 1 — SYCL Graph Default ⚠️ CRASHES ON MoE
+| Patch | File | Change | Status |
+|-------|------|--------|--------|
+| 0001 | ggml-sycl.cpp:217 | Graph default 1→0 | ⚠️ Crashes 35B MoE |

-Enables SYCL graph execution by default. Eliminates 8 blocking `.wait()` calls.
-Expected 10-30% token generation speedup for single-GPU dense LLMs.
+Enables SYCL graph execution by default. Works on small dense models but crashes
+on 35B MoE with `async_malloc` failure in `opt_for_reorder`. The original default
+of disabled (1) was correct for Arc A770.

-### Phase 2 — SYCL Kernel Tuning (depends on Phase 1)
-| Patch | File | Change | Decision |
-|-------|------|--------|----------|
-| 0001 | common.hpp:90-91 | VER_GEN12 1M→1200, VER_GEN13→1300 | Approved 3/3 |
-| 0002 | presets.hpp:57,60 | DMMV_X 32→64, MMV_Y 1→2 | Approved 3/3, needs bench |
-| 0003 | common.hpp:106,109 | DMMV_X 32→64, MMV_Y 1→2 | Approved 3/3, needs bench |
+### Phase 2 — SYCL Kernel Tuning ✅ Neutral
+| Patch | File | Change | Status |
+|-------|------|--------|--------|
+| 0001 | common.hpp:90-91 | VER_GEN12 1M→1200, VER_GEN13→1300 | ✅ Neutral |
+| 0002 | presets.hpp:57,60 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral |
+| 0003 | common.hpp:106,109 | DMMV_X 32→64, MMV_Y 1→2 | ✅ Neutral |

-Fixes the VER_GEN12 placeholder (1M) that routed all Intel GPUs to NVIDIA Ampere paths.
-Tunes DMMV thread parameters for Arc hardware. Expected 5-15% additional improvement.
+Fixes VER_GEN12 placeholder but shows no measurable impact on 9B dense model.
+Needs testing on MoE or larger models to evaluate properly. DMMV_X=64 vs 32
+is within noise on this workload.

-### Phase 3 — Vulkan Intel (independent of Phase 1/2)
-| Patch | File | Change | Decision |
-|-------|------|--------|----------|
-| 0001 | ggml-vulkan.cpp:302,349 | Arc 140T device-ID Xe2 override | Approved 1/3 |
+### Phase 3 — Vulkan Intel ✅ Independent
+| Patch | File | Change | Status |
+|-------|------|--------|--------|
+| 0001 | ggml-vulkan.cpp | Arc 140T Xe2 override | ⏳ Not tested (no spirv-headers) |

-Fixes Arrow Lake H misdetection as non-Xe2. Enables cooperative matrix.
-Only affects Arc 140T systems.
+### Phase 4 — Host-Buffer Copy ✅ Neutral
+| Patch | File | Change | Status |
+|-------|------|--------|--------|
+| 0001 | ggml-sycl.cpp | mmap workaround opt-in | ✅ Neutral on 9B |

-## Testing Protocol
-
-On Intel Arc GPU test machine:
-
-```bash
-cd repos
-
-# Apply one phase
-./patch/apply-phase.sh 1
-
-# Build
-source /opt/intel/oneapi/setvars.sh
-cmake -B build -DGGML_SYCL=ON -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx \
-  -DCMAKE_BUILD_TYPE=Release -DGGML_AVX2=ON -DGGML_AVX=ON -DGGML_FMA=ON -DGGML_F16C=ON
-cmake --build build -j$(nproc)
-
-# Test
-ctest --test-dir build --output-on-failure
-
-# Benchmark before/after each phase
-./build/bin/llama-bench -m <model> -ngl 99 -p 512 -n 128
-```
+Removes blanket Linux host-buffer double-copy. Neutral on 9B dense model that
+fits in VRAM. Should help on models that stress the host→device copy path.

 ## Council Deliberations

 Stored in `../logs/` (gitignored). Key files:
- `logs/decisions.md` — All 6 council decisions with rationale
- `logs/M-sync-overhead-20260415.md` — Agent-M sync analysis
- `logs/K-kernel-tuning-20260415.md` — Agent-K kernel tuning analysis
- `logs/M-review-K-20260415.md` — Cross-review
- `logs/K-review-M-20260415.md` — Cross-review (caught memset_tensor error)
+- `logs/decisions.md` — All council decisions with rationale
+- `logs/test-machine-megumin.md` — Test machine environment and benchmarks
+- `logs/M-sync-overhead-*.md` — Agent-M sync analysis
+- `logs/K-kernel-tuning-*.md` — Agent-K kernel tuning analysis
+- `logs/M-review-K-*.md`, `logs/K-review-M-*.md` — Cross-reviews

-## Deferred to Future Phases
+## Key Lesson

- Q4_K DMMV reorder (medium complexity)
- Q6_K DMMV reorder (medium complexity)
- Q5_K reorder for both DMMV and MMVQ (hard)
- Host-buffer double-copy elimination
- Async memory ops decoupled from graph
+The Arc A770 bottleneck for token generation is NOT primarily in the areas we patched.
+The 9B dense model achieves ~29 t/s gen which is close to theoretical bandwidth for
+Q4_0 (~28-30 t/s expected for 2GB model at 512 GB/s bandwidth). The real performance
+gap vs NVIDIA/AMD may be in:
+1. MoE expert routing overhead (not exercised by dense models)
+2. Memory bandwidth utilization during attention (not just matmul)
+3. Driver/runtime overhead in the xe/Level Zero stack