6 Commits

Author SHA1 Message Date
Pi Agent 48bbb14a64 Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s)
Root cause analysis corrected: Q4_0 low BW utilization is NOT due to
SYCL submission model overhead. Per-op profiling proves the bottleneck
is dp4a compute throughput — nibble extraction requires 2 dp4a per byte
vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s.

Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing
16 blocks per subgroup instead of 8. Better amortizes dp4a overhead.

Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3):
  Q4_0:  30.16 → 35.96 t/s (+19%)
  Q8_0:  29.96 → 30.82 t/s (within noise)
  Q4_K_M: 24.65 → 25.32 t/s (within noise)

Also includes:
- Timing instrumentation patch (for debugging, not applied to source)
- Updated decisions log (Decisions 8-9)
- Updated workplan with revised benchmark data
- Root cause analysis document
2026-04-15 19:55:44 +03:00
sleepy f3caef26f7 Add cross-platform bandwidth utilization comparison and research data
Key findings:
- SYCL achieves only ~30% BW utilization (29% Q4_0, 54% Q8_0)
- Vulkan on same hardware gets 47-75% utilization
- CUDA on RTX 3060 gets ~72% utilization
- The bottleneck is SYCL's 1-op-at-a-time submission model, not kernel params
- Our patches are neutral because they don't address the submission bottleneck
- Phase 1 (graph) crashes MoE but could be key to fixing submission if async_malloc bug is resolved
2026-04-15 17:54:03 +02:00
sleepy 105f1348dc Update patch README with clean benchmark results and Phase 1 crash finding
- Full A/B test on Qwen3.5-9B Q4_0 and Q8_0 across all phases
- All patches neutral on 9B dense (within ~1 t/s noise)
- Phase 1 (SYCL graph) crashes 35B MoE with async_malloc in opt_for_reorder
- Decision 1 (graph default) overturned by empirical evidence
- Baseline DISABLE_GRAPH=1 was correct for Arc A770
2026-04-15 17:40:58 +02:00
sleepy 1c374e1262 feat: phase 4 host-copy fix + docker build script + test machine docs
Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor.
The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra
malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via
GGML_SYCL_MMAP_WORKAROUND=1.

Also adds:
- docker-build-test.sh for local amd64 SYCL build verification
- test-machine-megumin.md with hardware/software env and test procedures
- Updated apply-phase.sh to support phase 4
- Updated workplan with corrected council composition (GLM/Minimax/Kimi)
2026-04-15 15:35:29 +02:00
sleepy 6ad84d543c feat: phased patch system for Intel Arc GPU performance fixes
3-model council (GLM-5.1, Minimax-M2.7, Kimi k2p5) analyzed Intel Arc GPU
performance issues and produced patches for llama.cpp:

Phase 1 - SYCL Sync: Enable graph execution by default (GGML_SYCL_DISABLE_GRAPH)
Phase 2 - SYCL Kernel: Fix VER_GEN12/13 thresholds, tune DMMV_X/MMV_Y
Phase 3 - Vulkan Intel: Arc 140T device-ID Xe2 override

Includes:
- Phased apply script (apply-phase.sh [1|2|3|all])
- Master apply script with --status/--reverse/--dry-run
- Per-phase READMEs with testing checklists
- Council deliberation logs (gitignored in logs/)

Verified: all patches apply/reverse cleanly via git apply.
Static verification: VER_GEN arithmetic and DMMV_X divisibility pass.
2026-04-15 14:53:40 +02:00
sleepy ef614682bc Add upstream repos as git submodules (shallow clones)
llama.cpp, ipex-llm, intel-extension-for-pytorch, compute-runtime,
intel-graphics-compiler, oneDNN, vllm, vllm-xpu-kernels, level-zero,
llvm (sycl branch), openvino, sycl-tla
2026-04-15 13:19:15 +02:00