Key findings:
- SYCL achieves only ~30% BW utilization (29% Q4_0, 54% Q8_0)
- Vulkan on same hardware gets 47-75% utilization
- CUDA on RTX 3060 gets ~72% utilization
- The bottleneck is SYCL's 1-op-at-a-time submission model, not kernel params
- Our patches are neutral because they don't address the submission bottleneck
- Phase 1 (graph) crashes MoE but could be key to fixing submission if async_malloc bug is resolved
- Full A/B test on Qwen3.5-9B Q4_0 and Q8_0 across all phases
- All patches neutral on 9B dense (within ~1 t/s noise)
- Phase 1 (SYCL graph) crashes 35B MoE with async_malloc in opt_for_reorder
- Decision 1 (graph default) overturned by empirical evidence
- Baseline DISABLE_GRAPH=1 was correct for Arc A770
Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor.
The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra
malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via
GGML_SYCL_MMAP_WORKAROUND=1.
Also adds:
- docker-build-test.sh for local amd64 SYCL build verification
- test-machine-megumin.md with hardware/software env and test procedures
- Updated apply-phase.sh to support phase 4
- Updated workplan with corrected council composition (GLM/Minimax/Kimi)