1c374e1262
Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor. The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via GGML_SYCL_MMAP_WORKAROUND=1. Also adds: - docker-build-test.sh for local amd64 SYCL build verification - test-machine-megumin.md with hardware/software env and test procedures - Updated apply-phase.sh to support phase 4 - Updated workplan with corrected council composition (GLM/Minimax/Kimi)
Phase 4 — Host-Buffer Double-Copy Fix
Depends on: Phase 1 and 2 (should be applied and tested first)
0001-remove-blanket-host-buffer-copy.patch
Removes the blanket Linux host-buffer double-copy workaround in set_tensor.
Problem
ggml_backend_sycl_buffer_set_tensor on Linux does:
malloc(host_buf) → memcpy(host_buf, data) → memcpy(device, host_buf) → free(host_buf)
This was a workaround for a PVC (Ponte Vecchio) bug where mmap()-backed host
pointers caused issues with direct device copies. The #ifndef _WIN32 guard
penalized ALL Linux Intel GPUs — including Arc A770, A750, Meteor Lake iGPUs —
with an unnecessary extra malloc/memcpy/free on every set_tensor call.
Fix
- Replaces the
#ifndef _WIN32compile-time guard with a runtime check - New env var
GGML_SYCL_MMAP_WORKAROUNDdefaults to 0 (disabled) - PVC users who need the workaround:
GGML_SYCL_MMAP_WORKAROUND=1 - The
elsebranch now does the direct device copy for all platforms
Impact
- Eliminates one
malloc + memcpy + freeper tensor during model loading - On Arc A770 with a 17GB model (~1M tensors): saves ~17GB of host-side copying
- No effect on Windows (already used the direct path)
Testing checklist
- Build succeeds
- Model loads correctly
- Inference produces correct output
GGML_SYCL_MMAP_WORKAROUND=1restores old behavior