Files
sleepy 1c374e1262 feat: phase 4 host-copy fix + docker build script + test machine docs
Phase 4: Remove blanket Linux host-buffer double-copy in set_tensor.
The #ifndef _WIN32 guard penalized all Linux Intel GPUs with an extra
malloc/memcpy/free per tensor load for a PVC-only bug. Now opt-in via
GGML_SYCL_MMAP_WORKAROUND=1.

Also adds:
- docker-build-test.sh for local amd64 SYCL build verification
- test-machine-megumin.md with hardware/software env and test procedures
- Updated apply-phase.sh to support phase 4
- Updated workplan with corrected council composition (GLM/Minimax/Kimi)
2026-04-15 15:35:29 +02:00
..

Phase 4 — Host-Buffer Double-Copy Fix

Depends on: Phase 1 and 2 (should be applied and tested first)

0001-remove-blanket-host-buffer-copy.patch

Removes the blanket Linux host-buffer double-copy workaround in set_tensor.

Problem

ggml_backend_sycl_buffer_set_tensor on Linux does:

malloc(host_buf) → memcpy(host_buf, data) → memcpy(device, host_buf) → free(host_buf)

This was a workaround for a PVC (Ponte Vecchio) bug where mmap()-backed host pointers caused issues with direct device copies. The #ifndef _WIN32 guard penalized ALL Linux Intel GPUs — including Arc A770, A750, Meteor Lake iGPUs — with an unnecessary extra malloc/memcpy/free on every set_tensor call.

Fix

  • Replaces the #ifndef _WIN32 compile-time guard with a runtime check
  • New env var GGML_SYCL_MMAP_WORKAROUND defaults to 0 (disabled)
  • PVC users who need the workaround: GGML_SYCL_MMAP_WORKAROUND=1
  • The else branch now does the direct device copy for all platforms

Impact

  • Eliminates one malloc + memcpy + free per tensor during model loading
  • On Arc A770 with a 17GB model (~1M tensors): saves ~17GB of host-side copying
  • No effect on Windows (already used the direct path)

Testing checklist

  • Build succeeds
  • Model loads correctly
  • Inference produces correct output
  • GGML_SYCL_MMAP_WORKAROUND=1 restores old behavior