intel-gpu-llm-diagnosis

Files

T

History

Pi Agent 48bbb14a64 Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s)

Root cause analysis corrected: Q4_0 low BW utilization is NOT due to
SYCL submission model overhead. Per-op profiling proves the bottleneck
is dp4a compute throughput — nibble extraction requires 2 dp4a per byte
vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s.

Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing
16 blocks per subgroup instead of 8. Better amortizes dp4a overhead.

Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3):
  Q4_0:  30.16 → 35.96 t/s (+19%)
  Q8_0:  29.96 → 30.82 t/s (within noise)
  Q4_K_M: 24.65 → 25.32 t/s (within noise)

Also includes:
- Timing instrumentation patch (for debugging, not applied to source)
- Updated decisions log (Decisions 8-9)
- Updated workplan with revised benchmark data
- Root cause analysis document

2026-04-15 19:55:44 +03:00

patch.diff

Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s)

2026-04-15 19:55:44 +03:00