intel-gpu-llm-diagnosis

sleepy/intel-gpu-llm-diagnosis

Fork 0

Commit Graph

Author	SHA1	Message	Date
Pi Agent	48bbb14a64	Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s) Root cause analysis corrected: Q4_0 low BW utilization is NOT due to SYCL submission model overhead. Per-op profiling proves the bottleneck is dp4a compute throughput — nibble extraction requires 2 dp4a per byte vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s. Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing 16 blocks per subgroup instead of 8. Better amortizes dp4a overhead. Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3): Q4_0: 30.16 → 35.96 t/s (+19%) Q8_0: 29.96 → 30.82 t/s (within noise) Q4_K_M: 24.65 → 25.32 t/s (within noise) Also includes: - Timing instrumentation patch (for debugging, not applied to source) - Updated decisions log (Decisions 8-9) - Updated workplan with revised benchmark data - Root cause analysis document	2026-04-15 19:55:44 +03:00

Author

SHA1

Message

Date

Pi Agent

48bbb14a64

Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s)

Root cause analysis corrected: Q4_0 low BW utilization is NOT due to
SYCL submission model overhead. Per-op profiling proves the bottleneck
is dp4a compute throughput — nibble extraction requires 2 dp4a per byte
vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s.

Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing
16 blocks per subgroup instead of 8. Better amortizes dp4a overhead.

Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3):
  Q4_0:  30.16 → 35.96 t/s (+19%)
  Q8_0:  29.96 → 30.82 t/s (within noise)
  Q4_K_M: 24.65 → 25.32 t/s (within noise)

Also includes:
- Timing instrumentation patch (for debugging, not applied to source)
- Updated decisions log (Decisions 8-9)
- Updated workplan with revised benchmark data
- Root cause analysis document

2026-04-15 19:55:44 +03:00

1 Commits