Pi Agent
|
48bbb14a64
|
Phase 5: Q4_0 vdr_mmvq 2→4 (+19% tg128, 30→36 t/s)
Root cause analysis corrected: Q4_0 low BW utilization is NOT due to
SYCL submission model overhead. Per-op profiling proves the bottleneck
is dp4a compute throughput — nibble extraction requires 2 dp4a per byte
vs Q8_0's 1 dp4a per byte, making Q4_0 compute-bound at ~30 t/s.
Fix: Increase vdr_mmvq from 2 to 4 for Q4_0 reorder path, processing
16 blocks per subgroup instead of 8. Better amortizes dp4a overhead.
Benchmark (Qwen3.5-9B, Arc A770, llama-bench -p 512 -n 128 -r 3):
Q4_0: 30.16 → 35.96 t/s (+19%)
Q8_0: 29.96 → 30.82 t/s (within noise)
Q4_K_M: 24.65 → 25.32 t/s (within noise)
Also includes:
- Timing instrumentation patch (for debugging, not applied to source)
- Updated decisions log (Decisions 8-9)
- Updated workplan with revised benchmark data
- Root cause analysis document
|
2026-04-15 19:55:44 +03:00 |
|