Profile concurrent encoding effectiveness #35

Open
opened 2026-04-30 18:11:37 +02:00 by sleepy · 0 comments
Owner

Problem

GGML_METAL_CONCURRENCY_DISABLE=1 drops tg128 from 53.83 to 51.80 -- only 4% improvement from concurrent command buffer encoding. Many ops are marked concurrent but may not benefit from pipelining.

Data (9B Q4_0)

Config tg128 (tok/s)
Concurrency ON (default) 53.83
Concurrency OFF 51.80

Graph topology

Default n_cb=1 creates 2 command buffers: main thread handles first 183 nodes, async thread handles remaining 1650 nodes. With concurrency disabled, all nodes are encoded sequentially.

Questions

  • Are concurrent ops actually overlapping in GPU execution or just in encoding?
  • Xcode GPUtrace can show which dispatches overlap in time
  • Would n_cb=2 or n_cb=4 help?
  • Is the 4% gain from overlapping encoding or overlapping execution?
## Problem GGML_METAL_CONCURRENCY_DISABLE=1 drops tg128 from 53.83 to 51.80 -- only 4% improvement from concurrent command buffer encoding. Many ops are marked concurrent but may not benefit from pipelining. ## Data (9B Q4_0) | Config | tg128 (tok/s) | |--------|--------------| | Concurrency ON (default) | 53.83 | | Concurrency OFF | 51.80 | ## Graph topology Default n_cb=1 creates 2 command buffers: main thread handles first 183 nodes, async thread handles remaining 1650 nodes. With concurrency disabled, all nodes are encoded sequentially. ## Questions - Are concurrent ops actually overlapping in GPU execution or just in encoding? - Xcode GPUtrace can show which dispatches overlap in time - Would n_cb=2 or n_cb=4 help? - Is the 4% gain from overlapping encoding or overlapping execution?
sleepy added the profiling label 2026-04-30 18:11:37 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#35