Profile concurrent encoding effectiveness #35

New Issue

2026-04-30T18:11:37+02:00

sleepy commented

2026-04-30 18:11:37 +02:00

Problem

GGML_METAL_CONCURRENCY_DISABLE=1 drops tg128 from 53.83 to 51.80 -- only 4% improvement from concurrent command buffer encoding. Many ops are marked concurrent but may not benefit from pipelining.

Data (9B Q4_0)

Config	tg128 (tok/s)
Concurrency ON (default)	53.83
Concurrency OFF	51.80

Graph topology

Default n_cb=1 creates 2 command buffers: main thread handles first 183 nodes, async thread handles remaining 1650 nodes. With concurrency disabled, all nodes are encoded sequentially.

Questions

Are concurrent ops actually overlapping in GPU execution or just in encoding?
Xcode GPUtrace can show which dispatches overlap in time
Would n_cb=2 or n_cb=4 help?
Is the 4% gain from overlapping encoding or overlapping execution?

## Problem GGML_METAL_CONCURRENCY_DISABLE=1 drops tg128 from 53.83 to 51.80 -- only 4% improvement from concurrent command buffer encoding. Many ops are marked concurrent but may not benefit from pipelining. ## Data (9B Q4_0) | Config | tg128 (tok/s) | |--------|--------------| | Concurrency ON (default) | 53.83 | | Concurrency OFF | 51.80 | ## Graph topology Default n_cb=1 creates 2 command buffers: main thread handles first 183 nodes, async thread handles remaining 1650 nodes. With concurrency disabled, all nodes are encoded sequentially. ## Questions - Are concurrent ops actually overlapping in GPU execution or just in encoding? - Xcode GPUtrace can show which dispatches overlap in time - Would n_cb=2 or n_cb=4 help? - Is the 4% gain from overlapping encoding or overlapping execution?

sleepy added the profiling label 2026-04-30 18:11:37 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#35