Investigate CPY overhead (159 MB/tick at 9B) #31

New Issue

2026-04-30T18:11:35+02:00

sleepy commented

2026-04-30 18:11:35 +02:00

Problem

CPY reads 106 MB and writes 53 MB per tick. These are actual GPU memory copies for type conversion and layout transformation. They are not zero-cost.

Data (9B Q4_0, ctx=256)

Op	PerTick	BytesIn/tk	BytesOut/tk
CPY	97	106 MB	53 MB

Questions

What tensor types are being converted? Likely f16 <-> f32 transitions
Can layout be restructured to avoid copies?
Are these contiguous or strided copies?
Profile individual CPY dispatch times

Approach

Use GGML_METAL_GRAPH_DEBUG=2 to identify the specific CPY tensors
Check if input/output layouts can be made consistent
Consider if type conversion can happen in-place within other kernels

## Problem CPY reads 106 MB and writes 53 MB per tick. These are actual GPU memory copies for type conversion and layout transformation. They are not zero-cost. ## Data (9B Q4_0, ctx=256) | Op | PerTick | BytesIn/tk | BytesOut/tk | |----|--------|------------|-------------| | CPY | 97 | 106 MB | 53 MB | ## Questions - What tensor types are being converted? Likely f16 <-> f32 transitions - Can layout be restructured to avoid copies? - Are these contiguous or strided copies? - Profile individual CPY dispatch times ## Approach - Use GGML_METAL_GRAPH_DEBUG=2 to identify the specific CPY tensors - Check if input/output layouts can be made consistent - Consider if type conversion can happen in-place within other kernels

sleepy added the perf label 2026-04-30 18:11:35 +02:00

sleepy changed title from ~~Investigate CPY overhead 159 MB per tick at 9B~~ to Investigate CPY overhead (159 MB/tick at 9B)

2026-04-30 18:16:38 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#31