Investigate CPY overhead (159 MB/tick at 9B) #31

Open
opened 2026-04-30 18:11:35 +02:00 by sleepy · 0 comments
Owner

Problem

CPY reads 106 MB and writes 53 MB per tick. These are actual GPU memory copies for type conversion and layout transformation. They are not zero-cost.

Data (9B Q4_0, ctx=256)

Op PerTick BytesIn/tk BytesOut/tk
CPY 97 106 MB 53 MB

Questions

  • What tensor types are being converted? Likely f16 <-> f32 transitions
  • Can layout be restructured to avoid copies?
  • Are these contiguous or strided copies?
  • Profile individual CPY dispatch times

Approach

  • Use GGML_METAL_GRAPH_DEBUG=2 to identify the specific CPY tensors
  • Check if input/output layouts can be made consistent
  • Consider if type conversion can happen in-place within other kernels
## Problem CPY reads 106 MB and writes 53 MB per tick. These are actual GPU memory copies for type conversion and layout transformation. They are not zero-cost. ## Data (9B Q4_0, ctx=256) | Op | PerTick | BytesIn/tk | BytesOut/tk | |----|--------|------------|-------------| | CPY | 97 | 106 MB | 53 MB | ## Questions - What tensor types are being converted? Likely f16 <-> f32 transitions - Can layout be restructured to avoid copies? - Are these contiguous or strided copies? - Profile individual CPY dispatch times ## Approach - Use GGML_METAL_GRAPH_DEBUG=2 to identify the specific CPY tensors - Check if input/output layouts can be made consistent - Consider if type conversion can happen in-place within other kernels
sleepy added the perf label 2026-04-30 18:11:35 +02:00
sleepy changed title from Investigate CPY overhead 159 MB per tick at 9B to Investigate CPY overhead (159 MB/tick at 9B) 2026-04-30 18:16:38 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#31