Investigate GET_ROWS overhead (678 MB/tick at 9B) #30

Open
opened 2026-04-30 18:11:35 +02:00 by sleepy · 0 comments
Owner

Problem

GET_ROWS reads 678 MB per tick at 9B -- second only to MUL_MAT (4.8 GB). This includes:

  • Token embedding lookup (once per tick, tiny)
  • GatedDeltaNet recurrent state access (24 layers, dominates)

Data (9B Q4_0, ctx=256)

Op PerTick BytesIn/tk BytesOut/tk
GET_ROWS 99 678 MB 53 MB
MUL_MAT 249 4797 MB 7 MB
CPY 97 106 MB 53 MB

Questions

  • Why 99 GET_ROWS per tick? Token embedding plus 3 per GatedDeltaNet layer
  • Can recurrent state reads be fused into the GATED_DELTA_NET op?
  • Are state reads contiguous or strided?
  • Profile individual GET_ROWS dispatch times in Xcode GPUtrace

Context scaling

GET_ROWS bytes_in stays constant at 678 MB regardless of context length (only depends on model size), so it does not become more dominant at long contexts.

## Problem GET_ROWS reads 678 MB per tick at 9B -- second only to MUL_MAT (4.8 GB). This includes: - Token embedding lookup (once per tick, tiny) - GatedDeltaNet recurrent state access (24 layers, dominates) ## Data (9B Q4_0, ctx=256) | Op | PerTick | BytesIn/tk | BytesOut/tk | |----|--------|------------|-------------| | GET_ROWS | 99 | 678 MB | 53 MB | | MUL_MAT | 249 | 4797 MB | 7 MB | | CPY | 97 | 106 MB | 53 MB | ## Questions - Why 99 GET_ROWS per tick? Token embedding plus 3 per GatedDeltaNet layer - Can recurrent state reads be fused into the GATED_DELTA_NET op? - Are state reads contiguous or strided? - Profile individual GET_ROWS dispatch times in Xcode GPUtrace ## Context scaling GET_ROWS bytes_in stays constant at 678 MB regardless of context length (only depends on model size), so it does not become more dominant at long contexts.
sleepy added the perf label 2026-04-30 18:11:35 +02:00
sleepy changed title from Investigate GET_ROWS overhead 678 MB per tick at 9B to Investigate GET_ROWS overhead (678 MB/tick at 9B) 2026-04-30 18:16:38 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#30