KV cache IO scaling with context length #32

Open
opened 2026-04-30 18:11:35 +02:00 by sleepy · 0 comments
Owner

Problem

As context length grows, KV cache reads and writes increase. This investigation tracks when KV cache IO becomes significant relative to weight reads.

Data (9B Q4_0, Qwen3.5: 8 full-attention layers, 24 GatedDeltaNet)

Context SET_ROWS bytes/tick TPS
256 8 MB 52.9
2048 67 MB 52.8
8192 268 MB 52.5

Observations

  • SET_ROWS (KV cache writes) grows linearly with context
  • Even at 8K context, 268 MB is only 5.6% of MUL_MAT reads (4.8 GB)
  • TPS barely drops -- KV cache is not the bottleneck for this architecture
  • FLASH_ATTN_EXT reads are small (8.5 MB at ctx=256)

Architecture note

Only 8 of 32 layers use full attention (KV cache). The other 24 are GatedDeltaNet (recurrent state, not KV cache).

Next steps

  • Profile at context=32K to find crossover point
  • Profile 27B model at longer contexts (64 layers, 16 full-attention)
## Problem As context length grows, KV cache reads and writes increase. This investigation tracks when KV cache IO becomes significant relative to weight reads. ## Data (9B Q4_0, Qwen3.5: 8 full-attention layers, 24 GatedDeltaNet) | Context | SET_ROWS bytes/tick | TPS | |---------|--------------------|----| | 256 | 8 MB | 52.9 | | 2048 | 67 MB | 52.8 | | 8192 | 268 MB | 52.5 | ## Observations - SET_ROWS (KV cache writes) grows linearly with context - Even at 8K context, 268 MB is only 5.6% of MUL_MAT reads (4.8 GB) - TPS barely drops -- KV cache is not the bottleneck for this architecture - FLASH_ATTN_EXT reads are small (8.5 MB at ctx=256) ## Architecture note Only 8 of 32 layers use full attention (KV cache). The other 24 are GatedDeltaNet (recurrent state, not KV cache). ## Next steps - Profile at context=32K to find crossover point - Profile 27B model at longer contexts (64 layers, 16 full-attention)
sleepy added the perf label 2026-04-30 18:11:35 +02:00
Sign in to join this conversation.
1 Participants
Notifications
Due Date
No due date set.
Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#32