KV cache IO scaling with context length #32

New Issue

2026-04-30T18:11:35+02:00

sleepy commented

2026-04-30 18:11:35 +02:00

Problem

As context length grows, KV cache reads and writes increase. This investigation tracks when KV cache IO becomes significant relative to weight reads.

Data (9B Q4_0, Qwen3.5: 8 full-attention layers, 24 GatedDeltaNet)

Context	SET_ROWS bytes/tick	TPS
256	8 MB	52.9
2048	67 MB	52.8
8192	268 MB	52.5

Observations

SET_ROWS (KV cache writes) grows linearly with context
Even at 8K context, 268 MB is only 5.6% of MUL_MAT reads (4.8 GB)
TPS barely drops -- KV cache is not the bottleneck for this architecture
FLASH_ATTN_EXT reads are small (8.5 MB at ctx=256)

Architecture note

Only 8 of 32 layers use full attention (KV cache). The other 24 are GatedDeltaNet (recurrent state, not KV cache).

Next steps

Profile at context=32K to find crossover point
Profile 27B model at longer contexts (64 layers, 16 full-attention)

## Problem As context length grows, KV cache reads and writes increase. This investigation tracks when KV cache IO becomes significant relative to weight reads. ## Data (9B Q4_0, Qwen3.5: 8 full-attention layers, 24 GatedDeltaNet) | Context | SET_ROWS bytes/tick | TPS | |---------|--------------------|----| | 256 | 8 MB | 52.9 | | 2048 | 67 MB | 52.8 | | 8192 | 268 MB | 52.5 | ## Observations - SET_ROWS (KV cache writes) grows linearly with context - Even at 8K context, 268 MB is only 5.6% of MUL_MAT reads (4.8 GB) - TPS barely drops -- KV cache is not the bottleneck for this architecture - FLASH_ATTN_EXT reads are small (8.5 MB at ctx=256) ## Architecture note Only 8 of 32 layers use full attention (KV cache). The other 24 are GatedDeltaNet (recurrent state, not KV cache). ## Next steps - Profile at context=32K to find crossover point - Profile 27B model at longer contexts (64 layers, 16 full-attention)

sleepy added the perf label 2026-04-30 18:11:35 +02:00

Sign in to join this conversation.

1 Participants

Notifications

Due Date

No due date set.

Dependencies

No dependencies set.

Reference: sleepy/llama.cpp#32