Agentic workloads re-prefill full context every turn (live session reuse + disk KV cache not hitting) #7

Open
opened 2026-06-23 11:27:59 +02:00 by sleepy · 1 comment
Owner

Summary

When using the server with an agentic client (Hermes), every tool-call round-trip re-prefills the full conversation from scratch. The server log shows repeated ctx=0..13745:13745 TOOLS prompt start followed by a full 43-second prefill, even for the same 13745-token conversation prefilled seconds earlier. The 0.. prefix means cached=0 — neither the live session reuse nor the disk KV cache is hitting.

What the server already has

Three layers of context caching in src/server/ds4_server.c generate_job():

  1. Live session token-prefix reuse (memory-token): ds4_session_common_prefix compares incoming prompt tokens vs live GPU session checkpoint. If new prompt starts with checkpoint, only suffix prefilled. Condition: common == old_pos && prompt.len >= old_pos.

  2. Live session text-prefix reuse (memory-text): live_text_prefix_prompt renders live tokens to text, byte-prefix matches against request prompt_text. If live text is byte-prefix of new request, only text suffix re-tokenized and prefilled.

  3. Disk KV cache (disk-text): --kv-disk-dir DIR enables persistent disk checkpoints. kv_cache_try_load_text finds cached checkpoint whose text is byte-prefix of new request and loads it. NOT enabled by default — requires --kv-disk-dir.

Why agentic requests don't hit the cache

Log shows every TOOLS request starts at ctx=0..13745:13745 (cached=0). Possible causes:

  1. Live session evicted/reset between requests. Server has single live GPU session. If Hermes sends requests that don't share token/text prefix with current live state (different tool-call result formatting, re-rendered chat template, request after eviction), both prefix checks fail → full prefill.

  2. Disk KV cache not enabled. Server started without --kv-disk-dir, so disk-text path disabled. Even if live session evicted, no disk checkpoint to fall back on → full prefill.

  3. Chat template re-rendering may break text-prefix matching. Agentic clients re-render full conversation with tool results inserted. If rendered text differs slightly (whitespace, tool-result formatting, system prompt placement), byte_prefix_match fails.

What might need adding/changing

  • Enable --kv-disk-dir by default or document prominently for agentic use. Even if live session evicted, disk cache restores prior checkpoint.
  • Investigate why live session reuse fails between agentic turns. Log shows ctx=0.. even for back-to-back same-token-count requests. Need --trace to see common_prefix value and why it's 0.
  • Consider session pinning or conversation ID so server keeps live session across tool-call round-trips.

Existing flags that may help (no code change)

--kv-disk-dir DIR: enables disk KV checkpoints (biggest lever, not currently enabled)
--kv-cache-cold-max-tokens N: saves cold first prompts up to N (default 30000)
--kv-cache-continued-interval-tokens N: saves aligned frontiers every N (default 10000)
--trace FILE: detailed trace logging including cache hit/miss diagnostics

Repro

DS4_KV_TURBO=1 DS4_CUDA_DECODE_GRAPH=1 ./ds4-server -m K128.gguf --ctx 1048576 --host 0.0.0.0 --port 17777
Use agentic client (Hermes) with tool calls. Log shows every TOOLS request ctx=0..N:N (cached=0) and full prefill.

Acceptance

  • Agentic tool-call round-trips reuse prior turn KV state (log shows ctx=N..M:suffix with cached>0)
  • Re-prefill time for continued conversations drops from ~43s to <5s
  • Works with and without --kv-disk-dir
## Summary When using the server with an agentic client (Hermes), every tool-call round-trip re-prefills the full conversation from scratch. The server log shows repeated ctx=0..13745:13745 TOOLS prompt start followed by a full 43-second prefill, even for the same 13745-token conversation prefilled seconds earlier. The 0.. prefix means cached=0 — neither the live session reuse nor the disk KV cache is hitting. ## What the server already has Three layers of context caching in src/server/ds4_server.c generate_job(): 1. Live session token-prefix reuse (memory-token): ds4_session_common_prefix compares incoming prompt tokens vs live GPU session checkpoint. If new prompt starts with checkpoint, only suffix prefilled. Condition: common == old_pos && prompt.len >= old_pos. 2. Live session text-prefix reuse (memory-text): live_text_prefix_prompt renders live tokens to text, byte-prefix matches against request prompt_text. If live text is byte-prefix of new request, only text suffix re-tokenized and prefilled. 3. Disk KV cache (disk-text): --kv-disk-dir DIR enables persistent disk checkpoints. kv_cache_try_load_text finds cached checkpoint whose text is byte-prefix of new request and loads it. NOT enabled by default — requires --kv-disk-dir. ## Why agentic requests don't hit the cache Log shows every TOOLS request starts at ctx=0..13745:13745 (cached=0). Possible causes: 1. Live session evicted/reset between requests. Server has single live GPU session. If Hermes sends requests that don't share token/text prefix with current live state (different tool-call result formatting, re-rendered chat template, request after eviction), both prefix checks fail → full prefill. 2. Disk KV cache not enabled. Server started without --kv-disk-dir, so disk-text path disabled. Even if live session evicted, no disk checkpoint to fall back on → full prefill. 3. Chat template re-rendering may break text-prefix matching. Agentic clients re-render full conversation with tool results inserted. If rendered text differs slightly (whitespace, tool-result formatting, system prompt placement), byte_prefix_match fails. ## What might need adding/changing - Enable --kv-disk-dir by default or document prominently for agentic use. Even if live session evicted, disk cache restores prior checkpoint. - Investigate why live session reuse fails between agentic turns. Log shows ctx=0.. even for back-to-back same-token-count requests. Need --trace to see common_prefix value and why it's 0. - Consider session pinning or conversation ID so server keeps live session across tool-call round-trips. ## Existing flags that may help (no code change) --kv-disk-dir DIR: enables disk KV checkpoints (biggest lever, not currently enabled) --kv-cache-cold-max-tokens N: saves cold first prompts up to N (default 30000) --kv-cache-continued-interval-tokens N: saves aligned frontiers every N (default 10000) --trace FILE: detailed trace logging including cache hit/miss diagnostics ## Repro DS4_KV_TURBO=1 DS4_CUDA_DECODE_GRAPH=1 ./ds4-server -m K128.gguf --ctx 1048576 --host 0.0.0.0 --port 17777 Use agentic client (Hermes) with tool calls. Log shows every TOOLS request ctx=0..N:N (cached=0) and full prefill. ## Acceptance - Agentic tool-call round-trips reuse prior turn KV state (log shows ctx=N..M:suffix with cached>0) - Re-prefill time for continued conversations drops from ~43s to <5s - Works with and without --kv-disk-dir
Author
Owner

Trace analysis with --kv-disk-dir enabled (2026-06-23)

The server was restarted with all caching levers. Here is what the trace reveals.

What is working

Memory-token reuse works perfectly between back-to-back tool calls. Most requests show cache_source: memory-token with cached_tokens > 0, small suffix prefills (<2k tokens, sub-10s).

Disk KV cache also works: a 12288-token checkpoint loaded from /tmp/ds4-kv/ in 62ms with cache_source: disk-text.

The killer: eviction → text-prefix mismatch → 136s full re-prefill

11:38:27  kv cache stored tokens=36729 reason=evict        ← live session evicted
11:38:27  chat ctx=0..37846:37846 TOOLS prompt start       ← cached=0, full 38K prefill
11:40:43  chat ctx=0..37846:37846 TOOLS prompt done 135.908s
11:40:43  gen=3 finish=error: client stream write failed    ← client timed out

The trace confirms: cached_tokens: 0, cache_source: none, disk_cached_tokens: 0. Nothing matched after eviction.

Root cause: the disk cache checkpoints were stored at 12288, 20480, and 30720 tokens from EARLIER in the conversation, before tool results were injected. kv_cache_try_load_text() does a byte-prefix match of the rendered text. After tool calls mutate the history, the text at 37846 tokens has a different byte-prefix than at 12288 tokens. Byte-prefix matching inherently breaks across tool-call history growth.

The model confusion

The model produced an unterminated tool call (gen=142 TOOLS DSML_START finish=error: unterminated tool call) during a suffix prefill after memory-text reuse.

Suggested fix

The live session already uses token-level prefix matching (ds4_session_common_prefix() compares token IDs, immune to chat template re-rendering). The disk cache should do the same: store the token checkpoint alongside the text checkpoint, and match on token prefix. Concretely in kv_cache_try_load_text(): compare the first N tokens of the cached checkpoint against the live prompt tokens. If they match, use the checkpoint as the prefix and re-tokenize only the text suffix.

## Trace analysis with --kv-disk-dir enabled (2026-06-23) The server was restarted with all caching levers. Here is what the trace reveals. ### What is working Memory-token reuse works perfectly between back-to-back tool calls. Most requests show cache_source: memory-token with cached_tokens > 0, small suffix prefills (<2k tokens, sub-10s). Disk KV cache also works: a 12288-token checkpoint loaded from /tmp/ds4-kv/ in 62ms with cache_source: disk-text. ### The killer: eviction → text-prefix mismatch → 136s full re-prefill ``` 11:38:27 kv cache stored tokens=36729 reason=evict ← live session evicted 11:38:27 chat ctx=0..37846:37846 TOOLS prompt start ← cached=0, full 38K prefill 11:40:43 chat ctx=0..37846:37846 TOOLS prompt done 135.908s 11:40:43 gen=3 finish=error: client stream write failed ← client timed out ``` The trace confirms: cached_tokens: 0, cache_source: none, disk_cached_tokens: 0. Nothing matched after eviction. Root cause: the disk cache checkpoints were stored at 12288, 20480, and 30720 tokens from EARLIER in the conversation, before tool results were injected. kv_cache_try_load_text() does a byte-prefix match of the rendered text. After tool calls mutate the history, the text at 37846 tokens has a different byte-prefix than at 12288 tokens. Byte-prefix matching inherently breaks across tool-call history growth. ### The model confusion The model produced an unterminated tool call (gen=142 TOOLS DSML_START finish=error: unterminated tool call) during a suffix prefill after memory-text reuse. ### Suggested fix The live session already uses token-level prefix matching (ds4_session_common_prefix() compares token IDs, immune to chat template re-rendering). The disk cache should do the same: store the token checkpoint alongside the text checkpoint, and match on token prefix. Concretely in kv_cache_try_load_text(): compare the first N tokens of the cached checkpoint against the live prompt tokens. If they match, use the checkpoint as the prefix and re-tokenize only the text suffix.
Sign in to join this conversation.
No labels
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/ds4-nvfp4-spark#7
No description provided.