sleepy/ds4-nvfp4-spark

Fork 0

Agentic workloads re-prefill full context every turn (live session reuse + disk KV cache not hitting) #7

New issue

Open

opened 2026-06-23 11:27:59 +02:00 by sleepy · 1 comment

sleepy commented

2026-06-23 11:27:59 +02:00

Owner

Summary

When using the server with an agentic client (Hermes), every tool-call round-trip re-prefills the full conversation from scratch. The server log shows repeated ctx=0..13745:13745 TOOLS prompt start followed by a full 43-second prefill, even for the same 13745-token conversation prefilled seconds earlier. The 0.. prefix means cached=0 — neither the live session reuse nor the disk KV cache is hitting.

What the server already has

Three layers of context caching in src/server/ds4_server.c generate_job():

Live session token-prefix reuse (memory-token): ds4_session_common_prefix compares incoming prompt tokens vs live GPU session checkpoint. If new prompt starts with checkpoint, only suffix prefilled. Condition: common == old_pos && prompt.len >= old_pos.
Live session text-prefix reuse (memory-text): live_text_prefix_prompt renders live tokens to text, byte-prefix matches against request prompt_text. If live text is byte-prefix of new request, only text suffix re-tokenized and prefilled.
Disk KV cache (disk-text): --kv-disk-dir DIR enables persistent disk checkpoints. kv_cache_try_load_text finds cached checkpoint whose text is byte-prefix of new request and loads it. NOT enabled by default — requires --kv-disk-dir.

Why agentic requests don't hit the cache

Log shows every TOOLS request starts at ctx=0..13745:13745 (cached=0). Possible causes:

Live session evicted/reset between requests. Server has single live GPU session. If Hermes sends requests that don't share token/text prefix with current live state (different tool-call result formatting, re-rendered chat template, request after eviction), both prefix checks fail → full prefill.
Disk KV cache not enabled. Server started without --kv-disk-dir, so disk-text path disabled. Even if live session evicted, no disk checkpoint to fall back on → full prefill.
Chat template re-rendering may break text-prefix matching. Agentic clients re-render full conversation with tool results inserted. If rendered text differs slightly (whitespace, tool-result formatting, system prompt placement), byte_prefix_match fails.

What might need adding/changing

Enable --kv-disk-dir by default or document prominently for agentic use. Even if live session evicted, disk cache restores prior checkpoint.
Investigate why live session reuse fails between agentic turns. Log shows ctx=0.. even for back-to-back same-token-count requests. Need --trace to see common_prefix value and why it's 0.
Consider session pinning or conversation ID so server keeps live session across tool-call round-trips.

Existing flags that may help (no code change)

--kv-disk-dir DIR: enables disk KV checkpoints (biggest lever, not currently enabled)
--kv-cache-cold-max-tokens N: saves cold first prompts up to N (default 30000)
--kv-cache-continued-interval-tokens N: saves aligned frontiers every N (default 10000)
--trace FILE: detailed trace logging including cache hit/miss diagnostics

Repro

DS4_KV_TURBO=1 DS4_CUDA_DECODE_GRAPH=1 ./ds4-server -m K128.gguf --ctx 1048576 --host 0.0.0.0 --port 17777
Use agentic client (Hermes) with tool calls. Log shows every TOOLS request ctx=0..N:N (cached=0) and full prefill.

Acceptance

Agentic tool-call round-trips reuse prior turn KV state (log shows ctx=N..M:suffix with cached>0)
Re-prefill time for continued conversations drops from ~43s to <5s
Works with and without --kv-disk-dir

## Summary When using the server with an agentic client (Hermes), every tool-call round-trip re-prefills the full conversation from scratch. The server log shows repeated ctx=0..13745:13745 TOOLS prompt start followed by a full 43-second prefill, even for the same 13745-token conversation prefilled seconds earlier. The 0.. prefix means cached=0 — neither the live session reuse nor the disk KV cache is hitting. ## What the server already has Three layers of context caching in src/server/ds4_server.c generate_job(): 1. Live session token-prefix reuse (memory-token): ds4_session_common_prefix compares incoming prompt tokens vs live GPU session checkpoint. If new prompt starts with checkpoint, only suffix prefilled. Condition: common == old_pos && prompt.len >= old_pos. 2. Live session text-prefix reuse (memory-text): live_text_prefix_prompt renders live tokens to text, byte-prefix matches against request prompt_text. If live text is byte-prefix of new request, only text suffix re-tokenized and prefilled. 3. Disk KV cache (disk-text): --kv-disk-dir DIR enables persistent disk checkpoints. kv_cache_try_load_text finds cached checkpoint whose text is byte-prefix of new request and loads it. NOT enabled by default — requires --kv-disk-dir. ## Why agentic requests don't hit the cache Log shows every TOOLS request starts at ctx=0..13745:13745 (cached=0). Possible causes: 1. Live session evicted/reset between requests. Server has single live GPU session. If Hermes sends requests that don't share token/text prefix with current live state (different tool-call result formatting, re-rendered chat template, request after eviction), both prefix checks fail → full prefill. 2. Disk KV cache not enabled. Server started without --kv-disk-dir, so disk-text path disabled. Even if live session evicted, no disk checkpoint to fall back on → full prefill. 3. Chat template re-rendering may break text-prefix matching. Agentic clients re-render full conversation with tool results inserted. If rendered text differs slightly (whitespace, tool-result formatting, system prompt placement), byte_prefix_match fails. ## What might need adding/changing - Enable --kv-disk-dir by default or document prominently for agentic use. Even if live session evicted, disk cache restores prior checkpoint. - Investigate why live session reuse fails between agentic turns. Log shows ctx=0.. even for back-to-back same-token-count requests. Need --trace to see common_prefix value and why it's 0. - Consider session pinning or conversation ID so server keeps live session across tool-call round-trips. ## Existing flags that may help (no code change) --kv-disk-dir DIR: enables disk KV checkpoints (biggest lever, not currently enabled) --kv-cache-cold-max-tokens N: saves cold first prompts up to N (default 30000) --kv-cache-continued-interval-tokens N: saves aligned frontiers every N (default 10000) --trace FILE: detailed trace logging including cache hit/miss diagnostics ## Repro DS4_KV_TURBO=1 DS4_CUDA_DECODE_GRAPH=1 ./ds4-server -m K128.gguf --ctx 1048576 --host 0.0.0.0 --port 17777 Use agentic client (Hermes) with tool calls. Log shows every TOOLS request ctx=0..N:N (cached=0) and full prefill. ## Acceptance - Agentic tool-call round-trips reuse prior turn KV state (log shows ctx=N..M:suffix with cached>0) - Re-prefill time for continued conversations drops from ~43s to <5s - Works with and without --kv-disk-dir

sleepy commented

2026-06-23 11:46:52 +02:00

Author

Owner

Trace analysis with --kv-disk-dir enabled (2026-06-23)

The server was restarted with all caching levers. Here is what the trace reveals.

What is working

Memory-token reuse works perfectly between back-to-back tool calls. Most requests show cache_source: memory-token with cached_tokens > 0, small suffix prefills (<2k tokens, sub-10s).

Disk KV cache also works: a 12288-token checkpoint loaded from /tmp/ds4-kv/ in 62ms with cache_source: disk-text.

The killer: eviction → text-prefix mismatch → 136s full re-prefill

11:38:27  kv cache stored tokens=36729 reason=evict        ← live session evicted
11:38:27  chat ctx=0..37846:37846 TOOLS prompt start       ← cached=0, full 38K prefill
11:40:43  chat ctx=0..37846:37846 TOOLS prompt done 135.908s
11:40:43  gen=3 finish=error: client stream write failed    ← client timed out

The trace confirms: cached_tokens: 0, cache_source: none, disk_cached_tokens: 0. Nothing matched after eviction.

Root cause: the disk cache checkpoints were stored at 12288, 20480, and 30720 tokens from EARLIER in the conversation, before tool results were injected. kv_cache_try_load_text() does a byte-prefix match of the rendered text. After tool calls mutate the history, the text at 37846 tokens has a different byte-prefix than at 12288 tokens. Byte-prefix matching inherently breaks across tool-call history growth.

The model confusion

The model produced an unterminated tool call (gen=142 TOOLS DSML_START finish=error: unterminated tool call) during a suffix prefill after memory-text reuse.

Suggested fix

The live session already uses token-level prefix matching (ds4_session_common_prefix() compares token IDs, immune to chat template re-rendering). The disk cache should do the same: store the token checkpoint alongside the text checkpoint, and match on token prefix. Concretely in kv_cache_try_load_text(): compare the first N tokens of the cached checkpoint against the live prompt tokens. If they match, use the checkpoint as the prefix and re-tokenize only the text suffix.

## Trace analysis with --kv-disk-dir enabled (2026-06-23) The server was restarted with all caching levers. Here is what the trace reveals. ### What is working Memory-token reuse works perfectly between back-to-back tool calls. Most requests show cache_source: memory-token with cached_tokens > 0, small suffix prefills (<2k tokens, sub-10s). Disk KV cache also works: a 12288-token checkpoint loaded from /tmp/ds4-kv/ in 62ms with cache_source: disk-text. ### The killer: eviction → text-prefix mismatch → 136s full re-prefill ``` 11:38:27 kv cache stored tokens=36729 reason=evict ← live session evicted 11:38:27 chat ctx=0..37846:37846 TOOLS prompt start ← cached=0, full 38K prefill 11:40:43 chat ctx=0..37846:37846 TOOLS prompt done 135.908s 11:40:43 gen=3 finish=error: client stream write failed ← client timed out ``` The trace confirms: cached_tokens: 0, cache_source: none, disk_cached_tokens: 0. Nothing matched after eviction. Root cause: the disk cache checkpoints were stored at 12288, 20480, and 30720 tokens from EARLIER in the conversation, before tool results were injected. kv_cache_try_load_text() does a byte-prefix match of the rendered text. After tool calls mutate the history, the text at 37846 tokens has a different byte-prefix than at 12288 tokens. Byte-prefix matching inherently breaks across tool-call history growth. ### The model confusion The model produced an unterminated tool call (gen=142 TOOLS DSML_START finish=error: unterminated tool call) during a suffix prefill after memory-text reuse. ### Suggested fix The live session already uses token-level prefix matching (ds4_session_common_prefix() compares token IDs, immune to chat template re-rendering). The disk cache should do the same: store the token checkpoint alongside the text checkpoint, and match on token prefix. Concretely in kv_cache_try_load_text(): compare the first N tokens of the cached checkpoint against the live prompt tokens. If they match, use the checkpoint as the prefix and re-tokenize only the text suffix.

sleepy referenced this issue from a commit

2026-06-23 12:47:16 +02:00

issue #7: evicted-session token-prefix recovery for KV disk cache

No labels

No milestone

No project

No assignees

1 participant

Notifications

Due date

The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference

sleepy/ds4-nvfp4-spark#7

No description provided.

Rows
Columns