Agentic workloads re-prefill full context every turn (live session reuse + disk KV cache not hitting) #7
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
When using the server with an agentic client (Hermes), every tool-call round-trip re-prefills the full conversation from scratch. The server log shows repeated ctx=0..13745:13745 TOOLS prompt start followed by a full 43-second prefill, even for the same 13745-token conversation prefilled seconds earlier. The 0.. prefix means cached=0 — neither the live session reuse nor the disk KV cache is hitting.
What the server already has
Three layers of context caching in src/server/ds4_server.c generate_job():
Live session token-prefix reuse (memory-token): ds4_session_common_prefix compares incoming prompt tokens vs live GPU session checkpoint. If new prompt starts with checkpoint, only suffix prefilled. Condition: common == old_pos && prompt.len >= old_pos.
Live session text-prefix reuse (memory-text): live_text_prefix_prompt renders live tokens to text, byte-prefix matches against request prompt_text. If live text is byte-prefix of new request, only text suffix re-tokenized and prefilled.
Disk KV cache (disk-text): --kv-disk-dir DIR enables persistent disk checkpoints. kv_cache_try_load_text finds cached checkpoint whose text is byte-prefix of new request and loads it. NOT enabled by default — requires --kv-disk-dir.
Why agentic requests don't hit the cache
Log shows every TOOLS request starts at ctx=0..13745:13745 (cached=0). Possible causes:
Live session evicted/reset between requests. Server has single live GPU session. If Hermes sends requests that don't share token/text prefix with current live state (different tool-call result formatting, re-rendered chat template, request after eviction), both prefix checks fail → full prefill.
Disk KV cache not enabled. Server started without --kv-disk-dir, so disk-text path disabled. Even if live session evicted, no disk checkpoint to fall back on → full prefill.
Chat template re-rendering may break text-prefix matching. Agentic clients re-render full conversation with tool results inserted. If rendered text differs slightly (whitespace, tool-result formatting, system prompt placement), byte_prefix_match fails.
What might need adding/changing
Existing flags that may help (no code change)
--kv-disk-dir DIR: enables disk KV checkpoints (biggest lever, not currently enabled)
--kv-cache-cold-max-tokens N: saves cold first prompts up to N (default 30000)
--kv-cache-continued-interval-tokens N: saves aligned frontiers every N (default 10000)
--trace FILE: detailed trace logging including cache hit/miss diagnostics
Repro
DS4_KV_TURBO=1 DS4_CUDA_DECODE_GRAPH=1 ./ds4-server -m K128.gguf --ctx 1048576 --host 0.0.0.0 --port 17777
Use agentic client (Hermes) with tool calls. Log shows every TOOLS request ctx=0..N:N (cached=0) and full prefill.
Acceptance
Trace analysis with --kv-disk-dir enabled (2026-06-23)
The server was restarted with all caching levers. Here is what the trace reveals.
What is working
Memory-token reuse works perfectly between back-to-back tool calls. Most requests show cache_source: memory-token with cached_tokens > 0, small suffix prefills (<2k tokens, sub-10s).
Disk KV cache also works: a 12288-token checkpoint loaded from /tmp/ds4-kv/ in 62ms with cache_source: disk-text.
The killer: eviction → text-prefix mismatch → 136s full re-prefill
The trace confirms: cached_tokens: 0, cache_source: none, disk_cached_tokens: 0. Nothing matched after eviction.
Root cause: the disk cache checkpoints were stored at 12288, 20480, and 30720 tokens from EARLIER in the conversation, before tool results were injected. kv_cache_try_load_text() does a byte-prefix match of the rendered text. After tool calls mutate the history, the text at 37846 tokens has a different byte-prefix than at 12288 tokens. Byte-prefix matching inherently breaks across tool-call history growth.
The model confusion
The model produced an unterminated tool call (gen=142 TOOLS DSML_START finish=error: unterminated tool call) during a suffix prefill after memory-text reuse.
Suggested fix
The live session already uses token-level prefix matching (ds4_session_common_prefix() compares token IDs, immune to chat template re-rendering). The disk cache should do the same: store the token checkpoint alongside the text checkpoint, and match on token prefix. Concretely in kv_cache_try_load_text(): compare the first N tokens of the cached checkpoint against the live prompt tokens. If they match, use the checkpoint as the prefix and re-tokenize only the text suffix.