refactor: centralize CoT parsing in backend for streaming mode (#16394)
* refactor: unify reasoning handling via backend reasoning_content, drop frontend tag parsing - Updated the chat message component to surface backend-supplied reasoning via message.thinking while showing the raw assistant content without inline tag scrubbing - Simplified chat streaming to append content chunks directly, stream reasoning into the message model, and persist any partial reasoning when generation stops - Refactored the chat service SSE handler to rely on server-provided reasoning_content, removing legacy <think> parsing logic - Refreshed Storybook data and streaming flows to populate the thinking field explicitly for static and streaming assistant messages * refactor: implement streaming-aware universal reasoning parser Remove the streaming mode limitation from --reasoning-format by refactoring try_parse_reasoning() to handle incremental parsing of <think> tags across all formats. - Rework try_parse_reasoning() to track whitespace, partial tags, and multiple reasoning segments, allowing proper separation of reasoning_content and content in streaming mode - Parse reasoning tags before tool call handling in content-only and Llama 3.x formats to ensure inline <think> blocks are captured correctly - Change default reasoning_format from 'auto' to 'deepseek' for consistent behavior - Add 'deepseek-legacy' option to preserve old inline behavior when needed - Update CLI help and documentation to reflect streaming support - Add parser tests for inline <think>...</think> segments The parser now continues processing content after </think> closes instead of stopping, enabling proper message.reasoning_content and message.content separation in both streaming and non-streaming modes. Fixes the issue where streaming responses would dump everything (including post-thinking content) into reasoning_content while leaving content empty. * refactor: address review feedback from allozaur - Passed the assistant message content directly to ChatMessageAssistant to drop the redundant derived state in the chat message component - Simplified chat streaming updates by removing unused partial-thinking handling and persisting partial responses straight from currentResponse - Refreshed the ChatMessage stories to cover standard and reasoning scenarios without the old THINK-tag parsing examples Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * refactor: restore forced reasoning prefix to pass test-chat ([chat] All tests passed) - store the exact sequence seen on input when 'thinking_forced_open' enforces a reasoning block - inject this prefix before the first accumulated segment in 'reasoning_content', then clear it to avoid duplication - repeat the capture on every new 'start_think' detection to properly handle partial/streaming flows * refactor: address review feedback from ngxson * debug: say goodbye to curl -N, hello one-click raw stream - adds a new checkbox in the WebUI to display raw LLM output without backend parsing or frontend Markdown rendering * Update tools/server/webui/src/lib/components/app/chat/ChatMessages/ChatMessage.svelte Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> * webui: add Storybook example for raw LLM output and scope reasoning format toggle per story - Added a Storybook example that showcases the chat message component in raw LLM output mode with the provided trace sample - Updated every ChatMessage story to toggle the disableReasoningFormat setting so the raw-output rendering remains scoped to its own example * npm run format * chat-parser: address review feedback from ngxson Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com> --------- Co-authored-by: Aleksander Grygier <aleksander.grygier@gmail.com> Co-authored-by: Xuan Son Nguyen <thichthat@gmail.com>
This commit is contained in:
@@ -1,7 +1,6 @@
|
||||
<script lang="ts">
|
||||
import { getDeletionInfo } from '$lib/stores/chat.svelte';
|
||||
import { copyToClipboard } from '$lib/utils/copy';
|
||||
import { parseThinkingContent } from '$lib/utils/thinking';
|
||||
import ChatMessageAssistant from './ChatMessageAssistant.svelte';
|
||||
import ChatMessageUser from './ChatMessageUser.svelte';
|
||||
|
||||
@@ -47,26 +46,13 @@
|
||||
|
||||
let thinkingContent = $derived.by(() => {
|
||||
if (message.role === 'assistant') {
|
||||
if (message.thinking) {
|
||||
return message.thinking;
|
||||
}
|
||||
const trimmedThinking = message.thinking?.trim();
|
||||
|
||||
const parsed = parseThinkingContent(message.content);
|
||||
|
||||
return parsed.thinking;
|
||||
return trimmedThinking ? trimmedThinking : null;
|
||||
}
|
||||
return null;
|
||||
});
|
||||
|
||||
let messageContent = $derived.by(() => {
|
||||
if (message.role === 'assistant') {
|
||||
const parsed = parseThinkingContent(message.content);
|
||||
return parsed.cleanContent?.replace('<|channel|>analysis', '');
|
||||
}
|
||||
|
||||
return message.content?.replace('<|channel|>analysis', '');
|
||||
});
|
||||
|
||||
function handleCancelEdit() {
|
||||
isEditing = false;
|
||||
editedContent = message.content;
|
||||
@@ -165,7 +151,7 @@
|
||||
{editedContent}
|
||||
{isEditing}
|
||||
{message}
|
||||
{messageContent}
|
||||
messageContent={message.content}
|
||||
onCancelEdit={handleCancelEdit}
|
||||
onConfirmDelete={handleConfirmDelete}
|
||||
onCopy={handleCopy}
|
||||
|
||||
+22
-1
@@ -131,7 +131,11 @@
|
||||
</div>
|
||||
</div>
|
||||
{:else if message.role === 'assistant'}
|
||||
<MarkdownContent content={messageContent || ''} />
|
||||
{#if config().disableReasoningFormat}
|
||||
<pre class="raw-output">{messageContent || ''}</pre>
|
||||
{:else}
|
||||
<MarkdownContent content={messageContent || ''} />
|
||||
{/if}
|
||||
{:else}
|
||||
<div class="text-sm whitespace-pre-wrap">
|
||||
{messageContent}
|
||||
@@ -203,4 +207,21 @@
|
||||
background-position: -200% 0;
|
||||
}
|
||||
}
|
||||
|
||||
.raw-output {
|
||||
width: 100%;
|
||||
max-width: 48rem;
|
||||
margin-top: 1.5rem;
|
||||
padding: 1rem 1.25rem;
|
||||
border-radius: 1rem;
|
||||
background: hsl(var(--muted) / 0.3);
|
||||
color: var(--foreground);
|
||||
font-family:
|
||||
ui-monospace, SFMono-Regular, 'SF Mono', Monaco, 'Cascadia Code', 'Roboto Mono', Consolas,
|
||||
'Liberation Mono', Menlo, monospace;
|
||||
font-size: 0.875rem;
|
||||
line-height: 1.6;
|
||||
white-space: pre-wrap;
|
||||
word-break: break-word;
|
||||
}
|
||||
</style>
|
||||
|
||||
@@ -148,6 +148,12 @@
|
||||
key: 'showThoughtInProgress',
|
||||
label: 'Show thought in progress',
|
||||
type: 'checkbox'
|
||||
},
|
||||
{
|
||||
key: 'disableReasoningFormat',
|
||||
label:
|
||||
'Show raw LLM output without backend parsing and frontend Markdown rendering to inspect streaming across different models.',
|
||||
type: 'checkbox'
|
||||
}
|
||||
]
|
||||
},
|
||||
|
||||
Reference in New Issue
Block a user