[bug] Thinking tokens leak into user-visible output via MTP path #1

New issue

Closed

opened 2026-05-03 17:57:53 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-03 17:57:53 +02:00

(Migrated from localhost:18431)

When using the MTP fast path in BatchedEngine.stream_generate(), Qwen3.6 thinking blocks are streamed verbatim to the user instead of being stripped.

The MTP fast path calls mlx-lm stream_generate() directly, bypassing oMLX thinking token handling.

Acceptance criteria:

think blocks stripped from output
Thinking tokens counted separately
Output matches standard batching path behavior

When using the MTP fast path in BatchedEngine.stream_generate(), Qwen3.6 thinking blocks are streamed verbatim to the user instead of being stripped. The MTP fast path calls mlx-lm stream_generate() directly, bypassing oMLX thinking token handling. **Acceptance criteria:** - think blocks stripped from output - Thinking tokens counted separately - Output matches standard batching path behavior

sleepy commented

2026-05-03 18:49:03 +02:00

(Migrated from localhost:18431)

Root cause identified:

The standard batching path sends RAW text (with <think> tags intact) to the client. The web UI then uses extractThinking() to separate thinking blocks from content and renders them separately.

The MTP fast path incorrectly tries to strip thinking on the SERVER with ThinkingParser, which:

Breaks client-side thinking block rendering
Causes word spacing issues because ThinkingParser.feed() doesn't handle incremental detokenized text properly

Fix needed:

Remove ThinkingParser from MTP fast path
Send raw text (with thinking tags) to client, matching standard path behavior
Fix word spacing by properly accumulating response.text (which is already detokenized incremental text from mlx-lm)
Let client-side extractThinking() handle thinking separation, as it did before MTP

**Root cause identified:** The standard batching path sends RAW text (with `<think>` tags intact) to the client. The web UI then uses `extractThinking()` to separate thinking blocks from content and renders them separately. The MTP fast path incorrectly tries to strip thinking on the SERVER with `ThinkingParser`, which: 1. Breaks client-side thinking block rendering 2. Causes word spacing issues because `ThinkingParser.feed()` doesn't handle incremental detokenized text properly **Fix needed:** 1. Remove `ThinkingParser` from MTP fast path 2. Send raw text (with thinking tags) to client, matching standard path behavior 3. Fix word spacing by properly accumulating `response.text` (which is already detokenized incremental text from mlx-lm) 4. Let client-side `extractThinking()` handle thinking separation, as it did before MTP