[bug] Thinking tokens leak into user-visible output via MTP path #1

Closed
opened 2026-05-03 17:57:53 +02:00 by sleepy · 1 comment
sleepy commented 2026-05-03 17:57:53 +02:00 (Migrated from localhost:18431)

When using the MTP fast path in BatchedEngine.stream_generate(), Qwen3.6 thinking blocks are streamed verbatim to the user instead of being stripped.

The MTP fast path calls mlx-lm stream_generate() directly, bypassing oMLX thinking token handling.

Acceptance criteria:

  • think blocks stripped from output
  • Thinking tokens counted separately
  • Output matches standard batching path behavior
When using the MTP fast path in BatchedEngine.stream_generate(), Qwen3.6 thinking blocks are streamed verbatim to the user instead of being stripped. The MTP fast path calls mlx-lm stream_generate() directly, bypassing oMLX thinking token handling. **Acceptance criteria:** - think blocks stripped from output - Thinking tokens counted separately - Output matches standard batching path behavior
sleepy commented 2026-05-03 18:49:03 +02:00 (Migrated from localhost:18431)

Root cause identified:

The standard batching path sends RAW text (with <think> tags intact) to the client. The web UI then uses extractThinking() to separate thinking blocks from content and renders them separately.

The MTP fast path incorrectly tries to strip thinking on the SERVER with ThinkingParser, which:

  1. Breaks client-side thinking block rendering
  2. Causes word spacing issues because ThinkingParser.feed() doesn't handle incremental detokenized text properly

Fix needed:

  1. Remove ThinkingParser from MTP fast path
  2. Send raw text (with thinking tags) to client, matching standard path behavior
  3. Fix word spacing by properly accumulating response.text (which is already detokenized incremental text from mlx-lm)
  4. Let client-side extractThinking() handle thinking separation, as it did before MTP
**Root cause identified:** The standard batching path sends RAW text (with `<think>` tags intact) to the client. The web UI then uses `extractThinking()` to separate thinking blocks from content and renders them separately. The MTP fast path incorrectly tries to strip thinking on the SERVER with `ThinkingParser`, which: 1. Breaks client-side thinking block rendering 2. Causes word spacing issues because `ThinkingParser.feed()` doesn't handle incremental detokenized text properly **Fix needed:** 1. Remove `ThinkingParser` from MTP fast path 2. Send raw text (with thinking tags) to client, matching standard path behavior 3. Fix word spacing by properly accumulating `response.text` (which is already detokenized incremental text from mlx-lm) 4. Let client-side `extractThinking()` handle thinking separation, as it did before MTP
Sign in to join this conversation.
No labels
bug
feature
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/omlx#1
No description provided.