perf: Pipeline decode — overlap GPU commit with next token embedding #47

Open
opened 2026-05-15 19:02:26 +02:00 by sleepy · 0 comments
Owner

Current decode is strictly serial: GPU encode → GPU commit+wait → CPU readback → CPU argmax → next token id → GPU encode (embed lookup).

With GPU argmax (issue #6), the flow becomes: GPU encode → GPU commit → read 4-byte argmax result → GPU encode for next token.

We can go further: overlap the GPU commit of token N with the embedding lookup + first-layer compute of token N+1 by using double-buffered command buffers. Submit CB1, while GPU executes CB1, encode CB2, submit CB2, wait CB1 (get result), encode CB3, submit CB3, wait CB2, etc.

This hides command buffer encoding overhead and could save 2-5ms per token.

Current decode is strictly serial: GPU encode → GPU commit+wait → CPU readback → CPU argmax → next token id → GPU encode (embed lookup). With GPU argmax (issue #6), the flow becomes: GPU encode → GPU commit → read 4-byte argmax result → GPU encode for next token. We can go further: overlap the GPU commit of token N with the embedding lookup + first-layer compute of token N+1 by using double-buffered command buffers. Submit CB1, while GPU executes CB1, encode CB2, submit CB2, wait CB1 (get result), encode CB3, submit CB3, wait CB2, etc. This hides command buffer encoding overhead and could save 2-5ms per token.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#47
No description provided.