perf: Pipeline decode — overlap GPU commit with next token embedding #47
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Current decode is strictly serial: GPU encode → GPU commit+wait → CPU readback → CPU argmax → next token id → GPU encode (embed lookup).
With GPU argmax (issue #6), the flow becomes: GPU encode → GPU commit → read 4-byte argmax result → GPU encode for next token.
We can go further: overlap the GPU commit of token N with the embedding lookup + first-layer compute of token N+1 by using double-buffered command buffers. Submit CB1, while GPU executes CB1, encode CB2, submit CB2, wait CB1 (get result), encode CB3, submit CB3, wait CB2, etc.
This hides command buffer encoding overhead and could save 2-5ms per token.