perf: Pipeline decode — overlap GPU commit with next token embedding #47

New issue

Open

opened 2026-05-15 19:02:26 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-15 19:02:26 +02:00

Owner

Current decode is strictly serial: GPU encode → GPU commit+wait → CPU readback → CPU argmax → next token id → GPU encode (embed lookup).

With GPU argmax (issue #6), the flow becomes: GPU encode → GPU commit → read 4-byte argmax result → GPU encode for next token.

We can go further: overlap the GPU commit of token N with the embedding lookup + first-layer compute of token N+1 by using double-buffered command buffers. Submit CB1, while GPU executes CB1, encode CB2, submit CB2, wait CB1 (get result), encode CB3, submit CB3, wait CB2, etc.

This hides command buffer encoding overhead and could save 2-5ms per token.

Current decode is strictly serial: GPU encode → GPU commit+wait → CPU readback → CPU argmax → next token id → GPU encode (embed lookup). With GPU argmax (issue #6), the flow becomes: GPU encode → GPU commit → read 4-byte argmax result → GPU encode for next token. We can go further: overlap the GPU commit of token N with the embedding lookup + first-layer compute of token N+1 by using double-buffered command buffers. Submit CB1, while GPU executes CB1, encode CB2, submit CB2, wait CB1 (get result), encode CB3, submit CB3, wait CB2, etc. This hides command buffer encoding overhead and could save 2-5ms per token.