Single command buffer decode — full GPU pipeline for all 32 layers #36
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Objective
Rewrite forward_decode() to encode ALL 32 layers (24 linear + 8 full attention) into a single MTLCommandBuffer. Zero CPU interaction during decode.
Current problem
Target architecture
Per linear attention layer (seq_len=1):
Per full attention layer (existing GPU path, no changes needed):
GPU buffer requirements
Files to modify
Depends on
Acceptance