Single command buffer decode — full GPU pipeline for all 32 layers #36

New issue

Open

opened 2026-05-15 12:21:00 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-15 12:21:00 +02:00

Owner

Objective

Rewrite forward_decode() to encode ALL 32 layers (24 linear + 8 full attention) into a single MTLCommandBuffer. Zero CPU interaction during decode.

Current problem

24 linear attention layers cause CPU↔GPU sync (commitAndWait per layer)
Each sync adds ~30-50ms of latency
Total: ~1.4 tok/s

Target architecture

Single command buffer per decode step
All layers encoded sequentially with no commits between them
Linear attention state lives in GPU MTLBuffers (not CPU f32 arrays)
Use kernels from #35 (conv1d, delta rule, norm/gate)
Use existing kernels (matmul_bf16_m1_t, rms_norm_bf16, residual_add_bf16, etc.)

Per linear attention layer (seq_len=1):

RMSNorm (input_layernorm)
matmul in_proj_qkv (hidden→conv_dim)
matmul in_proj_z (hidden→value_dim)
matmul in_proj_a (hidden→num_v_heads) — tiny, can use existing matmul
matmul in_proj_b (hidden→num_v_heads) — tiny, can use existing matmul
conv1d_state_bf16 kernel
linear_attn_delta_rule_bf16 kernel
matmul out_proj (value_dim→hidden)
residual_add
RMSNorm (post_attention_layernorm)
MLP: matmul gate + matmul up + swiglu + matmul down
residual_add

Per full attention layer (existing GPU path, no changes needed):

RMSNorm → matmul qkv → RoPE → K/V cache store → decode_attention → matmul o → residual
RMSNorm → MLP (gate+up+swiglu+down) → residual

GPU buffer requirements

recurrent_state: 24 layers × 524288 f32 = ~50MB (move from CPU to GPU)
conv_state: 24 layers × 3*6656 bf16 = ~0.96MB (move from CPU to GPU)
conv1d weight: 24 layers × 6656*4 bf16 = ~1.28MB (already in zero-copy mmap)
a_proj, b_proj: small, use existing temp buffers

Files to modify

src/models/qwen3_5/model.zig — rewrite forward_decode(), add GPU state buffer init
src/metal/dispatch.zig — may need new helper functions

Depends on

Issue #35 (Metal kernels)

Acceptance

Single commitAndWait per decode step (not per layer)
Coherent English output
tok/s improved significantly over 1.4 tok/s baseline
zig build test passes (91/91)
Push branch perf/36-single-cb-decode to Forgejo

## Objective Rewrite forward_decode() to encode ALL 32 layers (24 linear + 8 full attention) into a single MTLCommandBuffer. Zero CPU interaction during decode. ## Current problem - 24 linear attention layers cause CPU↔GPU sync (commitAndWait per layer) - Each sync adds ~30-50ms of latency - Total: ~1.4 tok/s ## Target architecture - Single command buffer per decode step - All layers encoded sequentially with no commits between them - Linear attention state lives in GPU MTLBuffers (not CPU f32 arrays) - Use kernels from #35 (conv1d, delta rule, norm/gate) - Use existing kernels (matmul_bf16_m1_t, rms_norm_bf16, residual_add_bf16, etc.) ## Per linear attention layer (seq_len=1): 1. RMSNorm (input_layernorm) 2. matmul in_proj_qkv (hidden→conv_dim) 3. matmul in_proj_z (hidden→value_dim) 4. matmul in_proj_a (hidden→num_v_heads) — tiny, can use existing matmul 5. matmul in_proj_b (hidden→num_v_heads) — tiny, can use existing matmul 6. conv1d_state_bf16 kernel 7. linear_attn_delta_rule_bf16 kernel 8. matmul out_proj (value_dim→hidden) 9. residual_add 10. RMSNorm (post_attention_layernorm) 11. MLP: matmul gate + matmul up + swiglu + matmul down 12. residual_add ## Per full attention layer (existing GPU path, no changes needed): 1. RMSNorm → matmul qkv → RoPE → K/V cache store → decode_attention → matmul o → residual 2. RMSNorm → MLP (gate+up+swiglu+down) → residual ## GPU buffer requirements - recurrent_state: 24 layers × 524288 f32 = ~50MB (move from CPU to GPU) - conv_state: 24 layers × 3*6656 bf16 = ~0.96MB (move from CPU to GPU) - conv1d weight: 24 layers × 6656*4 bf16 = ~1.28MB (already in zero-copy mmap) - a_proj, b_proj: small, use existing temp buffers ## Files to modify - src/models/qwen3_5/model.zig — rewrite forward_decode(), add GPU state buffer init - src/metal/dispatch.zig — may need new helper functions ## Depends on - Issue #35 (Metal kernels) ## Acceptance - Single commitAndWait per decode step (not per layer) - Coherent English output - tok/s improved significantly over 1.4 tok/s baseline - zig build test passes (91/91) - Push branch perf/36-single-cb-decode to Forgejo

sleepy referenced this issue from a commit

2026-05-15 13:11:45 +02:00

perf: single command buffer decode with full GPU linear attention (#36)