Single command buffer decode — full GPU pipeline for all 32 layers #36

Open
opened 2026-05-15 12:21:00 +02:00 by sleepy · 0 comments
Owner

Objective

Rewrite forward_decode() to encode ALL 32 layers (24 linear + 8 full attention) into a single MTLCommandBuffer. Zero CPU interaction during decode.

Current problem

  • 24 linear attention layers cause CPU↔GPU sync (commitAndWait per layer)
  • Each sync adds ~30-50ms of latency
  • Total: ~1.4 tok/s

Target architecture

  • Single command buffer per decode step
  • All layers encoded sequentially with no commits between them
  • Linear attention state lives in GPU MTLBuffers (not CPU f32 arrays)
  • Use kernels from #35 (conv1d, delta rule, norm/gate)
  • Use existing kernels (matmul_bf16_m1_t, rms_norm_bf16, residual_add_bf16, etc.)

Per linear attention layer (seq_len=1):

  1. RMSNorm (input_layernorm)
  2. matmul in_proj_qkv (hidden→conv_dim)
  3. matmul in_proj_z (hidden→value_dim)
  4. matmul in_proj_a (hidden→num_v_heads) — tiny, can use existing matmul
  5. matmul in_proj_b (hidden→num_v_heads) — tiny, can use existing matmul
  6. conv1d_state_bf16 kernel
  7. linear_attn_delta_rule_bf16 kernel
  8. matmul out_proj (value_dim→hidden)
  9. residual_add
  10. RMSNorm (post_attention_layernorm)
  11. MLP: matmul gate + matmul up + swiglu + matmul down
  12. residual_add

Per full attention layer (existing GPU path, no changes needed):

  1. RMSNorm → matmul qkv → RoPE → K/V cache store → decode_attention → matmul o → residual
  2. RMSNorm → MLP (gate+up+swiglu+down) → residual

GPU buffer requirements

  • recurrent_state: 24 layers × 524288 f32 = ~50MB (move from CPU to GPU)
  • conv_state: 24 layers × 3*6656 bf16 = ~0.96MB (move from CPU to GPU)
  • conv1d weight: 24 layers × 6656*4 bf16 = ~1.28MB (already in zero-copy mmap)
  • a_proj, b_proj: small, use existing temp buffers

Files to modify

  • src/models/qwen3_5/model.zig — rewrite forward_decode(), add GPU state buffer init
  • src/metal/dispatch.zig — may need new helper functions

Depends on

  • Issue #35 (Metal kernels)

Acceptance

  • Single commitAndWait per decode step (not per layer)
  • Coherent English output
  • tok/s improved significantly over 1.4 tok/s baseline
  • zig build test passes (91/91)
  • Push branch perf/36-single-cb-decode to Forgejo
## Objective Rewrite forward_decode() to encode ALL 32 layers (24 linear + 8 full attention) into a single MTLCommandBuffer. Zero CPU interaction during decode. ## Current problem - 24 linear attention layers cause CPU↔GPU sync (commitAndWait per layer) - Each sync adds ~30-50ms of latency - Total: ~1.4 tok/s ## Target architecture - Single command buffer per decode step - All layers encoded sequentially with no commits between them - Linear attention state lives in GPU MTLBuffers (not CPU f32 arrays) - Use kernels from #35 (conv1d, delta rule, norm/gate) - Use existing kernels (matmul_bf16_m1_t, rms_norm_bf16, residual_add_bf16, etc.) ## Per linear attention layer (seq_len=1): 1. RMSNorm (input_layernorm) 2. matmul in_proj_qkv (hidden→conv_dim) 3. matmul in_proj_z (hidden→value_dim) 4. matmul in_proj_a (hidden→num_v_heads) — tiny, can use existing matmul 5. matmul in_proj_b (hidden→num_v_heads) — tiny, can use existing matmul 6. conv1d_state_bf16 kernel 7. linear_attn_delta_rule_bf16 kernel 8. matmul out_proj (value_dim→hidden) 9. residual_add 10. RMSNorm (post_attention_layernorm) 11. MLP: matmul gate + matmul up + swiglu + matmul down 12. residual_add ## Per full attention layer (existing GPU path, no changes needed): 1. RMSNorm → matmul qkv → RoPE → K/V cache store → decode_attention → matmul o → residual 2. RMSNorm → MLP (gate+up+swiglu+down) → residual ## GPU buffer requirements - recurrent_state: 24 layers × 524288 f32 = ~50MB (move from CPU to GPU) - conv_state: 24 layers × 3*6656 bf16 = ~0.96MB (move from CPU to GPU) - conv1d weight: 24 layers × 6656*4 bf16 = ~1.28MB (already in zero-copy mmap) - a_proj, b_proj: small, use existing temp buffers ## Files to modify - src/models/qwen3_5/model.zig — rewrite forward_decode(), add GPU state buffer init - src/metal/dispatch.zig — may need new helper functions ## Depends on - Issue #35 (Metal kernels) ## Acceptance - Single commitAndWait per decode step (not per layer) - Coherent English output - tok/s improved significantly over 1.4 tok/s baseline - zig build test passes (91/91) - Push branch perf/36-single-cb-decode to Forgejo
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#36
No description provided.