Move linear attention layers to GPU (24/32 layers on CPU) #33

Closed
opened 2026-05-15 00:50:48 +02:00 by sleepy · 1 comment
Owner

Problem

24 of 32 Qwen3.5 layers are linear attention, currently running on CPU during decode. Each decode step does GPU→CPU→GPU roundtrip for these layers. This is the main speed bottleneck.

Current: ~0.6 tok/s decode (linear attention on CPU)
Target: 37+ tok/s (need GPU linear attention)

LinearAttentionLayer.forward_gpu() already exists and handles seq_len=1 with persistent GPU state buffers. Needs wiring into the decode loop.

Acceptance

  • Linear attention layers run entirely on GPU during decode
  • No bf16→f32→bf16 conversion for linear layers
  • Persistent GPU state buffers (conv_state, recurrent state) between decode steps
  • Decode speed > 5 tok/s

Max 2 attempts.

## Problem 24 of 32 Qwen3.5 layers are linear attention, currently running on CPU during decode. Each decode step does GPU→CPU→GPU roundtrip for these layers. This is the main speed bottleneck. Current: ~0.6 tok/s decode (linear attention on CPU) Target: 37+ tok/s (need GPU linear attention) LinearAttentionLayer.forward_gpu() already exists and handles seq_len=1 with persistent GPU state buffers. Needs wiring into the decode loop. ## Acceptance - Linear attention layers run entirely on GPU during decode - No bf16→f32→bf16 conversion for linear layers - Persistent GPU state buffers (conv_state, recurrent state) between decode steps - Decode speed > 5 tok/s Max 2 attempts.
Author
Owner

Merged hybrid GPU+CPU linear attention (85b7a77). GPU matmuls for the 3 big projections (37M ops/layer), CPU for state management. ~2x speedup overall (0.67→1.4 tok/s). Full GPU linear attention deferred — the remaining CPU state management is now the bottleneck.

Merged hybrid GPU+CPU linear attention (85b7a77). GPU matmuls for the 3 big projections (37M ops/layer), CPU for state management. ~2x speedup overall (0.67→1.4 tok/s). Full GPU linear attention deferred — the remaining CPU state management is now the bottleneck.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#33
No description provided.