Move linear attention layers to GPU (24/32 layers on CPU) #33

New issue

Closed

opened 2026-05-15 00:50:48 +02:00 by sleepy · 1 comment

sleepy commented

2026-05-15 00:50:48 +02:00

Owner

Problem

24 of 32 Qwen3.5 layers are linear attention, currently running on CPU during decode. Each decode step does GPU→CPU→GPU roundtrip for these layers. This is the main speed bottleneck.

Current: ~0.6 tok/s decode (linear attention on CPU)
Target: 37+ tok/s (need GPU linear attention)

LinearAttentionLayer.forward_gpu() already exists and handles seq_len=1 with persistent GPU state buffers. Needs wiring into the decode loop.

Acceptance

Linear attention layers run entirely on GPU during decode
No bf16→f32→bf16 conversion for linear layers
Persistent GPU state buffers (conv_state, recurrent state) between decode steps
Decode speed > 5 tok/s

Max 2 attempts.

## Problem 24 of 32 Qwen3.5 layers are linear attention, currently running on CPU during decode. Each decode step does GPU→CPU→GPU roundtrip for these layers. This is the main speed bottleneck. Current: ~0.6 tok/s decode (linear attention on CPU) Target: 37+ tok/s (need GPU linear attention) LinearAttentionLayer.forward_gpu() already exists and handles seq_len=1 with persistent GPU state buffers. Needs wiring into the decode loop. ## Acceptance - Linear attention layers run entirely on GPU during decode - No bf16→f32→bf16 conversion for linear layers - Persistent GPU state buffers (conv_state, recurrent state) between decode steps - Decode speed > 5 tok/s Max 2 attempts.

sleepy referenced this issue

2026-05-15 00:50:48 +02:00

Reach 37+ tok/s decode target (match llama.cpp) #34

sleepy referenced this issue from a commit

2026-05-15 04:00:31 +02:00

perf: GPU matmuls for linear attention decode (#33)

sleepy commented

2026-05-15 04:03:29 +02:00

Author

Owner

Merged hybrid GPU+CPU linear attention (85b7a77). GPU matmuls for the 3 big projections (37M ops/layer), CPU for state management. ~2x speedup overall (0.67→1.4 tok/s). Full GPU linear attention deferred — the remaining CPU state management is now the bottleneck.

Merged hybrid GPU+CPU linear attention (85b7a77). GPU matmuls for the 3 big projections (37M ops/layer), CPU for state management. ~2x speedup overall (0.67→1.4 tok/s). Full GPU linear attention deferred — the remaining CPU state management is now the bottleneck.

sleepy closed this issue

2026-05-15 04:03:29 +02:00