feat: GPU-accelerated prefill path #45

Open
opened 2026-05-15 19:02:26 +02:00 by sleepy · 0 comments
Owner

Current prefill runs on CPU (forward_cpu). This is slow for long prompts.

Plan:

  1. Move prefill to GPU using the batched matmul kernels (matmul_bf16_batch with 2D tiled approach)
  2. Batched attention for full attention layers (seq_len > 1 SDPA)
  3. Linear attention prefill stays CPU (forward_with_state) until we can port conv1d + delta rule to batched GPU
  4. Expected speedup: 10-50x for long prompts
Current prefill runs on CPU (forward_cpu). This is slow for long prompts. Plan: 1. Move prefill to GPU using the batched matmul kernels (matmul_bf16_batch with 2D tiled approach) 2. Batched attention for full attention layers (seq_len > 1 SDPA) 3. Linear attention prefill stays CPU (forward_with_state) until we can port conv1d + delta rule to batched GPU 4. Expected speedup: 10-50x for long prompts
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#45
No description provided.