feat: GPU-accelerated prefill path #45

New issue

Open

opened 2026-05-15 19:02:26 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-15 19:02:26 +02:00

Owner

Current prefill runs on CPU (forward_cpu). This is slow for long prompts.

Plan:

Move prefill to GPU using the batched matmul kernels (matmul_bf16_batch with 2D tiled approach)
Batched attention for full attention layers (seq_len > 1 SDPA)
Linear attention prefill stays CPU (forward_with_state) until we can port conv1d + delta rule to batched GPU
Expected speedup: 10-50x for long prompts

Current prefill runs on CPU (forward_cpu). This is slow for long prompts. Plan: 1. Move prefill to GPU using the batched matmul kernels (matmul_bf16_batch with 2D tiled approach) 2. Batched attention for full attention layers (seq_len > 1 SDPA) 3. Linear attention prefill stays CPU (forward_with_state) until we can port conv1d + delta rule to batched GPU 4. Expected speedup: 10-50x for long prompts