perf: Quantized weight loading (4-bit) for 4x bandwidth reduction #39

Open
opened 2026-05-15 19:02:26 +02:00 by sleepy · 0 comments
Owner

BF16 weights read ~7.6GB per decode step. At 410 GB/s, theoretical floor is ~18.5ms.

With 4-bit quantization (Q4_K or similar): weights become ~1.9GB, theoretical floor drops to ~4.6ms = 217 tok/s.

Implementation plan:

  1. Support GGUF quantized weight loading
  2. GPU dequantization kernel in the matmul: read 4-bit weights, dequantize to bf16 in registers/threadgroup, multiply
  3. Layout weights for tile-friendly access (interleave scales/zeros with weight blocks)
  4. This is the nuclear option if MPP matmul2d doesn't reach 37 tok/s with BF16

Priority: Low until we've exhausted BF16 bandwidth optimizations.

BF16 weights read ~7.6GB per decode step. At 410 GB/s, theoretical floor is ~18.5ms. With 4-bit quantization (Q4_K or similar): weights become ~1.9GB, theoretical floor drops to ~4.6ms = 217 tok/s. Implementation plan: 1. Support GGUF quantized weight loading 2. GPU dequantization kernel in the matmul: read 4-bit weights, dequantize to bf16 in registers/threadgroup, multiply 3. Layout weights for tile-friendly access (interleave scales/zeros with weight blocks) 4. This is the nuclear option if MPP matmul2d doesn't reach 37 tok/s with BF16 Priority: Low until we've exhausted BF16 bandwidth optimizations.
Sign in to join this conversation.
No milestone
No project
No assignees
1 participant
Notifications
Due date
The due date is invalid or out of range. Please use the format "yyyy-mm-dd".

No due date set.

Dependencies

No dependencies set.

Reference
sleepy/sleepy-llm#39
No description provided.