perf: Quantized weight loading (4-bit) for 4x bandwidth reduction #39

New issue

Open

opened 2026-05-15 19:02:26 +02:00 by sleepy · 0 comments

sleepy commented

2026-05-15 19:02:26 +02:00

Owner

BF16 weights read ~7.6GB per decode step. At 410 GB/s, theoretical floor is ~18.5ms.

With 4-bit quantization (Q4_K or similar): weights become ~1.9GB, theoretical floor drops to ~4.6ms = 217 tok/s.

Implementation plan:

Support GGUF quantized weight loading
GPU dequantization kernel in the matmul: read 4-bit weights, dequantize to bf16 in registers/threadgroup, multiply
Layout weights for tile-friendly access (interleave scales/zeros with weight blocks)
This is the nuclear option if MPP matmul2d doesn't reach 37 tok/s with BF16

Priority: Low until we've exhausted BF16 bandwidth optimizations.

BF16 weights read ~7.6GB per decode step. At 410 GB/s, theoretical floor is ~18.5ms. With 4-bit quantization (Q4_K or similar): weights become ~1.9GB, theoretical floor drops to ~4.6ms = 217 tok/s. Implementation plan: 1. Support GGUF quantized weight loading 2. GPU dequantization kernel in the matmul: read 4-bit weights, dequantize to bf16 in registers/threadgroup, multiply 3. Layout weights for tile-friendly access (interleave scales/zeros with weight blocks) 4. This is the nuclear option if MPP matmul2d doesn't reach 37 tok/s with BF16 Priority: Low until we've exhausted BF16 bandwidth optimizations.

sleepy referenced this issue

2026-05-18 01:02:55 +02:00

perf: Reach llama.cpp/MLX decode parity (37 tok/s target) #53

sleepy referenced this issue

2026-05-18 01:55:21 +02:00

perf: Reach llama.cpp/MLX decode parity (37 tok/s target) #53