perf: Quantized weight loading (4-bit) for 4x bandwidth reduction #39
Loading…
Reference in a new issue
No description provided.
Delete branch "%!s()"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
BF16 weights read ~7.6GB per decode step. At 410 GB/s, theoretical floor is ~18.5ms.
With 4-bit quantization (Q4_K or similar): weights become ~1.9GB, theoretical floor drops to ~4.6ms = 217 tok/s.
Implementation plan:
Priority: Low until we've exhausted BF16 bandwidth optimizations.