Block a user
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)
Closing PR. Nibble extraction fundamentally broken (only 4/8 nibbles per uint32_t, with duplication). Will rewrite with MLX qmv_fast_impl as reference. Re-opening #29 for proper implementation.
[perf] achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max)
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)
REJECTED: Nibble extraction is incorrect. The masks (0x0F, 0xF00, 0xF000, 0xF0000) only extract 4 of 8 nibbles per uint32_t and do not shift to LSB. Each uint32_t holds 8 nibbles. Fix: extract all…
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)
REJECTED: Contiguous kernel has incorrect nibble extraction. Only 4 nibbles per uint32_t are extracted (masks 0x0F, 0xF00, 0xF000, 0xF0000) instead of all 8. Nibbles are not shifted to LSB before…
[metal] wire contiguous Q4_0 kernel into dispatch (#29)
[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)
Merged via squash. Coherence test passed (token output byte-identical to master).
[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)
IQ4_XS tg4096 anomaly (45 vs 76 tok/s on 4B)
Eliminate zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE)
Reduce GPU dispatch count (1151 per tick)
[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)