fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29) #39
Reference in New Issue
Block a user
Delete Branch "fix/29-q40-contig-reads"
Deleting a branch is permanent. Although the deleted branch may continue to exist for a short time before it actually gets removed, it CANNOT be undone in most cases. Continue?
Summary
Fix correctness bug in
kernel_mul_mv_q4_0_f32_ccontiguous Q4_0 kernel:Fix
(qs >> (4*j)) & 0xFfor j=0..7il/8offset to select correct uint32_t pair (qs[il/8] and qs[il/8+2])Verification
Benchmark (Qwen3.5-4B-Q4_0, M4 Max, 1 thread)
REJECTED: Contiguous kernel has incorrect nibble extraction. Only 4 nibbles per uint32_t are extracted (masks 0x0F, 0xF00, 0xF000, 0xF0000) instead of all 8. Nibbles are not shifted to LSB before multiplication. This produces garbage output. Rewrite the unpack logic to extract all 8 nibbles per uint32_t with proper shifts (>>0, >>4, >>8, >>12, >>16, >>20, >>24, >>28) and AND 0xF.
REJECTED: Nibble extraction is incorrect. The masks (0x0F, 0xF00, 0xF000, 0xF0000) only extract 4 of 8 nibbles per uint32_t and do not shift to LSB. Each uint32_t holds 8 nibbles. Fix: extract all 8 nibbles with (qs[i] >> (4*j)) & 0xF for j=0..7, or compare against MLX qmv_fast_impl unpack logic. Also verify scale/delta application matches the strided kernel.
[metal] wire contiguous Q4_0 kernel into dispatch (#29)to fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)View command line instructions
Checkout
From your project repository, check out a new branch and test the changes.