2025-10-30 - 2026-04-30
Overview
1 Pull request merged by 1 user
Merged
#38 [metal] extend bin op fusion to MUL/SUB/DIV chains (#28)
1 Pull request proposed by 1 user
Proposed
#39 [metal] wire contiguous Q4_0 kernel into dispatch (#29)
3 Issues closed from 1 user
Closed
#33 IQ4_XS tg4096 anomaly (45 vs 76 tok/s on 4B)
Closed
#27 Eliminate zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE)
Closed
#28 Reduce GPU dispatch count (1151 per tick)
11 Issues created by 1 user
Opened
#27 Eliminate zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE)
Opened
#28 Reduce GPU dispatch count (1151 per tick)
Opened
#29 Port contiguous weight reads to Q4_0 MUL_MAT kernel
Opened
#30 Investigate GET_ROWS overhead (678 MB/tick at 9B)
Opened
#31 Investigate CPY overhead (159 MB/tick at 9B)
Opened
#32 KV cache IO scaling with context length
Opened
#33 IQ4_XS tg4096 anomaly (45 vs 76 tok/s on 4B)
Opened
#34 Profile graph fusion effectiveness
Opened
#35 Profile concurrent encoding effectiveness
Opened
#36 Compare llama.cpp and MLX dispatch structure
Opened
#37 Implement MXFP4 GGUF converter