• Joined on 2026-04-30
sleepy closed pull request sleepy/llama.cpp#39 2026-05-01 00:55:59 +02:00
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)
sleepy commented on pull request sleepy/llama.cpp#39 2026-05-01 00:55:59 +02:00
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)

Closing PR. Nibble extraction fundamentally broken (only 4/8 nibbles per uint32_t, with duplication). Will rewrite with MLX qmv_fast_impl as reference. Re-opening #29 for proper implementation.

sleepy pushed to master at sleepy/llama.cpp 2026-05-01 00:44:45 +02:00
757ef4de97 [docs] add coherence tests, MLX benchmarking, onboarding, Gitea API
sleepy opened issue sleepy/llama.cpp#40 2026-05-01 00:24:09 +02:00
[perf] achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max)
sleepy pushed to fix/29-q40-contig-reads at sleepy/llama.cpp 2026-05-01 00:14:12 +02:00
31ce8b1ae5 fix(metal): correct Q4_0 contiguous kernel nibble extraction
sleepy commented on pull request sleepy/llama.cpp#39 2026-04-30 22:42:17 +02:00
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)

REJECTED: Nibble extraction is incorrect. The masks (0x0F, 0xF00, 0xF000, 0xF0000) only extract 4 of 8 nibbles per uint32_t and do not shift to LSB. Each uint32_t holds 8 nibbles. Fix: extract all…

sleepy commented on pull request sleepy/llama.cpp#39 2026-04-30 22:41:51 +02:00
fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)

REJECTED: Contiguous kernel has incorrect nibble extraction. Only 4 nibbles per uint32_t are extracted (masks 0x0F, 0xF00, 0xF000, 0xF0000) instead of all 8. Nibbles are not shifted to LSB before…

sleepy created pull request sleepy/llama.cpp#39 2026-04-30 22:39:55 +02:00
[metal] wire contiguous Q4_0 kernel into dispatch (#29)
sleepy pushed to fix/29-q40-contig-reads at sleepy/llama.cpp 2026-04-30 22:38:52 +02:00
06f05e71c1 [metal] wire contiguous Q4_0 kernel into dispatch (#29)
eeb79b026b [metal] extend bin op fusion to MUL/SUB/DIV chains (#28)
Compare 2 commits »
sleepy created branch fix/29-q40-contig-reads in sleepy/llama.cpp 2026-04-30 22:38:52 +02:00
sleepy commented on pull request sleepy/llama.cpp#38 2026-04-30 21:04:03 +02:00
[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)

Merged via squash. Coherence test passed (token output byte-identical to master).

sleepy deleted branch fix/28-bin-op-fusion from sleepy/llama.cpp 2026-04-30 21:03:37 +02:00
sleepy pushed to master at sleepy/llama.cpp 2026-04-30 21:03:16 +02:00
8c532835be [metal] extend bin op fusion to MUL/SUB/DIV chains (#28) (#38)
sleepy merged pull request sleepy/llama.cpp#38 2026-04-30 21:03:15 +02:00
[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)
sleepy closed issue sleepy/llama.cpp#33 2026-04-30 20:17:37 +02:00
IQ4_XS tg4096 anomaly (45 vs 76 tok/s on 4B)
sleepy closed issue sleepy/llama.cpp#27 2026-04-30 20:17:28 +02:00
Eliminate zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE)
sleepy closed issue sleepy/llama.cpp#28 2026-04-30 20:17:15 +02:00
Reduce GPU dispatch count (1151 per tick)
sleepy created pull request sleepy/llama.cpp#38 2026-04-30 20:17:00 +02:00
[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)
sleepy pushed to fix/28-bin-op-fusion at sleepy/llama.cpp 2026-04-30 20:14:19 +02:00
eeb79b026b [metal] extend bin op fusion to MUL/SUB/DIV chains (#28)
sleepy created branch fix/28-bin-op-fusion in sleepy/llama.cpp 2026-04-30 20:14:19 +02:00