sleepy

sleepy closed pull request sleepy/llama.cpp#39

2026-05-01 00:55:59 +02:00

fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)

sleepy commented on pull request sleepy/llama.cpp#39

2026-05-01 00:55:59 +02:00

fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)

Closing PR. Nibble extraction fundamentally broken (only 4/8 nibbles per uint32_t, with duplication). Will rewrite with MLX qmv_fast_impl as reference. Re-opening #29 for proper implementation.

sleepy pushed to master at sleepy/llama.cpp

2026-05-01 00:44:45 +02:00

757ef4de97 [docs] add coherence tests, MLX benchmarking, onboarding, Gitea API

sleepy opened issue sleepy/llama.cpp#40

2026-05-01 00:24:09 +02:00

[perf] achieve MLX generation t/s — 22 t/s on 27B Q4_0 (M4 Max)

sleepy pushed to fix/29-q40-contig-reads at sleepy/llama.cpp

2026-05-01 00:14:12 +02:00

31ce8b1ae5 fix(metal): correct Q4_0 contiguous kernel nibble extraction

sleepy commented on pull request sleepy/llama.cpp#39

2026-04-30 22:42:17 +02:00

fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)

REJECTED: Nibble extraction is incorrect. The masks (0x0F, 0xF00, 0xF000, 0xF0000) only extract 4 of 8 nibbles per uint32_t and do not shift to LSB. Each uint32_t holds 8 nibbles. Fix: extract all…

sleepy commented on pull request sleepy/llama.cpp#39

2026-04-30 22:41:51 +02:00

fix(metal): correct Q4_0 contiguous kernel nibble extraction (#29)

REJECTED: Contiguous kernel has incorrect nibble extraction. Only 4 nibbles per uint32_t are extracted (masks 0x0F, 0xF00, 0xF000, 0xF0000) instead of all 8. Nibbles are not shifted to LSB before…

sleepy created pull request sleepy/llama.cpp#39

2026-04-30 22:39:55 +02:00

[metal] wire contiguous Q4_0 kernel into dispatch (#29)

sleepy pushed to fix/29-q40-contig-reads at sleepy/llama.cpp

2026-04-30 22:38:52 +02:00

06f05e71c1 [metal] wire contiguous Q4_0 kernel into dispatch (#29)

eeb79b026b [metal] extend bin op fusion to MUL/SUB/DIV chains (#28)

Compare 2 commits »

sleepy created branch fix/29-q40-contig-reads in sleepy/llama.cpp

2026-04-30 22:38:52 +02:00

sleepy commented on pull request sleepy/llama.cpp#38

2026-04-30 21:04:03 +02:00

[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)

Merged via squash. Coherence test passed (token output byte-identical to master).

sleepy deleted branch fix/28-bin-op-fusion from sleepy/llama.cpp

2026-04-30 21:03:37 +02:00

sleepy pushed to master at sleepy/llama.cpp

2026-04-30 21:03:16 +02:00

8c532835be [metal] extend bin op fusion to MUL/SUB/DIV chains (#28) (#38)

sleepy merged pull request sleepy/llama.cpp#38

2026-04-30 21:03:15 +02:00

[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)

sleepy closed issue sleepy/llama.cpp#33

2026-04-30 20:17:37 +02:00

IQ4_XS tg4096 anomaly (45 vs 76 tok/s on 4B)

sleepy closed issue sleepy/llama.cpp#27

2026-04-30 20:17:28 +02:00

Eliminate zero-ops (VIEW/RESHAPE/TRANSPOSE/PERMUTE)

sleepy closed issue sleepy/llama.cpp#28

2026-04-30 20:17:15 +02:00

Reduce GPU dispatch count (1151 per tick)

sleepy created pull request sleepy/llama.cpp#38

2026-04-30 20:17:00 +02:00

[metal] extend bin op fusion to MUL/SUB/DIV chains (#28)

sleepy pushed to fix/28-bin-op-fusion at sleepy/llama.cpp

2026-04-30 20:14:19 +02:00

eeb79b026b [metal] extend bin op fusion to MUL/SUB/DIV chains (#28)

sleepy created branch fix/28-bin-op-fusion in sleepy/llama.cpp

2026-04-30 20:14:19 +02:00