No description

Zig 82.3%
Metal 16.6%
Objective-C 1.1%

Find a file

Repository files (latest commit first)
Filename	Latest commit message	Latest commit date
Kaloyan Nikolov 914a4f9c94 perf: two-stage top-K lm_head — 87% bandwidth reduction Stage 1: approximate scores using first 128/1024 dims (39MB vs 312MB) Stage 2: exact re-ranking for top-512 candidates (1MB) Result: ~100 tok/s decode (+2 tok/s). Greedy output verified matching.		2026-05-22 19:04:13 +02:00
docs	[infra] Project scaffold: build.zig, directory structure (#1 ) (#9 )	2026-05-10 12:21:47 +02:00
src	perf: two-stage top-K lm_head — 87% bandwidth reduction	2026-05-22 19:04:13 +02:00
.gitignore	[metal] Metal shim and context module (#2 ) (#10 )	2026-05-10 13:19:53 +02:00
AGENTS.md	initial: mlx-zig project setup	2026-05-10 11:22:42 +02:00
build.zig	perf: simd GEMV kernel + norm fix + fused residual (13.7 → 21 tok/s)	2026-05-15 18:57:49 +02:00
build.zig.zon	[infra] Project scaffold: build.zig, directory structure (#1 ) (#9 )	2026-05-10 12:21:47 +02:00
PROJECT.md	docs: add correctness prerequisites to PROJECT.md	2026-05-10 12:38:50 +02:00
README.md	[infra] Project scaffold: build.zig, directory structure (#1 ) (#9 )	2026-05-10 12:21:47 +02:00
WIKI.md	initial: mlx-zig project setup	2026-05-10 11:22:42 +02:00

README.md

sleepy-llm

A ground-up Zig-native inference engine for Apple Silicon. Target: beat MLX performance.

Core idea: Skip the Python/MLX overhead and the underperforming multi-platform engines. Write a focused inference engine in Zig with hand-tuned Metal Shading Language kernels, mmap model weights into unified memory, and dispatch directly to the GPU via MTLCommandBuffer.

Why not ZINC or llama.cpp: ZINC's Apple Silicon path achieves 25-38 tok/s — 3-5x slower than MLX. This project starts from scratch with the right architecture: comptime-shaped tensors, zero-copy unified memory, and kernels tuned specifically for Apple Silicon's Neural Engine + GPU fusion.

Status: Early architecture. Building toward Qwen3.5-4B support with Multi-Token Prediction (MTP).

Stack: Zig 0.16+, Metal 3, MSL. No Python. No Vulkan. No MLX dependency.

Model format: Safetensors (initially). We use MLX-optimized safetensors for fair baseline comparison against MLX. GGUF may be added later for broader compatibility.

Test model: Qwen3.5-4B with verified MTP layers (15 MTP tensors confirmed, mtp_num_hidden_layers: 1 in config).

Build

zig build

Test

zig build test

Lint

zig build lint

Architecture

src/metal/ — Metal GPU backend (shim, context, buffers, pipelines, kernels)
src/tensor/ — Comptime-shaped tensor system
src/safetensors/ — Safetensors parser and zero-copy loader
src/models/ — Model implementations (qwen3_5 reference)
src/inference/ — Inference engine, sampling, scheduling, MTP
src/platform/ — Apple Silicon feature detection
src/tests/ — End-to-end tests