- Zig 82.3%
- Metal 16.6%
- Objective-C 1.1%
| Filename | Latest commit message | Latest commit date |
|---|---|---|
Stage 1: approximate scores using first 128/1024 dims (39MB vs 312MB) Stage 2: exact re-ranking for top-512 candidates (1MB) Result: ~100 tok/s decode (+2 tok/s). Greedy output verified matching. |
||
| docs | ||
| src | ||
| .gitignore | ||
| AGENTS.md | ||
| build.zig | ||
| build.zig.zon | ||
| PROJECT.md | ||
| README.md | ||
| WIKI.md | ||
sleepy-llm
A ground-up Zig-native inference engine for Apple Silicon. Target: beat MLX performance.
Core idea: Skip the Python/MLX overhead and the underperforming multi-platform engines. Write a focused inference engine in Zig with hand-tuned Metal Shading Language kernels, mmap model weights into unified memory, and dispatch directly to the GPU via MTLCommandBuffer.
Why not ZINC or llama.cpp: ZINC's Apple Silicon path achieves 25-38 tok/s — 3-5x slower than MLX. This project starts from scratch with the right architecture: comptime-shaped tensors, zero-copy unified memory, and kernels tuned specifically for Apple Silicon's Neural Engine + GPU fusion.
Status: Early architecture. Building toward Qwen3.5-4B support with Multi-Token Prediction (MTP).
Stack: Zig 0.16+, Metal 3, MSL. No Python. No Vulkan. No MLX dependency.
Model format: Safetensors (initially). We use MLX-optimized safetensors for fair baseline comparison against MLX. GGUF may be added later for broader compatibility.
Test model: Qwen3.5-4B with verified MTP layers (15 MTP tensors confirmed, mtp_num_hidden_layers: 1 in config).
Build
zig build
Test
zig build test
Lint
zig build lint
Architecture
src/metal/— Metal GPU backend (shim, context, buffers, pipelines, kernels)src/tensor/— Comptime-shaped tensor systemsrc/safetensors/— Safetensors parser and zero-copy loadersrc/models/— Model implementations (qwen3_5 reference)src/inference/— Inference engine, sampling, scheduling, MTPsrc/platform/— Apple Silicon feature detectionsrc/tests/— End-to-end tests