No description
  • Zig 82.3%
  • Metal 16.6%
  • Objective-C 1.1%
Find a file
Repository files (latest commit first)
Filename Latest commit message Latest commit date
Kaloyan Nikolov 914a4f9c94 perf: two-stage top-K lm_head — 87% bandwidth reduction
Stage 1: approximate scores using first 128/1024 dims (39MB vs 312MB)
Stage 2: exact re-ranking for top-512 candidates (1MB)

Result: ~100 tok/s decode (+2 tok/s). Greedy output verified matching.
2026-05-22 19:04:13 +02:00
docs [infra] Project scaffold: build.zig, directory structure (#1) (#9) 2026-05-10 12:21:47 +02:00
src perf: two-stage top-K lm_head — 87% bandwidth reduction 2026-05-22 19:04:13 +02:00
.gitignore [metal] Metal shim and context module (#2) (#10) 2026-05-10 13:19:53 +02:00
AGENTS.md initial: mlx-zig project setup 2026-05-10 11:22:42 +02:00
build.zig perf: simd GEMV kernel + norm fix + fused residual (13.7 → 21 tok/s) 2026-05-15 18:57:49 +02:00
build.zig.zon [infra] Project scaffold: build.zig, directory structure (#1) (#9) 2026-05-10 12:21:47 +02:00
PROJECT.md docs: add correctness prerequisites to PROJECT.md 2026-05-10 12:38:50 +02:00
README.md [infra] Project scaffold: build.zig, directory structure (#1) (#9) 2026-05-10 12:21:47 +02:00
WIKI.md initial: mlx-zig project setup 2026-05-10 11:22:42 +02:00

sleepy-llm

A ground-up Zig-native inference engine for Apple Silicon. Target: beat MLX performance.

Core idea: Skip the Python/MLX overhead and the underperforming multi-platform engines. Write a focused inference engine in Zig with hand-tuned Metal Shading Language kernels, mmap model weights into unified memory, and dispatch directly to the GPU via MTLCommandBuffer.

Why not ZINC or llama.cpp: ZINC's Apple Silicon path achieves 25-38 tok/s — 3-5x slower than MLX. This project starts from scratch with the right architecture: comptime-shaped tensors, zero-copy unified memory, and kernels tuned specifically for Apple Silicon's Neural Engine + GPU fusion.

Status: Early architecture. Building toward Qwen3.5-4B support with Multi-Token Prediction (MTP).

Stack: Zig 0.16+, Metal 3, MSL. No Python. No Vulkan. No MLX dependency.

Model format: Safetensors (initially). We use MLX-optimized safetensors for fair baseline comparison against MLX. GGUF may be added later for broader compatibility.

Test model: Qwen3.5-4B with verified MTP layers (15 MTP tensors confirmed, mtp_num_hidden_layers: 1 in config).

Build

zig build

Test

zig build test

Lint

zig build lint

Architecture

  • src/metal/ — Metal GPU backend (shim, context, buffers, pipelines, kernels)
  • src/tensor/ — Comptime-shaped tensor system
  • src/safetensors/ — Safetensors parser and zero-copy loader
  • src/models/ — Model implementations (qwen3_5 reference)
  • src/inference/ — Inference engine, sampling, scheduling, MTP
  • src/platform/ — Apple Silicon feature detection
  • src/tests/ — End-to-end tests