No description

D 55.5%
Rust 40.6%
Metal 2.8%
Makefile 1%

Find a file

Repository files (latest commit first)
Filename	Latest commit message	Latest commit date
Kaloyan Nikolov f71e20b7b5 feat: E2E test infra, Metal dispatch wiring, config fixes - e2e.rs: comprehensive E2E tests with real model loading - dispatch.rs: Metal kernel dispatch primitives (matmul, rms_norm) - config.rs: proper serde defaults for rope_theta, sliding_window - model.rs: Metal buffer management, forward pass with dispatch - weight_loader.rs: improved weight loading with dtype conversion - parser.rs: improved safetensors shard resolution - 156 tests pass, clippy clean		2026-05-13 08:06:46 +02:00
src	feat: E2E test infra, Metal dispatch wiring, config fixes	2026-05-13 08:06:46 +02:00
target	feat: initial project structure with stubs	2026-05-12 15:04:07 +02:00
.gitignore	feat: implement tensor, metal, and safetensors modules	2026-05-12 16:11:58 +02:00
AGENTS.md	initial: project docs	2026-05-12 14:36:33 +02:00
build.rs	feat: MSL kernels and build system	2026-05-12 20:41:04 +02:00
Cargo.lock	feat: inference engine and CLI	2026-05-12 23:33:59 +02:00
Cargo.toml	feat: inference engine and CLI	2026-05-12 23:33:59 +02:00
PROJECT.md	initial: project docs	2026-05-12 14:36:33 +02:00
README.md	initial: project docs	2026-05-12 14:36:33 +02:00
WIKI.md	initial: project docs	2026-05-12 14:36:33 +02:00

README.md

rust-llm

A Rust continuation of sleepy-llm — a ground-up inference engine for Apple Silicon. Target: beat MLX performance.

Core idea: Same architecture as the Zig project, rebuilt in Rust. Skip the Python/MLX overhead and the underperforming multi-platform engines. Write a focused inference engine with hand-tuned Metal Shading Language kernels, mmap model weights into unified memory, and dispatch directly to the GPU via MTLCommandBuffer.

Why Rust: The Zig project proved the kernel fusion and memory architecture. Rust gives us serde, metal-rs, and a mature ecosystem for safetensors parsing and tokenizer handling — without sacrificing the zero-copy, direct-to-metal design. If Zig's comptime was the right tool for shape checking, Rust's const generics and type system are the equivalent here.

Status: Early architecture. Ported from Zig design docs. Building toward Qwen3.5-4B support with Multi-Token Prediction (MTP).

Stack: Rust (latest stable), Metal 3, MSL, metal-rs. No Python. No Vulkan. No MLX dependency.

Model format: Safetensors (initially). We use MLX-optimized safetensors for fair baseline comparison against MLX. GGUF may be added later for broader compatibility.

Test model: Qwen3.5-4B with verified MTP layers (15 MTP tensors confirmed, mtp_num_hidden_layers: 1 in config).

Build

cargo build --release

Test

cargo test

Lint

cargo clippy --all-targets --all-features && cargo fmt --check

Architecture

src/metal/ — Metal GPU backend (context, buffers, pipelines, kernels)
src/tensor/ — Generic tensor system with static shape checking
src/safetensors/ — Safetensors parser and zero-copy loader
src/models/ — Model implementations (qwen3_5 reference)
src/inference/ — Inference engine, sampling, scheduling, MTP
src/platform/ — Apple Silicon feature detection
src/tests/ — End-to-end tests