No description
  • D 55.5%
  • Rust 40.6%
  • Metal 2.8%
  • Makefile 1%
Find a file
Repository files (latest commit first)
Filename Latest commit message Latest commit date
Kaloyan Nikolov f71e20b7b5 feat: E2E test infra, Metal dispatch wiring, config fixes
- e2e.rs: comprehensive E2E tests with real model loading
- dispatch.rs: Metal kernel dispatch primitives (matmul, rms_norm)
- config.rs: proper serde defaults for rope_theta, sliding_window
- model.rs: Metal buffer management, forward pass with dispatch
- weight_loader.rs: improved weight loading with dtype conversion
- parser.rs: improved safetensors shard resolution
- 156 tests pass, clippy clean
2026-05-13 08:06:46 +02:00
src feat: E2E test infra, Metal dispatch wiring, config fixes 2026-05-13 08:06:46 +02:00
target feat: initial project structure with stubs 2026-05-12 15:04:07 +02:00
.gitignore feat: implement tensor, metal, and safetensors modules 2026-05-12 16:11:58 +02:00
AGENTS.md initial: project docs 2026-05-12 14:36:33 +02:00
build.rs feat: MSL kernels and build system 2026-05-12 20:41:04 +02:00
Cargo.lock feat: inference engine and CLI 2026-05-12 23:33:59 +02:00
Cargo.toml feat: inference engine and CLI 2026-05-12 23:33:59 +02:00
PROJECT.md initial: project docs 2026-05-12 14:36:33 +02:00
README.md initial: project docs 2026-05-12 14:36:33 +02:00
WIKI.md initial: project docs 2026-05-12 14:36:33 +02:00

rust-llm

A Rust continuation of sleepy-llm — a ground-up inference engine for Apple Silicon. Target: beat MLX performance.

Core idea: Same architecture as the Zig project, rebuilt in Rust. Skip the Python/MLX overhead and the underperforming multi-platform engines. Write a focused inference engine with hand-tuned Metal Shading Language kernels, mmap model weights into unified memory, and dispatch directly to the GPU via MTLCommandBuffer.

Why Rust: The Zig project proved the kernel fusion and memory architecture. Rust gives us serde, metal-rs, and a mature ecosystem for safetensors parsing and tokenizer handling — without sacrificing the zero-copy, direct-to-metal design. If Zig's comptime was the right tool for shape checking, Rust's const generics and type system are the equivalent here.

Status: Early architecture. Ported from Zig design docs. Building toward Qwen3.5-4B support with Multi-Token Prediction (MTP).

Stack: Rust (latest stable), Metal 3, MSL, metal-rs. No Python. No Vulkan. No MLX dependency.

Model format: Safetensors (initially). We use MLX-optimized safetensors for fair baseline comparison against MLX. GGUF may be added later for broader compatibility.

Test model: Qwen3.5-4B with verified MTP layers (15 MTP tensors confirmed, mtp_num_hidden_layers: 1 in config).

Build

cargo build --release

Test

cargo test

Lint

cargo clippy --all-targets --all-features && cargo fmt --check

Architecture

  • src/metal/ — Metal GPU backend (context, buffers, pipelines, kernels)
  • src/tensor/ — Generic tensor system with static shape checking
  • src/safetensors/ — Safetensors parser and zero-copy loader
  • src/models/ — Model implementations (qwen3_5 reference)
  • src/inference/ — Inference engine, sampling, scheduling, MTP
  • src/platform/ — Apple Silicon feature detection
  • src/tests/ — End-to-end tests