- Add _validate_mlx_model_exists() to check HuggingFace repos - Show warning when selected quantization doesn't exist - List available quantizations for the model - Better error messages with suggestions This prevents trying to download non-existent quantizations like 5bit when only 3bit, 4bit, 6bit, 8bit are available.
Local Swarm
Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.
Features
- Interactive Menu System: Easy-to-use menu for selecting model configurations, browsing options, or creating custom setups
- Hardware Auto-Detection: Automatically detects your GPU (NVIDIA, AMD, Intel), Apple Silicon, Qualcomm (Android), or CPU and selects optimal settings
- Smart Model Selection: Chooses the best model, quantization, and instance count based on available VRAM/RAM
- Startup Summary: Clear display of detected hardware, selected model, resource usage, and worker status
- Swarm Consensus: Multiple LLM instances vote on the best response for higher quality outputs
- Network Federation: Multiple machines on the same network can join into a "federated swarm" for distributed consensus
- OpenAI-Compatible API: Drop-in replacement for OpenAI API at
http://localhost:8000/v1 - MCP Server: Model Context Protocol support for tight AI assistant integration
- Cross-Platform: Works on Windows, macOS, Linux, and Android (via Termux) with automatic backend selection
Documentation
- Quick Start - Get up and running in minutes
- Complete Guide - Comprehensive documentation
- Opencode configuration examples
- API reference
- Troubleshooting guide
- Performance tuning
- Advanced configuration
- Configuration - Customize your setup
- Interactive Mode - Using the menu system
- Tips & Help - Learn about models, quantization, and optimization
Quick Start
Installation
Windows (PowerShell)
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run installer
.\scripts\install.bat
macOS/Linux
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run installer
chmod +x scripts/install.sh
./scripts/install.sh
Android (Termux)
# In Termux app
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run Termux installer
chmod +x scripts/install-termux.sh
./scripts/install-termux.sh
Note: Android support is limited to small models (1-3B) due to memory constraints. Requires 8GB+ RAM.
Usage
Start the Swarm
# Auto-detect hardware and start
python -m local_swarm
# Or use the CLI
python main.py
On first run, the tool will:
- Scan your hardware (GPU, RAM, CPU)
- Select the optimal model and quantization
- Download the model (one-time)
- Start multiple instances based on available memory
- Expose the API at
http://localhost:8000
Example startup output:
🔍 Detecting hardware...
OS: Windows 11
GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
CPU: 16 cores
RAM: 32 GB
📊 Optimal configuration:
Model: Qwen 2.5 Coder 3B
Quantization: Q4_K_M (1.8 GB per instance)
Instances: 8 (using 14.4 GB VRAM)
⬇️ Downloading model...
Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB
🚀 Starting swarm...
Worker 1: Ready (GPU:0)
Worker 2: Ready (GPU:0)
...
Worker 8: Ready (GPU:0)
✅ Local Swarm is running!
API: http://localhost:8000/v1
Models: http://localhost:8000/v1/models
Health: http://localhost:8000/health
💡 Configure opencode to use:
base_url: http://localhost:8000/v1
api_key: any (not used)
Configure opencode
Add to your opencode configuration:
{
"model": {
"provider": "openai",
"base_url": "http://localhost:8000/v1",
"api_key": "not-needed",
"model": "local-swarm"
}
}
MCP Server (Optional)
For tighter integration with AI assistants, enable the MCP server:
python main.py --mcp
This runs alongside the HTTP API and exposes tools AI assistants can use:
get_hardware_info- Query CPU, GPU, and RAMget_swarm_status- Check worker healthgenerate_code- Generate code with consensuslist_available_models- See what models can runget_worker_details- Get detailed worker statistics
MCP allows AI assistants to automatically query your hardware capabilities and select appropriate models.
Configuration
Create a config.yaml file for customization:
server:
host: "127.0.0.1"
port: 8000
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest
min_instances: 2
max_instances: 8
hardware:
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
federation:
enabled: true
discovery_port: 8765
federation_port: 8766
max_peers: 10
models:
cache_dir: "~/.local_swarm/models"
CLI Options
# Show hardware detection without starting
python -m local_swarm --detect
# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4
# Use specific port
python -m local_swarm --port 8080
# Force number of instances
python -m local_swarm --instances 4
# Download models only (no server)
python -m local_swarm --download-only
# Enable MCP server alongside HTTP API
python -m local_swarm --mcp
# Show help
python -m local_swarm --help
# Auto-detect without interactive menu
python -m local_swarm --auto
Interactive Mode
By default, Local Swarm starts in interactive mode with a menu system:
======================================================================
Local Swarm - Model Selection
======================================================================
----------------------------------------------------------------------
Hardware Detection
----------------------------------------------------------------------
Operating System: Darwin
CPU: 12 cores
System RAM: 24.0 GB
Available RAM: 6.2 GB
GPU Detected:
Name: Apple Silicon GPU
Type: Apple Silicon (Unified Memory)
Total Memory: 24.0 GB
Available for LLMs: 12.0 GB
(Using 50% of system RAM)
----------------------------------------------------------------------
Configuration Options
----------------------------------------------------------------------
💡 Recommended: Qwen 2.5 Coder 7b (q6_k)
Instances: 2
Memory: 12.0 GB
[1] Recommended Configuration - Qwen 2.5 Coder 7b (q6_k) with 2 instances
[2] Browse All Configurations - See all models that fit your hardware
[3] Custom Configuration - Specify exact model and number of instances
Enter your choice:
Menu Options
- Recommended Configuration - Automatically selects the best model and instance count for your hardware
- Browse All Configurations - Shows all feasible models that fit in your available memory
- Custom Configuration - Step-by-step wizard to select:
- Model family (Qwen, DeepSeek, CodeLlama)
- Model size (3B, 7B, 14B)
- Quantization level (Q4, Q5, Q6)
- Number of instances (1 to max supported)
To skip the menu and use auto-detection, use --auto flag.
Startup Summary
When starting, Local Swarm displays a comprehensive summary:
======================================================================
Local Swarm - Startup Summary
======================================================================
----------------------------------------------------------------------
Hardware Detection
----------------------------------------------------------------------
Operating System: Darwin
CPU: 12 cores
System RAM: 24.0 GB
Available RAM: 6.2 GB
GPU Detected:
Name: Apple Silicon GPU
Type: Apple Silicon (Unified Memory)
Total Memory: 24.0 GB
Available for LLMs: 12.0 GB
----------------------------------------------------------------------
Model Configuration
----------------------------------------------------------------------
Model: Qwen 2.5 Coder 7b (q6_k)
Description: Alibaba's code-focused model
Instances: 2
Memory per Instance: 6.0 GB
Total Memory: 12.0 GB
Utilization: 100.0% of available
======================================================================
How It Works
Hardware Detection
The tool automatically detects your system:
- Windows: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI)
- macOS: Apple Silicon via Metal, unified memory model
- Linux: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI/OpenCL)
- Android: Qualcomm Adreno GPUs (via Termux)
Supported Backends:
- NVIDIA: CUDA via llama.cpp
- AMD: ROCm via llama.cpp (Linux, Windows experimental)
- Intel: OneAPI/SYCL via llama.cpp
- Apple Silicon: Metal via MLX
- Qualcomm: CPU fallback on llama.cpp (Android/Termux)
Model Selection
Based on available memory:
- External GPU: Use 100% of VRAM minus OS overhead
- Apple Silicon: Use 50% of unified RAM
- CPU-only: Use 50% of system RAM
The algorithm selects:
- Largest model size that fits
- Highest quantization quality possible
- Maximum instances (2-8) based on memory
Example configurations:
| Hardware | Model | Quant | Instances | Memory Used |
|---|---|---|---|---|
| RTX 4090 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
| RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
| RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
| RX 7900 XTX 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
| Arc A770 16GB | Qwen 2.5 7B | Q5_K_M | 2 | ~10.4 GB |
| M4 Max 64GB | Qwen 2.5 14B | Q4_K_M | 4 | ~35.2 GB |
| M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
| M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
| Snapdragon 8 Gen 3 | Qwen 2.5 3B | Q4_K_M | 1 | ~1.8 GB |
| CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |
| Federated (3 machines) | Qwen 2.5 7B | Q4_K_M | 9 | ~40.5 GB |
Swarm Consensus
For each request, the swarm:
- Sends the prompt to all running instances
- Collects responses in parallel
- Runs consensus algorithm:
- Similarity: Groups responses by semantic similarity, returns largest group
- Quality: Scores responses on completeness and code quality
- Fastest: Returns the quickest response
- Returns the winning response via OpenAI-compatible API
Network Federation
Run Local Swarm on multiple machines in the same network to create a "federated swarm":
Example Setup:
- Windows PC (RTX 4060 Ti): 4 instances
- Mac Mini (M1): 2 instances
- MacBook (M4): 3 instances
- Total: 9 instances voting on every request
How it works:
- Each machine auto-discovers others via mDNS/Bonjour
- Each swarm generates responses independently
- Local consensus picks best response per machine
- Cross-swarm consensus votes across all machines
- Best response returned to client
To enable federation:
federation:
enabled: true
discovery_port: 8765 # mDNS/Bonjour discovery
federation_port: 8766 # Inter-swarm communication
Machines will automatically discover each other within 10 seconds.
API Endpoints
GET /v1/models
List available models
POST /v1/chat/completions
Chat completion with consensus
Request:
{
"model": "local-swarm",
"messages": [
{"role": "user", "content": "Write a Python function to sort a list"}
]
}
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1234567890,
"model": "local-swarm",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "def sort_list(lst):\n return sorted(lst)"
},
"finish_reason": "stop"
}]
}
GET /health
Health check
GET /metrics
Prometheus metrics (optional)
Supported Models
Currently supported models (auto-selected based on hardware):
- Qwen 2.5 Coder (3B, 7B, 14B) - Recommended for coding tasks
- DeepSeek Coder (1.3B, 6.7B, 33B) - Good alternative
- CodeLlama (7B, 13B, 34B) - Meta's code model
All models support GGUF quantization:
- Q4_K_M - Good quality, smallest size (recommended)
- Q5_K_M - Better quality
- Q6_K - Best quality
Troubleshooting
Out of Memory
If you get OOM errors:
# Reduce instances
python -m local_swarm --instances 2
# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4
Slow Performance
- Check GPU utilization with
nvidia-smi(NVIDIA) or Activity Monitor (macOS) - Ensure model is cached (first run downloads to
~/.local_swarm/models) - Try reducing instances to avoid contention
Windows: CUDA not detected
Make sure NVIDIA drivers are installed:
nvidia-smi
If this fails, reinstall drivers from nvidia.com
macOS: MLX not found
pip install mlx-lm
Linux: AMD GPU not detected
Ensure ROCm is installed:
rocm-smi
If not found, install from https://www.amd.com/en/developer/rocm-hub.html
Linux: Intel GPU not detected
Install Intel oneAPI:
# Ubuntu/Debian
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor -o /usr/share/keyrings/intel-oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-basekit
Android: Termux issues
- Ensure Termux is installed from F-Droid (not Play Store)
- Run
pkg updatebefore installation - Limited to small models (1-3B) due to RAM constraints
- Use CPU backend only (no GPU acceleration on Android yet)
Requirements
- Python 3.9+
- 4GB+ RAM (8GB+ recommended)
- Optional: NVIDIA/AMD/Intel GPU with 4GB+ VRAM
- Optional: Apple Silicon Mac
- Optional: Android device with 8GB+ RAM (via Termux)
Development
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
pytest
# Run specific platform tests
pytest tests/test_hardware.py -v
# Format code
black src/
ruff check src/
Architecture
Single Machine
┌─────────────────────────────────────┐
│ OpenAI API Client │
│ (opencode, etc.) │
└─────────────┬───────────────────────┘
│ HTTP
▼
┌─────────────────────────────────────┐
│ Local Swarm API Server │
│ (FastAPI / localhost:8000) │
└─────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Swarm Manager │
│ ┌─────────┐ ┌─────────┐ │
│ │ Worker 1│ │ Worker 2│ ... │
│ │(LLM #1) │ │(LLM #2) │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ └─────┬─────┘ │
│ ▼ │
│ Consensus Engine │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Backend (llama.cpp / MLX) │
│ ┌─────────────────────┐ │
│ │ GGUF/MLX Model │ │
│ │ (Qwen/Codellama) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘
Federated Swarm (Multiple Machines)
┌─────────────────────────────────────────────────────────────┐
│ Local Network │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Windows PC │ │ Mac Mini │ │ MacBook │ │
│ │ (RTX 4060) │ │ (M1) │ │ (M4) │ │
│ │ 4 instances │ │ 2 instances │ │ 3 instances │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ Cross-Swarm │ │
│ │ Consensus │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ opencode │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
License
MIT License - See LICENSE file
Contributing
Contributions welcome! Please read CONTRIBUTING.md first.