Replace complex OpenAI-style JSON format with simple format:
TOOL: tool_name
ARGUMENTS: {param: value}
This matches what the tool server expects and is much easier
for smaller models to generate correctly. Also add parser for
this format with priority over other formats.
Local Swarm
Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.
Features
- Interactive Menu System: Easy-to-use menu for selecting model configurations, browsing options, or creating custom setups
- Hardware Auto-Detection: Automatically detects your GPU (NVIDIA, AMD, Intel), Apple Silicon, Qualcomm (Android), or CPU and selects optimal settings
- Smart Model Selection: Chooses the best model, quantization, and instance count based on available VRAM/RAM
- Startup Summary: Clear display of detected hardware, selected model, resource usage, and worker status
- Swarm Consensus: Multiple LLM instances vote on the best response for higher quality outputs
- Network Federation: Multiple machines on the same network can join into a "federated swarm" for distributed consensus
- OpenAI-Compatible API: Drop-in replacement for OpenAI API at
http://localhost:8000/v1 - MCP Server: Model Context Protocol support for tight AI assistant integration
- Cross-Platform: Works on Windows, macOS, Linux, and Android (via Termux) with automatic backend selection
Documentation
- Quick Start - Get up and running in minutes
- Complete Guide - Comprehensive documentation
- Opencode configuration examples
- API reference
- Troubleshooting guide
- Performance tuning
- Advanced configuration
- Configuration - Customize your setup
- Interactive Mode - Using the menu system
- Tips & Help - Learn about models, quantization, and optimization
Quick Start
Installation
Windows (PowerShell)
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run installer
.\scripts\install.bat
macOS/Linux
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run installer
chmod +x scripts/install.sh
./scripts/install.sh
Android (Termux)
# In Termux app
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run Termux installer
chmod +x scripts/install-termux.sh
./scripts/install-termux.sh
Note: Android support is limited to small models (1-3B) due to memory constraints. Requires 8GB+ RAM.
Usage
Start the Swarm
# Auto-detect hardware and start
python -m local_swarm
# Or use the CLI
python main.py
On first run, the tool will:
- Scan your hardware (GPU, RAM, CPU)
- Select the optimal model and quantization
- Download the model (one-time)
- Start multiple instances based on available memory
- Expose the API at
http://localhost:8000
Example startup output:
🔍 Detecting hardware...
OS: Windows 11
GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
CPU: 16 cores
RAM: 32 GB
📊 Optimal configuration:
Model: Qwen 2.5 Coder 3B
Quantization: Q4_K_M (1.8 GB per instance)
Instances: 8 (using 14.4 GB VRAM)
⬇️ Downloading model...
Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB
🚀 Starting swarm...
Worker 1: Ready (GPU:0)
Worker 2: Ready (GPU:0)
...
Worker 8: Ready (GPU:0)
✅ Local Swarm is running!
API: http://localhost:8000/v1
Models: http://localhost:8000/v1/models
Health: http://localhost:8000/health
💡 Configure opencode to use:
base_url: http://localhost:8000/v1
api_key: any (not used)
Configure opencode
Add to your opencode configuration:
{
"model": {
"provider": "openai",
"base_url": "http://localhost:8000/v1",
"api_key": "not-needed",
"model": "local-swarm"
}
}
MCP Server (Optional)
For tighter integration with AI assistants, enable the MCP server:
python main.py --mcp
This runs alongside the HTTP API and exposes tools AI assistants can use:
get_hardware_info- Query CPU, GPU, and RAMget_swarm_status- Check worker healthgenerate_code- Generate code with consensuslist_available_models- See what models can runget_worker_details- Get detailed worker statistics
MCP allows AI assistants to automatically query your hardware capabilities and select appropriate models.
Configuration
Create a config.yaml file for customization:
server:
host: "127.0.0.1"
port: 8000
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest
min_instances: 2
max_instances: 8
hardware:
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
federation:
enabled: true
discovery_port: 8765
federation_port: 8766
max_peers: 10
models:
cache_dir: "~/.local_swarm/models"
CLI Options
# Show hardware detection without starting
python -m local_swarm --detect
# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4
# Use specific port
python -m local_swarm --port 8080
# Force number of instances
python -m local_swarm --instances 4
# Download models only (no server)
python -m local_swarm --download-only
# Enable MCP server alongside HTTP API
python -m local_swarm --mcp
# Show help
python -m local_swarm --help
# Auto-detect without interactive menu
python -m local_swarm --auto
Interactive Mode
By default, Local Swarm starts in interactive mode with a menu system:
======================================================================
Local Swarm - Model Selection
======================================================================
----------------------------------------------------------------------
Hardware Detection
----------------------------------------------------------------------
Operating System: Darwin
CPU: 12 cores
System RAM: 24.0 GB
Available RAM: 6.2 GB
GPU Detected:
Name: Apple Silicon GPU
Type: Apple Silicon (Unified Memory)
Total Memory: 24.0 GB
Available for LLMs: 12.0 GB
(Using 50% of system RAM)
----------------------------------------------------------------------
Configuration Options
----------------------------------------------------------------------
💡 Recommended: Qwen 2.5 Coder 7b (q6_k)
Instances: 2
Memory: 12.0 GB
[1] Recommended Configuration - Qwen 2.5 Coder 7b (q6_k) with 2 instances
[2] Browse All Configurations - See all models that fit your hardware
[3] Custom Configuration - Specify exact model and number of instances
Enter your choice:
Menu Options
- Recommended Configuration - Automatically selects the best model and instance count for your hardware
- Browse All Configurations - Shows all feasible models that fit in your available memory
- Custom Configuration - Step-by-step wizard to select:
- Model family (Qwen, DeepSeek, CodeLlama)
- Model size (3B, 7B, 14B)
- Quantization level (Q4, Q5, Q6)
- Number of instances (1 to max supported)
To skip the menu and use auto-detection, use --auto flag.
Startup Summary
When starting, Local Swarm displays a comprehensive summary:
======================================================================
Local Swarm - Startup Summary
======================================================================
----------------------------------------------------------------------
Hardware Detection
----------------------------------------------------------------------
Operating System: Darwin
CPU: 12 cores
System RAM: 24.0 GB
Available RAM: 6.2 GB
GPU Detected:
Name: Apple Silicon GPU
Type: Apple Silicon (Unified Memory)
Total Memory: 24.0 GB
Available for LLMs: 12.0 GB
----------------------------------------------------------------------
Model Configuration
----------------------------------------------------------------------
Model: Qwen 2.5 Coder 7b (q6_k)
Description: Alibaba's code-focused model
Instances: 2
Memory per Instance: 6.0 GB
Total Memory: 12.0 GB
Utilization: 100.0% of available
======================================================================
How It Works
Hardware Detection
The tool automatically detects your system:
- Windows: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI)
- macOS: Apple Silicon via Metal, unified memory model
- Linux: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI/OpenCL)
- Android: Qualcomm Adreno GPUs (via Termux)
Supported Backends:
- NVIDIA: CUDA via llama.cpp
- AMD: ROCm via llama.cpp (Linux, Windows experimental)
- Intel: OneAPI/SYCL via llama.cpp
- Apple Silicon: Metal via MLX
- Qualcomm: CPU fallback on llama.cpp (Android/Termux)
Model Selection
Based on available memory:
- External GPU: Use 100% of VRAM minus OS overhead
- Apple Silicon: Use 50% of unified RAM
- CPU-only: Use 50% of system RAM
The algorithm selects:
- Largest model size that fits
- Highest quantization quality possible
- Maximum instances (2-8) based on memory
Example configurations:
| Hardware | Model | Quant | Instances | Memory Used |
|---|---|---|---|---|
| RTX 4090 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
| RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
| RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
| RX 7900 XTX 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
| Arc A770 16GB | Qwen 2.5 7B | Q5_K_M | 2 | ~10.4 GB |
| M4 Max 64GB | Qwen 2.5 14B | Q4_K_M | 4 | ~35.2 GB |
| M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
| M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
| Snapdragon 8 Gen 3 | Qwen 2.5 3B | Q4_K_M | 1 | ~1.8 GB |
| CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |
| Federated (3 machines) | Qwen 2.5 7B | Q4_K_M | 9 | ~40.5 GB |
Swarm Consensus
For each request, the swarm:
- Sends the prompt to all running instances
- Collects responses in parallel
- Runs consensus algorithm:
- Similarity: Groups responses by semantic similarity, returns largest group
- Quality: Scores responses on completeness and code quality
- Fastest: Returns the quickest response
- Returns the winning response via OpenAI-compatible API
Network Federation
Run Local Swarm on multiple machines in the same network to create a "federated swarm":
Example Setup:
- Windows PC (RTX 4060 Ti): 4 instances
- Mac Mini (M1): 2 instances
- MacBook (M4): 3 instances
- Total: 9 instances voting on every request
How it works:
- Each machine auto-discovers others via mDNS/Bonjour
- Each swarm generates responses independently
- Local consensus picks best response per machine
- Cross-swarm consensus votes across all machines
- Best response returned to client
To enable federation:
federation:
enabled: true
discovery_port: 8765 # mDNS/Bonjour discovery
federation_port: 8766 # Inter-swarm communication
Machines will automatically discover each other within 10 seconds.
API Endpoints
GET /v1/models
List available models
POST /v1/chat/completions
Chat completion with consensus
Request:
{
"model": "local-swarm",
"messages": [
{"role": "user", "content": "Write a Python function to sort a list"}
]
}
Response:
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1234567890,
"model": "local-swarm",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "def sort_list(lst):\n return sorted(lst)"
},
"finish_reason": "stop"
}]
}
GET /health
Health check
GET /metrics
Prometheus metrics (optional)
Supported Models
Currently supported models (auto-selected based on hardware):
- Qwen 2.5 Coder (3B, 7B, 14B) - Recommended for coding tasks
- DeepSeek Coder (1.3B, 6.7B, 33B) - Good alternative
- CodeLlama (7B, 13B, 34B) - Meta's code model
All models support GGUF quantization:
- Q4_K_M - Good quality, smallest size (recommended)
- Q5_K_M - Better quality
- Q6_K - Best quality
Troubleshooting
Out of Memory
If you get OOM errors:
# Reduce instances
python -m local_swarm --instances 2
# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4
Slow Performance
- Check GPU utilization with
nvidia-smi(NVIDIA) or Activity Monitor (macOS) - Ensure model is cached (first run downloads to
~/.local_swarm/models) - Try reducing instances to avoid contention
Windows: CUDA not detected
Make sure NVIDIA drivers are installed:
nvidia-smi
If this fails, reinstall drivers from nvidia.com
macOS: MLX not found
pip install mlx-lm
Linux: AMD GPU not detected
Ensure ROCm is installed:
rocm-smi
If not found, install from https://www.amd.com/en/developer/rocm-hub.html
Linux: Intel GPU not detected
Install Intel oneAPI:
# Ubuntu/Debian
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor -o /usr/share/keyrings/intel-oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-basekit
Android: Termux issues
- Ensure Termux is installed from F-Droid (not Play Store)
- Run
pkg updatebefore installation - Limited to small models (1-3B) due to RAM constraints
- Use CPU backend only (no GPU acceleration on Android yet)
Requirements
- Python 3.9+
- 4GB+ RAM (8GB+ recommended)
- Optional: NVIDIA/AMD/Intel GPU with 4GB+ VRAM
- Optional: Apple Silicon Mac
- Optional: Android device with 8GB+ RAM (via Termux)
Development
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
pytest
# Run specific platform tests
pytest tests/test_hardware.py -v
# Format code
black src/
ruff check src/
Architecture
Single Machine
┌─────────────────────────────────────┐
│ OpenAI API Client │
│ (opencode, etc.) │
└─────────────┬───────────────────────┘
│ HTTP
▼
┌─────────────────────────────────────┐
│ Local Swarm API Server │
│ (FastAPI / localhost:8000) │
└─────────────┬───────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Swarm Manager │
│ ┌─────────┐ ┌─────────┐ │
│ │ Worker 1│ │ Worker 2│ ... │
│ │(LLM #1) │ │(LLM #2) │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ └─────┬─────┘ │
│ ▼ │
│ Consensus Engine │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Backend (llama.cpp / MLX) │
│ ┌─────────────────────┐ │
│ │ GGUF/MLX Model │ │
│ │ (Qwen/Codellama) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────┘
│
▼
┌─────────────────────────────────────┐
│ Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘
Federated Swarm (Multiple Machines)
┌─────────────────────────────────────────────────────────────┐
│ Local Network │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Windows PC │ │ Mac Mini │ │ MacBook │ │
│ │ (RTX 4060) │ │ (M1) │ │ (M4) │ │
│ │ 4 instances │ │ 2 instances │ │ 3 instances │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ Cross-Swarm │ │
│ │ Consensus │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ opencode │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
License
MIT License - See LICENSE file
Contributing
Contributions welcome! Please read CONTRIBUTING.md first.