Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.

Features

Hardware Auto-Detection: Automatically detects your GPU (NVIDIA), Apple Silicon, or CPU and selects optimal settings
Smart Model Selection: Chooses the best model, quantization, and instance count based on available VRAM/RAM
Swarm Consensus: Multiple LLM instances vote on the best response for higher quality outputs
OpenAI-Compatible API: Drop-in replacement for OpenAI API at http://localhost:8000/v1
Cross-Platform: Works on Windows, macOS, and Linux with automatic backend selection

Quick Start

Installation

Windows (PowerShell)

# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
.\scripts\install.bat

macOS/Linux

# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
chmod +x scripts/install.sh
./scripts/install.sh

Usage

Start the Swarm

# Auto-detect hardware and start
python -m local_swarm

# Or use the CLI
python main.py

On first run, the tool will:

Scan your hardware (GPU, RAM, CPU)
Select the optimal model and quantization
Download the model (one-time)
Start multiple instances based on available memory
Expose the API at http://localhost:8000

Example startup output:

🔍 Detecting hardware...
   OS: Windows 11
   GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
   CPU: 16 cores
   RAM: 32 GB

📊 Optimal configuration:
   Model: Qwen 2.5 Coder 3B
   Quantization: Q4_K_M (1.8 GB per instance)
   Instances: 8 (using 14.4 GB VRAM)

⬇️  Downloading model...
   Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB

🚀 Starting swarm...
   Worker 1: Ready (GPU:0)
   Worker 2: Ready (GPU:0)
   ...
   Worker 8: Ready (GPU:0)

✅ Local Swarm is running!
   API: http://localhost:8000/v1
   Models: http://localhost:8000/v1/models
   Health: http://localhost:8000/health

💡 Configure opencode to use:
   base_url: http://localhost:8000/v1
   api_key: any (not used)

Configure opencode

Add to your opencode configuration:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

Configuration

Create a config.yaml file for customization:

server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

models:
  cache_dir: "~/.local_swarm/models"

CLI Options

# Show hardware detection without starting
python -m local_swarm --detect

# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Use specific port
python -m local_swarm --port 8080

# Force number of instances
python -m local_swarm --instances 4

# Download models only (no server)
python -m local_swarm --download-only

# Show help
python -m local_swarm --help

How It Works

Hardware Detection

The tool automatically detects your system:

Windows: NVIDIA GPUs via NVML, DirectX fallback
macOS: Apple Silicon via Metal, unified memory model
Linux: NVIDIA (NVML), AMD (ROCm)

Model Selection

Based on available memory:

External GPU: Use 100% of VRAM minus OS overhead
Apple Silicon: Use 50% of unified RAM
CPU-only: Use 50% of system RAM

The algorithm selects:

Largest model size that fits
Highest quantization quality possible
Maximum instances (2-8) based on memory

Example configurations:

Hardware	Model	Quant	Instances	Memory Used
RTX 4060 Ti 16GB	Qwen 2.5 7B	Q4_K_M	3	~13.5 GB
RTX 4060 Ti 8GB	Qwen 2.5 3B	Q6_K	4	~10.4 GB
M3 Pro 36GB	Qwen 2.5 7B	Q4_K_M	4	~18 GB
M1 8GB	Qwen 2.5 3B	Q4_K_M	2	~3.6 GB
CPU 32GB	Qwen 2.5 3B	Q4_K_M	8	~14.4 GB

Swarm Consensus

For each request, the swarm:

Sends the prompt to all running instances
Collects responses in parallel
Runs consensus algorithm:
- Similarity: Groups responses by semantic similarity, returns largest group
- Quality: Scores responses on completeness and code quality
- Fastest: Returns the quickest response
Returns the winning response via OpenAI-compatible API

API Endpoints

GET /v1/models

List available models

POST /v1/chat/completions

Chat completion with consensus

Request:

{
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ]
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def sort_list(lst):\n    return sorted(lst)"
    },
    "finish_reason": "stop"
  }]
}

GET /health

Health check

GET /metrics

Prometheus metrics (optional)

Supported Models

Currently supported models (auto-selected based on hardware):

Qwen 2.5 Coder (3B, 7B, 14B) - Recommended for coding tasks
DeepSeek Coder (1.3B, 6.7B, 33B) - Good alternative
CodeLlama (7B, 13B, 34B) - Meta's code model

All models support GGUF quantization:

Q4_K_M - Good quality, smallest size (recommended)
Q5_K_M - Better quality
Q6_K - Best quality

Troubleshooting

Out of Memory

If you get OOM errors:

# Reduce instances
python -m local_swarm --instances 2

# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4

Slow Performance

Check GPU utilization with nvidia-smi (NVIDIA) or Activity Monitor (macOS)
Ensure model is cached (first run downloads to ~/.local_swarm/models)
Try reducing instances to avoid contention

Windows: CUDA not detected

Make sure NVIDIA drivers are installed:

nvidia-smi

If this fails, reinstall drivers from nvidia.com

macOS: MLX not found

pip install mlx-lm

Requirements

Python 3.9+
4GB+ RAM (8GB+ recommended)
Optional: NVIDIA GPU with 4GB+ VRAM
Optional: Apple Silicon Mac

Development

# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Run specific platform tests
pytest tests/test_hardware.py -v

# Format code
black src/
ruff check src/

Architecture

┌─────────────────────────────────────┐
│         OpenAI API Client           │
│        (opencode, etc.)             │
└─────────────┬───────────────────────┘
              │ HTTP
              ▼
┌─────────────────────────────────────┐
│     Local Swarm API Server          │
│    (FastAPI / localhost:8000)       │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│       Swarm Manager                 │
│  ┌─────────┐ ┌─────────┐           │
│  │ Worker 1│ │ Worker 2│ ...       │
│  │(LLM #1) │ │(LLM #2) │           │
│  └────┬────┘ └────┬────┘           │
│       │           │                 │
│       └─────┬─────┘                 │
│             ▼                       │
│      Consensus Engine               │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│     Backend (llama.cpp / MLX)       │
│    ┌─────────────────────┐          │
│    │   GGUF/MLX Model    │          │
│    │   (Qwen/Codellama)  │          │
│    └─────────────────────┘          │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│    Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘

License

MIT License - See LICENSE file

Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Acknowledgments

llama.cpp - Inference engine
MLX - Apple Silicon backend
Qwen - Model family
HuggingFace - Model hosting