Files
local_swarm/README.md
T

353 lines
9.4 KiB
Markdown

# Local Swarm
Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.
## Features
- **Hardware Auto-Detection**: Automatically detects your GPU (NVIDIA), Apple Silicon, or CPU and selects optimal settings
- **Smart Model Selection**: Chooses the best model, quantization, and instance count based on available VRAM/RAM
- **Swarm Consensus**: Multiple LLM instances vote on the best response for higher quality outputs
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI API at `http://localhost:8000/v1`
- **Cross-Platform**: Works on Windows, macOS, and Linux with automatic backend selection
## Quick Start
### Installation
#### Windows (PowerShell)
```powershell
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run installer
.\scripts\install.bat
```
#### macOS/Linux
```bash
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
# Run installer
chmod +x scripts/install.sh
./scripts/install.sh
```
### Usage
#### Start the Swarm
```bash
# Auto-detect hardware and start
python -m local_swarm
# Or use the CLI
python main.py
```
On first run, the tool will:
1. Scan your hardware (GPU, RAM, CPU)
2. Select the optimal model and quantization
3. Download the model (one-time)
4. Start multiple instances based on available memory
5. Expose the API at `http://localhost:8000`
Example startup output:
```
🔍 Detecting hardware...
OS: Windows 11
GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
CPU: 16 cores
RAM: 32 GB
📊 Optimal configuration:
Model: Qwen 2.5 Coder 3B
Quantization: Q4_K_M (1.8 GB per instance)
Instances: 8 (using 14.4 GB VRAM)
⬇️ Downloading model...
Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB
🚀 Starting swarm...
Worker 1: Ready (GPU:0)
Worker 2: Ready (GPU:0)
...
Worker 8: Ready (GPU:0)
✅ Local Swarm is running!
API: http://localhost:8000/v1
Models: http://localhost:8000/v1/models
Health: http://localhost:8000/health
💡 Configure opencode to use:
base_url: http://localhost:8000/v1
api_key: any (not used)
```
#### Configure opencode
Add to your opencode configuration:
```json
{
"model": {
"provider": "openai",
"base_url": "http://localhost:8000/v1",
"api_key": "not-needed",
"model": "local-swarm"
}
}
```
## Configuration
Create a `config.yaml` file for customization:
```yaml
server:
host: "127.0.0.1"
port: 8000
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest
min_instances: 2
max_instances: 8
hardware:
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
models:
cache_dir: "~/.local_swarm/models"
```
## CLI Options
```bash
# Show hardware detection without starting
python -m local_swarm --detect
# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4
# Use specific port
python -m local_swarm --port 8080
# Force number of instances
python -m local_swarm --instances 4
# Download models only (no server)
python -m local_swarm --download-only
# Show help
python -m local_swarm --help
```
## How It Works
### Hardware Detection
The tool automatically detects your system:
- **Windows**: NVIDIA GPUs via NVML, DirectX fallback
- **macOS**: Apple Silicon via Metal, unified memory model
- **Linux**: NVIDIA (NVML), AMD (ROCm)
### Model Selection
Based on available memory:
1. **External GPU**: Use 100% of VRAM minus OS overhead
2. **Apple Silicon**: Use 50% of unified RAM
3. **CPU-only**: Use 50% of system RAM
The algorithm selects:
- Largest model size that fits
- Highest quantization quality possible
- Maximum instances (2-8) based on memory
Example configurations:
| Hardware | Model | Quant | Instances | Memory Used |
|----------|-------|-------|-----------|-------------|
| RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
| RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
| M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
| M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
| CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |
### Swarm Consensus
For each request, the swarm:
1. Sends the prompt to all running instances
2. Collects responses in parallel
3. Runs consensus algorithm:
- **Similarity**: Groups responses by semantic similarity, returns largest group
- **Quality**: Scores responses on completeness and code quality
- **Fastest**: Returns the quickest response
4. Returns the winning response via OpenAI-compatible API
## API Endpoints
### GET /v1/models
List available models
### POST /v1/chat/completions
Chat completion with consensus
**Request**:
```json
{
"model": "local-swarm",
"messages": [
{"role": "user", "content": "Write a Python function to sort a list"}
]
}
```
**Response**:
```json
{
"id": "chatcmpl-abc123",
"object": "chat.completion",
"created": 1234567890,
"model": "local-swarm",
"choices": [{
"index": 0,
"message": {
"role": "assistant",
"content": "def sort_list(lst):\n return sorted(lst)"
},
"finish_reason": "stop"
}]
}
```
### GET /health
Health check
### GET /metrics
Prometheus metrics (optional)
## Supported Models
Currently supported models (auto-selected based on hardware):
- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended for coding tasks
- **DeepSeek Coder** (1.3B, 6.7B, 33B) - Good alternative
- **CodeLlama** (7B, 13B, 34B) - Meta's code model
All models support GGUF quantization:
- Q4_K_M - Good quality, smallest size (recommended)
- Q5_K_M - Better quality
- Q6_K - Best quality
## Troubleshooting
### Out of Memory
If you get OOM errors:
```bash
# Reduce instances
python -m local_swarm --instances 2
# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4
```
### Slow Performance
- Check GPU utilization with `nvidia-smi` (NVIDIA) or Activity Monitor (macOS)
- Ensure model is cached (first run downloads to `~/.local_swarm/models`)
- Try reducing instances to avoid contention
### Windows: CUDA not detected
Make sure NVIDIA drivers are installed:
```powershell
nvidia-smi
```
If this fails, reinstall drivers from nvidia.com
### macOS: MLX not found
```bash
pip install mlx-lm
```
## Requirements
- Python 3.9+
- 4GB+ RAM (8GB+ recommended)
- Optional: NVIDIA GPU with 4GB+ VRAM
- Optional: Apple Silicon Mac
## Development
```bash
# Install dev dependencies
pip install -r requirements-dev.txt
# Run tests
pytest
# Run specific platform tests
pytest tests/test_hardware.py -v
# Format code
black src/
ruff check src/
```
## Architecture
```
┌─────────────────────────────────────┐
│ OpenAI API Client │
│ (opencode, etc.) │
└─────────────┬───────────────────────┘
│ HTTP
┌─────────────────────────────────────┐
│ Local Swarm API Server │
│ (FastAPI / localhost:8000) │
└─────────────┬───────────────────────┘
┌─────────────────────────────────────┐
│ Swarm Manager │
│ ┌─────────┐ ┌─────────┐ │
│ │ Worker 1│ │ Worker 2│ ... │
│ │(LLM #1) │ │(LLM #2) │ │
│ └────┬────┘ └────┬────┘ │
│ │ │ │
│ └─────┬─────┘ │
│ ▼ │
│ Consensus Engine │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Backend (llama.cpp / MLX) │
│ ┌─────────────────────┐ │
│ │ GGUF/MLX Model │ │
│ │ (Qwen/Codellama) │ │
│ └─────────────────────┘ │
└─────────────────────────────────────┘
┌─────────────────────────────────────┐
│ Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘
```
## License
MIT License - See LICENSE file
## Contributing
Contributions welcome! Please read CONTRIBUTING.md first.
## Acknowledgments
- [llama.cpp](https://github.com/ggerganov/llama.cpp) - Inference engine
- [MLX](https://github.com/ml-explore/mlx) - Apple Silicon backend
- [Qwen](https://github.com/QwenLM/Qwen) - Model family
- [HuggingFace](https://huggingface.co) - Model hosting