1788087145
Create docs/GUIDE.md with complete documentation: - Quick Start Guide for all platforms - Opencode configuration examples: - Basic configuration - Remote machine setup - Multiple model options - Environment-specific configs - Testing instructions - API Reference: - All OpenAI-compatible endpoints - Federation endpoints - Request/response examples - Troubleshooting Guide: - Common issues and solutions - Platform-specific problems - Installation issues - Advanced Configuration: - config.yaml options - Environment variables - Performance Tuning: - Speed vs quality settings - Memory usage tables - Recommended configurations - MCP Server setup and usage - Network Federation guide Update README.md: - Add Documentation section with links - Reference the complete guide Documentation now covers: ✅ Installation all platforms ✅ Opencode integration ✅ API usage ✅ Troubleshooting ✅ Performance optimization ✅ Advanced features
598 lines
19 KiB
Markdown
598 lines
19 KiB
Markdown
# Local Swarm
|
|
|
|
Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.
|
|
|
|
## Features
|
|
|
|
- **Interactive Menu System**: Easy-to-use menu for selecting model configurations, browsing options, or creating custom setups
|
|
- **Hardware Auto-Detection**: Automatically detects your GPU (NVIDIA, AMD, Intel), Apple Silicon, Qualcomm (Android), or CPU and selects optimal settings
|
|
- **Smart Model Selection**: Chooses the best model, quantization, and instance count based on available VRAM/RAM
|
|
- **Startup Summary**: Clear display of detected hardware, selected model, resource usage, and worker status
|
|
- **Swarm Consensus**: Multiple LLM instances vote on the best response for higher quality outputs
|
|
- **Network Federation**: Multiple machines on the same network can join into a "federated swarm" for distributed consensus
|
|
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI API at `http://localhost:8000/v1`
|
|
- **MCP Server**: Model Context Protocol support for tight AI assistant integration
|
|
- **Cross-Platform**: Works on Windows, macOS, Linux, and Android (via Termux) with automatic backend selection
|
|
|
|
## Documentation
|
|
|
|
- **[Quick Start](#quick-start)** - Get up and running in minutes
|
|
- **[Complete Guide](docs/GUIDE.md)** - Comprehensive documentation
|
|
- Opencode configuration examples
|
|
- API reference
|
|
- Troubleshooting guide
|
|
- Performance tuning
|
|
- Advanced configuration
|
|
- **[Configuration](#configuration)** - Customize your setup
|
|
- **[Interactive Mode](#interactive-mode)** - Using the menu system
|
|
- **[Tips & Help](#tips--help)** - Learn about models, quantization, and optimization
|
|
|
|
## Quick Start
|
|
|
|
### Installation
|
|
|
|
#### Windows (PowerShell)
|
|
```powershell
|
|
# Clone the repository
|
|
git clone https://github.com/yourusername/local_swarm.git
|
|
cd local_swarm
|
|
|
|
# Run installer
|
|
.\scripts\install.bat
|
|
```
|
|
|
|
#### macOS/Linux
|
|
```bash
|
|
# Clone the repository
|
|
git clone https://github.com/yourusername/local_swarm.git
|
|
cd local_swarm
|
|
|
|
# Run installer
|
|
chmod +x scripts/install.sh
|
|
./scripts/install.sh
|
|
```
|
|
|
|
#### Android (Termux)
|
|
```bash
|
|
# In Termux app
|
|
git clone https://github.com/yourusername/local_swarm.git
|
|
cd local_swarm
|
|
|
|
# Run Termux installer
|
|
chmod +x scripts/install-termux.sh
|
|
./scripts/install-termux.sh
|
|
```
|
|
|
|
**Note**: Android support is limited to small models (1-3B) due to memory constraints. Requires 8GB+ RAM.
|
|
|
|
### Usage
|
|
|
|
#### Start the Swarm
|
|
```bash
|
|
# Auto-detect hardware and start
|
|
python -m local_swarm
|
|
|
|
# Or use the CLI
|
|
python main.py
|
|
```
|
|
|
|
On first run, the tool will:
|
|
1. Scan your hardware (GPU, RAM, CPU)
|
|
2. Select the optimal model and quantization
|
|
3. Download the model (one-time)
|
|
4. Start multiple instances based on available memory
|
|
5. Expose the API at `http://localhost:8000`
|
|
|
|
Example startup output:
|
|
```
|
|
🔍 Detecting hardware...
|
|
OS: Windows 11
|
|
GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
|
|
CPU: 16 cores
|
|
RAM: 32 GB
|
|
|
|
📊 Optimal configuration:
|
|
Model: Qwen 2.5 Coder 3B
|
|
Quantization: Q4_K_M (1.8 GB per instance)
|
|
Instances: 8 (using 14.4 GB VRAM)
|
|
|
|
⬇️ Downloading model...
|
|
Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB
|
|
|
|
🚀 Starting swarm...
|
|
Worker 1: Ready (GPU:0)
|
|
Worker 2: Ready (GPU:0)
|
|
...
|
|
Worker 8: Ready (GPU:0)
|
|
|
|
✅ Local Swarm is running!
|
|
API: http://localhost:8000/v1
|
|
Models: http://localhost:8000/v1/models
|
|
Health: http://localhost:8000/health
|
|
|
|
💡 Configure opencode to use:
|
|
base_url: http://localhost:8000/v1
|
|
api_key: any (not used)
|
|
```
|
|
|
|
#### Configure opencode
|
|
|
|
Add to your opencode configuration:
|
|
|
|
```json
|
|
{
|
|
"model": {
|
|
"provider": "openai",
|
|
"base_url": "http://localhost:8000/v1",
|
|
"api_key": "not-needed",
|
|
"model": "local-swarm"
|
|
}
|
|
}
|
|
```
|
|
|
|
#### MCP Server (Optional)
|
|
|
|
For tighter integration with AI assistants, enable the MCP server:
|
|
|
|
```bash
|
|
python main.py --mcp
|
|
```
|
|
|
|
This runs alongside the HTTP API and exposes tools AI assistants can use:
|
|
- `get_hardware_info` - Query CPU, GPU, and RAM
|
|
- `get_swarm_status` - Check worker health
|
|
- `generate_code` - Generate code with consensus
|
|
- `list_available_models` - See what models can run
|
|
- `get_worker_details` - Get detailed worker statistics
|
|
|
|
MCP allows AI assistants to automatically query your hardware capabilities and select appropriate models.
|
|
|
|
## Configuration
|
|
|
|
Create a `config.yaml` file for customization:
|
|
|
|
```yaml
|
|
server:
|
|
host: "127.0.0.1"
|
|
port: 8000
|
|
|
|
swarm:
|
|
consensus_strategy: "similarity" # similarity, quality, fastest
|
|
min_instances: 2
|
|
max_instances: 8
|
|
|
|
hardware:
|
|
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
|
|
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
|
|
|
|
federation:
|
|
enabled: true
|
|
discovery_port: 8765
|
|
federation_port: 8766
|
|
max_peers: 10
|
|
|
|
models:
|
|
cache_dir: "~/.local_swarm/models"
|
|
```
|
|
|
|
## CLI Options
|
|
|
|
```bash
|
|
# Show hardware detection without starting
|
|
python -m local_swarm --detect
|
|
|
|
# Use specific model
|
|
python -m local_swarm --model qwen2.5-coder:3b:q4
|
|
|
|
# Use specific port
|
|
python -m local_swarm --port 8080
|
|
|
|
# Force number of instances
|
|
python -m local_swarm --instances 4
|
|
|
|
# Download models only (no server)
|
|
python -m local_swarm --download-only
|
|
|
|
# Enable MCP server alongside HTTP API
|
|
python -m local_swarm --mcp
|
|
|
|
# Show help
|
|
python -m local_swarm --help
|
|
|
|
# Auto-detect without interactive menu
|
|
python -m local_swarm --auto
|
|
```
|
|
|
|
## Interactive Mode
|
|
|
|
By default, Local Swarm starts in **interactive mode** with a menu system:
|
|
|
|
```
|
|
======================================================================
|
|
Local Swarm - Model Selection
|
|
======================================================================
|
|
|
|
----------------------------------------------------------------------
|
|
Hardware Detection
|
|
----------------------------------------------------------------------
|
|
Operating System: Darwin
|
|
CPU: 12 cores
|
|
System RAM: 24.0 GB
|
|
Available RAM: 6.2 GB
|
|
|
|
GPU Detected:
|
|
Name: Apple Silicon GPU
|
|
Type: Apple Silicon (Unified Memory)
|
|
Total Memory: 24.0 GB
|
|
|
|
Available for LLMs: 12.0 GB
|
|
(Using 50% of system RAM)
|
|
|
|
----------------------------------------------------------------------
|
|
Configuration Options
|
|
----------------------------------------------------------------------
|
|
|
|
💡 Recommended: Qwen 2.5 Coder 7b (q6_k)
|
|
Instances: 2
|
|
Memory: 12.0 GB
|
|
|
|
[1] Recommended Configuration - Qwen 2.5 Coder 7b (q6_k) with 2 instances
|
|
[2] Browse All Configurations - See all models that fit your hardware
|
|
[3] Custom Configuration - Specify exact model and number of instances
|
|
|
|
Enter your choice:
|
|
```
|
|
|
|
### Menu Options
|
|
|
|
1. **Recommended Configuration** - Automatically selects the best model and instance count for your hardware
|
|
2. **Browse All Configurations** - Shows all feasible models that fit in your available memory
|
|
3. **Custom Configuration** - Step-by-step wizard to select:
|
|
- Model family (Qwen, DeepSeek, CodeLlama)
|
|
- Model size (3B, 7B, 14B)
|
|
- Quantization level (Q4, Q5, Q6)
|
|
- Number of instances (1 to max supported)
|
|
|
|
To skip the menu and use auto-detection, use `--auto` flag.
|
|
|
|
## Startup Summary
|
|
|
|
When starting, Local Swarm displays a comprehensive summary:
|
|
|
|
```
|
|
======================================================================
|
|
Local Swarm - Startup Summary
|
|
======================================================================
|
|
|
|
----------------------------------------------------------------------
|
|
Hardware Detection
|
|
----------------------------------------------------------------------
|
|
Operating System: Darwin
|
|
CPU: 12 cores
|
|
System RAM: 24.0 GB
|
|
Available RAM: 6.2 GB
|
|
|
|
GPU Detected:
|
|
Name: Apple Silicon GPU
|
|
Type: Apple Silicon (Unified Memory)
|
|
Total Memory: 24.0 GB
|
|
|
|
Available for LLMs: 12.0 GB
|
|
|
|
----------------------------------------------------------------------
|
|
Model Configuration
|
|
----------------------------------------------------------------------
|
|
Model: Qwen 2.5 Coder 7b (q6_k)
|
|
Description: Alibaba's code-focused model
|
|
Instances: 2
|
|
Memory per Instance: 6.0 GB
|
|
Total Memory: 12.0 GB
|
|
Utilization: 100.0% of available
|
|
|
|
======================================================================
|
|
```
|
|
|
|
## How It Works
|
|
|
|
### Hardware Detection
|
|
|
|
The tool automatically detects your system:
|
|
- **Windows**: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI)
|
|
- **macOS**: Apple Silicon via Metal, unified memory model
|
|
- **Linux**: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI/OpenCL)
|
|
- **Android**: Qualcomm Adreno GPUs (via Termux)
|
|
|
|
**Supported Backends**:
|
|
- **NVIDIA**: CUDA via llama.cpp
|
|
- **AMD**: ROCm via llama.cpp (Linux, Windows experimental)
|
|
- **Intel**: OneAPI/SYCL via llama.cpp
|
|
- **Apple Silicon**: Metal via MLX
|
|
- **Qualcomm**: CPU fallback on llama.cpp (Android/Termux)
|
|
|
|
### Model Selection
|
|
|
|
Based on available memory:
|
|
1. **External GPU**: Use 100% of VRAM minus OS overhead
|
|
2. **Apple Silicon**: Use 50% of unified RAM
|
|
3. **CPU-only**: Use 50% of system RAM
|
|
|
|
The algorithm selects:
|
|
- Largest model size that fits
|
|
- Highest quantization quality possible
|
|
- Maximum instances (2-8) based on memory
|
|
|
|
Example configurations:
|
|
|
|
| Hardware | Model | Quant | Instances | Memory Used |
|
|
|----------|-------|-------|-----------|-------------|
|
|
| RTX 4090 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
|
|
| RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
|
|
| RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
|
|
| RX 7900 XTX 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
|
|
| Arc A770 16GB | Qwen 2.5 7B | Q5_K_M | 2 | ~10.4 GB |
|
|
| M4 Max 64GB | Qwen 2.5 14B | Q4_K_M | 4 | ~35.2 GB |
|
|
| M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
|
|
| M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
|
|
| Snapdragon 8 Gen 3 | Qwen 2.5 3B | Q4_K_M | 1 | ~1.8 GB |
|
|
| CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |
|
|
| **Federated (3 machines)** | **Qwen 2.5 7B** | **Q4_K_M** | **9** | **~40.5 GB** |
|
|
|
|
### Swarm Consensus
|
|
|
|
For each request, the swarm:
|
|
1. Sends the prompt to all running instances
|
|
2. Collects responses in parallel
|
|
3. Runs consensus algorithm:
|
|
- **Similarity**: Groups responses by semantic similarity, returns largest group
|
|
- **Quality**: Scores responses on completeness and code quality
|
|
- **Fastest**: Returns the quickest response
|
|
4. Returns the winning response via OpenAI-compatible API
|
|
|
|
### Network Federation
|
|
|
|
Run Local Swarm on multiple machines in the same network to create a "federated swarm":
|
|
|
|
**Example Setup**:
|
|
- Windows PC (RTX 4060 Ti): 4 instances
|
|
- Mac Mini (M1): 2 instances
|
|
- MacBook (M4): 3 instances
|
|
- Total: 9 instances voting on every request
|
|
|
|
**How it works**:
|
|
1. Each machine auto-discovers others via mDNS/Bonjour
|
|
2. Each swarm generates responses independently
|
|
3. Local consensus picks best response per machine
|
|
4. Cross-swarm consensus votes across all machines
|
|
5. Best response returned to client
|
|
|
|
**To enable federation**:
|
|
```yaml
|
|
federation:
|
|
enabled: true
|
|
discovery_port: 8765 # mDNS/Bonjour discovery
|
|
federation_port: 8766 # Inter-swarm communication
|
|
```
|
|
|
|
Machines will automatically discover each other within 10 seconds.
|
|
|
|
## API Endpoints
|
|
|
|
### GET /v1/models
|
|
List available models
|
|
|
|
### POST /v1/chat/completions
|
|
Chat completion with consensus
|
|
|
|
**Request**:
|
|
```json
|
|
{
|
|
"model": "local-swarm",
|
|
"messages": [
|
|
{"role": "user", "content": "Write a Python function to sort a list"}
|
|
]
|
|
}
|
|
```
|
|
|
|
**Response**:
|
|
```json
|
|
{
|
|
"id": "chatcmpl-abc123",
|
|
"object": "chat.completion",
|
|
"created": 1234567890,
|
|
"model": "local-swarm",
|
|
"choices": [{
|
|
"index": 0,
|
|
"message": {
|
|
"role": "assistant",
|
|
"content": "def sort_list(lst):\n return sorted(lst)"
|
|
},
|
|
"finish_reason": "stop"
|
|
}]
|
|
}
|
|
```
|
|
|
|
### GET /health
|
|
Health check
|
|
|
|
### GET /metrics
|
|
Prometheus metrics (optional)
|
|
|
|
## Supported Models
|
|
|
|
Currently supported models (auto-selected based on hardware):
|
|
|
|
- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended for coding tasks
|
|
- **DeepSeek Coder** (1.3B, 6.7B, 33B) - Good alternative
|
|
- **CodeLlama** (7B, 13B, 34B) - Meta's code model
|
|
|
|
All models support GGUF quantization:
|
|
- Q4_K_M - Good quality, smallest size (recommended)
|
|
- Q5_K_M - Better quality
|
|
- Q6_K - Best quality
|
|
|
|
## Troubleshooting
|
|
|
|
### Out of Memory
|
|
If you get OOM errors:
|
|
```bash
|
|
# Reduce instances
|
|
python -m local_swarm --instances 2
|
|
|
|
# Or use smaller model
|
|
python -m local_swarm --model qwen2.5-coder:3b:q4
|
|
```
|
|
|
|
### Slow Performance
|
|
- Check GPU utilization with `nvidia-smi` (NVIDIA) or Activity Monitor (macOS)
|
|
- Ensure model is cached (first run downloads to `~/.local_swarm/models`)
|
|
- Try reducing instances to avoid contention
|
|
|
|
### Windows: CUDA not detected
|
|
Make sure NVIDIA drivers are installed:
|
|
```powershell
|
|
nvidia-smi
|
|
```
|
|
If this fails, reinstall drivers from nvidia.com
|
|
|
|
### macOS: MLX not found
|
|
```bash
|
|
pip install mlx-lm
|
|
```
|
|
|
|
### Linux: AMD GPU not detected
|
|
Ensure ROCm is installed:
|
|
```bash
|
|
rocm-smi
|
|
```
|
|
If not found, install from https://www.amd.com/en/developer/rocm-hub.html
|
|
|
|
### Linux: Intel GPU not detected
|
|
Install Intel oneAPI:
|
|
```bash
|
|
# Ubuntu/Debian
|
|
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor -o /usr/share/keyrings/intel-oneapi-archive-keyring.gpg
|
|
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
|
|
sudo apt update
|
|
sudo apt install intel-basekit
|
|
```
|
|
|
|
### Android: Termux issues
|
|
- Ensure Termux is installed from F-Droid (not Play Store)
|
|
- Run `pkg update` before installation
|
|
- Limited to small models (1-3B) due to RAM constraints
|
|
- Use CPU backend only (no GPU acceleration on Android yet)
|
|
|
|
## Requirements
|
|
|
|
- Python 3.9+
|
|
- 4GB+ RAM (8GB+ recommended)
|
|
- Optional: NVIDIA/AMD/Intel GPU with 4GB+ VRAM
|
|
- Optional: Apple Silicon Mac
|
|
- Optional: Android device with 8GB+ RAM (via Termux)
|
|
|
|
## Development
|
|
|
|
```bash
|
|
# Install dev dependencies
|
|
pip install -r requirements-dev.txt
|
|
|
|
# Run tests
|
|
pytest
|
|
|
|
# Run specific platform tests
|
|
pytest tests/test_hardware.py -v
|
|
|
|
# Format code
|
|
black src/
|
|
ruff check src/
|
|
```
|
|
|
|
## Architecture
|
|
|
|
### Single Machine
|
|
|
|
```
|
|
┌─────────────────────────────────────┐
|
|
│ OpenAI API Client │
|
|
│ (opencode, etc.) │
|
|
└─────────────┬───────────────────────┘
|
|
│ HTTP
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Local Swarm API Server │
|
|
│ (FastAPI / localhost:8000) │
|
|
└─────────────┬───────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Swarm Manager │
|
|
│ ┌─────────┐ ┌─────────┐ │
|
|
│ │ Worker 1│ │ Worker 2│ ... │
|
|
│ │(LLM #1) │ │(LLM #2) │ │
|
|
│ └────┬────┘ └────┬────┘ │
|
|
│ │ │ │
|
|
│ └─────┬─────┘ │
|
|
│ ▼ │
|
|
│ Consensus Engine │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Backend (llama.cpp / MLX) │
|
|
│ ┌─────────────────────┐ │
|
|
│ │ GGUF/MLX Model │ │
|
|
│ │ (Qwen/Codellama) │ │
|
|
│ └─────────────────────┘ │
|
|
└─────────────────────────────────────┘
|
|
│
|
|
▼
|
|
┌─────────────────────────────────────┐
|
|
│ Hardware (GPU/CPU/Apple Silicon) │
|
|
└─────────────────────────────────────┘
|
|
```
|
|
|
|
### Federated Swarm (Multiple Machines)
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Local Network │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Windows PC │ │ Mac Mini │ │ MacBook │ │
|
|
│ │ (RTX 4060) │ │ (M1) │ │ (M4) │ │
|
|
│ │ 4 instances │ │ 2 instances │ │ 3 instances │ │
|
|
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
|
|
│ │ │ │ │
|
|
│ │ │ │ │
|
|
│ └───────────────────┼───────────────────┘ │
|
|
│ │ │
|
|
│ ┌────────┴────────┐ │
|
|
│ │ Cross-Swarm │ │
|
|
│ │ Consensus │ │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ┌────────▼────────┐ │
|
|
│ │ opencode │ │
|
|
│ └─────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
```
|
|
|
|
## License
|
|
|
|
MIT License - See LICENSE file
|
|
|
|
## Contributing
|
|
|
|
Contributions welcome! Please read CONTRIBUTING.md first.
|
|
|
|
## Acknowledgments
|
|
|
|
- [llama.cpp](https://github.com/ggerganov/llama.cpp) - Inference engine (CUDA/ROCm/SYCL)
|
|
- [MLX](https://github.com/ml-explore/mlx) - Apple Silicon backend
|
|
- [Qwen](https://github.com/QwenLM/Qwen) - Model family
|
|
- [DeepSeek](https://github.com/deepseek-ai/deepseek-coder) - Model family
|
|
- [HuggingFace](https://huggingface.co) - Model hosting
|
|
- [ROCm](https://github.com/RadeonOpenCompute/ROCm) - AMD GPU support
|
|
- [oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) - Intel GPU support
|
|
- [Termux](https://termux.dev) - Android terminal emulator
|