Files
local_swarm/README.md
T
sleepy 9fdc3a6d02 docs: Update README with --use-opencode-tools flag documentation
Add documentation for the new tool mode options:
- Default local tool server mode (~125 tokens)
- Optional --use-opencode-tools flag (~27k tokens)

Helps users understand the token trade-off between modes.
2026-02-25 11:35:00 +01:00

197 lines
5.1 KiB
Markdown

# Local Swarm
Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.
## What It Does
- **Auto-detects your hardware** (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
- **Downloads and runs multiple LLM instances** optimized for your VRAM/RAM
- **Uses consensus voting** - all instances answer, best response wins
- **Connects multiple machines** on your network for a "hive mind" effect
- **Provides an OpenAI-compatible API** at `http://localhost:17615/v1`
## Quick Start
```bash
# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt
# Run it
python main.py
```
On first run, it will:
1. Detect your hardware
2. Pick the best model and quantization
3. Download the model (one-time)
4. Start multiple LLM workers
5. Expose the API at `http://localhost:17615`
## Usage
### Interactive Mode (default)
```bash
python main.py
```
Shows a menu with:
- Recommended configuration (auto-selected)
- Browse all compatible models
- Custom configuration wizard
### Auto Mode (no menu)
```bash
python main.py --auto
```
### With Other Options
```bash
python main.py --model qwen:3b:q4 # Use specific model
python main.py --instances 4 # Force 4 workers
python main.py --port 8080 # Custom port
python main.py --detect # Show hardware info only
python main.py --federation # Enable network federation
python main.py --mcp # Enable MCP server
python main.py --use-opencode-tools # Use opencode tools (adds ~27k tokens)
```
**Tool Mode Options:**
- Default: Local tool server (~125 tokens, saves context window space)
- `--use-opencode-tools`: Full opencode tool definitions (~27k tokens, more capabilities)
## Connect to Opencode
Add to your opencode config:
```json
{
"model": {
"provider": "openai",
"base_url": "http://localhost:17615/v1",
"api_key": "not-needed",
"model": "local-swarm"
}
}
```
## Network Federation (Hive Mind)
Run on multiple machines to combine their power:
```bash
# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation
# Machine 2 (Mac Mini M1)
python main.py --auto --federation
# Machine 3 (Old laptop)
python main.py --auto --federation
```
Machines auto-discover each other and vote together on every request.
## How Consensus Works
1. Your prompt goes to all LLM instances
2. Each instance generates a response independently
3. The consensus algorithm picks the best answer:
- **Similarity** (default): Groups responses by meaning, picks the largest group
- **Quality**: Scores on completeness, code blocks, structure
- **Fastest**: Returns the quickest response
- **Majority**: Simple text match voting
## Configuration
Create `config.yaml`:
```yaml
server:
host: "127.0.0.1"
port: 17615
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest, majority
min_instances: 2
max_instances: 8
federation:
enabled: true
discovery_port: 8765
max_peers: 10
```
## Supported Hardware
| Hardware | Backend | Notes |
|----------|---------|-------|
| NVIDIA GPU | llama.cpp (CUDA) | Best performance |
| AMD GPU | llama.cpp (ROCm) | Linux/Windows |
| Intel GPU | llama.cpp (SYCL) | Linux/Windows |
| Apple Silicon | MLX | Native Metal |
| Qualcomm | llama.cpp (CPU) | Android/Termux |
| CPU-only | llama.cpp | Slower but works |
## Supported Models
- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended
- **DeepSeek Coder** (1.3B, 6.7B, 33B)
- **CodeLlama** (7B, 13B, 34B)
All support GGUF quantization (Q4_K_M recommended).
## API Endpoints
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completion with consensus
- `GET /health` - Health check
- `GET /v1/federation/peers` - List discovered peers (when federation enabled)
## Troubleshooting
### Out of Memory
```bash
python main.py --instances 2 # Reduce workers
python main.py --model qwen:3b:q4 # Use smaller model
```
### Slow Performance
- Check GPU utilization with `nvidia-smi`
- Reduce instances to avoid contention
- Use Q4 quantization instead of Q6
### CUDA Not Detected (Windows)
```powershell
nvidia-smi # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
```
### macOS: MLX Not Found
```bash
pip install mlx-lm
```
## Project Structure
```
local_swarm/
├── main.py # CLI entry point
├── src/
│ ├── hardware/ # GPU detection (NVIDIA, AMD, Intel, Apple, Qualcomm)
│ ├── models/ # Model registry, selection, downloading
│ ├── backends/ # llama.cpp and MLX backends
│ ├── swarm/ # Worker management and consensus
│ ├── network/ # Federation and peer discovery
│ ├── api/ # OpenAI-compatible API server
│ └── tools/ # Tool execution (read, write, bash)
└── docs/ # Documentation
```
## License
MIT License