e7b826da4e
- Removed old design docs and test plans from docs/ directory - Updated TODO section to reflect completed improvements - Added section on Recent Improvements with detailed changelog - Updated Federation description to explain objective quality voting - Added federation vote endpoint to API endpoints list - Clarified universal tool support and OpenCode streaming compatibility - All changes ready for main branch merge
322 lines
11 KiB
Markdown
322 lines
11 KiB
Markdown
# Local Swarm
|
|
|
|
Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.
|
|
|
|
## What It Does
|
|
|
|
- **Auto-detects your hardware** (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
|
|
- **Downloads and runs multiple LLM instances** optimized for your VRAM/RAM
|
|
- **Uses consensus voting** - all instances answer, best response wins
|
|
- **Connects multiple machines** on your network for a "hive mind" effect
|
|
- **Provides an OpenAI-compatible API** at `http://localhost:17615/v1`
|
|
|
|
## Quick Start
|
|
|
|
```bash
|
|
# Clone and install
|
|
git clone https://github.com/yourusername/local_swarm.git
|
|
cd local_swarm
|
|
pip install -r requirements.txt
|
|
|
|
# Run it
|
|
python main.py
|
|
```
|
|
|
|
On first run, it will:
|
|
1. Detect your hardware
|
|
2. Pick the best model and quantization
|
|
3. Download the model (one-time)
|
|
4. Start multiple LLM workers
|
|
5. Expose the API at `http://localhost:17615`
|
|
|
|
## Usage
|
|
|
|
### Interactive Mode (default)
|
|
```bash
|
|
python main.py
|
|
```
|
|
|
|
Shows a menu with:
|
|
- Recommended configuration (auto-selected)
|
|
- Browse all compatible models
|
|
- Custom configuration wizard
|
|
|
|
### Auto Mode (no menu)
|
|
```bash
|
|
python main.py --auto
|
|
```
|
|
|
|
### With Other Options
|
|
```bash
|
|
python main.py --model qwen:3b:q4 # Use specific model
|
|
python main.py --instances 4 # Force 4 workers
|
|
python main.py --port 8080 # Custom port
|
|
python main.py --detect # Show hardware info only
|
|
python main.py --federation # Enable network federation
|
|
python main.py --mcp # Enable MCP server
|
|
python main.py --use-opencode-tools # Use opencode tools (adds ~27k tokens)
|
|
```
|
|
|
|
**Tool Mode Options:**
|
|
- Default: Local tool server (~125 tokens, saves context window space)
|
|
- `--use-opencode-tools`: Full opencode tool definitions (~27k tokens, more capabilities)
|
|
|
|
## Connect to Opencode
|
|
|
|
Add to your opencode config:
|
|
|
|
```json
|
|
{
|
|
"model": {
|
|
"provider": "openai",
|
|
"base_url": "http://localhost:17615/v1",
|
|
"api_key": "not-needed",
|
|
"model": "local-swarm"
|
|
}
|
|
}
|
|
```
|
|
|
|
## Network Federation (Hive Mind)
|
|
|
|
Run on multiple machines to combine their power:
|
|
|
|
```bash
|
|
# Machine 1 (Windows with RTX 4060)
|
|
python main.py --auto --federation
|
|
|
|
# Machine 2 (Mac Mini M1)
|
|
python main.py --auto --federation
|
|
|
|
# Machine 3 (Old laptop)
|
|
python main.py --auto --federation
|
|
```
|
|
|
|
Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
|
|
|
|
**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).
|
|
|
|
## How Consensus Works
|
|
|
|
1. Your prompt goes to all LLM instances
|
|
2. Each instance generates a response independently
|
|
3. The consensus algorithm picks the best answer:
|
|
- **Similarity** (default): Groups responses by meaning, picks the largest group
|
|
- **Quality**: Scores on completeness, code blocks, structure
|
|
- **Fastest**: Returns the quickest response
|
|
- **Majority**: Simple text match voting
|
|
|
|
## Configuration
|
|
|
|
Create `config.yaml`:
|
|
|
|
```yaml
|
|
server:
|
|
host: "127.0.0.1"
|
|
port: 17615
|
|
|
|
swarm:
|
|
consensus_strategy: "similarity" # similarity, quality, fastest, majority
|
|
min_instances: 2
|
|
max_instances: 8
|
|
|
|
federation:
|
|
enabled: true
|
|
discovery_port: 8765
|
|
max_peers: 10
|
|
```
|
|
|
|
## Supported Hardware
|
|
|
|
| Hardware | Backend | Notes |
|
|
|----------|---------|-------|
|
|
| NVIDIA GPU | llama.cpp (CUDA) | Best performance |
|
|
| AMD GPU | llama.cpp (ROCm) | Linux/Windows |
|
|
| Intel GPU | llama.cpp (SYCL) | Linux/Windows |
|
|
| Apple Silicon | MLX | Native Metal |
|
|
| Qualcomm | llama.cpp (CPU) | Android/Termux |
|
|
| CPU-only | llama.cpp | Slower but works |
|
|
|
|
## Supported Models
|
|
|
|
- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended
|
|
- **DeepSeek Coder** (1.3B, 6.7B, 33B)
|
|
- **CodeLlama** (7B, 13B, 34B)
|
|
|
|
All support GGUF quantization (Q4_K_M recommended).
|
|
|
|
## API Endpoints
|
|
|
|
- `GET /v1/models` - List available models
|
|
- `POST /v1/chat/completions` - Chat completion with consensus
|
|
- `GET /health` - Health check
|
|
- `POST /v1/federation/vote` - Federation voting (used internally between peers)
|
|
|
|
## Troubleshooting
|
|
|
|
### Out of Memory
|
|
```bash
|
|
python main.py --instances 2 # Reduce workers
|
|
python main.py --model qwen:3b:q4 # Use smaller model
|
|
```
|
|
|
|
### Slow Performance
|
|
- Check GPU utilization with `nvidia-smi`
|
|
- Reduce instances to avoid contention
|
|
- Use Q4 quantization instead of Q6
|
|
|
|
### CUDA Not Detected (Windows)
|
|
```powershell
|
|
nvidia-smi # Check drivers
|
|
pip uninstall llama-cpp-python
|
|
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
|
|
```
|
|
|
|
### macOS: MLX Not Found
|
|
```bash
|
|
pip install mlx-lm
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
local_swarm/
|
|
├── main.py # CLI entry point (99 lines)
|
|
├── src/
|
|
│ ├── api/ # OpenAI-compatible API
|
|
│ │ ├── routes.py # HTTP routing (252 lines)
|
|
│ │ ├── formatting.py # Message formatting
|
|
│ │ ├── tool_parser.py # Tool call parsing
|
|
│ │ ├── chat_handlers.py # Chat completion logic
|
|
│ │ └── models.py # API data models
|
|
│ ├── cli/ # Command-line interface
|
|
│ │ ├── parser.py # CLI argument parsing
|
|
│ │ ├── main_runner.py # Main application logic
|
|
│ │ ├── server_runner.py # Server management
|
|
│ │ └── test_runner.py # Test mode execution
|
|
│ ├── swarm/ # Swarm orchestration
|
|
│ │ ├── manager.py # Swarm manager
|
|
│ │ ├── worker.py # LLM worker implementation
|
|
│ │ ├── consensus.py # Consensus algorithms
|
|
│ │ └── orchestrator.py # Generation orchestration
|
|
│ ├── models/ # Model management
|
|
│ │ ├── registry.py # Model registry (194 lines)
|
|
│ │ ├── selector.py # Model selection (329 lines)
|
|
│ │ ├── memory_calculator.py # Memory calculations
|
|
│ │ └── downloader.py # Model downloading
|
|
│ ├── hardware/ # Hardware detection
|
|
│ │ ├── detector.py # Hardware detection
|
|
│ │ ├── nvidia.py # NVIDIA GPU detection
|
|
│ │ ├── intel.py # Intel GPU detection
|
|
│ │ └── qualcomm.py # Qualcomm detection
|
|
│ ├── network/ # Network federation
|
|
│ │ ├── federation.py # Cross-swarm consensus
|
|
│ │ └── discovery.py # Peer discovery
|
|
│ ├── backends/ # LLM backends
|
|
│ │ ├── llama_cpp.py # llama.cpp backend
|
|
│ │ ├── mlx.py # Apple Silicon MLX backend
|
|
│ │ └── base.py # Base backend interface
|
|
│ ├── interactive/ # Interactive CLI
|
|
│ │ ├── ui.py # UI utilities
|
|
│ │ ├── display.py # Hardware display
|
|
│ │ └── tips.py # Help content
|
|
│ ├── tools/ # Tool execution
|
|
│ │ └── executor.py # Tool execution engine
|
|
│ └── utils/ # Shared utilities
|
|
│ ├── token_counter.py # Token counting
|
|
│ ├── project_discovery.py # Project root discovery
|
|
│ └── network.py # Network utilities
|
|
├── config/ # Configuration files
|
|
│ └── models/ # Model configurations
|
|
│ ├── model_metadata.json # Model metadata
|
|
│ ├── mlx_quant_sizes.json # MLX quantization sizes
|
|
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
|
|
│ └── selector_config.json # Selection constants
|
|
└── docs/ # Documentation
|
|
|
|
```
|
|
|
|
### Architecture Principles
|
|
|
|
- **Modular Design**: Each module has a single, focused responsibility
|
|
- **Configuration Over Code**: Static data extracted to JSON config files
|
|
- **Separation of Concerns**: API, CLI, and business logic are cleanly separated
|
|
- **No Files > 300 Lines**: Most modules kept under 300 lines for maintainability
|
|
|
|
## Development
|
|
|
|
### Code Quality Standards
|
|
|
|
This project follows strict code quality standards:
|
|
|
|
- **File Size**: No files > 300 lines (with few exceptions)
|
|
- **Function Size**: No functions > 50 lines
|
|
- **Nesting Depth**: No indentation > 3 levels
|
|
- **DRY Principle**: No duplicate code (>3 lines)
|
|
- **Single Responsibility**: Each module does one thing
|
|
- **Configuration Over Code**: Static data in JSON configs
|
|
|
|
### Running Tests
|
|
|
|
```bash
|
|
# Run all tests
|
|
python -m pytest tests/ -v
|
|
|
|
# Run specific test file
|
|
python -m pytest tests/test_tool_parsing.py -v
|
|
|
|
# Run with coverage
|
|
python -m pytest tests/ --cov=src
|
|
```
|
|
|
|
### Recent Refactoring
|
|
|
|
Major refactoring completed to improve modularity:
|
|
|
|
**Before**: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines)
|
|
**After**: Modular architecture (main.py: 99 lines, routes.py: 252 lines)
|
|
|
|
**Changes**:
|
|
- Extracted API logic into focused modules (formatting, parsing, handlers)
|
|
- Created CLI package with separated concerns (parser, runner, server)
|
|
- Moved hardcoded model data to JSON configuration files
|
|
- Created shared utility modules (token_counter, project_discovery, network)
|
|
- Reduced code duplication across the codebase
|
|
|
|
See `docs/ARCHITECTURE.md` for detailed architecture documentation.
|
|
|
|
## Recent Improvements
|
|
|
|
### ✅ Universal Tool Support (2025-02-25)
|
|
- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
|
|
- No client-side configuration needed - just use the API
|
|
- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
|
|
- Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
|
|
- Proper OpenAI tool format with unique IDs and tool_call_id linking
|
|
|
|
### ✅ OpenCode-Compatible Streaming (2025-02-25)
|
|
- Proper `reasoning_content` field for "Thinking..." collapsible blocks
|
|
- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
|
|
- Final answer delivered in `content` field after tool execution
|
|
|
|
### ✅ Federation Quality Voting (2025-02-25)
|
|
- Head node now **objectively judges** all peer responses using quality metrics
|
|
- No more reliance on self-reported confidence (which biased toward local)
|
|
- All responses scored on length, structure, completeness
|
|
- Fair competition: 14B models properly beat 3B on quality tasks
|
|
|
|
### 🚧 Planned Features
|
|
- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
|
|
- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)
|
|
|
|
## Contributing
|
|
|
|
Contributions are welcome! Please ensure:
|
|
1. Code follows the quality standards above
|
|
2. All tests pass
|
|
3. New features include tests
|
|
4. Documentation is updated
|
|
|
|
## License
|
|
|
|
MIT License
|