- Critical fix: peers don't have tool results from previous iterations - Running federation on tool result iterations causes inconsistent context - Now federation is ONLY used on iteration 1 (initial planning) - Iterations 2+ are local-only (tool result processing) - This prevents the infinite ls loop and wrong file hallucinations - All 41 tests passing
Local Swarm
Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.
What It Does
- Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
- Downloads and runs multiple LLM instances optimized for your VRAM/RAM
- Uses consensus voting - all instances answer, best response wins
- Connects multiple machines on your network for a "hive mind" effect
- Provides an OpenAI-compatible API at
http://localhost:17615/v1
Quick Start
# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt
# Run it
python main.py
On first run, it will:
- Detect your hardware
- Pick the best model and quantization
- Download the model (one-time)
- Start multiple LLM workers
- Expose the API at
http://localhost:17615
Usage
Interactive Mode (default)
python main.py
Shows a menu with:
- Recommended configuration (auto-selected)
- Browse all compatible models
- Custom configuration wizard
Auto Mode (no menu)
python main.py --auto
With Other Options
python main.py --model qwen:3b:q4 # Use specific model
python main.py --instances 4 # Force 4 workers
python main.py --port 8080 # Custom port
python main.py --detect # Show hardware info only
python main.py --federation # Enable network federation
python main.py --mcp # Enable MCP server
python main.py --use-opencode-tools # Use opencode tools (adds ~27k tokens)
Tool Mode Options:
- Default: Local tool server (~125 tokens, saves context window space)
--use-opencode-tools: Full opencode tool definitions (~27k tokens, more capabilities)
Connect to Opencode
Add to your opencode config:
{
"model": {
"provider": "openai",
"base_url": "http://localhost:17615/v1",
"api_key": "not-needed",
"model": "local-swarm"
}
}
Network Federation (Hive Mind)
Run on multiple machines to combine their power:
# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation
# Machine 2 (Mac Mini M1)
python main.py --auto --federation
# Machine 3 (Old laptop)
python main.py --auto --federation
Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses objective quality scoring to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
Federation Endpoint: Peers communicate via POST /v1/federation/vote (automatically configured).
How Consensus Works
- Your prompt goes to all LLM instances
- Each instance generates a response independently
- The consensus algorithm picks the best answer:
- Similarity (default): Groups responses by meaning, picks the largest group
- Quality: Scores on completeness, code blocks, structure
- Fastest: Returns the quickest response
- Majority: Simple text match voting
Configuration
Create config.yaml:
server:
host: "127.0.0.1"
port: 17615
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest, majority
min_instances: 2
max_instances: 8
federation:
enabled: true
discovery_port: 8765
max_peers: 10
Supported Hardware
| Hardware | Backend | Notes |
|---|---|---|
| NVIDIA GPU | llama.cpp (CUDA) | Best performance |
| AMD GPU | llama.cpp (ROCm) | Linux/Windows |
| Intel GPU | llama.cpp (SYCL) | Linux/Windows |
| Apple Silicon | MLX | Native Metal |
| Qualcomm | llama.cpp (CPU) | Android/Termux |
| CPU-only | llama.cpp | Slower but works |
Supported Models
- Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
- DeepSeek Coder (1.3B, 6.7B, 33B)
- CodeLlama (7B, 13B, 34B)
All support GGUF quantization (Q4_K_M recommended).
API Endpoints
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completion with consensusGET /health- Health checkPOST /v1/federation/vote- Federation voting (used internally between peers)
Troubleshooting
Out of Memory
python main.py --instances 2 # Reduce workers
python main.py --model qwen:3b:q4 # Use smaller model
Slow Performance
- Check GPU utilization with
nvidia-smi - Reduce instances to avoid contention
- Use Q4 quantization instead of Q6
CUDA Not Detected (Windows)
nvidia-smi # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
macOS: MLX Not Found
pip install mlx-lm
Project Structure
local_swarm/
├── main.py # CLI entry point (99 lines)
├── src/
│ ├── api/ # OpenAI-compatible API
│ │ ├── routes.py # HTTP routing (252 lines)
│ │ ├── formatting.py # Message formatting
│ │ ├── tool_parser.py # Tool call parsing
│ │ ├── chat_handlers.py # Chat completion logic
│ │ └── models.py # API data models
│ ├── cli/ # Command-line interface
│ │ ├── parser.py # CLI argument parsing
│ │ ├── main_runner.py # Main application logic
│ │ ├── server_runner.py # Server management
│ │ └── test_runner.py # Test mode execution
│ ├── swarm/ # Swarm orchestration
│ │ ├── manager.py # Swarm manager
│ │ ├── worker.py # LLM worker implementation
│ │ ├── consensus.py # Consensus algorithms
│ │ └── orchestrator.py # Generation orchestration
│ ├── models/ # Model management
│ │ ├── registry.py # Model registry (194 lines)
│ │ ├── selector.py # Model selection (329 lines)
│ │ ├── memory_calculator.py # Memory calculations
│ │ └── downloader.py # Model downloading
│ ├── hardware/ # Hardware detection
│ │ ├── detector.py # Hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection
│ │ ├── intel.py # Intel GPU detection
│ │ └── qualcomm.py # Qualcomm detection
│ ├── network/ # Network federation
│ │ ├── federation.py # Cross-swarm consensus
│ │ └── discovery.py # Peer discovery
│ ├── backends/ # LLM backends
│ │ ├── llama_cpp.py # llama.cpp backend
│ │ ├── mlx.py # Apple Silicon MLX backend
│ │ └── base.py # Base backend interface
│ ├── interactive/ # Interactive CLI
│ │ ├── ui.py # UI utilities
│ │ ├── display.py # Hardware display
│ │ └── tips.py # Help content
│ ├── tools/ # Tool execution
│ │ └── executor.py # Tool execution engine
│ └── utils/ # Shared utilities
│ ├── token_counter.py # Token counting
│ ├── project_discovery.py # Project root discovery
│ └── network.py # Network utilities
├── config/ # Configuration files
│ └── models/ # Model configurations
│ ├── model_metadata.json # Model metadata
│ ├── mlx_quant_sizes.json # MLX quantization sizes
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
│ └── selector_config.json # Selection constants
└── docs/ # Documentation
Architecture Principles
- Modular Design: Each module has a single, focused responsibility
- Configuration Over Code: Static data extracted to JSON config files
- Separation of Concerns: API, CLI, and business logic are cleanly separated
- No Files > 300 Lines: Most modules kept under 300 lines for maintainability
Development
Code Quality Standards
This project follows strict code quality standards:
- File Size: No files > 300 lines (with few exceptions)
- Function Size: No functions > 50 lines
- Nesting Depth: No indentation > 3 levels
- DRY Principle: No duplicate code (>3 lines)
- Single Responsibility: Each module does one thing
- Configuration Over Code: Static data in JSON configs
Running Tests
# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_tool_parsing.py -v
# Run with coverage
python -m pytest tests/ --cov=src
Recent Refactoring
Major refactoring completed to improve modularity:
Before: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) After: Modular architecture (main.py: 99 lines, routes.py: 252 lines)
Changes:
- Extracted API logic into focused modules (formatting, parsing, handlers)
- Created CLI package with separated concerns (parser, runner, server)
- Moved hardcoded model data to JSON configuration files
- Created shared utility modules (token_counter, project_discovery, network)
- Reduced code duplication across the codebase
See docs/ARCHITECTURE.md for detailed architecture documentation.
Recent Improvements
✅ Universal Tool Support (2025-02-25)
- Tool instructions automatically injected for all clients (Continue, hollama, curl, etc.)
- No client-side configuration needed - just use the API
- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
- Working directory auto-extraction from prompts (
in /path/to/dirpatterns) - Proper OpenAI tool format with unique IDs and tool_call_id linking
✅ OpenCode-Compatible Streaming (2025-02-25)
- Proper
reasoning_contentfield for "Thinking..." collapsible blocks - Multi-chunk
tool_callsstreaming matching Vercel AI SDK format - Final answer delivered in
contentfield after tool execution
✅ Federation Quality Voting (2025-02-25)
- Head node now objectively judges all peer responses using quality metrics
- No more reliance on self-reported confidence (which biased toward local)
- All responses scored on length, structure, completeness
- Fair competition: 14B models properly beat 3B on quality tasks
🚧 Planned Features
- Plan Mode: Disable tool execution for planning-only conversations (
--plan-mode) - Tool Consensus: Verify tool calls across multiple workers before execution (for critical operations)
Contributing
Contributions are welcome! Please ensure:
- Code follows the quality standards above
- All tests pass
- New features include tests
- Documentation is updated
License
MIT License