- Fixed thinking capture: use parsed_content (without tool call) instead of full response - _stream_response now correctly emits reasoning_content before tool_calls - Tool calls streamed with proper multi-chunk format: id+name (empty args), then arguments, then finish_reason - Final answers sent as content with finish_reason=stop - Used setattr to dynamically attach _thinking to response object - ChatLogger already in place for debugging - This should now work correctly with OpenCode's Vercel AI SDK integration
Local Swarm
Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.
What It Does
- Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
- Downloads and runs multiple LLM instances optimized for your VRAM/RAM
- Uses consensus voting - all instances answer, best response wins
- Connects multiple machines on your network for a "hive mind" effect
- Provides an OpenAI-compatible API at
http://localhost:17615/v1
Quick Start
# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt
# Run it
python main.py
On first run, it will:
- Detect your hardware
- Pick the best model and quantization
- Download the model (one-time)
- Start multiple LLM workers
- Expose the API at
http://localhost:17615
Usage
Interactive Mode (default)
python main.py
Shows a menu with:
- Recommended configuration (auto-selected)
- Browse all compatible models
- Custom configuration wizard
Auto Mode (no menu)
python main.py --auto
With Other Options
python main.py --model qwen:3b:q4 # Use specific model
python main.py --instances 4 # Force 4 workers
python main.py --port 8080 # Custom port
python main.py --detect # Show hardware info only
python main.py --federation # Enable network federation
python main.py --mcp # Enable MCP server
python main.py --use-opencode-tools # Use opencode tools (adds ~27k tokens)
Tool Mode Options:
- Default: Local tool server (~125 tokens, saves context window space)
--use-opencode-tools: Full opencode tool definitions (~27k tokens, more capabilities)
Connect to Opencode
Add to your opencode config:
{
"model": {
"provider": "openai",
"base_url": "http://localhost:17615/v1",
"api_key": "not-needed",
"model": "local-swarm"
}
}
Network Federation (Hive Mind)
Run on multiple machines to combine their power:
# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation
# Machine 2 (Mac Mini M1)
python main.py --auto --federation
# Machine 3 (Old laptop)
python main.py --auto --federation
Machines auto-discover each other and vote together on every request.
How Consensus Works
- Your prompt goes to all LLM instances
- Each instance generates a response independently
- The consensus algorithm picks the best answer:
- Similarity (default): Groups responses by meaning, picks the largest group
- Quality: Scores on completeness, code blocks, structure
- Fastest: Returns the quickest response
- Majority: Simple text match voting
Configuration
Create config.yaml:
server:
host: "127.0.0.1"
port: 17615
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest, majority
min_instances: 2
max_instances: 8
federation:
enabled: true
discovery_port: 8765
max_peers: 10
Supported Hardware
| Hardware | Backend | Notes |
|---|---|---|
| NVIDIA GPU | llama.cpp (CUDA) | Best performance |
| AMD GPU | llama.cpp (ROCm) | Linux/Windows |
| Intel GPU | llama.cpp (SYCL) | Linux/Windows |
| Apple Silicon | MLX | Native Metal |
| Qualcomm | llama.cpp (CPU) | Android/Termux |
| CPU-only | llama.cpp | Slower but works |
Supported Models
- Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
- DeepSeek Coder (1.3B, 6.7B, 33B)
- CodeLlama (7B, 13B, 34B)
All support GGUF quantization (Q4_K_M recommended).
API Endpoints
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completion with consensusGET /health- Health checkGET /v1/federation/peers- List discovered peers (when federation enabled)
Troubleshooting
Out of Memory
python main.py --instances 2 # Reduce workers
python main.py --model qwen:3b:q4 # Use smaller model
Slow Performance
- Check GPU utilization with
nvidia-smi - Reduce instances to avoid contention
- Use Q4 quantization instead of Q6
CUDA Not Detected (Windows)
nvidia-smi # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
macOS: MLX Not Found
pip install mlx-lm
Project Structure
local_swarm/
├── main.py # CLI entry point (99 lines)
├── src/
│ ├── api/ # OpenAI-compatible API
│ │ ├── routes.py # HTTP routing (252 lines)
│ │ ├── formatting.py # Message formatting
│ │ ├── tool_parser.py # Tool call parsing
│ │ ├── chat_handlers.py # Chat completion logic
│ │ └── models.py # API data models
│ ├── cli/ # Command-line interface
│ │ ├── parser.py # CLI argument parsing
│ │ ├── main_runner.py # Main application logic
│ │ ├── server_runner.py # Server management
│ │ └── test_runner.py # Test mode execution
│ ├── swarm/ # Swarm orchestration
│ │ ├── manager.py # Swarm manager
│ │ ├── worker.py # LLM worker implementation
│ │ ├── consensus.py # Consensus algorithms
│ │ └── orchestrator.py # Generation orchestration
│ ├── models/ # Model management
│ │ ├── registry.py # Model registry (194 lines)
│ │ ├── selector.py # Model selection (329 lines)
│ │ ├── memory_calculator.py # Memory calculations
│ │ └── downloader.py # Model downloading
│ ├── hardware/ # Hardware detection
│ │ ├── detector.py # Hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection
│ │ ├── intel.py # Intel GPU detection
│ │ └── qualcomm.py # Qualcomm detection
│ ├── network/ # Network federation
│ │ ├── federation.py # Cross-swarm consensus
│ │ └── discovery.py # Peer discovery
│ ├── backends/ # LLM backends
│ │ ├── llama_cpp.py # llama.cpp backend
│ │ ├── mlx.py # Apple Silicon MLX backend
│ │ └── base.py # Base backend interface
│ ├── interactive/ # Interactive CLI
│ │ ├── ui.py # UI utilities
│ │ ├── display.py # Hardware display
│ │ └── tips.py # Help content
│ ├── tools/ # Tool execution
│ │ └── executor.py # Tool execution engine
│ └── utils/ # Shared utilities
│ ├── token_counter.py # Token counting
│ ├── project_discovery.py # Project root discovery
│ └── network.py # Network utilities
├── config/ # Configuration files
│ └── models/ # Model configurations
│ ├── model_metadata.json # Model metadata
│ ├── mlx_quant_sizes.json # MLX quantization sizes
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
│ └── selector_config.json # Selection constants
└── docs/ # Documentation
Architecture Principles
- Modular Design: Each module has a single, focused responsibility
- Configuration Over Code: Static data extracted to JSON config files
- Separation of Concerns: API, CLI, and business logic are cleanly separated
- No Files > 300 Lines: Most modules kept under 300 lines for maintainability
Development
Code Quality Standards
This project follows strict code quality standards:
- File Size: No files > 300 lines (with few exceptions)
- Function Size: No functions > 50 lines
- Nesting Depth: No indentation > 3 levels
- DRY Principle: No duplicate code (>3 lines)
- Single Responsibility: Each module does one thing
- Configuration Over Code: Static data in JSON configs
Running Tests
# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_tool_parsing.py -v
# Run with coverage
python -m pytest tests/ --cov=src
Recent Refactoring
Major refactoring completed to improve modularity:
Before: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) After: Modular architecture (main.py: 99 lines, routes.py: 252 lines)
Changes:
- Extracted API logic into focused modules (formatting, parsing, handlers)
- Created CLI package with separated concerns (parser, runner, server)
- Moved hardcoded model data to JSON configuration files
- Created shared utility modules (token_counter, project_discovery, network)
- Reduced code duplication across the codebase
See docs/ARCHITECTURE.md for detailed architecture documentation.
Contributing
Contributions are welcome! Please ensure:
- Code follows the quality standards above
- All tests pass
- New features include tests
- Documentation is updated
License
MIT License