# Local Swarm Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting. ## What It Does - **Auto-detects your hardware** (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU) - **Downloads and runs multiple LLM instances** optimized for your VRAM/RAM - **Uses consensus voting** - all instances answer, best response wins - **Connects multiple machines** on your network for a "hive mind" effect - **Provides an OpenAI-compatible API** at `http://localhost:17615/v1` ## Quick Start ```bash # Clone and install git clone https://github.com/yourusername/local_swarm.git cd local_swarm pip install -r requirements.txt # Run it python main.py ``` On first run, it will: 1. Detect your hardware 2. Pick the best model and quantization 3. Download the model (one-time) 4. Start multiple LLM workers 5. Expose the API at `http://localhost:17615` ## Usage ### Interactive Mode (default) ```bash python main.py ``` Shows a menu with: - Recommended configuration (auto-selected) - Browse all compatible models - Custom configuration wizard ### Auto Mode (no menu) ```bash python main.py --auto ``` ### With Other Options ```bash python main.py --model qwen:3b:q4 # Use specific model python main.py --instances 4 # Force 4 workers python main.py --port 8080 # Custom port python main.py --detect # Show hardware info only python main.py --federation # Enable network federation python main.py --mcp # Enable MCP server python main.py --use-opencode-tools # Use opencode tools (adds ~27k tokens) ``` **Tool Mode Options:** - Default: Local tool server (~125 tokens, saves context window space) - `--use-opencode-tools`: Full opencode tool definitions (~27k tokens, more capabilities) ## Connect to Opencode Add to your opencode config: ```json { "model": { "provider": "openai", "base_url": "http://localhost:17615/v1", "api_key": "not-needed", "model": "local-swarm" } } ``` ## Network Federation (Hive Mind) Run on multiple machines to combine their power: ```bash # Machine 1 (Windows with RTX 4060) python main.py --auto --federation # Machine 2 (Mac Mini M1) python main.py --auto --federation # Machine 3 (Old laptop) python main.py --auto --federation ``` Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models. **Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured). ## How Consensus Works 1. Your prompt goes to all LLM instances 2. Each instance generates a response independently 3. The consensus algorithm picks the best answer: - **Similarity** (default): Groups responses by meaning, picks the largest group - **Quality**: Scores on completeness, code blocks, structure - **Fastest**: Returns the quickest response - **Majority**: Simple text match voting ## Configuration Create `config.yaml`: ```yaml server: host: "127.0.0.1" port: 17615 swarm: consensus_strategy: "similarity" # similarity, quality, fastest, majority min_instances: 2 max_instances: 8 federation: enabled: true discovery_port: 8765 max_peers: 10 ``` ## Supported Hardware | Hardware | Backend | Notes | |----------|---------|-------| | NVIDIA GPU | llama.cpp (CUDA) | Best performance | | AMD GPU | llama.cpp (ROCm) | Linux/Windows | | Intel GPU | llama.cpp (SYCL) | Linux/Windows | | Apple Silicon | MLX | Native Metal | | Qualcomm | llama.cpp (CPU) | Android/Termux | | CPU-only | llama.cpp | Slower but works | ## Supported Models - **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended - **DeepSeek Coder** (1.3B, 6.7B, 33B) - **CodeLlama** (7B, 13B, 34B) All support GGUF quantization (Q4_K_M recommended). ## API Endpoints - `GET /v1/models` - List available models - `POST /v1/chat/completions` - Chat completion with consensus - `GET /health` - Health check - `POST /v1/federation/vote` - Federation voting (used internally between peers) ## Troubleshooting ### Out of Memory ```bash python main.py --instances 2 # Reduce workers python main.py --model qwen:3b:q4 # Use smaller model ``` ### Slow Performance - Check GPU utilization with `nvidia-smi` - Reduce instances to avoid contention - Use Q4 quantization instead of Q6 ### CUDA Not Detected (Windows) ```powershell nvidia-smi # Check drivers pip uninstall llama-cpp-python pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121 ``` ### macOS: MLX Not Found ```bash pip install mlx-lm ``` ## Project Structure ``` local_swarm/ ├── main.py # CLI entry point (99 lines) ├── src/ │ ├── api/ # OpenAI-compatible API │ │ ├── routes.py # HTTP routing (252 lines) │ │ ├── formatting.py # Message formatting │ │ ├── tool_parser.py # Tool call parsing │ │ ├── chat_handlers.py # Chat completion logic │ │ └── models.py # API data models │ ├── cli/ # Command-line interface │ │ ├── parser.py # CLI argument parsing │ │ ├── main_runner.py # Main application logic │ │ ├── server_runner.py # Server management │ │ └── test_runner.py # Test mode execution │ ├── swarm/ # Swarm orchestration │ │ ├── manager.py # Swarm manager │ │ ├── worker.py # LLM worker implementation │ │ ├── consensus.py # Consensus algorithms │ │ └── orchestrator.py # Generation orchestration │ ├── models/ # Model management │ │ ├── registry.py # Model registry (194 lines) │ │ ├── selector.py # Model selection (329 lines) │ │ ├── memory_calculator.py # Memory calculations │ │ └── downloader.py # Model downloading │ ├── hardware/ # Hardware detection │ │ ├── detector.py # Hardware detection │ │ ├── nvidia.py # NVIDIA GPU detection │ │ ├── intel.py # Intel GPU detection │ │ └── qualcomm.py # Qualcomm detection │ ├── network/ # Network federation │ │ ├── federation.py # Cross-swarm consensus │ │ └── discovery.py # Peer discovery │ ├── backends/ # LLM backends │ │ ├── llama_cpp.py # llama.cpp backend │ │ ├── mlx.py # Apple Silicon MLX backend │ │ └── base.py # Base backend interface │ ├── interactive/ # Interactive CLI │ │ ├── ui.py # UI utilities │ │ ├── display.py # Hardware display │ │ └── tips.py # Help content │ ├── tools/ # Tool execution │ │ └── executor.py # Tool execution engine │ └── utils/ # Shared utilities │ ├── token_counter.py # Token counting │ ├── project_discovery.py # Project root discovery │ └── network.py # Network utilities ├── config/ # Configuration files │ └── models/ # Model configurations │ ├── model_metadata.json # Model metadata │ ├── mlx_quant_sizes.json # MLX quantization sizes │ ├── gguf_quant_sizes.json # GGUF quantization sizes │ └── selector_config.json # Selection constants └── docs/ # Documentation ``` ### Architecture Principles - **Modular Design**: Each module has a single, focused responsibility - **Configuration Over Code**: Static data extracted to JSON config files - **Separation of Concerns**: API, CLI, and business logic are cleanly separated - **No Files > 300 Lines**: Most modules kept under 300 lines for maintainability ## Development ### Code Quality Standards This project follows strict code quality standards: - **File Size**: No files > 300 lines (with few exceptions) - **Function Size**: No functions > 50 lines - **Nesting Depth**: No indentation > 3 levels - **DRY Principle**: No duplicate code (>3 lines) - **Single Responsibility**: Each module does one thing - **Configuration Over Code**: Static data in JSON configs ### Running Tests ```bash # Run all tests python -m pytest tests/ -v # Run specific test file python -m pytest tests/test_tool_parsing.py -v # Run with coverage python -m pytest tests/ --cov=src ``` ### Recent Refactoring Major refactoring completed to improve modularity: **Before**: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) **After**: Modular architecture (main.py: 99 lines, routes.py: 252 lines) **Changes**: - Extracted API logic into focused modules (formatting, parsing, handlers) - Created CLI package with separated concerns (parser, runner, server) - Moved hardcoded model data to JSON configuration files - Created shared utility modules (token_counter, project_discovery, network) - Reduced code duplication across the codebase See `docs/ARCHITECTURE.md` for detailed architecture documentation. ## Recent Improvements ### ✅ Universal Tool Support (2025-02-25) - Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.) - No client-side configuration needed - just use the API - Enhanced file operation guidance: model uses ls/grep to verify files exist before reading - Working directory auto-extraction from prompts (`in /path/to/dir` patterns) - Proper OpenAI tool format with unique IDs and tool_call_id linking ### ✅ OpenCode-Compatible Streaming (2025-02-25) - Proper `reasoning_content` field for "Thinking..." collapsible blocks - Multi-chunk `tool_calls` streaming matching Vercel AI SDK format - Final answer delivered in `content` field after tool execution ### ✅ Federation Quality Voting (2025-02-25) - Head node now **objectively judges** all peer responses using quality metrics - No more reliance on self-reported confidence (which biased toward local) - All responses scored on length, structure, completeness - Fair competition: 14B models properly beat 3B on quality tasks ### 🚧 Planned Features - **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`) - **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations) ## Contributing Contributions are welcome! Please ensure: 1. Code follows the quality standards above 2. All tests pass 3. New features include tests 4. Documentation is updated ## License MIT License