# Local Swarm Architecture ## Core Concept Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware. ## How It Works ``` ┌─────────────────┐ ┌─────────────────────────────────────┐ │ Your Prompt │────▶│ Swarm Manager │ └─────────────────┘ │ ┌─────────┐ ┌─────────┐ ┌─────────┐│ │ │Worker 1 │ │Worker 2 │ │Worker 3 ││ │ │ (LLM) │ │ (LLM) │ │ (LLM) ││ │ └────┬────┘ └────┬────┘ └────┬────┘│ │ └───────────┼───────────┘ │ │ ▼ │ │ Consensus Engine │ │ (Picks best answer) │ └───────────────────┬─────────────────┘ ▼ ┌───────────────┐ │ Best Response │ └───────────────┘ ``` ## Components ### 1. Hardware Detection (`src/hardware/`) Detects your GPU and available memory to optimize model selection. - **NVIDIA** - pynvml - **AMD** - rocm-smi - **Intel** - sycl-ls - **Apple Silicon** - sysctl/unified memory - **Qualcomm** - Android/Termux detection - **CPU** - psutil ### 2. Model Selection (`src/models/`) Automatically picks the best model based on available memory: ``` Available Memory → Model Size → Quantization → Instance Count 24 GB → 14B → Q4_K_M → 2-3 instances 16 GB → 7B → Q4_K_M → 3-4 instances 8 GB → 3B → Q6_K → 2-3 instances ``` ### 3. Backends (`src/backends/`) Run the actual LLM inference: - **llama.cpp** - CUDA, ROCm, SYCL, CPU (cross-platform) - **MLX** - Apple Silicon optimized ### 4. Swarm Management (`src/swarm/`) Manages multiple LLM workers and consensus voting. **Workers**: Each runs an independent LLM instance **Consensus**: Picks the best response using: - Similarity (semantic grouping) - Quality (code blocks, structure) - Fastest (latency) - Majority (exact match) ### 5. Network Federation (`src/network/`) Connect multiple machines into a distributed swarm: ``` Machine 1 (4 workers) ──┐ Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer Machine 3 (3 workers) ──┘ ``` **Discovery**: mDNS/Bonjour auto-discovery **Protocol**: HTTP between peers **Voting**: Two-phase (local consensus → global consensus) ### 6. API (`src/api/`) OpenAI-compatible REST API: - `POST /v1/chat/completions` - Main endpoint - `GET /v1/models` - List models - `GET /health` - Health check - Federation endpoints when enabled ### 7. Tools (`src/tools/`) Optional tool execution for enhanced capabilities: - `read_file` - Read files - `write_file` - Write files - `execute_bash` - Run shell commands ## Data Flow 1. **Request** comes in via API 2. **Swarm Manager** sends to all workers 3. **Workers** generate responses in parallel 4. **Consensus** picks the best answer 5. **Response** returned to client ## Memory Model - **External GPU**: Use 90% of VRAM - **Apple Silicon**: Use RAM - 4GB buffer - **CPU-only**: Use RAM - 4GB buffer Each worker loads the full model independently (no sharing). ## Future Ideas - Context compression for long inputs - CPU offloading for memory-constrained systems - RAG integration for knowledge bases - Speculative decoding for speed