# Local Swarm Architecture ## Core Concept Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware. ## How It Works ``` ┌─────────────────┐ ┌─────────────────────────────────────┐ │ Your Prompt │────▶│ Swarm Manager │ └─────────────────┘ │ ┌─────────┐ ┌─────────┐ ┌─────────┐│ │ │Worker 1 │ │Worker 2 │ │Worker 3 ││ │ │ (LLM) │ │ (LLM) │ │ (LLM) ││ │ └────┬────┘ └────┬────┘ └────┬────┘│ │ └───────────┼───────────┘ │ │ ▼ │ │ Consensus Engine │ │ (Picks best answer) │ └───────────────────┬─────────────────┘ ▼ ┌───────────────┐ │ Best Response │ └───────────────┘ ``` ## Project Structure ``` local_swarm/ ├── main.py # Entry point (99 lines) ├── src/ │ ├── api/ # HTTP API layer │ │ ├── routes.py # FastAPI routes (252 lines) │ │ ├── formatting.py # Message formatting (265 lines) │ │ ├── tool_parser.py # Tool parsing (250 lines) │ │ ├── chat_handlers.py # Chat completion logic (287 lines) │ │ ├── server.py # Server setup │ │ └── models.py # API data models │ ├── cli/ # Command-line interface │ │ ├── parser.py # CLI argument parsing │ │ ├── main_runner.py # Main application logic │ │ ├── server_runner.py # Server management │ │ ├── test_runner.py # Test mode execution │ │ └── tool_server.py # Tool server runner │ ├── swarm/ # Swarm orchestration │ │ ├── manager.py # Swarm manager │ │ ├── worker.py # LLM worker implementation │ │ ├── consensus.py # Consensus algorithms │ │ └── orchestrator.py # Generation orchestration │ ├── models/ # Model management │ │ ├── registry.py # Model registry (194 lines) │ │ ├── selector.py # Model selection (329 lines) │ │ ├── memory_calculator.py # Memory calculation utilities │ │ └── downloader.py # Model downloading │ ├── backends/ # LLM backends │ │ ├── llama_cpp.py # llama.cpp backend │ │ ├── mlx.py # Apple Silicon MLX backend │ │ └── base.py # Base backend interface │ ├── hardware/ # Hardware detection │ │ ├── detector.py # Hardware detection │ │ ├── nvidia.py # NVIDIA GPU detection │ │ ├── intel.py # Intel GPU detection │ │ ├── qualcomm.py # Qualcomm detection │ │ └── ... │ ├── network/ # Network federation │ │ ├── federation.py # Cross-swarm consensus │ │ ├── discovery.py # Peer discovery (mDNS) │ │ └── discovery_core.py # Discovery utilities │ ├── tools/ # Tool execution │ │ └── executor.py # Tool execution engine │ ├── interactive/ # Interactive CLI │ │ ├── ui.py # UI utilities │ │ ├── display.py # Hardware/resource display │ │ ├── tips.py # Help content │ │ └── config_utils.py # Configuration selection │ └── utils/ # Utilities │ ├── token_counter.py # Token counting │ ├── project_discovery.py # Project root discovery │ ├── network.py # Network utilities │ └── logging_config.py # Logging configuration ├── config/ │ └── models/ # Model configuration files │ ├── model_metadata.json # Model metadata │ ├── mlx_quant_sizes.json # MLX quantization sizes │ ├── gguf_quant_sizes.json # GGUF quantization sizes │ └── selector_config.json # Selection constants └── tests/ # Test suite ``` ## Architecture Principles ### 1. Separation of Concerns Each module has a single responsibility: - **API layer** (`src/api/`) - HTTP routing only - **CLI layer** (`src/cli/`) - User interface and orchestration - **Swarm layer** (`src/swarm/`) - LLM worker management - **Models layer** (`src/models/`) - Model selection and downloading ### 2. Configuration Over Code Static data extracted to JSON configs: - Model metadata in `config/models/model_metadata.json` - Quantization sizes in `mlx_quant_sizes.json` and `gguf_quant_sizes.json` - Selection constants in `selector_config.json` ### 3. Modular Utilities Shared functionality in reusable modules: - `utils/token_counter.py` - Centralized token counting - `utils/project_discovery.py` - Project root detection - `utils/network.py` - IP detection and network utilities ## Components ### 1. Hardware Detection (`src/hardware/`) Detects your GPU and available memory to optimize model selection. - **NVIDIA** - pynvml - **AMD** - rocm-smi - **Intel** - sycl-ls - **Apple Silicon** - sysctl/unified memory - **Qualcomm** - Android/Termux detection - **CPU** - psutil ### 2. Model Selection (`src/models/`) Automatically picks the best model based on available memory: ``` Available Memory → Model Size → Quantization → Instance Count 24 GB → 14B → Q4_K_M → 2-3 instances 16 GB → 7B → Q4_K_M → 3-4 instances 8 GB → 3B → Q6_K → 2-3 instances ``` **Key modules:** - `registry.py` - Loads model data from JSON configs - `selector.py` - Selects optimal model for hardware - `memory_calculator.py` - Calculates memory requirements ### 3. Backends (`src/backends/`) Run the actual LLM inference: - **llama.cpp** - CUDA, ROCm, SYCL, CPU (cross-platform) - **MLX** - Apple Silicon optimized ### 4. Swarm Management (`src/swarm/`) Manages multiple LLM workers and consensus voting. **Workers**: Each runs an independent LLM instance **Consensus**: Picks the best response using: - Similarity (semantic grouping) - Quality (code blocks, structure) - Fastest (latency) - Majority (exact match) **Key modules:** - `manager.py` - Swarm lifecycle and coordination - `worker.py` - Individual worker implementation - `consensus.py` - Consensus algorithms - `orchestrator.py` - Generation orchestration ### 5. Network Federation (`src/network/`) Connect multiple machines into a distributed swarm: ``` Machine 1 (4 workers) ──┐ Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer Machine 3 (3 workers) ──┘ ``` **Discovery**: mDNS/Bonjour auto-discovery **Protocol**: HTTP between peers **Voting**: Two-phase (local consensus → global consensus) ### 6. API (`src/api/`) OpenAI-compatible REST API: - `POST /v1/chat/completions` - Main endpoint - `GET /v1/models` - List models - `GET /health` - Health check - `POST /v1/tools/execute` - Tool execution (when enabled) **Modular design:** - `routes.py` - HTTP routing only (thin controllers) - `formatting.py` - Message formatting logic - `tool_parser.py` - Tool call parsing - `chat_handlers.py` - Chat completion business logic ### 7. CLI (`src/cli/`) Command-line interface modules: - `parser.py` - Argument parsing - `main_runner.py` - Main application orchestration - `server_runner.py` - Server lifecycle management - `test_runner.py` - Test mode execution - `tool_server.py` - Tool server management ### 8. Tools (`src/tools/`) Optional tool execution for enhanced capabilities: - `read_file` - Read files - `write_file` - Write files - `execute_bash` - Run shell commands - `webfetch` - Fetch web content ### 9. Interactive Mode (`src/interactive/`) Interactive CLI components: - `ui.py` - Menu display and input handling - `display.py` - Hardware and resource display - `tips.py` - Educational content and help - `config_utils.py` - Configuration selection utilities ### 10. Utilities (`src/utils/`) Shared utility functions: - `token_counter.py` - Token counting with tiktoken - `project_discovery.py` - Project root detection - `network.py` - Network utilities (IP detection) - `logging_config.py` - Logging configuration ## Data Flow 1. **Request** comes in via API 2. **Routes** (thin layer) forward to handlers 3. **Chat Handlers** process the request 4. **Swarm Manager** sends to all workers 5. **Workers** generate responses in parallel 6. **Consensus** picks the best answer 7. **Response** returned to client ## Memory Model - **External GPU**: Use 90% of VRAM - **Apple Silicon**: Use RAM - 4GB buffer - **CPU-only**: Use RAM - 4GB buffer Each worker loads the full model independently (no sharing). ## Configuration Files Static data extracted to JSON for easy maintenance: ``` config/models/ ├── model_metadata.json # Model names, descriptions, priorities ├── mlx_quant_sizes.json # MLX quantization VRAM requirements ├── gguf_quant_sizes.json # GGUF quantization VRAM requirements └── selector_config.json # Selection constraints and defaults ``` ## Code Quality Standards - **No files > 300 lines** (with few exceptions) - **No functions > 50 lines** - **No indentation > 3 levels** - **No duplicate code** (>3 lines) - **Single responsibility** per module - **Configuration over code** for static data ## Testing ``` tests/ ├── test_hardware_detector.py # Hardware detection tests ├── test_tool_parsing.py # Tool parsing tests └── test_federation_metrics.py # Federation tests ``` Run tests: `python -m pytest tests/ -v` ## Future Ideas - Context compression for long inputs - CPU offloading for memory-constrained systems - RAG integration for knowledge bases - Speculative decoding for speed - More sophisticated consensus algorithms