Updated documentation to reflect the recent refactoring: README.md: - Added detailed project structure with line counts - Added Architecture Principles section - Added Development section with code quality standards - Added section about recent refactoring work ARCHITECTURE.md: - Added complete project structure tree - Added Architecture Principles section - Detailed all modules and their responsibilities - Added Configuration Files section - Added Code Quality Standards section DEVELOPMENT_PATTERNS.md: - Added Refactoring Success section - Documented all changes made - Listed architecture principles established - Updated success metrics with checkmarks
11 KiB
Local Swarm Architecture
Core Concept
Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.
How It Works
┌─────────────────┐ ┌─────────────────────────────────────┐
│ Your Prompt │────▶│ Swarm Manager │
└─────────────────┘ │ ┌─────────┐ ┌─────────┐ ┌─────────┐│
│ │Worker 1 │ │Worker 2 │ │Worker 3 ││
│ │ (LLM) │ │ (LLM) │ │ (LLM) ││
│ └────┬────┘ └────┬────┘ └────┬────┘│
│ └───────────┼───────────┘ │
│ ▼ │
│ Consensus Engine │
│ (Picks best answer) │
└───────────────────┬─────────────────┘
▼
┌───────────────┐
│ Best Response │
└───────────────┘
Project Structure
local_swarm/
├── main.py # Entry point (99 lines)
├── src/
│ ├── api/ # HTTP API layer
│ │ ├── routes.py # FastAPI routes (252 lines)
│ │ ├── formatting.py # Message formatting (265 lines)
│ │ ├── tool_parser.py # Tool parsing (250 lines)
│ │ ├── chat_handlers.py # Chat completion logic (287 lines)
│ │ ├── server.py # Server setup
│ │ └── models.py # API data models
│ ├── cli/ # Command-line interface
│ │ ├── parser.py # CLI argument parsing
│ │ ├── main_runner.py # Main application logic
│ │ ├── server_runner.py # Server management
│ │ ├── test_runner.py # Test mode execution
│ │ └── tool_server.py # Tool server runner
│ ├── swarm/ # Swarm orchestration
│ │ ├── manager.py # Swarm manager
│ │ ├── worker.py # LLM worker implementation
│ │ ├── consensus.py # Consensus algorithms
│ │ └── orchestrator.py # Generation orchestration
│ ├── models/ # Model management
│ │ ├── registry.py # Model registry (194 lines)
│ │ ├── selector.py # Model selection (329 lines)
│ │ ├── memory_calculator.py # Memory calculation utilities
│ │ └── downloader.py # Model downloading
│ ├── backends/ # LLM backends
│ │ ├── llama_cpp.py # llama.cpp backend
│ │ ├── mlx.py # Apple Silicon MLX backend
│ │ └── base.py # Base backend interface
│ ├── hardware/ # Hardware detection
│ │ ├── detector.py # Hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection
│ │ ├── intel.py # Intel GPU detection
│ │ ├── qualcomm.py # Qualcomm detection
│ │ └── ...
│ ├── network/ # Network federation
│ │ ├── federation.py # Cross-swarm consensus
│ │ ├── discovery.py # Peer discovery (mDNS)
│ │ └── discovery_core.py # Discovery utilities
│ ├── tools/ # Tool execution
│ │ └── executor.py # Tool execution engine
│ ├── interactive/ # Interactive CLI
│ │ ├── ui.py # UI utilities
│ │ ├── display.py # Hardware/resource display
│ │ ├── tips.py # Help content
│ │ └── config_utils.py # Configuration selection
│ └── utils/ # Utilities
│ ├── token_counter.py # Token counting
│ ├── project_discovery.py # Project root discovery
│ ├── network.py # Network utilities
│ └── logging_config.py # Logging configuration
├── config/
│ └── models/ # Model configuration files
│ ├── model_metadata.json # Model metadata
│ ├── mlx_quant_sizes.json # MLX quantization sizes
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
│ └── selector_config.json # Selection constants
└── tests/ # Test suite
Architecture Principles
1. Separation of Concerns
Each module has a single responsibility:
- API layer (
src/api/) - HTTP routing only - CLI layer (
src/cli/) - User interface and orchestration - Swarm layer (
src/swarm/) - LLM worker management - Models layer (
src/models/) - Model selection and downloading
2. Configuration Over Code
Static data extracted to JSON configs:
- Model metadata in
config/models/model_metadata.json - Quantization sizes in
mlx_quant_sizes.jsonandgguf_quant_sizes.json - Selection constants in
selector_config.json
3. Modular Utilities
Shared functionality in reusable modules:
utils/token_counter.py- Centralized token countingutils/project_discovery.py- Project root detectionutils/network.py- IP detection and network utilities
Components
1. Hardware Detection (src/hardware/)
Detects your GPU and available memory to optimize model selection.
- NVIDIA - pynvml
- AMD - rocm-smi
- Intel - sycl-ls
- Apple Silicon - sysctl/unified memory
- Qualcomm - Android/Termux detection
- CPU - psutil
2. Model Selection (src/models/)
Automatically picks the best model based on available memory:
Available Memory → Model Size → Quantization → Instance Count
24 GB → 14B → Q4_K_M → 2-3 instances
16 GB → 7B → Q4_K_M → 3-4 instances
8 GB → 3B → Q6_K → 2-3 instances
Key modules:
registry.py- Loads model data from JSON configsselector.py- Selects optimal model for hardwarememory_calculator.py- Calculates memory requirements
3. Backends (src/backends/)
Run the actual LLM inference:
- llama.cpp - CUDA, ROCm, SYCL, CPU (cross-platform)
- MLX - Apple Silicon optimized
4. Swarm Management (src/swarm/)
Manages multiple LLM workers and consensus voting.
Workers: Each runs an independent LLM instance Consensus: Picks the best response using:
- Similarity (semantic grouping)
- Quality (code blocks, structure)
- Fastest (latency)
- Majority (exact match)
Key modules:
manager.py- Swarm lifecycle and coordinationworker.py- Individual worker implementationconsensus.py- Consensus algorithmsorchestrator.py- Generation orchestration
5. Network Federation (src/network/)
Connect multiple machines into a distributed swarm:
Machine 1 (4 workers) ──┐
Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
Machine 3 (3 workers) ──┘
Discovery: mDNS/Bonjour auto-discovery Protocol: HTTP between peers Voting: Two-phase (local consensus → global consensus)
6. API (src/api/)
OpenAI-compatible REST API:
POST /v1/chat/completions- Main endpointGET /v1/models- List modelsGET /health- Health checkPOST /v1/tools/execute- Tool execution (when enabled)
Modular design:
routes.py- HTTP routing only (thin controllers)formatting.py- Message formatting logictool_parser.py- Tool call parsingchat_handlers.py- Chat completion business logic
7. CLI (src/cli/)
Command-line interface modules:
parser.py- Argument parsingmain_runner.py- Main application orchestrationserver_runner.py- Server lifecycle managementtest_runner.py- Test mode executiontool_server.py- Tool server management
8. Tools (src/tools/)
Optional tool execution for enhanced capabilities:
read_file- Read fileswrite_file- Write filesexecute_bash- Run shell commandswebfetch- Fetch web content
9. Interactive Mode (src/interactive/)
Interactive CLI components:
ui.py- Menu display and input handlingdisplay.py- Hardware and resource displaytips.py- Educational content and helpconfig_utils.py- Configuration selection utilities
10. Utilities (src/utils/)
Shared utility functions:
token_counter.py- Token counting with tiktokenproject_discovery.py- Project root detectionnetwork.py- Network utilities (IP detection)logging_config.py- Logging configuration
Data Flow
- Request comes in via API
- Routes (thin layer) forward to handlers
- Chat Handlers process the request
- Swarm Manager sends to all workers
- Workers generate responses in parallel
- Consensus picks the best answer
- Response returned to client
Memory Model
- External GPU: Use 90% of VRAM
- Apple Silicon: Use RAM - 4GB buffer
- CPU-only: Use RAM - 4GB buffer
Each worker loads the full model independently (no sharing).
Configuration Files
Static data extracted to JSON for easy maintenance:
config/models/
├── model_metadata.json # Model names, descriptions, priorities
├── mlx_quant_sizes.json # MLX quantization VRAM requirements
├── gguf_quant_sizes.json # GGUF quantization VRAM requirements
└── selector_config.json # Selection constraints and defaults
Code Quality Standards
- No files > 300 lines (with few exceptions)
- No functions > 50 lines
- No indentation > 3 levels
- No duplicate code (>3 lines)
- Single responsibility per module
- Configuration over code for static data
Testing
tests/
├── test_hardware_detector.py # Hardware detection tests
├── test_tool_parsing.py # Tool parsing tests
└── test_federation_metrics.py # Federation tests
Run tests: python -m pytest tests/ -v
Future Ideas
- Context compression for long inputs
- CPU offloading for memory-constrained systems
- RAG integration for knowledge bases
- Speculative decoding for speed
- More sophisticated consensus algorithms