Files
local_swarm/docs/ARCHITECTURE.md
T
sleepy b9ce5db8ef docs: update architecture and README with new modular structure
Updated documentation to reflect the recent refactoring:

README.md:
- Added detailed project structure with line counts
- Added Architecture Principles section
- Added Development section with code quality standards
- Added section about recent refactoring work

ARCHITECTURE.md:
- Added complete project structure tree
- Added Architecture Principles section
- Detailed all modules and their responsibilities
- Added Configuration Files section
- Added Code Quality Standards section

DEVELOPMENT_PATTERNS.md:
- Added Refactoring Success section
- Documented all changes made
- Listed architecture principles established
- Updated success metrics with checkmarks
2026-02-25 13:31:24 +01:00

278 lines
11 KiB
Markdown

# Local Swarm Architecture
## Core Concept
Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.
## How It Works
```
┌─────────────────┐ ┌─────────────────────────────────────┐
│ Your Prompt │────▶│ Swarm Manager │
└─────────────────┘ │ ┌─────────┐ ┌─────────┐ ┌─────────┐│
│ │Worker 1 │ │Worker 2 │ │Worker 3 ││
│ │ (LLM) │ │ (LLM) │ │ (LLM) ││
│ └────┬────┘ └────┬────┘ └────┬────┘│
│ └───────────┼───────────┘ │
│ ▼ │
│ Consensus Engine │
│ (Picks best answer) │
└───────────────────┬─────────────────┘
┌───────────────┐
│ Best Response │
└───────────────┘
```
## Project Structure
```
local_swarm/
├── main.py # Entry point (99 lines)
├── src/
│ ├── api/ # HTTP API layer
│ │ ├── routes.py # FastAPI routes (252 lines)
│ │ ├── formatting.py # Message formatting (265 lines)
│ │ ├── tool_parser.py # Tool parsing (250 lines)
│ │ ├── chat_handlers.py # Chat completion logic (287 lines)
│ │ ├── server.py # Server setup
│ │ └── models.py # API data models
│ ├── cli/ # Command-line interface
│ │ ├── parser.py # CLI argument parsing
│ │ ├── main_runner.py # Main application logic
│ │ ├── server_runner.py # Server management
│ │ ├── test_runner.py # Test mode execution
│ │ └── tool_server.py # Tool server runner
│ ├── swarm/ # Swarm orchestration
│ │ ├── manager.py # Swarm manager
│ │ ├── worker.py # LLM worker implementation
│ │ ├── consensus.py # Consensus algorithms
│ │ └── orchestrator.py # Generation orchestration
│ ├── models/ # Model management
│ │ ├── registry.py # Model registry (194 lines)
│ │ ├── selector.py # Model selection (329 lines)
│ │ ├── memory_calculator.py # Memory calculation utilities
│ │ └── downloader.py # Model downloading
│ ├── backends/ # LLM backends
│ │ ├── llama_cpp.py # llama.cpp backend
│ │ ├── mlx.py # Apple Silicon MLX backend
│ │ └── base.py # Base backend interface
│ ├── hardware/ # Hardware detection
│ │ ├── detector.py # Hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection
│ │ ├── intel.py # Intel GPU detection
│ │ ├── qualcomm.py # Qualcomm detection
│ │ └── ...
│ ├── network/ # Network federation
│ │ ├── federation.py # Cross-swarm consensus
│ │ ├── discovery.py # Peer discovery (mDNS)
│ │ └── discovery_core.py # Discovery utilities
│ ├── tools/ # Tool execution
│ │ └── executor.py # Tool execution engine
│ ├── interactive/ # Interactive CLI
│ │ ├── ui.py # UI utilities
│ │ ├── display.py # Hardware/resource display
│ │ ├── tips.py # Help content
│ │ └── config_utils.py # Configuration selection
│ └── utils/ # Utilities
│ ├── token_counter.py # Token counting
│ ├── project_discovery.py # Project root discovery
│ ├── network.py # Network utilities
│ └── logging_config.py # Logging configuration
├── config/
│ └── models/ # Model configuration files
│ ├── model_metadata.json # Model metadata
│ ├── mlx_quant_sizes.json # MLX quantization sizes
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
│ └── selector_config.json # Selection constants
└── tests/ # Test suite
```
## Architecture Principles
### 1. Separation of Concerns
Each module has a single responsibility:
- **API layer** (`src/api/`) - HTTP routing only
- **CLI layer** (`src/cli/`) - User interface and orchestration
- **Swarm layer** (`src/swarm/`) - LLM worker management
- **Models layer** (`src/models/`) - Model selection and downloading
### 2. Configuration Over Code
Static data extracted to JSON configs:
- Model metadata in `config/models/model_metadata.json`
- Quantization sizes in `mlx_quant_sizes.json` and `gguf_quant_sizes.json`
- Selection constants in `selector_config.json`
### 3. Modular Utilities
Shared functionality in reusable modules:
- `utils/token_counter.py` - Centralized token counting
- `utils/project_discovery.py` - Project root detection
- `utils/network.py` - IP detection and network utilities
## Components
### 1. Hardware Detection (`src/hardware/`)
Detects your GPU and available memory to optimize model selection.
- **NVIDIA** - pynvml
- **AMD** - rocm-smi
- **Intel** - sycl-ls
- **Apple Silicon** - sysctl/unified memory
- **Qualcomm** - Android/Termux detection
- **CPU** - psutil
### 2. Model Selection (`src/models/`)
Automatically picks the best model based on available memory:
```
Available Memory → Model Size → Quantization → Instance Count
24 GB → 14B → Q4_K_M → 2-3 instances
16 GB → 7B → Q4_K_M → 3-4 instances
8 GB → 3B → Q6_K → 2-3 instances
```
**Key modules:**
- `registry.py` - Loads model data from JSON configs
- `selector.py` - Selects optimal model for hardware
- `memory_calculator.py` - Calculates memory requirements
### 3. Backends (`src/backends/`)
Run the actual LLM inference:
- **llama.cpp** - CUDA, ROCm, SYCL, CPU (cross-platform)
- **MLX** - Apple Silicon optimized
### 4. Swarm Management (`src/swarm/`)
Manages multiple LLM workers and consensus voting.
**Workers**: Each runs an independent LLM instance
**Consensus**: Picks the best response using:
- Similarity (semantic grouping)
- Quality (code blocks, structure)
- Fastest (latency)
- Majority (exact match)
**Key modules:**
- `manager.py` - Swarm lifecycle and coordination
- `worker.py` - Individual worker implementation
- `consensus.py` - Consensus algorithms
- `orchestrator.py` - Generation orchestration
### 5. Network Federation (`src/network/`)
Connect multiple machines into a distributed swarm:
```
Machine 1 (4 workers) ──┐
Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
Machine 3 (3 workers) ──┘
```
**Discovery**: mDNS/Bonjour auto-discovery
**Protocol**: HTTP between peers
**Voting**: Two-phase (local consensus → global consensus)
### 6. API (`src/api/`)
OpenAI-compatible REST API:
- `POST /v1/chat/completions` - Main endpoint
- `GET /v1/models` - List models
- `GET /health` - Health check
- `POST /v1/tools/execute` - Tool execution (when enabled)
**Modular design:**
- `routes.py` - HTTP routing only (thin controllers)
- `formatting.py` - Message formatting logic
- `tool_parser.py` - Tool call parsing
- `chat_handlers.py` - Chat completion business logic
### 7. CLI (`src/cli/`)
Command-line interface modules:
- `parser.py` - Argument parsing
- `main_runner.py` - Main application orchestration
- `server_runner.py` - Server lifecycle management
- `test_runner.py` - Test mode execution
- `tool_server.py` - Tool server management
### 8. Tools (`src/tools/`)
Optional tool execution for enhanced capabilities:
- `read_file` - Read files
- `write_file` - Write files
- `execute_bash` - Run shell commands
- `webfetch` - Fetch web content
### 9. Interactive Mode (`src/interactive/`)
Interactive CLI components:
- `ui.py` - Menu display and input handling
- `display.py` - Hardware and resource display
- `tips.py` - Educational content and help
- `config_utils.py` - Configuration selection utilities
### 10. Utilities (`src/utils/`)
Shared utility functions:
- `token_counter.py` - Token counting with tiktoken
- `project_discovery.py` - Project root detection
- `network.py` - Network utilities (IP detection)
- `logging_config.py` - Logging configuration
## Data Flow
1. **Request** comes in via API
2. **Routes** (thin layer) forward to handlers
3. **Chat Handlers** process the request
4. **Swarm Manager** sends to all workers
5. **Workers** generate responses in parallel
6. **Consensus** picks the best answer
7. **Response** returned to client
## Memory Model
- **External GPU**: Use 90% of VRAM
- **Apple Silicon**: Use RAM - 4GB buffer
- **CPU-only**: Use RAM - 4GB buffer
Each worker loads the full model independently (no sharing).
## Configuration Files
Static data extracted to JSON for easy maintenance:
```
config/models/
├── model_metadata.json # Model names, descriptions, priorities
├── mlx_quant_sizes.json # MLX quantization VRAM requirements
├── gguf_quant_sizes.json # GGUF quantization VRAM requirements
└── selector_config.json # Selection constraints and defaults
```
## Code Quality Standards
- **No files > 300 lines** (with few exceptions)
- **No functions > 50 lines**
- **No indentation > 3 levels**
- **No duplicate code** (>3 lines)
- **Single responsibility** per module
- **Configuration over code** for static data
## Testing
```
tests/
├── test_hardware_detector.py # Hardware detection tests
├── test_tool_parsing.py # Tool parsing tests
└── test_federation_metrics.py # Federation tests
```
Run tests: `python -m pytest tests/ -v`
## Future Ideas
- Context compression for long inputs
- CPU offloading for memory-constrained systems
- RAG integration for knowledge bases
- Speculative decoding for speed
- More sophisticated consensus algorithms