b9ce5db8ef
Updated documentation to reflect the recent refactoring: README.md: - Added detailed project structure with line counts - Added Architecture Principles section - Added Development section with code quality standards - Added section about recent refactoring work ARCHITECTURE.md: - Added complete project structure tree - Added Architecture Principles section - Detailed all modules and their responsibilities - Added Configuration Files section - Added Code Quality Standards section DEVELOPMENT_PATTERNS.md: - Added Refactoring Success section - Documented all changes made - Listed architecture principles established - Updated success metrics with checkmarks
278 lines
11 KiB
Markdown
278 lines
11 KiB
Markdown
# Local Swarm Architecture
|
|
|
|
## Core Concept
|
|
|
|
Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.
|
|
|
|
## How It Works
|
|
|
|
```
|
|
┌─────────────────┐ ┌─────────────────────────────────────┐
|
|
│ Your Prompt │────▶│ Swarm Manager │
|
|
└─────────────────┘ │ ┌─────────┐ ┌─────────┐ ┌─────────┐│
|
|
│ │Worker 1 │ │Worker 2 │ │Worker 3 ││
|
|
│ │ (LLM) │ │ (LLM) │ │ (LLM) ││
|
|
│ └────┬────┘ └────┬────┘ └────┬────┘│
|
|
│ └───────────┼───────────┘ │
|
|
│ ▼ │
|
|
│ Consensus Engine │
|
|
│ (Picks best answer) │
|
|
└───────────────────┬─────────────────┘
|
|
▼
|
|
┌───────────────┐
|
|
│ Best Response │
|
|
└───────────────┘
|
|
```
|
|
|
|
## Project Structure
|
|
|
|
```
|
|
local_swarm/
|
|
├── main.py # Entry point (99 lines)
|
|
├── src/
|
|
│ ├── api/ # HTTP API layer
|
|
│ │ ├── routes.py # FastAPI routes (252 lines)
|
|
│ │ ├── formatting.py # Message formatting (265 lines)
|
|
│ │ ├── tool_parser.py # Tool parsing (250 lines)
|
|
│ │ ├── chat_handlers.py # Chat completion logic (287 lines)
|
|
│ │ ├── server.py # Server setup
|
|
│ │ └── models.py # API data models
|
|
│ ├── cli/ # Command-line interface
|
|
│ │ ├── parser.py # CLI argument parsing
|
|
│ │ ├── main_runner.py # Main application logic
|
|
│ │ ├── server_runner.py # Server management
|
|
│ │ ├── test_runner.py # Test mode execution
|
|
│ │ └── tool_server.py # Tool server runner
|
|
│ ├── swarm/ # Swarm orchestration
|
|
│ │ ├── manager.py # Swarm manager
|
|
│ │ ├── worker.py # LLM worker implementation
|
|
│ │ ├── consensus.py # Consensus algorithms
|
|
│ │ └── orchestrator.py # Generation orchestration
|
|
│ ├── models/ # Model management
|
|
│ │ ├── registry.py # Model registry (194 lines)
|
|
│ │ ├── selector.py # Model selection (329 lines)
|
|
│ │ ├── memory_calculator.py # Memory calculation utilities
|
|
│ │ └── downloader.py # Model downloading
|
|
│ ├── backends/ # LLM backends
|
|
│ │ ├── llama_cpp.py # llama.cpp backend
|
|
│ │ ├── mlx.py # Apple Silicon MLX backend
|
|
│ │ └── base.py # Base backend interface
|
|
│ ├── hardware/ # Hardware detection
|
|
│ │ ├── detector.py # Hardware detection
|
|
│ │ ├── nvidia.py # NVIDIA GPU detection
|
|
│ │ ├── intel.py # Intel GPU detection
|
|
│ │ ├── qualcomm.py # Qualcomm detection
|
|
│ │ └── ...
|
|
│ ├── network/ # Network federation
|
|
│ │ ├── federation.py # Cross-swarm consensus
|
|
│ │ ├── discovery.py # Peer discovery (mDNS)
|
|
│ │ └── discovery_core.py # Discovery utilities
|
|
│ ├── tools/ # Tool execution
|
|
│ │ └── executor.py # Tool execution engine
|
|
│ ├── interactive/ # Interactive CLI
|
|
│ │ ├── ui.py # UI utilities
|
|
│ │ ├── display.py # Hardware/resource display
|
|
│ │ ├── tips.py # Help content
|
|
│ │ └── config_utils.py # Configuration selection
|
|
│ └── utils/ # Utilities
|
|
│ ├── token_counter.py # Token counting
|
|
│ ├── project_discovery.py # Project root discovery
|
|
│ ├── network.py # Network utilities
|
|
│ └── logging_config.py # Logging configuration
|
|
├── config/
|
|
│ └── models/ # Model configuration files
|
|
│ ├── model_metadata.json # Model metadata
|
|
│ ├── mlx_quant_sizes.json # MLX quantization sizes
|
|
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
|
|
│ └── selector_config.json # Selection constants
|
|
└── tests/ # Test suite
|
|
```
|
|
|
|
## Architecture Principles
|
|
|
|
### 1. Separation of Concerns
|
|
Each module has a single responsibility:
|
|
- **API layer** (`src/api/`) - HTTP routing only
|
|
- **CLI layer** (`src/cli/`) - User interface and orchestration
|
|
- **Swarm layer** (`src/swarm/`) - LLM worker management
|
|
- **Models layer** (`src/models/`) - Model selection and downloading
|
|
|
|
### 2. Configuration Over Code
|
|
Static data extracted to JSON configs:
|
|
- Model metadata in `config/models/model_metadata.json`
|
|
- Quantization sizes in `mlx_quant_sizes.json` and `gguf_quant_sizes.json`
|
|
- Selection constants in `selector_config.json`
|
|
|
|
### 3. Modular Utilities
|
|
Shared functionality in reusable modules:
|
|
- `utils/token_counter.py` - Centralized token counting
|
|
- `utils/project_discovery.py` - Project root detection
|
|
- `utils/network.py` - IP detection and network utilities
|
|
|
|
## Components
|
|
|
|
### 1. Hardware Detection (`src/hardware/`)
|
|
Detects your GPU and available memory to optimize model selection.
|
|
|
|
- **NVIDIA** - pynvml
|
|
- **AMD** - rocm-smi
|
|
- **Intel** - sycl-ls
|
|
- **Apple Silicon** - sysctl/unified memory
|
|
- **Qualcomm** - Android/Termux detection
|
|
- **CPU** - psutil
|
|
|
|
### 2. Model Selection (`src/models/`)
|
|
Automatically picks the best model based on available memory:
|
|
|
|
```
|
|
Available Memory → Model Size → Quantization → Instance Count
|
|
24 GB → 14B → Q4_K_M → 2-3 instances
|
|
16 GB → 7B → Q4_K_M → 3-4 instances
|
|
8 GB → 3B → Q6_K → 2-3 instances
|
|
```
|
|
|
|
**Key modules:**
|
|
- `registry.py` - Loads model data from JSON configs
|
|
- `selector.py` - Selects optimal model for hardware
|
|
- `memory_calculator.py` - Calculates memory requirements
|
|
|
|
### 3. Backends (`src/backends/`)
|
|
Run the actual LLM inference:
|
|
|
|
- **llama.cpp** - CUDA, ROCm, SYCL, CPU (cross-platform)
|
|
- **MLX** - Apple Silicon optimized
|
|
|
|
### 4. Swarm Management (`src/swarm/`)
|
|
Manages multiple LLM workers and consensus voting.
|
|
|
|
**Workers**: Each runs an independent LLM instance
|
|
**Consensus**: Picks the best response using:
|
|
- Similarity (semantic grouping)
|
|
- Quality (code blocks, structure)
|
|
- Fastest (latency)
|
|
- Majority (exact match)
|
|
|
|
**Key modules:**
|
|
- `manager.py` - Swarm lifecycle and coordination
|
|
- `worker.py` - Individual worker implementation
|
|
- `consensus.py` - Consensus algorithms
|
|
- `orchestrator.py` - Generation orchestration
|
|
|
|
### 5. Network Federation (`src/network/`)
|
|
Connect multiple machines into a distributed swarm:
|
|
|
|
```
|
|
Machine 1 (4 workers) ──┐
|
|
Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
|
|
Machine 3 (3 workers) ──┘
|
|
```
|
|
|
|
**Discovery**: mDNS/Bonjour auto-discovery
|
|
**Protocol**: HTTP between peers
|
|
**Voting**: Two-phase (local consensus → global consensus)
|
|
|
|
### 6. API (`src/api/`)
|
|
OpenAI-compatible REST API:
|
|
|
|
- `POST /v1/chat/completions` - Main endpoint
|
|
- `GET /v1/models` - List models
|
|
- `GET /health` - Health check
|
|
- `POST /v1/tools/execute` - Tool execution (when enabled)
|
|
|
|
**Modular design:**
|
|
- `routes.py` - HTTP routing only (thin controllers)
|
|
- `formatting.py` - Message formatting logic
|
|
- `tool_parser.py` - Tool call parsing
|
|
- `chat_handlers.py` - Chat completion business logic
|
|
|
|
### 7. CLI (`src/cli/`)
|
|
Command-line interface modules:
|
|
|
|
- `parser.py` - Argument parsing
|
|
- `main_runner.py` - Main application orchestration
|
|
- `server_runner.py` - Server lifecycle management
|
|
- `test_runner.py` - Test mode execution
|
|
- `tool_server.py` - Tool server management
|
|
|
|
### 8. Tools (`src/tools/`)
|
|
Optional tool execution for enhanced capabilities:
|
|
|
|
- `read_file` - Read files
|
|
- `write_file` - Write files
|
|
- `execute_bash` - Run shell commands
|
|
- `webfetch` - Fetch web content
|
|
|
|
### 9. Interactive Mode (`src/interactive/`)
|
|
Interactive CLI components:
|
|
|
|
- `ui.py` - Menu display and input handling
|
|
- `display.py` - Hardware and resource display
|
|
- `tips.py` - Educational content and help
|
|
- `config_utils.py` - Configuration selection utilities
|
|
|
|
### 10. Utilities (`src/utils/`)
|
|
Shared utility functions:
|
|
|
|
- `token_counter.py` - Token counting with tiktoken
|
|
- `project_discovery.py` - Project root detection
|
|
- `network.py` - Network utilities (IP detection)
|
|
- `logging_config.py` - Logging configuration
|
|
|
|
## Data Flow
|
|
|
|
1. **Request** comes in via API
|
|
2. **Routes** (thin layer) forward to handlers
|
|
3. **Chat Handlers** process the request
|
|
4. **Swarm Manager** sends to all workers
|
|
5. **Workers** generate responses in parallel
|
|
6. **Consensus** picks the best answer
|
|
7. **Response** returned to client
|
|
|
|
## Memory Model
|
|
|
|
- **External GPU**: Use 90% of VRAM
|
|
- **Apple Silicon**: Use RAM - 4GB buffer
|
|
- **CPU-only**: Use RAM - 4GB buffer
|
|
|
|
Each worker loads the full model independently (no sharing).
|
|
|
|
## Configuration Files
|
|
|
|
Static data extracted to JSON for easy maintenance:
|
|
|
|
```
|
|
config/models/
|
|
├── model_metadata.json # Model names, descriptions, priorities
|
|
├── mlx_quant_sizes.json # MLX quantization VRAM requirements
|
|
├── gguf_quant_sizes.json # GGUF quantization VRAM requirements
|
|
└── selector_config.json # Selection constraints and defaults
|
|
```
|
|
|
|
## Code Quality Standards
|
|
|
|
- **No files > 300 lines** (with few exceptions)
|
|
- **No functions > 50 lines**
|
|
- **No indentation > 3 levels**
|
|
- **No duplicate code** (>3 lines)
|
|
- **Single responsibility** per module
|
|
- **Configuration over code** for static data
|
|
|
|
## Testing
|
|
|
|
```
|
|
tests/
|
|
├── test_hardware_detector.py # Hardware detection tests
|
|
├── test_tool_parsing.py # Tool parsing tests
|
|
└── test_federation_metrics.py # Federation tests
|
|
```
|
|
|
|
Run tests: `python -m pytest tests/ -v`
|
|
|
|
## Future Ideas
|
|
|
|
- Context compression for long inputs
|
|
- CPU offloading for memory-constrained systems
|
|
- RAG integration for knowledge bases
|
|
- Speculative decoding for speed
|
|
- More sophisticated consensus algorithms
|