local_swarm/docs/ARCHITECTURE.md

# Local Swarm Architecture

## Core Concept

Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.

## How It Works

```
┌─────────────────┐     ┌─────────────────────────────────────┐
│   Your Prompt   │────▶│         Swarm Manager               │
└─────────────────┘     │  ┌─────────┐ ┌─────────┐ ┌─────────┐│
                        │  │Worker 1 │ │Worker 2 │ │Worker 3 ││
                        │  │ (LLM)   │ │ (LLM)   │ │ (LLM)   ││
                        │  └────┬────┘ └────┬────┘ └────┬────┘│
                        │       └───────────┼───────────┘     │
                        │                   ▼                 │
                        │         Consensus Engine            │
                        │         (Picks best answer)         │
                        └───────────────────┬─────────────────┘
                                            ▼
                                    ┌───────────────┐
                                    │ Best Response │
                                    └───────────────┘
```

## Project Structure

```
local_swarm/
├── main.py                    # Entry point (99 lines)
├── src/
│   ├── api/                   # HTTP API layer
│   │   ├── routes.py          # FastAPI routes (252 lines)
│   │   ├── formatting.py      # Message formatting (265 lines)
│   │   ├── tool_parser.py     # Tool parsing (250 lines)
│   │   ├── chat_handlers.py   # Chat completion logic (287 lines)
│   │   ├── server.py          # Server setup
│   │   └── models.py          # API data models
│   ├── cli/                   # Command-line interface
│   │   ├── parser.py          # CLI argument parsing
│   │   ├── main_runner.py     # Main application logic
│   │   ├── server_runner.py   # Server management
│   │   ├── test_runner.py     # Test mode execution
│   │   └── tool_server.py     # Tool server runner
│   ├── swarm/                 # Swarm orchestration
│   │   ├── manager.py         # Swarm manager
│   │   ├── worker.py          # LLM worker implementation
│   │   ├── consensus.py       # Consensus algorithms
│   │   └── orchestrator.py    # Generation orchestration
│   ├── models/                # Model management
│   │   ├── registry.py        # Model registry (194 lines)
│   │   ├── selector.py        # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculation utilities
│   │   └── downloader.py      # Model downloading
│   ├── backends/              # LLM backends
│   │   ├── llama_cpp.py       # llama.cpp backend
│   │   ├── mlx.py             # Apple Silicon MLX backend
│   │   └── base.py            # Base backend interface
│   ├── hardware/              # Hardware detection
│   │   ├── detector.py        # Hardware detection
│   │   ├── nvidia.py          # NVIDIA GPU detection
│   │   ├── intel.py           # Intel GPU detection
│   │   ├── qualcomm.py        # Qualcomm detection
│   │   └── ...
│   ├── network/               # Network federation
│   │   ├── federation.py      # Cross-swarm consensus
│   │   ├── discovery.py       # Peer discovery (mDNS)
│   │   └── discovery_core.py  # Discovery utilities
│   ├── tools/                 # Tool execution
│   │   └── executor.py        # Tool execution engine
│   ├── interactive/           # Interactive CLI
│   │   ├── ui.py              # UI utilities
│   │   ├── display.py         # Hardware/resource display
│   │   ├── tips.py            # Help content
│   │   └── config_utils.py    # Configuration selection
│   └── utils/                 # Utilities
│       ├── token_counter.py   # Token counting
│       ├── project_discovery.py # Project root discovery
│       ├── network.py         # Network utilities
│       └── logging_config.py  # Logging configuration
├── config/
│   └── models/                # Model configuration files
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── tests/                     # Test suite
```

## Architecture Principles

### 1. Separation of Concerns
Each module has a single responsibility:
- **API layer** (`src/api/`) - HTTP routing only
- **CLI layer** (`src/cli/`) - User interface and orchestration
- **Swarm layer** (`src/swarm/`) - LLM worker management
- **Models layer** (`src/models/`) - Model selection and downloading

### 2. Configuration Over Code
Static data extracted to JSON configs:
- Model metadata in `config/models/model_metadata.json`
- Quantization sizes in `mlx_quant_sizes.json` and `gguf_quant_sizes.json`
- Selection constants in `selector_config.json`

### 3. Modular Utilities
Shared functionality in reusable modules:
- `utils/token_counter.py` - Centralized token counting
- `utils/project_discovery.py` - Project root detection
- `utils/network.py` - IP detection and network utilities

## Components

### 1. Hardware Detection (`src/hardware/`)
Detects your GPU and available memory to optimize model selection.

- **NVIDIA** - pynvml
- **AMD** - rocm-smi
- **Intel** - sycl-ls
- **Apple Silicon** - sysctl/unified memory
- **Qualcomm** - Android/Termux detection
- **CPU** - psutil

### 2. Model Selection (`src/models/`)
Automatically picks the best model based on available memory:

```
Available Memory → Model Size → Quantization → Instance Count
     24 GB     →   14B      →    Q4_K_M    →   2-3 instances
     16 GB     →    7B      →    Q4_K_M    →   3-4 instances
      8 GB     →    3B      →    Q6_K      →   2-3 instances
```

**Key modules:**
- `registry.py` - Loads model data from JSON configs
- `selector.py` - Selects optimal model for hardware
- `memory_calculator.py` - Calculates memory requirements

### 3. Backends (`src/backends/`)
Run the actual LLM inference:

- **llama.cpp** - CUDA, ROCm, SYCL, CPU (cross-platform)
- **MLX** - Apple Silicon optimized

### 4. Swarm Management (`src/swarm/`)
Manages multiple LLM workers and consensus voting.

**Workers**: Each runs an independent LLM instance
**Consensus**: Picks the best response using:
- Similarity (semantic grouping)
- Quality (code blocks, structure)
- Fastest (latency)
- Majority (exact match)

**Key modules:**
- `manager.py` - Swarm lifecycle and coordination
- `worker.py` - Individual worker implementation
- `consensus.py` - Consensus algorithms
- `orchestrator.py` - Generation orchestration

### 5. Network Federation (`src/network/`)
Connect multiple machines into a distributed swarm:

```
Machine 1 (4 workers) ──┐
Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
Machine 3 (3 workers) ──┘
```

**Discovery**: mDNS/Bonjour auto-discovery
**Protocol**: HTTP between peers
**Voting**: Two-phase (local consensus → global consensus)

### 6. API (`src/api/`)
OpenAI-compatible REST API:

- `POST /v1/chat/completions` - Main endpoint
- `GET /v1/models` - List models
- `GET /health` - Health check
- `POST /v1/tools/execute` - Tool execution (when enabled)

**Modular design:**
- `routes.py` - HTTP routing only (thin controllers)
- `formatting.py` - Message formatting logic
- `tool_parser.py` - Tool call parsing
- `chat_handlers.py` - Chat completion business logic

### 7. CLI (`src/cli/`)
Command-line interface modules:

- `parser.py` - Argument parsing
- `main_runner.py` - Main application orchestration
- `server_runner.py` - Server lifecycle management
- `test_runner.py` - Test mode execution
- `tool_server.py` - Tool server management

### 8. Tools (`src/tools/`)
Optional tool execution for enhanced capabilities:

- `read_file` - Read files
- `write_file` - Write files
- `execute_bash` - Run shell commands
- `webfetch` - Fetch web content

### 9. Interactive Mode (`src/interactive/`)
Interactive CLI components:

- `ui.py` - Menu display and input handling
- `display.py` - Hardware and resource display
- `tips.py` - Educational content and help
- `config_utils.py` - Configuration selection utilities

### 10. Utilities (`src/utils/`)
Shared utility functions:

- `token_counter.py` - Token counting with tiktoken
- `project_discovery.py` - Project root detection
- `network.py` - Network utilities (IP detection)
- `logging_config.py` - Logging configuration

## Data Flow

1. **Request** comes in via API
2. **Routes** (thin layer) forward to handlers
3. **Chat Handlers** process the request
4. **Swarm Manager** sends to all workers
5. **Workers** generate responses in parallel
6. **Consensus** picks the best answer
7. **Response** returned to client

## Memory Model

- **External GPU**: Use 90% of VRAM
- **Apple Silicon**: Use RAM - 4GB buffer
- **CPU-only**: Use RAM - 4GB buffer

Each worker loads the full model independently (no sharing).

## Configuration Files

Static data extracted to JSON for easy maintenance:

```
config/models/
├── model_metadata.json      # Model names, descriptions, priorities
├── mlx_quant_sizes.json     # MLX quantization VRAM requirements
├── gguf_quant_sizes.json    # GGUF quantization VRAM requirements
└── selector_config.json     # Selection constraints and defaults
```

## Code Quality Standards

- **No files > 300 lines** (with few exceptions)
- **No functions > 50 lines**
- **No indentation > 3 levels**
- **No duplicate code** (>3 lines)
- **Single responsibility** per module
- **Configuration over code** for static data

## Testing

```
tests/
├── test_hardware_detector.py    # Hardware detection tests
├── test_tool_parsing.py         # Tool parsing tests
└── test_federation_metrics.py   # Federation tests
```

Run tests: `python -m pytest tests/ -v`

## Future Ideas

- Context compression for long inputs
- CPU offloading for memory-constrained systems
- RAG integration for knowledge bases
- Speculative decoding for speed
- More sophisticated consensus algorithms