# Local Swarm Architecture

## Core Concept

Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.

## How It Works

```
┌─────────────────┐     ┌─────────────────────────────────────┐
│   Your Prompt   │────▶│         Swarm Manager               │
└─────────────────┘     │  ┌─────────┐ ┌─────────┐ ┌─────────┐│
                        │  │Worker 1 │ │Worker 2 │ │Worker 3 ││
                        │  │ (LLM)   │ │ (LLM)   │ │ (LLM)   ││
                        │  └────┬────┘ └────┬────┘ └────┬────┘│
                        │       └───────────┼───────────┘     │
                        │                   ▼                 │
                        │         Consensus Engine            │
                        │         (Picks best answer)         │
                        └───────────────────┬─────────────────┘
                                            ▼
                                    ┌───────────────┐
                                    │ Best Response │
                                    └───────────────┘
```

## Components

### 1. Hardware Detection (`src/hardware/`)
Detects your GPU and available memory to optimize model selection.

- **NVIDIA** - pynvml
- **AMD** - rocm-smi
- **Intel** - sycl-ls
- **Apple Silicon** - sysctl/unified memory
- **Qualcomm** - Android/Termux detection
- **CPU** - psutil

### 2. Model Selection (`src/models/`)
Automatically picks the best model based on available memory:

```
Available Memory → Model Size → Quantization → Instance Count
     24 GB     →   14B      →    Q4_K_M    →   2-3 instances
     16 GB     →    7B      →    Q4_K_M    →   3-4 instances
      8 GB     →    3B      →    Q6_K      →   2-3 instances
```

### 3. Backends (`src/backends/`)
Run the actual LLM inference:

- **llama.cpp** - CUDA, ROCm, SYCL, CPU (cross-platform)
- **MLX** - Apple Silicon optimized

### 4. Swarm Management (`src/swarm/`)
Manages multiple LLM workers and consensus voting.

**Workers**: Each runs an independent LLM instance
**Consensus**: Picks the best response using:
- Similarity (semantic grouping)
- Quality (code blocks, structure)
- Fastest (latency)
- Majority (exact match)

### 5. Network Federation (`src/network/`)
Connect multiple machines into a distributed swarm:

```
Machine 1 (4 workers) ──┐
Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
Machine 3 (3 workers) ──┘
```

**Discovery**: mDNS/Bonjour auto-discovery
**Protocol**: HTTP between peers
**Voting**: Two-phase (local consensus → global consensus)

### 6. API (`src/api/`)
OpenAI-compatible REST API:

- `POST /v1/chat/completions` - Main endpoint
- `GET /v1/models` - List models
- `GET /health` - Health check
- Federation endpoints when enabled

### 7. Tools (`src/tools/`)
Optional tool execution for enhanced capabilities:

- `read_file` - Read files
- `write_file` - Write files  
- `execute_bash` - Run shell commands

## Data Flow

1. **Request** comes in via API
2. **Swarm Manager** sends to all workers
3. **Workers** generate responses in parallel
4. **Consensus** picks the best answer
5. **Response** returned to client

## Memory Model

- **External GPU**: Use 90% of VRAM
- **Apple Silicon**: Use RAM - 4GB buffer
- **CPU-only**: Use RAM - 4GB buffer

Each worker loads the full model independently (no sharing).

## Future Ideas

- Context compression for long inputs
- CPU offloading for memory-constrained systems
- RAG integration for knowledge bases
- Speculative decoding for speed