local_swarm/PLAN.md

# Local Swarm - Detailed Implementation Plan

## Overview
A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.

## Architecture

```
local_swarm/
├── src/
│   ├── __init__.py
│   ├── hardware/
│   │   ├── __init__.py
│   │   ├── detector.py       # Platform-agnostic hardware detection
│   │   ├── nvidia.py         # NVIDIA GPU detection (Windows/Linux)
│   │   ├── amd.py            # AMD GPU detection (ROCm)
│   │   ├── intel.py          # Intel GPU detection (OneAPI/OpenCL)
│   │   ├── qualcomm.py       # Qualcomm/Adreno detection (Android)
│   │   ├── apple_silicon.py  # Apple Silicon detection (macOS)
│   │   └── memory.py         # RAM detection
│   ├── models/
│   │   ├── __init__.py
│   │   ├── registry.py       # Model database with specs
│   │   ├── selector.py       # Optimal model/quant selection logic
│   │   └── downloader.py     # Download manager (HuggingFace)
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── base.py           # Backend interface
│   │   ├── llamacpp.py       # llama.cpp backend (CUDA/ROCm/SYCL)
│   │   └── mlx.py            # MLX backend (macOS)
│   ├── swarm/
│   │   ├── __init__.py
│   │   ├── manager.py        # Instance lifecycle management
│   │   ├── worker.py         # Individual LLM instance wrapper
│   │   ├── consensus.py      # Local voting/consensus algorithm
│   │   └── cross_consensus.py # Cross-swarm consensus
│   ├── network/
│   │   ├── __init__.py
│   │   ├── discovery.py      # mDNS/Bonjour peer discovery
│   │   ├── federation.py     # Inter-swarm communication
│   │   └── protocol.py       # Network protocol definitions
│   └── api/
│       ├── __init__.py
│       ├── server.py         # FastAPI/uvicorn server
│       ├── routes.py         # OpenAI-compatible endpoints
│       ├── federation.py     # Federation endpoints
│       └── middleware.py     # Request handling
├── tests/
├── config/
│   └── models.yaml           # Model configurations
├── scripts/
│   ├── install.bat           # Windows installer
│   ├── install.sh            # Unix installer
│   └── install-termux.sh     # Android/Termux installer
├── main.py                   # CLI entry point
├── requirements.txt
├── requirements-macos.txt    # MLX-specific deps
├── requirements-termux.txt   # Android/Termux deps
├── setup.py
└── .gitignore
```

## Network Federation Architecture

When multiple machines run Local Swarm on the same network, they can form a "federated swarm":

```
┌─────────────────────────────────────────────────────────────┐
│                    Local Network                             │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Windows PC  │    │   Mac Mini   │    │   MacBook    │  │
│  │  (RTX 4060)  │    │    (M1)      │    │    (M4)      │  │
│  │              │    │              │    │              │  │
│  │ ┌──────────┐ │    │ ┌──────────┐ │    │ ┌──────────┐ │  │
│  │ │ Swarm 1  │ │    │ │ Swarm 2  │ │    │ │ Swarm 3  │ │  │
│  │ │ 4 inst.  │ │    │ │ 2 inst.  │ │    │ │ 3 inst.  │ │  │
│  │ │ Qwen 7B  │ │    │ │ Qwen 7B  │ │    │ │ Qwen 7B  │ │  │
│  │ └────┬─────┘ │    │ └────┬─────┘ │    │ └────┬─────┘ │  │
│  │      │       │    │      │       │    │      │       │  │
│  │ mDNS │       │    │ mDNS │       │    │ mDNS │       │  │
│  │      │       │    │      │       │    │      │       │  │
│  └──────┼───────┘    └──────┼───────┘    └──────┼───────┘  │
│         │                   │                   │           │
│         └───────────────────┼───────────────────┘           │
│                             │                               │
│                    ┌────────┴────────┐                      │
│                    │  Federation     │                      │
│                    │  Coordinator    │                      │
│                    │                 │                      │
│                    │  ┌───────────┐  │                      │
│                    │  │ Consensus │  │                      │
│                    │  │  Engine   │  │                      │
│                    │  └───────────┘  │                      │
│                    └────────┬────────┘                      │
│                             │                               │
│                    ┌────────▼────────┐                      │
│                    │   opencode      │                      │
│                    │   (Client)      │                      │
│                    └─────────────────┘                      │
└─────────────────────────────────────────────────────────────┘

Federation Flow:
1. Each swarm independently detects hardware and starts instances
2. Swarms advertise themselves via mDNS/Bonjour
3. When request comes in, each swarm generates local responses
4. Local consensus picks best response per swarm
5. Cross-swarm consensus votes across all best responses
6. Final answer returned to opencode
```

**Benefits**:
- Utilize all hardware in your home/office
- Each machine optimizes for its own specs
- No single point of failure
- Automatic load distribution
- Works even if one machine goes offline

## Implementation Phases

### Phase 1: Foundation (Week 1)

#### 1.1 Hardware Detection Module
**File**: `src/hardware/detector.py`

**Requirements**:
- Cross-platform OS detection (Windows, macOS, Linux)
- CPU info (cores, architecture)
- RAM detection (total, available)
- GPU detection with VRAM

**Platform-specific implementations**:
- **Windows**: Use `pynvml` for NVIDIA, fallback to DirectX for others
- **macOS**: Use `psutil` for RAM, `sysctl` for CPU, Metal API for GPU
- **Linux**: Use `pynvml` for NVIDIA, `rocm-smi` for AMD

**Output structure**:
```python
class HardwareProfile:
    os: str  # 'windows', 'darwin', 'linux'
    cpu_cores: int
    ram_gb: float
    gpu: Optional[GPUInfo]
    is_apple_silicon: bool
```

**Model selection rules**:
- External GPU (NVIDIA/AMD): Use 100% of VRAM
- Apple Silicon: Use 50% of unified RAM
- CPU-only: Use 50% of system RAM

#### 1.2 Model Registry
**File**: `src/models/registry.py`

**Model database** (YAML format):
```yaml
models:
  qwen2.5-coder:
    name: "Qwen 2.5 Coder"
    description: "Alibaba's code-focused model"
    variants:
      - size: 3b
        base_vram_gb: 2.0  # Approximate VRAM for fp16
        quantizations:
          q4_k_m:
            vram_gb: 1.8
            quality: "good"
          q5_k_m:
            vram_gb: 2.2
            quality: "better"
          q6_k:
            vram_gb: 2.6
            quality: "best"
      - size: 7b
        base_vram_gb: 14.0
        quantizations:
          q4_k_m:
            vram_gb: 4.5
          q5_k_m:
            vram_gb: 5.2
          q6_k:
            vram_gb: 6.0

  codellama:
    name: "CodeLlama"
    # Similar structure...

  deepseek-coder:
    name: "DeepSeek Coder"
    # Similar structure...
```

**Selection priority**:
1. Qwen 2.5 Coder (best for small sizes)
2. DeepSeek Coder (good alternative)
3. CodeLlama (fallback)

#### 1.3 Model Selector Logic
**File**: `src/models/selector.py`

**Algorithm**:
```python
def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
    available_memory = get_available_memory(hardware)

    # Try models in priority order
    for model in PRIORITY_MODELS:
        # Find largest size that fits
        for variant in reversed(model.variants):
            # Try highest quantization that fits
            for quant in reversed(variant.quantizations):
                total_vram_needed = quant.vram_gb * MIN_INSTANCES
                if total_vram_needed <= available_memory:
                    # Calculate max instances
                    max_instances = int(available_memory // quant.vram_gb)
                    # Cap at reasonable limit (e.g., 8)
                    instances = min(max_instances, 8)
                    return ModelConfig(model, variant, quant, instances)

    # Fallback to smallest model
    return FALLBACK_CONFIG
```

**Minimum instances**: 2 (for consensus voting)
**Maximum instances**: 8 (to avoid overhead)

### Phase 2: Backend Integration (Week 2)

#### 2.1 Base Backend Interface
**File**: `src/backends/base.py`

```python
from abc import ABC, abstractmethod
from typing import AsyncIterator

class LLMBackend(ABC):
    @abstractmethod
    async def load_model(self, model_path: str, config: dict) -> bool:
        pass

    @abstractmethod
    async def generate(self, prompt: str, **kwargs) -> str:
        pass

    @abstractmethod
    async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
        pass

    @abstractmethod
    def get_memory_usage(self) -> float:
        """Return current VRAM/RAM usage in GB"""
        pass

    @abstractmethod
    def shutdown(self):
        pass
```

#### 2.2 llama.cpp Backend
**File**: `src/backends/llamacpp.py`

**Implementation**:
- Use `llama-cpp-python` library
- Support GGUF model format
- GPU acceleration via CUDA/Metal
- Server mode with HTTP API

**Key features**:
- Model caching to avoid reload
- Context window management
- Batch processing support

**Memory calculation**:
```python
def calculate_memory_usage(model_path: str) -> float:
    # Parse GGUF metadata
    # Return estimated VRAM usage
```

#### 2.3 MLX Backend (macOS)
**File**: `src/backends/mlx.py`

**Implementation**:
- Use `mlx-lm` library
- Support MLX format models
- Optimized for Apple Silicon

**Key differences from llama.cpp**:
- Native Metal performance
- Simpler API
- Unified memory model

### Phase 3: Swarm Management (Week 3)

#### 3.1 Worker Instance
**File**: `src/swarm/worker.py`

Each worker manages:
- One LLM instance
- Request queue
- Health monitoring
- Metrics collection

```python
class SwarmWorker:
    def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
        self.worker_id = worker_id
        self.backend = backend
        self.is_healthy = True
        self.request_count = 0
        self.avg_latency = 0.0

    async def process(self, request: GenerationRequest) -> GenerationResponse:
        start = time.time()
        response = await self.backend.generate(**request.params)
        latency = time.time() - start
        self._update_metrics(latency)
        return GenerationResponse(response, latency, self.worker_id)
```

#### 3.2 Swarm Manager
**File**: `src/swarm/manager.py`

Responsibilities:
- Spawn N workers based on hardware
- Distribute requests to all workers
- Collect responses
- Handle worker failures

```python
class SwarmManager:
    def __init__(self, config: ModelConfig):
        self.workers: List[SwarmWorker] = []
        self.config = config

    async def initialize(self):
        # Download model if needed
        model_path = await self._ensure_model()

        # Spawn workers
        for i in range(self.config.instances):
            backend = self._create_backend()
            await backend.load_model(model_path, self.config.backend_params)
            worker = SwarmWorker(i, backend, self.config)
            self.workers.append(worker)

    async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
        # Send to all workers in parallel
        tasks = [w.process(request) for w in self.workers]
        return await asyncio.gather(*tasks)
```

#### 3.3 Consensus Algorithm
**File**: `src/swarm/consensus.py`

**Voting strategies**:

1. **Similarity voting** (default):
   - Embed all responses
   - Group by semantic similarity
   - Return largest group

2. **Quality scoring**:
   - Score each response on:
     - Completeness (does it answer the question?)
     - Code quality (syntax, structure)
     - Length appropriateness
   - Return highest score

3. **Latency-weighted**:
   - Prefer faster responses (lower memory pressure)

**Implementation**:
```python
class ConsensusEngine:
    def __init__(self, strategy: str = "similarity"):
        self.strategy = strategy
        self.embedding_model = None  # Lazy load

    async def select_best(self, responses: List[GenerationResponse]) -> str:
        if len(responses) == 1:
            return responses[0].text

        if self.strategy == "similarity":
            return await self._similarity_vote(responses)
        elif self.strategy == "quality":
            return await self._quality_score(responses)
        else:
            return self._fastest_response(responses)

    async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
        # Use sentence-transformers for embeddings
        # Group by cosine similarity > 0.85
        # Return median response from largest group
```

### Phase 4: API Server (Week 4)

#### 4.1 OpenAI-Compatible Endpoints
**File**: `src/api/routes.py`

Required endpoints:
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completion
- `POST /v1/completions` - Text completion (optional)
- `GET /health` - Health check
- `GET /metrics` - Prometheus metrics (optional)

**Chat completions endpoint**:
```python
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    # Extract messages
    messages = request.messages
    prompt = format_messages(messages)

    # Get all responses from swarm
    responses = await swarm_manager.generate_all(prompt, **request.params)

    # Run consensus
    best_response = await consensus_engine.select_best(responses)

    # Format as OpenAI response
    return {
        "id": f"chatcmpl-{uuid4()}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": best_response},
            "finish_reason": "stop"
        }],
        "usage": calculate_usage(prompt, best_response)
    }
```

#### 4.2 Streaming Support
**File**: `src/api/routes.py`

For streaming, use the fastest worker instead of consensus:
```python
if request.stream:
    # Pick worker with lowest latency
    worker = swarm_manager.get_fastest_worker()
    return StreamingResponse(
        worker.stream_generate(prompt),
        media_type="text/event-stream"
    )
```

### Phase 5: CLI & Interactive Interface (Week 5)

#### 5.1 Interactive Menu System
**File**: `src/interactive.py`

**Features**:
- **Hardware Display**: Detailed hardware info with formatting
  - OS, CPU cores, RAM (total/available)
  - GPU details (name, VRAM, driver, type)
  - Memory allocation rules

- **Model Selection Menu**: Three configuration options
  1. **Recommended Configuration**: Auto-detects optimal model + instances
  2. **Browse All Configurations**: Lists all feasible models for hardware
  3. **Custom Configuration**: Step-by-step wizard
     - Select model family (Qwen, DeepSeek, CodeLlama)
     - Choose model size (3B, 7B, 14B)
     - Pick quantization (Q4_K_M, Q5_K_M, Q6_K)
     - Specify instance count (1 to max supported)

- **Resource Usage Monitor**: Real-time swarm status
  - Swarm status (running/stopped)
  - Current model name
  - Healthy workers count
  - Memory usage (total and per-worker)
  - Worker statistics:
    - Total requests served
    - Average latency
    - Tokens per second

- **Startup Summary**: Comprehensive display showing:
  - Hardware detection section
  - Model configuration section
  - Resource usage section
  - Memory utilization percentage

**Implementation**:
```python
def interactive_model_selection(hardware: HardwareProfile) -> Optional[ModelConfig]:
    # Show hardware info
    # Display menu with 3 options
    # Return selected configuration

def custom_configuration(hardware: HardwareProfile) -> Optional[ModelConfig]:
    # Step-by-step wizard
    # Select model -> size -> quantization -> instances
    # Validate memory constraints

def show_startup_summary(hardware, config, swarm_manager=None):
    # Clear screen
    # Print formatted hardware, config, and usage info
```

#### 5.2 CLI Interface
**File**: `main.py`

Commands:
```bash
# Start the swarm (auto-detect hardware)
python -m local_swarm

# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Start with specific port
python -m local_swarm --port 8080

# Override instance count
python -m local_swarm --instances 4

# Show hardware detection
python -m local_swarm --detect

# Download models only
python -m local_swarm --download-only
```

#### 5.2 Configuration File
**File**: `config.yaml`

```yaml
server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8
  timeout: 60

models:
  cache_dir: "~/.local_swarm/models"
  preferred_models:
    - qwen2.5-coder
    - deepseek-coder
    - codellama

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

network:
  enabled: true
  discovery_port: 8765
  federation_port: 8766
  advertise_interval: 30
  max_peers: 10
  auth_token: null  # Optional auth for security
```

### Phase 5.5: MCP Server (Week 5)

#### 5.5.1 MCP Protocol Implementation
**File**: `src/mcp_server.py`

**Features**:
- **MCP Server Class**: `LocalSwarmMCPServer` implementing MCP protocol
- **Stdio Transport**: Communication via standard input/output
- **Tool Registration**: 5 MCP tools for AI assistants:

1. **`get_hardware_info`** - Query system capabilities
   - OS, CPU cores, RAM
   - GPU name, VRAM, type
   - Available memory for LLMs

2. **`get_swarm_status`** - Check swarm health
   - Running/stopped status
   - Model name
   - Healthy workers count
   - Total memory usage

3. **`generate_code`** - Generate with consensus
   - Input: prompt, max_tokens, temperature
   - Returns: generated code with metadata
   - Shows: strategy used, confidence, latency

4. **`list_available_models`** - Browse models
   - All available models
   - Variants per model
   - Quantization options

5. **`get_worker_details`** - Worker statistics
   - Per-worker backend info
   - Health status
   - Request count
   - Latency and throughput

**Implementation**:
```python
class LocalSwarmMCPServer:
    def __init__(self, swarm_manager):
        self.server = Server("local-swarm")
        self.register_tools()

    def register_tools(self):
        @self.server.list_tools()
        async def list_tools() -> List[Tool]:
            # Define tool schemas

        @self.server.call_tool()
        async def call_tool(name, arguments):
            # Handle tool calls
```

#### 5.5.2 Dual Server Mode
- Run HTTP API and MCP server simultaneously
- Shared SwarmManager instance
- HTTP API for external clients
- MCP for AI assistant integration

**Usage**:
```bash
# HTTP API only
python main.py

# HTTP API + MCP server
python main.py --mcp
```

#### 5.5.3 Installation Scripts

**Windows** (`scripts/install.bat`):
```batch
@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt

:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
    echo CUDA detected, installing GPU support...
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
    echo No CUDA detected, using CPU backend...
    pip install llama-cpp-python
)

echo Installation complete!
echo Run: python -m local_swarm
```

**macOS/Linux** (`scripts/install.sh`):
```bash
#!/bin/bash
set -e

echo "Installing Local Swarm..."
pip install --upgrade pip

# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
    echo "macOS detected..."
    pip install -r requirements.txt
    pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    echo "Linux detected..."
    pip install -r requirements.txt
    if command -v nvidia-smi &> /dev/null; then
        echo "CUDA detected, installing GPU support..."
        pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
    else
        pip install llama-cpp-python
    fi
fi

echo "Installation complete!"
echo "Run: python -m local_swarm"
```

### Phase 6: Local Network Federation (Week 6)

#### 6.1 Overview
Enable multiple machines on the same local network to form a "federated swarm". Each machine runs its own optimized swarm, but they coordinate to vote on the best responses across the entire network.

**Example scenario**:
- Windows PC (RTX 4060 Ti): 4 instances of Qwen 7B Q4
- Mac Mini (M1): 2 instances of Qwen 7B Q4
- MacBook (M4): 3 instances of Qwen 7B Q4
- Total: 9 instances voting on every request

#### 6.2 Architecture

**Federation modes**:

1. **Independent Swarms with Cross-Voting** (Recommended):
   - Each machine runs its own swarm locally
   - When a request comes in, each swarm generates responses internally
   - All swarms exchange their "best local" responses
   - Final vote across all best responses
   - Pros: Simple, resilient, uses each machine's optimal config
   - Cons: Slightly higher latency

2. **Distributed Workers** (Advanced):
   - Single coordinator manages workers across all machines
   - Workers distributed based on capability
   - Pros: Optimal load balancing
   - Cons: Complex failure handling, network overhead

**Implementation**: Start with Mode 1 (Independent Swarms with Cross-Voting)

#### 6.3 Discovery Protocol
**File**: `src/network/discovery.py`

**mDNS/Bonjour-based discovery**:
```python
class SwarmDiscovery:
    """Discovers other Local Swarm instances on the local network."""

    def __init__(self, port: int = 8765):
        self.port = port
        self.peers: Dict[str, PeerInfo] = {}
        self.zeroconf = Zeroconf()

    def start_advertising(self, swarm_info: SwarmInfo):
        """Advertise this swarm on the network."""
        service_info = ServiceInfo(
            "_local-swarm._tcp.local.",
            f"{hostname}._local-swarm._tcp.local.",
            addresses=[self.get_ip()],
            port=self.port,
            properties={
                b"version": b"1.0",
                b"instances": str(swarm_info.instances).encode(),
                b"model": swarm_info.model_id.encode(),
                b"hardware": swarm_info.hardware_summary.encode(),
            }
        )
        self.zeroconf.register_service(service_info)

    def discover_peers(self) -> List[PeerInfo]:
        """Discover other swarms on the network."""
        browser = ServiceBrowser(
            self.zeroconf,
            "_local-swarm._tcp.local.",
            handlers=[self._on_service_state_change]
        )
```

**Alternative**: Simple UDP broadcast for environments without mDNS

#### 6.4 Federation Protocol
**File**: `src/network/federation.py`

**HTTP-based communication**:
```python
class FederationClient:
    """Client for communicating with peer swarms."""

    async def request_vote(self, peer: PeerInfo, request: GenerationRequest) -> Vote:
        """Request a vote from a peer swarm."""
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"http://{peer.host}:{peer.port}/v1/federation/vote",
                json={
                    "prompt": request.prompt,
                    "context": request.context,
                    "request_id": request.id
                },
                timeout=aiohttp.ClientTimeout(total=30)
            ) as resp:
                return Vote(await resp.json())

    async def cast_vote(self, responses: List[GenerationResponse]) -> Vote:
        """Cast this swarm's vote on a set of responses."""
        # Use consensus engine to pick best local response
        best = await self.consensus.select_best(responses)
        return Vote(best, confidence=self._calculate_confidence(responses))
```

**Endpoints**:
- `POST /v1/federation/vote` - Request a vote from this swarm
- `GET /v1/federation/health` - Check peer health
- `GET /v1/federation/info` - Get swarm capabilities

#### 6.5 Cross-Swarm Consensus
**File**: `src/swarm/cross_consensus.py`

**Two-phase voting**:

1. **Local Consensus** (Phase 1):
   - Each swarm generates responses from all local instances
   - Runs local consensus to pick "best local" response
   - Returns to coordinator

2. **Global Consensus** (Phase 2):
   - Coordinator collects all "best local" responses
   - Weights by swarm confidence (based on local agreement)
   - Returns highest-weighted response

```python
class CrossSwarmConsensus:
    """Consensus across multiple networked swarms."""

    async def generate_with_federation(
        self,
        request: GenerationRequest,
        timeout: float = 30.0
    ) -> GenerationResponse:
        # Phase 1: Local generation
        local_responses = await self.local_swarm.generate_all(request)
        local_best = await self.local_consensus.select_best(local_responses)
        local_confidence = self._calculate_local_confidence(local_responses)

        # Phase 2: Request peer votes
        peer_votes = []
        for peer in self.discovery.get_peers():
            try:
                vote = await asyncio.wait_for(
                    self.federation.request_vote(peer, request),
                    timeout=5.0
                )
                peer_votes.append(vote)
            except asyncio.TimeoutError:
                logger.warning(f"Peer {peer.host} timed out")

        # Phase 3: Global consensus
        all_votes = [Vote(local_best, local_confidence)] + peer_votes
        best_vote = self._weighted_vote(all_votes)

        return best_vote.response

    def _calculate_local_confidence(self, responses: List[GenerationResponse]) -> float:
        """Calculate confidence based on local agreement."""
        # High agreement = high confidence
        # Use embedding similarity
        similarities = self._compute_pairwise_similarity(responses)
        return np.mean(similarities)

    def _weighted_vote(self, votes: List[Vote]) -> Vote:
        """Select best vote weighted by confidence."""
        # Could use weighted random selection or just pick highest
        return max(votes, key=lambda v: v.confidence)
```

#### 6.6 Configuration

```yaml
federation:
  enabled: true
  mode: "independent"  # independent, distributed
  discovery:
    method: "mdns"  # mdns, broadcast, static
    port: 8765
    advertise_interval: 30
  communication:
    port: 8766
    timeout: 30
    auth:
      enabled: false
      token: null
  consensus:
    strategy: "weighted"  # weighted, best_of_n, latency
    min_peers: 1  # Minimum peers for federation (0 = solo mode OK)
    max_peers: 10
```

### Phase 7: Extended GPU Support (Week 7)

#### 7.1 AMD GPU Support (ROCm)
**File**: `src/hardware/amd.py`

**Detection**:
```python
def detect_amd_gpu() -> Optional[GPUInfo]:
    """Detect AMD GPU using ROCm."""
    try:
        # Try rocm-smi
        import subprocess
        result = subprocess.run(
            ["rocm-smi", "--showmeminfo", "vram"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            # Parse VRAM info
            vram_gb = parse_rocm_output(result.stdout)
            return GPUInfo(
                name="AMD GPU",
                vram_gb=vram_gb,
                is_amd=True
            )
    except FileNotFoundError:
        pass

    # Fallback to checking for AMD in PCI
    return detect_amd_via_pci()
```

**Backend**: llama.cpp with ROCm support
```bash
# Build llama.cpp with ROCm
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
```

**Platforms**: Linux (primary), Windows (experimental)

#### 7.2 Intel GPU Support (OpenCL/OneAPI)
**File**: `src/hardware/intel.py`

**Detection**:
```python
def detect_intel_gpu() -> Optional[GPUInfo]:
    """Detect Intel GPU using OneAPI or OpenCL."""
    # Try Intel GPU driver
    try:
        import subprocess
        result = subprocess.run(
            ["sycl-ls"],
            capture_output=True, text=True
        )
        if "Intel" in result.stdout:
            vram_gb = parse_sycl_output(result.stdout)
            return GPUInfo(
                name="Intel GPU",
                vram_gb=vram_gb,
                driver_version=get_intel_driver_version()
            )
    except FileNotFoundError:
        pass

    # Fallback to OpenCL
    return detect_intel_via_opencl()
```

**Backend**: llama.cpp with SYCL support
```bash
# Build llama.cpp with SYCL
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" \
    pip install llama-cpp-python
```

**Platforms**: Linux, Windows

#### 7.3 Qualcomm GPU Support (Android/Termux)
**File**: `src/hardware/qualcomm.py`

**Detection**:
```python
def detect_qualcomm_gpu() -> Optional[GPUInfo]:
    """Detect Qualcomm Adreno GPU on Android/Termux."""
    if not is_termux():
        return None

    try:
        # Parse /proc/cpuinfo and /proc/meminfo
        with open("/proc/cpuinfo") as f:
            cpuinfo = f.read()

        # Check for Qualcomm SoC
        if "Qualcomm" in cpuinfo or "Snapdragon" in cpuinfo:
            # Get RAM info (no separate VRAM on mobile)
            mem = psutil.virtual_memory()
            total_gb = mem.total / (1024**3)

            # Use 25% of RAM for LLM (very conservative on mobile)
            return GPUInfo(
                name="Qualcomm Adreno",
                vram_gb=total_gb * 0.25,
                is_mobile=True
            )
    except Exception:
        pass

    return None

def is_termux() -> bool:
    """Check if running in Termux environment."""
    return (
        os.environ.get("TERMUX_VERSION") is not None or
        os.path.exists("/data/data/com.termux/files/usr")
    )
```

**Backend Options**:

1. **llama.cpp on Termux**:
   ```bash
   # In Termux
   pkg install cmake clang
   pip install llama-cpp-python
   ```

2. **QNN (Qualcomm Neural Network)** - Advanced:
   - Use Qualcomm's SDK for optimized inference
   - Better performance but complex setup

**Limitations**:
- Models must be small (1-3B parameters)
- Quantization essential (Q4 or lower)
- Limited context window (2048 tokens)
- Slower than desktop GPUs

**Platforms**: Android (via Termux)

#### 7.4 Hardware Detection Updates

Update `src/hardware/detector.py` to support new GPUs:

```python
def detect_gpu() -> Optional[GPUInfo]:
    """Detect GPU based on platform."""
    os_name = detect_os()

    if os_name == "darwin":
        return detect_apple_gpu()

    # Priority: NVIDIA > AMD > Intel > Qualcomm
    gpu = detect_nvidia_gpu()
    if gpu:
        return gpu

    gpu = detect_amd_gpu()
    if gpu:
        return gpu

    gpu = detect_intel_gpu()
    if gpu:
        return gpu

    gpu = detect_qualcomm_gpu()
    if gpu:
        return gpu

    return None
```

### Phase 8: Testing & Polish (Week 8)

#### 8.1 Test Coverage

**Unit tests**:
- Hardware detection mocking (all GPU vendors)
- Model selection logic
- Consensus algorithm (local + cross-swarm)
- API endpoint validation
- Network discovery protocol

**Integration tests**:
- End-to-end inference
- Multi-worker coordination
- Cross-swarm voting
- Error handling
- Network partition scenarios

**Platform tests**:
- Windows with NVIDIA/AMD/Intel
- macOS with M1/M2/M3/M4
- Linux with NVIDIA/AMD/Intel
- CPU-only fallback
- Android (Termux) with Qualcomm

#### 8.2 Performance Optimization

- **Model warmup**: Pre-load models on startup
- **Request batching**: Group similar requests
- **Worker pooling**: Reuse workers instead of respawning
- **Memory monitoring**: Auto-shutdown if OOM

#### 8.3 Documentation ✅ COMPLETED

**Created docs/GUIDE.md with:**
- Quick Start Guide (all platforms)
- Opencode Configuration Examples:
  - Basic setup
  - Remote machine configuration
  - Multiple model options
  - Environment-specific configs
- API Reference (OpenAI-compatible endpoints)
- Troubleshooting Guide (common issues, platform-specific)
- Performance Tuning (speed vs quality, memory usage)
- Advanced Configuration (config.yaml, env vars)
- MCP Server setup
- Network Federation guide

**Updated README.md:**
- Added Documentation section with links
- Referenced complete guide

## Technical Decisions

### Why llama.cpp?
- Best cross-platform support
- Mature quantization formats (GGUF)
- Active community
- Good performance/quality tradeoff

### Why MLX for macOS?
- Native Apple Silicon optimization
- Simpler than llama.cpp on macOS
- Better unified memory handling

### Why consensus voting?
- Improves response quality vs single model
- Uses available hardware efficiently
- Can detect model hallucinations

### Memory Model

**External GPU (NVIDIA/AMD)**:
- Use 100% of VRAM
- Keep 10% buffer for OS/drivers
- Each instance gets equal share

**Apple Silicon**:
- Use 50% of unified RAM
- Avoid system swap
- Monitor memory pressure

**CPU-only**:
- Use 50% of system RAM
- Dependent on available memory
- Slower but functional

## Future Enhancements

1. **Dynamic scaling**: Add/remove workers based on load
2. **Model mixing**: Different models in same swarm
3. **Fine-tuning**: Local fine-tuning on user data
4. **Web UI**: Browser-based configuration
5. **Docker support**: Containerized deployment
6. **Cloud inference**: Fallback to cloud APIs
7. **WebGPU support**: Browser-based inference
8. **Persistent knowledge**: RAG with local vector DB

## Success Metrics

- **Startup time**: < 30 seconds from cold start
- **First inference**: < 10 seconds after startup
- **Concurrent requests**: Support 2-8 parallel inferences per machine
- **Consensus accuracy**: > 80% agreement on code tasks
- **Memory efficiency**: Use > 80% of available memory
- **Cross-platform**: Works on Windows/macOS/Linux without code changes
- **GPU support**: NVIDIA, AMD, Intel, Apple Silicon, Qualcomm
- **Network federation**: Auto-discovery within 10 seconds
- **Federated consensus**: Scale to 5+ machines (25+ instances)
- **Mobile support**: Functional on Android/Termux (3B models)