Files
local_swarm/PLAN.md
T
sleepy 77f26f7381 Update PLAN.md and README with documentation completion
Update Phase 8.3 Documentation to mark as COMPLETED:
- Document all sections added to docs/GUIDE.md
- Update README.md with documentation links

Documentation now includes:
- Quick Start Guide for all platforms
- Opencode configuration examples
- API reference with examples
- Comprehensive troubleshooting
- Performance tuning guide
- Advanced configuration options
2026-02-23 18:40:35 +01:00

35 KiB

Local Swarm - Detailed Implementation Plan

Overview

A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.

Architecture

local_swarm/
├── src/
│   ├── __init__.py
│   ├── hardware/
│   │   ├── __init__.py
│   │   ├── detector.py       # Platform-agnostic hardware detection
│   │   ├── nvidia.py         # NVIDIA GPU detection (Windows/Linux)
│   │   ├── amd.py            # AMD GPU detection (ROCm)
│   │   ├── intel.py          # Intel GPU detection (OneAPI/OpenCL)
│   │   ├── qualcomm.py       # Qualcomm/Adreno detection (Android)
│   │   ├── apple_silicon.py  # Apple Silicon detection (macOS)
│   │   └── memory.py         # RAM detection
│   ├── models/
│   │   ├── __init__.py
│   │   ├── registry.py       # Model database with specs
│   │   ├── selector.py       # Optimal model/quant selection logic
│   │   └── downloader.py     # Download manager (HuggingFace)
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── base.py           # Backend interface
│   │   ├── llamacpp.py       # llama.cpp backend (CUDA/ROCm/SYCL)
│   │   └── mlx.py            # MLX backend (macOS)
│   ├── swarm/
│   │   ├── __init__.py
│   │   ├── manager.py        # Instance lifecycle management
│   │   ├── worker.py         # Individual LLM instance wrapper
│   │   ├── consensus.py      # Local voting/consensus algorithm
│   │   └── cross_consensus.py # Cross-swarm consensus
│   ├── network/
│   │   ├── __init__.py
│   │   ├── discovery.py      # mDNS/Bonjour peer discovery
│   │   ├── federation.py     # Inter-swarm communication
│   │   └── protocol.py       # Network protocol definitions
│   └── api/
│       ├── __init__.py
│       ├── server.py         # FastAPI/uvicorn server
│       ├── routes.py         # OpenAI-compatible endpoints
│       ├── federation.py     # Federation endpoints
│       └── middleware.py     # Request handling
├── tests/
├── config/
│   └── models.yaml           # Model configurations
├── scripts/
│   ├── install.bat           # Windows installer
│   ├── install.sh            # Unix installer
│   └── install-termux.sh     # Android/Termux installer
├── main.py                   # CLI entry point
├── requirements.txt
├── requirements-macos.txt    # MLX-specific deps
├── requirements-termux.txt   # Android/Termux deps
├── setup.py
└── .gitignore

Network Federation Architecture

When multiple machines run Local Swarm on the same network, they can form a "federated swarm":

┌─────────────────────────────────────────────────────────────┐
│                    Local Network                             │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Windows PC  │    │   Mac Mini   │    │   MacBook    │  │
│  │  (RTX 4060)  │    │    (M1)      │    │    (M4)      │  │
│  │              │    │              │    │              │  │
│  │ ┌──────────┐ │    │ ┌──────────┐ │    │ ┌──────────┐ │  │
│  │ │ Swarm 1  │ │    │ │ Swarm 2  │ │    │ │ Swarm 3  │ │  │
│  │ │ 4 inst.  │ │    │ │ 2 inst.  │ │    │ │ 3 inst.  │ │  │
│  │ │ Qwen 7B  │ │    │ │ Qwen 7B  │ │    │ │ Qwen 7B  │ │  │
│  │ └────┬─────┘ │    │ └────┬─────┘ │    │ └────┬─────┘ │  │
│  │      │       │    │      │       │    │      │       │  │
│  │ mDNS │       │    │ mDNS │       │    │ mDNS │       │  │
│  │      │       │    │      │       │    │      │       │  │
│  └──────┼───────┘    └──────┼───────┘    └──────┼───────┘  │
│         │                   │                   │           │
│         └───────────────────┼───────────────────┘           │
│                             │                               │
│                    ┌────────┴────────┐                      │
│                    │  Federation     │                      │
│                    │  Coordinator    │                      │
│                    │                 │                      │
│                    │  ┌───────────┐  │                      │
│                    │  │ Consensus │  │                      │
│                    │  │  Engine   │  │                      │
│                    │  └───────────┘  │                      │
│                    └────────┬────────┘                      │
│                             │                               │
│                    ┌────────▼────────┐                      │
│                    │   opencode      │                      │
│                    │   (Client)      │                      │
│                    └─────────────────┘                      │
└─────────────────────────────────────────────────────────────┘

Federation Flow:
1. Each swarm independently detects hardware and starts instances
2. Swarms advertise themselves via mDNS/Bonjour
3. When request comes in, each swarm generates local responses
4. Local consensus picks best response per swarm
5. Cross-swarm consensus votes across all best responses
6. Final answer returned to opencode

Benefits:

  • Utilize all hardware in your home/office
  • Each machine optimizes for its own specs
  • No single point of failure
  • Automatic load distribution
  • Works even if one machine goes offline

Implementation Phases

Phase 1: Foundation (Week 1)

1.1 Hardware Detection Module

File: src/hardware/detector.py

Requirements:

  • Cross-platform OS detection (Windows, macOS, Linux)
  • CPU info (cores, architecture)
  • RAM detection (total, available)
  • GPU detection with VRAM

Platform-specific implementations:

  • Windows: Use pynvml for NVIDIA, fallback to DirectX for others
  • macOS: Use psutil for RAM, sysctl for CPU, Metal API for GPU
  • Linux: Use pynvml for NVIDIA, rocm-smi for AMD

Output structure:

class HardwareProfile:
    os: str  # 'windows', 'darwin', 'linux'
    cpu_cores: int
    ram_gb: float
    gpu: Optional[GPUInfo]
    is_apple_silicon: bool

Model selection rules:

  • External GPU (NVIDIA/AMD): Use 100% of VRAM
  • Apple Silicon: Use 50% of unified RAM
  • CPU-only: Use 50% of system RAM

1.2 Model Registry

File: src/models/registry.py

Model database (YAML format):

models:
  qwen2.5-coder:
    name: "Qwen 2.5 Coder"
    description: "Alibaba's code-focused model"
    variants:
      - size: 3b
        base_vram_gb: 2.0  # Approximate VRAM for fp16
        quantizations:
          q4_k_m:
            vram_gb: 1.8
            quality: "good"
          q5_k_m:
            vram_gb: 2.2
            quality: "better"
          q6_k:
            vram_gb: 2.6
            quality: "best"
      - size: 7b
        base_vram_gb: 14.0
        quantizations:
          q4_k_m:
            vram_gb: 4.5
          q5_k_m:
            vram_gb: 5.2
          q6_k:
            vram_gb: 6.0
    
  codellama:
    name: "CodeLlama"
    # Similar structure...
    
  deepseek-coder:
    name: "DeepSeek Coder"
    # Similar structure...

Selection priority:

  1. Qwen 2.5 Coder (best for small sizes)
  2. DeepSeek Coder (good alternative)
  3. CodeLlama (fallback)

1.3 Model Selector Logic

File: src/models/selector.py

Algorithm:

def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
    available_memory = get_available_memory(hardware)
    
    # Try models in priority order
    for model in PRIORITY_MODELS:
        # Find largest size that fits
        for variant in reversed(model.variants):
            # Try highest quantization that fits
            for quant in reversed(variant.quantizations):
                total_vram_needed = quant.vram_gb * MIN_INSTANCES
                if total_vram_needed <= available_memory:
                    # Calculate max instances
                    max_instances = int(available_memory // quant.vram_gb)
                    # Cap at reasonable limit (e.g., 8)
                    instances = min(max_instances, 8)
                    return ModelConfig(model, variant, quant, instances)
    
    # Fallback to smallest model
    return FALLBACK_CONFIG

Minimum instances: 2 (for consensus voting) Maximum instances: 8 (to avoid overhead)

Phase 2: Backend Integration (Week 2)

2.1 Base Backend Interface

File: src/backends/base.py

from abc import ABC, abstractmethod
from typing import AsyncIterator

class LLMBackend(ABC):
    @abstractmethod
    async def load_model(self, model_path: str, config: dict) -> bool:
        pass
    
    @abstractmethod
    async def generate(self, prompt: str, **kwargs) -> str:
        pass
    
    @abstractmethod
    async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
        pass
    
    @abstractmethod
    def get_memory_usage(self) -> float:
        """Return current VRAM/RAM usage in GB"""
        pass
    
    @abstractmethod
    def shutdown(self):
        pass

2.2 llama.cpp Backend

File: src/backends/llamacpp.py

Implementation:

  • Use llama-cpp-python library
  • Support GGUF model format
  • GPU acceleration via CUDA/Metal
  • Server mode with HTTP API

Key features:

  • Model caching to avoid reload
  • Context window management
  • Batch processing support

Memory calculation:

def calculate_memory_usage(model_path: str) -> float:
    # Parse GGUF metadata
    # Return estimated VRAM usage

2.3 MLX Backend (macOS)

File: src/backends/mlx.py

Implementation:

  • Use mlx-lm library
  • Support MLX format models
  • Optimized for Apple Silicon

Key differences from llama.cpp:

  • Native Metal performance
  • Simpler API
  • Unified memory model

Phase 3: Swarm Management (Week 3)

3.1 Worker Instance

File: src/swarm/worker.py

Each worker manages:

  • One LLM instance
  • Request queue
  • Health monitoring
  • Metrics collection
class SwarmWorker:
    def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
        self.worker_id = worker_id
        self.backend = backend
        self.is_healthy = True
        self.request_count = 0
        self.avg_latency = 0.0
    
    async def process(self, request: GenerationRequest) -> GenerationResponse:
        start = time.time()
        response = await self.backend.generate(**request.params)
        latency = time.time() - start
        self._update_metrics(latency)
        return GenerationResponse(response, latency, self.worker_id)

3.2 Swarm Manager

File: src/swarm/manager.py

Responsibilities:

  • Spawn N workers based on hardware
  • Distribute requests to all workers
  • Collect responses
  • Handle worker failures
class SwarmManager:
    def __init__(self, config: ModelConfig):
        self.workers: List[SwarmWorker] = []
        self.config = config
    
    async def initialize(self):
        # Download model if needed
        model_path = await self._ensure_model()
        
        # Spawn workers
        for i in range(self.config.instances):
            backend = self._create_backend()
            await backend.load_model(model_path, self.config.backend_params)
            worker = SwarmWorker(i, backend, self.config)
            self.workers.append(worker)
    
    async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
        # Send to all workers in parallel
        tasks = [w.process(request) for w in self.workers]
        return await asyncio.gather(*tasks)

3.3 Consensus Algorithm

File: src/swarm/consensus.py

Voting strategies:

  1. Similarity voting (default):

    • Embed all responses
    • Group by semantic similarity
    • Return largest group
  2. Quality scoring:

    • Score each response on:
      • Completeness (does it answer the question?)
      • Code quality (syntax, structure)
      • Length appropriateness
    • Return highest score
  3. Latency-weighted:

    • Prefer faster responses (lower memory pressure)

Implementation:

class ConsensusEngine:
    def __init__(self, strategy: str = "similarity"):
        self.strategy = strategy
        self.embedding_model = None  # Lazy load
    
    async def select_best(self, responses: List[GenerationResponse]) -> str:
        if len(responses) == 1:
            return responses[0].text
        
        if self.strategy == "similarity":
            return await self._similarity_vote(responses)
        elif self.strategy == "quality":
            return await self._quality_score(responses)
        else:
            return self._fastest_response(responses)
    
    async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
        # Use sentence-transformers for embeddings
        # Group by cosine similarity > 0.85
        # Return median response from largest group

Phase 4: API Server (Week 4)

4.1 OpenAI-Compatible Endpoints

File: src/api/routes.py

Required endpoints:

  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completion
  • POST /v1/completions - Text completion (optional)
  • GET /health - Health check
  • GET /metrics - Prometheus metrics (optional)

Chat completions endpoint:

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    # Extract messages
    messages = request.messages
    prompt = format_messages(messages)
    
    # Get all responses from swarm
    responses = await swarm_manager.generate_all(prompt, **request.params)
    
    # Run consensus
    best_response = await consensus_engine.select_best(responses)
    
    # Format as OpenAI response
    return {
        "id": f"chatcmpl-{uuid4()}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": best_response},
            "finish_reason": "stop"
        }],
        "usage": calculate_usage(prompt, best_response)
    }

4.2 Streaming Support

File: src/api/routes.py

For streaming, use the fastest worker instead of consensus:

if request.stream:
    # Pick worker with lowest latency
    worker = swarm_manager.get_fastest_worker()
    return StreamingResponse(
        worker.stream_generate(prompt),
        media_type="text/event-stream"
    )

Phase 5: CLI & Interactive Interface (Week 5)

5.1 Interactive Menu System

File: src/interactive.py

Features:

  • Hardware Display: Detailed hardware info with formatting

    • OS, CPU cores, RAM (total/available)
    • GPU details (name, VRAM, driver, type)
    • Memory allocation rules
  • Model Selection Menu: Three configuration options

    1. Recommended Configuration: Auto-detects optimal model + instances
    2. Browse All Configurations: Lists all feasible models for hardware
    3. Custom Configuration: Step-by-step wizard
      • Select model family (Qwen, DeepSeek, CodeLlama)
      • Choose model size (3B, 7B, 14B)
      • Pick quantization (Q4_K_M, Q5_K_M, Q6_K)
      • Specify instance count (1 to max supported)
  • Resource Usage Monitor: Real-time swarm status

    • Swarm status (running/stopped)
    • Current model name
    • Healthy workers count
    • Memory usage (total and per-worker)
    • Worker statistics:
      • Total requests served
      • Average latency
      • Tokens per second
  • Startup Summary: Comprehensive display showing:

    • Hardware detection section
    • Model configuration section
    • Resource usage section
    • Memory utilization percentage

Implementation:

def interactive_model_selection(hardware: HardwareProfile) -> Optional[ModelConfig]:
    # Show hardware info
    # Display menu with 3 options
    # Return selected configuration

def custom_configuration(hardware: HardwareProfile) -> Optional[ModelConfig]:
    # Step-by-step wizard
    # Select model -> size -> quantization -> instances
    # Validate memory constraints

def show_startup_summary(hardware, config, swarm_manager=None):
    # Clear screen
    # Print formatted hardware, config, and usage info

5.2 CLI Interface

File: main.py

Commands:

# Start the swarm (auto-detect hardware)
python -m local_swarm

# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Start with specific port
python -m local_swarm --port 8080

# Override instance count
python -m local_swarm --instances 4

# Show hardware detection
python -m local_swarm --detect

# Download models only
python -m local_swarm --download-only

5.2 Configuration File

File: config.yaml

server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8
  timeout: 60

models:
  cache_dir: "~/.local_swarm/models"
  preferred_models:
    - qwen2.5-coder
    - deepseek-coder
    - codellama

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

network:
  enabled: true
  discovery_port: 8765
  federation_port: 8766
  advertise_interval: 30
  max_peers: 10
  auth_token: null  # Optional auth for security

Phase 5.5: MCP Server (Week 5)

5.5.1 MCP Protocol Implementation

File: src/mcp_server.py

Features:

  • MCP Server Class: LocalSwarmMCPServer implementing MCP protocol
  • Stdio Transport: Communication via standard input/output
  • Tool Registration: 5 MCP tools for AI assistants:
  1. get_hardware_info - Query system capabilities

    • OS, CPU cores, RAM
    • GPU name, VRAM, type
    • Available memory for LLMs
  2. get_swarm_status - Check swarm health

    • Running/stopped status
    • Model name
    • Healthy workers count
    • Total memory usage
  3. generate_code - Generate with consensus

    • Input: prompt, max_tokens, temperature
    • Returns: generated code with metadata
    • Shows: strategy used, confidence, latency
  4. list_available_models - Browse models

    • All available models
    • Variants per model
    • Quantization options
  5. get_worker_details - Worker statistics

    • Per-worker backend info
    • Health status
    • Request count
    • Latency and throughput

Implementation:

class LocalSwarmMCPServer:
    def __init__(self, swarm_manager):
        self.server = Server("local-swarm")
        self.register_tools()
    
    def register_tools(self):
        @self.server.list_tools()
        async def list_tools() -> List[Tool]:
            # Define tool schemas
            
        @self.server.call_tool()
        async def call_tool(name, arguments):
            # Handle tool calls

5.5.2 Dual Server Mode

  • Run HTTP API and MCP server simultaneously
  • Shared SwarmManager instance
  • HTTP API for external clients
  • MCP for AI assistant integration

Usage:

# HTTP API only
python main.py

# HTTP API + MCP server
python main.py --mcp

5.5.3 Installation Scripts

Windows (scripts/install.bat):

@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt

:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
    echo CUDA detected, installing GPU support...
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
    echo No CUDA detected, using CPU backend...
    pip install llama-cpp-python
)

echo Installation complete!
echo Run: python -m local_swarm

macOS/Linux (scripts/install.sh):

#!/bin/bash
set -e

echo "Installing Local Swarm..."
pip install --upgrade pip

# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
    echo "macOS detected..."
    pip install -r requirements.txt
    pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    echo "Linux detected..."
    pip install -r requirements.txt
    if command -v nvidia-smi &> /dev/null; then
        echo "CUDA detected, installing GPU support..."
        pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
    else
        pip install llama-cpp-python
    fi
fi

echo "Installation complete!"
echo "Run: python -m local_swarm"

Phase 6: Local Network Federation (Week 6)

6.1 Overview

Enable multiple machines on the same local network to form a "federated swarm". Each machine runs its own optimized swarm, but they coordinate to vote on the best responses across the entire network.

Example scenario:

  • Windows PC (RTX 4060 Ti): 4 instances of Qwen 7B Q4
  • Mac Mini (M1): 2 instances of Qwen 7B Q4
  • MacBook (M4): 3 instances of Qwen 7B Q4
  • Total: 9 instances voting on every request

6.2 Architecture

Federation modes:

  1. Independent Swarms with Cross-Voting (Recommended):

    • Each machine runs its own swarm locally
    • When a request comes in, each swarm generates responses internally
    • All swarms exchange their "best local" responses
    • Final vote across all best responses
    • Pros: Simple, resilient, uses each machine's optimal config
    • Cons: Slightly higher latency
  2. Distributed Workers (Advanced):

    • Single coordinator manages workers across all machines
    • Workers distributed based on capability
    • Pros: Optimal load balancing
    • Cons: Complex failure handling, network overhead

Implementation: Start with Mode 1 (Independent Swarms with Cross-Voting)

6.3 Discovery Protocol

File: src/network/discovery.py

mDNS/Bonjour-based discovery:

class SwarmDiscovery:
    """Discovers other Local Swarm instances on the local network."""
    
    def __init__(self, port: int = 8765):
        self.port = port
        self.peers: Dict[str, PeerInfo] = {}
        self.zeroconf = Zeroconf()
    
    def start_advertising(self, swarm_info: SwarmInfo):
        """Advertise this swarm on the network."""
        service_info = ServiceInfo(
            "_local-swarm._tcp.local.",
            f"{hostname}._local-swarm._tcp.local.",
            addresses=[self.get_ip()],
            port=self.port,
            properties={
                b"version": b"1.0",
                b"instances": str(swarm_info.instances).encode(),
                b"model": swarm_info.model_id.encode(),
                b"hardware": swarm_info.hardware_summary.encode(),
            }
        )
        self.zeroconf.register_service(service_info)
    
    def discover_peers(self) -> List[PeerInfo]:
        """Discover other swarms on the network."""
        browser = ServiceBrowser(
            self.zeroconf,
            "_local-swarm._tcp.local.",
            handlers=[self._on_service_state_change]
        )

Alternative: Simple UDP broadcast for environments without mDNS

6.4 Federation Protocol

File: src/network/federation.py

HTTP-based communication:

class FederationClient:
    """Client for communicating with peer swarms."""
    
    async def request_vote(self, peer: PeerInfo, request: GenerationRequest) -> Vote:
        """Request a vote from a peer swarm."""
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"http://{peer.host}:{peer.port}/v1/federation/vote",
                json={
                    "prompt": request.prompt,
                    "context": request.context,
                    "request_id": request.id
                },
                timeout=aiohttp.ClientTimeout(total=30)
            ) as resp:
                return Vote(await resp.json())
    
    async def cast_vote(self, responses: List[GenerationResponse]) -> Vote:
        """Cast this swarm's vote on a set of responses."""
        # Use consensus engine to pick best local response
        best = await self.consensus.select_best(responses)
        return Vote(best, confidence=self._calculate_confidence(responses))

Endpoints:

  • POST /v1/federation/vote - Request a vote from this swarm
  • GET /v1/federation/health - Check peer health
  • GET /v1/federation/info - Get swarm capabilities

6.5 Cross-Swarm Consensus

File: src/swarm/cross_consensus.py

Two-phase voting:

  1. Local Consensus (Phase 1):

    • Each swarm generates responses from all local instances
    • Runs local consensus to pick "best local" response
    • Returns to coordinator
  2. Global Consensus (Phase 2):

    • Coordinator collects all "best local" responses
    • Weights by swarm confidence (based on local agreement)
    • Returns highest-weighted response
class CrossSwarmConsensus:
    """Consensus across multiple networked swarms."""
    
    async def generate_with_federation(
        self, 
        request: GenerationRequest,
        timeout: float = 30.0
    ) -> GenerationResponse:
        # Phase 1: Local generation
        local_responses = await self.local_swarm.generate_all(request)
        local_best = await self.local_consensus.select_best(local_responses)
        local_confidence = self._calculate_local_confidence(local_responses)
        
        # Phase 2: Request peer votes
        peer_votes = []
        for peer in self.discovery.get_peers():
            try:
                vote = await asyncio.wait_for(
                    self.federation.request_vote(peer, request),
                    timeout=5.0
                )
                peer_votes.append(vote)
            except asyncio.TimeoutError:
                logger.warning(f"Peer {peer.host} timed out")
        
        # Phase 3: Global consensus
        all_votes = [Vote(local_best, local_confidence)] + peer_votes
        best_vote = self._weighted_vote(all_votes)
        
        return best_vote.response
    
    def _calculate_local_confidence(self, responses: List[GenerationResponse]) -> float:
        """Calculate confidence based on local agreement."""
        # High agreement = high confidence
        # Use embedding similarity
        similarities = self._compute_pairwise_similarity(responses)
        return np.mean(similarities)
    
    def _weighted_vote(self, votes: List[Vote]) -> Vote:
        """Select best vote weighted by confidence."""
        # Could use weighted random selection or just pick highest
        return max(votes, key=lambda v: v.confidence)

6.6 Configuration

federation:
  enabled: true
  mode: "independent"  # independent, distributed
  discovery:
    method: "mdns"  # mdns, broadcast, static
    port: 8765
    advertise_interval: 30
  communication:
    port: 8766
    timeout: 30
    auth:
      enabled: false
      token: null
  consensus:
    strategy: "weighted"  # weighted, best_of_n, latency
    min_peers: 1  # Minimum peers for federation (0 = solo mode OK)
    max_peers: 10

Phase 7: Extended GPU Support (Week 7)

7.1 AMD GPU Support (ROCm)

File: src/hardware/amd.py

Detection:

def detect_amd_gpu() -> Optional[GPUInfo]:
    """Detect AMD GPU using ROCm."""
    try:
        # Try rocm-smi
        import subprocess
        result = subprocess.run(
            ["rocm-smi", "--showmeminfo", "vram"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            # Parse VRAM info
            vram_gb = parse_rocm_output(result.stdout)
            return GPUInfo(
                name="AMD GPU",
                vram_gb=vram_gb,
                is_amd=True
            )
    except FileNotFoundError:
        pass
    
    # Fallback to checking for AMD in PCI
    return detect_amd_via_pci()

Backend: llama.cpp with ROCm support

# Build llama.cpp with ROCm
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

Platforms: Linux (primary), Windows (experimental)

7.2 Intel GPU Support (OpenCL/OneAPI)

File: src/hardware/intel.py

Detection:

def detect_intel_gpu() -> Optional[GPUInfo]:
    """Detect Intel GPU using OneAPI or OpenCL."""
    # Try Intel GPU driver
    try:
        import subprocess
        result = subprocess.run(
            ["sycl-ls"],
            capture_output=True, text=True
        )
        if "Intel" in result.stdout:
            vram_gb = parse_sycl_output(result.stdout)
            return GPUInfo(
                name="Intel GPU",
                vram_gb=vram_gb,
                driver_version=get_intel_driver_version()
            )
    except FileNotFoundError:
        pass
    
    # Fallback to OpenCL
    return detect_intel_via_opencl()

Backend: llama.cpp with SYCL support

# Build llama.cpp with SYCL
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" \
    pip install llama-cpp-python

Platforms: Linux, Windows

7.3 Qualcomm GPU Support (Android/Termux)

File: src/hardware/qualcomm.py

Detection:

def detect_qualcomm_gpu() -> Optional[GPUInfo]:
    """Detect Qualcomm Adreno GPU on Android/Termux."""
    if not is_termux():
        return None
    
    try:
        # Parse /proc/cpuinfo and /proc/meminfo
        with open("/proc/cpuinfo") as f:
            cpuinfo = f.read()
        
        # Check for Qualcomm SoC
        if "Qualcomm" in cpuinfo or "Snapdragon" in cpuinfo:
            # Get RAM info (no separate VRAM on mobile)
            mem = psutil.virtual_memory()
            total_gb = mem.total / (1024**3)
            
            # Use 25% of RAM for LLM (very conservative on mobile)
            return GPUInfo(
                name="Qualcomm Adreno",
                vram_gb=total_gb * 0.25,
                is_mobile=True
            )
    except Exception:
        pass
    
    return None

def is_termux() -> bool:
    """Check if running in Termux environment."""
    return (
        os.environ.get("TERMUX_VERSION") is not None or
        os.path.exists("/data/data/com.termux/files/usr")
    )

Backend Options:

  1. llama.cpp on Termux:

    # In Termux
    pkg install cmake clang
    pip install llama-cpp-python
    
  2. QNN (Qualcomm Neural Network) - Advanced:

    • Use Qualcomm's SDK for optimized inference
    • Better performance but complex setup

Limitations:

  • Models must be small (1-3B parameters)
  • Quantization essential (Q4 or lower)
  • Limited context window (2048 tokens)
  • Slower than desktop GPUs

Platforms: Android (via Termux)

7.4 Hardware Detection Updates

Update src/hardware/detector.py to support new GPUs:

def detect_gpu() -> Optional[GPUInfo]:
    """Detect GPU based on platform."""
    os_name = detect_os()
    
    if os_name == "darwin":
        return detect_apple_gpu()
    
    # Priority: NVIDIA > AMD > Intel > Qualcomm
    gpu = detect_nvidia_gpu()
    if gpu:
        return gpu
    
    gpu = detect_amd_gpu()
    if gpu:
        return gpu
    
    gpu = detect_intel_gpu()
    if gpu:
        return gpu
    
    gpu = detect_qualcomm_gpu()
    if gpu:
        return gpu
    
    return None

Phase 8: Testing & Polish (Week 8)

8.1 Test Coverage

Unit tests:

  • Hardware detection mocking (all GPU vendors)
  • Model selection logic
  • Consensus algorithm (local + cross-swarm)
  • API endpoint validation
  • Network discovery protocol

Integration tests:

  • End-to-end inference
  • Multi-worker coordination
  • Cross-swarm voting
  • Error handling
  • Network partition scenarios

Platform tests:

  • Windows with NVIDIA/AMD/Intel
  • macOS with M1/M2/M3/M4
  • Linux with NVIDIA/AMD/Intel
  • CPU-only fallback
  • Android (Termux) with Qualcomm

8.2 Performance Optimization

  • Model warmup: Pre-load models on startup
  • Request batching: Group similar requests
  • Worker pooling: Reuse workers instead of respawning
  • Memory monitoring: Auto-shutdown if OOM

8.3 Documentation COMPLETED

Created docs/GUIDE.md with:

  • Quick Start Guide (all platforms)
  • Opencode Configuration Examples:
    • Basic setup
    • Remote machine configuration
    • Multiple model options
    • Environment-specific configs
  • API Reference (OpenAI-compatible endpoints)
  • Troubleshooting Guide (common issues, platform-specific)
  • Performance Tuning (speed vs quality, memory usage)
  • Advanced Configuration (config.yaml, env vars)
  • MCP Server setup
  • Network Federation guide

Updated README.md:

  • Added Documentation section with links
  • Referenced complete guide

Technical Decisions

Why llama.cpp?

  • Best cross-platform support
  • Mature quantization formats (GGUF)
  • Active community
  • Good performance/quality tradeoff

Why MLX for macOS?

  • Native Apple Silicon optimization
  • Simpler than llama.cpp on macOS
  • Better unified memory handling

Why consensus voting?

  • Improves response quality vs single model
  • Uses available hardware efficiently
  • Can detect model hallucinations

Memory Model

External GPU (NVIDIA/AMD):

  • Use 100% of VRAM
  • Keep 10% buffer for OS/drivers
  • Each instance gets equal share

Apple Silicon:

  • Use 50% of unified RAM
  • Avoid system swap
  • Monitor memory pressure

CPU-only:

  • Use 50% of system RAM
  • Dependent on available memory
  • Slower but functional

Future Enhancements

  1. Dynamic scaling: Add/remove workers based on load
  2. Model mixing: Different models in same swarm
  3. Fine-tuning: Local fine-tuning on user data
  4. Web UI: Browser-based configuration
  5. Docker support: Containerized deployment
  6. Cloud inference: Fallback to cloud APIs
  7. WebGPU support: Browser-based inference
  8. Persistent knowledge: RAG with local vector DB

Success Metrics

  • Startup time: < 30 seconds from cold start
  • First inference: < 10 seconds after startup
  • Concurrent requests: Support 2-8 parallel inferences per machine
  • Consensus accuracy: > 80% agreement on code tasks
  • Memory efficiency: Use > 80% of available memory
  • Cross-platform: Works on Windows/macOS/Linux without code changes
  • GPU support: NVIDIA, AMD, Intel, Apple Silicon, Qualcomm
  • Network federation: Auto-discovery within 10 seconds
  • Federated consensus: Scale to 5+ machines (25+ instances)
  • Mobile support: Functional on Android/Termux (3B models)