Files

T

sleepy 77f26f7381 Update PLAN.md and README with documentation completion

Update Phase 8.3 Documentation to mark as COMPLETED:
- Document all sections added to docs/GUIDE.md
- Update README.md with documentation links

Documentation now includes:
- Quick Start Guide for all platforms
- Opencode configuration examples
- API reference with examples
- Comprehensive troubleshooting
- Performance tuning guide
- Advanced configuration options

2026-02-23 18:40:35 +01:00

35 KiB

Raw Blame History

Local Swarm - Detailed Implementation Plan

Overview

A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.

Architecture

local_swarm/
├── src/
│   ├── __init__.py
│   ├── hardware/
│   │   ├── __init__.py
│   │   ├── detector.py       # Platform-agnostic hardware detection
│   │   ├── nvidia.py         # NVIDIA GPU detection (Windows/Linux)
│   │   ├── amd.py            # AMD GPU detection (ROCm)
│   │   ├── intel.py          # Intel GPU detection (OneAPI/OpenCL)
│   │   ├── qualcomm.py       # Qualcomm/Adreno detection (Android)
│   │   ├── apple_silicon.py  # Apple Silicon detection (macOS)
│   │   └── memory.py         # RAM detection
│   ├── models/
│   │   ├── __init__.py
│   │   ├── registry.py       # Model database with specs
│   │   ├── selector.py       # Optimal model/quant selection logic
│   │   └── downloader.py     # Download manager (HuggingFace)
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── base.py           # Backend interface
│   │   ├── llamacpp.py       # llama.cpp backend (CUDA/ROCm/SYCL)
│   │   └── mlx.py            # MLX backend (macOS)
│   ├── swarm/
│   │   ├── __init__.py
│   │   ├── manager.py        # Instance lifecycle management
│   │   ├── worker.py         # Individual LLM instance wrapper
│   │   ├── consensus.py      # Local voting/consensus algorithm
│   │   └── cross_consensus.py # Cross-swarm consensus
│   ├── network/
│   │   ├── __init__.py
│   │   ├── discovery.py      # mDNS/Bonjour peer discovery
│   │   ├── federation.py     # Inter-swarm communication
│   │   └── protocol.py       # Network protocol definitions
│   └── api/
│       ├── __init__.py
│       ├── server.py         # FastAPI/uvicorn server
│       ├── routes.py         # OpenAI-compatible endpoints
│       ├── federation.py     # Federation endpoints
│       └── middleware.py     # Request handling
├── tests/
├── config/
│   └── models.yaml           # Model configurations
├── scripts/
│   ├── install.bat           # Windows installer
│   ├── install.sh            # Unix installer
│   └── install-termux.sh     # Android/Termux installer
├── main.py                   # CLI entry point
├── requirements.txt
├── requirements-macos.txt    # MLX-specific deps
├── requirements-termux.txt   # Android/Termux deps
├── setup.py
└── .gitignore

Network Federation Architecture

When multiple machines run Local Swarm on the same network, they can form a "federated swarm":

┌─────────────────────────────────────────────────────────────┐
│                    Local Network                             │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Windows PC  │    │   Mac Mini   │    │   MacBook    │  │
│  │  (RTX 4060)  │    │    (M1)      │    │    (M4)      │  │
│  │              │    │              │    │              │  │
│  │ ┌──────────┐ │    │ ┌──────────┐ │    │ ┌──────────┐ │  │
│  │ │ Swarm 1  │ │    │ │ Swarm 2  │ │    │ │ Swarm 3  │ │  │
│  │ │ 4 inst.  │ │    │ │ 2 inst.  │ │    │ │ 3 inst.  │ │  │
│  │ │ Qwen 7B  │ │    │ │ Qwen 7B  │ │    │ │ Qwen 7B  │ │  │
│  │ └────┬─────┘ │    │ └────┬─────┘ │    │ └────┬─────┘ │  │
│  │      │       │    │      │       │    │      │       │  │
│  │ mDNS │       │    │ mDNS │       │    │ mDNS │       │  │
│  │      │       │    │      │       │    │      │       │  │
│  └──────┼───────┘    └──────┼───────┘    └──────┼───────┘  │
│         │                   │                   │           │
│         └───────────────────┼───────────────────┘           │
│                             │                               │
│                    ┌────────┴────────┐                      │
│                    │  Federation     │                      │
│                    │  Coordinator    │                      │
│                    │                 │                      │
│                    │  ┌───────────┐  │                      │
│                    │  │ Consensus │  │                      │
│                    │  │  Engine   │  │                      │
│                    │  └───────────┘  │                      │
│                    └────────┬────────┘                      │
│                             │                               │
│                    ┌────────▼────────┐                      │
│                    │   opencode      │                      │
│                    │   (Client)      │                      │
│                    └─────────────────┘                      │
└─────────────────────────────────────────────────────────────┘

Federation Flow:
1. Each swarm independently detects hardware and starts instances
2. Swarms advertise themselves via mDNS/Bonjour
3. When request comes in, each swarm generates local responses
4. Local consensus picks best response per swarm
5. Cross-swarm consensus votes across all best responses
6. Final answer returned to opencode

Benefits:

Utilize all hardware in your home/office
Each machine optimizes for its own specs
No single point of failure
Automatic load distribution
Works even if one machine goes offline

Implementation Phases

Phase 1: Foundation (Week 1)

1.1 Hardware Detection Module

File: src/hardware/detector.py

Requirements:

Cross-platform OS detection (Windows, macOS, Linux)
CPU info (cores, architecture)
RAM detection (total, available)
GPU detection with VRAM

Platform-specific implementations:

Windows: Use pynvml for NVIDIA, fallback to DirectX for others
macOS: Use psutil for RAM, sysctl for CPU, Metal API for GPU
Linux: Use pynvml for NVIDIA, rocm-smi for AMD

Output structure:

class HardwareProfile:
    os: str  # 'windows', 'darwin', 'linux'
    cpu_cores: int
    ram_gb: float
    gpu: Optional[GPUInfo]
    is_apple_silicon: bool

Model selection rules:

External GPU (NVIDIA/AMD): Use 100% of VRAM
Apple Silicon: Use 50% of unified RAM
CPU-only: Use 50% of system RAM

1.2 Model Registry

File: src/models/registry.py

Model database (YAML format):

models:
  qwen2.5-coder:
    name: "Qwen 2.5 Coder"
    description: "Alibaba's code-focused model"
    variants:
      - size: 3b
        base_vram_gb: 2.0  # Approximate VRAM for fp16
        quantizations:
          q4_k_m:
            vram_gb: 1.8
            quality: "good"
          q5_k_m:
            vram_gb: 2.2
            quality: "better"
          q6_k:
            vram_gb: 2.6
            quality: "best"
      - size: 7b
        base_vram_gb: 14.0
        quantizations:
          q4_k_m:
            vram_gb: 4.5
          q5_k_m:
            vram_gb: 5.2
          q6_k:
            vram_gb: 6.0
    
  codellama:
    name: "CodeLlama"
    # Similar structure...
    
  deepseek-coder:
    name: "DeepSeek Coder"
    # Similar structure...

Selection priority:

Qwen 2.5 Coder (best for small sizes)
DeepSeek Coder (good alternative)
CodeLlama (fallback)

1.3 Model Selector Logic

File: src/models/selector.py

Algorithm:

def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
    available_memory = get_available_memory(hardware)
    
    # Try models in priority order
    for model in PRIORITY_MODELS:
        # Find largest size that fits
        for variant in reversed(model.variants):
            # Try highest quantization that fits
            for quant in reversed(variant.quantizations):
                total_vram_needed = quant.vram_gb * MIN_INSTANCES
                if total_vram_needed <= available_memory:
                    # Calculate max instances
                    max_instances = int(available_memory // quant.vram_gb)
                    # Cap at reasonable limit (e.g., 8)
                    instances = min(max_instances, 8)
                    return ModelConfig(model, variant, quant, instances)
    
    # Fallback to smallest model
    return FALLBACK_CONFIG

Minimum instances: 2 (for consensus voting) Maximum instances: 8 (to avoid overhead)

Phase 2: Backend Integration (Week 2)

2.1 Base Backend Interface

File: src/backends/base.py

from abc import ABC, abstractmethod
from typing import AsyncIterator

class LLMBackend(ABC):
    @abstractmethod
    async def load_model(self, model_path: str, config: dict) -> bool:
        pass
    
    @abstractmethod
    async def generate(self, prompt: str, **kwargs) -> str:
        pass
    
    @abstractmethod
    async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
        pass
    
    @abstractmethod
    def get_memory_usage(self) -> float:
        """Return current VRAM/RAM usage in GB"""
        pass
    
    @abstractmethod
    def shutdown(self):
        pass

2.2 llama.cpp Backend

File: src/backends/llamacpp.py

Implementation:

Use llama-cpp-python library
Support GGUF model format
GPU acceleration via CUDA/Metal
Server mode with HTTP API

Key features:

Model caching to avoid reload
Context window management
Batch processing support

Memory calculation:

def calculate_memory_usage(model_path: str) -> float:
    # Parse GGUF metadata
    # Return estimated VRAM usage

2.3 MLX Backend (macOS)

File: src/backends/mlx.py

Implementation:

Use mlx-lm library
Support MLX format models
Optimized for Apple Silicon

Key differences from llama.cpp:

Native Metal performance
Simpler API
Unified memory model

Phase 3: Swarm Management (Week 3)

3.1 Worker Instance

File: src/swarm/worker.py

Each worker manages:

One LLM instance
Request queue
Health monitoring
Metrics collection

class SwarmWorker:
    def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
        self.worker_id = worker_id
        self.backend = backend
        self.is_healthy = True
        self.request_count = 0
        self.avg_latency = 0.0
    
    async def process(self, request: GenerationRequest) -> GenerationResponse:
        start = time.time()
        response = await self.backend.generate(**request.params)
        latency = time.time() - start
        self._update_metrics(latency)
        return GenerationResponse(response, latency, self.worker_id)

3.2 Swarm Manager

File: src/swarm/manager.py

Responsibilities:

Spawn N workers based on hardware
Distribute requests to all workers
Collect responses
Handle worker failures

class SwarmManager:
    def __init__(self, config: ModelConfig):
        self.workers: List[SwarmWorker] = []
        self.config = config
    
    async def initialize(self):
        # Download model if needed
        model_path = await self._ensure_model()
        
        # Spawn workers
        for i in range(self.config.instances):
            backend = self._create_backend()
            await backend.load_model(model_path, self.config.backend_params)
            worker = SwarmWorker(i, backend, self.config)
            self.workers.append(worker)
    
    async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
        # Send to all workers in parallel
        tasks = [w.process(request) for w in self.workers]
        return await asyncio.gather(*tasks)

3.3 Consensus Algorithm

File: src/swarm/consensus.py

Voting strategies:

Similarity voting (default):
- Embed all responses
- Group by semantic similarity
- Return largest group
Quality scoring:
- Score each response on:
  - Completeness (does it answer the question?)
  - Code quality (syntax, structure)
  - Length appropriateness
- Return highest score
Latency-weighted:
- Prefer faster responses (lower memory pressure)

Implementation:

class ConsensusEngine:
    def __init__(self, strategy: str = "similarity"):
        self.strategy = strategy
        self.embedding_model = None  # Lazy load
    
    async def select_best(self, responses: List[GenerationResponse]) -> str:
        if len(responses) == 1:
            return responses[0].text
        
        if self.strategy == "similarity":
            return await self._similarity_vote(responses)
        elif self.strategy == "quality":
            return await self._quality_score(responses)
        else:
            return self._fastest_response(responses)
    
    async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
        # Use sentence-transformers for embeddings
        # Group by cosine similarity > 0.85
        # Return median response from largest group

Phase 4: API Server (Week 4)

4.1 OpenAI-Compatible Endpoints

File: src/api/routes.py

Required endpoints:

GET /v1/models - List available models
POST /v1/chat/completions - Chat completion
POST /v1/completions - Text completion (optional)
GET /health - Health check
GET /metrics - Prometheus metrics (optional)

Chat completions endpoint:

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    # Extract messages
    messages = request.messages
    prompt = format_messages(messages)
    
    # Get all responses from swarm
    responses = await swarm_manager.generate_all(prompt, **request.params)
    
    # Run consensus
    best_response = await consensus_engine.select_best(responses)
    
    # Format as OpenAI response
    return {
        "id": f"chatcmpl-{uuid4()}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": best_response},
            "finish_reason": "stop"
        }],
        "usage": calculate_usage(prompt, best_response)
    }

4.2 Streaming Support

File: src/api/routes.py

For streaming, use the fastest worker instead of consensus:

if request.stream:
    # Pick worker with lowest latency
    worker = swarm_manager.get_fastest_worker()
    return StreamingResponse(
        worker.stream_generate(prompt),
        media_type="text/event-stream"
    )

Phase 5: CLI & Interactive Interface (Week 5)

File: src/interactive.py

Features:

Hardware Display: Detailed hardware info with formatting
- OS, CPU cores, RAM (total/available)
- GPU details (name, VRAM, driver, type)
- Memory allocation rules
Model Selection Menu: Three configuration options
1. Recommended Configuration: Auto-detects optimal model + instances
2. Browse All Configurations: Lists all feasible models for hardware
3. Custom Configuration: Step-by-step wizard
  - Select model family (Qwen, DeepSeek, CodeLlama)
  - Choose model size (3B, 7B, 14B)
  - Pick quantization (Q4_K_M, Q5_K_M, Q6_K)
  - Specify instance count (1 to max supported)
Resource Usage Monitor: Real-time swarm status
- Swarm status (running/stopped)
- Current model name
- Healthy workers count
- Memory usage (total and per-worker)
- Worker statistics:
  - Total requests served
  - Average latency
  - Tokens per second
Startup Summary: Comprehensive display showing:
- Hardware detection section
- Model configuration section
- Resource usage section
- Memory utilization percentage

Implementation:

def interactive_model_selection(hardware: HardwareProfile) -> Optional[ModelConfig]:
    # Show hardware info
    # Display menu with 3 options
    # Return selected configuration

def custom_configuration(hardware: HardwareProfile) -> Optional[ModelConfig]:
    # Step-by-step wizard
    # Select model -> size -> quantization -> instances
    # Validate memory constraints

def show_startup_summary(hardware, config, swarm_manager=None):
    # Clear screen
    # Print formatted hardware, config, and usage info

5.2 CLI Interface

File: main.py

Commands:

# Start the swarm (auto-detect hardware)
python -m local_swarm

# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Start with specific port
python -m local_swarm --port 8080

# Override instance count
python -m local_swarm --instances 4

# Show hardware detection
python -m local_swarm --detect

# Download models only
python -m local_swarm --download-only

5.2 Configuration File

File: config.yaml

server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8
  timeout: 60

models:
  cache_dir: "~/.local_swarm/models"
  preferred_models:
    - qwen2.5-coder
    - deepseek-coder
    - codellama

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

network:
  enabled: true
  discovery_port: 8765
  federation_port: 8766
  advertise_interval: 30
  max_peers: 10
  auth_token: null  # Optional auth for security

Phase 5.5: MCP Server (Week 5)

5.5.1 MCP Protocol Implementation

File: src/mcp_server.py

Features:

MCP Server Class: LocalSwarmMCPServer implementing MCP protocol
Stdio Transport: Communication via standard input/output
Tool Registration: 5 MCP tools for AI assistants:

get_hardware_info - Query system capabilities
- OS, CPU cores, RAM
- GPU name, VRAM, type
- Available memory for LLMs
get_swarm_status - Check swarm health
- Running/stopped status
- Model name
- Healthy workers count
- Total memory usage
generate_code - Generate with consensus
- Input: prompt, max_tokens, temperature
- Returns: generated code with metadata
- Shows: strategy used, confidence, latency
list_available_models - Browse models
- All available models
- Variants per model
- Quantization options
get_worker_details - Worker statistics
- Per-worker backend info
- Health status
- Request count
- Latency and throughput

Implementation:

class LocalSwarmMCPServer:
    def __init__(self, swarm_manager):
        self.server = Server("local-swarm")
        self.register_tools()
    
    def register_tools(self):
        @self.server.list_tools()
        async def list_tools() -> List[Tool]:
            # Define tool schemas
            
        @self.server.call_tool()
        async def call_tool(name, arguments):
            # Handle tool calls

5.5.2 Dual Server Mode

Run HTTP API and MCP server simultaneously
Shared SwarmManager instance
HTTP API for external clients
MCP for AI assistant integration

Usage:

# HTTP API only
python main.py

# HTTP API + MCP server
python main.py --mcp

5.5.3 Installation Scripts

Windows (scripts/install.bat):

@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt

:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
    echo CUDA detected, installing GPU support...
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
    echo No CUDA detected, using CPU backend...
    pip install llama-cpp-python
)

echo Installation complete!
echo Run: python -m local_swarm

macOS/Linux (scripts/install.sh):

#!/bin/bash
set -e

echo "Installing Local Swarm..."
pip install --upgrade pip

# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
    echo "macOS detected..."
    pip install -r requirements.txt
    pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    echo "Linux detected..."
    pip install -r requirements.txt
    if command -v nvidia-smi &> /dev/null; then
        echo "CUDA detected, installing GPU support..."
        pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
    else
        pip install llama-cpp-python
    fi
fi

echo "Installation complete!"
echo "Run: python -m local_swarm"

Phase 6: Local Network Federation (Week 6)

6.1 Overview

Enable multiple machines on the same local network to form a "federated swarm". Each machine runs its own optimized swarm, but they coordinate to vote on the best responses across the entire network.

Example scenario:

Windows PC (RTX 4060 Ti): 4 instances of Qwen 7B Q4
Mac Mini (M1): 2 instances of Qwen 7B Q4
MacBook (M4): 3 instances of Qwen 7B Q4
Total: 9 instances voting on every request

6.2 Architecture

Federation modes:

Independent Swarms with Cross-Voting (Recommended):
- Each machine runs its own swarm locally
- When a request comes in, each swarm generates responses internally
- All swarms exchange their "best local" responses
- Final vote across all best responses
- Pros: Simple, resilient, uses each machine's optimal config
- Cons: Slightly higher latency
Distributed Workers (Advanced):
- Single coordinator manages workers across all machines
- Workers distributed based on capability
- Pros: Optimal load balancing
- Cons: Complex failure handling, network overhead

Implementation: Start with Mode 1 (Independent Swarms with Cross-Voting)

6.3 Discovery Protocol

File: src/network/discovery.py

mDNS/Bonjour-based discovery:

class SwarmDiscovery:
    """Discovers other Local Swarm instances on the local network."""
    
    def __init__(self, port: int = 8765):
        self.port = port
        self.peers: Dict[str, PeerInfo] = {}
        self.zeroconf = Zeroconf()
    
    def start_advertising(self, swarm_info: SwarmInfo):
        """Advertise this swarm on the network."""
        service_info = ServiceInfo(
            "_local-swarm._tcp.local.",
            f"{hostname}._local-swarm._tcp.local.",
            addresses=[self.get_ip()],
            port=self.port,
            properties={
                b"version": b"1.0",
                b"instances": str(swarm_info.instances).encode(),
                b"model": swarm_info.model_id.encode(),
                b"hardware": swarm_info.hardware_summary.encode(),
            }
        )
        self.zeroconf.register_service(service_info)
    
    def discover_peers(self) -> List[PeerInfo]:
        """Discover other swarms on the network."""
        browser = ServiceBrowser(
            self.zeroconf,
            "_local-swarm._tcp.local.",
            handlers=[self._on_service_state_change]
        )

Alternative: Simple UDP broadcast for environments without mDNS

6.4 Federation Protocol

File: src/network/federation.py

HTTP-based communication:

class FederationClient:
    """Client for communicating with peer swarms."""
    
    async def request_vote(self, peer: PeerInfo, request: GenerationRequest) -> Vote:
        """Request a vote from a peer swarm."""
        async with aiohttp.ClientSession() as session:
            async with session.post(
                f"http://{peer.host}:{peer.port}/v1/federation/vote",
                json={
                    "prompt": request.prompt,
                    "context": request.context,
                    "request_id": request.id
                },
                timeout=aiohttp.ClientTimeout(total=30)
            ) as resp:
                return Vote(await resp.json())
    
    async def cast_vote(self, responses: List[GenerationResponse]) -> Vote:
        """Cast this swarm's vote on a set of responses."""
        # Use consensus engine to pick best local response
        best = await self.consensus.select_best(responses)
        return Vote(best, confidence=self._calculate_confidence(responses))

Endpoints:

POST /v1/federation/vote - Request a vote from this swarm
GET /v1/federation/health - Check peer health
GET /v1/federation/info - Get swarm capabilities

6.5 Cross-Swarm Consensus

File: src/swarm/cross_consensus.py

Two-phase voting:

Local Consensus (Phase 1):
- Each swarm generates responses from all local instances
- Runs local consensus to pick "best local" response
- Returns to coordinator
Global Consensus (Phase 2):
- Coordinator collects all "best local" responses
- Weights by swarm confidence (based on local agreement)
- Returns highest-weighted response

class CrossSwarmConsensus:
    """Consensus across multiple networked swarms."""
    
    async def generate_with_federation(
        self, 
        request: GenerationRequest,
        timeout: float = 30.0
    ) -> GenerationResponse:
        # Phase 1: Local generation
        local_responses = await self.local_swarm.generate_all(request)
        local_best = await self.local_consensus.select_best(local_responses)
        local_confidence = self._calculate_local_confidence(local_responses)
        
        # Phase 2: Request peer votes
        peer_votes = []
        for peer in self.discovery.get_peers():
            try:
                vote = await asyncio.wait_for(
                    self.federation.request_vote(peer, request),
                    timeout=5.0
                )
                peer_votes.append(vote)
            except asyncio.TimeoutError:
                logger.warning(f"Peer {peer.host} timed out")
        
        # Phase 3: Global consensus
        all_votes = [Vote(local_best, local_confidence)] + peer_votes
        best_vote = self._weighted_vote(all_votes)
        
        return best_vote.response
    
    def _calculate_local_confidence(self, responses: List[GenerationResponse]) -> float:
        """Calculate confidence based on local agreement."""
        # High agreement = high confidence
        # Use embedding similarity
        similarities = self._compute_pairwise_similarity(responses)
        return np.mean(similarities)
    
    def _weighted_vote(self, votes: List[Vote]) -> Vote:
        """Select best vote weighted by confidence."""
        # Could use weighted random selection or just pick highest
        return max(votes, key=lambda v: v.confidence)

6.6 Configuration

federation:
  enabled: true
  mode: "independent"  # independent, distributed
  discovery:
    method: "mdns"  # mdns, broadcast, static
    port: 8765
    advertise_interval: 30
  communication:
    port: 8766
    timeout: 30
    auth:
      enabled: false
      token: null
  consensus:
    strategy: "weighted"  # weighted, best_of_n, latency
    min_peers: 1  # Minimum peers for federation (0 = solo mode OK)
    max_peers: 10

Phase 7: Extended GPU Support (Week 7)

7.1 AMD GPU Support (ROCm)

File: src/hardware/amd.py

Detection:

def detect_amd_gpu() -> Optional[GPUInfo]:
    """Detect AMD GPU using ROCm."""
    try:
        # Try rocm-smi
        import subprocess
        result = subprocess.run(
            ["rocm-smi", "--showmeminfo", "vram"],
            capture_output=True, text=True
        )
        if result.returncode == 0:
            # Parse VRAM info
            vram_gb = parse_rocm_output(result.stdout)
            return GPUInfo(
                name="AMD GPU",
                vram_gb=vram_gb,
                is_amd=True
            )
    except FileNotFoundError:
        pass
    
    # Fallback to checking for AMD in PCI
    return detect_amd_via_pci()

Backend: llama.cpp with ROCm support

# Build llama.cpp with ROCm
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python

Platforms: Linux (primary), Windows (experimental)

7.2 Intel GPU Support (OpenCL/OneAPI)

File: src/hardware/intel.py

Detection:

def detect_intel_gpu() -> Optional[GPUInfo]:
    """Detect Intel GPU using OneAPI or OpenCL."""
    # Try Intel GPU driver
    try:
        import subprocess
        result = subprocess.run(
            ["sycl-ls"],
            capture_output=True, text=True
        )
        if "Intel" in result.stdout:
            vram_gb = parse_sycl_output(result.stdout)
            return GPUInfo(
                name="Intel GPU",
                vram_gb=vram_gb,
                driver_version=get_intel_driver_version()
            )
    except FileNotFoundError:
        pass
    
    # Fallback to OpenCL
    return detect_intel_via_opencl()

Backend: llama.cpp with SYCL support

# Build llama.cpp with SYCL
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" \
    pip install llama-cpp-python

Platforms: Linux, Windows

7.3 Qualcomm GPU Support (Android/Termux)

File: src/hardware/qualcomm.py

Detection:

def detect_qualcomm_gpu() -> Optional[GPUInfo]:
    """Detect Qualcomm Adreno GPU on Android/Termux."""
    if not is_termux():
        return None
    
    try:
        # Parse /proc/cpuinfo and /proc/meminfo
        with open("/proc/cpuinfo") as f:
            cpuinfo = f.read()
        
        # Check for Qualcomm SoC
        if "Qualcomm" in cpuinfo or "Snapdragon" in cpuinfo:
            # Get RAM info (no separate VRAM on mobile)
            mem = psutil.virtual_memory()
            total_gb = mem.total / (1024**3)
            
            # Use 25% of RAM for LLM (very conservative on mobile)
            return GPUInfo(
                name="Qualcomm Adreno",
                vram_gb=total_gb * 0.25,
                is_mobile=True
            )
    except Exception:
        pass
    
    return None

def is_termux() -> bool:
    """Check if running in Termux environment."""
    return (
        os.environ.get("TERMUX_VERSION") is not None or
        os.path.exists("/data/data/com.termux/files/usr")
    )

Backend Options:

llama.cpp on Termux:

# In Termux
pkg install cmake clang
pip install llama-cpp-python

QNN (Qualcomm Neural Network) - Advanced:
- Use Qualcomm's SDK for optimized inference
- Better performance but complex setup

Limitations:

Models must be small (1-3B parameters)
Quantization essential (Q4 or lower)
Limited context window (2048 tokens)
Slower than desktop GPUs

Platforms: Android (via Termux)

7.4 Hardware Detection Updates

Update src/hardware/detector.py to support new GPUs:

def detect_gpu() -> Optional[GPUInfo]:
    """Detect GPU based on platform."""
    os_name = detect_os()
    
    if os_name == "darwin":
        return detect_apple_gpu()
    
    # Priority: NVIDIA > AMD > Intel > Qualcomm
    gpu = detect_nvidia_gpu()
    if gpu:
        return gpu
    
    gpu = detect_amd_gpu()
    if gpu:
        return gpu
    
    gpu = detect_intel_gpu()
    if gpu:
        return gpu
    
    gpu = detect_qualcomm_gpu()
    if gpu:
        return gpu
    
    return None

Phase 8: Testing & Polish (Week 8)

8.1 Test Coverage

Unit tests:

Hardware detection mocking (all GPU vendors)
Model selection logic
Consensus algorithm (local + cross-swarm)
API endpoint validation
Network discovery protocol

Integration tests:

End-to-end inference
Multi-worker coordination
Cross-swarm voting
Error handling
Network partition scenarios

Platform tests:

Windows with NVIDIA/AMD/Intel
macOS with M1/M2/M3/M4
Linux with NVIDIA/AMD/Intel
CPU-only fallback
Android (Termux) with Qualcomm

8.2 Performance Optimization

Model warmup: Pre-load models on startup
Request batching: Group similar requests
Worker pooling: Reuse workers instead of respawning
Memory monitoring: Auto-shutdown if OOM

8.3 Documentation ✅ COMPLETED

Created docs/GUIDE.md with:

Quick Start Guide (all platforms)
Opencode Configuration Examples:
- Basic setup
- Remote machine configuration
- Multiple model options
- Environment-specific configs
API Reference (OpenAI-compatible endpoints)
Troubleshooting Guide (common issues, platform-specific)
Performance Tuning (speed vs quality, memory usage)
Advanced Configuration (config.yaml, env vars)
MCP Server setup
Network Federation guide

Updated README.md:

Added Documentation section with links
Referenced complete guide

Technical Decisions

Why llama.cpp?

Best cross-platform support
Mature quantization formats (GGUF)
Active community
Good performance/quality tradeoff

Why MLX for macOS?

Native Apple Silicon optimization
Simpler than llama.cpp on macOS
Better unified memory handling

Why consensus voting?

Improves response quality vs single model
Uses available hardware efficiently
Can detect model hallucinations

Memory Model

External GPU (NVIDIA/AMD):

Use 100% of VRAM
Keep 10% buffer for OS/drivers
Each instance gets equal share

Apple Silicon:

Use 50% of unified RAM
Avoid system swap
Monitor memory pressure

CPU-only:

Use 50% of system RAM
Dependent on available memory
Slower but functional

Future Enhancements

Dynamic scaling: Add/remove workers based on load
Model mixing: Different models in same swarm
Fine-tuning: Local fine-tuning on user data
Web UI: Browser-based configuration
Docker support: Containerized deployment
Cloud inference: Fallback to cloud APIs
WebGPU support: Browser-based inference
Persistent knowledge: RAG with local vector DB

Success Metrics

Startup time: < 30 seconds from cold start
First inference: < 10 seconds after startup
Concurrent requests: Support 2-8 parallel inferences per machine
Consensus accuracy: > 80% agreement on code tasks
Memory efficiency: Use > 80% of available memory
Cross-platform: Works on Windows/macOS/Linux without code changes
GPU support: NVIDIA, AMD, Intel, Apple Silicon, Qualcomm
Network federation: Auto-discovery within 10 seconds
Federated consensus: Scale to 5+ machines (25+ instances)
Mobile support: Functional on Android/Termux (3B models)

35 KiB Raw Blame History

Local Swarm - Detailed Implementation Plan

Overview

Architecture

Network Federation Architecture

Implementation Phases

Phase 1: Foundation (Week 1)

1.1 Hardware Detection Module

1.2 Model Registry

1.3 Model Selector Logic

Phase 2: Backend Integration (Week 2)

2.1 Base Backend Interface

2.2 llama.cpp Backend

2.3 MLX Backend (macOS)

Phase 3: Swarm Management (Week 3)

3.1 Worker Instance

3.2 Swarm Manager

3.3 Consensus Algorithm

Phase 4: API Server (Week 4)

4.1 OpenAI-Compatible Endpoints

4.2 Streaming Support

Phase 5: CLI & Interactive Interface (Week 5)

5.1 Interactive Menu System

5.2 CLI Interface

5.2 Configuration File

Phase 5.5: MCP Server (Week 5)

5.5.1 MCP Protocol Implementation

5.5.2 Dual Server Mode

5.5.3 Installation Scripts

Phase 6: Local Network Federation (Week 6)

6.1 Overview

6.2 Architecture

6.3 Discovery Protocol

6.4 Federation Protocol

6.5 Cross-Swarm Consensus

6.6 Configuration

Phase 7: Extended GPU Support (Week 7)

7.1 AMD GPU Support (ROCm)

7.2 Intel GPU Support (OpenCL/OneAPI)

7.3 Qualcomm GPU Support (Android/Termux)

7.4 Hardware Detection Updates

Phase 8: Testing & Polish (Week 8)

8.1 Test Coverage

8.2 Performance Optimization

8.3 Documentation ✅ COMPLETED

Technical Decisions

Why llama.cpp?

Why MLX for macOS?

Why consensus voting?

Memory Model

Future Enhancements

Success Metrics

35 KiB

Raw Blame History