Files
local_swarm/PLAN.md
T
sleepy 77f26f7381 Update PLAN.md and README with documentation completion
Update Phase 8.3 Documentation to mark as COMPLETED:
- Document all sections added to docs/GUIDE.md
- Update README.md with documentation links

Documentation now includes:
- Quick Start Guide for all platforms
- Opencode configuration examples
- API reference with examples
- Comprehensive troubleshooting
- Performance tuning guide
- Advanced configuration options
2026-02-23 18:40:35 +01:00

1149 lines
35 KiB
Markdown

# Local Swarm - Detailed Implementation Plan
## Overview
A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.
## Architecture
```
local_swarm/
├── src/
│ ├── __init__.py
│ ├── hardware/
│ │ ├── __init__.py
│ │ ├── detector.py # Platform-agnostic hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection (Windows/Linux)
│ │ ├── amd.py # AMD GPU detection (ROCm)
│ │ ├── intel.py # Intel GPU detection (OneAPI/OpenCL)
│ │ ├── qualcomm.py # Qualcomm/Adreno detection (Android)
│ │ ├── apple_silicon.py # Apple Silicon detection (macOS)
│ │ └── memory.py # RAM detection
│ ├── models/
│ │ ├── __init__.py
│ │ ├── registry.py # Model database with specs
│ │ ├── selector.py # Optimal model/quant selection logic
│ │ └── downloader.py # Download manager (HuggingFace)
│ ├── backends/
│ │ ├── __init__.py
│ │ ├── base.py # Backend interface
│ │ ├── llamacpp.py # llama.cpp backend (CUDA/ROCm/SYCL)
│ │ └── mlx.py # MLX backend (macOS)
│ ├── swarm/
│ │ ├── __init__.py
│ │ ├── manager.py # Instance lifecycle management
│ │ ├── worker.py # Individual LLM instance wrapper
│ │ ├── consensus.py # Local voting/consensus algorithm
│ │ └── cross_consensus.py # Cross-swarm consensus
│ ├── network/
│ │ ├── __init__.py
│ │ ├── discovery.py # mDNS/Bonjour peer discovery
│ │ ├── federation.py # Inter-swarm communication
│ │ └── protocol.py # Network protocol definitions
│ └── api/
│ ├── __init__.py
│ ├── server.py # FastAPI/uvicorn server
│ ├── routes.py # OpenAI-compatible endpoints
│ ├── federation.py # Federation endpoints
│ └── middleware.py # Request handling
├── tests/
├── config/
│ └── models.yaml # Model configurations
├── scripts/
│ ├── install.bat # Windows installer
│ ├── install.sh # Unix installer
│ └── install-termux.sh # Android/Termux installer
├── main.py # CLI entry point
├── requirements.txt
├── requirements-macos.txt # MLX-specific deps
├── requirements-termux.txt # Android/Termux deps
├── setup.py
└── .gitignore
```
## Network Federation Architecture
When multiple machines run Local Swarm on the same network, they can form a "federated swarm":
```
┌─────────────────────────────────────────────────────────────┐
│ Local Network │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Windows PC │ │ Mac Mini │ │ MacBook │ │
│ │ (RTX 4060) │ │ (M1) │ │ (M4) │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Swarm 1 │ │ │ │ Swarm 2 │ │ │ │ Swarm 3 │ │ │
│ │ │ 4 inst. │ │ │ │ 2 inst. │ │ │ │ 3 inst. │ │ │
│ │ │ Qwen 7B │ │ │ │ Qwen 7B │ │ │ │ Qwen 7B │ │ │
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ mDNS │ │ │ mDNS │ │ │ mDNS │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ └──────┼───────┘ └──────┼───────┘ └──────┼───────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ Federation │ │
│ │ Coordinator │ │
│ │ │ │
│ │ ┌───────────┐ │ │
│ │ │ Consensus │ │ │
│ │ │ Engine │ │ │
│ │ └───────────┘ │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ opencode │ │
│ │ (Client) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Federation Flow:
1. Each swarm independently detects hardware and starts instances
2. Swarms advertise themselves via mDNS/Bonjour
3. When request comes in, each swarm generates local responses
4. Local consensus picks best response per swarm
5. Cross-swarm consensus votes across all best responses
6. Final answer returned to opencode
```
**Benefits**:
- Utilize all hardware in your home/office
- Each machine optimizes for its own specs
- No single point of failure
- Automatic load distribution
- Works even if one machine goes offline
## Implementation Phases
### Phase 1: Foundation (Week 1)
#### 1.1 Hardware Detection Module
**File**: `src/hardware/detector.py`
**Requirements**:
- Cross-platform OS detection (Windows, macOS, Linux)
- CPU info (cores, architecture)
- RAM detection (total, available)
- GPU detection with VRAM
**Platform-specific implementations**:
- **Windows**: Use `pynvml` for NVIDIA, fallback to DirectX for others
- **macOS**: Use `psutil` for RAM, `sysctl` for CPU, Metal API for GPU
- **Linux**: Use `pynvml` for NVIDIA, `rocm-smi` for AMD
**Output structure**:
```python
class HardwareProfile:
os: str # 'windows', 'darwin', 'linux'
cpu_cores: int
ram_gb: float
gpu: Optional[GPUInfo]
is_apple_silicon: bool
```
**Model selection rules**:
- External GPU (NVIDIA/AMD): Use 100% of VRAM
- Apple Silicon: Use 50% of unified RAM
- CPU-only: Use 50% of system RAM
#### 1.2 Model Registry
**File**: `src/models/registry.py`
**Model database** (YAML format):
```yaml
models:
qwen2.5-coder:
name: "Qwen 2.5 Coder"
description: "Alibaba's code-focused model"
variants:
- size: 3b
base_vram_gb: 2.0 # Approximate VRAM for fp16
quantizations:
q4_k_m:
vram_gb: 1.8
quality: "good"
q5_k_m:
vram_gb: 2.2
quality: "better"
q6_k:
vram_gb: 2.6
quality: "best"
- size: 7b
base_vram_gb: 14.0
quantizations:
q4_k_m:
vram_gb: 4.5
q5_k_m:
vram_gb: 5.2
q6_k:
vram_gb: 6.0
codellama:
name: "CodeLlama"
# Similar structure...
deepseek-coder:
name: "DeepSeek Coder"
# Similar structure...
```
**Selection priority**:
1. Qwen 2.5 Coder (best for small sizes)
2. DeepSeek Coder (good alternative)
3. CodeLlama (fallback)
#### 1.3 Model Selector Logic
**File**: `src/models/selector.py`
**Algorithm**:
```python
def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
available_memory = get_available_memory(hardware)
# Try models in priority order
for model in PRIORITY_MODELS:
# Find largest size that fits
for variant in reversed(model.variants):
# Try highest quantization that fits
for quant in reversed(variant.quantizations):
total_vram_needed = quant.vram_gb * MIN_INSTANCES
if total_vram_needed <= available_memory:
# Calculate max instances
max_instances = int(available_memory // quant.vram_gb)
# Cap at reasonable limit (e.g., 8)
instances = min(max_instances, 8)
return ModelConfig(model, variant, quant, instances)
# Fallback to smallest model
return FALLBACK_CONFIG
```
**Minimum instances**: 2 (for consensus voting)
**Maximum instances**: 8 (to avoid overhead)
### Phase 2: Backend Integration (Week 2)
#### 2.1 Base Backend Interface
**File**: `src/backends/base.py`
```python
from abc import ABC, abstractmethod
from typing import AsyncIterator
class LLMBackend(ABC):
@abstractmethod
async def load_model(self, model_path: str, config: dict) -> bool:
pass
@abstractmethod
async def generate(self, prompt: str, **kwargs) -> str:
pass
@abstractmethod
async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
pass
@abstractmethod
def get_memory_usage(self) -> float:
"""Return current VRAM/RAM usage in GB"""
pass
@abstractmethod
def shutdown(self):
pass
```
#### 2.2 llama.cpp Backend
**File**: `src/backends/llamacpp.py`
**Implementation**:
- Use `llama-cpp-python` library
- Support GGUF model format
- GPU acceleration via CUDA/Metal
- Server mode with HTTP API
**Key features**:
- Model caching to avoid reload
- Context window management
- Batch processing support
**Memory calculation**:
```python
def calculate_memory_usage(model_path: str) -> float:
# Parse GGUF metadata
# Return estimated VRAM usage
```
#### 2.3 MLX Backend (macOS)
**File**: `src/backends/mlx.py`
**Implementation**:
- Use `mlx-lm` library
- Support MLX format models
- Optimized for Apple Silicon
**Key differences from llama.cpp**:
- Native Metal performance
- Simpler API
- Unified memory model
### Phase 3: Swarm Management (Week 3)
#### 3.1 Worker Instance
**File**: `src/swarm/worker.py`
Each worker manages:
- One LLM instance
- Request queue
- Health monitoring
- Metrics collection
```python
class SwarmWorker:
def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
self.worker_id = worker_id
self.backend = backend
self.is_healthy = True
self.request_count = 0
self.avg_latency = 0.0
async def process(self, request: GenerationRequest) -> GenerationResponse:
start = time.time()
response = await self.backend.generate(**request.params)
latency = time.time() - start
self._update_metrics(latency)
return GenerationResponse(response, latency, self.worker_id)
```
#### 3.2 Swarm Manager
**File**: `src/swarm/manager.py`
Responsibilities:
- Spawn N workers based on hardware
- Distribute requests to all workers
- Collect responses
- Handle worker failures
```python
class SwarmManager:
def __init__(self, config: ModelConfig):
self.workers: List[SwarmWorker] = []
self.config = config
async def initialize(self):
# Download model if needed
model_path = await self._ensure_model()
# Spawn workers
for i in range(self.config.instances):
backend = self._create_backend()
await backend.load_model(model_path, self.config.backend_params)
worker = SwarmWorker(i, backend, self.config)
self.workers.append(worker)
async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
# Send to all workers in parallel
tasks = [w.process(request) for w in self.workers]
return await asyncio.gather(*tasks)
```
#### 3.3 Consensus Algorithm
**File**: `src/swarm/consensus.py`
**Voting strategies**:
1. **Similarity voting** (default):
- Embed all responses
- Group by semantic similarity
- Return largest group
2. **Quality scoring**:
- Score each response on:
- Completeness (does it answer the question?)
- Code quality (syntax, structure)
- Length appropriateness
- Return highest score
3. **Latency-weighted**:
- Prefer faster responses (lower memory pressure)
**Implementation**:
```python
class ConsensusEngine:
def __init__(self, strategy: str = "similarity"):
self.strategy = strategy
self.embedding_model = None # Lazy load
async def select_best(self, responses: List[GenerationResponse]) -> str:
if len(responses) == 1:
return responses[0].text
if self.strategy == "similarity":
return await self._similarity_vote(responses)
elif self.strategy == "quality":
return await self._quality_score(responses)
else:
return self._fastest_response(responses)
async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
# Use sentence-transformers for embeddings
# Group by cosine similarity > 0.85
# Return median response from largest group
```
### Phase 4: API Server (Week 4)
#### 4.1 OpenAI-Compatible Endpoints
**File**: `src/api/routes.py`
Required endpoints:
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completion
- `POST /v1/completions` - Text completion (optional)
- `GET /health` - Health check
- `GET /metrics` - Prometheus metrics (optional)
**Chat completions endpoint**:
```python
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
# Extract messages
messages = request.messages
prompt = format_messages(messages)
# Get all responses from swarm
responses = await swarm_manager.generate_all(prompt, **request.params)
# Run consensus
best_response = await consensus_engine.select_best(responses)
# Format as OpenAI response
return {
"id": f"chatcmpl-{uuid4()}",
"object": "chat.completion",
"created": int(time.time()),
"model": request.model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": best_response},
"finish_reason": "stop"
}],
"usage": calculate_usage(prompt, best_response)
}
```
#### 4.2 Streaming Support
**File**: `src/api/routes.py`
For streaming, use the fastest worker instead of consensus:
```python
if request.stream:
# Pick worker with lowest latency
worker = swarm_manager.get_fastest_worker()
return StreamingResponse(
worker.stream_generate(prompt),
media_type="text/event-stream"
)
```
### Phase 5: CLI & Interactive Interface (Week 5)
#### 5.1 Interactive Menu System
**File**: `src/interactive.py`
**Features**:
- **Hardware Display**: Detailed hardware info with formatting
- OS, CPU cores, RAM (total/available)
- GPU details (name, VRAM, driver, type)
- Memory allocation rules
- **Model Selection Menu**: Three configuration options
1. **Recommended Configuration**: Auto-detects optimal model + instances
2. **Browse All Configurations**: Lists all feasible models for hardware
3. **Custom Configuration**: Step-by-step wizard
- Select model family (Qwen, DeepSeek, CodeLlama)
- Choose model size (3B, 7B, 14B)
- Pick quantization (Q4_K_M, Q5_K_M, Q6_K)
- Specify instance count (1 to max supported)
- **Resource Usage Monitor**: Real-time swarm status
- Swarm status (running/stopped)
- Current model name
- Healthy workers count
- Memory usage (total and per-worker)
- Worker statistics:
- Total requests served
- Average latency
- Tokens per second
- **Startup Summary**: Comprehensive display showing:
- Hardware detection section
- Model configuration section
- Resource usage section
- Memory utilization percentage
**Implementation**:
```python
def interactive_model_selection(hardware: HardwareProfile) -> Optional[ModelConfig]:
# Show hardware info
# Display menu with 3 options
# Return selected configuration
def custom_configuration(hardware: HardwareProfile) -> Optional[ModelConfig]:
# Step-by-step wizard
# Select model -> size -> quantization -> instances
# Validate memory constraints
def show_startup_summary(hardware, config, swarm_manager=None):
# Clear screen
# Print formatted hardware, config, and usage info
```
#### 5.2 CLI Interface
**File**: `main.py`
Commands:
```bash
# Start the swarm (auto-detect hardware)
python -m local_swarm
# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4
# Start with specific port
python -m local_swarm --port 8080
# Override instance count
python -m local_swarm --instances 4
# Show hardware detection
python -m local_swarm --detect
# Download models only
python -m local_swarm --download-only
```
#### 5.2 Configuration File
**File**: `config.yaml`
```yaml
server:
host: "127.0.0.1"
port: 8000
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest
min_instances: 2
max_instances: 8
timeout: 60
models:
cache_dir: "~/.local_swarm/models"
preferred_models:
- qwen2.5-coder
- deepseek-coder
- codellama
hardware:
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
network:
enabled: true
discovery_port: 8765
federation_port: 8766
advertise_interval: 30
max_peers: 10
auth_token: null # Optional auth for security
```
### Phase 5.5: MCP Server (Week 5)
#### 5.5.1 MCP Protocol Implementation
**File**: `src/mcp_server.py`
**Features**:
- **MCP Server Class**: `LocalSwarmMCPServer` implementing MCP protocol
- **Stdio Transport**: Communication via standard input/output
- **Tool Registration**: 5 MCP tools for AI assistants:
1. **`get_hardware_info`** - Query system capabilities
- OS, CPU cores, RAM
- GPU name, VRAM, type
- Available memory for LLMs
2. **`get_swarm_status`** - Check swarm health
- Running/stopped status
- Model name
- Healthy workers count
- Total memory usage
3. **`generate_code`** - Generate with consensus
- Input: prompt, max_tokens, temperature
- Returns: generated code with metadata
- Shows: strategy used, confidence, latency
4. **`list_available_models`** - Browse models
- All available models
- Variants per model
- Quantization options
5. **`get_worker_details`** - Worker statistics
- Per-worker backend info
- Health status
- Request count
- Latency and throughput
**Implementation**:
```python
class LocalSwarmMCPServer:
def __init__(self, swarm_manager):
self.server = Server("local-swarm")
self.register_tools()
def register_tools(self):
@self.server.list_tools()
async def list_tools() -> List[Tool]:
# Define tool schemas
@self.server.call_tool()
async def call_tool(name, arguments):
# Handle tool calls
```
#### 5.5.2 Dual Server Mode
- Run HTTP API and MCP server simultaneously
- Shared SwarmManager instance
- HTTP API for external clients
- MCP for AI assistant integration
**Usage**:
```bash
# HTTP API only
python main.py
# HTTP API + MCP server
python main.py --mcp
```
#### 5.5.3 Installation Scripts
**Windows** (`scripts/install.bat`):
```batch
@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt
:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
echo CUDA detected, installing GPU support...
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
echo No CUDA detected, using CPU backend...
pip install llama-cpp-python
)
echo Installation complete!
echo Run: python -m local_swarm
```
**macOS/Linux** (`scripts/install.sh`):
```bash
#!/bin/bash
set -e
echo "Installing Local Swarm..."
pip install --upgrade pip
# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
echo "macOS detected..."
pip install -r requirements.txt
pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
echo "Linux detected..."
pip install -r requirements.txt
if command -v nvidia-smi &> /dev/null; then
echo "CUDA detected, installing GPU support..."
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
else
pip install llama-cpp-python
fi
fi
echo "Installation complete!"
echo "Run: python -m local_swarm"
```
### Phase 6: Local Network Federation (Week 6)
#### 6.1 Overview
Enable multiple machines on the same local network to form a "federated swarm". Each machine runs its own optimized swarm, but they coordinate to vote on the best responses across the entire network.
**Example scenario**:
- Windows PC (RTX 4060 Ti): 4 instances of Qwen 7B Q4
- Mac Mini (M1): 2 instances of Qwen 7B Q4
- MacBook (M4): 3 instances of Qwen 7B Q4
- Total: 9 instances voting on every request
#### 6.2 Architecture
**Federation modes**:
1. **Independent Swarms with Cross-Voting** (Recommended):
- Each machine runs its own swarm locally
- When a request comes in, each swarm generates responses internally
- All swarms exchange their "best local" responses
- Final vote across all best responses
- Pros: Simple, resilient, uses each machine's optimal config
- Cons: Slightly higher latency
2. **Distributed Workers** (Advanced):
- Single coordinator manages workers across all machines
- Workers distributed based on capability
- Pros: Optimal load balancing
- Cons: Complex failure handling, network overhead
**Implementation**: Start with Mode 1 (Independent Swarms with Cross-Voting)
#### 6.3 Discovery Protocol
**File**: `src/network/discovery.py`
**mDNS/Bonjour-based discovery**:
```python
class SwarmDiscovery:
"""Discovers other Local Swarm instances on the local network."""
def __init__(self, port: int = 8765):
self.port = port
self.peers: Dict[str, PeerInfo] = {}
self.zeroconf = Zeroconf()
def start_advertising(self, swarm_info: SwarmInfo):
"""Advertise this swarm on the network."""
service_info = ServiceInfo(
"_local-swarm._tcp.local.",
f"{hostname}._local-swarm._tcp.local.",
addresses=[self.get_ip()],
port=self.port,
properties={
b"version": b"1.0",
b"instances": str(swarm_info.instances).encode(),
b"model": swarm_info.model_id.encode(),
b"hardware": swarm_info.hardware_summary.encode(),
}
)
self.zeroconf.register_service(service_info)
def discover_peers(self) -> List[PeerInfo]:
"""Discover other swarms on the network."""
browser = ServiceBrowser(
self.zeroconf,
"_local-swarm._tcp.local.",
handlers=[self._on_service_state_change]
)
```
**Alternative**: Simple UDP broadcast for environments without mDNS
#### 6.4 Federation Protocol
**File**: `src/network/federation.py`
**HTTP-based communication**:
```python
class FederationClient:
"""Client for communicating with peer swarms."""
async def request_vote(self, peer: PeerInfo, request: GenerationRequest) -> Vote:
"""Request a vote from a peer swarm."""
async with aiohttp.ClientSession() as session:
async with session.post(
f"http://{peer.host}:{peer.port}/v1/federation/vote",
json={
"prompt": request.prompt,
"context": request.context,
"request_id": request.id
},
timeout=aiohttp.ClientTimeout(total=30)
) as resp:
return Vote(await resp.json())
async def cast_vote(self, responses: List[GenerationResponse]) -> Vote:
"""Cast this swarm's vote on a set of responses."""
# Use consensus engine to pick best local response
best = await self.consensus.select_best(responses)
return Vote(best, confidence=self._calculate_confidence(responses))
```
**Endpoints**:
- `POST /v1/federation/vote` - Request a vote from this swarm
- `GET /v1/federation/health` - Check peer health
- `GET /v1/federation/info` - Get swarm capabilities
#### 6.5 Cross-Swarm Consensus
**File**: `src/swarm/cross_consensus.py`
**Two-phase voting**:
1. **Local Consensus** (Phase 1):
- Each swarm generates responses from all local instances
- Runs local consensus to pick "best local" response
- Returns to coordinator
2. **Global Consensus** (Phase 2):
- Coordinator collects all "best local" responses
- Weights by swarm confidence (based on local agreement)
- Returns highest-weighted response
```python
class CrossSwarmConsensus:
"""Consensus across multiple networked swarms."""
async def generate_with_federation(
self,
request: GenerationRequest,
timeout: float = 30.0
) -> GenerationResponse:
# Phase 1: Local generation
local_responses = await self.local_swarm.generate_all(request)
local_best = await self.local_consensus.select_best(local_responses)
local_confidence = self._calculate_local_confidence(local_responses)
# Phase 2: Request peer votes
peer_votes = []
for peer in self.discovery.get_peers():
try:
vote = await asyncio.wait_for(
self.federation.request_vote(peer, request),
timeout=5.0
)
peer_votes.append(vote)
except asyncio.TimeoutError:
logger.warning(f"Peer {peer.host} timed out")
# Phase 3: Global consensus
all_votes = [Vote(local_best, local_confidence)] + peer_votes
best_vote = self._weighted_vote(all_votes)
return best_vote.response
def _calculate_local_confidence(self, responses: List[GenerationResponse]) -> float:
"""Calculate confidence based on local agreement."""
# High agreement = high confidence
# Use embedding similarity
similarities = self._compute_pairwise_similarity(responses)
return np.mean(similarities)
def _weighted_vote(self, votes: List[Vote]) -> Vote:
"""Select best vote weighted by confidence."""
# Could use weighted random selection or just pick highest
return max(votes, key=lambda v: v.confidence)
```
#### 6.6 Configuration
```yaml
federation:
enabled: true
mode: "independent" # independent, distributed
discovery:
method: "mdns" # mdns, broadcast, static
port: 8765
advertise_interval: 30
communication:
port: 8766
timeout: 30
auth:
enabled: false
token: null
consensus:
strategy: "weighted" # weighted, best_of_n, latency
min_peers: 1 # Minimum peers for federation (0 = solo mode OK)
max_peers: 10
```
### Phase 7: Extended GPU Support (Week 7)
#### 7.1 AMD GPU Support (ROCm)
**File**: `src/hardware/amd.py`
**Detection**:
```python
def detect_amd_gpu() -> Optional[GPUInfo]:
"""Detect AMD GPU using ROCm."""
try:
# Try rocm-smi
import subprocess
result = subprocess.run(
["rocm-smi", "--showmeminfo", "vram"],
capture_output=True, text=True
)
if result.returncode == 0:
# Parse VRAM info
vram_gb = parse_rocm_output(result.stdout)
return GPUInfo(
name="AMD GPU",
vram_gb=vram_gb,
is_amd=True
)
except FileNotFoundError:
pass
# Fallback to checking for AMD in PCI
return detect_amd_via_pci()
```
**Backend**: llama.cpp with ROCm support
```bash
# Build llama.cpp with ROCm
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
```
**Platforms**: Linux (primary), Windows (experimental)
#### 7.2 Intel GPU Support (OpenCL/OneAPI)
**File**: `src/hardware/intel.py`
**Detection**:
```python
def detect_intel_gpu() -> Optional[GPUInfo]:
"""Detect Intel GPU using OneAPI or OpenCL."""
# Try Intel GPU driver
try:
import subprocess
result = subprocess.run(
["sycl-ls"],
capture_output=True, text=True
)
if "Intel" in result.stdout:
vram_gb = parse_sycl_output(result.stdout)
return GPUInfo(
name="Intel GPU",
vram_gb=vram_gb,
driver_version=get_intel_driver_version()
)
except FileNotFoundError:
pass
# Fallback to OpenCL
return detect_intel_via_opencl()
```
**Backend**: llama.cpp with SYCL support
```bash
# Build llama.cpp with SYCL
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" \
pip install llama-cpp-python
```
**Platforms**: Linux, Windows
#### 7.3 Qualcomm GPU Support (Android/Termux)
**File**: `src/hardware/qualcomm.py`
**Detection**:
```python
def detect_qualcomm_gpu() -> Optional[GPUInfo]:
"""Detect Qualcomm Adreno GPU on Android/Termux."""
if not is_termux():
return None
try:
# Parse /proc/cpuinfo and /proc/meminfo
with open("/proc/cpuinfo") as f:
cpuinfo = f.read()
# Check for Qualcomm SoC
if "Qualcomm" in cpuinfo or "Snapdragon" in cpuinfo:
# Get RAM info (no separate VRAM on mobile)
mem = psutil.virtual_memory()
total_gb = mem.total / (1024**3)
# Use 25% of RAM for LLM (very conservative on mobile)
return GPUInfo(
name="Qualcomm Adreno",
vram_gb=total_gb * 0.25,
is_mobile=True
)
except Exception:
pass
return None
def is_termux() -> bool:
"""Check if running in Termux environment."""
return (
os.environ.get("TERMUX_VERSION") is not None or
os.path.exists("/data/data/com.termux/files/usr")
)
```
**Backend Options**:
1. **llama.cpp on Termux**:
```bash
# In Termux
pkg install cmake clang
pip install llama-cpp-python
```
2. **QNN (Qualcomm Neural Network)** - Advanced:
- Use Qualcomm's SDK for optimized inference
- Better performance but complex setup
**Limitations**:
- Models must be small (1-3B parameters)
- Quantization essential (Q4 or lower)
- Limited context window (2048 tokens)
- Slower than desktop GPUs
**Platforms**: Android (via Termux)
#### 7.4 Hardware Detection Updates
Update `src/hardware/detector.py` to support new GPUs:
```python
def detect_gpu() -> Optional[GPUInfo]:
"""Detect GPU based on platform."""
os_name = detect_os()
if os_name == "darwin":
return detect_apple_gpu()
# Priority: NVIDIA > AMD > Intel > Qualcomm
gpu = detect_nvidia_gpu()
if gpu:
return gpu
gpu = detect_amd_gpu()
if gpu:
return gpu
gpu = detect_intel_gpu()
if gpu:
return gpu
gpu = detect_qualcomm_gpu()
if gpu:
return gpu
return None
```
### Phase 8: Testing & Polish (Week 8)
#### 8.1 Test Coverage
**Unit tests**:
- Hardware detection mocking (all GPU vendors)
- Model selection logic
- Consensus algorithm (local + cross-swarm)
- API endpoint validation
- Network discovery protocol
**Integration tests**:
- End-to-end inference
- Multi-worker coordination
- Cross-swarm voting
- Error handling
- Network partition scenarios
**Platform tests**:
- Windows with NVIDIA/AMD/Intel
- macOS with M1/M2/M3/M4
- Linux with NVIDIA/AMD/Intel
- CPU-only fallback
- Android (Termux) with Qualcomm
#### 8.2 Performance Optimization
- **Model warmup**: Pre-load models on startup
- **Request batching**: Group similar requests
- **Worker pooling**: Reuse workers instead of respawning
- **Memory monitoring**: Auto-shutdown if OOM
#### 8.3 Documentation ✅ COMPLETED
**Created docs/GUIDE.md with:**
- Quick Start Guide (all platforms)
- Opencode Configuration Examples:
- Basic setup
- Remote machine configuration
- Multiple model options
- Environment-specific configs
- API Reference (OpenAI-compatible endpoints)
- Troubleshooting Guide (common issues, platform-specific)
- Performance Tuning (speed vs quality, memory usage)
- Advanced Configuration (config.yaml, env vars)
- MCP Server setup
- Network Federation guide
**Updated README.md:**
- Added Documentation section with links
- Referenced complete guide
## Technical Decisions
### Why llama.cpp?
- Best cross-platform support
- Mature quantization formats (GGUF)
- Active community
- Good performance/quality tradeoff
### Why MLX for macOS?
- Native Apple Silicon optimization
- Simpler than llama.cpp on macOS
- Better unified memory handling
### Why consensus voting?
- Improves response quality vs single model
- Uses available hardware efficiently
- Can detect model hallucinations
### Memory Model
**External GPU (NVIDIA/AMD)**:
- Use 100% of VRAM
- Keep 10% buffer for OS/drivers
- Each instance gets equal share
**Apple Silicon**:
- Use 50% of unified RAM
- Avoid system swap
- Monitor memory pressure
**CPU-only**:
- Use 50% of system RAM
- Dependent on available memory
- Slower but functional
## Future Enhancements
1. **Dynamic scaling**: Add/remove workers based on load
2. **Model mixing**: Different models in same swarm
3. **Fine-tuning**: Local fine-tuning on user data
4. **Web UI**: Browser-based configuration
5. **Docker support**: Containerized deployment
6. **Cloud inference**: Fallback to cloud APIs
7. **WebGPU support**: Browser-based inference
8. **Persistent knowledge**: RAG with local vector DB
## Success Metrics
- **Startup time**: < 30 seconds from cold start
- **First inference**: < 10 seconds after startup
- **Concurrent requests**: Support 2-8 parallel inferences per machine
- **Consensus accuracy**: > 80% agreement on code tasks
- **Memory efficiency**: Use > 80% of available memory
- **Cross-platform**: Works on Windows/macOS/Linux without code changes
- **GPU support**: NVIDIA, AMD, Intel, Apple Silicon, Qualcomm
- **Network federation**: Auto-discovery within 10 seconds
- **Federated consensus**: Scale to 5+ machines (25+ instances)
- **Mobile support**: Functional on Android/Termux (3B models)