Update Phase 8.3 Documentation to mark as COMPLETED: - Document all sections added to docs/GUIDE.md - Update README.md with documentation links Documentation now includes: - Quick Start Guide for all platforms - Opencode configuration examples - API reference with examples - Comprehensive troubleshooting - Performance tuning guide - Advanced configuration options
35 KiB
Local Swarm - Detailed Implementation Plan
Overview
A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.
Architecture
local_swarm/
├── src/
│ ├── __init__.py
│ ├── hardware/
│ │ ├── __init__.py
│ │ ├── detector.py # Platform-agnostic hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection (Windows/Linux)
│ │ ├── amd.py # AMD GPU detection (ROCm)
│ │ ├── intel.py # Intel GPU detection (OneAPI/OpenCL)
│ │ ├── qualcomm.py # Qualcomm/Adreno detection (Android)
│ │ ├── apple_silicon.py # Apple Silicon detection (macOS)
│ │ └── memory.py # RAM detection
│ ├── models/
│ │ ├── __init__.py
│ │ ├── registry.py # Model database with specs
│ │ ├── selector.py # Optimal model/quant selection logic
│ │ └── downloader.py # Download manager (HuggingFace)
│ ├── backends/
│ │ ├── __init__.py
│ │ ├── base.py # Backend interface
│ │ ├── llamacpp.py # llama.cpp backend (CUDA/ROCm/SYCL)
│ │ └── mlx.py # MLX backend (macOS)
│ ├── swarm/
│ │ ├── __init__.py
│ │ ├── manager.py # Instance lifecycle management
│ │ ├── worker.py # Individual LLM instance wrapper
│ │ ├── consensus.py # Local voting/consensus algorithm
│ │ └── cross_consensus.py # Cross-swarm consensus
│ ├── network/
│ │ ├── __init__.py
│ │ ├── discovery.py # mDNS/Bonjour peer discovery
│ │ ├── federation.py # Inter-swarm communication
│ │ └── protocol.py # Network protocol definitions
│ └── api/
│ ├── __init__.py
│ ├── server.py # FastAPI/uvicorn server
│ ├── routes.py # OpenAI-compatible endpoints
│ ├── federation.py # Federation endpoints
│ └── middleware.py # Request handling
├── tests/
├── config/
│ └── models.yaml # Model configurations
├── scripts/
│ ├── install.bat # Windows installer
│ ├── install.sh # Unix installer
│ └── install-termux.sh # Android/Termux installer
├── main.py # CLI entry point
├── requirements.txt
├── requirements-macos.txt # MLX-specific deps
├── requirements-termux.txt # Android/Termux deps
├── setup.py
└── .gitignore
Network Federation Architecture
When multiple machines run Local Swarm on the same network, they can form a "federated swarm":
┌─────────────────────────────────────────────────────────────┐
│ Local Network │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Windows PC │ │ Mac Mini │ │ MacBook │ │
│ │ (RTX 4060) │ │ (M1) │ │ (M4) │ │
│ │ │ │ │ │ │ │
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
│ │ │ Swarm 1 │ │ │ │ Swarm 2 │ │ │ │ Swarm 3 │ │ │
│ │ │ 4 inst. │ │ │ │ 2 inst. │ │ │ │ 3 inst. │ │ │
│ │ │ Qwen 7B │ │ │ │ Qwen 7B │ │ │ │ Qwen 7B │ │ │
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
│ │ │ │ │ │ │ │ │ │ │
│ │ mDNS │ │ │ mDNS │ │ │ mDNS │ │ │
│ │ │ │ │ │ │ │ │ │ │
│ └──────┼───────┘ └──────┼───────┘ └──────┼───────┘ │
│ │ │ │ │
│ └───────────────────┼───────────────────┘ │
│ │ │
│ ┌────────┴────────┐ │
│ │ Federation │ │
│ │ Coordinator │ │
│ │ │ │
│ │ ┌───────────┐ │ │
│ │ │ Consensus │ │ │
│ │ │ Engine │ │ │
│ │ └───────────┘ │ │
│ └────────┬────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ opencode │ │
│ │ (Client) │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────────────────┘
Federation Flow:
1. Each swarm independently detects hardware and starts instances
2. Swarms advertise themselves via mDNS/Bonjour
3. When request comes in, each swarm generates local responses
4. Local consensus picks best response per swarm
5. Cross-swarm consensus votes across all best responses
6. Final answer returned to opencode
Benefits:
- Utilize all hardware in your home/office
- Each machine optimizes for its own specs
- No single point of failure
- Automatic load distribution
- Works even if one machine goes offline
Implementation Phases
Phase 1: Foundation (Week 1)
1.1 Hardware Detection Module
File: src/hardware/detector.py
Requirements:
- Cross-platform OS detection (Windows, macOS, Linux)
- CPU info (cores, architecture)
- RAM detection (total, available)
- GPU detection with VRAM
Platform-specific implementations:
- Windows: Use
pynvmlfor NVIDIA, fallback to DirectX for others - macOS: Use
psutilfor RAM,sysctlfor CPU, Metal API for GPU - Linux: Use
pynvmlfor NVIDIA,rocm-smifor AMD
Output structure:
class HardwareProfile:
os: str # 'windows', 'darwin', 'linux'
cpu_cores: int
ram_gb: float
gpu: Optional[GPUInfo]
is_apple_silicon: bool
Model selection rules:
- External GPU (NVIDIA/AMD): Use 100% of VRAM
- Apple Silicon: Use 50% of unified RAM
- CPU-only: Use 50% of system RAM
1.2 Model Registry
File: src/models/registry.py
Model database (YAML format):
models:
qwen2.5-coder:
name: "Qwen 2.5 Coder"
description: "Alibaba's code-focused model"
variants:
- size: 3b
base_vram_gb: 2.0 # Approximate VRAM for fp16
quantizations:
q4_k_m:
vram_gb: 1.8
quality: "good"
q5_k_m:
vram_gb: 2.2
quality: "better"
q6_k:
vram_gb: 2.6
quality: "best"
- size: 7b
base_vram_gb: 14.0
quantizations:
q4_k_m:
vram_gb: 4.5
q5_k_m:
vram_gb: 5.2
q6_k:
vram_gb: 6.0
codellama:
name: "CodeLlama"
# Similar structure...
deepseek-coder:
name: "DeepSeek Coder"
# Similar structure...
Selection priority:
- Qwen 2.5 Coder (best for small sizes)
- DeepSeek Coder (good alternative)
- CodeLlama (fallback)
1.3 Model Selector Logic
File: src/models/selector.py
Algorithm:
def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
available_memory = get_available_memory(hardware)
# Try models in priority order
for model in PRIORITY_MODELS:
# Find largest size that fits
for variant in reversed(model.variants):
# Try highest quantization that fits
for quant in reversed(variant.quantizations):
total_vram_needed = quant.vram_gb * MIN_INSTANCES
if total_vram_needed <= available_memory:
# Calculate max instances
max_instances = int(available_memory // quant.vram_gb)
# Cap at reasonable limit (e.g., 8)
instances = min(max_instances, 8)
return ModelConfig(model, variant, quant, instances)
# Fallback to smallest model
return FALLBACK_CONFIG
Minimum instances: 2 (for consensus voting) Maximum instances: 8 (to avoid overhead)
Phase 2: Backend Integration (Week 2)
2.1 Base Backend Interface
File: src/backends/base.py
from abc import ABC, abstractmethod
from typing import AsyncIterator
class LLMBackend(ABC):
@abstractmethod
async def load_model(self, model_path: str, config: dict) -> bool:
pass
@abstractmethod
async def generate(self, prompt: str, **kwargs) -> str:
pass
@abstractmethod
async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
pass
@abstractmethod
def get_memory_usage(self) -> float:
"""Return current VRAM/RAM usage in GB"""
pass
@abstractmethod
def shutdown(self):
pass
2.2 llama.cpp Backend
File: src/backends/llamacpp.py
Implementation:
- Use
llama-cpp-pythonlibrary - Support GGUF model format
- GPU acceleration via CUDA/Metal
- Server mode with HTTP API
Key features:
- Model caching to avoid reload
- Context window management
- Batch processing support
Memory calculation:
def calculate_memory_usage(model_path: str) -> float:
# Parse GGUF metadata
# Return estimated VRAM usage
2.3 MLX Backend (macOS)
File: src/backends/mlx.py
Implementation:
- Use
mlx-lmlibrary - Support MLX format models
- Optimized for Apple Silicon
Key differences from llama.cpp:
- Native Metal performance
- Simpler API
- Unified memory model
Phase 3: Swarm Management (Week 3)
3.1 Worker Instance
File: src/swarm/worker.py
Each worker manages:
- One LLM instance
- Request queue
- Health monitoring
- Metrics collection
class SwarmWorker:
def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
self.worker_id = worker_id
self.backend = backend
self.is_healthy = True
self.request_count = 0
self.avg_latency = 0.0
async def process(self, request: GenerationRequest) -> GenerationResponse:
start = time.time()
response = await self.backend.generate(**request.params)
latency = time.time() - start
self._update_metrics(latency)
return GenerationResponse(response, latency, self.worker_id)
3.2 Swarm Manager
File: src/swarm/manager.py
Responsibilities:
- Spawn N workers based on hardware
- Distribute requests to all workers
- Collect responses
- Handle worker failures
class SwarmManager:
def __init__(self, config: ModelConfig):
self.workers: List[SwarmWorker] = []
self.config = config
async def initialize(self):
# Download model if needed
model_path = await self._ensure_model()
# Spawn workers
for i in range(self.config.instances):
backend = self._create_backend()
await backend.load_model(model_path, self.config.backend_params)
worker = SwarmWorker(i, backend, self.config)
self.workers.append(worker)
async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
# Send to all workers in parallel
tasks = [w.process(request) for w in self.workers]
return await asyncio.gather(*tasks)
3.3 Consensus Algorithm
File: src/swarm/consensus.py
Voting strategies:
-
Similarity voting (default):
- Embed all responses
- Group by semantic similarity
- Return largest group
-
Quality scoring:
- Score each response on:
- Completeness (does it answer the question?)
- Code quality (syntax, structure)
- Length appropriateness
- Return highest score
- Score each response on:
-
Latency-weighted:
- Prefer faster responses (lower memory pressure)
Implementation:
class ConsensusEngine:
def __init__(self, strategy: str = "similarity"):
self.strategy = strategy
self.embedding_model = None # Lazy load
async def select_best(self, responses: List[GenerationResponse]) -> str:
if len(responses) == 1:
return responses[0].text
if self.strategy == "similarity":
return await self._similarity_vote(responses)
elif self.strategy == "quality":
return await self._quality_score(responses)
else:
return self._fastest_response(responses)
async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
# Use sentence-transformers for embeddings
# Group by cosine similarity > 0.85
# Return median response from largest group
Phase 4: API Server (Week 4)
4.1 OpenAI-Compatible Endpoints
File: src/api/routes.py
Required endpoints:
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completionPOST /v1/completions- Text completion (optional)GET /health- Health checkGET /metrics- Prometheus metrics (optional)
Chat completions endpoint:
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
# Extract messages
messages = request.messages
prompt = format_messages(messages)
# Get all responses from swarm
responses = await swarm_manager.generate_all(prompt, **request.params)
# Run consensus
best_response = await consensus_engine.select_best(responses)
# Format as OpenAI response
return {
"id": f"chatcmpl-{uuid4()}",
"object": "chat.completion",
"created": int(time.time()),
"model": request.model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": best_response},
"finish_reason": "stop"
}],
"usage": calculate_usage(prompt, best_response)
}
4.2 Streaming Support
File: src/api/routes.py
For streaming, use the fastest worker instead of consensus:
if request.stream:
# Pick worker with lowest latency
worker = swarm_manager.get_fastest_worker()
return StreamingResponse(
worker.stream_generate(prompt),
media_type="text/event-stream"
)
Phase 5: CLI & Interactive Interface (Week 5)
5.1 Interactive Menu System
File: src/interactive.py
Features:
-
Hardware Display: Detailed hardware info with formatting
- OS, CPU cores, RAM (total/available)
- GPU details (name, VRAM, driver, type)
- Memory allocation rules
-
Model Selection Menu: Three configuration options
- Recommended Configuration: Auto-detects optimal model + instances
- Browse All Configurations: Lists all feasible models for hardware
- Custom Configuration: Step-by-step wizard
- Select model family (Qwen, DeepSeek, CodeLlama)
- Choose model size (3B, 7B, 14B)
- Pick quantization (Q4_K_M, Q5_K_M, Q6_K)
- Specify instance count (1 to max supported)
-
Resource Usage Monitor: Real-time swarm status
- Swarm status (running/stopped)
- Current model name
- Healthy workers count
- Memory usage (total and per-worker)
- Worker statistics:
- Total requests served
- Average latency
- Tokens per second
-
Startup Summary: Comprehensive display showing:
- Hardware detection section
- Model configuration section
- Resource usage section
- Memory utilization percentage
Implementation:
def interactive_model_selection(hardware: HardwareProfile) -> Optional[ModelConfig]:
# Show hardware info
# Display menu with 3 options
# Return selected configuration
def custom_configuration(hardware: HardwareProfile) -> Optional[ModelConfig]:
# Step-by-step wizard
# Select model -> size -> quantization -> instances
# Validate memory constraints
def show_startup_summary(hardware, config, swarm_manager=None):
# Clear screen
# Print formatted hardware, config, and usage info
5.2 CLI Interface
File: main.py
Commands:
# Start the swarm (auto-detect hardware)
python -m local_swarm
# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4
# Start with specific port
python -m local_swarm --port 8080
# Override instance count
python -m local_swarm --instances 4
# Show hardware detection
python -m local_swarm --detect
# Download models only
python -m local_swarm --download-only
5.2 Configuration File
File: config.yaml
server:
host: "127.0.0.1"
port: 8000
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest
min_instances: 2
max_instances: 8
timeout: 60
models:
cache_dir: "~/.local_swarm/models"
preferred_models:
- qwen2.5-coder
- deepseek-coder
- codellama
hardware:
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
network:
enabled: true
discovery_port: 8765
federation_port: 8766
advertise_interval: 30
max_peers: 10
auth_token: null # Optional auth for security
Phase 5.5: MCP Server (Week 5)
5.5.1 MCP Protocol Implementation
File: src/mcp_server.py
Features:
- MCP Server Class:
LocalSwarmMCPServerimplementing MCP protocol - Stdio Transport: Communication via standard input/output
- Tool Registration: 5 MCP tools for AI assistants:
-
get_hardware_info- Query system capabilities- OS, CPU cores, RAM
- GPU name, VRAM, type
- Available memory for LLMs
-
get_swarm_status- Check swarm health- Running/stopped status
- Model name
- Healthy workers count
- Total memory usage
-
generate_code- Generate with consensus- Input: prompt, max_tokens, temperature
- Returns: generated code with metadata
- Shows: strategy used, confidence, latency
-
list_available_models- Browse models- All available models
- Variants per model
- Quantization options
-
get_worker_details- Worker statistics- Per-worker backend info
- Health status
- Request count
- Latency and throughput
Implementation:
class LocalSwarmMCPServer:
def __init__(self, swarm_manager):
self.server = Server("local-swarm")
self.register_tools()
def register_tools(self):
@self.server.list_tools()
async def list_tools() -> List[Tool]:
# Define tool schemas
@self.server.call_tool()
async def call_tool(name, arguments):
# Handle tool calls
5.5.2 Dual Server Mode
- Run HTTP API and MCP server simultaneously
- Shared SwarmManager instance
- HTTP API for external clients
- MCP for AI assistant integration
Usage:
# HTTP API only
python main.py
# HTTP API + MCP server
python main.py --mcp
5.5.3 Installation Scripts
Windows (scripts/install.bat):
@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt
:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
echo CUDA detected, installing GPU support...
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
echo No CUDA detected, using CPU backend...
pip install llama-cpp-python
)
echo Installation complete!
echo Run: python -m local_swarm
macOS/Linux (scripts/install.sh):
#!/bin/bash
set -e
echo "Installing Local Swarm..."
pip install --upgrade pip
# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
echo "macOS detected..."
pip install -r requirements.txt
pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
echo "Linux detected..."
pip install -r requirements.txt
if command -v nvidia-smi &> /dev/null; then
echo "CUDA detected, installing GPU support..."
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
else
pip install llama-cpp-python
fi
fi
echo "Installation complete!"
echo "Run: python -m local_swarm"
Phase 6: Local Network Federation (Week 6)
6.1 Overview
Enable multiple machines on the same local network to form a "federated swarm". Each machine runs its own optimized swarm, but they coordinate to vote on the best responses across the entire network.
Example scenario:
- Windows PC (RTX 4060 Ti): 4 instances of Qwen 7B Q4
- Mac Mini (M1): 2 instances of Qwen 7B Q4
- MacBook (M4): 3 instances of Qwen 7B Q4
- Total: 9 instances voting on every request
6.2 Architecture
Federation modes:
-
Independent Swarms with Cross-Voting (Recommended):
- Each machine runs its own swarm locally
- When a request comes in, each swarm generates responses internally
- All swarms exchange their "best local" responses
- Final vote across all best responses
- Pros: Simple, resilient, uses each machine's optimal config
- Cons: Slightly higher latency
-
Distributed Workers (Advanced):
- Single coordinator manages workers across all machines
- Workers distributed based on capability
- Pros: Optimal load balancing
- Cons: Complex failure handling, network overhead
Implementation: Start with Mode 1 (Independent Swarms with Cross-Voting)
6.3 Discovery Protocol
File: src/network/discovery.py
mDNS/Bonjour-based discovery:
class SwarmDiscovery:
"""Discovers other Local Swarm instances on the local network."""
def __init__(self, port: int = 8765):
self.port = port
self.peers: Dict[str, PeerInfo] = {}
self.zeroconf = Zeroconf()
def start_advertising(self, swarm_info: SwarmInfo):
"""Advertise this swarm on the network."""
service_info = ServiceInfo(
"_local-swarm._tcp.local.",
f"{hostname}._local-swarm._tcp.local.",
addresses=[self.get_ip()],
port=self.port,
properties={
b"version": b"1.0",
b"instances": str(swarm_info.instances).encode(),
b"model": swarm_info.model_id.encode(),
b"hardware": swarm_info.hardware_summary.encode(),
}
)
self.zeroconf.register_service(service_info)
def discover_peers(self) -> List[PeerInfo]:
"""Discover other swarms on the network."""
browser = ServiceBrowser(
self.zeroconf,
"_local-swarm._tcp.local.",
handlers=[self._on_service_state_change]
)
Alternative: Simple UDP broadcast for environments without mDNS
6.4 Federation Protocol
File: src/network/federation.py
HTTP-based communication:
class FederationClient:
"""Client for communicating with peer swarms."""
async def request_vote(self, peer: PeerInfo, request: GenerationRequest) -> Vote:
"""Request a vote from a peer swarm."""
async with aiohttp.ClientSession() as session:
async with session.post(
f"http://{peer.host}:{peer.port}/v1/federation/vote",
json={
"prompt": request.prompt,
"context": request.context,
"request_id": request.id
},
timeout=aiohttp.ClientTimeout(total=30)
) as resp:
return Vote(await resp.json())
async def cast_vote(self, responses: List[GenerationResponse]) -> Vote:
"""Cast this swarm's vote on a set of responses."""
# Use consensus engine to pick best local response
best = await self.consensus.select_best(responses)
return Vote(best, confidence=self._calculate_confidence(responses))
Endpoints:
POST /v1/federation/vote- Request a vote from this swarmGET /v1/federation/health- Check peer healthGET /v1/federation/info- Get swarm capabilities
6.5 Cross-Swarm Consensus
File: src/swarm/cross_consensus.py
Two-phase voting:
-
Local Consensus (Phase 1):
- Each swarm generates responses from all local instances
- Runs local consensus to pick "best local" response
- Returns to coordinator
-
Global Consensus (Phase 2):
- Coordinator collects all "best local" responses
- Weights by swarm confidence (based on local agreement)
- Returns highest-weighted response
class CrossSwarmConsensus:
"""Consensus across multiple networked swarms."""
async def generate_with_federation(
self,
request: GenerationRequest,
timeout: float = 30.0
) -> GenerationResponse:
# Phase 1: Local generation
local_responses = await self.local_swarm.generate_all(request)
local_best = await self.local_consensus.select_best(local_responses)
local_confidence = self._calculate_local_confidence(local_responses)
# Phase 2: Request peer votes
peer_votes = []
for peer in self.discovery.get_peers():
try:
vote = await asyncio.wait_for(
self.federation.request_vote(peer, request),
timeout=5.0
)
peer_votes.append(vote)
except asyncio.TimeoutError:
logger.warning(f"Peer {peer.host} timed out")
# Phase 3: Global consensus
all_votes = [Vote(local_best, local_confidence)] + peer_votes
best_vote = self._weighted_vote(all_votes)
return best_vote.response
def _calculate_local_confidence(self, responses: List[GenerationResponse]) -> float:
"""Calculate confidence based on local agreement."""
# High agreement = high confidence
# Use embedding similarity
similarities = self._compute_pairwise_similarity(responses)
return np.mean(similarities)
def _weighted_vote(self, votes: List[Vote]) -> Vote:
"""Select best vote weighted by confidence."""
# Could use weighted random selection or just pick highest
return max(votes, key=lambda v: v.confidence)
6.6 Configuration
federation:
enabled: true
mode: "independent" # independent, distributed
discovery:
method: "mdns" # mdns, broadcast, static
port: 8765
advertise_interval: 30
communication:
port: 8766
timeout: 30
auth:
enabled: false
token: null
consensus:
strategy: "weighted" # weighted, best_of_n, latency
min_peers: 1 # Minimum peers for federation (0 = solo mode OK)
max_peers: 10
Phase 7: Extended GPU Support (Week 7)
7.1 AMD GPU Support (ROCm)
File: src/hardware/amd.py
Detection:
def detect_amd_gpu() -> Optional[GPUInfo]:
"""Detect AMD GPU using ROCm."""
try:
# Try rocm-smi
import subprocess
result = subprocess.run(
["rocm-smi", "--showmeminfo", "vram"],
capture_output=True, text=True
)
if result.returncode == 0:
# Parse VRAM info
vram_gb = parse_rocm_output(result.stdout)
return GPUInfo(
name="AMD GPU",
vram_gb=vram_gb,
is_amd=True
)
except FileNotFoundError:
pass
# Fallback to checking for AMD in PCI
return detect_amd_via_pci()
Backend: llama.cpp with ROCm support
# Build llama.cpp with ROCm
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
Platforms: Linux (primary), Windows (experimental)
7.2 Intel GPU Support (OpenCL/OneAPI)
File: src/hardware/intel.py
Detection:
def detect_intel_gpu() -> Optional[GPUInfo]:
"""Detect Intel GPU using OneAPI or OpenCL."""
# Try Intel GPU driver
try:
import subprocess
result = subprocess.run(
["sycl-ls"],
capture_output=True, text=True
)
if "Intel" in result.stdout:
vram_gb = parse_sycl_output(result.stdout)
return GPUInfo(
name="Intel GPU",
vram_gb=vram_gb,
driver_version=get_intel_driver_version()
)
except FileNotFoundError:
pass
# Fallback to OpenCL
return detect_intel_via_opencl()
Backend: llama.cpp with SYCL support
# Build llama.cpp with SYCL
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" \
pip install llama-cpp-python
Platforms: Linux, Windows
7.3 Qualcomm GPU Support (Android/Termux)
File: src/hardware/qualcomm.py
Detection:
def detect_qualcomm_gpu() -> Optional[GPUInfo]:
"""Detect Qualcomm Adreno GPU on Android/Termux."""
if not is_termux():
return None
try:
# Parse /proc/cpuinfo and /proc/meminfo
with open("/proc/cpuinfo") as f:
cpuinfo = f.read()
# Check for Qualcomm SoC
if "Qualcomm" in cpuinfo or "Snapdragon" in cpuinfo:
# Get RAM info (no separate VRAM on mobile)
mem = psutil.virtual_memory()
total_gb = mem.total / (1024**3)
# Use 25% of RAM for LLM (very conservative on mobile)
return GPUInfo(
name="Qualcomm Adreno",
vram_gb=total_gb * 0.25,
is_mobile=True
)
except Exception:
pass
return None
def is_termux() -> bool:
"""Check if running in Termux environment."""
return (
os.environ.get("TERMUX_VERSION") is not None or
os.path.exists("/data/data/com.termux/files/usr")
)
Backend Options:
-
llama.cpp on Termux:
# In Termux pkg install cmake clang pip install llama-cpp-python -
QNN (Qualcomm Neural Network) - Advanced:
- Use Qualcomm's SDK for optimized inference
- Better performance but complex setup
Limitations:
- Models must be small (1-3B parameters)
- Quantization essential (Q4 or lower)
- Limited context window (2048 tokens)
- Slower than desktop GPUs
Platforms: Android (via Termux)
7.4 Hardware Detection Updates
Update src/hardware/detector.py to support new GPUs:
def detect_gpu() -> Optional[GPUInfo]:
"""Detect GPU based on platform."""
os_name = detect_os()
if os_name == "darwin":
return detect_apple_gpu()
# Priority: NVIDIA > AMD > Intel > Qualcomm
gpu = detect_nvidia_gpu()
if gpu:
return gpu
gpu = detect_amd_gpu()
if gpu:
return gpu
gpu = detect_intel_gpu()
if gpu:
return gpu
gpu = detect_qualcomm_gpu()
if gpu:
return gpu
return None
Phase 8: Testing & Polish (Week 8)
8.1 Test Coverage
Unit tests:
- Hardware detection mocking (all GPU vendors)
- Model selection logic
- Consensus algorithm (local + cross-swarm)
- API endpoint validation
- Network discovery protocol
Integration tests:
- End-to-end inference
- Multi-worker coordination
- Cross-swarm voting
- Error handling
- Network partition scenarios
Platform tests:
- Windows with NVIDIA/AMD/Intel
- macOS with M1/M2/M3/M4
- Linux with NVIDIA/AMD/Intel
- CPU-only fallback
- Android (Termux) with Qualcomm
8.2 Performance Optimization
- Model warmup: Pre-load models on startup
- Request batching: Group similar requests
- Worker pooling: Reuse workers instead of respawning
- Memory monitoring: Auto-shutdown if OOM
8.3 Documentation ✅ COMPLETED
Created docs/GUIDE.md with:
- Quick Start Guide (all platforms)
- Opencode Configuration Examples:
- Basic setup
- Remote machine configuration
- Multiple model options
- Environment-specific configs
- API Reference (OpenAI-compatible endpoints)
- Troubleshooting Guide (common issues, platform-specific)
- Performance Tuning (speed vs quality, memory usage)
- Advanced Configuration (config.yaml, env vars)
- MCP Server setup
- Network Federation guide
Updated README.md:
- Added Documentation section with links
- Referenced complete guide
Technical Decisions
Why llama.cpp?
- Best cross-platform support
- Mature quantization formats (GGUF)
- Active community
- Good performance/quality tradeoff
Why MLX for macOS?
- Native Apple Silicon optimization
- Simpler than llama.cpp on macOS
- Better unified memory handling
Why consensus voting?
- Improves response quality vs single model
- Uses available hardware efficiently
- Can detect model hallucinations
Memory Model
External GPU (NVIDIA/AMD):
- Use 100% of VRAM
- Keep 10% buffer for OS/drivers
- Each instance gets equal share
Apple Silicon:
- Use 50% of unified RAM
- Avoid system swap
- Monitor memory pressure
CPU-only:
- Use 50% of system RAM
- Dependent on available memory
- Slower but functional
Future Enhancements
- Dynamic scaling: Add/remove workers based on load
- Model mixing: Different models in same swarm
- Fine-tuning: Local fine-tuning on user data
- Web UI: Browser-based configuration
- Docker support: Containerized deployment
- Cloud inference: Fallback to cloud APIs
- WebGPU support: Browser-based inference
- Persistent knowledge: RAG with local vector DB
Success Metrics
- Startup time: < 30 seconds from cold start
- First inference: < 10 seconds after startup
- Concurrent requests: Support 2-8 parallel inferences per machine
- Consensus accuracy: > 80% agreement on code tasks
- Memory efficiency: Use > 80% of available memory
- Cross-platform: Works on Windows/macOS/Linux without code changes
- GPU support: NVIDIA, AMD, Intel, Apple Silicon, Qualcomm
- Network federation: Auto-discovery within 10 seconds
- Federated consensus: Scale to 5+ machines (25+ instances)
- Mobile support: Functional on Android/Termux (3B models)