77f26f7381
Update Phase 8.3 Documentation to mark as COMPLETED: - Document all sections added to docs/GUIDE.md - Update README.md with documentation links Documentation now includes: - Quick Start Guide for all platforms - Opencode configuration examples - API reference with examples - Comprehensive troubleshooting - Performance tuning guide - Advanced configuration options
1149 lines
35 KiB
Markdown
1149 lines
35 KiB
Markdown
# Local Swarm - Detailed Implementation Plan
|
|
|
|
## Overview
|
|
A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.
|
|
|
|
## Architecture
|
|
|
|
```
|
|
local_swarm/
|
|
├── src/
|
|
│ ├── __init__.py
|
|
│ ├── hardware/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── detector.py # Platform-agnostic hardware detection
|
|
│ │ ├── nvidia.py # NVIDIA GPU detection (Windows/Linux)
|
|
│ │ ├── amd.py # AMD GPU detection (ROCm)
|
|
│ │ ├── intel.py # Intel GPU detection (OneAPI/OpenCL)
|
|
│ │ ├── qualcomm.py # Qualcomm/Adreno detection (Android)
|
|
│ │ ├── apple_silicon.py # Apple Silicon detection (macOS)
|
|
│ │ └── memory.py # RAM detection
|
|
│ ├── models/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── registry.py # Model database with specs
|
|
│ │ ├── selector.py # Optimal model/quant selection logic
|
|
│ │ └── downloader.py # Download manager (HuggingFace)
|
|
│ ├── backends/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── base.py # Backend interface
|
|
│ │ ├── llamacpp.py # llama.cpp backend (CUDA/ROCm/SYCL)
|
|
│ │ └── mlx.py # MLX backend (macOS)
|
|
│ ├── swarm/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── manager.py # Instance lifecycle management
|
|
│ │ ├── worker.py # Individual LLM instance wrapper
|
|
│ │ ├── consensus.py # Local voting/consensus algorithm
|
|
│ │ └── cross_consensus.py # Cross-swarm consensus
|
|
│ ├── network/
|
|
│ │ ├── __init__.py
|
|
│ │ ├── discovery.py # mDNS/Bonjour peer discovery
|
|
│ │ ├── federation.py # Inter-swarm communication
|
|
│ │ └── protocol.py # Network protocol definitions
|
|
│ └── api/
|
|
│ ├── __init__.py
|
|
│ ├── server.py # FastAPI/uvicorn server
|
|
│ ├── routes.py # OpenAI-compatible endpoints
|
|
│ ├── federation.py # Federation endpoints
|
|
│ └── middleware.py # Request handling
|
|
├── tests/
|
|
├── config/
|
|
│ └── models.yaml # Model configurations
|
|
├── scripts/
|
|
│ ├── install.bat # Windows installer
|
|
│ ├── install.sh # Unix installer
|
|
│ └── install-termux.sh # Android/Termux installer
|
|
├── main.py # CLI entry point
|
|
├── requirements.txt
|
|
├── requirements-macos.txt # MLX-specific deps
|
|
├── requirements-termux.txt # Android/Termux deps
|
|
├── setup.py
|
|
└── .gitignore
|
|
```
|
|
|
|
## Network Federation Architecture
|
|
|
|
When multiple machines run Local Swarm on the same network, they can form a "federated swarm":
|
|
|
|
```
|
|
┌─────────────────────────────────────────────────────────────┐
|
|
│ Local Network │
|
|
│ │
|
|
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
|
|
│ │ Windows PC │ │ Mac Mini │ │ MacBook │ │
|
|
│ │ (RTX 4060) │ │ (M1) │ │ (M4) │ │
|
|
│ │ │ │ │ │ │ │
|
|
│ │ ┌──────────┐ │ │ ┌──────────┐ │ │ ┌──────────┐ │ │
|
|
│ │ │ Swarm 1 │ │ │ │ Swarm 2 │ │ │ │ Swarm 3 │ │ │
|
|
│ │ │ 4 inst. │ │ │ │ 2 inst. │ │ │ │ 3 inst. │ │ │
|
|
│ │ │ Qwen 7B │ │ │ │ Qwen 7B │ │ │ │ Qwen 7B │ │ │
|
|
│ │ └────┬─────┘ │ │ └────┬─────┘ │ │ └────┬─────┘ │ │
|
|
│ │ │ │ │ │ │ │ │ │ │
|
|
│ │ mDNS │ │ │ mDNS │ │ │ mDNS │ │ │
|
|
│ │ │ │ │ │ │ │ │ │ │
|
|
│ └──────┼───────┘ └──────┼───────┘ └──────┼───────┘ │
|
|
│ │ │ │ │
|
|
│ └───────────────────┼───────────────────┘ │
|
|
│ │ │
|
|
│ ┌────────┴────────┐ │
|
|
│ │ Federation │ │
|
|
│ │ Coordinator │ │
|
|
│ │ │ │
|
|
│ │ ┌───────────┐ │ │
|
|
│ │ │ Consensus │ │ │
|
|
│ │ │ Engine │ │ │
|
|
│ │ └───────────┘ │ │
|
|
│ └────────┬────────┘ │
|
|
│ │ │
|
|
│ ┌────────▼────────┐ │
|
|
│ │ opencode │ │
|
|
│ │ (Client) │ │
|
|
│ └─────────────────┘ │
|
|
└─────────────────────────────────────────────────────────────┘
|
|
|
|
Federation Flow:
|
|
1. Each swarm independently detects hardware and starts instances
|
|
2. Swarms advertise themselves via mDNS/Bonjour
|
|
3. When request comes in, each swarm generates local responses
|
|
4. Local consensus picks best response per swarm
|
|
5. Cross-swarm consensus votes across all best responses
|
|
6. Final answer returned to opencode
|
|
```
|
|
|
|
**Benefits**:
|
|
- Utilize all hardware in your home/office
|
|
- Each machine optimizes for its own specs
|
|
- No single point of failure
|
|
- Automatic load distribution
|
|
- Works even if one machine goes offline
|
|
|
|
## Implementation Phases
|
|
|
|
### Phase 1: Foundation (Week 1)
|
|
|
|
#### 1.1 Hardware Detection Module
|
|
**File**: `src/hardware/detector.py`
|
|
|
|
**Requirements**:
|
|
- Cross-platform OS detection (Windows, macOS, Linux)
|
|
- CPU info (cores, architecture)
|
|
- RAM detection (total, available)
|
|
- GPU detection with VRAM
|
|
|
|
**Platform-specific implementations**:
|
|
- **Windows**: Use `pynvml` for NVIDIA, fallback to DirectX for others
|
|
- **macOS**: Use `psutil` for RAM, `sysctl` for CPU, Metal API for GPU
|
|
- **Linux**: Use `pynvml` for NVIDIA, `rocm-smi` for AMD
|
|
|
|
**Output structure**:
|
|
```python
|
|
class HardwareProfile:
|
|
os: str # 'windows', 'darwin', 'linux'
|
|
cpu_cores: int
|
|
ram_gb: float
|
|
gpu: Optional[GPUInfo]
|
|
is_apple_silicon: bool
|
|
```
|
|
|
|
**Model selection rules**:
|
|
- External GPU (NVIDIA/AMD): Use 100% of VRAM
|
|
- Apple Silicon: Use 50% of unified RAM
|
|
- CPU-only: Use 50% of system RAM
|
|
|
|
#### 1.2 Model Registry
|
|
**File**: `src/models/registry.py`
|
|
|
|
**Model database** (YAML format):
|
|
```yaml
|
|
models:
|
|
qwen2.5-coder:
|
|
name: "Qwen 2.5 Coder"
|
|
description: "Alibaba's code-focused model"
|
|
variants:
|
|
- size: 3b
|
|
base_vram_gb: 2.0 # Approximate VRAM for fp16
|
|
quantizations:
|
|
q4_k_m:
|
|
vram_gb: 1.8
|
|
quality: "good"
|
|
q5_k_m:
|
|
vram_gb: 2.2
|
|
quality: "better"
|
|
q6_k:
|
|
vram_gb: 2.6
|
|
quality: "best"
|
|
- size: 7b
|
|
base_vram_gb: 14.0
|
|
quantizations:
|
|
q4_k_m:
|
|
vram_gb: 4.5
|
|
q5_k_m:
|
|
vram_gb: 5.2
|
|
q6_k:
|
|
vram_gb: 6.0
|
|
|
|
codellama:
|
|
name: "CodeLlama"
|
|
# Similar structure...
|
|
|
|
deepseek-coder:
|
|
name: "DeepSeek Coder"
|
|
# Similar structure...
|
|
```
|
|
|
|
**Selection priority**:
|
|
1. Qwen 2.5 Coder (best for small sizes)
|
|
2. DeepSeek Coder (good alternative)
|
|
3. CodeLlama (fallback)
|
|
|
|
#### 1.3 Model Selector Logic
|
|
**File**: `src/models/selector.py`
|
|
|
|
**Algorithm**:
|
|
```python
|
|
def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
|
|
available_memory = get_available_memory(hardware)
|
|
|
|
# Try models in priority order
|
|
for model in PRIORITY_MODELS:
|
|
# Find largest size that fits
|
|
for variant in reversed(model.variants):
|
|
# Try highest quantization that fits
|
|
for quant in reversed(variant.quantizations):
|
|
total_vram_needed = quant.vram_gb * MIN_INSTANCES
|
|
if total_vram_needed <= available_memory:
|
|
# Calculate max instances
|
|
max_instances = int(available_memory // quant.vram_gb)
|
|
# Cap at reasonable limit (e.g., 8)
|
|
instances = min(max_instances, 8)
|
|
return ModelConfig(model, variant, quant, instances)
|
|
|
|
# Fallback to smallest model
|
|
return FALLBACK_CONFIG
|
|
```
|
|
|
|
**Minimum instances**: 2 (for consensus voting)
|
|
**Maximum instances**: 8 (to avoid overhead)
|
|
|
|
### Phase 2: Backend Integration (Week 2)
|
|
|
|
#### 2.1 Base Backend Interface
|
|
**File**: `src/backends/base.py`
|
|
|
|
```python
|
|
from abc import ABC, abstractmethod
|
|
from typing import AsyncIterator
|
|
|
|
class LLMBackend(ABC):
|
|
@abstractmethod
|
|
async def load_model(self, model_path: str, config: dict) -> bool:
|
|
pass
|
|
|
|
@abstractmethod
|
|
async def generate(self, prompt: str, **kwargs) -> str:
|
|
pass
|
|
|
|
@abstractmethod
|
|
async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
|
|
pass
|
|
|
|
@abstractmethod
|
|
def get_memory_usage(self) -> float:
|
|
"""Return current VRAM/RAM usage in GB"""
|
|
pass
|
|
|
|
@abstractmethod
|
|
def shutdown(self):
|
|
pass
|
|
```
|
|
|
|
#### 2.2 llama.cpp Backend
|
|
**File**: `src/backends/llamacpp.py`
|
|
|
|
**Implementation**:
|
|
- Use `llama-cpp-python` library
|
|
- Support GGUF model format
|
|
- GPU acceleration via CUDA/Metal
|
|
- Server mode with HTTP API
|
|
|
|
**Key features**:
|
|
- Model caching to avoid reload
|
|
- Context window management
|
|
- Batch processing support
|
|
|
|
**Memory calculation**:
|
|
```python
|
|
def calculate_memory_usage(model_path: str) -> float:
|
|
# Parse GGUF metadata
|
|
# Return estimated VRAM usage
|
|
```
|
|
|
|
#### 2.3 MLX Backend (macOS)
|
|
**File**: `src/backends/mlx.py`
|
|
|
|
**Implementation**:
|
|
- Use `mlx-lm` library
|
|
- Support MLX format models
|
|
- Optimized for Apple Silicon
|
|
|
|
**Key differences from llama.cpp**:
|
|
- Native Metal performance
|
|
- Simpler API
|
|
- Unified memory model
|
|
|
|
### Phase 3: Swarm Management (Week 3)
|
|
|
|
#### 3.1 Worker Instance
|
|
**File**: `src/swarm/worker.py`
|
|
|
|
Each worker manages:
|
|
- One LLM instance
|
|
- Request queue
|
|
- Health monitoring
|
|
- Metrics collection
|
|
|
|
```python
|
|
class SwarmWorker:
|
|
def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
|
|
self.worker_id = worker_id
|
|
self.backend = backend
|
|
self.is_healthy = True
|
|
self.request_count = 0
|
|
self.avg_latency = 0.0
|
|
|
|
async def process(self, request: GenerationRequest) -> GenerationResponse:
|
|
start = time.time()
|
|
response = await self.backend.generate(**request.params)
|
|
latency = time.time() - start
|
|
self._update_metrics(latency)
|
|
return GenerationResponse(response, latency, self.worker_id)
|
|
```
|
|
|
|
#### 3.2 Swarm Manager
|
|
**File**: `src/swarm/manager.py`
|
|
|
|
Responsibilities:
|
|
- Spawn N workers based on hardware
|
|
- Distribute requests to all workers
|
|
- Collect responses
|
|
- Handle worker failures
|
|
|
|
```python
|
|
class SwarmManager:
|
|
def __init__(self, config: ModelConfig):
|
|
self.workers: List[SwarmWorker] = []
|
|
self.config = config
|
|
|
|
async def initialize(self):
|
|
# Download model if needed
|
|
model_path = await self._ensure_model()
|
|
|
|
# Spawn workers
|
|
for i in range(self.config.instances):
|
|
backend = self._create_backend()
|
|
await backend.load_model(model_path, self.config.backend_params)
|
|
worker = SwarmWorker(i, backend, self.config)
|
|
self.workers.append(worker)
|
|
|
|
async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
|
|
# Send to all workers in parallel
|
|
tasks = [w.process(request) for w in self.workers]
|
|
return await asyncio.gather(*tasks)
|
|
```
|
|
|
|
#### 3.3 Consensus Algorithm
|
|
**File**: `src/swarm/consensus.py`
|
|
|
|
**Voting strategies**:
|
|
|
|
1. **Similarity voting** (default):
|
|
- Embed all responses
|
|
- Group by semantic similarity
|
|
- Return largest group
|
|
|
|
2. **Quality scoring**:
|
|
- Score each response on:
|
|
- Completeness (does it answer the question?)
|
|
- Code quality (syntax, structure)
|
|
- Length appropriateness
|
|
- Return highest score
|
|
|
|
3. **Latency-weighted**:
|
|
- Prefer faster responses (lower memory pressure)
|
|
|
|
**Implementation**:
|
|
```python
|
|
class ConsensusEngine:
|
|
def __init__(self, strategy: str = "similarity"):
|
|
self.strategy = strategy
|
|
self.embedding_model = None # Lazy load
|
|
|
|
async def select_best(self, responses: List[GenerationResponse]) -> str:
|
|
if len(responses) == 1:
|
|
return responses[0].text
|
|
|
|
if self.strategy == "similarity":
|
|
return await self._similarity_vote(responses)
|
|
elif self.strategy == "quality":
|
|
return await self._quality_score(responses)
|
|
else:
|
|
return self._fastest_response(responses)
|
|
|
|
async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
|
|
# Use sentence-transformers for embeddings
|
|
# Group by cosine similarity > 0.85
|
|
# Return median response from largest group
|
|
```
|
|
|
|
### Phase 4: API Server (Week 4)
|
|
|
|
#### 4.1 OpenAI-Compatible Endpoints
|
|
**File**: `src/api/routes.py`
|
|
|
|
Required endpoints:
|
|
- `GET /v1/models` - List available models
|
|
- `POST /v1/chat/completions` - Chat completion
|
|
- `POST /v1/completions` - Text completion (optional)
|
|
- `GET /health` - Health check
|
|
- `GET /metrics` - Prometheus metrics (optional)
|
|
|
|
**Chat completions endpoint**:
|
|
```python
|
|
@app.post("/v1/chat/completions")
|
|
async def chat_completions(request: ChatCompletionRequest):
|
|
# Extract messages
|
|
messages = request.messages
|
|
prompt = format_messages(messages)
|
|
|
|
# Get all responses from swarm
|
|
responses = await swarm_manager.generate_all(prompt, **request.params)
|
|
|
|
# Run consensus
|
|
best_response = await consensus_engine.select_best(responses)
|
|
|
|
# Format as OpenAI response
|
|
return {
|
|
"id": f"chatcmpl-{uuid4()}",
|
|
"object": "chat.completion",
|
|
"created": int(time.time()),
|
|
"model": request.model,
|
|
"choices": [{
|
|
"index": 0,
|
|
"message": {"role": "assistant", "content": best_response},
|
|
"finish_reason": "stop"
|
|
}],
|
|
"usage": calculate_usage(prompt, best_response)
|
|
}
|
|
```
|
|
|
|
#### 4.2 Streaming Support
|
|
**File**: `src/api/routes.py`
|
|
|
|
For streaming, use the fastest worker instead of consensus:
|
|
```python
|
|
if request.stream:
|
|
# Pick worker with lowest latency
|
|
worker = swarm_manager.get_fastest_worker()
|
|
return StreamingResponse(
|
|
worker.stream_generate(prompt),
|
|
media_type="text/event-stream"
|
|
)
|
|
```
|
|
|
|
### Phase 5: CLI & Interactive Interface (Week 5)
|
|
|
|
#### 5.1 Interactive Menu System
|
|
**File**: `src/interactive.py`
|
|
|
|
**Features**:
|
|
- **Hardware Display**: Detailed hardware info with formatting
|
|
- OS, CPU cores, RAM (total/available)
|
|
- GPU details (name, VRAM, driver, type)
|
|
- Memory allocation rules
|
|
|
|
- **Model Selection Menu**: Three configuration options
|
|
1. **Recommended Configuration**: Auto-detects optimal model + instances
|
|
2. **Browse All Configurations**: Lists all feasible models for hardware
|
|
3. **Custom Configuration**: Step-by-step wizard
|
|
- Select model family (Qwen, DeepSeek, CodeLlama)
|
|
- Choose model size (3B, 7B, 14B)
|
|
- Pick quantization (Q4_K_M, Q5_K_M, Q6_K)
|
|
- Specify instance count (1 to max supported)
|
|
|
|
- **Resource Usage Monitor**: Real-time swarm status
|
|
- Swarm status (running/stopped)
|
|
- Current model name
|
|
- Healthy workers count
|
|
- Memory usage (total and per-worker)
|
|
- Worker statistics:
|
|
- Total requests served
|
|
- Average latency
|
|
- Tokens per second
|
|
|
|
- **Startup Summary**: Comprehensive display showing:
|
|
- Hardware detection section
|
|
- Model configuration section
|
|
- Resource usage section
|
|
- Memory utilization percentage
|
|
|
|
**Implementation**:
|
|
```python
|
|
def interactive_model_selection(hardware: HardwareProfile) -> Optional[ModelConfig]:
|
|
# Show hardware info
|
|
# Display menu with 3 options
|
|
# Return selected configuration
|
|
|
|
def custom_configuration(hardware: HardwareProfile) -> Optional[ModelConfig]:
|
|
# Step-by-step wizard
|
|
# Select model -> size -> quantization -> instances
|
|
# Validate memory constraints
|
|
|
|
def show_startup_summary(hardware, config, swarm_manager=None):
|
|
# Clear screen
|
|
# Print formatted hardware, config, and usage info
|
|
```
|
|
|
|
#### 5.2 CLI Interface
|
|
**File**: `main.py`
|
|
|
|
Commands:
|
|
```bash
|
|
# Start the swarm (auto-detect hardware)
|
|
python -m local_swarm
|
|
|
|
# Start with specific model
|
|
python -m local_swarm --model qwen2.5-coder:3b:q4
|
|
|
|
# Start with specific port
|
|
python -m local_swarm --port 8080
|
|
|
|
# Override instance count
|
|
python -m local_swarm --instances 4
|
|
|
|
# Show hardware detection
|
|
python -m local_swarm --detect
|
|
|
|
# Download models only
|
|
python -m local_swarm --download-only
|
|
```
|
|
|
|
#### 5.2 Configuration File
|
|
**File**: `config.yaml`
|
|
|
|
```yaml
|
|
server:
|
|
host: "127.0.0.1"
|
|
port: 8000
|
|
|
|
swarm:
|
|
consensus_strategy: "similarity" # similarity, quality, fastest
|
|
min_instances: 2
|
|
max_instances: 8
|
|
timeout: 60
|
|
|
|
models:
|
|
cache_dir: "~/.local_swarm/models"
|
|
preferred_models:
|
|
- qwen2.5-coder
|
|
- deepseek-coder
|
|
- codellama
|
|
|
|
hardware:
|
|
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
|
|
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
|
|
|
|
network:
|
|
enabled: true
|
|
discovery_port: 8765
|
|
federation_port: 8766
|
|
advertise_interval: 30
|
|
max_peers: 10
|
|
auth_token: null # Optional auth for security
|
|
```
|
|
|
|
### Phase 5.5: MCP Server (Week 5)
|
|
|
|
#### 5.5.1 MCP Protocol Implementation
|
|
**File**: `src/mcp_server.py`
|
|
|
|
**Features**:
|
|
- **MCP Server Class**: `LocalSwarmMCPServer` implementing MCP protocol
|
|
- **Stdio Transport**: Communication via standard input/output
|
|
- **Tool Registration**: 5 MCP tools for AI assistants:
|
|
|
|
1. **`get_hardware_info`** - Query system capabilities
|
|
- OS, CPU cores, RAM
|
|
- GPU name, VRAM, type
|
|
- Available memory for LLMs
|
|
|
|
2. **`get_swarm_status`** - Check swarm health
|
|
- Running/stopped status
|
|
- Model name
|
|
- Healthy workers count
|
|
- Total memory usage
|
|
|
|
3. **`generate_code`** - Generate with consensus
|
|
- Input: prompt, max_tokens, temperature
|
|
- Returns: generated code with metadata
|
|
- Shows: strategy used, confidence, latency
|
|
|
|
4. **`list_available_models`** - Browse models
|
|
- All available models
|
|
- Variants per model
|
|
- Quantization options
|
|
|
|
5. **`get_worker_details`** - Worker statistics
|
|
- Per-worker backend info
|
|
- Health status
|
|
- Request count
|
|
- Latency and throughput
|
|
|
|
**Implementation**:
|
|
```python
|
|
class LocalSwarmMCPServer:
|
|
def __init__(self, swarm_manager):
|
|
self.server = Server("local-swarm")
|
|
self.register_tools()
|
|
|
|
def register_tools(self):
|
|
@self.server.list_tools()
|
|
async def list_tools() -> List[Tool]:
|
|
# Define tool schemas
|
|
|
|
@self.server.call_tool()
|
|
async def call_tool(name, arguments):
|
|
# Handle tool calls
|
|
```
|
|
|
|
#### 5.5.2 Dual Server Mode
|
|
- Run HTTP API and MCP server simultaneously
|
|
- Shared SwarmManager instance
|
|
- HTTP API for external clients
|
|
- MCP for AI assistant integration
|
|
|
|
**Usage**:
|
|
```bash
|
|
# HTTP API only
|
|
python main.py
|
|
|
|
# HTTP API + MCP server
|
|
python main.py --mcp
|
|
```
|
|
|
|
#### 5.5.3 Installation Scripts
|
|
|
|
**Windows** (`scripts/install.bat`):
|
|
```batch
|
|
@echo off
|
|
echo Installing Local Swarm...
|
|
python -m pip install --upgrade pip
|
|
pip install -r requirements.txt
|
|
|
|
:: Check for CUDA
|
|
nvidia-smi >nul 2>&1
|
|
if %errorlevel% == 0 (
|
|
echo CUDA detected, installing GPU support...
|
|
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
|
|
) else (
|
|
echo No CUDA detected, using CPU backend...
|
|
pip install llama-cpp-python
|
|
)
|
|
|
|
echo Installation complete!
|
|
echo Run: python -m local_swarm
|
|
```
|
|
|
|
**macOS/Linux** (`scripts/install.sh`):
|
|
```bash
|
|
#!/bin/bash
|
|
set -e
|
|
|
|
echo "Installing Local Swarm..."
|
|
pip install --upgrade pip
|
|
|
|
# Detect platform
|
|
if [[ "$OSTYPE" == "darwin"* ]]; then
|
|
echo "macOS detected..."
|
|
pip install -r requirements.txt
|
|
pip install -r requirements-macos.txt
|
|
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
|
|
echo "Linux detected..."
|
|
pip install -r requirements.txt
|
|
if command -v nvidia-smi &> /dev/null; then
|
|
echo "CUDA detected, installing GPU support..."
|
|
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
|
|
else
|
|
pip install llama-cpp-python
|
|
fi
|
|
fi
|
|
|
|
echo "Installation complete!"
|
|
echo "Run: python -m local_swarm"
|
|
```
|
|
|
|
### Phase 6: Local Network Federation (Week 6)
|
|
|
|
#### 6.1 Overview
|
|
Enable multiple machines on the same local network to form a "federated swarm". Each machine runs its own optimized swarm, but they coordinate to vote on the best responses across the entire network.
|
|
|
|
**Example scenario**:
|
|
- Windows PC (RTX 4060 Ti): 4 instances of Qwen 7B Q4
|
|
- Mac Mini (M1): 2 instances of Qwen 7B Q4
|
|
- MacBook (M4): 3 instances of Qwen 7B Q4
|
|
- Total: 9 instances voting on every request
|
|
|
|
#### 6.2 Architecture
|
|
|
|
**Federation modes**:
|
|
|
|
1. **Independent Swarms with Cross-Voting** (Recommended):
|
|
- Each machine runs its own swarm locally
|
|
- When a request comes in, each swarm generates responses internally
|
|
- All swarms exchange their "best local" responses
|
|
- Final vote across all best responses
|
|
- Pros: Simple, resilient, uses each machine's optimal config
|
|
- Cons: Slightly higher latency
|
|
|
|
2. **Distributed Workers** (Advanced):
|
|
- Single coordinator manages workers across all machines
|
|
- Workers distributed based on capability
|
|
- Pros: Optimal load balancing
|
|
- Cons: Complex failure handling, network overhead
|
|
|
|
**Implementation**: Start with Mode 1 (Independent Swarms with Cross-Voting)
|
|
|
|
#### 6.3 Discovery Protocol
|
|
**File**: `src/network/discovery.py`
|
|
|
|
**mDNS/Bonjour-based discovery**:
|
|
```python
|
|
class SwarmDiscovery:
|
|
"""Discovers other Local Swarm instances on the local network."""
|
|
|
|
def __init__(self, port: int = 8765):
|
|
self.port = port
|
|
self.peers: Dict[str, PeerInfo] = {}
|
|
self.zeroconf = Zeroconf()
|
|
|
|
def start_advertising(self, swarm_info: SwarmInfo):
|
|
"""Advertise this swarm on the network."""
|
|
service_info = ServiceInfo(
|
|
"_local-swarm._tcp.local.",
|
|
f"{hostname}._local-swarm._tcp.local.",
|
|
addresses=[self.get_ip()],
|
|
port=self.port,
|
|
properties={
|
|
b"version": b"1.0",
|
|
b"instances": str(swarm_info.instances).encode(),
|
|
b"model": swarm_info.model_id.encode(),
|
|
b"hardware": swarm_info.hardware_summary.encode(),
|
|
}
|
|
)
|
|
self.zeroconf.register_service(service_info)
|
|
|
|
def discover_peers(self) -> List[PeerInfo]:
|
|
"""Discover other swarms on the network."""
|
|
browser = ServiceBrowser(
|
|
self.zeroconf,
|
|
"_local-swarm._tcp.local.",
|
|
handlers=[self._on_service_state_change]
|
|
)
|
|
```
|
|
|
|
**Alternative**: Simple UDP broadcast for environments without mDNS
|
|
|
|
#### 6.4 Federation Protocol
|
|
**File**: `src/network/federation.py`
|
|
|
|
**HTTP-based communication**:
|
|
```python
|
|
class FederationClient:
|
|
"""Client for communicating with peer swarms."""
|
|
|
|
async def request_vote(self, peer: PeerInfo, request: GenerationRequest) -> Vote:
|
|
"""Request a vote from a peer swarm."""
|
|
async with aiohttp.ClientSession() as session:
|
|
async with session.post(
|
|
f"http://{peer.host}:{peer.port}/v1/federation/vote",
|
|
json={
|
|
"prompt": request.prompt,
|
|
"context": request.context,
|
|
"request_id": request.id
|
|
},
|
|
timeout=aiohttp.ClientTimeout(total=30)
|
|
) as resp:
|
|
return Vote(await resp.json())
|
|
|
|
async def cast_vote(self, responses: List[GenerationResponse]) -> Vote:
|
|
"""Cast this swarm's vote on a set of responses."""
|
|
# Use consensus engine to pick best local response
|
|
best = await self.consensus.select_best(responses)
|
|
return Vote(best, confidence=self._calculate_confidence(responses))
|
|
```
|
|
|
|
**Endpoints**:
|
|
- `POST /v1/federation/vote` - Request a vote from this swarm
|
|
- `GET /v1/federation/health` - Check peer health
|
|
- `GET /v1/federation/info` - Get swarm capabilities
|
|
|
|
#### 6.5 Cross-Swarm Consensus
|
|
**File**: `src/swarm/cross_consensus.py`
|
|
|
|
**Two-phase voting**:
|
|
|
|
1. **Local Consensus** (Phase 1):
|
|
- Each swarm generates responses from all local instances
|
|
- Runs local consensus to pick "best local" response
|
|
- Returns to coordinator
|
|
|
|
2. **Global Consensus** (Phase 2):
|
|
- Coordinator collects all "best local" responses
|
|
- Weights by swarm confidence (based on local agreement)
|
|
- Returns highest-weighted response
|
|
|
|
```python
|
|
class CrossSwarmConsensus:
|
|
"""Consensus across multiple networked swarms."""
|
|
|
|
async def generate_with_federation(
|
|
self,
|
|
request: GenerationRequest,
|
|
timeout: float = 30.0
|
|
) -> GenerationResponse:
|
|
# Phase 1: Local generation
|
|
local_responses = await self.local_swarm.generate_all(request)
|
|
local_best = await self.local_consensus.select_best(local_responses)
|
|
local_confidence = self._calculate_local_confidence(local_responses)
|
|
|
|
# Phase 2: Request peer votes
|
|
peer_votes = []
|
|
for peer in self.discovery.get_peers():
|
|
try:
|
|
vote = await asyncio.wait_for(
|
|
self.federation.request_vote(peer, request),
|
|
timeout=5.0
|
|
)
|
|
peer_votes.append(vote)
|
|
except asyncio.TimeoutError:
|
|
logger.warning(f"Peer {peer.host} timed out")
|
|
|
|
# Phase 3: Global consensus
|
|
all_votes = [Vote(local_best, local_confidence)] + peer_votes
|
|
best_vote = self._weighted_vote(all_votes)
|
|
|
|
return best_vote.response
|
|
|
|
def _calculate_local_confidence(self, responses: List[GenerationResponse]) -> float:
|
|
"""Calculate confidence based on local agreement."""
|
|
# High agreement = high confidence
|
|
# Use embedding similarity
|
|
similarities = self._compute_pairwise_similarity(responses)
|
|
return np.mean(similarities)
|
|
|
|
def _weighted_vote(self, votes: List[Vote]) -> Vote:
|
|
"""Select best vote weighted by confidence."""
|
|
# Could use weighted random selection or just pick highest
|
|
return max(votes, key=lambda v: v.confidence)
|
|
```
|
|
|
|
#### 6.6 Configuration
|
|
|
|
```yaml
|
|
federation:
|
|
enabled: true
|
|
mode: "independent" # independent, distributed
|
|
discovery:
|
|
method: "mdns" # mdns, broadcast, static
|
|
port: 8765
|
|
advertise_interval: 30
|
|
communication:
|
|
port: 8766
|
|
timeout: 30
|
|
auth:
|
|
enabled: false
|
|
token: null
|
|
consensus:
|
|
strategy: "weighted" # weighted, best_of_n, latency
|
|
min_peers: 1 # Minimum peers for federation (0 = solo mode OK)
|
|
max_peers: 10
|
|
```
|
|
|
|
### Phase 7: Extended GPU Support (Week 7)
|
|
|
|
#### 7.1 AMD GPU Support (ROCm)
|
|
**File**: `src/hardware/amd.py`
|
|
|
|
**Detection**:
|
|
```python
|
|
def detect_amd_gpu() -> Optional[GPUInfo]:
|
|
"""Detect AMD GPU using ROCm."""
|
|
try:
|
|
# Try rocm-smi
|
|
import subprocess
|
|
result = subprocess.run(
|
|
["rocm-smi", "--showmeminfo", "vram"],
|
|
capture_output=True, text=True
|
|
)
|
|
if result.returncode == 0:
|
|
# Parse VRAM info
|
|
vram_gb = parse_rocm_output(result.stdout)
|
|
return GPUInfo(
|
|
name="AMD GPU",
|
|
vram_gb=vram_gb,
|
|
is_amd=True
|
|
)
|
|
except FileNotFoundError:
|
|
pass
|
|
|
|
# Fallback to checking for AMD in PCI
|
|
return detect_amd_via_pci()
|
|
```
|
|
|
|
**Backend**: llama.cpp with ROCm support
|
|
```bash
|
|
# Build llama.cpp with ROCm
|
|
CMAKE_ARGS="-DLLAMA_HIPBLAS=on" pip install llama-cpp-python
|
|
```
|
|
|
|
**Platforms**: Linux (primary), Windows (experimental)
|
|
|
|
#### 7.2 Intel GPU Support (OpenCL/OneAPI)
|
|
**File**: `src/hardware/intel.py`
|
|
|
|
**Detection**:
|
|
```python
|
|
def detect_intel_gpu() -> Optional[GPUInfo]:
|
|
"""Detect Intel GPU using OneAPI or OpenCL."""
|
|
# Try Intel GPU driver
|
|
try:
|
|
import subprocess
|
|
result = subprocess.run(
|
|
["sycl-ls"],
|
|
capture_output=True, text=True
|
|
)
|
|
if "Intel" in result.stdout:
|
|
vram_gb = parse_sycl_output(result.stdout)
|
|
return GPUInfo(
|
|
name="Intel GPU",
|
|
vram_gb=vram_gb,
|
|
driver_version=get_intel_driver_version()
|
|
)
|
|
except FileNotFoundError:
|
|
pass
|
|
|
|
# Fallback to OpenCL
|
|
return detect_intel_via_opencl()
|
|
```
|
|
|
|
**Backend**: llama.cpp with SYCL support
|
|
```bash
|
|
# Build llama.cpp with SYCL
|
|
CMAKE_ARGS="-DLLAMA_SYCL=on -DCMAKE_C_COMPILER=icx -DCMAKE_CXX_COMPILER=icpx" \
|
|
pip install llama-cpp-python
|
|
```
|
|
|
|
**Platforms**: Linux, Windows
|
|
|
|
#### 7.3 Qualcomm GPU Support (Android/Termux)
|
|
**File**: `src/hardware/qualcomm.py`
|
|
|
|
**Detection**:
|
|
```python
|
|
def detect_qualcomm_gpu() -> Optional[GPUInfo]:
|
|
"""Detect Qualcomm Adreno GPU on Android/Termux."""
|
|
if not is_termux():
|
|
return None
|
|
|
|
try:
|
|
# Parse /proc/cpuinfo and /proc/meminfo
|
|
with open("/proc/cpuinfo") as f:
|
|
cpuinfo = f.read()
|
|
|
|
# Check for Qualcomm SoC
|
|
if "Qualcomm" in cpuinfo or "Snapdragon" in cpuinfo:
|
|
# Get RAM info (no separate VRAM on mobile)
|
|
mem = psutil.virtual_memory()
|
|
total_gb = mem.total / (1024**3)
|
|
|
|
# Use 25% of RAM for LLM (very conservative on mobile)
|
|
return GPUInfo(
|
|
name="Qualcomm Adreno",
|
|
vram_gb=total_gb * 0.25,
|
|
is_mobile=True
|
|
)
|
|
except Exception:
|
|
pass
|
|
|
|
return None
|
|
|
|
def is_termux() -> bool:
|
|
"""Check if running in Termux environment."""
|
|
return (
|
|
os.environ.get("TERMUX_VERSION") is not None or
|
|
os.path.exists("/data/data/com.termux/files/usr")
|
|
)
|
|
```
|
|
|
|
**Backend Options**:
|
|
|
|
1. **llama.cpp on Termux**:
|
|
```bash
|
|
# In Termux
|
|
pkg install cmake clang
|
|
pip install llama-cpp-python
|
|
```
|
|
|
|
2. **QNN (Qualcomm Neural Network)** - Advanced:
|
|
- Use Qualcomm's SDK for optimized inference
|
|
- Better performance but complex setup
|
|
|
|
**Limitations**:
|
|
- Models must be small (1-3B parameters)
|
|
- Quantization essential (Q4 or lower)
|
|
- Limited context window (2048 tokens)
|
|
- Slower than desktop GPUs
|
|
|
|
**Platforms**: Android (via Termux)
|
|
|
|
#### 7.4 Hardware Detection Updates
|
|
|
|
Update `src/hardware/detector.py` to support new GPUs:
|
|
|
|
```python
|
|
def detect_gpu() -> Optional[GPUInfo]:
|
|
"""Detect GPU based on platform."""
|
|
os_name = detect_os()
|
|
|
|
if os_name == "darwin":
|
|
return detect_apple_gpu()
|
|
|
|
# Priority: NVIDIA > AMD > Intel > Qualcomm
|
|
gpu = detect_nvidia_gpu()
|
|
if gpu:
|
|
return gpu
|
|
|
|
gpu = detect_amd_gpu()
|
|
if gpu:
|
|
return gpu
|
|
|
|
gpu = detect_intel_gpu()
|
|
if gpu:
|
|
return gpu
|
|
|
|
gpu = detect_qualcomm_gpu()
|
|
if gpu:
|
|
return gpu
|
|
|
|
return None
|
|
```
|
|
|
|
### Phase 8: Testing & Polish (Week 8)
|
|
|
|
#### 8.1 Test Coverage
|
|
|
|
**Unit tests**:
|
|
- Hardware detection mocking (all GPU vendors)
|
|
- Model selection logic
|
|
- Consensus algorithm (local + cross-swarm)
|
|
- API endpoint validation
|
|
- Network discovery protocol
|
|
|
|
**Integration tests**:
|
|
- End-to-end inference
|
|
- Multi-worker coordination
|
|
- Cross-swarm voting
|
|
- Error handling
|
|
- Network partition scenarios
|
|
|
|
**Platform tests**:
|
|
- Windows with NVIDIA/AMD/Intel
|
|
- macOS with M1/M2/M3/M4
|
|
- Linux with NVIDIA/AMD/Intel
|
|
- CPU-only fallback
|
|
- Android (Termux) with Qualcomm
|
|
|
|
#### 8.2 Performance Optimization
|
|
|
|
- **Model warmup**: Pre-load models on startup
|
|
- **Request batching**: Group similar requests
|
|
- **Worker pooling**: Reuse workers instead of respawning
|
|
- **Memory monitoring**: Auto-shutdown if OOM
|
|
|
|
#### 8.3 Documentation ✅ COMPLETED
|
|
|
|
**Created docs/GUIDE.md with:**
|
|
- Quick Start Guide (all platforms)
|
|
- Opencode Configuration Examples:
|
|
- Basic setup
|
|
- Remote machine configuration
|
|
- Multiple model options
|
|
- Environment-specific configs
|
|
- API Reference (OpenAI-compatible endpoints)
|
|
- Troubleshooting Guide (common issues, platform-specific)
|
|
- Performance Tuning (speed vs quality, memory usage)
|
|
- Advanced Configuration (config.yaml, env vars)
|
|
- MCP Server setup
|
|
- Network Federation guide
|
|
|
|
**Updated README.md:**
|
|
- Added Documentation section with links
|
|
- Referenced complete guide
|
|
|
|
## Technical Decisions
|
|
|
|
### Why llama.cpp?
|
|
- Best cross-platform support
|
|
- Mature quantization formats (GGUF)
|
|
- Active community
|
|
- Good performance/quality tradeoff
|
|
|
|
### Why MLX for macOS?
|
|
- Native Apple Silicon optimization
|
|
- Simpler than llama.cpp on macOS
|
|
- Better unified memory handling
|
|
|
|
### Why consensus voting?
|
|
- Improves response quality vs single model
|
|
- Uses available hardware efficiently
|
|
- Can detect model hallucinations
|
|
|
|
### Memory Model
|
|
|
|
**External GPU (NVIDIA/AMD)**:
|
|
- Use 100% of VRAM
|
|
- Keep 10% buffer for OS/drivers
|
|
- Each instance gets equal share
|
|
|
|
**Apple Silicon**:
|
|
- Use 50% of unified RAM
|
|
- Avoid system swap
|
|
- Monitor memory pressure
|
|
|
|
**CPU-only**:
|
|
- Use 50% of system RAM
|
|
- Dependent on available memory
|
|
- Slower but functional
|
|
|
|
## Future Enhancements
|
|
|
|
1. **Dynamic scaling**: Add/remove workers based on load
|
|
2. **Model mixing**: Different models in same swarm
|
|
3. **Fine-tuning**: Local fine-tuning on user data
|
|
4. **Web UI**: Browser-based configuration
|
|
5. **Docker support**: Containerized deployment
|
|
6. **Cloud inference**: Fallback to cloud APIs
|
|
7. **WebGPU support**: Browser-based inference
|
|
8. **Persistent knowledge**: RAG with local vector DB
|
|
|
|
## Success Metrics
|
|
|
|
- **Startup time**: < 30 seconds from cold start
|
|
- **First inference**: < 10 seconds after startup
|
|
- **Concurrent requests**: Support 2-8 parallel inferences per machine
|
|
- **Consensus accuracy**: > 80% agreement on code tasks
|
|
- **Memory efficiency**: Use > 80% of available memory
|
|
- **Cross-platform**: Works on Windows/macOS/Linux without code changes
|
|
- **GPU support**: NVIDIA, AMD, Intel, Apple Silicon, Qualcomm
|
|
- **Network federation**: Auto-discovery within 10 seconds
|
|
- **Federated consensus**: Scale to 5+ machines (25+ instances)
|
|
- **Mobile support**: Functional on Android/Termux (3B models)
|