15 KiB
Local Swarm - Detailed Implementation Plan
Overview
A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.
Architecture
local_swarm/
├── src/
│ ├── __init__.py
│ ├── hardware/
│ │ ├── __init__.py
│ │ ├── detector.py # Platform-agnostic hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection (Windows/Linux)
│ │ ├── apple_silicon.py # Apple Silicon detection (macOS)
│ │ └── memory.py # RAM detection
│ ├── models/
│ │ ├── __init__.py
│ │ ├── registry.py # Model database with specs
│ │ ├── selector.py # Optimal model/quant selection logic
│ │ └── downloader.py # Download manager (HuggingFace)
│ ├── backends/
│ │ ├── __init__.py
│ │ ├── base.py # Backend interface
│ │ ├── llamacpp.py # llama.cpp backend
│ │ └── mlx.py # MLX backend (macOS)
│ ├── swarm/
│ │ ├── __init__.py
│ │ ├── manager.py # Instance lifecycle management
│ │ ├── worker.py # Individual LLM instance wrapper
│ │ └── consensus.py # Voting/consensus algorithm
│ └── api/
│ ├── __init__.py
│ ├── server.py # FastAPI/uvicorn server
│ ├── routes.py # OpenAI-compatible endpoints
│ └── middleware.py # Request handling
├── tests/
├── config/
│ └── models.yaml # Model configurations
├── scripts/
│ ├── install.bat # Windows installer
│ └── install.sh # Unix installer
├── main.py # CLI entry point
├── requirements.txt
├── requirements-macos.txt # MLX-specific deps
├── setup.py
└── .gitignore
Implementation Phases
Phase 1: Foundation (Week 1)
1.1 Hardware Detection Module
File: src/hardware/detector.py
Requirements:
- Cross-platform OS detection (Windows, macOS, Linux)
- CPU info (cores, architecture)
- RAM detection (total, available)
- GPU detection with VRAM
Platform-specific implementations:
- Windows: Use
pynvmlfor NVIDIA, fallback to DirectX for others - macOS: Use
psutilfor RAM,sysctlfor CPU, Metal API for GPU - Linux: Use
pynvmlfor NVIDIA,rocm-smifor AMD
Output structure:
class HardwareProfile:
os: str # 'windows', 'darwin', 'linux'
cpu_cores: int
ram_gb: float
gpu: Optional[GPUInfo]
is_apple_silicon: bool
Model selection rules:
- External GPU (NVIDIA/AMD): Use 100% of VRAM
- Apple Silicon: Use 50% of unified RAM
- CPU-only: Use 50% of system RAM
1.2 Model Registry
File: src/models/registry.py
Model database (YAML format):
models:
qwen2.5-coder:
name: "Qwen 2.5 Coder"
description: "Alibaba's code-focused model"
variants:
- size: 3b
base_vram_gb: 2.0 # Approximate VRAM for fp16
quantizations:
q4_k_m:
vram_gb: 1.8
quality: "good"
q5_k_m:
vram_gb: 2.2
quality: "better"
q6_k:
vram_gb: 2.6
quality: "best"
- size: 7b
base_vram_gb: 14.0
quantizations:
q4_k_m:
vram_gb: 4.5
q5_k_m:
vram_gb: 5.2
q6_k:
vram_gb: 6.0
codellama:
name: "CodeLlama"
# Similar structure...
deepseek-coder:
name: "DeepSeek Coder"
# Similar structure...
Selection priority:
- Qwen 2.5 Coder (best for small sizes)
- DeepSeek Coder (good alternative)
- CodeLlama (fallback)
1.3 Model Selector Logic
File: src/models/selector.py
Algorithm:
def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
available_memory = get_available_memory(hardware)
# Try models in priority order
for model in PRIORITY_MODELS:
# Find largest size that fits
for variant in reversed(model.variants):
# Try highest quantization that fits
for quant in reversed(variant.quantizations):
total_vram_needed = quant.vram_gb * MIN_INSTANCES
if total_vram_needed <= available_memory:
# Calculate max instances
max_instances = int(available_memory // quant.vram_gb)
# Cap at reasonable limit (e.g., 8)
instances = min(max_instances, 8)
return ModelConfig(model, variant, quant, instances)
# Fallback to smallest model
return FALLBACK_CONFIG
Minimum instances: 2 (for consensus voting) Maximum instances: 8 (to avoid overhead)
Phase 2: Backend Integration (Week 2)
2.1 Base Backend Interface
File: src/backends/base.py
from abc import ABC, abstractmethod
from typing import AsyncIterator
class LLMBackend(ABC):
@abstractmethod
async def load_model(self, model_path: str, config: dict) -> bool:
pass
@abstractmethod
async def generate(self, prompt: str, **kwargs) -> str:
pass
@abstractmethod
async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
pass
@abstractmethod
def get_memory_usage(self) -> float:
"""Return current VRAM/RAM usage in GB"""
pass
@abstractmethod
def shutdown(self):
pass
2.2 llama.cpp Backend
File: src/backends/llamacpp.py
Implementation:
- Use
llama-cpp-pythonlibrary - Support GGUF model format
- GPU acceleration via CUDA/Metal
- Server mode with HTTP API
Key features:
- Model caching to avoid reload
- Context window management
- Batch processing support
Memory calculation:
def calculate_memory_usage(model_path: str) -> float:
# Parse GGUF metadata
# Return estimated VRAM usage
2.3 MLX Backend (macOS)
File: src/backends/mlx.py
Implementation:
- Use
mlx-lmlibrary - Support MLX format models
- Optimized for Apple Silicon
Key differences from llama.cpp:
- Native Metal performance
- Simpler API
- Unified memory model
Phase 3: Swarm Management (Week 3)
3.1 Worker Instance
File: src/swarm/worker.py
Each worker manages:
- One LLM instance
- Request queue
- Health monitoring
- Metrics collection
class SwarmWorker:
def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
self.worker_id = worker_id
self.backend = backend
self.is_healthy = True
self.request_count = 0
self.avg_latency = 0.0
async def process(self, request: GenerationRequest) -> GenerationResponse:
start = time.time()
response = await self.backend.generate(**request.params)
latency = time.time() - start
self._update_metrics(latency)
return GenerationResponse(response, latency, self.worker_id)
3.2 Swarm Manager
File: src/swarm/manager.py
Responsibilities:
- Spawn N workers based on hardware
- Distribute requests to all workers
- Collect responses
- Handle worker failures
class SwarmManager:
def __init__(self, config: ModelConfig):
self.workers: List[SwarmWorker] = []
self.config = config
async def initialize(self):
# Download model if needed
model_path = await self._ensure_model()
# Spawn workers
for i in range(self.config.instances):
backend = self._create_backend()
await backend.load_model(model_path, self.config.backend_params)
worker = SwarmWorker(i, backend, self.config)
self.workers.append(worker)
async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
# Send to all workers in parallel
tasks = [w.process(request) for w in self.workers]
return await asyncio.gather(*tasks)
3.3 Consensus Algorithm
File: src/swarm/consensus.py
Voting strategies:
-
Similarity voting (default):
- Embed all responses
- Group by semantic similarity
- Return largest group
-
Quality scoring:
- Score each response on:
- Completeness (does it answer the question?)
- Code quality (syntax, structure)
- Length appropriateness
- Return highest score
- Score each response on:
-
Latency-weighted:
- Prefer faster responses (lower memory pressure)
Implementation:
class ConsensusEngine:
def __init__(self, strategy: str = "similarity"):
self.strategy = strategy
self.embedding_model = None # Lazy load
async def select_best(self, responses: List[GenerationResponse]) -> str:
if len(responses) == 1:
return responses[0].text
if self.strategy == "similarity":
return await self._similarity_vote(responses)
elif self.strategy == "quality":
return await self._quality_score(responses)
else:
return self._fastest_response(responses)
async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
# Use sentence-transformers for embeddings
# Group by cosine similarity > 0.85
# Return median response from largest group
Phase 4: API Server (Week 4)
4.1 OpenAI-Compatible Endpoints
File: src/api/routes.py
Required endpoints:
GET /v1/models- List available modelsPOST /v1/chat/completions- Chat completionPOST /v1/completions- Text completion (optional)GET /health- Health checkGET /metrics- Prometheus metrics (optional)
Chat completions endpoint:
@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
# Extract messages
messages = request.messages
prompt = format_messages(messages)
# Get all responses from swarm
responses = await swarm_manager.generate_all(prompt, **request.params)
# Run consensus
best_response = await consensus_engine.select_best(responses)
# Format as OpenAI response
return {
"id": f"chatcmpl-{uuid4()}",
"object": "chat.completion",
"created": int(time.time()),
"model": request.model,
"choices": [{
"index": 0,
"message": {"role": "assistant", "content": best_response},
"finish_reason": "stop"
}],
"usage": calculate_usage(prompt, best_response)
}
4.2 Streaming Support
File: src/api/routes.py
For streaming, use the fastest worker instead of consensus:
if request.stream:
# Pick worker with lowest latency
worker = swarm_manager.get_fastest_worker()
return StreamingResponse(
worker.stream_generate(prompt),
media_type="text/event-stream"
)
Phase 5: CLI & Distribution (Week 5)
5.1 CLI Interface
File: main.py
Commands:
# Start the swarm (auto-detect hardware)
python -m local_swarm
# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4
# Start with specific port
python -m local_swarm --port 8080
# Override instance count
python -m local_swarm --instances 4
# Show hardware detection
python -m local_swarm --detect
# Download models only
python -m local_swarm --download-only
5.2 Configuration File
File: config.yaml
server:
host: "127.0.0.1"
port: 8000
swarm:
consensus_strategy: "similarity" # similarity, quality, fastest
min_instances: 2
max_instances: 8
timeout: 60
models:
cache_dir: "~/.local_swarm/models"
preferred_models:
- qwen2.5-coder
- deepseek-coder
- codellama
hardware:
gpu_memory_fraction: 1.0 # Use 100% of GPU VRAM
ram_fraction: 0.5 # Use 50% of system RAM for CPU/Apple Silicon
5.3 Installation Scripts
Windows (scripts/install.bat):
@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt
:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
echo CUDA detected, installing GPU support...
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
echo No CUDA detected, using CPU backend...
pip install llama-cpp-python
)
echo Installation complete!
echo Run: python -m local_swarm
macOS/Linux (scripts/install.sh):
#!/bin/bash
set -e
echo "Installing Local Swarm..."
pip install --upgrade pip
# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
echo "macOS detected..."
pip install -r requirements.txt
pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
echo "Linux detected..."
pip install -r requirements.txt
if command -v nvidia-smi &> /dev/null; then
echo "CUDA detected, installing GPU support..."
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
else
pip install llama-cpp-python
fi
fi
echo "Installation complete!"
echo "Run: python -m local_swarm"
Phase 6: Testing & Polish (Week 6)
6.1 Test Coverage
Unit tests:
- Hardware detection mocking
- Model selection logic
- Consensus algorithm
- API endpoint validation
Integration tests:
- End-to-end inference
- Multi-worker coordination
- Error handling
Platform tests:
- Windows with NVIDIA
- macOS with M1/M2/M3
- Linux with CUDA
- CPU-only fallback
6.2 Performance Optimization
- Model warmup: Pre-load models on startup
- Request batching: Group similar requests
- Worker pooling: Reuse workers instead of respawning
- Memory monitoring: Auto-shutdown if OOM
6.3 Documentation
- API documentation (OpenAPI spec)
- Configuration guide
- Troubleshooting
- Performance tuning tips
Technical Decisions
Why llama.cpp?
- Best cross-platform support
- Mature quantization formats (GGUF)
- Active community
- Good performance/quality tradeoff
Why MLX for macOS?
- Native Apple Silicon optimization
- Simpler than llama.cpp on macOS
- Better unified memory handling
Why consensus voting?
- Improves response quality vs single model
- Uses available hardware efficiently
- Can detect model hallucinations
Memory Model
External GPU (NVIDIA/AMD):
- Use 100% of VRAM
- Keep 10% buffer for OS/drivers
- Each instance gets equal share
Apple Silicon:
- Use 50% of unified RAM
- Avoid system swap
- Monitor memory pressure
CPU-only:
- Use 50% of system RAM
- Dependent on available memory
- Slower but functional
Future Enhancements
- Multi-GPU support: Distribute across multiple GPUs
- Dynamic scaling: Add/remove workers based on load
- Model mixing: Different models in same swarm
- Fine-tuning: Local fine-tuning on user data
- Web UI: Browser-based configuration
- Docker support: Containerized deployment
- Cloud inference: Fallback to cloud APIs
Success Metrics
- Startup time: < 30 seconds from cold start
- First inference: < 10 seconds after startup
- Concurrent requests: Support 2-8 parallel inferences
- Consensus accuracy: > 80% agreement on code tasks
- Memory efficiency: Use > 80% of available memory
- Cross-platform: Works on Windows/macOS/Linux without code changes