Files
local_swarm/PLAN.md
T

15 KiB

Local Swarm - Detailed Implementation Plan

Overview

A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.

Architecture

local_swarm/
├── src/
│   ├── __init__.py
│   ├── hardware/
│   │   ├── __init__.py
│   │   ├── detector.py       # Platform-agnostic hardware detection
│   │   ├── nvidia.py         # NVIDIA GPU detection (Windows/Linux)
│   │   ├── apple_silicon.py  # Apple Silicon detection (macOS)
│   │   └── memory.py         # RAM detection
│   ├── models/
│   │   ├── __init__.py
│   │   ├── registry.py       # Model database with specs
│   │   ├── selector.py       # Optimal model/quant selection logic
│   │   └── downloader.py     # Download manager (HuggingFace)
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── base.py           # Backend interface
│   │   ├── llamacpp.py       # llama.cpp backend
│   │   └── mlx.py            # MLX backend (macOS)
│   ├── swarm/
│   │   ├── __init__.py
│   │   ├── manager.py        # Instance lifecycle management
│   │   ├── worker.py         # Individual LLM instance wrapper
│   │   └── consensus.py      # Voting/consensus algorithm
│   └── api/
│       ├── __init__.py
│       ├── server.py         # FastAPI/uvicorn server
│       ├── routes.py         # OpenAI-compatible endpoints
│       └── middleware.py     # Request handling
├── tests/
├── config/
│   └── models.yaml           # Model configurations
├── scripts/
│   ├── install.bat           # Windows installer
│   └── install.sh            # Unix installer
├── main.py                   # CLI entry point
├── requirements.txt
├── requirements-macos.txt    # MLX-specific deps
├── setup.py
└── .gitignore

Implementation Phases

Phase 1: Foundation (Week 1)

1.1 Hardware Detection Module

File: src/hardware/detector.py

Requirements:

  • Cross-platform OS detection (Windows, macOS, Linux)
  • CPU info (cores, architecture)
  • RAM detection (total, available)
  • GPU detection with VRAM

Platform-specific implementations:

  • Windows: Use pynvml for NVIDIA, fallback to DirectX for others
  • macOS: Use psutil for RAM, sysctl for CPU, Metal API for GPU
  • Linux: Use pynvml for NVIDIA, rocm-smi for AMD

Output structure:

class HardwareProfile:
    os: str  # 'windows', 'darwin', 'linux'
    cpu_cores: int
    ram_gb: float
    gpu: Optional[GPUInfo]
    is_apple_silicon: bool

Model selection rules:

  • External GPU (NVIDIA/AMD): Use 100% of VRAM
  • Apple Silicon: Use 50% of unified RAM
  • CPU-only: Use 50% of system RAM

1.2 Model Registry

File: src/models/registry.py

Model database (YAML format):

models:
  qwen2.5-coder:
    name: "Qwen 2.5 Coder"
    description: "Alibaba's code-focused model"
    variants:
      - size: 3b
        base_vram_gb: 2.0  # Approximate VRAM for fp16
        quantizations:
          q4_k_m:
            vram_gb: 1.8
            quality: "good"
          q5_k_m:
            vram_gb: 2.2
            quality: "better"
          q6_k:
            vram_gb: 2.6
            quality: "best"
      - size: 7b
        base_vram_gb: 14.0
        quantizations:
          q4_k_m:
            vram_gb: 4.5
          q5_k_m:
            vram_gb: 5.2
          q6_k:
            vram_gb: 6.0
    
  codellama:
    name: "CodeLlama"
    # Similar structure...
    
  deepseek-coder:
    name: "DeepSeek Coder"
    # Similar structure...

Selection priority:

  1. Qwen 2.5 Coder (best for small sizes)
  2. DeepSeek Coder (good alternative)
  3. CodeLlama (fallback)

1.3 Model Selector Logic

File: src/models/selector.py

Algorithm:

def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
    available_memory = get_available_memory(hardware)
    
    # Try models in priority order
    for model in PRIORITY_MODELS:
        # Find largest size that fits
        for variant in reversed(model.variants):
            # Try highest quantization that fits
            for quant in reversed(variant.quantizations):
                total_vram_needed = quant.vram_gb * MIN_INSTANCES
                if total_vram_needed <= available_memory:
                    # Calculate max instances
                    max_instances = int(available_memory // quant.vram_gb)
                    # Cap at reasonable limit (e.g., 8)
                    instances = min(max_instances, 8)
                    return ModelConfig(model, variant, quant, instances)
    
    # Fallback to smallest model
    return FALLBACK_CONFIG

Minimum instances: 2 (for consensus voting) Maximum instances: 8 (to avoid overhead)

Phase 2: Backend Integration (Week 2)

2.1 Base Backend Interface

File: src/backends/base.py

from abc import ABC, abstractmethod
from typing import AsyncIterator

class LLMBackend(ABC):
    @abstractmethod
    async def load_model(self, model_path: str, config: dict) -> bool:
        pass
    
    @abstractmethod
    async def generate(self, prompt: str, **kwargs) -> str:
        pass
    
    @abstractmethod
    async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
        pass
    
    @abstractmethod
    def get_memory_usage(self) -> float:
        """Return current VRAM/RAM usage in GB"""
        pass
    
    @abstractmethod
    def shutdown(self):
        pass

2.2 llama.cpp Backend

File: src/backends/llamacpp.py

Implementation:

  • Use llama-cpp-python library
  • Support GGUF model format
  • GPU acceleration via CUDA/Metal
  • Server mode with HTTP API

Key features:

  • Model caching to avoid reload
  • Context window management
  • Batch processing support

Memory calculation:

def calculate_memory_usage(model_path: str) -> float:
    # Parse GGUF metadata
    # Return estimated VRAM usage

2.3 MLX Backend (macOS)

File: src/backends/mlx.py

Implementation:

  • Use mlx-lm library
  • Support MLX format models
  • Optimized for Apple Silicon

Key differences from llama.cpp:

  • Native Metal performance
  • Simpler API
  • Unified memory model

Phase 3: Swarm Management (Week 3)

3.1 Worker Instance

File: src/swarm/worker.py

Each worker manages:

  • One LLM instance
  • Request queue
  • Health monitoring
  • Metrics collection
class SwarmWorker:
    def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
        self.worker_id = worker_id
        self.backend = backend
        self.is_healthy = True
        self.request_count = 0
        self.avg_latency = 0.0
    
    async def process(self, request: GenerationRequest) -> GenerationResponse:
        start = time.time()
        response = await self.backend.generate(**request.params)
        latency = time.time() - start
        self._update_metrics(latency)
        return GenerationResponse(response, latency, self.worker_id)

3.2 Swarm Manager

File: src/swarm/manager.py

Responsibilities:

  • Spawn N workers based on hardware
  • Distribute requests to all workers
  • Collect responses
  • Handle worker failures
class SwarmManager:
    def __init__(self, config: ModelConfig):
        self.workers: List[SwarmWorker] = []
        self.config = config
    
    async def initialize(self):
        # Download model if needed
        model_path = await self._ensure_model()
        
        # Spawn workers
        for i in range(self.config.instances):
            backend = self._create_backend()
            await backend.load_model(model_path, self.config.backend_params)
            worker = SwarmWorker(i, backend, self.config)
            self.workers.append(worker)
    
    async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
        # Send to all workers in parallel
        tasks = [w.process(request) for w in self.workers]
        return await asyncio.gather(*tasks)

3.3 Consensus Algorithm

File: src/swarm/consensus.py

Voting strategies:

  1. Similarity voting (default):

    • Embed all responses
    • Group by semantic similarity
    • Return largest group
  2. Quality scoring:

    • Score each response on:
      • Completeness (does it answer the question?)
      • Code quality (syntax, structure)
      • Length appropriateness
    • Return highest score
  3. Latency-weighted:

    • Prefer faster responses (lower memory pressure)

Implementation:

class ConsensusEngine:
    def __init__(self, strategy: str = "similarity"):
        self.strategy = strategy
        self.embedding_model = None  # Lazy load
    
    async def select_best(self, responses: List[GenerationResponse]) -> str:
        if len(responses) == 1:
            return responses[0].text
        
        if self.strategy == "similarity":
            return await self._similarity_vote(responses)
        elif self.strategy == "quality":
            return await self._quality_score(responses)
        else:
            return self._fastest_response(responses)
    
    async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
        # Use sentence-transformers for embeddings
        # Group by cosine similarity > 0.85
        # Return median response from largest group

Phase 4: API Server (Week 4)

4.1 OpenAI-Compatible Endpoints

File: src/api/routes.py

Required endpoints:

  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completion
  • POST /v1/completions - Text completion (optional)
  • GET /health - Health check
  • GET /metrics - Prometheus metrics (optional)

Chat completions endpoint:

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    # Extract messages
    messages = request.messages
    prompt = format_messages(messages)
    
    # Get all responses from swarm
    responses = await swarm_manager.generate_all(prompt, **request.params)
    
    # Run consensus
    best_response = await consensus_engine.select_best(responses)
    
    # Format as OpenAI response
    return {
        "id": f"chatcmpl-{uuid4()}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": best_response},
            "finish_reason": "stop"
        }],
        "usage": calculate_usage(prompt, best_response)
    }

4.2 Streaming Support

File: src/api/routes.py

For streaming, use the fastest worker instead of consensus:

if request.stream:
    # Pick worker with lowest latency
    worker = swarm_manager.get_fastest_worker()
    return StreamingResponse(
        worker.stream_generate(prompt),
        media_type="text/event-stream"
    )

Phase 5: CLI & Distribution (Week 5)

5.1 CLI Interface

File: main.py

Commands:

# Start the swarm (auto-detect hardware)
python -m local_swarm

# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Start with specific port
python -m local_swarm --port 8080

# Override instance count
python -m local_swarm --instances 4

# Show hardware detection
python -m local_swarm --detect

# Download models only
python -m local_swarm --download-only

5.2 Configuration File

File: config.yaml

server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8
  timeout: 60

models:
  cache_dir: "~/.local_swarm/models"
  preferred_models:
    - qwen2.5-coder
    - deepseek-coder
    - codellama

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

5.3 Installation Scripts

Windows (scripts/install.bat):

@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt

:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
    echo CUDA detected, installing GPU support...
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
    echo No CUDA detected, using CPU backend...
    pip install llama-cpp-python
)

echo Installation complete!
echo Run: python -m local_swarm

macOS/Linux (scripts/install.sh):

#!/bin/bash
set -e

echo "Installing Local Swarm..."
pip install --upgrade pip

# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
    echo "macOS detected..."
    pip install -r requirements.txt
    pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    echo "Linux detected..."
    pip install -r requirements.txt
    if command -v nvidia-smi &> /dev/null; then
        echo "CUDA detected, installing GPU support..."
        pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
    else
        pip install llama-cpp-python
    fi
fi

echo "Installation complete!"
echo "Run: python -m local_swarm"

Phase 6: Testing & Polish (Week 6)

6.1 Test Coverage

Unit tests:

  • Hardware detection mocking
  • Model selection logic
  • Consensus algorithm
  • API endpoint validation

Integration tests:

  • End-to-end inference
  • Multi-worker coordination
  • Error handling

Platform tests:

  • Windows with NVIDIA
  • macOS with M1/M2/M3
  • Linux with CUDA
  • CPU-only fallback

6.2 Performance Optimization

  • Model warmup: Pre-load models on startup
  • Request batching: Group similar requests
  • Worker pooling: Reuse workers instead of respawning
  • Memory monitoring: Auto-shutdown if OOM

6.3 Documentation

  • API documentation (OpenAPI spec)
  • Configuration guide
  • Troubleshooting
  • Performance tuning tips

Technical Decisions

Why llama.cpp?

  • Best cross-platform support
  • Mature quantization formats (GGUF)
  • Active community
  • Good performance/quality tradeoff

Why MLX for macOS?

  • Native Apple Silicon optimization
  • Simpler than llama.cpp on macOS
  • Better unified memory handling

Why consensus voting?

  • Improves response quality vs single model
  • Uses available hardware efficiently
  • Can detect model hallucinations

Memory Model

External GPU (NVIDIA/AMD):

  • Use 100% of VRAM
  • Keep 10% buffer for OS/drivers
  • Each instance gets equal share

Apple Silicon:

  • Use 50% of unified RAM
  • Avoid system swap
  • Monitor memory pressure

CPU-only:

  • Use 50% of system RAM
  • Dependent on available memory
  • Slower but functional

Future Enhancements

  1. Multi-GPU support: Distribute across multiple GPUs
  2. Dynamic scaling: Add/remove workers based on load
  3. Model mixing: Different models in same swarm
  4. Fine-tuning: Local fine-tuning on user data
  5. Web UI: Browser-based configuration
  6. Docker support: Containerized deployment
  7. Cloud inference: Fallback to cloud APIs

Success Metrics

  • Startup time: < 30 seconds from cold start
  • First inference: < 10 seconds after startup
  • Concurrent requests: Support 2-8 parallel inferences
  • Consensus accuracy: > 80% agreement on code tasks
  • Memory efficiency: Use > 80% of available memory
  • Cross-platform: Works on Windows/macOS/Linux without code changes