sleepy/local_swarm

Fork 0

Files

T

sleepy 8cf1e16703 Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

15 KiB

Raw Blame History

Local Swarm - Detailed Implementation Plan

Overview

A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.

Architecture

local_swarm/
├── src/
│   ├── __init__.py
│   ├── hardware/
│   │   ├── __init__.py
│   │   ├── detector.py       # Platform-agnostic hardware detection
│   │   ├── nvidia.py         # NVIDIA GPU detection (Windows/Linux)
│   │   ├── apple_silicon.py  # Apple Silicon detection (macOS)
│   │   └── memory.py         # RAM detection
│   ├── models/
│   │   ├── __init__.py
│   │   ├── registry.py       # Model database with specs
│   │   ├── selector.py       # Optimal model/quant selection logic
│   │   └── downloader.py     # Download manager (HuggingFace)
│   ├── backends/
│   │   ├── __init__.py
│   │   ├── base.py           # Backend interface
│   │   ├── llamacpp.py       # llama.cpp backend
│   │   └── mlx.py            # MLX backend (macOS)
│   ├── swarm/
│   │   ├── __init__.py
│   │   ├── manager.py        # Instance lifecycle management
│   │   ├── worker.py         # Individual LLM instance wrapper
│   │   └── consensus.py      # Voting/consensus algorithm
│   └── api/
│       ├── __init__.py
│       ├── server.py         # FastAPI/uvicorn server
│       ├── routes.py         # OpenAI-compatible endpoints
│       └── middleware.py     # Request handling
├── tests/
├── config/
│   └── models.yaml           # Model configurations
├── scripts/
│   ├── install.bat           # Windows installer
│   └── install.sh            # Unix installer
├── main.py                   # CLI entry point
├── requirements.txt
├── requirements-macos.txt    # MLX-specific deps
├── setup.py
└── .gitignore

Implementation Phases

Phase 1: Foundation (Week 1)

1.1 Hardware Detection Module

File: src/hardware/detector.py

Requirements:

Cross-platform OS detection (Windows, macOS, Linux)
CPU info (cores, architecture)
RAM detection (total, available)
GPU detection with VRAM

Platform-specific implementations:

Windows: Use pynvml for NVIDIA, fallback to DirectX for others
macOS: Use psutil for RAM, sysctl for CPU, Metal API for GPU
Linux: Use pynvml for NVIDIA, rocm-smi for AMD

Output structure:

class HardwareProfile:
    os: str  # 'windows', 'darwin', 'linux'
    cpu_cores: int
    ram_gb: float
    gpu: Optional[GPUInfo]
    is_apple_silicon: bool

Model selection rules:

External GPU (NVIDIA/AMD): Use 100% of VRAM
Apple Silicon: Use 50% of unified RAM
CPU-only: Use 50% of system RAM

1.2 Model Registry

File: src/models/registry.py

Model database (YAML format):

models:
  qwen2.5-coder:
    name: "Qwen 2.5 Coder"
    description: "Alibaba's code-focused model"
    variants:
      - size: 3b
        base_vram_gb: 2.0  # Approximate VRAM for fp16
        quantizations:
          q4_k_m:
            vram_gb: 1.8
            quality: "good"
          q5_k_m:
            vram_gb: 2.2
            quality: "better"
          q6_k:
            vram_gb: 2.6
            quality: "best"
      - size: 7b
        base_vram_gb: 14.0
        quantizations:
          q4_k_m:
            vram_gb: 4.5
          q5_k_m:
            vram_gb: 5.2
          q6_k:
            vram_gb: 6.0
    
  codellama:
    name: "CodeLlama"
    # Similar structure...
    
  deepseek-coder:
    name: "DeepSeek Coder"
    # Similar structure...

Selection priority:

Qwen 2.5 Coder (best for small sizes)
DeepSeek Coder (good alternative)
CodeLlama (fallback)

1.3 Model Selector Logic

File: src/models/selector.py

Algorithm:

def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
    available_memory = get_available_memory(hardware)
    
    # Try models in priority order
    for model in PRIORITY_MODELS:
        # Find largest size that fits
        for variant in reversed(model.variants):
            # Try highest quantization that fits
            for quant in reversed(variant.quantizations):
                total_vram_needed = quant.vram_gb * MIN_INSTANCES
                if total_vram_needed <= available_memory:
                    # Calculate max instances
                    max_instances = int(available_memory // quant.vram_gb)
                    # Cap at reasonable limit (e.g., 8)
                    instances = min(max_instances, 8)
                    return ModelConfig(model, variant, quant, instances)
    
    # Fallback to smallest model
    return FALLBACK_CONFIG

Minimum instances: 2 (for consensus voting) Maximum instances: 8 (to avoid overhead)

Phase 2: Backend Integration (Week 2)

2.1 Base Backend Interface

File: src/backends/base.py

from abc import ABC, abstractmethod
from typing import AsyncIterator

class LLMBackend(ABC):
    @abstractmethod
    async def load_model(self, model_path: str, config: dict) -> bool:
        pass
    
    @abstractmethod
    async def generate(self, prompt: str, **kwargs) -> str:
        pass
    
    @abstractmethod
    async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
        pass
    
    @abstractmethod
    def get_memory_usage(self) -> float:
        """Return current VRAM/RAM usage in GB"""
        pass
    
    @abstractmethod
    def shutdown(self):
        pass

2.2 llama.cpp Backend

File: src/backends/llamacpp.py

Implementation:

Use llama-cpp-python library
Support GGUF model format
GPU acceleration via CUDA/Metal
Server mode with HTTP API

Key features:

Model caching to avoid reload
Context window management
Batch processing support

Memory calculation:

def calculate_memory_usage(model_path: str) -> float:
    # Parse GGUF metadata
    # Return estimated VRAM usage

2.3 MLX Backend (macOS)

File: src/backends/mlx.py

Implementation:

Use mlx-lm library
Support MLX format models
Optimized for Apple Silicon

Key differences from llama.cpp:

Native Metal performance
Simpler API
Unified memory model

Phase 3: Swarm Management (Week 3)

3.1 Worker Instance

File: src/swarm/worker.py

Each worker manages:

One LLM instance
Request queue
Health monitoring
Metrics collection

class SwarmWorker:
    def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
        self.worker_id = worker_id
        self.backend = backend
        self.is_healthy = True
        self.request_count = 0
        self.avg_latency = 0.0
    
    async def process(self, request: GenerationRequest) -> GenerationResponse:
        start = time.time()
        response = await self.backend.generate(**request.params)
        latency = time.time() - start
        self._update_metrics(latency)
        return GenerationResponse(response, latency, self.worker_id)

3.2 Swarm Manager

File: src/swarm/manager.py

Responsibilities:

Spawn N workers based on hardware
Distribute requests to all workers
Collect responses
Handle worker failures

class SwarmManager:
    def __init__(self, config: ModelConfig):
        self.workers: List[SwarmWorker] = []
        self.config = config
    
    async def initialize(self):
        # Download model if needed
        model_path = await self._ensure_model()
        
        # Spawn workers
        for i in range(self.config.instances):
            backend = self._create_backend()
            await backend.load_model(model_path, self.config.backend_params)
            worker = SwarmWorker(i, backend, self.config)
            self.workers.append(worker)
    
    async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
        # Send to all workers in parallel
        tasks = [w.process(request) for w in self.workers]
        return await asyncio.gather(*tasks)

3.3 Consensus Algorithm

File: src/swarm/consensus.py

Voting strategies:

Similarity voting (default):
- Embed all responses
- Group by semantic similarity
- Return largest group
Quality scoring:
- Score each response on:
  - Completeness (does it answer the question?)
  - Code quality (syntax, structure)
  - Length appropriateness
- Return highest score
Latency-weighted:
- Prefer faster responses (lower memory pressure)

Implementation:

class ConsensusEngine:
    def __init__(self, strategy: str = "similarity"):
        self.strategy = strategy
        self.embedding_model = None  # Lazy load
    
    async def select_best(self, responses: List[GenerationResponse]) -> str:
        if len(responses) == 1:
            return responses[0].text
        
        if self.strategy == "similarity":
            return await self._similarity_vote(responses)
        elif self.strategy == "quality":
            return await self._quality_score(responses)
        else:
            return self._fastest_response(responses)
    
    async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
        # Use sentence-transformers for embeddings
        # Group by cosine similarity > 0.85
        # Return median response from largest group

Phase 4: API Server (Week 4)

4.1 OpenAI-Compatible Endpoints

File: src/api/routes.py

Required endpoints:

GET /v1/models - List available models
POST /v1/chat/completions - Chat completion
POST /v1/completions - Text completion (optional)
GET /health - Health check
GET /metrics - Prometheus metrics (optional)

Chat completions endpoint:

@app.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest):
    # Extract messages
    messages = request.messages
    prompt = format_messages(messages)
    
    # Get all responses from swarm
    responses = await swarm_manager.generate_all(prompt, **request.params)
    
    # Run consensus
    best_response = await consensus_engine.select_best(responses)
    
    # Format as OpenAI response
    return {
        "id": f"chatcmpl-{uuid4()}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": best_response},
            "finish_reason": "stop"
        }],
        "usage": calculate_usage(prompt, best_response)
    }

4.2 Streaming Support

File: src/api/routes.py

For streaming, use the fastest worker instead of consensus:

if request.stream:
    # Pick worker with lowest latency
    worker = swarm_manager.get_fastest_worker()
    return StreamingResponse(
        worker.stream_generate(prompt),
        media_type="text/event-stream"
    )

Phase 5: CLI & Distribution (Week 5)

5.1 CLI Interface

File: main.py

Commands:

# Start the swarm (auto-detect hardware)
python -m local_swarm

# Start with specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Start with specific port
python -m local_swarm --port 8080

# Override instance count
python -m local_swarm --instances 4

# Show hardware detection
python -m local_swarm --detect

# Download models only
python -m local_swarm --download-only

5.2 Configuration File

File: config.yaml

server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8
  timeout: 60

models:
  cache_dir: "~/.local_swarm/models"
  preferred_models:
    - qwen2.5-coder
    - deepseek-coder
    - codellama

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

5.3 Installation Scripts

Windows (scripts/install.bat):

@echo off
echo Installing Local Swarm...
python -m pip install --upgrade pip
pip install -r requirements.txt

:: Check for CUDA
nvidia-smi >nul 2>&1
if %errorlevel% == 0 (
    echo CUDA detected, installing GPU support...
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
) else (
    echo No CUDA detected, using CPU backend...
    pip install llama-cpp-python
)

echo Installation complete!
echo Run: python -m local_swarm

macOS/Linux (scripts/install.sh):

#!/bin/bash
set -e

echo "Installing Local Swarm..."
pip install --upgrade pip

# Detect platform
if [[ "$OSTYPE" == "darwin"* ]]; then
    echo "macOS detected..."
    pip install -r requirements.txt
    pip install -r requirements-macos.txt
elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    echo "Linux detected..."
    pip install -r requirements.txt
    if command -v nvidia-smi &> /dev/null; then
        echo "CUDA detected, installing GPU support..."
        pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
    else
        pip install llama-cpp-python
    fi
fi

echo "Installation complete!"
echo "Run: python -m local_swarm"

Phase 6: Testing & Polish (Week 6)

6.1 Test Coverage

Unit tests:

Hardware detection mocking
Model selection logic
Consensus algorithm
API endpoint validation

Integration tests:

End-to-end inference
Multi-worker coordination
Error handling

Platform tests:

Windows with NVIDIA
macOS with M1/M2/M3
Linux with CUDA
CPU-only fallback

6.2 Performance Optimization

Model warmup: Pre-load models on startup
Request batching: Group similar requests
Worker pooling: Reuse workers instead of respawning
Memory monitoring: Auto-shutdown if OOM

6.3 Documentation

API documentation (OpenAPI spec)
Configuration guide
Troubleshooting
Performance tuning tips

Technical Decisions

Why llama.cpp?

Best cross-platform support
Mature quantization formats (GGUF)
Active community
Good performance/quality tradeoff

Why MLX for macOS?

Native Apple Silicon optimization
Simpler than llama.cpp on macOS
Better unified memory handling

Why consensus voting?

Improves response quality vs single model
Uses available hardware efficiently
Can detect model hallucinations

Memory Model

External GPU (NVIDIA/AMD):

Use 100% of VRAM
Keep 10% buffer for OS/drivers
Each instance gets equal share

Apple Silicon:

Use 50% of unified RAM
Avoid system swap
Monitor memory pressure

CPU-only:

Use 50% of system RAM
Dependent on available memory
Slower but functional

Future Enhancements

Multi-GPU support: Distribute across multiple GPUs
Dynamic scaling: Add/remove workers based on load
Model mixing: Different models in same swarm
Fine-tuning: Local fine-tuning on user data
Web UI: Browser-based configuration
Docker support: Containerized deployment
Cloud inference: Fallback to cloud APIs

Success Metrics

Startup time: < 30 seconds from cold start
First inference: < 10 seconds after startup
Concurrent requests: Support 2-8 parallel inferences
Consensus accuracy: > 80% agreement on code tasks
Memory efficiency: Use > 80% of available memory
Cross-platform: Works on Windows/macOS/Linux without code changes

15 KiB Raw Blame History

Local Swarm - Detailed Implementation Plan

Overview

Architecture

Implementation Phases

Phase 1: Foundation (Week 1)

1.1 Hardware Detection Module

1.2 Model Registry

1.3 Model Selector Logic

Phase 2: Backend Integration (Week 2)

2.1 Base Backend Interface

2.2 llama.cpp Backend

2.3 MLX Backend (macOS)

Phase 3: Swarm Management (Week 3)

3.1 Worker Instance

3.2 Swarm Manager

3.3 Consensus Algorithm

Phase 4: API Server (Week 4)

4.1 OpenAI-Compatible Endpoints

4.2 Streaming Support

Phase 5: CLI & Distribution (Week 5)

5.1 CLI Interface

5.2 Configuration File

5.3 Installation Scripts

Phase 6: Testing & Polish (Week 6)

6.1 Test Coverage

6.2 Performance Optimization

6.3 Documentation

Technical Decisions

Why llama.cpp?

Why MLX for macOS?

Why consensus voting?

Memory Model

Future Enhancements

Success Metrics

15 KiB

Raw Blame History