Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00
commit 8cf1e16703
15 changed files with 1407 additions and 0 deletions
@@ -0,0 +1,153 @@
 # Byte-compiled / optimized / DLL files
 __pycache__/
 *.py[cod]
 *$py.class
 # C extensions
 *.so
 # Distribution / packaging
 .Python
 build/
 develop-eggs/
 dist/
 downloads/
 eggs/
 .eggs/
 lib/
 lib64/
 parts/
 sdist/
 var/
 wheels/
 share/python-wheels/
 *.egg-info/
 .installed.cfg
 *.egg
 MANIFEST
 # PyInstaller
 *.manifest
 *.spec
 # Installer logs
 pip-log.txt
 pip-delete-this-directory.txt
 # Unit test / coverage reports
 htmlcov/
 .tox/
 .nox/
 .coverage
 .coverage.*
 .cache
 nosetests.xml
 coverage.xml
 *.cover
 *.py,cover
 .hypothesis/
 .pytest_cache/
 cover/
 # Translations
 *.mo
 *.pot
 # Django stuff:
 *.log
 local_settings.py
 db.sqlite3
 db.sqlite3-journal
 # Flask stuff:
 instance/
 .webassets-cache
 # Scrapy stuff:
 .scrapy
 # Sphinx documentation
 docs/_build/
 # PyBuilder
 .pybuilder/
 target/
 # Jupyter Notebook
 .ipynb_checkpoints
 # IPython
 profile_default/
 ipython_config.py
 # pyenv
 .python-version
 # pipenv
 Pipfile.lock
 # poetry
 poetry.lock
 # pdm
 .pdm.toml
 # PEP 582
 __pypackages__/
 # Celery stuff
 celerybeat-schedule
 celerybeat.pid
 # SageMath parsed files
 *.sage.py
 # Environments
 .env
 .venv
 env/
 venv/
 ENV/
 env.bak/
 venv.bak/
 # Spyder project settings
 .spyderproject
 .spyproject
 # Rope project settings
 .ropeproject
 # mkdocs documentation
 /site
 # mypy
 .mypy_cache/
 .dmypy.json
 dmypy.json
 # Pyre type checker
 .pyre/
 # pytype static type analyzer
 .pytype/
 # Cython debug symbols
 cython_debug/
 # PyCharm
 .idea/
 # VS Code
 .vscode/
 # Model cache
 models/
 *.gguf
 *.mlx
 .cache/
 # Local swarm specific
 config.local.yaml
 *.pid
 logs/
@@ -0,0 +1,574 @@
 # Local Swarm - Detailed Implementation Plan
 ## Overview
 A terminal-based tool that automatically configures and runs a swarm of small coding LLMs optimized for your hardware, exposing an OpenAI-compatible API for integration with opencode and other tools.
 ## Architecture
 ```
 local_swarm/
 ├── src/
 │   ├── __init__.py
 │   ├── hardware/
 │   │   ├── __init__.py
 │   │   ├── detector.py       # Platform-agnostic hardware detection
 │   │   ├── nvidia.py         # NVIDIA GPU detection (Windows/Linux)
 │   │   ├── apple_silicon.py  # Apple Silicon detection (macOS)
 │   │   └── memory.py         # RAM detection
 │   ├── models/
 │   │   ├── __init__.py
 │   │   ├── registry.py       # Model database with specs
 │   │   ├── selector.py       # Optimal model/quant selection logic
 │   │   └── downloader.py     # Download manager (HuggingFace)
 │   ├── backends/
 │   │   ├── __init__.py
 │   │   ├── base.py           # Backend interface
 │   │   ├── llamacpp.py       # llama.cpp backend
 │   │   └── mlx.py            # MLX backend (macOS)
 │   ├── swarm/
 │   │   ├── __init__.py
 │   │   ├── manager.py        # Instance lifecycle management
 │   │   ├── worker.py         # Individual LLM instance wrapper
 │   │   └── consensus.py      # Voting/consensus algorithm
 │   └── api/
 │       ├── __init__.py
 │       ├── server.py         # FastAPI/uvicorn server
 │       ├── routes.py         # OpenAI-compatible endpoints
 │       └── middleware.py     # Request handling
 ├── tests/
 ├── config/
 │   └── models.yaml           # Model configurations
 ├── scripts/
 │   ├── install.bat           # Windows installer
 │   └── install.sh            # Unix installer
 ├── main.py                   # CLI entry point
 ├── requirements.txt
 ├── requirements-macos.txt    # MLX-specific deps
 ├── setup.py
 └── .gitignore
 ```
 ## Implementation Phases
 ### Phase 1: Foundation (Week 1)
 #### 1.1 Hardware Detection Module
 **File**: `src/hardware/detector.py`
 **Requirements**:
 - Cross-platform OS detection (Windows, macOS, Linux)
 - CPU info (cores, architecture)
 - RAM detection (total, available)
 - GPU detection with VRAM
 **Platform-specific implementations**:
 - **Windows**: Use `pynvml` for NVIDIA, fallback to DirectX for others
 - **macOS**: Use `psutil` for RAM, `sysctl` for CPU, Metal API for GPU
 - **Linux**: Use `pynvml` for NVIDIA, `rocm-smi` for AMD
 **Output structure**:
 ```python
 class HardwareProfile:
    os: str  # 'windows', 'darwin', 'linux'
    cpu_cores: int
    ram_gb: float
    gpu: Optional[GPUInfo]
    is_apple_silicon: bool
 ```
 **Model selection rules**:
 - External GPU (NVIDIA/AMD): Use 100% of VRAM
 - Apple Silicon: Use 50% of unified RAM
 - CPU-only: Use 50% of system RAM
 #### 1.2 Model Registry
 **File**: `src/models/registry.py`
 **Model database** (YAML format):
 ```yaml
 models:
  qwen2.5-coder:
    name: "Qwen 2.5 Coder"
    description: "Alibaba's code-focused model"
    variants:
      - size: 3b
        base_vram_gb: 2.0  # Approximate VRAM for fp16
        quantizations:
          q4_k_m:
            vram_gb: 1.8
            quality: "good"
          q5_k_m:
            vram_gb: 2.2
            quality: "better"
          q6_k:
            vram_gb: 2.6
            quality: "best"
      - size: 7b
        base_vram_gb: 14.0
        quantizations:
          q4_k_m:
            vram_gb: 4.5
          q5_k_m:
            vram_gb: 5.2
          q6_k:
            vram_gb: 6.0
  codellama:
    name: "CodeLlama"
    # Similar structure...
  deepseek-coder:
    name: "DeepSeek Coder"
    # Similar structure...
 ```
 **Selection priority**:
 1. Qwen 2.5 Coder (best for small sizes)
 2. DeepSeek Coder (good alternative)
 3. CodeLlama (fallback)
 #### 1.3 Model Selector Logic
 **File**: `src/models/selector.py`
 **Algorithm**:
 ```python
 def select_optimal_model(hardware: HardwareProfile) -> ModelConfig:
    available_memory = get_available_memory(hardware)
    # Try models in priority order
    for model in PRIORITY_MODELS:
        # Find largest size that fits
        for variant in reversed(model.variants):
            # Try highest quantization that fits
            for quant in reversed(variant.quantizations):
                total_vram_needed = quant.vram_gb * MIN_INSTANCES
                if total_vram_needed <= available_memory:
                    # Calculate max instances
                    max_instances = int(available_memory // quant.vram_gb)
                    # Cap at reasonable limit (e.g., 8)
                    instances = min(max_instances, 8)
                    return ModelConfig(model, variant, quant, instances)
    # Fallback to smallest model
    return FALLBACK_CONFIG
 ```
 **Minimum instances**: 2 (for consensus voting)
 **Maximum instances**: 8 (to avoid overhead)
 ### Phase 2: Backend Integration (Week 2)
 #### 2.1 Base Backend Interface
 **File**: `src/backends/base.py`
 ```python
 from abc import ABC, abstractmethod
 from typing import AsyncIterator
 class LLMBackend(ABC):
    @abstractmethod
    async def load_model(self, model_path: str, config: dict) -> bool:
        pass
    @abstractmethod
    async def generate(self, prompt: str, **kwargs) -> str:
        pass
    @abstractmethod
    async def generate_stream(self, prompt: str, **kwargs) -> AsyncIterator[str]:
        pass
    @abstractmethod
    def get_memory_usage(self) -> float:
        """Return current VRAM/RAM usage in GB"""
        pass
    @abstractmethod
    def shutdown(self):
        pass
 ```
 #### 2.2 llama.cpp Backend
 **File**: `src/backends/llamacpp.py`
 **Implementation**:
 - Use `llama-cpp-python` library
 - Support GGUF model format
 - GPU acceleration via CUDA/Metal
 - Server mode with HTTP API
 **Key features**:
 - Model caching to avoid reload
 - Context window management
 - Batch processing support
 **Memory calculation**:
 ```python
 def calculate_memory_usage(model_path: str) -> float:
    # Parse GGUF metadata
    # Return estimated VRAM usage
 ```
 #### 2.3 MLX Backend (macOS)
 **File**: `src/backends/mlx.py`
 **Implementation**:
 - Use `mlx-lm` library
 - Support MLX format models
 - Optimized for Apple Silicon
 **Key differences from llama.cpp**:
 - Native Metal performance
 - Simpler API
 - Unified memory model
 ### Phase 3: Swarm Management (Week 3)
 #### 3.1 Worker Instance
 **File**: `src/swarm/worker.py`
 Each worker manages:
 - One LLM instance
 - Request queue
 - Health monitoring
 - Metrics collection
 ```python
 class SwarmWorker:
    def __init__(self, worker_id: int, backend: LLMBackend, config: dict):
        self.worker_id = worker_id
        self.backend = backend
        self.is_healthy = True
        self.request_count = 0
        self.avg_latency = 0.0
    async def process(self, request: GenerationRequest) -> GenerationResponse:
        start = time.time()
        response = await self.backend.generate(**request.params)
        latency = time.time() - start
        self._update_metrics(latency)
        return GenerationResponse(response, latency, self.worker_id)
 ```
 #### 3.2 Swarm Manager
 **File**: `src/swarm/manager.py`
 Responsibilities:
 - Spawn N workers based on hardware
 - Distribute requests to all workers
 - Collect responses
 - Handle worker failures
 ```python
 class SwarmManager:
    def __init__(self, config: ModelConfig):
        self.workers: List[SwarmWorker] = []
        self.config = config
    async def initialize(self):
        # Download model if needed
        model_path = await self._ensure_model()
        # Spawn workers
        for i in range(self.config.instances):
            backend = self._create_backend()
            await backend.load_model(model_path, self.config.backend_params)
            worker = SwarmWorker(i, backend, self.config)
            self.workers.append(worker)
    async def generate_all(self, prompt: str, **kwargs) -> List[GenerationResponse]:
        # Send to all workers in parallel
        tasks = [w.process(request) for w in self.workers]
        return await asyncio.gather(*tasks)
 ```
 #### 3.3 Consensus Algorithm
 **File**: `src/swarm/consensus.py`
 **Voting strategies**:
 1. **Similarity voting** (default):
   - Embed all responses
   - Group by semantic similarity
   - Return largest group
 2. **Quality scoring**:
   - Score each response on:
     - Completeness (does it answer the question?)
     - Code quality (syntax, structure)
     - Length appropriateness
   - Return highest score
 3. **Latency-weighted**:
   - Prefer faster responses (lower memory pressure)
 **Implementation**:
 ```python
 class ConsensusEngine:
    def __init__(self, strategy: str = "similarity"):
        self.strategy = strategy
        self.embedding_model = None  # Lazy load
    async def select_best(self, responses: List[GenerationResponse]) -> str:
        if len(responses) == 1:
            return responses[0].text
        if self.strategy == "similarity":
            return await self._similarity_vote(responses)
        elif self.strategy == "quality":
            return await self._quality_score(responses)
        else:
            return self._fastest_response(responses)
    async def _similarity_vote(self, responses: List[GenerationResponse]) -> str:
        # Use sentence-transformers for embeddings
        # Group by cosine similarity > 0.85
        # Return median response from largest group
 ```
 ### Phase 4: API Server (Week 4)
 #### 4.1 OpenAI-Compatible Endpoints
 **File**: `src/api/routes.py`
 Required endpoints:
 - `GET /v1/models` - List available models
 - `POST /v1/chat/completions` - Chat completion
 - `POST /v1/completions` - Text completion (optional)
 - `GET /health` - Health check
 - `GET /metrics` - Prometheus metrics (optional)
 **Chat completions endpoint**:
 ```python
@app.post("/v1/chat/completions")
 async def chat_completions(request: ChatCompletionRequest):
    # Extract messages
    messages = request.messages
    prompt = format_messages(messages)
    # Get all responses from swarm
    responses = await swarm_manager.generate_all(prompt, **request.params)
    # Run consensus
    best_response = await consensus_engine.select_best(responses)
    # Format as OpenAI response
    return {
        "id": f"chatcmpl-{uuid4()}",
        "object": "chat.completion",
        "created": int(time.time()),
        "model": request.model,
        "choices": [{
            "index": 0,
            "message": {"role": "assistant", "content": best_response},
            "finish_reason": "stop"
        }],
        "usage": calculate_usage(prompt, best_response)
    }
 ```
 #### 4.2 Streaming Support
 **File**: `src/api/routes.py`
 For streaming, use the fastest worker instead of consensus:
 ```python
 if request.stream:
    # Pick worker with lowest latency
    worker = swarm_manager.get_fastest_worker()
    return StreamingResponse(
        worker.stream_generate(prompt),
        media_type="text/event-stream"
    )
 ```
 ### Phase 5: CLI & Distribution (Week 5)
 #### 5.1 CLI Interface
 **File**: `main.py`
 Commands:
 ```bash
 # Start the swarm (auto-detect hardware)
 python -m local_swarm
 # Start with specific model
 python -m local_swarm --model qwen2.5-coder:3b:q4
 # Start with specific port
 python -m local_swarm --port 8080
 # Override instance count
 python -m local_swarm --instances 4
 # Show hardware detection
 python -m local_swarm --detect
 # Download models only
 python -m local_swarm --download-only
 ```
 #### 5.2 Configuration File
 **File**: `config.yaml`
 ```yaml
 server:
  host: "127.0.0.1"
  port: 8000
 swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8
  timeout: 60
 models:
  cache_dir: "~/.local_swarm/models"
  preferred_models:
    - qwen2.5-coder
    - deepseek-coder
    - codellama
 hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon
 ```
 #### 5.3 Installation Scripts
 **Windows** (`scripts/install.bat`):
 ```batch
@echo off
 echo Installing Local Swarm...
 python -m pip install --upgrade pip
 pip install -r requirements.txt
 :: Check for CUDA
 nvidia-smi >nul 2>&1
 if %errorlevel% == 0 (
    echo CUDA detected, installing GPU support...
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
 ) else (
    echo No CUDA detected, using CPU backend...
    pip install llama-cpp-python
 )
 echo Installation complete!
 echo Run: python -m local_swarm
 ```
 **macOS/Linux** (`scripts/install.sh`):
 ```bash
 #!/bin/bash
 set -e
 echo "Installing Local Swarm..."
 pip install --upgrade pip
 # Detect platform
 if [[ "$OSTYPE" == "darwin"* ]]; then
    echo "macOS detected..."
    pip install -r requirements.txt
    pip install -r requirements-macos.txt
 elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    echo "Linux detected..."
    pip install -r requirements.txt
    if command -v nvidia-smi &> /dev/null; then
        echo "CUDA detected, installing GPU support..."
        pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
    else
        pip install llama-cpp-python
    fi
 fi
 echo "Installation complete!"
 echo "Run: python -m local_swarm"
 ```
 ### Phase 6: Testing & Polish (Week 6)
 #### 6.1 Test Coverage
 **Unit tests**:
 - Hardware detection mocking
 - Model selection logic
 - Consensus algorithm
 - API endpoint validation
 **Integration tests**:
 - End-to-end inference
 - Multi-worker coordination
 - Error handling
 **Platform tests**:
 - Windows with NVIDIA
 - macOS with M1/M2/M3
 - Linux with CUDA
 - CPU-only fallback
 #### 6.2 Performance Optimization
 - **Model warmup**: Pre-load models on startup
 - **Request batching**: Group similar requests
 - **Worker pooling**: Reuse workers instead of respawning
 - **Memory monitoring**: Auto-shutdown if OOM
 #### 6.3 Documentation
 - API documentation (OpenAPI spec)
 - Configuration guide
 - Troubleshooting
 - Performance tuning tips
 ## Technical Decisions
 ### Why llama.cpp?
 - Best cross-platform support
 - Mature quantization formats (GGUF)
 - Active community
 - Good performance/quality tradeoff
 ### Why MLX for macOS?
 - Native Apple Silicon optimization
 - Simpler than llama.cpp on macOS
 - Better unified memory handling
 ### Why consensus voting?
 - Improves response quality vs single model
 - Uses available hardware efficiently
 - Can detect model hallucinations
 ### Memory Model
 **External GPU (NVIDIA/AMD)**:
 - Use 100% of VRAM
 - Keep 10% buffer for OS/drivers
 - Each instance gets equal share
 **Apple Silicon**:
 - Use 50% of unified RAM
 - Avoid system swap
 - Monitor memory pressure
 **CPU-only**:
 - Use 50% of system RAM
 - Dependent on available memory
 - Slower but functional
 ## Future Enhancements
 1. **Multi-GPU support**: Distribute across multiple GPUs
 2. **Dynamic scaling**: Add/remove workers based on load
 3. **Model mixing**: Different models in same swarm
 4. **Fine-tuning**: Local fine-tuning on user data
 5. **Web UI**: Browser-based configuration
 6. **Docker support**: Containerized deployment
 7. **Cloud inference**: Fallback to cloud APIs
 ## Success Metrics
 - **Startup time**: < 30 seconds from cold start
 - **First inference**: < 10 seconds after startup
 - **Concurrent requests**: Support 2-8 parallel inferences
 - **Consensus accuracy**: > 80% agreement on code tasks
 - **Memory efficiency**: Use > 80% of available memory
 - **Cross-platform**: Works on Windows/macOS/Linux without code changes
@@ -0,0 +1,352 @@
 # Local Swarm
 Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.
 ## Features
 - **Hardware Auto-Detection**: Automatically detects your GPU (NVIDIA), Apple Silicon, or CPU and selects optimal settings
 - **Smart Model Selection**: Chooses the best model, quantization, and instance count based on available VRAM/RAM
 - **Swarm Consensus**: Multiple LLM instances vote on the best response for higher quality outputs
 - **OpenAI-Compatible API**: Drop-in replacement for OpenAI API at `http://localhost:8000/v1`
 - **Cross-Platform**: Works on Windows, macOS, and Linux with automatic backend selection
 ## Quick Start
 ### Installation
 #### Windows (PowerShell)
 ```powershell
 # Clone the repository
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 # Run installer
 .\scripts\install.bat
 ```
 #### macOS/Linux
 ```bash
 # Clone the repository
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 # Run installer
 chmod +x scripts/install.sh
 ./scripts/install.sh
 ```
 ### Usage
 #### Start the Swarm
 ```bash
 # Auto-detect hardware and start
 python -m local_swarm
 # Or use the CLI
 python main.py
 ```
 On first run, the tool will:
 1. Scan your hardware (GPU, RAM, CPU)
 2. Select the optimal model and quantization
 3. Download the model (one-time)
 4. Start multiple instances based on available memory
 5. Expose the API at `http://localhost:8000`
 Example startup output:
 ```
 🔍 Detecting hardware...
   OS: Windows 11
   GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
   CPU: 16 cores
   RAM: 32 GB
 📊 Optimal configuration:
   Model: Qwen 2.5 Coder 3B
   Quantization: Q4_K_M (1.8 GB per instance)
   Instances: 8 (using 14.4 GB VRAM)
 ⬇️  Downloading model...
   Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB
 🚀 Starting swarm...
   Worker 1: Ready (GPU:0)
   Worker 2: Ready (GPU:0)
   ...
   Worker 8: Ready (GPU:0)
 ✅ Local Swarm is running!
   API: http://localhost:8000/v1
   Models: http://localhost:8000/v1/models
   Health: http://localhost:8000/health
 💡 Configure opencode to use:
   base_url: http://localhost:8000/v1
   api_key: any (not used)
 ```
 #### Configure opencode
 Add to your opencode configuration:
 ```json
 {
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
 }
 ```
 ## Configuration
 Create a `config.yaml` file for customization:
 ```yaml
 server:
  host: "127.0.0.1"
  port: 8000
 swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8
 hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon
 models:
  cache_dir: "~/.local_swarm/models"
 ```
 ## CLI Options
 ```bash
 # Show hardware detection without starting
 python -m local_swarm --detect
 # Use specific model
 python -m local_swarm --model qwen2.5-coder:3b:q4
 # Use specific port
 python -m local_swarm --port 8080
 # Force number of instances
 python -m local_swarm --instances 4
 # Download models only (no server)
 python -m local_swarm --download-only
 # Show help
 python -m local_swarm --help
 ```
 ## How It Works
 ### Hardware Detection
 The tool automatically detects your system:
 - **Windows**: NVIDIA GPUs via NVML, DirectX fallback
 - **macOS**: Apple Silicon via Metal, unified memory model
 - **Linux**: NVIDIA (NVML), AMD (ROCm)
 ### Model Selection
 Based on available memory:
 1. **External GPU**: Use 100% of VRAM minus OS overhead
 2. **Apple Silicon**: Use 50% of unified RAM
 3. **CPU-only**: Use 50% of system RAM
 The algorithm selects:
 - Largest model size that fits
 - Highest quantization quality possible
 - Maximum instances (2-8) based on memory
 Example configurations:
 | Hardware | Model | Quant | Instances | Memory Used |
 |----------|-------|-------|-----------|-------------|
 | RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
 | RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
 | M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
 | M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
 | CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |
 ### Swarm Consensus
 For each request, the swarm:
 1. Sends the prompt to all running instances
 2. Collects responses in parallel
 3. Runs consensus algorithm:
   - **Similarity**: Groups responses by semantic similarity, returns largest group
   - **Quality**: Scores responses on completeness and code quality
   - **Fastest**: Returns the quickest response
 4. Returns the winning response via OpenAI-compatible API
 ## API Endpoints
 ### GET /v1/models
 List available models
 ### POST /v1/chat/completions
 Chat completion with consensus
 **Request**:
 ```json
 {
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ]
 }
 ```
 **Response**:
 ```json
 {
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def sort_list(lst):\n    return sorted(lst)"
    },
    "finish_reason": "stop"
  }]
 }
 ```
 ### GET /health
 Health check
 ### GET /metrics
 Prometheus metrics (optional)
 ## Supported Models
 Currently supported models (auto-selected based on hardware):
 - **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended for coding tasks
 - **DeepSeek Coder** (1.3B, 6.7B, 33B) - Good alternative
 - **CodeLlama** (7B, 13B, 34B) - Meta's code model
 All models support GGUF quantization:
 - Q4_K_M - Good quality, smallest size (recommended)
 - Q5_K_M - Better quality
 - Q6_K - Best quality
 ## Troubleshooting
 ### Out of Memory
 If you get OOM errors:
 ```bash
 # Reduce instances
 python -m local_swarm --instances 2
 # Or use smaller model
 python -m local_swarm --model qwen2.5-coder:3b:q4
 ```
 ### Slow Performance
 - Check GPU utilization with `nvidia-smi` (NVIDIA) or Activity Monitor (macOS)
 - Ensure model is cached (first run downloads to `~/.local_swarm/models`)
 - Try reducing instances to avoid contention
 ### Windows: CUDA not detected
 Make sure NVIDIA drivers are installed:
 ```powershell
 nvidia-smi
 ```
 If this fails, reinstall drivers from nvidia.com
 ### macOS: MLX not found
 ```bash
 pip install mlx-lm
 ```
 ## Requirements
 - Python 3.9+
 - 4GB+ RAM (8GB+ recommended)
 - Optional: NVIDIA GPU with 4GB+ VRAM
 - Optional: Apple Silicon Mac
 ## Development
 ```bash
 # Install dev dependencies
 pip install -r requirements-dev.txt
 # Run tests
 pytest
 # Run specific platform tests
 pytest tests/test_hardware.py -v
 # Format code
 black src/
 ruff check src/
 ```
 ## Architecture
 ```
 ┌─────────────────────────────────────┐
 │         OpenAI API Client           │
 │        (opencode, etc.)             │
 └─────────────┬───────────────────────┘
              │ HTTP
              ▼
 ┌─────────────────────────────────────┐
 │     Local Swarm API Server          │
 │    (FastAPI / localhost:8000)       │
 └─────────────┬───────────────────────┘
              │
              ▼
 ┌─────────────────────────────────────┐
 │       Swarm Manager                 │
 │  ┌─────────┐ ┌─────────┐           │
 │  │ Worker 1│ │ Worker 2│ ...       │
 │  │(LLM #1) │ │(LLM #2) │           │
 │  └────┬────┘ └────┬────┘           │
 │       │           │                 │
 │       └─────┬─────┘                 │
 │             ▼                       │
 │      Consensus Engine               │
 └─────────────────────────────────────┘
              │
              ▼
 ┌─────────────────────────────────────┐
 │     Backend (llama.cpp / MLX)       │
 │    ┌─────────────────────┐          │
 │    │   GGUF/MLX Model    │          │
 │    │   (Qwen/Codellama)  │          │
 │    └─────────────────────┘          │
 └─────────────────────────────────────┘
              │
              ▼
 ┌─────────────────────────────────────┐
 │    Hardware (GPU/CPU/Apple Silicon) │
 └─────────────────────────────────────┘
 ```
 ## License
 MIT License - See LICENSE file
 ## Contributing
 Contributions welcome! Please read CONTRIBUTING.md first.
 ## Acknowledgments
 - [llama.cpp](https://github.com/ggerganov/llama.cpp) - Inference engine
 - [MLX](https://github.com/ml-explore/mlx) - Apple Silicon backend
 - [Qwen](https://github.com/QwenLM/Qwen) - Model family
 - [HuggingFace](https://huggingface.co) - Model hosting
@@ -0,0 +1,96 @@
 #!/usr/bin/env python3
 """
 Local Swarm - Automatically configure and run a swarm of small coding LLMs
 """
 import argparse
 import sys
 from pathlib import Path
 # Add src to path
 sys.path.insert(0, str(Path(__file__).parent / "src"))
 from rich.console import Console
 from rich.panel import Panel
 console = Console()
 def main():
    parser = argparse.ArgumentParser(
        description="Local Swarm - AI-powered coding LLM swarm",
        formatter_class=argparse.RawDescriptionHelpFormatter,
        epilog="""
 Examples:
  python main.py                    # Auto-detect and start
  python main.py --detect           # Show hardware detection only
  python main.py --model qwen:3b:q4 # Use specific model
  python main.py --port 8080        # Use custom port
  python main.py --instances 4      # Force 4 instances
        """
    )
    parser.add_argument(
        "--detect", 
        action="store_true",
        help="Show hardware detection and exit"
    )
    parser.add_argument(
        "--model",
        type=str,
        help="Model to use (format: name:size:quant, e.g., qwen:3b:q4)"
    )
    parser.add_argument(
        "--port",
        type=int,
        default=8000,
        help="Port to run the API server on (default: 8000)"
    )
    parser.add_argument(
        "--instances",
        type=int,
        help="Force number of instances (overrides auto-calculation)"
    )
    parser.add_argument(
        "--download-only",
        action="store_true",
        help="Download models only, don't start server"
    )
    parser.add_argument(
        "--config",
        type=str,
        default="config.yaml",
        help="Path to config file"
    )
    parser.add_argument(
        "--version",
        action="version",
        version="%(prog)s 0.1.0"
    )
    args = parser.parse_args()
    # Show welcome
    console.print(Panel.fit(
        "[bold blue]Local Swarm[/bold blue] - AI-powered coding LLM swarm\n"
        "Automatically configures optimal LLM setup for your hardware",
        title="Welcome",
        border_style="blue"
    ))
    if args.detect:
        console.print("[yellow]Hardware detection mode - not yet implemented[/yellow]")
        console.print("Run without --detect to start the swarm (once implemented)")
        return
    console.print("[green]Starting Local Swarm...[/green]")
    console.print("[dim]Note: This is a placeholder. Implementation in progress.[/dim]")
    console.print()
    console.print("[bold]Next steps:[/bold]")
    console.print("1. Check PLAN.md for implementation details")
    console.print("2. Start implementing src/hardware/detector.py")
    console.print("3. Continue with other modules")
 if __name__ == "__main__":
    main()
@@ -0,0 +1,3 @@
 # macOS specific dependencies
 mlx>=0.15.0
 mlx-lm>=0.8.0
@@ -0,0 +1,28 @@
 # Core dependencies
 pydantic>=2.0.0
 pyyaml>=6.0
 requests>=2.31.0
 tqdm>=4.65.0
 psutil>=5.9.0
 # API server
 fastapi>=0.104.0
 uvicorn[standard]>=0.24.0
 # Hardware detection
 pynvml>=11.5.0
 # ML/Embeddings (for consensus)
 sentence-transformers>=2.2.0
 numpy>=1.24.0
 # llama.cpp (CPU version, GPU version installed via scripts)
 llama-cpp-python>=0.2.0
 # Async
 aiohttp>=3.9.0
 asyncio>=3.4.3
 # CLI
 click>=8.1.0
 rich>=13.0.0
@@ -0,0 +1,55 @@
@echo off
 echo ==========================================
 echo  Local Swarm - Windows Installer
 echo ==========================================
 echo.
 REM Check Python
 python --version >nul 2>&1
 if errorlevel 1 (
    echo [ERROR] Python is not installed or not in PATH
    echo Please install Python 3.9+ from https://python.org
    exit /b 1
 )
 echo [1/4] Checking Python version...
 for /f "tokens=2" %%a in ('python --version') do set PYTHON_VERSION=%%a
 echo       Found Python %PYTHON_VERSION%
 echo.
 echo [2/4] Upgrading pip...
 python -m pip install --upgrade pip
 echo.
 echo [3/4] Installing base dependencies...
 pip install -r requirements.txt
 REM Check for CUDA
 nvidia-smi >nul 2>&1
 if %errorlevel% == 0 (
    echo.
    echo [4/4] CUDA detected! Installing GPU-accelerated llama.cpp...
    pip uninstall -y llama-cpp-python
    pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
    echo       GPU support enabled!
 ) else (
    echo.
    echo [4/4] No CUDA detected, using CPU backend...
    echo       CPU-only mode (slower but works on any hardware)
 )
 echo.
 echo ==========================================
 echo  Installation Complete!
 echo ==========================================
 echo.
 echo To start Local Swarm:
 echo   python main.py
 echo.
 echo To check hardware detection:
 echo   python main.py --detect
 echo.
 echo For more options:
 echo   python main.py --help
 echo.
 pause
@@ -0,0 +1,98 @@
 #!/bin/bash
 set -e
 echo "=========================================="
 echo " Local Swarm - Installer"
 echo "=========================================="
 echo
 # Colors
 RED='\033[0;31m'
 GREEN='\033[0;32m'
 YELLOW='\033[1;33m'
 NC='\033[0m' # No Color
 # Check Python
 if ! command -v python3 &> /dev/null; then
    echo -e "${RED}[ERROR] Python 3 is not installed${NC}"
    echo "Please install Python 3.9+ and try again"
    exit 1
 fi
 echo "[1/4] Checking Python version..."
 PYTHON_VERSION=$(python3 --version | cut -d' ' -f2)
 echo "      Found Python $PYTHON_VERSION"
 echo
 echo "[2/4] Upgrading pip..."
 python3 -m pip install --upgrade pip
 echo
 echo "[3/4] Installing base dependencies..."
 pip3 install -r requirements.txt
 # Detect platform and install appropriate backend
 echo
 echo "[4/4] Detecting hardware and installing backend..."
 if [[ "$OSTYPE" == "darwin"* ]]; then
    # macOS
    echo "      Platform: macOS"
    # Check for Apple Silicon
    if [[ $(uname -m) == "arm64" ]]; then
        echo "      Hardware: Apple Silicon detected!"
        echo "      Installing MLX backend..."
        pip3 install -r requirements-macos.txt
        echo "      ${GREEN}MLX backend installed!${NC}"
    else
        echo "      Hardware: Intel Mac"
        echo "      Installing llama.cpp (CPU)..."
        pip3 install llama-cpp-python
        echo "      ${GREEN}llama.cpp installed (CPU mode)${NC}"
    fi
 elif [[ "$OSTYPE" == "linux-gnu"* ]]; then
    # Linux
    echo "      Platform: Linux"
    # Check for NVIDIA GPU
    if command -v nvidia-smi &> /dev/null; then
        echo "      Hardware: NVIDIA GPU detected!"
        echo "      Installing CUDA-enabled llama.cpp..."
        pip3 uninstall -y llama-cpp-python 2>/dev/null || true
        pip3 install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
        echo "      ${GREEN}GPU support enabled!${NC}"
    else
        echo "      Hardware: No NVIDIA GPU detected"
        echo "      Installing llama.cpp (CPU)..."
        pip3 install llama-cpp-python
        echo "      ${GREEN}CPU backend installed${NC}"
    fi
    # Check for AMD GPU (ROCm)
    if command -v rocm-smi &> /dev/null; then
        echo -e "${YELLOW}[WARNING] AMD GPU detected but ROCm support is experimental${NC}"
        echo "         Using CPU backend for now"
    fi
 else
    echo -e "${YELLOW}[WARNING] Unknown platform: $OSTYPE${NC}"
    echo "         Installing generic CPU backend..."
    pip3 install llama-cpp-python
 fi
 echo
 echo "=========================================="
 echo " Installation Complete!"
 echo "=========================================="
 echo
 echo "To start Local Swarm:"
 echo "  python3 main.py"
 echo
 echo "To check hardware detection:"
 echo "  python3 main.py --detect"
 echo
 echo "For more options:"
 echo "  python3 main.py --help"
 echo
@@ -0,0 +1,48 @@
 from setuptools import setup, find_packages
 with open("README.md", "r", encoding="utf-8") as fh:
    long_description = fh.read()
 with open("requirements.txt", "r", encoding="utf-8") as fh:
    requirements = [line.strip() for line in fh if line.strip() and not line.startswith("#")]
 setup(
    name="local-swarm",
    version="0.1.0",
    author="Local Swarm Contributors",
    description="Automatically configure and run a swarm of small coding LLMs",
    long_description=long_description,
    long_description_content_type="text/markdown",
    url="https://github.com/yourusername/local_swarm",
    packages=find_packages(where="src"),
    package_dir={"": "src"},
    classifiers=[
        "Development Status :: 3 - Alpha",
        "Intended Audience :: Developers",
        "Topic :: Scientific/Engineering :: Artificial Intelligence",
        "License :: OSI Approved :: MIT License",
        "Programming Language :: Python :: 3",
        "Programming Language :: Python :: 3.9",
        "Programming Language :: Python :: 3.10",
        "Programming Language :: Python :: 3.11",
        "Programming Language :: Python :: 3.12",
        "Operating System :: OS Independent",
    ],
    python_requires=">=3.9",
    install_requires=requirements,
    extras_require={
        "macos": ["mlx>=0.15.0", "mlx-lm>=0.8.0"],
        "dev": [
            "pytest>=7.4.0",
            "pytest-asyncio>=0.21.0",
            "black>=23.0.0",
            "ruff>=0.1.0",
            "mypy>=1.6.0",
        ],
    },
    entry_points={
        "console_scripts": [
            "local-swarm=main:main",
        ],
    },
 )