local_swarm/README.md

# Local Swarm

Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.

## Features

- **Interactive Menu System**: Easy-to-use menu for selecting model configurations, browsing options, or creating custom setups
- **Hardware Auto-Detection**: Automatically detects your GPU (NVIDIA, AMD, Intel), Apple Silicon, Qualcomm (Android), or CPU and selects optimal settings
- **Smart Model Selection**: Chooses the best model, quantization, and instance count based on available VRAM/RAM
- **Startup Summary**: Clear display of detected hardware, selected model, resource usage, and worker status
- **Swarm Consensus**: Multiple LLM instances vote on the best response for higher quality outputs
- **Network Federation**: Multiple machines on the same network can join into a "federated swarm" for distributed consensus
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI API at `http://localhost:8000/v1`
- **MCP Server**: Model Context Protocol support for tight AI assistant integration
- **Cross-Platform**: Works on Windows, macOS, Linux, and Android (via Termux) with automatic backend selection

## Documentation

- **[Quick Start](#quick-start)** - Get up and running in minutes
- **[Complete Guide](docs/GUIDE.md)** - Comprehensive documentation
  - Opencode configuration examples
  - API reference
  - Troubleshooting guide
  - Performance tuning
  - Advanced configuration
- **[Configuration](#configuration)** - Customize your setup
- **[Interactive Mode](#interactive-mode)** - Using the menu system
- **[Tips & Help](#tips--help)** - Learn about models, quantization, and optimization

## Quick Start

### Installation

#### Windows (PowerShell)
```powershell
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
.\scripts\install.bat
```

#### macOS/Linux
```bash
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
chmod +x scripts/install.sh
./scripts/install.sh
```

#### Android (Termux)
```bash
# In Termux app
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run Termux installer
chmod +x scripts/install-termux.sh
./scripts/install-termux.sh
```

**Note**: Android support is limited to small models (1-3B) due to memory constraints. Requires 8GB+ RAM.

### Usage

#### Start the Swarm
```bash
# Auto-detect hardware and start
python -m local_swarm

# Or use the CLI
python main.py
```

On first run, the tool will:
1. Scan your hardware (GPU, RAM, CPU)
2. Select the optimal model and quantization
3. Download the model (one-time)
4. Start multiple instances based on available memory
5. Expose the API at `http://localhost:8000`

Example startup output:
```
🔍 Detecting hardware...
   OS: Windows 11
   GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
   CPU: 16 cores
   RAM: 32 GB

📊 Optimal configuration:
   Model: Qwen 2.5 Coder 3B
   Quantization: Q4_K_M (1.8 GB per instance)
   Instances: 8 (using 14.4 GB VRAM)

⬇️  Downloading model...
   Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB

🚀 Starting swarm...
   Worker 1: Ready (GPU:0)
   Worker 2: Ready (GPU:0)
   ...
   Worker 8: Ready (GPU:0)

✅ Local Swarm is running!
   API: http://localhost:8000/v1
   Models: http://localhost:8000/v1/models
   Health: http://localhost:8000/health

💡 Configure opencode to use:
   base_url: http://localhost:8000/v1
   api_key: any (not used)
```

#### Configure opencode

Add to your opencode configuration:

```json
{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}
```

#### MCP Server (Optional)

For tighter integration with AI assistants, enable the MCP server:

```bash
python main.py --mcp
```

This runs alongside the HTTP API and exposes tools AI assistants can use:
- `get_hardware_info` - Query CPU, GPU, and RAM
- `get_swarm_status` - Check worker health
- `generate_code` - Generate code with consensus
- `list_available_models` - See what models can run
- `get_worker_details` - Get detailed worker statistics

MCP allows AI assistants to automatically query your hardware capabilities and select appropriate models.

## Configuration

Create a `config.yaml` file for customization:

```yaml
server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

federation:
  enabled: true
  discovery_port: 8765
  federation_port: 8766
  max_peers: 10

models:
  cache_dir: "~/.local_swarm/models"
```

## CLI Options

```bash
# Show hardware detection without starting
python -m local_swarm --detect

# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Use specific port
python -m local_swarm --port 8080

# Force number of instances
python -m local_swarm --instances 4

# Download models only (no server)
python -m local_swarm --download-only

# Enable MCP server alongside HTTP API
python -m local_swarm --mcp

# Show help
python -m local_swarm --help

# Auto-detect without interactive menu
python -m local_swarm --auto
```

## Interactive Mode

By default, Local Swarm starts in **interactive mode** with a menu system:

```
======================================================================
 Local Swarm - Model Selection
======================================================================

----------------------------------------------------------------------
 Hardware Detection
----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB

  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB

  Available for LLMs: 12.0 GB
  (Using 50% of system RAM)

----------------------------------------------------------------------
 Configuration Options
----------------------------------------------------------------------

  💡 Recommended: Qwen 2.5 Coder 7b (q6_k)
     Instances: 2
     Memory: 12.0 GB

  [1] Recommended Configuration - Qwen 2.5 Coder 7b (q6_k) with 2 instances
  [2] Browse All Configurations - See all models that fit your hardware
  [3] Custom Configuration - Specify exact model and number of instances

  Enter your choice:
```

### Menu Options

1. **Recommended Configuration** - Automatically selects the best model and instance count for your hardware
2. **Browse All Configurations** - Shows all feasible models that fit in your available memory
3. **Custom Configuration** - Step-by-step wizard to select:
   - Model family (Qwen, DeepSeek, CodeLlama)
   - Model size (3B, 7B, 14B)
   - Quantization level (Q4, Q5, Q6)
   - Number of instances (1 to max supported)

To skip the menu and use auto-detection, use `--auto` flag.

## Startup Summary

When starting, Local Swarm displays a comprehensive summary:

```
======================================================================
 Local Swarm - Startup Summary
======================================================================

----------------------------------------------------------------------
 Hardware Detection
----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB

  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB

  Available for LLMs: 12.0 GB

----------------------------------------------------------------------
 Model Configuration
----------------------------------------------------------------------
  Model: Qwen 2.5 Coder 7b (q6_k)
  Description: Alibaba's code-focused model
  Instances: 2
  Memory per Instance: 6.0 GB
  Total Memory: 12.0 GB
  Utilization: 100.0% of available

======================================================================
```

## How It Works

### Hardware Detection

The tool automatically detects your system:
- **Windows**: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI)
- **macOS**: Apple Silicon via Metal, unified memory model
- **Linux**: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI/OpenCL)
- **Android**: Qualcomm Adreno GPUs (via Termux)

**Supported Backends**:
- **NVIDIA**: CUDA via llama.cpp
- **AMD**: ROCm via llama.cpp (Linux, Windows experimental)
- **Intel**: OneAPI/SYCL via llama.cpp
- **Apple Silicon**: Metal via MLX
- **Qualcomm**: CPU fallback on llama.cpp (Android/Termux)

### Model Selection

Based on available memory:
1. **External GPU**: Use 100% of VRAM minus OS overhead
2. **Apple Silicon**: Use 50% of unified RAM
3. **CPU-only**: Use 50% of system RAM

The algorithm selects:
- Largest model size that fits
- Highest quantization quality possible
- Maximum instances (2-8) based on memory

Example configurations:

| Hardware | Model | Quant | Instances | Memory Used |
|----------|-------|-------|-----------|-------------|
| RTX 4090 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
| RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
| RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
| RX 7900 XTX 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
| Arc A770 16GB | Qwen 2.5 7B | Q5_K_M | 2 | ~10.4 GB |
| M4 Max 64GB | Qwen 2.5 14B | Q4_K_M | 4 | ~35.2 GB |
| M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
| M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
| Snapdragon 8 Gen 3 | Qwen 2.5 3B | Q4_K_M | 1 | ~1.8 GB |
| CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |
| **Federated (3 machines)** | **Qwen 2.5 7B** | **Q4_K_M** | **9** | **~40.5 GB** |

### Swarm Consensus

For each request, the swarm:
1. Sends the prompt to all running instances
2. Collects responses in parallel
3. Runs consensus algorithm:
   - **Similarity**: Groups responses by semantic similarity, returns largest group
   - **Quality**: Scores responses on completeness and code quality
   - **Fastest**: Returns the quickest response
4. Returns the winning response via OpenAI-compatible API

### Network Federation

Run Local Swarm on multiple machines in the same network to create a "federated swarm":

**Example Setup**:
- Windows PC (RTX 4060 Ti): 4 instances
- Mac Mini (M1): 2 instances
- MacBook (M4): 3 instances
- Total: 9 instances voting on every request

**How it works**:
1. Each machine auto-discovers others via mDNS/Bonjour
2. Each swarm generates responses independently
3. Local consensus picks best response per machine
4. Cross-swarm consensus votes across all machines
5. Best response returned to client

**To enable federation**:
```yaml
federation:
  enabled: true
  discovery_port: 8765  # mDNS/Bonjour discovery
  federation_port: 8766  # Inter-swarm communication
```

Machines will automatically discover each other within 10 seconds.

## API Endpoints

### GET /v1/models
List available models

### POST /v1/chat/completions
Chat completion with consensus

**Request**:
```json
{
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ]
}
```

**Response**:
```json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def sort_list(lst):\n    return sorted(lst)"
    },
    "finish_reason": "stop"
  }]
}
```

### GET /health
Health check

### GET /metrics
Prometheus metrics (optional)

## Supported Models

Currently supported models (auto-selected based on hardware):

- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended for coding tasks
- **DeepSeek Coder** (1.3B, 6.7B, 33B) - Good alternative
- **CodeLlama** (7B, 13B, 34B) - Meta's code model

All models support GGUF quantization:
- Q4_K_M - Good quality, smallest size (recommended)
- Q5_K_M - Better quality
- Q6_K - Best quality

## Troubleshooting

### Out of Memory
If you get OOM errors:
```bash
# Reduce instances
python -m local_swarm --instances 2

# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4
```

### Slow Performance
- Check GPU utilization with `nvidia-smi` (NVIDIA) or Activity Monitor (macOS)
- Ensure model is cached (first run downloads to `~/.local_swarm/models`)
- Try reducing instances to avoid contention

### Windows: CUDA not detected
Make sure NVIDIA drivers are installed:
```powershell
nvidia-smi
```
If this fails, reinstall drivers from nvidia.com

### macOS: MLX not found
```bash
pip install mlx-lm
```

### Linux: AMD GPU not detected
Ensure ROCm is installed:
```bash
rocm-smi
```
If not found, install from https://www.amd.com/en/developer/rocm-hub.html

### Linux: Intel GPU not detected
Install Intel oneAPI:
```bash
# Ubuntu/Debian
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor -o /usr/share/keyrings/intel-oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-basekit
```

### Android: Termux issues
- Ensure Termux is installed from F-Droid (not Play Store)
- Run `pkg update` before installation
- Limited to small models (1-3B) due to RAM constraints
- Use CPU backend only (no GPU acceleration on Android yet)

## Requirements

- Python 3.9+
- 4GB+ RAM (8GB+ recommended)
- Optional: NVIDIA/AMD/Intel GPU with 4GB+ VRAM
- Optional: Apple Silicon Mac
- Optional: Android device with 8GB+ RAM (via Termux)

## Development

```bash
# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Run specific platform tests
pytest tests/test_hardware.py -v

# Format code
black src/
ruff check src/
```

## Architecture

### Single Machine

```
┌─────────────────────────────────────┐
│         OpenAI API Client           │
│        (opencode, etc.)             │
└─────────────┬───────────────────────┘
              │ HTTP
              ▼
┌─────────────────────────────────────┐
│     Local Swarm API Server          │
│    (FastAPI / localhost:8000)       │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│       Swarm Manager                 │
│  ┌─────────┐ ┌─────────┐           │
│  │ Worker 1│ │ Worker 2│ ...       │
│  │(LLM #1) │ │(LLM #2) │           │
│  └────┬────┘ └────┬────┘           │
│       │           │                 │
│       └─────┬─────┘                 │
│             ▼                       │
│      Consensus Engine               │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│     Backend (llama.cpp / MLX)       │
│    ┌─────────────────────┐          │
│    │   GGUF/MLX Model    │          │
│    │   (Qwen/Codellama)  │          │
│    └─────────────────────┘          │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│    Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘
```

### Federated Swarm (Multiple Machines)

```
┌─────────────────────────────────────────────────────────────┐
│                    Local Network                             │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Windows PC  │    │   Mac Mini   │    │   MacBook    │  │
│  │  (RTX 4060)  │    │    (M1)      │    │    (M4)      │  │
│  │  4 instances │    │  2 instances │    │  3 instances │  │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘  │
│         │                   │                   │           │
│         │                   │                   │           │
│         └───────────────────┼───────────────────┘           │
│                             │                               │
│                    ┌────────┴────────┐                      │
│                    │  Cross-Swarm    │                      │
│                    │    Consensus    │                      │
│                    └────────┬────────┘                      │
│                             │                               │
│                    ┌────────▼────────┐                      │
│                    │   opencode      │                      │
│                    └─────────────────┘                      │
└─────────────────────────────────────────────────────────────┘
```

## License

MIT License - See LICENSE file

## Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

## Acknowledgments

- [llama.cpp](https://github.com/ggerganov/llama.cpp) - Inference engine (CUDA/ROCm/SYCL)
- [MLX](https://github.com/ml-explore/mlx) - Apple Silicon backend
- [Qwen](https://github.com/QwenLM/Qwen) - Model family
- [DeepSeek](https://github.com/deepseek-ai/deepseek-coder) - Model family
- [HuggingFace](https://huggingface.co) - Model hosting
- [ROCm](https://github.com/RadeonOpenCompute/ROCm) - AMD GPU support
- [oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) - Intel GPU support
- [Termux](https://termux.dev) - Android terminal emulator