local_swarm/README.md

# Local Swarm

Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.

## Features

- **Hardware Auto-Detection**: Automatically detects your GPU (NVIDIA), Apple Silicon, or CPU and selects optimal settings
- **Smart Model Selection**: Chooses the best model, quantization, and instance count based on available VRAM/RAM
- **Swarm Consensus**: Multiple LLM instances vote on the best response for higher quality outputs
- **OpenAI-Compatible API**: Drop-in replacement for OpenAI API at `http://localhost:8000/v1`
- **Cross-Platform**: Works on Windows, macOS, and Linux with automatic backend selection

## Quick Start

### Installation

#### Windows (PowerShell)
```powershell
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
.\scripts\install.bat
```

#### macOS/Linux
```bash
# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
chmod +x scripts/install.sh
./scripts/install.sh
```

### Usage

#### Start the Swarm
```bash
# Auto-detect hardware and start
python -m local_swarm

# Or use the CLI
python main.py
```

On first run, the tool will:
1. Scan your hardware (GPU, RAM, CPU)
2. Select the optimal model and quantization
3. Download the model (one-time)
4. Start multiple instances based on available memory
5. Expose the API at `http://localhost:8000`

Example startup output:
```
🔍 Detecting hardware...
   OS: Windows 11
   GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
   CPU: 16 cores
   RAM: 32 GB

📊 Optimal configuration:
   Model: Qwen 2.5 Coder 3B
   Quantization: Q4_K_M (1.8 GB per instance)
   Instances: 8 (using 14.4 GB VRAM)

⬇️  Downloading model...
   Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB

🚀 Starting swarm...
   Worker 1: Ready (GPU:0)
   Worker 2: Ready (GPU:0)
   ...
   Worker 8: Ready (GPU:0)

✅ Local Swarm is running!
   API: http://localhost:8000/v1
   Models: http://localhost:8000/v1/models
   Health: http://localhost:8000/health

💡 Configure opencode to use:
   base_url: http://localhost:8000/v1
   api_key: any (not used)
```

#### Configure opencode

Add to your opencode configuration:

```json
{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}
```

## Configuration

Create a `config.yaml` file for customization:

```yaml
server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

models:
  cache_dir: "~/.local_swarm/models"
```

## CLI Options

```bash
# Show hardware detection without starting
python -m local_swarm --detect

# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Use specific port
python -m local_swarm --port 8080

# Force number of instances
python -m local_swarm --instances 4

# Download models only (no server)
python -m local_swarm --download-only

# Show help
python -m local_swarm --help
```

## How It Works

### Hardware Detection

The tool automatically detects your system:
- **Windows**: NVIDIA GPUs via NVML, DirectX fallback
- **macOS**: Apple Silicon via Metal, unified memory model
- **Linux**: NVIDIA (NVML), AMD (ROCm)

### Model Selection

Based on available memory:
1. **External GPU**: Use 100% of VRAM minus OS overhead
2. **Apple Silicon**: Use 50% of unified RAM
3. **CPU-only**: Use 50% of system RAM

The algorithm selects:
- Largest model size that fits
- Highest quantization quality possible
- Maximum instances (2-8) based on memory

Example configurations:

| Hardware | Model | Quant | Instances | Memory Used |
|----------|-------|-------|-----------|-------------|
| RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
| RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
| M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
| M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
| CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |

### Swarm Consensus

For each request, the swarm:
1. Sends the prompt to all running instances
2. Collects responses in parallel
3. Runs consensus algorithm:
   - **Similarity**: Groups responses by semantic similarity, returns largest group
   - **Quality**: Scores responses on completeness and code quality
   - **Fastest**: Returns the quickest response
4. Returns the winning response via OpenAI-compatible API

## API Endpoints

### GET /v1/models
List available models

### POST /v1/chat/completions
Chat completion with consensus

**Request**:
```json
{
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ]
}
```

**Response**:
```json
{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def sort_list(lst):\n    return sorted(lst)"
    },
    "finish_reason": "stop"
  }]
}
```

### GET /health
Health check

### GET /metrics
Prometheus metrics (optional)

## Supported Models

Currently supported models (auto-selected based on hardware):

- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended for coding tasks
- **DeepSeek Coder** (1.3B, 6.7B, 33B) - Good alternative
- **CodeLlama** (7B, 13B, 34B) - Meta's code model

All models support GGUF quantization:
- Q4_K_M - Good quality, smallest size (recommended)
- Q5_K_M - Better quality
- Q6_K - Best quality

## Troubleshooting

### Out of Memory
If you get OOM errors:
```bash
# Reduce instances
python -m local_swarm --instances 2

# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4
```

### Slow Performance
- Check GPU utilization with `nvidia-smi` (NVIDIA) or Activity Monitor (macOS)
- Ensure model is cached (first run downloads to `~/.local_swarm/models`)
- Try reducing instances to avoid contention

### Windows: CUDA not detected
Make sure NVIDIA drivers are installed:
```powershell
nvidia-smi
```
If this fails, reinstall drivers from nvidia.com

### macOS: MLX not found
```bash
pip install mlx-lm
```

## Requirements

- Python 3.9+
- 4GB+ RAM (8GB+ recommended)
- Optional: NVIDIA GPU with 4GB+ VRAM
- Optional: Apple Silicon Mac

## Development

```bash
# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Run specific platform tests
pytest tests/test_hardware.py -v

# Format code
black src/
ruff check src/
```

## Architecture

```
┌─────────────────────────────────────┐
│         OpenAI API Client           │
│        (opencode, etc.)             │
└─────────────┬───────────────────────┘
              │ HTTP
              ▼
┌─────────────────────────────────────┐
│     Local Swarm API Server          │
│    (FastAPI / localhost:8000)       │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│       Swarm Manager                 │
│  ┌─────────┐ ┌─────────┐           │
│  │ Worker 1│ │ Worker 2│ ...       │
│  │(LLM #1) │ │(LLM #2) │           │
│  └────┬────┘ └────┬────┘           │
│       │           │                 │
│       └─────┬─────┘                 │
│             ▼                       │
│      Consensus Engine               │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│     Backend (llama.cpp / MLX)       │
│    ┌─────────────────────┐          │
│    │   GGUF/MLX Model    │          │
│    │   (Qwen/Codellama)  │          │
│    └─────────────────────┘          │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│    Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘
```

## License

MIT License - See LICENSE file

## Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

## Acknowledgments

- [llama.cpp](https://github.com/ggerganov/llama.cpp) - Inference engine
- [MLX](https://github.com/ml-explore/mlx) - Apple Silicon backend
- [Qwen](https://github.com/QwenLM/Qwen) - Model family
- [HuggingFace](https://huggingface.co) - Model hosting