local_swarm/README.md

# Local Swarm

Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.

## What It Does

- **Auto-detects your hardware** (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
- **Downloads and runs multiple LLM instances** optimized for your VRAM/RAM
- **Uses consensus voting** - all instances answer, best response wins
- **Connects multiple machines** on your network for a "hive mind" effect
- **Provides an OpenAI-compatible API** at `http://localhost:17615/v1`

## Quick Start

```bash
# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt

# Run it
python main.py
```

On first run, it will:
1. Detect your hardware
2. Pick the best model and quantization
3. Download the model (one-time)
4. Start multiple LLM workers
5. Expose the API at `http://localhost:17615`

## Usage

### Interactive Mode (default)
```bash
python main.py
```

Shows a menu with:
- Recommended configuration (auto-selected)
- Browse all compatible models
- Custom configuration wizard

### Auto Mode (no menu)
```bash
python main.py --auto
```

### With Other Options
```bash
python main.py --model qwen:3b:q4      # Use specific model
python main.py --instances 4           # Force 4 workers
python main.py --port 8080             # Custom port
python main.py --detect                # Show hardware info only
python main.py --federation            # Enable network federation
python main.py --mcp                   # Enable MCP server
python main.py --use-opencode-tools    # Use opencode tools (adds ~27k tokens)
```

**Tool Mode Options:**
- Default: Local tool server (~125 tokens, saves context window space)
- `--use-opencode-tools`: Full opencode tool definitions (~27k tokens, more capabilities)

## Connect to Opencode

Add to your opencode config:

```json
{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}
```

## Network Federation (Hive Mind)

Run on multiple machines to combine their power:

```bash
# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation

# Machine 2 (Mac Mini M1)
python main.py --auto --federation

# Machine 3 (Old laptop)
python main.py --auto --federation
```

Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.

**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).

## How Consensus Works

1. Your prompt goes to all LLM instances
2. Each instance generates a response independently
3. The consensus algorithm picks the best answer:
   - **Similarity** (default): Groups responses by meaning, picks the largest group
   - **Quality**: Scores on completeness, code blocks, structure
   - **Fastest**: Returns the quickest response
   - **Majority**: Simple text match voting

## Configuration

Create `config.yaml`:

```yaml
server:
  host: "127.0.0.1"
  port: 17615

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8

federation:
  enabled: true
  discovery_port: 8765
  max_peers: 10
```

## Supported Hardware

| Hardware | Backend | Notes |
|----------|---------|-------|
| NVIDIA GPU | llama.cpp (CUDA) | Best performance |
| AMD GPU | llama.cpp (ROCm) | Linux/Windows |
| Intel GPU | llama.cpp (SYCL) | Linux/Windows |
| Apple Silicon | MLX | Native Metal |
| Qualcomm | llama.cpp (CPU) | Android/Termux |
| CPU-only | llama.cpp | Slower but works |

## Supported Models

- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended
- **DeepSeek Coder** (1.3B, 6.7B, 33B)
- **CodeLlama** (7B, 13B, 34B)

All support GGUF quantization (Q4_K_M recommended).

## API Endpoints

- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completion with consensus
- `GET /health` - Health check
- `POST /v1/federation/vote` - Federation voting (used internally between peers)

## Troubleshooting

### Out of Memory
```bash
python main.py --instances 2           # Reduce workers
python main.py --model qwen:3b:q4      # Use smaller model
```

### Slow Performance
- Check GPU utilization with `nvidia-smi`
- Reduce instances to avoid contention
- Use Q4 quantization instead of Q6

### CUDA Not Detected (Windows)
```powershell
nvidia-smi  # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
```

### macOS: MLX Not Found
```bash
pip install mlx-lm
```

## Project Structure

```
local_swarm/
├── main.py                   # CLI entry point (99 lines)
├── src/
│   ├── api/                 # OpenAI-compatible API
│   │   ├── routes.py        # HTTP routing (252 lines)
│   │   ├── formatting.py    # Message formatting
│   │   ├── tool_parser.py   # Tool call parsing
│   │   ├── chat_handlers.py # Chat completion logic
│   │   └── models.py        # API data models
│   ├── cli/                 # Command-line interface
│   │   ├── parser.py        # CLI argument parsing
│   │   ├── main_runner.py   # Main application logic
│   │   ├── server_runner.py # Server management
│   │   └── test_runner.py   # Test mode execution
│   ├── swarm/               # Swarm orchestration
│   │   ├── manager.py       # Swarm manager
│   │   ├── worker.py        # LLM worker implementation
│   │   ├── consensus.py     # Consensus algorithms
│   │   └── orchestrator.py  # Generation orchestration
│   ├── models/              # Model management
│   │   ├── registry.py      # Model registry (194 lines)
│   │   ├── selector.py      # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculations
│   │   └── downloader.py    # Model downloading
│   ├── hardware/            # Hardware detection
│   │   ├── detector.py      # Hardware detection
│   │   ├── nvidia.py        # NVIDIA GPU detection
│   │   ├── intel.py         # Intel GPU detection
│   │   └── qualcomm.py      # Qualcomm detection
│   ├── network/             # Network federation
│   │   ├── federation.py    # Cross-swarm consensus
│   │   └── discovery.py     # Peer discovery
│   ├── backends/            # LLM backends
│   │   ├── llama_cpp.py     # llama.cpp backend
│   │   ├── mlx.py           # Apple Silicon MLX backend
│   │   └── base.py          # Base backend interface
│   ├── interactive/         # Interactive CLI
│   │   ├── ui.py            # UI utilities
│   │   ├── display.py       # Hardware display
│   │   └── tips.py          # Help content
│   ├── tools/               # Tool execution
│   │   └── executor.py      # Tool execution engine
│   └── utils/               # Shared utilities
│       ├── token_counter.py # Token counting
│       ├── project_discovery.py # Project root discovery
│       └── network.py       # Network utilities
├── config/                  # Configuration files
│   └── models/              # Model configurations
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── docs/                    # Documentation

```

### Architecture Principles

- **Modular Design**: Each module has a single, focused responsibility
- **Configuration Over Code**: Static data extracted to JSON config files
- **Separation of Concerns**: API, CLI, and business logic are cleanly separated
- **No Files > 300 Lines**: Most modules kept under 300 lines for maintainability

## Development

### Code Quality Standards

This project follows strict code quality standards:

- **File Size**: No files > 300 lines (with few exceptions)
- **Function Size**: No functions > 50 lines
- **Nesting Depth**: No indentation > 3 levels
- **DRY Principle**: No duplicate code (>3 lines)
- **Single Responsibility**: Each module does one thing
- **Configuration Over Code**: Static data in JSON configs

### Running Tests

```bash
# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_tool_parsing.py -v

# Run with coverage
python -m pytest tests/ --cov=src
```

### Recent Refactoring

Major refactoring completed to improve modularity:

**Before**: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines)
**After**: Modular architecture (main.py: 99 lines, routes.py: 252 lines)

**Changes**:
- Extracted API logic into focused modules (formatting, parsing, handlers)
- Created CLI package with separated concerns (parser, runner, server)
- Moved hardcoded model data to JSON configuration files
- Created shared utility modules (token_counter, project_discovery, network)
- Reduced code duplication across the codebase

See `docs/ARCHITECTURE.md` for detailed architecture documentation.

## Recent Improvements

### ✅ Universal Tool Support (2025-02-25)
- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
- No client-side configuration needed - just use the API
- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
- Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
- Proper OpenAI tool format with unique IDs and tool_call_id linking

### ✅ OpenCode-Compatible Streaming (2025-02-25)
- Proper `reasoning_content` field for "Thinking..." collapsible blocks
- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
- Final answer delivered in `content` field after tool execution

### ✅ Federation Quality Voting (2025-02-25)
- Head node now **objectively judges** all peer responses using quality metrics
- No more reliance on self-reported confidence (which biased toward local)
- All responses scored on length, structure, completeness
- Fair competition: 14B models properly beat 3B on quality tasks

### 🚧 Planned Features
- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)

## Contributing

Contributions are welcome! Please ensure:
1. Code follows the quality standards above
2. All tests pass
3. New features include tests
4. Documentation is updated

## License

MIT License