sleepy 429a3a4e3b Add NETWORK.md documenting federation status and TODO
Documents the current state of network federation:
- What's working (discovery, federation client, network binding)
- What's missing (integration in main.py)
- Relevant files and functions
- Scope and limitations
- Comprehensive TODO list for implementation

Federation exists but isn't wired up to the main application flow.
2026-02-24 04:14:18 +01:00
2026-02-23 20:19:46 +01:00
2026-02-23 18:39:56 +01:00

Local Swarm

Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.

Features

  • Interactive Menu System: Easy-to-use menu for selecting model configurations, browsing options, or creating custom setups
  • Hardware Auto-Detection: Automatically detects your GPU (NVIDIA, AMD, Intel), Apple Silicon, Qualcomm (Android), or CPU and selects optimal settings
  • Smart Model Selection: Chooses the best model, quantization, and instance count based on available VRAM/RAM
  • Startup Summary: Clear display of detected hardware, selected model, resource usage, and worker status
  • Swarm Consensus: Multiple LLM instances vote on the best response for higher quality outputs
  • Network Federation: Multiple machines on the same network can join into a "federated swarm" for distributed consensus
  • OpenAI-Compatible API: Drop-in replacement for OpenAI API at http://localhost:8000/v1
  • MCP Server: Model Context Protocol support for tight AI assistant integration
  • Cross-Platform: Works on Windows, macOS, Linux, and Android (via Termux) with automatic backend selection

Documentation

  • Quick Start - Get up and running in minutes
  • Complete Guide - Comprehensive documentation
    • Opencode configuration examples
    • API reference
    • Troubleshooting guide
    • Performance tuning
    • Advanced configuration
  • Configuration - Customize your setup
  • Interactive Mode - Using the menu system
  • Tips & Help - Learn about models, quantization, and optimization

Quick Start

Installation

Windows (PowerShell)

# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
.\scripts\install.bat

macOS/Linux

# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
chmod +x scripts/install.sh
./scripts/install.sh

Android (Termux)

# In Termux app
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run Termux installer
chmod +x scripts/install-termux.sh
./scripts/install-termux.sh

Note: Android support is limited to small models (1-3B) due to memory constraints. Requires 8GB+ RAM.

Usage

Start the Swarm

# Auto-detect hardware and start
python -m local_swarm

# Or use the CLI
python main.py

On first run, the tool will:

  1. Scan your hardware (GPU, RAM, CPU)
  2. Select the optimal model and quantization
  3. Download the model (one-time)
  4. Start multiple instances based on available memory
  5. Expose the API at http://localhost:8000

Example startup output:

🔍 Detecting hardware...
   OS: Windows 11
   GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
   CPU: 16 cores
   RAM: 32 GB

📊 Optimal configuration:
   Model: Qwen 2.5 Coder 3B
   Quantization: Q4_K_M (1.8 GB per instance)
   Instances: 8 (using 14.4 GB VRAM)

⬇️  Downloading model...
   Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB

🚀 Starting swarm...
   Worker 1: Ready (GPU:0)
   Worker 2: Ready (GPU:0)
   ...
   Worker 8: Ready (GPU:0)

✅ Local Swarm is running!
   API: http://localhost:8000/v1
   Models: http://localhost:8000/v1/models
   Health: http://localhost:8000/health

💡 Configure opencode to use:
   base_url: http://localhost:8000/v1
   api_key: any (not used)

Configure opencode

Add to your opencode configuration:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

MCP Server (Optional)

For tighter integration with AI assistants, enable the MCP server:

python main.py --mcp

This runs alongside the HTTP API and exposes tools AI assistants can use:

  • get_hardware_info - Query CPU, GPU, and RAM
  • get_swarm_status - Check worker health
  • generate_code - Generate code with consensus
  • list_available_models - See what models can run
  • get_worker_details - Get detailed worker statistics

MCP allows AI assistants to automatically query your hardware capabilities and select appropriate models.

Configuration

Create a config.yaml file for customization:

server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

federation:
  enabled: true
  discovery_port: 8765
  federation_port: 8766
  max_peers: 10

models:
  cache_dir: "~/.local_swarm/models"

CLI Options

# Show hardware detection without starting
python -m local_swarm --detect

# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Use specific port
python -m local_swarm --port 8080

# Force number of instances
python -m local_swarm --instances 4

# Download models only (no server)
python -m local_swarm --download-only

# Enable MCP server alongside HTTP API
python -m local_swarm --mcp

# Show help
python -m local_swarm --help

# Auto-detect without interactive menu
python -m local_swarm --auto

Interactive Mode

By default, Local Swarm starts in interactive mode with a menu system:

======================================================================
 Local Swarm - Model Selection
======================================================================

----------------------------------------------------------------------
 Hardware Detection
----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB

  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB

  Available for LLMs: 12.0 GB
  (Using 50% of system RAM)

----------------------------------------------------------------------
 Configuration Options
----------------------------------------------------------------------

  💡 Recommended: Qwen 2.5 Coder 7b (q6_k)
     Instances: 2
     Memory: 12.0 GB

  [1] Recommended Configuration - Qwen 2.5 Coder 7b (q6_k) with 2 instances
  [2] Browse All Configurations - See all models that fit your hardware
  [3] Custom Configuration - Specify exact model and number of instances

  Enter your choice: 

Menu Options

  1. Recommended Configuration - Automatically selects the best model and instance count for your hardware
  2. Browse All Configurations - Shows all feasible models that fit in your available memory
  3. Custom Configuration - Step-by-step wizard to select:
    • Model family (Qwen, DeepSeek, CodeLlama)
    • Model size (3B, 7B, 14B)
    • Quantization level (Q4, Q5, Q6)
    • Number of instances (1 to max supported)

To skip the menu and use auto-detection, use --auto flag.

Startup Summary

When starting, Local Swarm displays a comprehensive summary:

======================================================================
 Local Swarm - Startup Summary
======================================================================

----------------------------------------------------------------------
 Hardware Detection
----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB

  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB

  Available for LLMs: 12.0 GB

----------------------------------------------------------------------
 Model Configuration
----------------------------------------------------------------------
  Model: Qwen 2.5 Coder 7b (q6_k)
  Description: Alibaba's code-focused model
  Instances: 2
  Memory per Instance: 6.0 GB
  Total Memory: 12.0 GB
  Utilization: 100.0% of available

======================================================================

How It Works

Hardware Detection

The tool automatically detects your system:

  • Windows: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI)
  • macOS: Apple Silicon via Metal, unified memory model
  • Linux: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI/OpenCL)
  • Android: Qualcomm Adreno GPUs (via Termux)

Supported Backends:

  • NVIDIA: CUDA via llama.cpp
  • AMD: ROCm via llama.cpp (Linux, Windows experimental)
  • Intel: OneAPI/SYCL via llama.cpp
  • Apple Silicon: Metal via MLX
  • Qualcomm: CPU fallback on llama.cpp (Android/Termux)

Model Selection

Based on available memory:

  1. External GPU: Use 100% of VRAM minus OS overhead
  2. Apple Silicon: Use 50% of unified RAM
  3. CPU-only: Use 50% of system RAM

The algorithm selects:

  • Largest model size that fits
  • Highest quantization quality possible
  • Maximum instances (2-8) based on memory

Example configurations:

Hardware Model Quant Instances Memory Used
RTX 4090 24GB Qwen 2.5 14B Q4_K_M 2 ~17.6 GB
RTX 4060 Ti 16GB Qwen 2.5 7B Q4_K_M 3 ~13.5 GB
RTX 4060 Ti 8GB Qwen 2.5 3B Q6_K 4 ~10.4 GB
RX 7900 XTX 24GB Qwen 2.5 14B Q4_K_M 2 ~17.6 GB
Arc A770 16GB Qwen 2.5 7B Q5_K_M 2 ~10.4 GB
M4 Max 64GB Qwen 2.5 14B Q4_K_M 4 ~35.2 GB
M3 Pro 36GB Qwen 2.5 7B Q4_K_M 4 ~18 GB
M1 8GB Qwen 2.5 3B Q4_K_M 2 ~3.6 GB
Snapdragon 8 Gen 3 Qwen 2.5 3B Q4_K_M 1 ~1.8 GB
CPU 32GB Qwen 2.5 3B Q4_K_M 8 ~14.4 GB
Federated (3 machines) Qwen 2.5 7B Q4_K_M 9 ~40.5 GB

Swarm Consensus

For each request, the swarm:

  1. Sends the prompt to all running instances
  2. Collects responses in parallel
  3. Runs consensus algorithm:
    • Similarity: Groups responses by semantic similarity, returns largest group
    • Quality: Scores responses on completeness and code quality
    • Fastest: Returns the quickest response
  4. Returns the winning response via OpenAI-compatible API

Network Federation

Run Local Swarm on multiple machines in the same network to create a "federated swarm":

Example Setup:

  • Windows PC (RTX 4060 Ti): 4 instances
  • Mac Mini (M1): 2 instances
  • MacBook (M4): 3 instances
  • Total: 9 instances voting on every request

How it works:

  1. Each machine auto-discovers others via mDNS/Bonjour
  2. Each swarm generates responses independently
  3. Local consensus picks best response per machine
  4. Cross-swarm consensus votes across all machines
  5. Best response returned to client

To enable federation:

federation:
  enabled: true
  discovery_port: 8765  # mDNS/Bonjour discovery
  federation_port: 8766  # Inter-swarm communication

Machines will automatically discover each other within 10 seconds.

API Endpoints

GET /v1/models

List available models

POST /v1/chat/completions

Chat completion with consensus

Request:

{
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ]
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def sort_list(lst):\n    return sorted(lst)"
    },
    "finish_reason": "stop"
  }]
}

GET /health

Health check

GET /metrics

Prometheus metrics (optional)

Supported Models

Currently supported models (auto-selected based on hardware):

  • Qwen 2.5 Coder (3B, 7B, 14B) - Recommended for coding tasks
  • DeepSeek Coder (1.3B, 6.7B, 33B) - Good alternative
  • CodeLlama (7B, 13B, 34B) - Meta's code model

All models support GGUF quantization:

  • Q4_K_M - Good quality, smallest size (recommended)
  • Q5_K_M - Better quality
  • Q6_K - Best quality

Troubleshooting

Out of Memory

If you get OOM errors:

# Reduce instances
python -m local_swarm --instances 2

# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4

Slow Performance

  • Check GPU utilization with nvidia-smi (NVIDIA) or Activity Monitor (macOS)
  • Ensure model is cached (first run downloads to ~/.local_swarm/models)
  • Try reducing instances to avoid contention

Windows: CUDA not detected

Make sure NVIDIA drivers are installed:

nvidia-smi

If this fails, reinstall drivers from nvidia.com

macOS: MLX not found

pip install mlx-lm

Linux: AMD GPU not detected

Ensure ROCm is installed:

rocm-smi

If not found, install from https://www.amd.com/en/developer/rocm-hub.html

Linux: Intel GPU not detected

Install Intel oneAPI:

# Ubuntu/Debian
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor -o /usr/share/keyrings/intel-oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-basekit

Android: Termux issues

  • Ensure Termux is installed from F-Droid (not Play Store)
  • Run pkg update before installation
  • Limited to small models (1-3B) due to RAM constraints
  • Use CPU backend only (no GPU acceleration on Android yet)

Requirements

  • Python 3.9+
  • 4GB+ RAM (8GB+ recommended)
  • Optional: NVIDIA/AMD/Intel GPU with 4GB+ VRAM
  • Optional: Apple Silicon Mac
  • Optional: Android device with 8GB+ RAM (via Termux)

Development

# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Run specific platform tests
pytest tests/test_hardware.py -v

# Format code
black src/
ruff check src/

Architecture

Single Machine

┌─────────────────────────────────────┐
│         OpenAI API Client           │
│        (opencode, etc.)             │
└─────────────┬───────────────────────┘
              │ HTTP
              ▼
┌─────────────────────────────────────┐
│     Local Swarm API Server          │
│    (FastAPI / localhost:8000)       │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│       Swarm Manager                 │
│  ┌─────────┐ ┌─────────┐           │
│  │ Worker 1│ │ Worker 2│ ...       │
│  │(LLM #1) │ │(LLM #2) │           │
│  └────┬────┘ └────┬────┘           │
│       │           │                 │
│       └─────┬─────┘                 │
│             ▼                       │
│      Consensus Engine               │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│     Backend (llama.cpp / MLX)       │
│    ┌─────────────────────┐          │
│    │   GGUF/MLX Model    │          │
│    │   (Qwen/Codellama)  │          │
│    └─────────────────────┘          │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│    Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘

Federated Swarm (Multiple Machines)

┌─────────────────────────────────────────────────────────────┐
│                    Local Network                             │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Windows PC  │    │   Mac Mini   │    │   MacBook    │  │
│  │  (RTX 4060)  │    │    (M1)      │    │    (M4)      │  │
│  │  4 instances │    │  2 instances │    │  3 instances │  │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘  │
│         │                   │                   │           │
│         │                   │                   │           │
│         └───────────────────┼───────────────────┘           │
│                             │                               │
│                    ┌────────┴────────┐                      │
│                    │  Cross-Swarm    │                      │
│                    │    Consensus    │                      │
│                    └────────┬────────┘                      │
│                             │                               │
│                    ┌────────▼────────┐                      │
│                    │   opencode      │                      │
│                    └─────────────────┘                      │
└─────────────────────────────────────────────────────────────┘

License

MIT License - See LICENSE file

Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Acknowledgments

S
Description
No description provided
Readme 444 KiB
Languages
Python 99%
Shell 0.7%
Batchfile 0.3%