sleepy 3dc06c73ef feat(federation): add winner tracking and token usage reporting
- Track which node won the consensus voting (local or peer name)
- Add winner to FederationResult dataclass
- Log winner in server logs
- Calculate and report token usage in federation streaming
- Fix prompt_tokens calculation in streaming path

Now opencode will show:
- Context tokens used
- Which node won the vote (in logs)
2026-02-24 23:40:41 +01:00

Local Swarm

Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.

What It Does

  • Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
  • Downloads and runs multiple LLM instances optimized for your VRAM/RAM
  • Uses consensus voting - all instances answer, best response wins
  • Connects multiple machines on your network for a "hive mind" effect
  • Provides an OpenAI-compatible API at http://localhost:17615/v1

Quick Start

# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt

# Run it
python main.py

On first run, it will:

  1. Detect your hardware
  2. Pick the best model and quantization
  3. Download the model (one-time)
  4. Start multiple LLM workers
  5. Expose the API at http://localhost:17615

Usage

Interactive Mode (default)

python main.py

Shows a menu with:

  • Recommended configuration (auto-selected)
  • Browse all compatible models
  • Custom configuration wizard

Auto Mode (no menu)

python main.py --auto

With Other Options

python main.py --model qwen:3b:q4      # Use specific model
python main.py --instances 4           # Force 4 workers
python main.py --port 8080             # Custom port
python main.py --detect                # Show hardware info only
python main.py --federation            # Enable network federation
python main.py --mcp                   # Enable MCP server

Connect to Opencode

Add to your opencode config:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

Network Federation (Hive Mind)

Run on multiple machines to combine their power:

# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation

# Machine 2 (Mac Mini M1)
python main.py --auto --federation

# Machine 3 (Old laptop)
python main.py --auto --federation

Machines auto-discover each other and vote together on every request.

How Consensus Works

  1. Your prompt goes to all LLM instances
  2. Each instance generates a response independently
  3. The consensus algorithm picks the best answer:
    • Similarity (default): Groups responses by meaning, picks the largest group
    • Quality: Scores on completeness, code blocks, structure
    • Fastest: Returns the quickest response
    • Majority: Simple text match voting

Configuration

Create config.yaml:

server:
  host: "127.0.0.1"
  port: 17615

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8

federation:
  enabled: true
  discovery_port: 8765
  max_peers: 10

Supported Hardware

Hardware Backend Notes
NVIDIA GPU llama.cpp (CUDA) Best performance
AMD GPU llama.cpp (ROCm) Linux/Windows
Intel GPU llama.cpp (SYCL) Linux/Windows
Apple Silicon MLX Native Metal
Qualcomm llama.cpp (CPU) Android/Termux
CPU-only llama.cpp Slower but works

Supported Models

  • Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
  • DeepSeek Coder (1.3B, 6.7B, 33B)
  • CodeLlama (7B, 13B, 34B)

All support GGUF quantization (Q4_K_M recommended).

API Endpoints

  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completion with consensus
  • GET /health - Health check
  • GET /v1/federation/peers - List discovered peers (when federation enabled)

Troubleshooting

Out of Memory

python main.py --instances 2           # Reduce workers
python main.py --model qwen:3b:q4      # Use smaller model

Slow Performance

  • Check GPU utilization with nvidia-smi
  • Reduce instances to avoid contention
  • Use Q4 quantization instead of Q6

CUDA Not Detected (Windows)

nvidia-smi  # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

macOS: MLX Not Found

pip install mlx-lm

Project Structure

local_swarm/
├── main.py                   # CLI entry point
├── src/
│   ├── hardware/            # GPU detection (NVIDIA, AMD, Intel, Apple, Qualcomm)
│   ├── models/              # Model registry, selection, downloading
│   ├── backends/            # llama.cpp and MLX backends
│   ├── swarm/               # Worker management and consensus
│   ├── network/             # Federation and peer discovery
│   ├── api/                 # OpenAI-compatible API server
│   └── tools/               # Tool execution (read, write, bash)
└── docs/                    # Documentation

License

MIT License

S
Description
No description provided
Readme 444 KiB
Languages
Python 99%
Shell 0.7%
Batchfile 0.3%