sleepy af728505e8 fix: properly unpack FederationResult object instead of trying to unpack as tuple
- generate_with_federation() returns FederationResult object, not tuple
- Fixed _generate_with_consensus() to access fed_result.final_response
- This fixes 'cannot unpack non-iterable FederationResult object' error
- All 41 tests passing
2026-02-25 23:43:25 +01:00

Local Swarm

Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.

What It Does

  • Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
  • Downloads and runs multiple LLM instances optimized for your VRAM/RAM
  • Uses consensus voting - all instances answer, best response wins
  • Connects multiple machines on your network for a "hive mind" effect
  • Provides an OpenAI-compatible API at http://localhost:17615/v1

Quick Start

# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt

# Run it
python main.py

On first run, it will:

  1. Detect your hardware
  2. Pick the best model and quantization
  3. Download the model (one-time)
  4. Start multiple LLM workers
  5. Expose the API at http://localhost:17615

Usage

Interactive Mode (default)

python main.py

Shows a menu with:

  • Recommended configuration (auto-selected)
  • Browse all compatible models
  • Custom configuration wizard

Auto Mode (no menu)

python main.py --auto

With Other Options

python main.py --model qwen:3b:q4      # Use specific model
python main.py --instances 4           # Force 4 workers
python main.py --port 8080             # Custom port
python main.py --detect                # Show hardware info only
python main.py --federation            # Enable network federation
python main.py --mcp                   # Enable MCP server
python main.py --use-opencode-tools    # Use opencode tools (adds ~27k tokens)

Tool Mode Options:

  • Default: Local tool server (~125 tokens, saves context window space)
  • --use-opencode-tools: Full opencode tool definitions (~27k tokens, more capabilities)

Connect to Opencode

Add to your opencode config:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

Network Federation (Hive Mind)

Run on multiple machines to combine their power:

# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation

# Machine 2 (Mac Mini M1)
python main.py --auto --federation

# Machine 3 (Old laptop)
python main.py --auto --federation

Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses objective quality scoring to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.

Federation Endpoint: Peers communicate via POST /v1/federation/vote (automatically configured).

How Consensus Works

  1. Your prompt goes to all LLM instances
  2. Each instance generates a response independently
  3. The consensus algorithm picks the best answer:
    • Similarity (default): Groups responses by meaning, picks the largest group
    • Quality: Scores on completeness, code blocks, structure
    • Fastest: Returns the quickest response
    • Majority: Simple text match voting

Configuration

Create config.yaml:

server:
  host: "127.0.0.1"
  port: 17615

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8

federation:
  enabled: true
  discovery_port: 8765
  max_peers: 10

Supported Hardware

Hardware Backend Notes
NVIDIA GPU llama.cpp (CUDA) Best performance
AMD GPU llama.cpp (ROCm) Linux/Windows
Intel GPU llama.cpp (SYCL) Linux/Windows
Apple Silicon MLX Native Metal
Qualcomm llama.cpp (CPU) Android/Termux
CPU-only llama.cpp Slower but works

Supported Models

  • Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
  • DeepSeek Coder (1.3B, 6.7B, 33B)
  • CodeLlama (7B, 13B, 34B)

All support GGUF quantization (Q4_K_M recommended).

API Endpoints

  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completion with consensus
  • GET /health - Health check
  • POST /v1/federation/vote - Federation voting (used internally between peers)

Troubleshooting

Out of Memory

python main.py --instances 2           # Reduce workers
python main.py --model qwen:3b:q4      # Use smaller model

Slow Performance

  • Check GPU utilization with nvidia-smi
  • Reduce instances to avoid contention
  • Use Q4 quantization instead of Q6

CUDA Not Detected (Windows)

nvidia-smi  # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

macOS: MLX Not Found

pip install mlx-lm

Project Structure

local_swarm/
├── main.py                   # CLI entry point (99 lines)
├── src/
│   ├── api/                 # OpenAI-compatible API
│   │   ├── routes.py        # HTTP routing (252 lines)
│   │   ├── formatting.py    # Message formatting
│   │   ├── tool_parser.py   # Tool call parsing
│   │   ├── chat_handlers.py # Chat completion logic
│   │   └── models.py        # API data models
│   ├── cli/                 # Command-line interface
│   │   ├── parser.py        # CLI argument parsing
│   │   ├── main_runner.py   # Main application logic
│   │   ├── server_runner.py # Server management
│   │   └── test_runner.py   # Test mode execution
│   ├── swarm/               # Swarm orchestration
│   │   ├── manager.py       # Swarm manager
│   │   ├── worker.py        # LLM worker implementation
│   │   ├── consensus.py     # Consensus algorithms
│   │   └── orchestrator.py  # Generation orchestration
│   ├── models/              # Model management
│   │   ├── registry.py      # Model registry (194 lines)
│   │   ├── selector.py      # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculations
│   │   └── downloader.py    # Model downloading
│   ├── hardware/            # Hardware detection
│   │   ├── detector.py      # Hardware detection
│   │   ├── nvidia.py        # NVIDIA GPU detection
│   │   ├── intel.py         # Intel GPU detection
│   │   └── qualcomm.py      # Qualcomm detection
│   ├── network/             # Network federation
│   │   ├── federation.py    # Cross-swarm consensus
│   │   └── discovery.py     # Peer discovery
│   ├── backends/            # LLM backends
│   │   ├── llama_cpp.py     # llama.cpp backend
│   │   ├── mlx.py           # Apple Silicon MLX backend
│   │   └── base.py          # Base backend interface
│   ├── interactive/         # Interactive CLI
│   │   ├── ui.py            # UI utilities
│   │   ├── display.py       # Hardware display
│   │   └── tips.py          # Help content
│   ├── tools/               # Tool execution
│   │   └── executor.py      # Tool execution engine
│   └── utils/               # Shared utilities
│       ├── token_counter.py # Token counting
│       ├── project_discovery.py # Project root discovery
│       └── network.py       # Network utilities
├── config/                  # Configuration files
│   └── models/              # Model configurations
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── docs/                    # Documentation

Architecture Principles

  • Modular Design: Each module has a single, focused responsibility
  • Configuration Over Code: Static data extracted to JSON config files
  • Separation of Concerns: API, CLI, and business logic are cleanly separated
  • No Files > 300 Lines: Most modules kept under 300 lines for maintainability

Development

Code Quality Standards

This project follows strict code quality standards:

  • File Size: No files > 300 lines (with few exceptions)
  • Function Size: No functions > 50 lines
  • Nesting Depth: No indentation > 3 levels
  • DRY Principle: No duplicate code (>3 lines)
  • Single Responsibility: Each module does one thing
  • Configuration Over Code: Static data in JSON configs

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_tool_parsing.py -v

# Run with coverage
python -m pytest tests/ --cov=src

Recent Refactoring

Major refactoring completed to improve modularity:

Before: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) After: Modular architecture (main.py: 99 lines, routes.py: 252 lines)

Changes:

  • Extracted API logic into focused modules (formatting, parsing, handlers)
  • Created CLI package with separated concerns (parser, runner, server)
  • Moved hardcoded model data to JSON configuration files
  • Created shared utility modules (token_counter, project_discovery, network)
  • Reduced code duplication across the codebase

See docs/ARCHITECTURE.md for detailed architecture documentation.

Recent Improvements

Universal Tool Support (2025-02-25)

  • Tool instructions automatically injected for all clients (Continue, hollama, curl, etc.)
  • No client-side configuration needed - just use the API
  • Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
  • Working directory auto-extraction from prompts (in /path/to/dir patterns)
  • Proper OpenAI tool format with unique IDs and tool_call_id linking

OpenCode-Compatible Streaming (2025-02-25)

  • Proper reasoning_content field for "Thinking..." collapsible blocks
  • Multi-chunk tool_calls streaming matching Vercel AI SDK format
  • Final answer delivered in content field after tool execution

Federation Quality Voting (2025-02-25)

  • Head node now objectively judges all peer responses using quality metrics
  • No more reliance on self-reported confidence (which biased toward local)
  • All responses scored on length, structure, completeness
  • Fair competition: 14B models properly beat 3B on quality tasks

🚧 Planned Features

  • Plan Mode: Disable tool execution for planning-only conversations (--plan-mode)
  • Tool Consensus: Verify tool calls across multiple workers before execution (for critical operations)

Contributing

Contributions are welcome! Please ensure:

  1. Code follows the quality standards above
  2. All tests pass
  3. New features include tests
  4. Documentation is updated

License

MIT License

S
Description
No description provided
Readme 444 KiB
Languages
Python 99%
Shell 0.7%
Batchfile 0.3%