Files
local_swarm/README.md
T
sleepy a09d23156b feat: universal tool support - inject instructions by default, add plan mode TODO, improve file handling
1. Tool instructions now ALWAYS injected by default:
   - Removed condition that only injected on first request
   - Any client (Continue, hollama) can now use tools without client-side setup
   - Added check to avoid duplicating instructions if already present

2. Updated tool instructions with file verification guidance:
   - Added 'FILE OPERATIONS - ALWAYS VERIFY FIRST' section
   - Instructs to use 'ls' and 'grep' to verify files exist before reading
   - Prevents blind file reads on non-existent paths

3. Added TODO to README:
   - Plan mode feature (disable tool execution for planning-only conversations)
   - Current status section showing what's implemented

4. Working directory extraction from prompts:
   - New _extract_working_dir_from_prompt() function
   - Extracts paths from patterns like 'in /path/to/dir', 'under /path/to/dir'
   - Validates paths exist before using
   - Falls back to auto-detection if not found in prompt
   - All 41 tests passing
2026-02-25 20:37:23 +01:00

9.9 KiB

Local Swarm

Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.

What It Does

  • Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
  • Downloads and runs multiple LLM instances optimized for your VRAM/RAM
  • Uses consensus voting - all instances answer, best response wins
  • Connects multiple machines on your network for a "hive mind" effect
  • Provides an OpenAI-compatible API at http://localhost:17615/v1

Quick Start

# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt

# Run it
python main.py

On first run, it will:

  1. Detect your hardware
  2. Pick the best model and quantization
  3. Download the model (one-time)
  4. Start multiple LLM workers
  5. Expose the API at http://localhost:17615

Usage

Interactive Mode (default)

python main.py

Shows a menu with:

  • Recommended configuration (auto-selected)
  • Browse all compatible models
  • Custom configuration wizard

Auto Mode (no menu)

python main.py --auto

With Other Options

python main.py --model qwen:3b:q4      # Use specific model
python main.py --instances 4           # Force 4 workers
python main.py --port 8080             # Custom port
python main.py --detect                # Show hardware info only
python main.py --federation            # Enable network federation
python main.py --mcp                   # Enable MCP server
python main.py --use-opencode-tools    # Use opencode tools (adds ~27k tokens)

Tool Mode Options:

  • Default: Local tool server (~125 tokens, saves context window space)
  • --use-opencode-tools: Full opencode tool definitions (~27k tokens, more capabilities)

Connect to Opencode

Add to your opencode config:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

Network Federation (Hive Mind)

Run on multiple machines to combine their power:

# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation

# Machine 2 (Mac Mini M1)
python main.py --auto --federation

# Machine 3 (Old laptop)
python main.py --auto --federation

Machines auto-discover each other and vote together on every request.

How Consensus Works

  1. Your prompt goes to all LLM instances
  2. Each instance generates a response independently
  3. The consensus algorithm picks the best answer:
    • Similarity (default): Groups responses by meaning, picks the largest group
    • Quality: Scores on completeness, code blocks, structure
    • Fastest: Returns the quickest response
    • Majority: Simple text match voting

Configuration

Create config.yaml:

server:
  host: "127.0.0.1"
  port: 17615

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8

federation:
  enabled: true
  discovery_port: 8765
  max_peers: 10

Supported Hardware

Hardware Backend Notes
NVIDIA GPU llama.cpp (CUDA) Best performance
AMD GPU llama.cpp (ROCm) Linux/Windows
Intel GPU llama.cpp (SYCL) Linux/Windows
Apple Silicon MLX Native Metal
Qualcomm llama.cpp (CPU) Android/Termux
CPU-only llama.cpp Slower but works

Supported Models

  • Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
  • DeepSeek Coder (1.3B, 6.7B, 33B)
  • CodeLlama (7B, 13B, 34B)

All support GGUF quantization (Q4_K_M recommended).

API Endpoints

  • GET /v1/models - List available models
  • POST /v1/chat/completions - Chat completion with consensus
  • GET /health - Health check
  • GET /v1/federation/peers - List discovered peers (when federation enabled)

Troubleshooting

Out of Memory

python main.py --instances 2           # Reduce workers
python main.py --model qwen:3b:q4      # Use smaller model

Slow Performance

  • Check GPU utilization with nvidia-smi
  • Reduce instances to avoid contention
  • Use Q4 quantization instead of Q6

CUDA Not Detected (Windows)

nvidia-smi  # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

macOS: MLX Not Found

pip install mlx-lm

Project Structure

local_swarm/
├── main.py                   # CLI entry point (99 lines)
├── src/
│   ├── api/                 # OpenAI-compatible API
│   │   ├── routes.py        # HTTP routing (252 lines)
│   │   ├── formatting.py    # Message formatting
│   │   ├── tool_parser.py   # Tool call parsing
│   │   ├── chat_handlers.py # Chat completion logic
│   │   └── models.py        # API data models
│   ├── cli/                 # Command-line interface
│   │   ├── parser.py        # CLI argument parsing
│   │   ├── main_runner.py   # Main application logic
│   │   ├── server_runner.py # Server management
│   │   └── test_runner.py   # Test mode execution
│   ├── swarm/               # Swarm orchestration
│   │   ├── manager.py       # Swarm manager
│   │   ├── worker.py        # LLM worker implementation
│   │   ├── consensus.py     # Consensus algorithms
│   │   └── orchestrator.py  # Generation orchestration
│   ├── models/              # Model management
│   │   ├── registry.py      # Model registry (194 lines)
│   │   ├── selector.py      # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculations
│   │   └── downloader.py    # Model downloading
│   ├── hardware/            # Hardware detection
│   │   ├── detector.py      # Hardware detection
│   │   ├── nvidia.py        # NVIDIA GPU detection
│   │   ├── intel.py         # Intel GPU detection
│   │   └── qualcomm.py      # Qualcomm detection
│   ├── network/             # Network federation
│   │   ├── federation.py    # Cross-swarm consensus
│   │   └── discovery.py     # Peer discovery
│   ├── backends/            # LLM backends
│   │   ├── llama_cpp.py     # llama.cpp backend
│   │   ├── mlx.py           # Apple Silicon MLX backend
│   │   └── base.py          # Base backend interface
│   ├── interactive/         # Interactive CLI
│   │   ├── ui.py            # UI utilities
│   │   ├── display.py       # Hardware display
│   │   └── tips.py          # Help content
│   ├── tools/               # Tool execution
│   │   └── executor.py      # Tool execution engine
│   └── utils/               # Shared utilities
│       ├── token_counter.py # Token counting
│       ├── project_discovery.py # Project root discovery
│       └── network.py       # Network utilities
├── config/                  # Configuration files
│   └── models/              # Model configurations
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── docs/                    # Documentation

Architecture Principles

  • Modular Design: Each module has a single, focused responsibility
  • Configuration Over Code: Static data extracted to JSON config files
  • Separation of Concerns: API, CLI, and business logic are cleanly separated
  • No Files > 300 Lines: Most modules kept under 300 lines for maintainability

Development

Code Quality Standards

This project follows strict code quality standards:

  • File Size: No files > 300 lines (with few exceptions)
  • Function Size: No functions > 50 lines
  • Nesting Depth: No indentation > 3 levels
  • DRY Principle: No duplicate code (>3 lines)
  • Single Responsibility: Each module does one thing
  • Configuration Over Code: Static data in JSON configs

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_tool_parsing.py -v

# Run with coverage
python -m pytest tests/ --cov=src

Recent Refactoring

Major refactoring completed to improve modularity:

Before: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) After: Modular architecture (main.py: 99 lines, routes.py: 252 lines)

Changes:

  • Extracted API logic into focused modules (formatting, parsing, handlers)
  • Created CLI package with separated concerns (parser, runner, server)
  • Moved hardcoded model data to JSON configuration files
  • Created shared utility modules (token_counter, project_discovery, network)
  • Reduced code duplication across the codebase

See docs/ARCHITECTURE.md for detailed architecture documentation.

TODO / Roadmap

Planned Features

  • Plan Mode: Add a "plan mode" that disables tool execution for planning-only conversations. This would allow the model to discuss file changes without actually modifying them until explicitly confirmed.
    • Usage: --plan-mode flag or API parameter
    • When enabled: Model can see what tools would do but doesn't execute them
    • Use case: Review changes before applying them

Current Status

  • Tool instructions now injected by default for all clients
  • Improved file operation safety (verify with ls/grep before reading)
  • Working directory support (extracted from client context)
  • 🔄 Plan mode - coming soon

Contributing

Contributions are welcome! Please ensure:

  1. Code follows the quality standards above
  2. All tests pass
  3. New features include tests
  4. Documentation is updated

License

MIT License