Files

T

sleepy a09d23156b feat: universal tool support - inject instructions by default, add plan mode TODO, improve file handling

1. Tool instructions now ALWAYS injected by default:
   - Removed condition that only injected on first request
   - Any client (Continue, hollama) can now use tools without client-side setup
   - Added check to avoid duplicating instructions if already present

2. Updated tool instructions with file verification guidance:
   - Added 'FILE OPERATIONS - ALWAYS VERIFY FIRST' section
   - Instructs to use 'ls' and 'grep' to verify files exist before reading
   - Prevents blind file reads on non-existent paths

3. Added TODO to README:
   - Plan mode feature (disable tool execution for planning-only conversations)
   - Current status section showing what's implemented

4. Working directory extraction from prompts:
   - New _extract_working_dir_from_prompt() function
   - Extracts paths from patterns like 'in /path/to/dir', 'under /path/to/dir'
   - Validates paths exist before using
   - Falls back to auto-detection if not found in prompt
   - All 41 tests passing

2026-02-25 20:37:23 +01:00

9.9 KiB

Raw Blame History

Local Swarm

Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.

What It Does

Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
Downloads and runs multiple LLM instances optimized for your VRAM/RAM
Uses consensus voting - all instances answer, best response wins
Connects multiple machines on your network for a "hive mind" effect
Provides an OpenAI-compatible API at http://localhost:17615/v1

Quick Start

# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt

# Run it
python main.py

On first run, it will:

Detect your hardware
Pick the best model and quantization
Download the model (one-time)
Start multiple LLM workers
Expose the API at http://localhost:17615

Usage

Interactive Mode (default)

python main.py

Shows a menu with:

Recommended configuration (auto-selected)
Browse all compatible models
Custom configuration wizard

python main.py --auto

With Other Options

python main.py --model qwen:3b:q4      # Use specific model
python main.py --instances 4           # Force 4 workers
python main.py --port 8080             # Custom port
python main.py --detect                # Show hardware info only
python main.py --federation            # Enable network federation
python main.py --mcp                   # Enable MCP server
python main.py --use-opencode-tools    # Use opencode tools (adds ~27k tokens)

Tool Mode Options:

Default: Local tool server (~125 tokens, saves context window space)
--use-opencode-tools: Full opencode tool definitions (~27k tokens, more capabilities)

Connect to Opencode

Add to your opencode config:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

Network Federation (Hive Mind)

Run on multiple machines to combine their power:

# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation

# Machine 2 (Mac Mini M1)
python main.py --auto --federation

# Machine 3 (Old laptop)
python main.py --auto --federation

Machines auto-discover each other and vote together on every request.

How Consensus Works

Your prompt goes to all LLM instances
Each instance generates a response independently
The consensus algorithm picks the best answer:
- Similarity (default): Groups responses by meaning, picks the largest group
- Quality: Scores on completeness, code blocks, structure
- Fastest: Returns the quickest response
- Majority: Simple text match voting

Configuration

Create config.yaml:

server:
  host: "127.0.0.1"
  port: 17615

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8

federation:
  enabled: true
  discovery_port: 8765
  max_peers: 10

Supported Hardware

Hardware	Backend	Notes
NVIDIA GPU	llama.cpp (CUDA)	Best performance
AMD GPU	llama.cpp (ROCm)	Linux/Windows
Intel GPU	llama.cpp (SYCL)	Linux/Windows
Apple Silicon	MLX	Native Metal
Qualcomm	llama.cpp (CPU)	Android/Termux
CPU-only	llama.cpp	Slower but works

Supported Models

Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
DeepSeek Coder (1.3B, 6.7B, 33B)
CodeLlama (7B, 13B, 34B)

All support GGUF quantization (Q4_K_M recommended).

API Endpoints

GET /v1/models - List available models
POST /v1/chat/completions - Chat completion with consensus
GET /health - Health check
GET /v1/federation/peers - List discovered peers (when federation enabled)

Troubleshooting

Out of Memory

python main.py --instances 2           # Reduce workers
python main.py --model qwen:3b:q4      # Use smaller model

Slow Performance

Check GPU utilization with nvidia-smi
Reduce instances to avoid contention
Use Q4 quantization instead of Q6

CUDA Not Detected (Windows)

nvidia-smi  # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

macOS: MLX Not Found

pip install mlx-lm

Project Structure

local_swarm/
├── main.py                   # CLI entry point (99 lines)
├── src/
│   ├── api/                 # OpenAI-compatible API
│   │   ├── routes.py        # HTTP routing (252 lines)
│   │   ├── formatting.py    # Message formatting
│   │   ├── tool_parser.py   # Tool call parsing
│   │   ├── chat_handlers.py # Chat completion logic
│   │   └── models.py        # API data models
│   ├── cli/                 # Command-line interface
│   │   ├── parser.py        # CLI argument parsing
│   │   ├── main_runner.py   # Main application logic
│   │   ├── server_runner.py # Server management
│   │   └── test_runner.py   # Test mode execution
│   ├── swarm/               # Swarm orchestration
│   │   ├── manager.py       # Swarm manager
│   │   ├── worker.py        # LLM worker implementation
│   │   ├── consensus.py     # Consensus algorithms
│   │   └── orchestrator.py  # Generation orchestration
│   ├── models/              # Model management
│   │   ├── registry.py      # Model registry (194 lines)
│   │   ├── selector.py      # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculations
│   │   └── downloader.py    # Model downloading
│   ├── hardware/            # Hardware detection
│   │   ├── detector.py      # Hardware detection
│   │   ├── nvidia.py        # NVIDIA GPU detection
│   │   ├── intel.py         # Intel GPU detection
│   │   └── qualcomm.py      # Qualcomm detection
│   ├── network/             # Network federation
│   │   ├── federation.py    # Cross-swarm consensus
│   │   └── discovery.py     # Peer discovery
│   ├── backends/            # LLM backends
│   │   ├── llama_cpp.py     # llama.cpp backend
│   │   ├── mlx.py           # Apple Silicon MLX backend
│   │   └── base.py          # Base backend interface
│   ├── interactive/         # Interactive CLI
│   │   ├── ui.py            # UI utilities
│   │   ├── display.py       # Hardware display
│   │   └── tips.py          # Help content
│   ├── tools/               # Tool execution
│   │   └── executor.py      # Tool execution engine
│   └── utils/               # Shared utilities
│       ├── token_counter.py # Token counting
│       ├── project_discovery.py # Project root discovery
│       └── network.py       # Network utilities
├── config/                  # Configuration files
│   └── models/              # Model configurations
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── docs/                    # Documentation

Architecture Principles

Modular Design: Each module has a single, focused responsibility
Configuration Over Code: Static data extracted to JSON config files
Separation of Concerns: API, CLI, and business logic are cleanly separated
No Files > 300 Lines: Most modules kept under 300 lines for maintainability

Development

Code Quality Standards

This project follows strict code quality standards:

File Size: No files > 300 lines (with few exceptions)
Function Size: No functions > 50 lines
Nesting Depth: No indentation > 3 levels
DRY Principle: No duplicate code (>3 lines)
Single Responsibility: Each module does one thing
Configuration Over Code: Static data in JSON configs

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_tool_parsing.py -v

# Run with coverage
python -m pytest tests/ --cov=src

Recent Refactoring

Major refactoring completed to improve modularity:

Before: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) After: Modular architecture (main.py: 99 lines, routes.py: 252 lines)

Changes:

Extracted API logic into focused modules (formatting, parsing, handlers)
Created CLI package with separated concerns (parser, runner, server)
Moved hardcoded model data to JSON configuration files
Created shared utility modules (token_counter, project_discovery, network)
Reduced code duplication across the codebase

See docs/ARCHITECTURE.md for detailed architecture documentation.

TODO / Roadmap

Planned Features

Plan Mode: Add a "plan mode" that disables tool execution for planning-only conversations. This would allow the model to discuss file changes without actually modifying them until explicitly confirmed.
- Usage: --plan-mode flag or API parameter
- When enabled: Model can see what tools would do but doesn't execute them
- Use case: Review changes before applying them

Current Status

✅ Tool instructions now injected by default for all clients
✅ Improved file operation safety (verify with ls/grep before reading)
✅ Working directory support (extracted from client context)
🔄 Plan mode - coming soon

Contributing

Contributions are welcome! Please ensure:

Code follows the quality standards above
All tests pass
New features include tests
Documentation is updated

License

MIT License

9.9 KiB Raw Blame History