T

sleepy af728505e8 fix: properly unpack FederationResult object instead of trying to unpack as tuple

- generate_with_federation() returns FederationResult object, not tuple
- Fixed _generate_with_consensus() to access fed_result.final_response
- This fixes 'cannot unpack non-iterable FederationResult object' error
- All 41 tests passing

2026-02-25 23:43:25 +01:00

.fcg

Fix opencode integration: streaming, response format, and tool handling

2026-02-24 03:44:46 +01:00

config

fix: prevent path hallucination - read files directly without ls first

2026-02-25 21:42:25 +01:00

docs

docs: update README with current features and remove outdated docs

2026-02-25 22:38:46 +01:00

scripts

feat: comprehensive tool system improvements and webfetch support (#3 )

2026-02-24 22:35:05 +01:00

src

fix: properly unpack FederationResult object instead of trying to unpack as tuple

2026-02-25 23:43:25 +01:00

tests

fix: OpenAI API compatibility for hollama and other clients

2026-02-25 19:39:05 +01:00

.DS_Store

Add debug logging to trace prompt sizes in worker

2026-02-25 11:54:57 +01:00

.gitignore

feat: comprehensive tool system improvements and webfetch support (#3 )

2026-02-24 22:35:05 +01:00

.opencodeignore

feat: Add configurable tool mode to save tokens

2026-02-25 11:31:48 +01:00

AGENT_REVIEW.md

docs: Add minimal, maintainable, modular code requirements

2026-02-25 12:30:18 +01:00

AGENT.md

fix: OpenAI API compatibility for hollama and other clients

2026-02-25 19:39:05 +01:00

main.py

refactor(cli): break down main.py into modular CLI components

2026-02-25 12:57:28 +01:00

README.md

docs: update README with current features and remove outdated docs

2026-02-25 22:38:46 +01:00

requirements-macos.txt

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

requirements.txt

feat: comprehensive tool system improvements and webfetch support (#3 )

2026-02-24 22:35:05 +01:00

setup.py

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

streaming_patch.diff

feat: Add real-time streaming for tools

2026-02-25 12:10:49 +01:00

TODO.md

feat: CUDA/Android support and federation metrics (#7 )

2026-02-25 00:53:07 +01:00

README.md

Local Swarm

Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.

What It Does

Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
Downloads and runs multiple LLM instances optimized for your VRAM/RAM
Uses consensus voting - all instances answer, best response wins
Connects multiple machines on your network for a "hive mind" effect
Provides an OpenAI-compatible API at http://localhost:17615/v1

Quick Start

# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt

# Run it
python main.py

On first run, it will:

Detect your hardware
Pick the best model and quantization
Download the model (one-time)
Start multiple LLM workers
Expose the API at http://localhost:17615

Usage

Interactive Mode (default)

python main.py

Shows a menu with:

Recommended configuration (auto-selected)
Browse all compatible models
Custom configuration wizard

python main.py --auto

With Other Options

python main.py --model qwen:3b:q4      # Use specific model
python main.py --instances 4           # Force 4 workers
python main.py --port 8080             # Custom port
python main.py --detect                # Show hardware info only
python main.py --federation            # Enable network federation
python main.py --mcp                   # Enable MCP server
python main.py --use-opencode-tools    # Use opencode tools (adds ~27k tokens)

Tool Mode Options:

Default: Local tool server (~125 tokens, saves context window space)
--use-opencode-tools: Full opencode tool definitions (~27k tokens, more capabilities)

Connect to Opencode

Add to your opencode config:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

Network Federation (Hive Mind)

Run on multiple machines to combine their power:

# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation

# Machine 2 (Mac Mini M1)
python main.py --auto --federation

# Machine 3 (Old laptop)
python main.py --auto --federation

Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses objective quality scoring to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.

Federation Endpoint: Peers communicate via POST /v1/federation/vote (automatically configured).

How Consensus Works

Your prompt goes to all LLM instances
Each instance generates a response independently
The consensus algorithm picks the best answer:
- Similarity (default): Groups responses by meaning, picks the largest group
- Quality: Scores on completeness, code blocks, structure
- Fastest: Returns the quickest response
- Majority: Simple text match voting

Configuration

Create config.yaml:

server:
  host: "127.0.0.1"
  port: 17615

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8

federation:
  enabled: true
  discovery_port: 8765
  max_peers: 10

Supported Hardware

Hardware	Backend	Notes
NVIDIA GPU	llama.cpp (CUDA)	Best performance
AMD GPU	llama.cpp (ROCm)	Linux/Windows
Intel GPU	llama.cpp (SYCL)	Linux/Windows
Apple Silicon	MLX	Native Metal
Qualcomm	llama.cpp (CPU)	Android/Termux
CPU-only	llama.cpp	Slower but works

Supported Models

Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
DeepSeek Coder (1.3B, 6.7B, 33B)
CodeLlama (7B, 13B, 34B)

All support GGUF quantization (Q4_K_M recommended).

API Endpoints

GET /v1/models - List available models
POST /v1/chat/completions - Chat completion with consensus
GET /health - Health check
POST /v1/federation/vote - Federation voting (used internally between peers)

Troubleshooting

Out of Memory

python main.py --instances 2           # Reduce workers
python main.py --model qwen:3b:q4      # Use smaller model

Slow Performance

Check GPU utilization with nvidia-smi
Reduce instances to avoid contention
Use Q4 quantization instead of Q6

CUDA Not Detected (Windows)

nvidia-smi  # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

macOS: MLX Not Found

pip install mlx-lm

Project Structure

local_swarm/
├── main.py                   # CLI entry point (99 lines)
├── src/
│   ├── api/                 # OpenAI-compatible API
│   │   ├── routes.py        # HTTP routing (252 lines)
│   │   ├── formatting.py    # Message formatting
│   │   ├── tool_parser.py   # Tool call parsing
│   │   ├── chat_handlers.py # Chat completion logic
│   │   └── models.py        # API data models
│   ├── cli/                 # Command-line interface
│   │   ├── parser.py        # CLI argument parsing
│   │   ├── main_runner.py   # Main application logic
│   │   ├── server_runner.py # Server management
│   │   └── test_runner.py   # Test mode execution
│   ├── swarm/               # Swarm orchestration
│   │   ├── manager.py       # Swarm manager
│   │   ├── worker.py        # LLM worker implementation
│   │   ├── consensus.py     # Consensus algorithms
│   │   └── orchestrator.py  # Generation orchestration
│   ├── models/              # Model management
│   │   ├── registry.py      # Model registry (194 lines)
│   │   ├── selector.py      # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculations
│   │   └── downloader.py    # Model downloading
│   ├── hardware/            # Hardware detection
│   │   ├── detector.py      # Hardware detection
│   │   ├── nvidia.py        # NVIDIA GPU detection
│   │   ├── intel.py         # Intel GPU detection
│   │   └── qualcomm.py      # Qualcomm detection
│   ├── network/             # Network federation
│   │   ├── federation.py    # Cross-swarm consensus
│   │   └── discovery.py     # Peer discovery
│   ├── backends/            # LLM backends
│   │   ├── llama_cpp.py     # llama.cpp backend
│   │   ├── mlx.py           # Apple Silicon MLX backend
│   │   └── base.py          # Base backend interface
│   ├── interactive/         # Interactive CLI
│   │   ├── ui.py            # UI utilities
│   │   ├── display.py       # Hardware display
│   │   └── tips.py          # Help content
│   ├── tools/               # Tool execution
│   │   └── executor.py      # Tool execution engine
│   └── utils/               # Shared utilities
│       ├── token_counter.py # Token counting
│       ├── project_discovery.py # Project root discovery
│       └── network.py       # Network utilities
├── config/                  # Configuration files
│   └── models/              # Model configurations
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── docs/                    # Documentation

Architecture Principles

Modular Design: Each module has a single, focused responsibility
Configuration Over Code: Static data extracted to JSON config files
Separation of Concerns: API, CLI, and business logic are cleanly separated
No Files > 300 Lines: Most modules kept under 300 lines for maintainability

Development

Code Quality Standards

This project follows strict code quality standards:

File Size: No files > 300 lines (with few exceptions)
Function Size: No functions > 50 lines
Nesting Depth: No indentation > 3 levels
DRY Principle: No duplicate code (>3 lines)
Single Responsibility: Each module does one thing
Configuration Over Code: Static data in JSON configs

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_tool_parsing.py -v

# Run with coverage
python -m pytest tests/ --cov=src

Recent Refactoring

Major refactoring completed to improve modularity:

Before: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) After: Modular architecture (main.py: 99 lines, routes.py: 252 lines)

Changes:

Extracted API logic into focused modules (formatting, parsing, handlers)
Created CLI package with separated concerns (parser, runner, server)
Moved hardcoded model data to JSON configuration files
Created shared utility modules (token_counter, project_discovery, network)
Reduced code duplication across the codebase

See docs/ARCHITECTURE.md for detailed architecture documentation.

Recent Improvements

✅ Universal Tool Support (2025-02-25)

Tool instructions automatically injected for all clients (Continue, hollama, curl, etc.)
No client-side configuration needed - just use the API
Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
Working directory auto-extraction from prompts (in /path/to/dir patterns)
Proper OpenAI tool format with unique IDs and tool_call_id linking

✅ OpenCode-Compatible Streaming (2025-02-25)

Proper reasoning_content field for "Thinking..." collapsible blocks
Multi-chunk tool_calls streaming matching Vercel AI SDK format
Final answer delivered in content field after tool execution

✅ Federation Quality Voting (2025-02-25)

Head node now objectively judges all peer responses using quality metrics
No more reliance on self-reported confidence (which biased toward local)
All responses scored on length, structure, completeness
Fair competition: 14B models properly beat 3B on quality tasks

🚧 Planned Features

Plan Mode: Disable tool execution for planning-only conversations (--plan-mode)
Tool Consensus: Verify tool calls across multiple workers before execution (for critical operations)

Contributing

Contributions are welcome! Please ensure:

Code follows the quality standards above
All tests pass
New features include tests
Documentation is updated

License

MIT License

README.md

Local Swarm

What It Does

Quick Start

Usage

Interactive Mode (default)

Auto Mode (no menu)

With Other Options

Connect to Opencode

Network Federation (Hive Mind)

How Consensus Works

Configuration

Supported Hardware

Supported Models

API Endpoints

Troubleshooting

Out of Memory

Slow Performance

CUDA Not Detected (Windows)

macOS: MLX Not Found

Project Structure

Architecture Principles

Development

Code Quality Standards

Running Tests

Recent Refactoring

Recent Improvements

✅ Universal Tool Support (2025-02-25)

✅ OpenCode-Compatible Streaming (2025-02-25)

✅ Federation Quality Voting (2025-02-25)

🚧 Planned Features

Contributing

License