T

sleepy a0d3ae9d4f fix: OpenCode-compatible streaming format with reasoning_content

- Fixed thinking capture: use parsed_content (without tool call) instead of full response
- _stream_response now correctly emits reasoning_content before tool_calls
- Tool calls streamed with proper multi-chunk format: id+name (empty args), then arguments, then finish_reason
- Final answers sent as content with finish_reason=stop
- Used setattr to dynamically attach _thinking to response object
- ChatLogger already in place for debugging
- This should now work correctly with OpenCode's Vercel AI SDK integration

2026-02-25 20:03:55 +01:00

.fcg

Fix opencode integration: streaming, response format, and tool handling

2026-02-24 03:44:46 +01:00

config

fix: update tool instructions to require file operations and prevent refusals

2026-02-25 19:41:16 +01:00

docs

docs: update architecture and README with new modular structure

2026-02-25 13:31:24 +01:00

scripts

feat: comprehensive tool system improvements and webfetch support (#3 )

2026-02-24 22:35:05 +01:00

src

fix: OpenCode-compatible streaming format with reasoning_content

2026-02-25 20:03:55 +01:00

tests

fix: OpenAI API compatibility for hollama and other clients

2026-02-25 19:39:05 +01:00

.DS_Store

Add debug logging to trace prompt sizes in worker

2026-02-25 11:54:57 +01:00

.gitignore

feat: comprehensive tool system improvements and webfetch support (#3 )

2026-02-24 22:35:05 +01:00

.opencodeignore

feat: Add configurable tool mode to save tokens

2026-02-25 11:31:48 +01:00

AGENT_REVIEW.md

docs: Add minimal, maintainable, modular code requirements

2026-02-25 12:30:18 +01:00

AGENT.md

fix: OpenAI API compatibility for hollama and other clients

2026-02-25 19:39:05 +01:00

main.py

refactor(cli): break down main.py into modular CLI components

2026-02-25 12:57:28 +01:00

README.md

docs: update architecture and README with new modular structure

2026-02-25 13:31:24 +01:00

requirements-macos.txt

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

requirements.txt

feat: comprehensive tool system improvements and webfetch support (#3 )

2026-02-24 22:35:05 +01:00

setup.py

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

streaming_patch.diff

feat: Add real-time streaming for tools

2026-02-25 12:10:49 +01:00

TODO.md

feat: CUDA/Android support and federation metrics (#7 )

2026-02-25 00:53:07 +01:00

README.md

Local Swarm

Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.

What It Does

Auto-detects your hardware (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
Downloads and runs multiple LLM instances optimized for your VRAM/RAM
Uses consensus voting - all instances answer, best response wins
Connects multiple machines on your network for a "hive mind" effect
Provides an OpenAI-compatible API at http://localhost:17615/v1

Quick Start

# Clone and install
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm
pip install -r requirements.txt

# Run it
python main.py

On first run, it will:

Detect your hardware
Pick the best model and quantization
Download the model (one-time)
Start multiple LLM workers
Expose the API at http://localhost:17615

Usage

Interactive Mode (default)

python main.py

Shows a menu with:

Recommended configuration (auto-selected)
Browse all compatible models
Custom configuration wizard

python main.py --auto

With Other Options

python main.py --model qwen:3b:q4      # Use specific model
python main.py --instances 4           # Force 4 workers
python main.py --port 8080             # Custom port
python main.py --detect                # Show hardware info only
python main.py --federation            # Enable network federation
python main.py --mcp                   # Enable MCP server
python main.py --use-opencode-tools    # Use opencode tools (adds ~27k tokens)

Tool Mode Options:

Default: Local tool server (~125 tokens, saves context window space)
--use-opencode-tools: Full opencode tool definitions (~27k tokens, more capabilities)

Connect to Opencode

Add to your opencode config:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

Network Federation (Hive Mind)

Run on multiple machines to combine their power:

# Machine 1 (Windows with RTX 4060)
python main.py --auto --federation

# Machine 2 (Mac Mini M1)
python main.py --auto --federation

# Machine 3 (Old laptop)
python main.py --auto --federation

Machines auto-discover each other and vote together on every request.

How Consensus Works

Your prompt goes to all LLM instances
Each instance generates a response independently
The consensus algorithm picks the best answer:
- Similarity (default): Groups responses by meaning, picks the largest group
- Quality: Scores on completeness, code blocks, structure
- Fastest: Returns the quickest response
- Majority: Simple text match voting

Configuration

Create config.yaml:

server:
  host: "127.0.0.1"
  port: 17615

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8

federation:
  enabled: true
  discovery_port: 8765
  max_peers: 10

Supported Hardware

Hardware	Backend	Notes
NVIDIA GPU	llama.cpp (CUDA)	Best performance
AMD GPU	llama.cpp (ROCm)	Linux/Windows
Intel GPU	llama.cpp (SYCL)	Linux/Windows
Apple Silicon	MLX	Native Metal
Qualcomm	llama.cpp (CPU)	Android/Termux
CPU-only	llama.cpp	Slower but works

Supported Models

Qwen 2.5 Coder (3B, 7B, 14B) - Recommended
DeepSeek Coder (1.3B, 6.7B, 33B)
CodeLlama (7B, 13B, 34B)

All support GGUF quantization (Q4_K_M recommended).

API Endpoints

GET /v1/models - List available models
POST /v1/chat/completions - Chat completion with consensus
GET /health - Health check
GET /v1/federation/peers - List discovered peers (when federation enabled)

Troubleshooting

Out of Memory

python main.py --instances 2           # Reduce workers
python main.py --model qwen:3b:q4      # Use smaller model

Slow Performance

Check GPU utilization with nvidia-smi
Reduce instances to avoid contention
Use Q4 quantization instead of Q6

CUDA Not Detected (Windows)

nvidia-smi  # Check drivers
pip uninstall llama-cpp-python
pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121

macOS: MLX Not Found

pip install mlx-lm

Project Structure

local_swarm/
├── main.py                   # CLI entry point (99 lines)
├── src/
│   ├── api/                 # OpenAI-compatible API
│   │   ├── routes.py        # HTTP routing (252 lines)
│   │   ├── formatting.py    # Message formatting
│   │   ├── tool_parser.py   # Tool call parsing
│   │   ├── chat_handlers.py # Chat completion logic
│   │   └── models.py        # API data models
│   ├── cli/                 # Command-line interface
│   │   ├── parser.py        # CLI argument parsing
│   │   ├── main_runner.py   # Main application logic
│   │   ├── server_runner.py # Server management
│   │   └── test_runner.py   # Test mode execution
│   ├── swarm/               # Swarm orchestration
│   │   ├── manager.py       # Swarm manager
│   │   ├── worker.py        # LLM worker implementation
│   │   ├── consensus.py     # Consensus algorithms
│   │   └── orchestrator.py  # Generation orchestration
│   ├── models/              # Model management
│   │   ├── registry.py      # Model registry (194 lines)
│   │   ├── selector.py      # Model selection (329 lines)
│   │   ├── memory_calculator.py # Memory calculations
│   │   └── downloader.py    # Model downloading
│   ├── hardware/            # Hardware detection
│   │   ├── detector.py      # Hardware detection
│   │   ├── nvidia.py        # NVIDIA GPU detection
│   │   ├── intel.py         # Intel GPU detection
│   │   └── qualcomm.py      # Qualcomm detection
│   ├── network/             # Network federation
│   │   ├── federation.py    # Cross-swarm consensus
│   │   └── discovery.py     # Peer discovery
│   ├── backends/            # LLM backends
│   │   ├── llama_cpp.py     # llama.cpp backend
│   │   ├── mlx.py           # Apple Silicon MLX backend
│   │   └── base.py          # Base backend interface
│   ├── interactive/         # Interactive CLI
│   │   ├── ui.py            # UI utilities
│   │   ├── display.py       # Hardware display
│   │   └── tips.py          # Help content
│   ├── tools/               # Tool execution
│   │   └── executor.py      # Tool execution engine
│   └── utils/               # Shared utilities
│       ├── token_counter.py # Token counting
│       ├── project_discovery.py # Project root discovery
│       └── network.py       # Network utilities
├── config/                  # Configuration files
│   └── models/              # Model configurations
│       ├── model_metadata.json      # Model metadata
│       ├── mlx_quant_sizes.json     # MLX quantization sizes
│       ├── gguf_quant_sizes.json    # GGUF quantization sizes
│       └── selector_config.json     # Selection constants
└── docs/                    # Documentation

Architecture Principles

Modular Design: Each module has a single, focused responsibility
Configuration Over Code: Static data extracted to JSON config files
Separation of Concerns: API, CLI, and business logic are cleanly separated
No Files > 300 Lines: Most modules kept under 300 lines for maintainability

Development

Code Quality Standards

This project follows strict code quality standards:

File Size: No files > 300 lines (with few exceptions)
Function Size: No functions > 50 lines
Nesting Depth: No indentation > 3 levels
DRY Principle: No duplicate code (>3 lines)
Single Responsibility: Each module does one thing
Configuration Over Code: Static data in JSON configs

Running Tests

# Run all tests
python -m pytest tests/ -v

# Run specific test file
python -m pytest tests/test_tool_parsing.py -v

# Run with coverage
python -m pytest tests/ --cov=src

Recent Refactoring

Major refactoring completed to improve modularity:

Before: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines) After: Modular architecture (main.py: 99 lines, routes.py: 252 lines)

Changes:

Extracted API logic into focused modules (formatting, parsing, handlers)
Created CLI package with separated concerns (parser, runner, server)
Moved hardcoded model data to JSON configuration files
Created shared utility modules (token_counter, project_discovery, network)
Reduced code duplication across the codebase

See docs/ARCHITECTURE.md for detailed architecture documentation.

Contributing

Contributions are welcome! Please ensure:

Code follows the quality standards above
All tests pass
New features include tests
Documentation is updated

License

MIT License

README.md

Local Swarm

What It Does

Quick Start

Usage

Interactive Mode (default)

Auto Mode (no menu)

With Other Options

Connect to Opencode

Network Federation (Hive Mind)

How Consensus Works

Configuration

Supported Hardware

Supported Models

API Endpoints

Troubleshooting

Out of Memory

Slow Performance

CUDA Not Detected (Windows)

macOS: MLX Not Found

Project Structure

Architecture Principles

Development

Code Quality Standards

Running Tests

Recent Refactoring

Contributing

License