T

sleepy e323d43d2b feat: Validate MLX models exist before download and suggest alternatives

- Add _validate_mlx_model_exists() to check HuggingFace repos
- Show warning when selected quantization doesn't exist
- List available quantizations for the model
- Better error messages with suggestions

This prevents trying to download non-existent quantizations like 5bit
when only 3bit, 4bit, 6bit, 8bit are available.

2026-02-23 23:48:53 +01:00

.fcg

feat: Validate MLX models exist before download and suggest alternatives

2026-02-23 23:48:53 +01:00

docs

Add CONTEXT.md documentation

2026-02-23 20:19:46 +01:00

scripts

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

src

feat: Validate MLX models exist before download and suggest alternatives

2026-02-23 23:48:53 +01:00

tests

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

.gitignore

Fix .gitignore to allow src/models/ directory

2026-02-23 19:51:40 +01:00

AGENTS.md

Phase 4: Implement OpenAI-compatible API server

2026-02-23 17:29:16 +01:00

main.py

feat: Apple Silicon MLX support, sequential workers, live status display, worker names

2026-02-23 22:57:38 +01:00

PLAN.md

Update PLAN.md and README with documentation completion

2026-02-23 18:40:35 +01:00

README.md

Add comprehensive documentation

2026-02-23 18:39:56 +01:00

requirements-macos.txt

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

requirements.txt

Phase 6: Network Federation (#1 )

2026-02-23 18:05:27 +01:00

REVIEW.md

feat: Apple Silicon MLX support, sequential workers, live status display, worker names

2026-02-23 22:57:38 +01:00

setup.py

Initial commit: Local Swarm project structure and documentation

2026-02-23 16:46:31 +01:00

README.md

Local Swarm

Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.

Features

Interactive Menu System: Easy-to-use menu for selecting model configurations, browsing options, or creating custom setups
Hardware Auto-Detection: Automatically detects your GPU (NVIDIA, AMD, Intel), Apple Silicon, Qualcomm (Android), or CPU and selects optimal settings
Smart Model Selection: Chooses the best model, quantization, and instance count based on available VRAM/RAM
Startup Summary: Clear display of detected hardware, selected model, resource usage, and worker status
Swarm Consensus: Multiple LLM instances vote on the best response for higher quality outputs
Network Federation: Multiple machines on the same network can join into a "federated swarm" for distributed consensus
OpenAI-Compatible API: Drop-in replacement for OpenAI API at http://localhost:8000/v1
MCP Server: Model Context Protocol support for tight AI assistant integration
Cross-Platform: Works on Windows, macOS, Linux, and Android (via Termux) with automatic backend selection

Documentation

Quick Start - Get up and running in minutes
Complete Guide - Comprehensive documentation
- Opencode configuration examples
- API reference
- Troubleshooting guide
- Performance tuning
- Advanced configuration
Configuration - Customize your setup
Interactive Mode - Using the menu system
Tips & Help - Learn about models, quantization, and optimization

Quick Start

Installation

Windows (PowerShell)

# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
.\scripts\install.bat

macOS/Linux

# Clone the repository
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run installer
chmod +x scripts/install.sh
./scripts/install.sh

Android (Termux)

# In Termux app
git clone https://github.com/yourusername/local_swarm.git
cd local_swarm

# Run Termux installer
chmod +x scripts/install-termux.sh
./scripts/install-termux.sh

Note: Android support is limited to small models (1-3B) due to memory constraints. Requires 8GB+ RAM.

Usage

Start the Swarm

# Auto-detect hardware and start
python -m local_swarm

# Or use the CLI
python main.py

On first run, the tool will:

Scan your hardware (GPU, RAM, CPU)
Select the optimal model and quantization
Download the model (one-time)
Start multiple instances based on available memory
Expose the API at http://localhost:8000

Example startup output:

🔍 Detecting hardware...
   OS: Windows 11
   GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
   CPU: 16 cores
   RAM: 32 GB

📊 Optimal configuration:
   Model: Qwen 2.5 Coder 3B
   Quantization: Q4_K_M (1.8 GB per instance)
   Instances: 8 (using 14.4 GB VRAM)

⬇️  Downloading model...
   Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB

🚀 Starting swarm...
   Worker 1: Ready (GPU:0)
   Worker 2: Ready (GPU:0)
   ...
   Worker 8: Ready (GPU:0)

✅ Local Swarm is running!
   API: http://localhost:8000/v1
   Models: http://localhost:8000/v1/models
   Health: http://localhost:8000/health

💡 Configure opencode to use:
   base_url: http://localhost:8000/v1
   api_key: any (not used)

Configure opencode

Add to your opencode configuration:

{
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
}

MCP Server (Optional)

For tighter integration with AI assistants, enable the MCP server:

python main.py --mcp

This runs alongside the HTTP API and exposes tools AI assistants can use:

get_hardware_info - Query CPU, GPU, and RAM
get_swarm_status - Check worker health
generate_code - Generate code with consensus
list_available_models - See what models can run
get_worker_details - Get detailed worker statistics

MCP allows AI assistants to automatically query your hardware capabilities and select appropriate models.

Configuration

Create a config.yaml file for customization:

server:
  host: "127.0.0.1"
  port: 8000

swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 8

hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon

federation:
  enabled: true
  discovery_port: 8765
  federation_port: 8766
  max_peers: 10

models:
  cache_dir: "~/.local_swarm/models"

CLI Options

# Show hardware detection without starting
python -m local_swarm --detect

# Use specific model
python -m local_swarm --model qwen2.5-coder:3b:q4

# Use specific port
python -m local_swarm --port 8080

# Force number of instances
python -m local_swarm --instances 4

# Download models only (no server)
python -m local_swarm --download-only

# Enable MCP server alongside HTTP API
python -m local_swarm --mcp

# Show help
python -m local_swarm --help

# Auto-detect without interactive menu
python -m local_swarm --auto

Interactive Mode

By default, Local Swarm starts in interactive mode with a menu system:

======================================================================
 Local Swarm - Model Selection
======================================================================

----------------------------------------------------------------------
 Hardware Detection
----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB

  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB

  Available for LLMs: 12.0 GB
  (Using 50% of system RAM)

----------------------------------------------------------------------
 Configuration Options
----------------------------------------------------------------------

  💡 Recommended: Qwen 2.5 Coder 7b (q6_k)
     Instances: 2
     Memory: 12.0 GB

  [1] Recommended Configuration - Qwen 2.5 Coder 7b (q6_k) with 2 instances
  [2] Browse All Configurations - See all models that fit your hardware
  [3] Custom Configuration - Specify exact model and number of instances

  Enter your choice:

Recommended Configuration - Automatically selects the best model and instance count for your hardware
Browse All Configurations - Shows all feasible models that fit in your available memory
Custom Configuration - Step-by-step wizard to select:
- Model family (Qwen, DeepSeek, CodeLlama)
- Model size (3B, 7B, 14B)
- Quantization level (Q4, Q5, Q6)
- Number of instances (1 to max supported)

To skip the menu and use auto-detection, use --auto flag.

Startup Summary

When starting, Local Swarm displays a comprehensive summary:

======================================================================
 Local Swarm - Startup Summary
======================================================================

----------------------------------------------------------------------
 Hardware Detection
----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB

  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB

  Available for LLMs: 12.0 GB

----------------------------------------------------------------------
 Model Configuration
----------------------------------------------------------------------
  Model: Qwen 2.5 Coder 7b (q6_k)
  Description: Alibaba's code-focused model
  Instances: 2
  Memory per Instance: 6.0 GB
  Total Memory: 12.0 GB
  Utilization: 100.0% of available

======================================================================

How It Works

Hardware Detection

The tool automatically detects your system:

Windows: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI)
macOS: Apple Silicon via Metal, unified memory model
Linux: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI/OpenCL)
Android: Qualcomm Adreno GPUs (via Termux)

Supported Backends:

NVIDIA: CUDA via llama.cpp
AMD: ROCm via llama.cpp (Linux, Windows experimental)
Intel: OneAPI/SYCL via llama.cpp
Apple Silicon: Metal via MLX
Qualcomm: CPU fallback on llama.cpp (Android/Termux)

Model Selection

Based on available memory:

External GPU: Use 100% of VRAM minus OS overhead
Apple Silicon: Use 50% of unified RAM
CPU-only: Use 50% of system RAM

The algorithm selects:

Largest model size that fits
Highest quantization quality possible
Maximum instances (2-8) based on memory

Example configurations:

Hardware	Model	Quant	Instances	Memory Used
RTX 4090 24GB	Qwen 2.5 14B	Q4_K_M	2	~17.6 GB
RTX 4060 Ti 16GB	Qwen 2.5 7B	Q4_K_M	3	~13.5 GB
RTX 4060 Ti 8GB	Qwen 2.5 3B	Q6_K	4	~10.4 GB
RX 7900 XTX 24GB	Qwen 2.5 14B	Q4_K_M	2	~17.6 GB
Arc A770 16GB	Qwen 2.5 7B	Q5_K_M	2	~10.4 GB
M4 Max 64GB	Qwen 2.5 14B	Q4_K_M	4	~35.2 GB
M3 Pro 36GB	Qwen 2.5 7B	Q4_K_M	4	~18 GB
M1 8GB	Qwen 2.5 3B	Q4_K_M	2	~3.6 GB
Snapdragon 8 Gen 3	Qwen 2.5 3B	Q4_K_M	1	~1.8 GB
CPU 32GB	Qwen 2.5 3B	Q4_K_M	8	~14.4 GB
Federated (3 machines)	Qwen 2.5 7B	Q4_K_M	9	~40.5 GB

Swarm Consensus

For each request, the swarm:

Sends the prompt to all running instances
Collects responses in parallel
Runs consensus algorithm:
- Similarity: Groups responses by semantic similarity, returns largest group
- Quality: Scores responses on completeness and code quality
- Fastest: Returns the quickest response
Returns the winning response via OpenAI-compatible API

Network Federation

Run Local Swarm on multiple machines in the same network to create a "federated swarm":

Example Setup:

Windows PC (RTX 4060 Ti): 4 instances
Mac Mini (M1): 2 instances
MacBook (M4): 3 instances
Total: 9 instances voting on every request

How it works:

Each machine auto-discovers others via mDNS/Bonjour
Each swarm generates responses independently
Local consensus picks best response per machine
Cross-swarm consensus votes across all machines
Best response returned to client

To enable federation:

federation:
  enabled: true
  discovery_port: 8765  # mDNS/Bonjour discovery
  federation_port: 8766  # Inter-swarm communication

Machines will automatically discover each other within 10 seconds.

API Endpoints

GET /v1/models

List available models

POST /v1/chat/completions

Chat completion with consensus

Request:

{
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ]
}

Response:

{
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def sort_list(lst):\n    return sorted(lst)"
    },
    "finish_reason": "stop"
  }]
}

GET /health

Health check

GET /metrics

Prometheus metrics (optional)

Supported Models

Currently supported models (auto-selected based on hardware):

Qwen 2.5 Coder (3B, 7B, 14B) - Recommended for coding tasks
DeepSeek Coder (1.3B, 6.7B, 33B) - Good alternative
CodeLlama (7B, 13B, 34B) - Meta's code model

All models support GGUF quantization:

Q4_K_M - Good quality, smallest size (recommended)
Q5_K_M - Better quality
Q6_K - Best quality

Troubleshooting

Out of Memory

If you get OOM errors:

# Reduce instances
python -m local_swarm --instances 2

# Or use smaller model
python -m local_swarm --model qwen2.5-coder:3b:q4

Slow Performance

Check GPU utilization with nvidia-smi (NVIDIA) or Activity Monitor (macOS)
Ensure model is cached (first run downloads to ~/.local_swarm/models)
Try reducing instances to avoid contention

Windows: CUDA not detected

Make sure NVIDIA drivers are installed:

nvidia-smi

If this fails, reinstall drivers from nvidia.com

macOS: MLX not found

pip install mlx-lm

Linux: AMD GPU not detected

Ensure ROCm is installed:

rocm-smi

If not found, install from https://www.amd.com/en/developer/rocm-hub.html

Linux: Intel GPU not detected

Install Intel oneAPI:

# Ubuntu/Debian
wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor -o /usr/share/keyrings/intel-oneapi-archive-keyring.gpg
echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
sudo apt update
sudo apt install intel-basekit

Android: Termux issues

Ensure Termux is installed from F-Droid (not Play Store)
Run pkg update before installation
Limited to small models (1-3B) due to RAM constraints
Use CPU backend only (no GPU acceleration on Android yet)

Requirements

Python 3.9+
4GB+ RAM (8GB+ recommended)
Optional: NVIDIA/AMD/Intel GPU with 4GB+ VRAM
Optional: Apple Silicon Mac
Optional: Android device with 8GB+ RAM (via Termux)

Development

# Install dev dependencies
pip install -r requirements-dev.txt

# Run tests
pytest

# Run specific platform tests
pytest tests/test_hardware.py -v

# Format code
black src/
ruff check src/

Architecture

Single Machine

┌─────────────────────────────────────┐
│         OpenAI API Client           │
│        (opencode, etc.)             │
└─────────────┬───────────────────────┘
              │ HTTP
              ▼
┌─────────────────────────────────────┐
│     Local Swarm API Server          │
│    (FastAPI / localhost:8000)       │
└─────────────┬───────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│       Swarm Manager                 │
│  ┌─────────┐ ┌─────────┐           │
│  │ Worker 1│ │ Worker 2│ ...       │
│  │(LLM #1) │ │(LLM #2) │           │
│  └────┬────┘ └────┬────┘           │
│       │           │                 │
│       └─────┬─────┘                 │
│             ▼                       │
│      Consensus Engine               │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│     Backend (llama.cpp / MLX)       │
│    ┌─────────────────────┐          │
│    │   GGUF/MLX Model    │          │
│    │   (Qwen/Codellama)  │          │
│    └─────────────────────┘          │
└─────────────────────────────────────┘
              │
              ▼
┌─────────────────────────────────────┐
│    Hardware (GPU/CPU/Apple Silicon) │
└─────────────────────────────────────┘

Federated Swarm (Multiple Machines)

┌─────────────────────────────────────────────────────────────┐
│                    Local Network                             │
│                                                              │
│  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
│  │  Windows PC  │    │   Mac Mini   │    │   MacBook    │  │
│  │  (RTX 4060)  │    │    (M1)      │    │    (M4)      │  │
│  │  4 instances │    │  2 instances │    │  3 instances │  │
│  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘  │
│         │                   │                   │           │
│         │                   │                   │           │
│         └───────────────────┼───────────────────┘           │
│                             │                               │
│                    ┌────────┴────────┐                      │
│                    │  Cross-Swarm    │                      │
│                    │    Consensus    │                      │
│                    └────────┬────────┘                      │
│                             │                               │
│                    ┌────────▼────────┐                      │
│                    │   opencode      │                      │
│                    └─────────────────┘                      │
└─────────────────────────────────────────────────────────────┘

License

MIT License - See LICENSE file

Contributing

Contributions welcome! Please read CONTRIBUTING.md first.

Acknowledgments

llama.cpp - Inference engine (CUDA/ROCm/SYCL)
MLX - Apple Silicon backend
Qwen - Model family
DeepSeek - Model family
HuggingFace - Model hosting
ROCm - AMD GPU support
oneAPI - Intel GPU support
Termux - Android terminal emulator

README.md

Local Swarm

Features

Documentation

Quick Start

Installation

Windows (PowerShell)

macOS/Linux

Android (Termux)

Usage

Start the Swarm

Configure opencode

MCP Server (Optional)

Configuration

CLI Options

Interactive Mode

Menu Options

Startup Summary

How It Works

Hardware Detection

Model Selection

Swarm Consensus

Network Federation

API Endpoints

GET /v1/models

POST /v1/chat/completions

GET /health

GET /metrics

Supported Models

Troubleshooting

Out of Memory

Slow Performance

Windows: CUDA not detected

macOS: MLX not found

Linux: AMD GPU not detected

Linux: Intel GPU not detected

Android: Termux issues

Requirements

Development

Architecture

Single Machine

Federated Swarm (Multiple Machines)

License

Contributing

Acknowledgments