- Fix streaming to work even when tools are present (was forcing JSON mode)
- Fix response format: use empty list [] instead of null for tool_calls
- Add exclude_none config to ChatMessage model to match OpenAI format
- Remove tool instructions from prompt (were confusing 3B model)
- Fix tool call parsing to handle markdown code blocks properly
- Change default instances from 3 to 1 for faster debugging
- Allow 1 instance minimum in interactive config (was 2 on Mac)
- Add debug logging to track requests and responses
Fixes infinite loop issue where opencode would retry requests repeatedly
- select_optimal_model was checking HF API for available quantizations
- This caused menu to hang/slow down when changing context
- Now only checks availability when browsing or custom config
- Recommended config uses default quantizations (faster)
- Update list_models() and build_models() to accept check_available parameter
- Update interactive.py to pass check_available=True on Mac
- Menu now filters out non-existent quantizations in real-time
- Users can only select quantizations that actually exist on HF
This prevents the issue where users select 4bit but the system
tries to download 5bit because only certain quants exist.
- Add filter_available_mlx_quants() to check HuggingFace for existing repos
- Update build_model_variants() to optionally check availability
- Menu will now only show quantizations that actually exist
- Prevents users from selecting non-existent quantizations
Note: This adds a small delay when building models as it checks HF API,
but prevents download failures later.
- Add _validate_mlx_model_exists() to check HuggingFace repos
- Show warning when selected quantization doesn't exist
- List available quantizations for the model
- Better error messages with suggestions
This prevents trying to download non-existent quantizations like 5bit
when only 3bit, 4bit, 6bit, 8bit are available.
On Mac (Apple Silicon) with seed variation:
- Total memory no longer multiplied by number of responses
- Memory is shared across all responses (same model, different seeds)
- list_available_configurations: Uses 3 responses, single memory calculation
- custom_configuration: Memory doesn't scale with response count
- show_startup_summary: Shows '(shared)' for RAM on Mac
- All memory displays now accurate for seed variation mode
- Add use_seed_variation mode: Generate multiple responses from one model
with different random seeds (saves memory on Apple Silicon)
- Add enable_reviewer mode: A critic worker validates consensus results
and triggers retries if output looks suspicious
- Add generate_with_seed_variation() method for single-model multi-response
- Add generate_with_reviewer() method with feedback loop
- Auto-enable seed variation on Apple Silicon to save memory
- Configurable max_retries for reviewer mode
- Add Tool, ToolCall, FunctionDefinition models
- Format prompts with tool descriptions for Qwen models
- Parse tool calls from model output (JSON and function call patterns)
- Auto-disable streaming when tools are present
- Return tool_calls in API response with proper finish_reason
- Support both simple function calls and JSON tool_calls format
* Update PLAN.md with new phases
- Add Phase 5: CLI & Interactive Interface
- Interactive menu system with 3 options
- Hardware display with detailed specs
- Resource usage monitoring
- Custom configuration wizard
- Add Phase 5.5: MCP Server
- MCP protocol implementation
- 5 MCP tools for AI assistants
- Dual server mode (HTTP + MCP)
- Reorganize phase structure for clarity
* Phase 6: Implement network federation (WIP)
Add src/network/discovery.py:
- SwarmDiscovery class using mDNS/Bonjour
- PeerInfo dataclass for peer metadata
- Automatic peer discovery on local network
- Service advertising for this swarm
- Stale peer detection and cleanup
Add src/network/federation.py:
- FederationClient for HTTP communication with peers
- FederatedSwarm for managing cross-swarm consensus
- Two-phase voting: local consensus then peer voting
- Weighted voting strategy based on confidence
- Federation status monitoring
- Peer health checking
Add src/network/__init__.py:
- Export network classes
Update src/api/routes.py:
- POST /v1/federation/vote - Receive votes from peers
- GET /v1/federation/status - Get federation status
- GET /v1/federation/peers - List discovered peers
Update requirements.txt:
- Add zeroconf for mDNS discovery
Features:
- Auto-discovery of other Local Swarm instances
- Cross-swarm consensus voting
- Configurable minimum peer requirements
- Fallback to local-only if no peers available
- Peer health monitoring
TODO:
- Integrate federation into main.py
- Add --federation flag
- Test multi-machine setup
- Add interactive mode section with screenshots
- Document the 3 menu options (recommended, browse, custom)
- Add startup summary section showing what info is displayed
- Add interactive features and MCP server to features list
- Document --auto flag to skip menu
- Add hardware/resource usage display examples
Add src/interactive.py:
- Interactive model selection menu with 3 options:
1. Recommended Configuration (auto-detect best)
2. Browse All Configurations (see all feasible models)
3. Custom Configuration (user-specified model + instances)
- Hardware info display with detailed specs
- Resource usage monitoring showing:
- Swarm status, model, workers
- Memory usage per worker
- Worker statistics (requests, latency, tokens/sec)
- Custom configuration wizard:
- Select from available models
- Choose model size (3B, 7B, 14B, etc.)
- Pick quantization level (Q4, Q5, Q6)
- Specify number of instances
- Runtime menu for monitoring (refresh/quit)
Update main.py:
- Default mode now shows interactive menu
- Add --auto flag to skip menu and use recommended config
- Show comprehensive startup summary with hardware + config + usage
- Better integration with interactive module
- Removed redundant print functions (now in interactive.py)
Features:
- Clear screen for clean menu display
- Formatted headers and sections
- Menu validation and error handling
- Memory utilization percentage display
- Real-time worker status with health indicators
- Add MCP Server section explaining the --mcp flag
- Document the 5 MCP tools available to AI assistants
- Add --mcp to CLI Options section
- Explain benefits of MCP integration for automatic hardware queries
Add src/mcp_server.py:
- LocalSwarmMCPServer class implementing MCP protocol
- 5 MCP tools exposed:
- get_hardware_info: Check CPU, GPU, RAM
- get_swarm_status: Get worker status and model info
- generate_code: Generate with consensus voting
- list_available_models: Show all runnable models
- get_worker_details: Detailed worker statistics
- Integration with SwarmManager for code generation
- Stdio transport for AI assistant communication
Update requirements.txt:
- Add mcp>=1.0.0 dependency
Update main.py:
- Add --mcp flag to enable MCP server
- Run MCP server alongside HTTP API when enabled
- Both servers share the same SwarmManager instance
- Display MCP status in startup message
Now Local Swarm supports both:
- HTTP API (for external clients, curl, opencode)
- MCP server (for tight AI assistant integration)
Usage:
python main.py # HTTP API only
python main.py --mcp # HTTP API + MCP server
MCP tools allow AI assistants to:
- Query hardware capabilities before suggesting models
- Check swarm health and worker status
- Generate code with automatic consensus voting
- List available models for the hardware
Add src/api/models.py:
- Pydantic models for OpenAI API compatibility
- ChatCompletionRequest/Response models
- Streaming response models (SSE format)
- Model listing and health check models
Add src/api/routes.py:
- POST /v1/chat/completions endpoint
- GET /v1/models endpoint
- GET /health and /v1/health endpoints
- Support for streaming (text/event-stream) and regular responses
- Message formatting for chat prompts
- Error handling with proper HTTP status codes
Add src/api/server.py:
- FastAPI application with CORS middleware
- Lifespan context for startup/shutdown
- Integration with SwarmManager
- Uvicorn server configuration
Update src/api/__init__.py:
- Export API classes and functions
Update main.py:
- Integrate API server into default workflow
- Start API server on http://127.0.0.1:PORT
- Show API endpoints and opencode configuration
- Graceful shutdown on Ctrl+C
Update AGENTS.md:
- Add note about Python support in MCP server
Phase 4 complete: Local Swarm now exposes OpenAI-compatible API at:
- POST /v1/chat/completions (with streaming support)
- GET /v1/models
- GET /health
Ready for use with opencode and other OpenAI-compatible clients.
Add src/swarm/worker.py:
- SwarmWorker class managing single LLM instance
- WorkerStats for tracking performance metrics
- WorkerInfo dataclass for status reporting
- Async generation with streaming support
- Health monitoring and graceful shutdown
Add src/swarm/consensus.py:
- ConsensusEngine with multiple voting strategies
- Similarity voting using sentence-transformers embeddings
- Quality voting based on code structure and completeness
- Fastest voting for low-latency scenarios
- Majority voting as fallback
- Confidence scoring for all strategies
Add src/swarm/manager.py:
- SwarmManager orchestrating multiple workers
- Parallel request distribution to all workers
- Integration with consensus engine
- Streaming support from fastest worker
- Status monitoring and health checks
- Graceful shutdown coordination
Update src/swarm/__init__.py:
- Export main classes for easy importing
Update main.py:
- Add --test mode for sample inference
- Integrate SwarmManager initialization
- Show inference results and consensus details
- Keep swarm running until interrupted
- Better error handling and status display
Phase 3 complete: Swarm can spawn N workers, generate responses,
and run consensus voting to select the best output.
Add src/backends/base.py:
- Abstract base class LLMBackend with async interface
- GenerationRequest/GenerationResponse dataclasses
- BackendError exception hierarchy
Add src/backends/llamacpp.py:
- llama.cpp backend for GGUF models
- Supports GPU offloading (CUDA/ROCm/Metal)
- Streaming and non-streaming generation
- Memory usage tracking
Add src/backends/mlx.py:
- MLX backend for Apple Silicon
- Optimized for Metal performance
- Unified memory model support
Add src/backends/__init__.py:
- Backend factory with auto-detection
- Selects MLX for Apple Silicon, llama.cpp for others
- Auto-configures GPU layers
Add src/models/downloader.py:
- HuggingFace model downloader
- Progress bar display with tqdm
- Cache management in ~/.local_swarm/models
- Support for all registered models
Update main.py:
- Integrate model downloading (--download-only mode)
- Test backend loading after download
- Async support for backend operations
- Better error handling and reporting
Phase 2 complete: Models can be downloaded and backends can load them.
- Add src/hardware/detector.py with cross-platform GPU/CPU/RAM detection
- Add src/models/registry.py with model database (Qwen, DeepSeek, CodeLlama)
- Add src/models/selector.py with optimal model selection algorithm
- Update main.py to use new modules and display results
Features:
- Detects NVIDIA GPUs on Windows/Linux
- Detects Apple Silicon on macOS
- Calculates available memory based on platform (100% GPU VRAM, 50% unified RAM)
- Selects optimal model, quantization, and instance count
- Supports 2-8 instances with quality-based selection