- select_optimal_model was checking HF API for available quantizations
- This caused menu to hang/slow down when changing context
- Now only checks availability when browsing or custom config
- Recommended config uses default quantizations (faster)
- Update list_models() and build_models() to accept check_available parameter
- Update interactive.py to pass check_available=True on Mac
- Menu now filters out non-existent quantizations in real-time
- Users can only select quantizations that actually exist on HF
This prevents the issue where users select 4bit but the system
tries to download 5bit because only certain quants exist.
- Add filter_available_mlx_quants() to check HuggingFace for existing repos
- Update build_model_variants() to optionally check availability
- Menu will now only show quantizations that actually exist
- Prevents users from selecting non-existent quantizations
Note: This adds a small delay when building models as it checks HF API,
but prevents download failures later.
- Add _validate_mlx_model_exists() to check HuggingFace repos
- Show warning when selected quantization doesn't exist
- List available quantizations for the model
- Better error messages with suggestions
This prevents trying to download non-existent quantizations like 5bit
when only 3bit, 4bit, 6bit, 8bit are available.
Updated _try_model_with_context and _try_smallest_variant_with_context:
- On Mac (use_mlx=True): Returns 3 responses by default
- On other platforms: Still calculates based on VRAM
- Memory calculation fixed for Mac (doesn't multiply by response count)
Fixes issue where recommended config showed 'Responses: 1' on Mac
On Mac (Apple Silicon) with seed variation:
- Total memory no longer multiplied by number of responses
- Memory is shared across all responses (same model, different seeds)
- list_available_configurations: Uses 3 responses, single memory calculation
- custom_configuration: Memory doesn't scale with response count
- show_startup_summary: Shows '(shared)' for RAM on Mac
- All memory displays now accurate for seed variation mode
- On Apple Silicon, UI terminology changed from 'instances' to 'responses'
- Mac default: 3 responses (configurable 2-5)
- Non-Mac: Still uses memory-based calculation
- Added explanation that seed variation keeps memory constant
- Menu and prompts updated to show appropriate terminology
- Add use_seed_variation mode: Generate multiple responses from one model
with different random seeds (saves memory on Apple Silicon)
- Add enable_reviewer mode: A critic worker validates consensus results
and triggers retries if output looks suspicious
- Add generate_with_seed_variation() method for single-model multi-response
- Add generate_with_reviewer() method with feedback loop
- Auto-enable seed variation on Apple Silicon to save memory
- Configurable max_retries for reviewer mode
- Add Tool, ToolCall, FunctionDefinition models
- Format prompts with tool descriptions for Qwen models
- Parse tool calls from model output (JSON and function call patterns)
- Auto-disable streaming when tools are present
- Return tool_calls in API response with proper finish_reason
- Support both simple function calls and JSON tool_calls format
Document the context window discussion and design decisions:
- Industry approaches (MoE, Ensemble, Pipeline, Speculative)
- Memory offloading options and trade-offs
- Why KV cache can't be shared between workers
- Three architectural options for 30K-60K+ context
- Current implementation status
- Hardware-specific recommendations
Provides reference for future enhancements and helps users
understand memory constraints in swarm architectures.
The .gitignore had 'models/' which excluded both:
- The models/ cache directory at root (intended)
- The src/models/ module directory (NOT intended)
Changed to '/models/' to only exclude root-level models/ directory
while allowing src/models/ to be tracked.
This fixes the 'No module named models' error on fresh clones.
Add more robust path resolution for Windows:
- Use Path.resolve() to get absolute path
- Also add parent directory to sys.path
- Fixes 'No module named models' error on Windows
Users can now run:
python main.py --test
Or use the module approach:
python -m local_swarm --test
Update Phase 8.3 Documentation to mark as COMPLETED:
- Document all sections added to docs/GUIDE.md
- Update README.md with documentation links
Documentation now includes:
- Quick Start Guide for all platforms
- Opencode configuration examples
- API reference with examples
- Comprehensive troubleshooting
- Performance tuning guide
- Advanced configuration options
Add new menu option [t] Tips & Help:
- Model Recommendations: Ranked list of best coding models
- Qwen 2.5 Coder (best overall)
- DeepSeek Coder (great alternative)
- CodeLlama (solid choice)
- Size recommendations (1-3B, 7B, 13B+)
- Quantization Guide: Simple explanation of Q4/Q5/Q6
- What quantization is
- Trade-offs between levels
- File size comparison
- When to use each level
- Quick reference table
- Instance Count Tips: Research-based recommendations
- Minimum 2 instances (required for consensus)
- Sweet spot: 3-5 instances (85-90% of benefit)
- Maximum 8 instances (diminishing returns)
- Memory calculation examples
- Research note on consensus effectiveness
- Hardware Optimization: Tips specific to user's setup
- Apple Silicon (MLX backend tips)
- Discrete GPU (CUDA/ROCm optimization)
- CPU-only (practical limitations)
- General speed vs quality trade-offs
- Memory management best practices
All tips are shown in interactive format with clear sections,
practical advice, and hardware-specific recommendations based on
detected system specs.
* Add exit menu option
Add [q] Quit option to interactive menu:
- Allows user to exit without starting the swarm
- Shows 'Exiting...' message
- Returns None to gracefully exit main.py
* Phase 6: Implement network federation (WIP)
Add src/network/discovery.py:
- SwarmDiscovery class using mDNS/Bonjour
- PeerInfo dataclass for peer metadata
- Automatic peer discovery on local network
- Service advertising for this swarm
- Stale peer detection and cleanup
Add src/network/federation.py:
- FederationClient for HTTP communication with peers
- FederatedSwarm for managing cross-swarm consensus
- Two-phase voting: local consensus then peer voting
- Weighted voting strategy based on confidence
- Federation status monitoring
- Peer health checking
Add src/network/__init__.py:
- Export network classes
Update src/api/routes.py:
- POST /v1/federation/vote - Receive votes from peers
- GET /v1/federation/status - Get federation status
- GET /v1/federation/peers - List discovered peers
Update requirements.txt:
- Add zeroconf for mDNS discovery
Features:
- Auto-discovery of other Local Swarm instances
- Cross-swarm consensus voting
- Configurable minimum peer requirements
- Fallback to local-only if no peers available
- Peer health monitoring
TODO:
- Integrate federation into main.py
- Add --federation flag
- Test multi-machine setup
* Update PLAN.md with new phases
- Add Phase 5: CLI & Interactive Interface
- Interactive menu system with 3 options
- Hardware display with detailed specs
- Resource usage monitoring
- Custom configuration wizard
- Add Phase 5.5: MCP Server
- MCP protocol implementation
- 5 MCP tools for AI assistants
- Dual server mode (HTTP + MCP)
- Reorganize phase structure for clarity
* Phase 6: Implement network federation (WIP)
Add src/network/discovery.py:
- SwarmDiscovery class using mDNS/Bonjour
- PeerInfo dataclass for peer metadata
- Automatic peer discovery on local network
- Service advertising for this swarm
- Stale peer detection and cleanup
Add src/network/federation.py:
- FederationClient for HTTP communication with peers
- FederatedSwarm for managing cross-swarm consensus
- Two-phase voting: local consensus then peer voting
- Weighted voting strategy based on confidence
- Federation status monitoring
- Peer health checking
Add src/network/__init__.py:
- Export network classes
Update src/api/routes.py:
- POST /v1/federation/vote - Receive votes from peers
- GET /v1/federation/status - Get federation status
- GET /v1/federation/peers - List discovered peers
Update requirements.txt:
- Add zeroconf for mDNS discovery
Features:
- Auto-discovery of other Local Swarm instances
- Cross-swarm consensus voting
- Configurable minimum peer requirements
- Fallback to local-only if no peers available
- Peer health monitoring
TODO:
- Integrate federation into main.py
- Add --federation flag
- Test multi-machine setup
Fix duplicate instances bug:
- Remove 'instances' from label in list_available_configurations()
- Now shows correctly as 'Model Size (quant)' with 'X instances' in description
Add more models to registry:
- Llama 3.2 (3B, 1B)
- Phi-4 (4B)
- Gemma 2 (2B, 4B, 9B)
- StarCoder2 (3B, 7B, 15B)
- Updated HF repo mappings and filename patterns
Add model update mechanism (src/models/updater.py):
- ModelUpdater class for querying HuggingFace Hub
- Queries trending GGUF models tagged with 'code'
- Filters out already-known models
- Estimates VRAM from model name
- 30-minute rate limiting between checks
- Saves custom models to ~/.local_swarm/custom_models.json
- Manual check only (no auto-update to avoid overloading HF)
Add menu option '4 - Check for New Models':
- Queries HF for trending models (respects rate limits)
- Displays model info (name, downloads, likes, est. VRAM)
- Allows adding models to custom registry
- Returns to model selection after
About etcd:
- Not needed for home networks
- mDNS (Bonjour) is simpler and requires no central server
- Perfect for 2-5 machine setups
- Zero configuration, auto-discovery
Changes to interactive.py:
- Added option 4 to main menu
- Added check_for_new_models_menu() function
- Displays trending models with metadata
- Allows manual addition to custom registry
- Add interactive mode section with screenshots
- Document the 3 menu options (recommended, browse, custom)
- Add startup summary section showing what info is displayed
- Add interactive features and MCP server to features list
- Document --auto flag to skip menu
- Add hardware/resource usage display examples
Add src/interactive.py:
- Interactive model selection menu with 3 options:
1. Recommended Configuration (auto-detect best)
2. Browse All Configurations (see all feasible models)
3. Custom Configuration (user-specified model + instances)
- Hardware info display with detailed specs
- Resource usage monitoring showing:
- Swarm status, model, workers
- Memory usage per worker
- Worker statistics (requests, latency, tokens/sec)
- Custom configuration wizard:
- Select from available models
- Choose model size (3B, 7B, 14B, etc.)
- Pick quantization level (Q4, Q5, Q6)
- Specify number of instances
- Runtime menu for monitoring (refresh/quit)
Update main.py:
- Default mode now shows interactive menu
- Add --auto flag to skip menu and use recommended config
- Show comprehensive startup summary with hardware + config + usage
- Better integration with interactive module
- Removed redundant print functions (now in interactive.py)
Features:
- Clear screen for clean menu display
- Formatted headers and sections
- Menu validation and error handling
- Memory utilization percentage display
- Real-time worker status with health indicators
- Add MCP Server section explaining the --mcp flag
- Document the 5 MCP tools available to AI assistants
- Add --mcp to CLI Options section
- Explain benefits of MCP integration for automatic hardware queries
Add src/mcp_server.py:
- LocalSwarmMCPServer class implementing MCP protocol
- 5 MCP tools exposed:
- get_hardware_info: Check CPU, GPU, RAM
- get_swarm_status: Get worker status and model info
- generate_code: Generate with consensus voting
- list_available_models: Show all runnable models
- get_worker_details: Detailed worker statistics
- Integration with SwarmManager for code generation
- Stdio transport for AI assistant communication
Update requirements.txt:
- Add mcp>=1.0.0 dependency
Update main.py:
- Add --mcp flag to enable MCP server
- Run MCP server alongside HTTP API when enabled
- Both servers share the same SwarmManager instance
- Display MCP status in startup message
Now Local Swarm supports both:
- HTTP API (for external clients, curl, opencode)
- MCP server (for tight AI assistant integration)
Usage:
python main.py # HTTP API only
python main.py --mcp # HTTP API + MCP server
MCP tools allow AI assistants to:
- Query hardware capabilities before suggesting models
- Check swarm health and worker status
- Generate code with automatic consensus voting
- List available models for the hardware
Add src/api/models.py:
- Pydantic models for OpenAI API compatibility
- ChatCompletionRequest/Response models
- Streaming response models (SSE format)
- Model listing and health check models
Add src/api/routes.py:
- POST /v1/chat/completions endpoint
- GET /v1/models endpoint
- GET /health and /v1/health endpoints
- Support for streaming (text/event-stream) and regular responses
- Message formatting for chat prompts
- Error handling with proper HTTP status codes
Add src/api/server.py:
- FastAPI application with CORS middleware
- Lifespan context for startup/shutdown
- Integration with SwarmManager
- Uvicorn server configuration
Update src/api/__init__.py:
- Export API classes and functions
Update main.py:
- Integrate API server into default workflow
- Start API server on http://127.0.0.1:PORT
- Show API endpoints and opencode configuration
- Graceful shutdown on Ctrl+C
Update AGENTS.md:
- Add note about Python support in MCP server
Phase 4 complete: Local Swarm now exposes OpenAI-compatible API at:
- POST /v1/chat/completions (with streaming support)
- GET /v1/models
- GET /health
Ready for use with opencode and other OpenAI-compatible clients.
Add src/swarm/worker.py:
- SwarmWorker class managing single LLM instance
- WorkerStats for tracking performance metrics
- WorkerInfo dataclass for status reporting
- Async generation with streaming support
- Health monitoring and graceful shutdown
Add src/swarm/consensus.py:
- ConsensusEngine with multiple voting strategies
- Similarity voting using sentence-transformers embeddings
- Quality voting based on code structure and completeness
- Fastest voting for low-latency scenarios
- Majority voting as fallback
- Confidence scoring for all strategies
Add src/swarm/manager.py:
- SwarmManager orchestrating multiple workers
- Parallel request distribution to all workers
- Integration with consensus engine
- Streaming support from fastest worker
- Status monitoring and health checks
- Graceful shutdown coordination
Update src/swarm/__init__.py:
- Export main classes for easy importing
Update main.py:
- Add --test mode for sample inference
- Integrate SwarmManager initialization
- Show inference results and consensus details
- Keep swarm running until interrupted
- Better error handling and status display
Phase 3 complete: Swarm can spawn N workers, generate responses,
and run consensus voting to select the best output.
Add src/backends/base.py:
- Abstract base class LLMBackend with async interface
- GenerationRequest/GenerationResponse dataclasses
- BackendError exception hierarchy
Add src/backends/llamacpp.py:
- llama.cpp backend for GGUF models
- Supports GPU offloading (CUDA/ROCm/Metal)
- Streaming and non-streaming generation
- Memory usage tracking
Add src/backends/mlx.py:
- MLX backend for Apple Silicon
- Optimized for Metal performance
- Unified memory model support
Add src/backends/__init__.py:
- Backend factory with auto-detection
- Selects MLX for Apple Silicon, llama.cpp for others
- Auto-configures GPU layers
Add src/models/downloader.py:
- HuggingFace model downloader
- Progress bar display with tqdm
- Cache management in ~/.local_swarm/models
- Support for all registered models
Update main.py:
- Integrate model downloading (--download-only mode)
- Test backend loading after download
- Async support for backend operations
- Better error handling and reporting
Phase 2 complete: Models can be downloaded and backends can load them.
- Add src/hardware/detector.py with cross-platform GPU/CPU/RAM detection
- Add src/models/registry.py with model database (Qwen, DeepSeek, CodeLlama)
- Add src/models/selector.py with optimal model selection algorithm
- Update main.py to use new modules and display results
Features:
- Detects NVIDIA GPUs on Windows/Linux
- Detects Apple Silicon on macOS
- Calculates available memory based on platform (100% GPU VRAM, 50% unified RAM)
- Selects optimal model, quantization, and instance count
- Supports 2-8 instances with quality-based selection