Commit Graph

20 Commits

Author SHA1 Message Date
sleepy d30eedaa63 Fix opencode integration: streaming, response format, and tool handling
- Fix streaming to work even when tools are present (was forcing JSON mode)
- Fix response format: use empty list [] instead of null for tool_calls
- Add exclude_none config to ChatMessage model to match OpenAI format
- Remove tool instructions from prompt (were confusing 3B model)
- Fix tool call parsing to handle markdown code blocks properly
- Change default instances from 3 to 1 for faster debugging
- Allow 1 instance minimum in interactive config (was 2 on Mac)
- Add debug logging to track requests and responses

Fixes infinite loop issue where opencode would retry requests repeatedly
2026-02-24 03:44:46 +01:00
sleepy 2461f45ca8 fix: Remove slow HF API check from recommended config selection
- select_optimal_model was checking HF API for available quantizations
- This caused menu to hang/slow down when changing context
- Now only checks availability when browsing or custom config
- Recommended config uses default quantizations (faster)
2026-02-23 23:54:57 +01:00
sleepy f2d0fddfa4 fix: Update selector to check available quantizations on Mac 2026-02-23 23:52:29 +01:00
sleepy cb8e05e627 feat: Check available quantizations on Mac before showing menu
- Update list_models() and build_models() to accept check_available parameter
- Update interactive.py to pass check_available=True on Mac
- Menu now filters out non-existent quantizations in real-time
- Users can only select quantizations that actually exist on HF

This prevents the issue where users select 4bit but the system
tries to download 5bit because only certain quants exist.
2026-02-23 23:52:06 +01:00
sleepy 8028df7150 feat: Filter MLX quantizations to only show available ones
- Add filter_available_mlx_quants() to check HuggingFace for existing repos
- Update build_model_variants() to optionally check availability
- Menu will now only show quantizations that actually exist
- Prevents users from selecting non-existent quantizations

Note: This adds a small delay when building models as it checks HF API,
but prevents download failures later.
2026-02-23 23:50:12 +01:00
sleepy e323d43d2b feat: Validate MLX models exist before download and suggest alternatives
- Add _validate_mlx_model_exists() to check HuggingFace repos
- Show warning when selected quantization doesn't exist
- List available quantizations for the model
- Better error messages with suggestions

This prevents trying to download non-existent quantizations like 5bit
when only 3bit, 4bit, 6bit, 8bit are available.
2026-02-23 23:48:53 +01:00
sleepy a4049f1c35 fix: Correct memory calculations for Mac seed variation mode
On Mac (Apple Silicon) with seed variation:
- Total memory no longer multiplied by number of responses
- Memory is shared across all responses (same model, different seeds)
- list_available_configurations: Uses 3 responses, single memory calculation
- custom_configuration: Memory doesn't scale with response count
- show_startup_summary: Shows '(shared)' for RAM on Mac
- All memory displays now accurate for seed variation mode
2026-02-23 23:42:47 +01:00
sleepy 411295acba feat: Add seed variation and reviewer modes for Apple Silicon
- Add use_seed_variation mode: Generate multiple responses from one model
  with different random seeds (saves memory on Apple Silicon)
- Add enable_reviewer mode: A critic worker validates consensus results
  and triggers retries if output looks suspicious
- Add generate_with_seed_variation() method for single-model multi-response
- Add generate_with_reviewer() method with feedback loop
- Auto-enable seed variation on Apple Silicon to save memory
- Configurable max_retries for reviewer mode
2026-02-23 23:33:43 +01:00
sleepy 93f5788d74 feat: Add tool calling support to API
- Add Tool, ToolCall, FunctionDefinition models
- Format prompts with tool descriptions for Qwen models
- Parse tool calls from model output (JSON and function call patterns)
- Auto-disable streaming when tools are present
- Return tool_calls in API response with proper finish_reason
- Support both simple function calls and JSON tool_calls format
2026-02-23 23:08:47 +01:00
sleepy 472961cc23 feat: Apple Silicon MLX support, sequential workers, live status display, worker names
Major improvements for macOS/Apple Silicon:
- Add spawn-based multiprocessing for Metal GPU compatibility
- Implement sequential generation mode for multiple workers
- Each worker runs one-at-a-time to avoid GPU conflicts
- All workers stay loaded in memory for fast switching

User Experience:
- 100 unique worker names (Alpha, Raven, Zeus, etc.)
- Live terminal status display with progress bars
- Show context usage and last output per worker
- Display IP addresses for network workers

Configuration:
- Default port changed to 17615 (from 8000)
- Context size options: 16K, 32K (default), 64K, 128K
- Offloading options: none, 20%, 50%
- Default max_tokens: 1024

MLX Quantization Support:
- Support 3bit, 4bit, 5bit, 6bit, 8bit MLX models
- Proper memory calculations for each quantization
- Sequential mode automatically enabled on Apple Silicon

Bug Fixes:
- Fix instance calculation (was always returning 1)
- Fix quantization bit detection for MLX models
- Fix config.json generation in model folders
- Preload MiniLM embedding model during init

Files Changed:
- main.py: Spawn method for macOS, port 17615
- src/backends/mlx.py: MLX generation with stop sequences
- src/models/selector.py: Fix instance calculation
- src/swarm/manager.py: Sequential generation mode
- src/swarm/consensus.py: Preload embedding model
- src/swarm/worker.py: Progress tracking per worker
- src/swarm/worker_names.py: 100 unique names (NEW)
- src/swarm/status_monitor.py: Live display (NEW)
- src/interactive.py: Context/offload menus
- src/models/registry.py: MLX quantization sizes
- src/api/server.py: Port 17615, live status
2026-02-23 22:57:38 +01:00
sleepy 2f547fe101 Phase 6: Network Federation (#1)
* Update PLAN.md with new phases

- Add Phase 5: CLI & Interactive Interface
  - Interactive menu system with 3 options
  - Hardware display with detailed specs
  - Resource usage monitoring
  - Custom configuration wizard

- Add Phase 5.5: MCP Server
  - MCP protocol implementation
  - 5 MCP tools for AI assistants
  - Dual server mode (HTTP + MCP)

- Reorganize phase structure for clarity

* Phase 6: Implement network federation (WIP)

Add src/network/discovery.py:
- SwarmDiscovery class using mDNS/Bonjour
- PeerInfo dataclass for peer metadata
- Automatic peer discovery on local network
- Service advertising for this swarm
- Stale peer detection and cleanup

Add src/network/federation.py:
- FederationClient for HTTP communication with peers
- FederatedSwarm for managing cross-swarm consensus
- Two-phase voting: local consensus then peer voting
- Weighted voting strategy based on confidence
- Federation status monitoring
- Peer health checking

Add src/network/__init__.py:
- Export network classes

Update src/api/routes.py:
- POST /v1/federation/vote - Receive votes from peers
- GET /v1/federation/status - Get federation status
- GET /v1/federation/peers - List discovered peers

Update requirements.txt:
- Add zeroconf for mDNS discovery

Features:
- Auto-discovery of other Local Swarm instances
- Cross-swarm consensus voting
- Configurable minimum peer requirements
- Fallback to local-only if no peers available
- Peer health monitoring

TODO:
- Integrate federation into main.py
- Add --federation flag
- Test multi-machine setup
2026-02-23 18:05:27 +01:00
sleepy b9669e415d Update README with interactive menu documentation
- Add interactive mode section with screenshots
- Document the 3 menu options (recommended, browse, custom)
- Add startup summary section showing what info is displayed
- Add interactive features and MCP server to features list
- Document --auto flag to skip menu
- Add hardware/resource usage display examples
2026-02-23 17:44:44 +01:00
sleepy 1e183bd4cc Add interactive menu system and startup summary
Add src/interactive.py:
- Interactive model selection menu with 3 options:
  1. Recommended Configuration (auto-detect best)
  2. Browse All Configurations (see all feasible models)
  3. Custom Configuration (user-specified model + instances)
- Hardware info display with detailed specs
- Resource usage monitoring showing:
  - Swarm status, model, workers
  - Memory usage per worker
  - Worker statistics (requests, latency, tokens/sec)
- Custom configuration wizard:
  - Select from available models
  - Choose model size (3B, 7B, 14B, etc.)
  - Pick quantization level (Q4, Q5, Q6)
  - Specify number of instances
- Runtime menu for monitoring (refresh/quit)

Update main.py:
- Default mode now shows interactive menu
- Add --auto flag to skip menu and use recommended config
- Show comprehensive startup summary with hardware + config + usage
- Better integration with interactive module
- Removed redundant print functions (now in interactive.py)

Features:
- Clear screen for clean menu display
- Formatted headers and sections
- Menu validation and error handling
- Memory utilization percentage display
- Real-time worker status with health indicators
2026-02-23 17:43:38 +01:00
sleepy d3d2c50c71 Update README with MCP server documentation
- Add MCP Server section explaining the --mcp flag
- Document the 5 MCP tools available to AI assistants
- Add --mcp to CLI Options section
- Explain benefits of MCP integration for automatic hardware queries
2026-02-23 17:38:48 +01:00
sleepy cc0ee08b6f Phase 5: Add MCP server support alongside HTTP API
Add src/mcp_server.py:
- LocalSwarmMCPServer class implementing MCP protocol
- 5 MCP tools exposed:
  - get_hardware_info: Check CPU, GPU, RAM
  - get_swarm_status: Get worker status and model info
  - generate_code: Generate with consensus voting
  - list_available_models: Show all runnable models
  - get_worker_details: Detailed worker statistics
- Integration with SwarmManager for code generation
- Stdio transport for AI assistant communication

Update requirements.txt:
- Add mcp>=1.0.0 dependency

Update main.py:
- Add --mcp flag to enable MCP server
- Run MCP server alongside HTTP API when enabled
- Both servers share the same SwarmManager instance
- Display MCP status in startup message

Now Local Swarm supports both:
- HTTP API (for external clients, curl, opencode)
- MCP server (for tight AI assistant integration)

Usage:
  python main.py              # HTTP API only
  python main.py --mcp        # HTTP API + MCP server

MCP tools allow AI assistants to:
- Query hardware capabilities before suggesting models
- Check swarm health and worker status
- Generate code with automatic consensus voting
- List available models for the hardware
2026-02-23 17:37:55 +01:00
sleepy 4367c79d83 Phase 4: Implement OpenAI-compatible API server
Add src/api/models.py:
- Pydantic models for OpenAI API compatibility
- ChatCompletionRequest/Response models
- Streaming response models (SSE format)
- Model listing and health check models

Add src/api/routes.py:
- POST /v1/chat/completions endpoint
- GET /v1/models endpoint
- GET /health and /v1/health endpoints
- Support for streaming (text/event-stream) and regular responses
- Message formatting for chat prompts
- Error handling with proper HTTP status codes

Add src/api/server.py:
- FastAPI application with CORS middleware
- Lifespan context for startup/shutdown
- Integration with SwarmManager
- Uvicorn server configuration

Update src/api/__init__.py:
- Export API classes and functions

Update main.py:
- Integrate API server into default workflow
- Start API server on http://127.0.0.1:PORT
- Show API endpoints and opencode configuration
- Graceful shutdown on Ctrl+C

Update AGENTS.md:
- Add note about Python support in MCP server

Phase 4 complete: Local Swarm now exposes OpenAI-compatible API at:
- POST /v1/chat/completions (with streaming support)
- GET /v1/models
- GET /health

Ready for use with opencode and other OpenAI-compatible clients.
2026-02-23 17:29:16 +01:00
sleepy 2ce3e138c1 Phase 3: Implement swarm management and consensus
Add src/swarm/worker.py:
- SwarmWorker class managing single LLM instance
- WorkerStats for tracking performance metrics
- WorkerInfo dataclass for status reporting
- Async generation with streaming support
- Health monitoring and graceful shutdown

Add src/swarm/consensus.py:
- ConsensusEngine with multiple voting strategies
- Similarity voting using sentence-transformers embeddings
- Quality voting based on code structure and completeness
- Fastest voting for low-latency scenarios
- Majority voting as fallback
- Confidence scoring for all strategies

Add src/swarm/manager.py:
- SwarmManager orchestrating multiple workers
- Parallel request distribution to all workers
- Integration with consensus engine
- Streaming support from fastest worker
- Status monitoring and health checks
- Graceful shutdown coordination

Update src/swarm/__init__.py:
- Export main classes for easy importing

Update main.py:
- Add --test mode for sample inference
- Integrate SwarmManager initialization
- Show inference results and consensus details
- Keep swarm running until interrupted
- Better error handling and status display

Phase 3 complete: Swarm can spawn N workers, generate responses,
and run consensus voting to select the best output.
2026-02-23 17:22:54 +01:00
sleepy 6d7f323bd4 Phase 2: Implement backend integration and model downloading
Add src/backends/base.py:
- Abstract base class LLMBackend with async interface
- GenerationRequest/GenerationResponse dataclasses
- BackendError exception hierarchy

Add src/backends/llamacpp.py:
- llama.cpp backend for GGUF models
- Supports GPU offloading (CUDA/ROCm/Metal)
- Streaming and non-streaming generation
- Memory usage tracking

Add src/backends/mlx.py:
- MLX backend for Apple Silicon
- Optimized for Metal performance
- Unified memory model support

Add src/backends/__init__.py:
- Backend factory with auto-detection
- Selects MLX for Apple Silicon, llama.cpp for others
- Auto-configures GPU layers

Add src/models/downloader.py:
- HuggingFace model downloader
- Progress bar display with tqdm
- Cache management in ~/.local_swarm/models
- Support for all registered models

Update main.py:
- Integrate model downloading (--download-only mode)
- Test backend loading after download
- Async support for backend operations
- Better error handling and reporting

Phase 2 complete: Models can be downloaded and backends can load them.
2026-02-23 17:15:37 +01:00
sleepy 1940b40be5 Update documentation: Add network federation, extended GPU support, and Android
PLAN.md updates:
- Add Phase 6: Local Network Federation with mDNS discovery
- Add Phase 7: Extended GPU support (AMD, Intel, Qualcomm)
- Update architecture diagram with network modules
- Add federation architecture diagram
- Update test coverage for all platforms

README.md updates:
- Add network federation features and configuration
- Add hardware support for AMD, Intel, Qualcomm GPUs
- Add Android/Termux installation instructions
- Update hardware detection section
- Update supported models table with more hardware examples
- Add federated swarm architecture diagram
- Add troubleshooting for AMD, Intel, Android
- Update acknowledgments with new dependencies

New todos added for:
- Network federation implementation
- AMD GPU support (ROCm)
- Intel GPU support (OneAPI)
- Android/Termux support
2026-02-23 17:05:59 +01:00
sleepy 0e08a2d66a Phase 1: Implement hardware detection and model selection
- Add src/hardware/detector.py with cross-platform GPU/CPU/RAM detection
- Add src/models/registry.py with model database (Qwen, DeepSeek, CodeLlama)
- Add src/models/selector.py with optimal model selection algorithm
- Update main.py to use new modules and display results

Features:
- Detects NVIDIA GPUs on Windows/Linux
- Detects Apple Silicon on macOS
- Calculates available memory based on platform (100% GPU VRAM, 50% unified RAM)
- Selects optimal model, quantization, and instance count
- Supports 2-8 instances with quality-based selection
2026-02-23 16:56:07 +01:00