Commit Graph

23 Commits

Author SHA1 Message Date
sleepy 580d1e5d17 feat: comprehensive tool system improvements and webfetch support (#3)
* feat: enhanced tool instructions for multi-step operations

- Add comprehensive examples for ls, find, grep, mkdir, npm init, etc.
- Explain multi-step workflow (explore → read → write)
- Tool system already supports chaining via conversation history
- Bash tool supports: ls, find, grep, cat, mkdir, cd, npm, etc.
- 30 second timeout on commands
- Output limited to 3000 chars for readability

* Cleanup: Consolidate documentation and tidy codebase

Documentation:
- Consolidate 6 markdown files into simplified README.md
- Remove redundant docs: TODO.md, NETWORK.md, REVIEW.md, PLAN.md, CONTEXT.md, GUIDE.md
- Add ARCHITECTURE.md with clean technical overview
- README now focuses on quick start and core concepts

Code verification:
- Verified blocking I/O properly wrapped in asyncio.to_thread()
- Confirmed locks initialized correctly in backends
- AMD VRAM detection uses proper regex (takes max value, not first match)
- All exception handling uses 'except Exception:' (not bare except)

Tool execution improvements (existing changes):
- Better working directory handling with project root detection
- Extended timeouts for package managers (300s)
- Multi-tool call parsing support
- Improved error handling and logging

Note: System prompt concern noted - 30k tokens too large for 16-32k context windows

* docs: add development patterns analysis

Document circular development issues identified in commit history:
- Tool execution went back-and-forth 3+ times (server-side vs client-side)
- Tool instructions changed from 40k → 300 → removed → enhanced tokens
- 8+ parsing fixes for same issues (no tests)
- 6 debug-only commits (production debugging)

Provides recommendations to prevent future cycles:
1. Pick one architecture and stick with it
2. Add unit tests before fixes
3. Token budget (<2000 for instructions)
4. One format only (remove alternative parsers)
5. Integration test script
6. Separate concerns into smaller modules
7. Design doc before code changes
8. CI/CD with automated testing

* docs: add comprehensive agent guidelines

AGENT_WORKER.md (600+ lines):
- Pre-flight checklist: token budget, test plan, design doc
- Coding rules: TDD, no debug code, architecture consistency
- Git workflow: branching strategy, commit rules, release process
- Testing requirements: unit (≥80%), integration structure
- Code quality: PEP 8, type hints, max 50 lines per function
- Architecture: no feature flags, separation of concerns
- Continuous learning: research requirements, documentation
- Forbidden patterns: bare except, production debugging, etc.

AGENT_REVIEW.md (400+ lines):
- Review philosophy: prevent circular development
- 6-phase review checklist: structure, quality, tokens, architecture, research, logic
- Report format with token impact analysis
- Severity levels: blocking vs warnings vs approved
- Common issues with examples (good vs bad)
- Review workflow: 30-35 min per PR
- Reports stored in reports/ folder (gitignored)

Also added:
- tests/test_tool_parsing.py - example test following guidelines
- Updated DEVELOPMENT_PATTERNS.md with recommendations

Reports folder in .gitignore for local review storage

* chore: gitignore review reports folder

* feat: fix tool execution and enhance instructions with accurate token counting

- Enhanced tool instructions (1041 tokens, within 2000 budget)
- Added tiktoken>=0.5.0 for accurate token counting
- Fixed subprocess hang by adding stdin=subprocess.DEVNULL
- Removed 9 DEBUG print statements from routes.py
- Added tests for instruction content and token budget verification
- All tests pass (11/11)

Resolves blockers from previous review:
- Token budget verified ✓
- Token documentation added ✓
- Debug code cleaned ✓
- Missing tests added ✓

* feat: implement comprehensive tool system with proper logging

Major improvements to tool instructions and execution:
- Enhanced tool instructions with 7-step task completion workflow
- Added markdown code block fallback parser for tool calls
- Fixed subprocess hang with stdin=subprocess.DEVNULL
- Fixed streaming path to return tool_calls (enabling multi-turn conversations)
- Added complete React project creation example with verification steps
- Token count: 1,743 tokens (within 2,000 limit)

Logging infrastructure:
- Created centralized logging configuration (src/utils/logging_config.py)
- Replaced 80+ print statements with logger.debug()
- Set log level to DEBUG for development
- All modules now use proper logging instead of print

Testing:
- Added 4 new tests for markdown parsing and instruction content
- All 13 tests passing
- Token budget verification test

Documentation:
- Added comprehensive design docs for all major changes
- Added test plans for verification
- Created helper scripts for logging migration

Files changed:
- main.py: Added logging setup
- src/api/routes.py: Tool instructions, streaming fixes, logging
- src/tools/executor.py: subprocess fix, logging
- src/utils/: New logging configuration module
- tests/test_tool_parsing.py: New tests
- docs/: Design decisions and test plans
- scripts/: Helper scripts for development

* refactor: simplify tool instructions to 109 tokens for 7B model

Reduced from 1,743 tokens to 109 tokens (94% reduction) to help
qwen2.5 7B 4bit model follow instructions better.

Changes:
- Removed complex workflow documentation
- Removed multi-turn conversation examples
- Removed lengthy anti-patterns
- Kept only essential format and rules
- Updated tests to match simplified content

Before: 1,743 tokens, 6,004 chars (87% of budget)
After: 109 tokens, 392 chars (5.5% of budget)

This should make it much easier for smaller models to:
1. Understand they must use tools
2. Follow the simple TOOL: format
3. Not get overwhelmed by instructions

* refactor: make tool instructions ultra-direct for 7B models

Further simplify instructions to prevent model from adding explanations.

Before: 109 tokens - model still added explanatory text
After: 86 tokens - ultra-direct commands

Key changes:
- Start with 'You MUST use tools. DO NOT explain.'
- 'OUTPUT THIS EXACT FORMAT - NOTHING ELSE'
- Removed all examples and pleasantries
- Added 'NEVER' rules in all caps
- 'ONLY output TOOL: lines'

The model was outputting:
'1. First, install... TOOL: bash ARGUMENTS: {...}'

Now should output just:
'TOOL: bash
ARGUMENTS: {...}'

This should force the 7B qwen model to stop explaining and just execute.

* refactor: move tool instructions to external config file

Moves hardcoded tool instructions from routes.py to external config file
for better maintainability and easier editing.

Changes:
- Created config/prompts/tool_instructions.txt
- Added _load_tool_instructions() function with caching
- Falls back to default if config file not found
- Updated tests to use the loader function
- Added proper error handling

Benefits:
- Easier to modify instructions without code changes
- Instructions can be edited by non-developers
- Cleaner separation of config vs code
- Supports hot-reloading (cached but easy to invalidate)

Token count: 86 tokens (loaded from file)
Location: config/prompts/tool_instructions.txt

* refactor: simplify tool instructions further and add debug logging

- Reduced instructions to bare minimum: 50 tokens
- Added debug logging to verify instructions are sent
- Removed all caps and aggressive language
- Made instructions more straightforward

Instructions now:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This should be easier for 7B models to follow while still
conveying the essential requirements.

* feat: improve tool parser to handle 7B model output variations

Enhanced parse_tool_calls() with multiple fallback strategies:

1. Standard TOOL:/ARGUMENTS: format (original)
2. Markdown code blocks ()
3. Numbered list items (1. npm install ...)
4. Standalone bash commands (npm, npx, mkdir, etc.)

Now handles messy output from small models like:
'1. Install: npm install -g create-react-app'
'2. Create: create-react-app hello-world'

Parses these into chained bash commands for execution.

Also simplified instructions to 50 tokens minimum:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This combination should make 7B models much more likely to
have their output successfully parsed and executed.

* fix: improve command extraction for 7B model output

Parser now extracts bash commands from any line containing:
- npm, npx, mkdir, cd, ls, cat, echo, git, python, pip, node, yarn
- create-react-app (added for React projects)

Example: Extracts 'npm install -g create-react-app' from:
'1. Install: npm install -g create-react-app'

Chains multiple commands with && for sequential execution.

This should now successfully parse the numbered list output
from 7B models and execute the commands.

* feat: add bash tool description validation and improve 7B model parsing

Changes:
- Added _ensure_tool_arguments() function to inject 'description' field
- Updated tool_instructions.txt to require description for bash tool
- Improved 7B model command extraction with better regex patterns
- Added 'create-react-app' to command detection list
- Updated delta field type to Dict[str, Any] for streaming
- Added GGUF to MLX quantization mapping for registry.py
- Clarified agent responsibilities in AGENT_REVIEW.md and AGENT_WORKER.md

Fixes:
- Bash tool now validates required 'description' field
- 7B model output parsed more reliably (numbered lists)
- Multiple commands chained with && for sequential execution

Token count: 69 tokens (down from 86, -19.8%)

All tests pass: 13/13

* feat: add webfetch tool support with URL extraction

Changes:
- Added webfetch to tool instructions config
- Added URL extraction pattern to parse_tool_calls()
- Parser now recognizes URLs and creates webfetch tool calls
- Updated token count: 89 tokens (+29% from 69)

The webfetch tool is available through opencode environment.
System prompt adjustment enables model to use it for URL fetching.

Token budget: 89 tokens (4.45% of 2000 limit)
Tests pass: 13/13
2026-02-24 22:35:05 +01:00
sleepy 9932e34385 feat: add tool execution logging and status display
- Tool server now logs when tools are executed:
  '🔧 TOOL SERVER: Executing read({...})'
  '🔧 TOOL SERVER: read completed (500 chars)'

- LLM server logs remote tool calls:
  '  🔧 Remote tool call: read({...})'
  '   Tool result received (500 chars)'

- Startup now shows tool server status:
  🔧 Tool Server: Remote
     URL: http://192.168.1.5:17616 (auto-detected)
     Mode: Tools executed remotely on tool host

  OR:

  🔧 Tool Server: Local
     Mode: Tools executed on this machine
2026-02-24 14:34:58 +01:00
sleepy cc66c550e4 feat: --tool-host auto-detects local IP when used without value
- --tool-host with no value: auto-detects local IP (e.g., http://192.168.1.5:17616)
- --tool-host with explicit URL: uses provided URL
- No --tool-host: local tool execution (default)

Example usage:
  python main.py --auto --tool-host              # Auto-detect local IP
  python main.py --auto --tool-host http://192.168.1.10:17616  # Explicit URL
  python main.py --auto                          # Local execution
2026-02-24 14:30:14 +01:00
sleepy b5bd154ba6 feat: add --tool-port argument for tool server (default: 17616)
- Tool server now runs on port 17616 by default (separate from main API on 17615)
- Add --tool-port argument to customize tool server port
- Update help text to reflect default port 17616
- Prevent port conflicts when running both services on same machine
2026-02-24 14:27:40 +01:00
sleepy 12eaac0d27 feat: implement distributed tool execution with tool host
- Add ToolExecutor class supporting both local and remote tool execution
- Add --tool-host argument to use remote tool execution server
- Add --tool-server argument to run dedicated tool execution server
- Add /v1/tools/execute endpoint for remote tool execution
- Workers can execute tools on centralized tool host
- Tools: read, write, bash with security restrictions

Architecture:
- Tool Host (--tool-server): Runs on one machine, executes all tools
- Workers (--tool-host): Send tool requests to tool host, get results
- Local mode (default): Execute tools locally as before
2026-02-24 14:24:22 +01:00
sleepy 8a93e25b16 feat: add peer health check loop and improve tool instructions
- Add periodic health check every 10s to keep peer connections alive
- Remove stale peers after 30s of unreachability
- Improve tool use instructions with clearer examples
- Add 'CRITICAL: Do not explain what you will do' instruction
- Add concrete example of tool use format
2026-02-24 11:46:23 +01:00
sleepy 6e06304b70 feat: wire up federation to use peer swarms for generation
- Modify /v1/chat/completions endpoint to check for federation
- If federation enabled with peers, use generate_with_federation()
- Otherwise fall back to local generation
- Add --peer example to help text

Now when federation is enabled and peers are discovered/manually added,
generation requests will be distributed across local and peer swarms,
with consensus voting to select the best response.
2026-02-24 05:12:54 +01:00
sleepy 857241135c feat: add --peer arg for manual peer configuration
Add --peer argument to manually specify peers when mDNS discovery
isn't working. Usage: --peer 192.168.178.192:17615
Can be used multiple times for multiple peers.
2026-02-24 05:09:44 +01:00
sleepy f8a146c6f1 fix: pass --host IP to discovery service for mDNS advertising
- Add advertise_ip parameter to SwarmDiscovery and create_discovery_service
- Use specified --host IP for mDNS advertising instead of auto-detect
- Add feedback when using specified vs auto-detected IP

This ensures both the API server and mDNS advertise the same IP.
2026-02-24 04:48:46 +01:00
sleepy 05b1c153d4 fix: add --host arg and fix hardware attribute names
- Add --host argument to specify bind IP directly
- Fix HardwareProfile attribute names (cpu_cores, ram_gb)
- Update help text with new --host option

Allows manual override of IP detection for multi-adapter setups.
2026-02-24 04:32:36 +01:00
sleepy 6003a4658c fix: restrict IP detection to 192.168.x.x only
Remove support for 10.x.x.x, 172.x.x.x, and 100.x.x.x private ranges
to force use of 192.168.x.x network adapter.
2026-02-24 04:27:19 +01:00
sleepy 57132a45b2 fix: improve private IP detection for federation
- Support all RFC 1918 private IP ranges (10.x, 172.16-31.x, 192.168.x)
- Add debug output to show detected IP and why it was rejected
- Fix API URL display to show actual bound host
- Use consistent IP detection between main.py and discovery.py
2026-02-24 04:25:15 +01:00
sleepy 36630bfb44 feat: integrate federation system for distributed consensus
- Add --federation CLI flag to enable network federation
- Integrate SwarmDiscovery service for mDNS peer discovery
- Wire up FederatedSwarm wrapper in main application flow
- Add GET /v1/federation/health endpoint
- Display discovered peers in startup banner
- Proper cleanup of federation resources on shutdown

Enables multiple Local Swarm instances to collaborate on the same
network for distributed consensus and load balancing.
2026-02-24 04:20:21 +01:00
sleepy 47f6c8e7d9 Add local network IP binding for federation support
- Add get_local_ip() function to detect local network IP (192.x.x.x or 100.x.x.x)
- Bind server to specific local IP instead of 0.0.0.0 for security
- Only expose to local network, not internet
- Fall back to localhost if not on private network

This enables federation between multiple Macs on the same local network
while keeping the server secure from external access.
2026-02-24 04:07:27 +01:00
sleepy 472961cc23 feat: Apple Silicon MLX support, sequential workers, live status display, worker names
Major improvements for macOS/Apple Silicon:
- Add spawn-based multiprocessing for Metal GPU compatibility
- Implement sequential generation mode for multiple workers
- Each worker runs one-at-a-time to avoid GPU conflicts
- All workers stay loaded in memory for fast switching

User Experience:
- 100 unique worker names (Alpha, Raven, Zeus, etc.)
- Live terminal status display with progress bars
- Show context usage and last output per worker
- Display IP addresses for network workers

Configuration:
- Default port changed to 17615 (from 8000)
- Context size options: 16K, 32K (default), 64K, 128K
- Offloading options: none, 20%, 50%
- Default max_tokens: 1024

MLX Quantization Support:
- Support 3bit, 4bit, 5bit, 6bit, 8bit MLX models
- Proper memory calculations for each quantization
- Sequential mode automatically enabled on Apple Silicon

Bug Fixes:
- Fix instance calculation (was always returning 1)
- Fix quantization bit detection for MLX models
- Fix config.json generation in model folders
- Preload MiniLM embedding model during init

Files Changed:
- main.py: Spawn method for macOS, port 17615
- src/backends/mlx.py: MLX generation with stop sequences
- src/models/selector.py: Fix instance calculation
- src/swarm/manager.py: Sequential generation mode
- src/swarm/consensus.py: Preload embedding model
- src/swarm/worker.py: Progress tracking per worker
- src/swarm/worker_names.py: 100 unique names (NEW)
- src/swarm/status_monitor.py: Live display (NEW)
- src/interactive.py: Context/offload menus
- src/models/registry.py: MLX quantization sizes
- src/api/server.py: Port 17615, live status
2026-02-23 22:57:38 +01:00
sleepy 6e10438914 Fix Windows import path issue
Add more robust path resolution for Windows:
- Use Path.resolve() to get absolute path
- Also add parent directory to sys.path
- Fixes 'No module named models' error on Windows

Users can now run:
  python main.py --test

Or use the module approach:
  python -m local_swarm --test
2026-02-23 19:24:33 +01:00
sleepy 1e183bd4cc Add interactive menu system and startup summary
Add src/interactive.py:
- Interactive model selection menu with 3 options:
  1. Recommended Configuration (auto-detect best)
  2. Browse All Configurations (see all feasible models)
  3. Custom Configuration (user-specified model + instances)
- Hardware info display with detailed specs
- Resource usage monitoring showing:
  - Swarm status, model, workers
  - Memory usage per worker
  - Worker statistics (requests, latency, tokens/sec)
- Custom configuration wizard:
  - Select from available models
  - Choose model size (3B, 7B, 14B, etc.)
  - Pick quantization level (Q4, Q5, Q6)
  - Specify number of instances
- Runtime menu for monitoring (refresh/quit)

Update main.py:
- Default mode now shows interactive menu
- Add --auto flag to skip menu and use recommended config
- Show comprehensive startup summary with hardware + config + usage
- Better integration with interactive module
- Removed redundant print functions (now in interactive.py)

Features:
- Clear screen for clean menu display
- Formatted headers and sections
- Menu validation and error handling
- Memory utilization percentage display
- Real-time worker status with health indicators
2026-02-23 17:43:38 +01:00
sleepy cc0ee08b6f Phase 5: Add MCP server support alongside HTTP API
Add src/mcp_server.py:
- LocalSwarmMCPServer class implementing MCP protocol
- 5 MCP tools exposed:
  - get_hardware_info: Check CPU, GPU, RAM
  - get_swarm_status: Get worker status and model info
  - generate_code: Generate with consensus voting
  - list_available_models: Show all runnable models
  - get_worker_details: Detailed worker statistics
- Integration with SwarmManager for code generation
- Stdio transport for AI assistant communication

Update requirements.txt:
- Add mcp>=1.0.0 dependency

Update main.py:
- Add --mcp flag to enable MCP server
- Run MCP server alongside HTTP API when enabled
- Both servers share the same SwarmManager instance
- Display MCP status in startup message

Now Local Swarm supports both:
- HTTP API (for external clients, curl, opencode)
- MCP server (for tight AI assistant integration)

Usage:
  python main.py              # HTTP API only
  python main.py --mcp        # HTTP API + MCP server

MCP tools allow AI assistants to:
- Query hardware capabilities before suggesting models
- Check swarm health and worker status
- Generate code with automatic consensus voting
- List available models for the hardware
2026-02-23 17:37:55 +01:00
sleepy 4367c79d83 Phase 4: Implement OpenAI-compatible API server
Add src/api/models.py:
- Pydantic models for OpenAI API compatibility
- ChatCompletionRequest/Response models
- Streaming response models (SSE format)
- Model listing and health check models

Add src/api/routes.py:
- POST /v1/chat/completions endpoint
- GET /v1/models endpoint
- GET /health and /v1/health endpoints
- Support for streaming (text/event-stream) and regular responses
- Message formatting for chat prompts
- Error handling with proper HTTP status codes

Add src/api/server.py:
- FastAPI application with CORS middleware
- Lifespan context for startup/shutdown
- Integration with SwarmManager
- Uvicorn server configuration

Update src/api/__init__.py:
- Export API classes and functions

Update main.py:
- Integrate API server into default workflow
- Start API server on http://127.0.0.1:PORT
- Show API endpoints and opencode configuration
- Graceful shutdown on Ctrl+C

Update AGENTS.md:
- Add note about Python support in MCP server

Phase 4 complete: Local Swarm now exposes OpenAI-compatible API at:
- POST /v1/chat/completions (with streaming support)
- GET /v1/models
- GET /health

Ready for use with opencode and other OpenAI-compatible clients.
2026-02-23 17:29:16 +01:00
sleepy 2ce3e138c1 Phase 3: Implement swarm management and consensus
Add src/swarm/worker.py:
- SwarmWorker class managing single LLM instance
- WorkerStats for tracking performance metrics
- WorkerInfo dataclass for status reporting
- Async generation with streaming support
- Health monitoring and graceful shutdown

Add src/swarm/consensus.py:
- ConsensusEngine with multiple voting strategies
- Similarity voting using sentence-transformers embeddings
- Quality voting based on code structure and completeness
- Fastest voting for low-latency scenarios
- Majority voting as fallback
- Confidence scoring for all strategies

Add src/swarm/manager.py:
- SwarmManager orchestrating multiple workers
- Parallel request distribution to all workers
- Integration with consensus engine
- Streaming support from fastest worker
- Status monitoring and health checks
- Graceful shutdown coordination

Update src/swarm/__init__.py:
- Export main classes for easy importing

Update main.py:
- Add --test mode for sample inference
- Integrate SwarmManager initialization
- Show inference results and consensus details
- Keep swarm running until interrupted
- Better error handling and status display

Phase 3 complete: Swarm can spawn N workers, generate responses,
and run consensus voting to select the best output.
2026-02-23 17:22:54 +01:00
sleepy 6d7f323bd4 Phase 2: Implement backend integration and model downloading
Add src/backends/base.py:
- Abstract base class LLMBackend with async interface
- GenerationRequest/GenerationResponse dataclasses
- BackendError exception hierarchy

Add src/backends/llamacpp.py:
- llama.cpp backend for GGUF models
- Supports GPU offloading (CUDA/ROCm/Metal)
- Streaming and non-streaming generation
- Memory usage tracking

Add src/backends/mlx.py:
- MLX backend for Apple Silicon
- Optimized for Metal performance
- Unified memory model support

Add src/backends/__init__.py:
- Backend factory with auto-detection
- Selects MLX for Apple Silicon, llama.cpp for others
- Auto-configures GPU layers

Add src/models/downloader.py:
- HuggingFace model downloader
- Progress bar display with tqdm
- Cache management in ~/.local_swarm/models
- Support for all registered models

Update main.py:
- Integrate model downloading (--download-only mode)
- Test backend loading after download
- Async support for backend operations
- Better error handling and reporting

Phase 2 complete: Models can be downloaded and backends can load them.
2026-02-23 17:15:37 +01:00
sleepy 0e08a2d66a Phase 1: Implement hardware detection and model selection
- Add src/hardware/detector.py with cross-platform GPU/CPU/RAM detection
- Add src/models/registry.py with model database (Qwen, DeepSeek, CodeLlama)
- Add src/models/selector.py with optimal model selection algorithm
- Update main.py to use new modules and display results

Features:
- Detects NVIDIA GPUs on Windows/Linux
- Detects Apple Silicon on macOS
- Calculates available memory based on platform (100% GPU VRAM, 50% unified RAM)
- Selects optimal model, quantization, and instance count
- Supports 2-8 instances with quality-based selection
2026-02-23 16:56:07 +01:00
sleepy 8cf1e16703 Initial commit: Local Swarm project structure and documentation 2026-02-23 16:46:31 +01:00