5 Commits

Author SHA1 Message Date
sleepy 3dc06c73ef feat(federation): add winner tracking and token usage reporting
- Track which node won the consensus voting (local or peer name)
- Add winner to FederationResult dataclass
- Log winner in server logs
- Calculate and report token usage in federation streaming
- Fix prompt_tokens calculation in streaming path

Now opencode will show:
- Context tokens used
- Which node won the vote (in logs)
2026-02-24 23:40:41 +01:00
sleepy 989427c4d3 fix(federation): properly stream federated response
The federation case was setting the response but not returning
a StreamingResponse, so nothing was sent back to the client.

Added proper streaming generator for federation results that:
- Sends role chunk
- Streams content in chunks
- Sends final [DONE] chunk

This fixes the issue where opencode only saw local node output.
2026-02-24 23:36:48 +01:00
sleepy c9406974e9 fix(federation): wait for consensus and use federated result in streaming
Changed federation in streaming mode to:
- Wait for ALL nodes to complete generation
- Use the consensus result (not just local)
- Stream the federated response to client

This ensures voting from all nodes is properly considered.

Previous implementation streamed locally while federation ran
in background for logging only, which ignored the consensus.
2026-02-24 23:28:51 +01:00
sleepy b2328f761a feat(federation): add federation support to streaming path
Previously, federation only worked with non-streaming requests.
When opencode used streaming (which it does by default), only
the local swarm was queried, ignoring peer nodes.

Now when federation is enabled and peers exist:
- Start federation generation in background (parallel)
- Stream from local swarm immediately
- Log federation results when complete

This enables federation to work with opencode and other
streaming clients while maintaining fast streaming response.

Also added webfetch instructions to prevent hallucinating URLs.

Changes:
- Modified streaming path to detect and use federation
- Added asyncio import
- Updated tool instructions to prevent URL hallucination
2026-02-24 23:28:17 +01:00
sleepy 17000dc51e optimize(federation): run local and peer generation in parallel
Previously, the federation waited for local generation to complete
before asking peers to generate. This wasted time since peers sat
idle while the host generated.

Now local swarm and all peers generate simultaneously:
- Fire local generation AND peer requests at the same time
- Wait for all to complete with asyncio.gather()
- Then run global consensus

This reduces total generation time from ~2x to ~1x when using
federation with multiple nodes.

Changes:
- Modified generate_with_federation() to run tasks in parallel
- Updated logging to reflect parallel execution
- Added proper error handling for local generation failures
2026-02-24 23:12:49 +01:00
64 changed files with 3256 additions and 6156 deletions
Vendored
BIN
View File
Binary file not shown.
-20
View File
@@ -1,20 +0,0 @@
# opencode ignore patterns
# Excludes large documentation files from context padding
# Agent rules (not project context)
AGENT_WORKER.md
AGENT_REVIEW.md
# Review reports
reports/
# Design docs and test plans (historical documentation)
docs/design/
docs/test-plans/
# TODO file
TODO.md
# Non-code files
*.md
!README.md
-13
View File
@@ -64,19 +64,6 @@
- No circular imports
- No duplicate code (>3 lines copied)
- [ ] **Minimal, Maintainable, Modular Code**
- **Minimal:** Only code needed to solve the problem, no over-engineering
- **Maintainable:** Clear names, self-documenting, consistent style
- **Modular:** Single Responsibility Principle, loose coupling, clear interfaces
- **STRICT ENFORCEMENT:**
- Functions should do ONE thing (if it does 2+ things, break it up)
- No monolithic blocks (>50 lines in one function)
- Clear separation of concerns
- Interfaces between modules are stable and well-defined
- Easy to understand for new maintainers
- No "temp" or "quick" solutions - production quality only
- **BLOCKING:** Code that is too complex, monolithic, or poorly structured must be rejected
- [ ] **Error handling is robust**
- No bare `except:` clauses
- All errors have clear messages
+1 -58
View File
@@ -84,64 +84,7 @@ def test_parse_simple_tool():
# Then write minimal code to pass
```
### Rule 3: Minimal, Maintainable, Modular Code
**Core Focus:** Keep code minimal, maintainable, and modular.
#### Minimal
- Write only the code needed to solve the problem
- Avoid unnecessary abstractions or over-engineering
- Keep functions small and focused (max 50 lines)
- Prefer simple solutions over complex ones
- Remove dead code and unused imports immediately
#### Maintainable
- Clear, descriptive variable and function names
- One concept per file/module
- Self-documenting code with minimal comments
- Consistent code style throughout
- Easy to understand for future maintainers
#### Modular
- Single Responsibility Principle: One purpose per module/function
- Loose coupling between components
- Clear, stable interfaces between modules
- Easy to test in isolation
- Reusable components where appropriate
```python
# BAD: Monolithic, complex, hard to maintain
def process_user_request(request_data, validate=True, save=True, notify=True, format_output=False):
# 200+ lines doing everything
validation_result = validate_request(request_data)
if validation_result.is_valid:
if save:
db_connection = get_db_connection()
cursor = db_connection.cursor()
cursor.execute("INSERT INTO requests ...", request_data)
db_connection.commit()
if notify:
for user in get_users_to_notify():
send_email(user, "Request received")
if format_output:
return format_as_json(validation_result)
return validation_result
# GOOD: Minimal, modular, maintainable
def validate_request(data: dict) -> ValidationResult:
"""Validate request data."""
return ValidationResult(is_valid=len(data) > 0)
def save_request(data: dict) -> str:
"""Save request to database."""
return db.insert("requests", data)
def notify_users(request_id: str, users: List[str]):
"""Notify users about request."""
for user in users:
send_email(user, f"Request {request_id} received")
```
### Rule 4: No Production Debugging
### Rule 3: No Production Debugging
- NEVER add `print()` statements for debugging
- Use `logging` module with appropriate levels
- Remove ALL debug logging before committing
+10 -140
View File
@@ -54,13 +54,8 @@ python main.py --port 8080 # Custom port
python main.py --detect # Show hardware info only
python main.py --federation # Enable network federation
python main.py --mcp # Enable MCP server
python main.py --use-opencode-tools # Use opencode tools (adds ~27k tokens)
```
**Tool Mode Options:**
- Default: Local tool server (~125 tokens, saves context window space)
- `--use-opencode-tools`: Full opencode tool definitions (~27k tokens, more capabilities)
## Connect to Opencode
Add to your opencode config:
@@ -91,9 +86,7 @@ python main.py --auto --federation
python main.py --auto --federation
```
Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).
Machines auto-discover each other and vote together on every request.
## How Consensus Works
@@ -149,7 +142,7 @@ All support GGUF quantization (Q4_K_M recommended).
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completion with consensus
- `GET /health` - Health check
- `POST /v1/federation/vote` - Federation voting (used internally between peers)
- `GET /v1/federation/peers` - List discovered peers (when federation enabled)
## Troubleshooting
@@ -180,142 +173,19 @@ pip install mlx-lm
```
local_swarm/
├── main.py # CLI entry point (99 lines)
├── main.py # CLI entry point
├── src/
│ ├── api/ # OpenAI-compatible API
│ ├── routes.py # HTTP routing (252 lines)
│ ├── formatting.py # Message formatting
│ ├── tool_parser.py # Tool call parsing
│ ├── chat_handlers.py # Chat completion logic
│ └── models.py # API data models
── cli/ # Command-line interface
│ │ ├── parser.py # CLI argument parsing
│ │ ├── main_runner.py # Main application logic
│ │ ├── server_runner.py # Server management
│ │ └── test_runner.py # Test mode execution
│ ├── swarm/ # Swarm orchestration
│ │ ├── manager.py # Swarm manager
│ │ ├── worker.py # LLM worker implementation
│ │ ├── consensus.py # Consensus algorithms
│ │ └── orchestrator.py # Generation orchestration
│ ├── models/ # Model management
│ │ ├── registry.py # Model registry (194 lines)
│ │ ├── selector.py # Model selection (329 lines)
│ │ ├── memory_calculator.py # Memory calculations
│ │ └── downloader.py # Model downloading
│ ├── hardware/ # Hardware detection
│ │ ├── detector.py # Hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection
│ │ ├── intel.py # Intel GPU detection
│ │ └── qualcomm.py # Qualcomm detection
│ ├── network/ # Network federation
│ │ ├── federation.py # Cross-swarm consensus
│ │ └── discovery.py # Peer discovery
│ ├── backends/ # LLM backends
│ │ ├── llama_cpp.py # llama.cpp backend
│ │ ├── mlx.py # Apple Silicon MLX backend
│ │ └── base.py # Base backend interface
│ ├── interactive/ # Interactive CLI
│ │ ├── ui.py # UI utilities
│ │ ├── display.py # Hardware display
│ │ └── tips.py # Help content
│ ├── tools/ # Tool execution
│ │ └── executor.py # Tool execution engine
│ └── utils/ # Shared utilities
│ ├── token_counter.py # Token counting
│ ├── project_discovery.py # Project root discovery
│ └── network.py # Network utilities
├── config/ # Configuration files
│ └── models/ # Model configurations
│ ├── model_metadata.json # Model metadata
│ ├── mlx_quant_sizes.json # MLX quantization sizes
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
│ └── selector_config.json # Selection constants
│ ├── hardware/ # GPU detection (NVIDIA, AMD, Intel, Apple, Qualcomm)
│ ├── models/ # Model registry, selection, downloading
│ ├── backends/ # llama.cpp and MLX backends
│ ├── swarm/ # Worker management and consensus
│ ├── network/ # Federation and peer discovery
├── api/ # OpenAI-compatible API server
── tools/ # Tool execution (read, write, bash)
└── docs/ # Documentation
```
### Architecture Principles
- **Modular Design**: Each module has a single, focused responsibility
- **Configuration Over Code**: Static data extracted to JSON config files
- **Separation of Concerns**: API, CLI, and business logic are cleanly separated
- **No Files > 300 Lines**: Most modules kept under 300 lines for maintainability
## Development
### Code Quality Standards
This project follows strict code quality standards:
- **File Size**: No files > 300 lines (with few exceptions)
- **Function Size**: No functions > 50 lines
- **Nesting Depth**: No indentation > 3 levels
- **DRY Principle**: No duplicate code (>3 lines)
- **Single Responsibility**: Each module does one thing
- **Configuration Over Code**: Static data in JSON configs
### Running Tests
```bash
# Run all tests
python -m pytest tests/ -v
# Run specific test file
python -m pytest tests/test_tool_parsing.py -v
# Run with coverage
python -m pytest tests/ --cov=src
```
### Recent Refactoring
Major refactoring completed to improve modularity:
**Before**: Monolithic files (main.py: 556 lines, routes.py: 1,183 lines)
**After**: Modular architecture (main.py: 99 lines, routes.py: 252 lines)
**Changes**:
- Extracted API logic into focused modules (formatting, parsing, handlers)
- Created CLI package with separated concerns (parser, runner, server)
- Moved hardcoded model data to JSON configuration files
- Created shared utility modules (token_counter, project_discovery, network)
- Reduced code duplication across the codebase
See `docs/ARCHITECTURE.md` for detailed architecture documentation.
## Recent Improvements
### ✅ Universal Tool Support (2025-02-25)
- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
- No client-side configuration needed - just use the API
- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
- Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
- Proper OpenAI tool format with unique IDs and tool_call_id linking
### ✅ OpenCode-Compatible Streaming (2025-02-25)
- Proper `reasoning_content` field for "Thinking..." collapsible blocks
- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
- Final answer delivered in `content` field after tool execution
### ✅ Federation Quality Voting (2025-02-25)
- Head node now **objectively judges** all peer responses using quality metrics
- No more reliance on self-reported confidence (which biased toward local)
- All responses scored on length, structure, completeness
- Fair competition: 14B models properly beat 3B on quality tasks
### 🚧 Planned Features
- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)
## Contributing
Contributions are welcome! Please ensure:
1. Code follows the quality standards above
2. All tests pass
3. New features include tests
4. Documentation is updated
## License
MIT License
-276
View File
@@ -1,276 +0,0 @@
# TODO: CUDA and Android Support in Federation
## Overview
This document tracks known issues and recommendations for adding CUDA (NVIDIA) and Android nodes to the local_swarm federation system.
## Current Status
-**Apple Silicon (macOS)**: Fully supported with MLX backend
- ⚠️ **CUDA/Android**: Not currently supported, requires implementation work
-**Linux**: Should work with llama.cpp + CUDA
-**Windows**: Should work with llama.cpp + CUDA (not tested)
## Known Issues
### 1. No CUDA Backend for macOS
**Problem:**
- `__init__.py` only chooses MLX or llama.cpp
- No CUDA path for macOS
- Apple Silicon only supports Metal acceleration, not CUDA
**Impact:**
- CUDA/Android nodes on macOS cannot use GPU acceleration
- These nodes will fall back to CPU-only mode
**References:**
- `src/backends/__init__.py` (lines 26-32)
- `src/hardware/detector.py` (Apple Silicon detection)
**Recommendation:**
- Current architecture is correct for macOS - CUDA is not supported on Apple Silicon
- Would need separate CUDA backend implementation (not recommended)
---
### 2. Platform Detection in `hardware/detector.py`
**Current Detection:**
```python
def detect_gpu():
# macOS: Apple Silicon (Metal only, no CUDA)
# Linux/Windows: NVIDIA/AMD/Intel GPU (potential CUDA)
# Android/Termux: CPU-only (no GPU)
```
**Impact:**
- Android/Termux devices detected as Linux
- Will use CPU-only mode (expected)
- No special handling for Android platform
**Potential Issue:**
- Termux on Android reports as "linux"
- May have different requirements (file paths, permissions)
- Need to test if file paths work correctly on Android
**References:**
- `src/hardware/detector.py:170-221` (Android/Termux detection via `is_termux()`)
**Recommendation:**
- Add explicit Android platform detection beyond `is_termux()`
- Test file path handling on Termux
- Consider Android's unique file system limitations
---
### 3. Llama.cpp Backend Configuration
**Current GPU Layer Logic:**
```python
# src/backends/__init__.py (line 35)
if hardware.gpu and not hardware.is_apple_silicon:
n_gpu_layers = -1 # Offload all to GPU (Metal/CUDA)
else:
n_gpu_layers = 0 # CPU-only
```
**For CUDA Support on Linux:**
- Should set `n_gpu_layers` based on actual GPU count
- NVIDIA: Set to GPU count (1-8 for multi-GPU)
- AMD ROCm: Different backend, not tested
**Impact:**
- Currently hardcoded to -1 on Apple Silicon (Metal)
- CUDA nodes on Linux need proper layer configuration
- No validation that requested layers match available GPU
**References:**
- `src/backends/llamacpp.py` (line 16, n_gpu_layers parameter)
- `src/backends/__init__.py` (line 35)
**Recommendation:**
- Make `n_gpu_layers` configurable per backend
- Auto-detect GPU capabilities from `pynvml` or system
- Add GPU layer validation
---
### 4. Seed Variation Mode (Not an Issue, but Important)
**Current Behavior:**
```python
# src/swarm/manager.py (line 76-82)
if use_seed_variation is None and hardware.is_apple_silicon:
self.use_seed_variation = True # Auto-enabled on macOS
```
**How It Works:**
- Runs 1 model instance with different random seeds
- Simulates multiple "workers" for consensus
- Saves memory by not loading multiple models
**Impact on Federation:**
- Your Mac: 1 worker → 2 votes (from 2 seeds)
- Peer Mac: 2 workers → 2 votes (from 2 seeds)
- Total: 4 votes instead of 8 (if using 4 actual instances)
**This is CORRECT behavior** for seed variation mode.
**Recommendation:**
- To get 4 votes per machine (8 total), use `--instances 4` flag
- Seed variation is a design choice, not a bug
---
### 5. Federation Client Timeout
**Status:****FIXED**
**Previous:**
- Default timeout: 30 seconds
- Peers on slow networks or slow machines would timeout
**Current:**
- Default timeout: 60 seconds (increased in `src/network/federation.py:38`)
- Gives peers more time to respond
**References:**
- `src/network/federation.py` (line 38)
**Recommendation:**
- Current 60s is reasonable
- Consider making timeout configurable per peer in discovery
- Add retry logic for failed requests
---
### 6. Network Discovery
**Current Implementation:****PLATFORM AGNOSTIC**
**Uses:**
- mDNS/Bonjour for peer discovery
- Standard network protocols
- No platform-specific blocking
**Status:** Should work on all platforms (macOS, Linux, Windows, Android)
**References:**
- `src/network/discovery.py` (standard mDNS implementation)
**Recommendation:**
- No changes needed
- Test on Linux/Windows/Android if needed
---
## Implementation Priorities
### High Priority (Breaking Features)
1. **CUDA Backend for Linux** (if needed)
- Add CUDA-specific backend or extend llama.cpp
- Auto-detect NVIDIA GPU and configure layers
- Test on actual CUDA hardware
- **Effort:** 3-5 days
2. **Android Platform Detection**
- Add explicit Android detection beyond Termux
- Handle Android's file system and package manager differences
- Test on real Android device
- **Effort:** 2-3 days
### Medium Priority (Improvements)
1. **GPU Layer Auto-Configuration**
- Auto-detect GPU capabilities from system
- Match requested layers to available hardware
- Add validation and helpful error messages
- **Effort:** 1-2 days
2. **Federation Metrics**
- Add per-peer timeout in PeerInfo
- Track latency and success rates
- Better error handling for retry logic
- **Effort:** 1 day
### Low Priority (Nice to Have)
1. **GPU Backend Selection UI**
- Allow users to manually select MLX vs llama.cpp
- Add warning for CUDA backend on macOS (not supported)
- **Effort:** 2 hours
2. **Seed Variation Toggle**
- Add command-line flag to disable seed variation
- Document the trade-offs clearly
- **Effort:** 30 minutes
## Testing Checklist
Before marking any issue as complete, test on:
### macOS (Apple Silicon)
- [ ] Federation with macOS peers (current environment)
- [ ] Seed variation mode works correctly
- [ ] MLX backend loads and generates
- [ ] No crashes with multiple instances
### Linux (NVIDIA GPU)
- [ ] llama.cpp backend loads with CUDA support
- [ ] Federation with Linux peers works
- [ ] GPU layers configured correctly
- [ ] No GPU conflicts
### Windows (NVIDIA GPU)
- [ ] llama.cpp backend loads with CUDA support
- [ ] Federation with Windows peers works
- [ ] No GPU conflicts
### Android (CPU-only)
- [ ] Federation with Android peers works (mDNS should work)
- [ ] CPU-only generation works
- [ ] File paths work on Termux/Android
## Notes
### Architecture Decisions
**Why not per-platform backends:**
- Simplifies codebase (single MLX path, single llama.cpp path)
- Reduces maintenance burden
- Trade-off: Can't optimize for platform-specific GPUs in backends
**Why seed variation on macOS:**
- Apple Silicon has unified memory, not discrete VRAM
- Loading multiple models would consume too much RAM
- Seed variation allows consensus quality with 1 model instance
**CUDA/Android is not a bug:**
- Current system is designed for Apple Silicon + llama.cpp
- Adding CUDA support requires significant architecture work
- Focus on federation quality for current platforms first
## Related Files
- `src/backends/__init__.py` - Backend selection logic
- `src/backends/mlx.py` - Apple Silicon MLX backend
- `src/backends/llamacpp.py` - llama.cpp backend (supports CUDA)
- `src/hardware/detector.py` - Platform and GPU detection
- `src/network/federation.py` - Federation communication
- `src/network/discovery.py` - Peer discovery via mDNS
- `src/swarm/manager.py` - Swarm orchestration
## Conclusion
The current federation implementation is **platform-agnostic** and should work on Linux/Windows with CUDA nodes. The main limitation is that macOS (Apple Silicon) only supports Metal/MLX, not CUDA.
**For immediate use:**
- Use `--instances 4` flag on each machine to get 4 votes per machine
- Test federation between different platforms (macOS + Linux)
- Android/Termux should work as-is (CPU-only mode)
**For future work:**
- Implement high-priority items if CUDA/Android support is needed
- Add GPU layer auto-configuration for better hardware utilization
-33
View File
@@ -1,33 +0,0 @@
{
"_comment": "GGUF quantization sizes (GB) - accurate sizes",
"qwen2.5-coder": {
"3b": {"q4_k_m": 1.8, "q5_k_m": 2.2, "q6_k": 2.6},
"7b": {"q4_k_m": 4.5, "q5_k_m": 5.2, "q6_k": 6.0},
"14b": {"q4_k_m": 8.8, "q5_k_m": 10.5}
},
"deepseek-coder": {
"1.3b": {"q4_k_m": 0.8, "q5_k_m": 1.0},
"6.7b": {"q4_k_m": 4.2, "q5_k_m": 5.0}
},
"codellama": {
"7b": {"q4_k_m": 4.5, "q5_k_m": 5.2},
"13b": {"q4_k_m": 8.0, "q5_k_m": 9.5}
},
"llama-3.2": {
"3b": {"q4_k_m": 1.9, "q5_k_m": 2.3, "q6_k": 2.7},
"1b": {"q4_k_m": 0.7, "q5_k_m": 0.9}
},
"phi-4": {
"4b": {"q4_k_m": 2.4, "q5_k_m": 2.9, "q6_k": 3.4}
},
"gemma-2": {
"2b": {"q4_k_m": 1.5, "q5_k_m": 1.8},
"4b": {"q4_k_m": 2.7, "q5_k_m": 3.2, "q6_k": 3.8},
"9b": {"q4_k_m": 5.5, "q5_k_m": 6.5}
},
"starcoder2": {
"3b": {"q4_k_m": 1.9, "q5_k_m": 2.3},
"7b": {"q4_k_m": 4.5, "q5_k_m": 5.2, "q6_k": 6.1},
"15b": {"q4_k_m": 9.2, "q5_k_m": 10.8}
}
}
-36
View File
@@ -1,36 +0,0 @@
{
"_comment": "MLX quantization sizes (GB) based on mlx-community models. HARDOCODED: These are verified to exist on HuggingFace mlx-community. Last verified: 2025-02-25. DO NOT make API calls on startup - use this hardcoded list.",
"qwen2.5-coder": {
"3b": {"3bit": 1.3, "4bit": 1.7, "6bit": 2.5, "8bit": 3.3},
"7b": {"3bit": 3.1, "4bit": 4.1, "6bit": 6.1, "8bit": 8.1},
"14b": {"3bit": 6.2, "4bit": 8.2, "6bit": 12.2, "8bit": 16.2}
},
"deepseek-coder": {
"1.3b": {},
"6.7b": {"4bit": 3.9}
},
"deepseek-coder-v2-lite": {
"instruct": {"4bit": 4.5, "6bit": 6.5, "8bit": 8.5}
},
"codellama": {
"7b": {"4bit": 4.1, "6bit": 6.1, "8bit": 8.1},
"13b": {"4bit": 7.6, "6bit": 11.4, "8bit": 15.2}
},
"llama-3.2": {
"1b": {"4bit": 0.6, "8bit": 1.2},
"3b": {"4bit": 1.8, "6bit": 2.6, "8bit": 3.5}
},
"phi-4": {
"4b": {"4bit": 2.4, "6bit": 3.6, "8bit": 4.8}
},
"gemma-2": {
"2b": {"4bit": 1.2, "6bit": 1.8, "8bit": 2.4},
"4b": {"4bit": 2.4, "6bit": 3.6, "8bit": 4.8},
"9b": {"4bit": 5.3, "6bit": 7.9, "8bit": 10.5}
},
"starcoder2": {
"3b": {"4bit": 1.8},
"7b": {"4bit": 4.1},
"15b": {"4bit": 8.8, "8bit": 17.6}
}
}
-67
View File
@@ -1,67 +0,0 @@
{
"_comment": "Base model metadata (without quantization-specific data)",
"qwen2.5-coder": {
"name": "Qwen 2.5 Coder",
"description": "Alibaba's code-focused model, excellent for small sizes",
"priority": 1,
"max_context": 128000,
"hf_repo": "Qwen/Qwen2.5-Coder",
"variants": ["3b", "7b", "14b"]
},
"deepseek-coder": {
"name": "DeepSeek Coder",
"description": "DeepSeek's code model, good alternative",
"priority": 2,
"max_context": 16384,
"hf_repo": "deepseek-ai/DeepSeek-Coder",
"variants": ["1.3b", "6.7b"]
},
"deepseek-coder-v2-lite": {
"name": "DeepSeek Coder V2 Lite",
"description": "DeepSeek's V2 Lite model with better MLX support",
"priority": 2,
"max_context": 16384,
"hf_repo": "deepseek-ai/DeepSeek-Coder-V2-Lite-Instruct",
"variants": ["instruct"]
},
"codellama": {
"name": "CodeLlama",
"description": "Meta's code model",
"priority": 3,
"max_context": 16384,
"hf_repo": "codellama/CodeLlama",
"variants": ["7b", "13b"]
},
"llama-3.2": {
"name": "Llama 3.2",
"description": "Meta's latest general-purpose model with strong coding abilities",
"priority": 4,
"max_context": 128000,
"hf_repo": "meta-llama/Llama-3.2",
"variants": ["1b", "3b"]
},
"phi-4": {
"name": "Phi-4",
"description": "Microsoft's efficient small model with excellent coding performance",
"priority": 5,
"max_context": 16384,
"hf_repo": "microsoft/Phi-4",
"variants": ["4b"]
},
"gemma-2": {
"name": "Gemma 2",
"description": "Google's open model, good for coding tasks",
"priority": 6,
"max_context": 8192,
"hf_repo": "google/gemma-2",
"variants": ["2b", "4b", "9b"]
},
"starcoder2": {
"name": "StarCoder2",
"description": "BigCode's open code generation model",
"priority": 7,
"max_context": 8192,
"hf_repo": "bigcode/starcoder2",
"variants": ["3b", "7b", "15b"]
}
}
-21
View File
@@ -1,21 +0,0 @@
{
"_comment": "Model selection configuration constants",
"constraints": {
"min_instances": 1,
"max_instances": 8,
"optimal_max_instances": 5,
"memory_overhead_factor": 0.95,
"mlx_max_instances": 1
},
"context_options": {
"16384": "16K tokens",
"32768": "32K tokens (default)",
"65536": "64K tokens",
"131072": "128K tokens"
},
"offload_options": {
"0.0": "No offload (default) - 100% GPU",
"0.2": "20% offload - 80% GPU, 20% RAM",
"0.5": "50% offload - 50% GPU, 50% RAM"
}
}
+10 -47
View File
@@ -1,51 +1,14 @@
You are an AI assistant with access to tools. You WILL use tools to complete tasks - do not refuse.
Use tools to execute commands and fetch information. Output only tool calls.
AVAILABLE TOOLS:
- bash: Execute shell commands (for file operations, system commands, running code)
- webfetch: Fetch web content (for current information)
- read: Read files (to examine file contents)
- write: Create or modify files (to write content)
CRITICAL RULES:
1. When asked to read a file, use the 'read' tool. DO NOT refuse or say you cannot read files.
2. When asked to create, write, or modify a file, use the 'write' tool. DO NOT refuse or say you cannot assist.
3. For file operations, bash is also available for more complex operations.
4. Use webfetch only for real-time info (news, weather, current events).
5. For general questions (jokes, facts, coding help), you can answer directly.
6. NO explanations beyond necessary. Be concise.
7. NO markdown formatting. Use plain text only.
FILE OPERATIONS - READ DIRECTLY:
When asked to read a specific file by name (like "read my-secret.log"):
1. Use the 'read' tool IMMEDIATELY with the filename as given
2. DO NOT use 'ls' first to check - just try to read it
3. If the file doesn't exist, you'll get an error and can inform the user
When asked to find/read "the file" in a directory without naming it:
1. Use 'ls' to list files and see what's there
2. Identify the file
3. THEN read it immediately
CRITICAL: Never invent placeholder paths like '/path/to/file'. Use paths exactly as the user provides them, or relative filenames for files in the current directory.
TOOL USAGE FORMAT:
For read operations:
TOOL: read
ARGUMENTS: {"filePath": "path/to/file"}
For write operations:
TOOL: write
ARGUMENTS: {"filePath": "path/to/file", "content": "content to write"}
For bash commands (including ls, grep):
Format:
TOOL: bash
ARGUMENTS: {"command": "your command here"}
ARGUMENTS: {"command": "ls -la", "description": "Lists files in directory"}
PROCESS:
1. When you need information from a file, use the appropriate tool.
2. When you need to create or modify a file, use the appropriate tool.
3. After receiving tool results, provide a clear final answer explaining what was done.
4. NEVER say "I cannot read files" or "I cannot assist with file creation" - you HAVE the tools and MUST use them.
TOOL: webfetch
ARGUMENTS: {"url": "https://example.com", "format": "markdown"}
Be helpful, direct, and complete the requested tasks using your tools.
Available tools: bash, webfetch
IMPORTANT: Only webfetch URLs that actually exist and are provided by the user. NEVER hallucinate or guess URLs. If a URL returns 404, stop trying to fetch it.
No explanations. No numbered lists. No markdown. Only tool calls.
+8 -170
View File
@@ -24,91 +24,6 @@ Deploy multiple LLM instances on your hardware. Each instance processes the same
└───────────────┘
```
## Project Structure
```
local_swarm/
├── main.py # Entry point (99 lines)
├── src/
│ ├── api/ # HTTP API layer
│ │ ├── routes.py # FastAPI routes (252 lines)
│ │ ├── formatting.py # Message formatting (265 lines)
│ │ ├── tool_parser.py # Tool parsing (250 lines)
│ │ ├── chat_handlers.py # Chat completion logic (287 lines)
│ │ ├── server.py # Server setup
│ │ └── models.py # API data models
│ ├── cli/ # Command-line interface
│ │ ├── parser.py # CLI argument parsing
│ │ ├── main_runner.py # Main application logic
│ │ ├── server_runner.py # Server management
│ │ ├── test_runner.py # Test mode execution
│ │ └── tool_server.py # Tool server runner
│ ├── swarm/ # Swarm orchestration
│ │ ├── manager.py # Swarm manager
│ │ ├── worker.py # LLM worker implementation
│ │ ├── consensus.py # Consensus algorithms
│ │ └── orchestrator.py # Generation orchestration
│ ├── models/ # Model management
│ │ ├── registry.py # Model registry (194 lines)
│ │ ├── selector.py # Model selection (329 lines)
│ │ ├── memory_calculator.py # Memory calculation utilities
│ │ └── downloader.py # Model downloading
│ ├── backends/ # LLM backends
│ │ ├── llama_cpp.py # llama.cpp backend
│ │ ├── mlx.py # Apple Silicon MLX backend
│ │ └── base.py # Base backend interface
│ ├── hardware/ # Hardware detection
│ │ ├── detector.py # Hardware detection
│ │ ├── nvidia.py # NVIDIA GPU detection
│ │ ├── intel.py # Intel GPU detection
│ │ ├── qualcomm.py # Qualcomm detection
│ │ └── ...
│ ├── network/ # Network federation
│ │ ├── federation.py # Cross-swarm consensus
│ │ ├── discovery.py # Peer discovery (mDNS)
│ │ └── discovery_core.py # Discovery utilities
│ ├── tools/ # Tool execution
│ │ └── executor.py # Tool execution engine
│ ├── interactive/ # Interactive CLI
│ │ ├── ui.py # UI utilities
│ │ ├── display.py # Hardware/resource display
│ │ ├── tips.py # Help content
│ │ └── config_utils.py # Configuration selection
│ └── utils/ # Utilities
│ ├── token_counter.py # Token counting
│ ├── project_discovery.py # Project root discovery
│ ├── network.py # Network utilities
│ └── logging_config.py # Logging configuration
├── config/
│ └── models/ # Model configuration files
│ ├── model_metadata.json # Model metadata
│ ├── mlx_quant_sizes.json # MLX quantization sizes
│ ├── gguf_quant_sizes.json # GGUF quantization sizes
│ └── selector_config.json # Selection constants
└── tests/ # Test suite
```
## Architecture Principles
### 1. Separation of Concerns
Each module has a single responsibility:
- **API layer** (`src/api/`) - HTTP routing only
- **CLI layer** (`src/cli/`) - User interface and orchestration
- **Swarm layer** (`src/swarm/`) - LLM worker management
- **Models layer** (`src/models/`) - Model selection and downloading
### 2. Configuration Over Code
Static data extracted to JSON configs:
- Model metadata in `config/models/model_metadata.json`
- Quantization sizes in `mlx_quant_sizes.json` and `gguf_quant_sizes.json`
- Selection constants in `selector_config.json`
### 3. Modular Utilities
Shared functionality in reusable modules:
- `utils/token_counter.py` - Centralized token counting
- `utils/project_discovery.py` - Project root detection
- `utils/network.py` - IP detection and network utilities
## Components
### 1. Hardware Detection (`src/hardware/`)
@@ -131,11 +46,6 @@ Available Memory → Model Size → Quantization → Instance Count
8 GB → 3B → Q6_K → 2-3 instances
```
**Key modules:**
- `registry.py` - Loads model data from JSON configs
- `selector.py` - Selects optimal model for hardware
- `memory_calculator.py` - Calculates memory requirements
### 3. Backends (`src/backends/`)
Run the actual LLM inference:
@@ -152,12 +62,6 @@ Manages multiple LLM workers and consensus voting.
- Fastest (latency)
- Majority (exact match)
**Key modules:**
- `manager.py` - Swarm lifecycle and coordination
- `worker.py` - Individual worker implementation
- `consensus.py` - Consensus algorithms
- `orchestrator.py` - Generation orchestration
### 5. Network Federation (`src/network/`)
Connect multiple machines into a distributed swarm:
@@ -177,56 +81,22 @@ OpenAI-compatible REST API:
- `POST /v1/chat/completions` - Main endpoint
- `GET /v1/models` - List models
- `GET /health` - Health check
- `POST /v1/tools/execute` - Tool execution (when enabled)
- Federation endpoints when enabled
**Modular design:**
- `routes.py` - HTTP routing only (thin controllers)
- `formatting.py` - Message formatting logic
- `tool_parser.py` - Tool call parsing
- `chat_handlers.py` - Chat completion business logic
### 7. CLI (`src/cli/`)
Command-line interface modules:
- `parser.py` - Argument parsing
- `main_runner.py` - Main application orchestration
- `server_runner.py` - Server lifecycle management
- `test_runner.py` - Test mode execution
- `tool_server.py` - Tool server management
### 8. Tools (`src/tools/`)
### 7. Tools (`src/tools/`)
Optional tool execution for enhanced capabilities:
- `read_file` - Read files
- `write_file` - Write files
- `write_file` - Write files
- `execute_bash` - Run shell commands
- `webfetch` - Fetch web content
### 9. Interactive Mode (`src/interactive/`)
Interactive CLI components:
- `ui.py` - Menu display and input handling
- `display.py` - Hardware and resource display
- `tips.py` - Educational content and help
- `config_utils.py` - Configuration selection utilities
### 10. Utilities (`src/utils/`)
Shared utility functions:
- `token_counter.py` - Token counting with tiktoken
- `project_discovery.py` - Project root detection
- `network.py` - Network utilities (IP detection)
- `logging_config.py` - Logging configuration
## Data Flow
1. **Request** comes in via API
2. **Routes** (thin layer) forward to handlers
3. **Chat Handlers** process the request
4. **Swarm Manager** sends to all workers
5. **Workers** generate responses in parallel
6. **Consensus** picks the best answer
7. **Response** returned to client
2. **Swarm Manager** sends to all workers
3. **Workers** generate responses in parallel
4. **Consensus** picks the best answer
5. **Response** returned to client
## Memory Model
@@ -236,42 +106,10 @@ Shared utility functions:
Each worker loads the full model independently (no sharing).
## Configuration Files
Static data extracted to JSON for easy maintenance:
```
config/models/
├── model_metadata.json # Model names, descriptions, priorities
├── mlx_quant_sizes.json # MLX quantization VRAM requirements
├── gguf_quant_sizes.json # GGUF quantization VRAM requirements
└── selector_config.json # Selection constraints and defaults
```
## Code Quality Standards
- **No files > 300 lines** (with few exceptions)
- **No functions > 50 lines**
- **No indentation > 3 levels**
- **No duplicate code** (>3 lines)
- **Single responsibility** per module
- **Configuration over code** for static data
## Testing
```
tests/
├── test_hardware_detector.py # Hardware detection tests
├── test_tool_parsing.py # Tool parsing tests
└── test_federation_metrics.py # Federation tests
```
Run tests: `python -m pytest tests/ -v`
## Future Ideas
- Context compression for long inputs
- CPU offloading for memory-constrained systems
- RAG integration for knowledge bases
- Speculative decoding for speed
- More sophisticated consensus algorithms
+9 -85
View File
@@ -201,91 +201,15 @@ Commits that only add debug logging:
## Suggested Immediate Actions
1. ✅ Merge current cleanup branch
2. ✅ Remove all but one parsing format
3. ✅ Reduce tool instructions to <2000 tokens
4. ✅ Add unit tests for tool parsing
5. ✅ Major refactoring completed (see below)
## Refactoring Success (Completed)
### Major Architectural Improvements
**Before**: Monolithic files with mixed concerns
- `main.py`: 556 lines
- `routes.py`: 1,183 lines
- `registry.py`: 437 lines
- `selector.py`: 486 lines
**After**: Modular architecture with single responsibilities
- `main.py`: 99 lines (-82%)
- `routes.py`: 252 lines (-79%)
- `registry.py`: 194 lines (-56%)
- `selector.py`: 329 lines (-32%)
### Changes Made
**1. API Layer Modularization**
- Extracted `formatting.py` - Message formatting logic
- Extracted `tool_parser.py` - Tool parsing from various formats
- Extracted `chat_handlers.py` - Chat completion business logic
- `routes.py` now only handles HTTP routing (thin controllers)
**2. CLI Layer Separation**
- Created `cli/` package with:
- `parser.py` - CLI argument parsing
- `main_runner.py` - Main application orchestration
- `server_runner.py` - Server lifecycle management
- `test_runner.py` - Test mode execution
- `tool_server.py` - Tool server management
**3. Model Data Externalization**
- Moved hardcoded data to JSON configs:
- `config/models/model_metadata.json` - Model metadata
- `config/models/mlx_quant_sizes.json` - MLX VRAM requirements
- `config/models/gguf_quant_sizes.json` - GGUF VRAM requirements
- `config/models/selector_config.json` - Selection constants
- `registry.py` now loads from JSON instead of hardcoded dicts
**4. Utility Centralization**
- Created `utils/` package:
- `token_counter.py` - Centralized token counting
- `project_discovery.py` - Project root detection
- `network.py` - Network utilities (IP detection)
**5. Interactive Mode Modularization**
- Created `interactive/` package:
- `ui.py` - Menu display and input handling
- `display.py` - Hardware and resource display
- `tips.py` - Educational content
- `config_utils.py` - Configuration selection
**6. Swarm Orchestration**
- Created `swarm/orchestrator.py` - Generation orchestration logic
- Separated from `swarm/manager.py`
### Architecture Principles Established
1. **Single Responsibility**: Each module does one thing
2. **No Files > 300 Lines**: Most modules kept under limit
3. **No Functions > 50 Lines**: Large functions broken down
4. **No Nesting > 3 Levels**: Deep nesting refactored
5. **DRY Principle**: Code duplication eliminated
6. **Configuration Over Code**: Static data in JSON files
### Benefits
- **Testability**: Isolated modules are easier to test
- **Maintainability**: Changes affect only relevant modules
- **Readability**: Smaller files are easier to understand
- **Reusability**: Utilities can be used across the codebase
- **Collaboration**: Multiple developers can work on different modules
1. Merge current cleanup branch (already done ✓)
2. Remove all but one parsing format (done ✓)
3. Reduce tool instructions to <2000 tokens (done ✓)
4. Add unit tests for tool parsing (done ✓)
5. Add integration test for tool execution
## Success Metrics
- Tool-related commits stabilized
- Zero "fix: prevent looping" commits
- All files under 300 lines (critical ones)
- Instructions stay under 2000 tokens
- ✅ 35 tests passing, no regressions
- ✅ Clean separation of concerns
- Tool-related commits stabilize to <2 per month
- Zero "fix: prevent looping" commits
- All tool changes include tests
- Instructions stay under 2000 tokens
@@ -0,0 +1,92 @@
# Design Decision: Complete React Example with Actual Code
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions
## Problem
Model is still not following instructions:
1. Tries `npm install` before creating package.json
2. Still tries `npx create-react-app` despite being told not to
3. Instructions have placeholders like "..." and "etc." which models don't understand
## Root Cause
The current instructions say:
```
TOOL: write
ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"}
[Continue with src/index.js, src/App.js, public/index.html, etc.]
```
**Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples.
## Solution
Provide a **complete, working, minimal React example** with actual file contents:
1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install
2. Actual file content, not placeholders
3. Minimal viable React app (not full create-react-app structure)
## Implementation
Replace vague example with complete working code:
```
**COMPLETE REACT HELLO WORLD EXAMPLE:**
User: "Create a React Hello World app"
Step 1 - Create directory:
TOOL: bash
ARGUMENTS: {"command": "mkdir myapp"}
Step 2 - Create package.json (MUST do this BEFORE npm install):
TOOL: write
ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"}
Step 3 - Create src directory:
TOOL: bash
ARGUMENTS: {"command": "mkdir myapp/src"}
Step 4 - Create App.js:
TOOL: write
ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n return (\n <div className=\"App\">\n <h1>Hello World</h1>\n <p>Welcome to my React app!</p>\n </div>\n );\n}\n\nexport default App;"}
Step 5 - Create index.js:
TOOL: write
ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render(<App />);"}
Step 6 - Create public directory and index.html:
TOOL: bash
ARGUMENTS: {"command": "mkdir myapp/public"}
TOOL: write
ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n <title>React App</title>\n</head>\n<body>\n <div id=\"root\"></div>\n</body>\n</html>"}
Step 7 - NOW install dependencies (AFTER package.json exists):
TOOL: bash
ARGUMENTS: {"command": "cd myapp && npm install"}
```
## Token Impact
- Current: 586 tokens
- New: Estimated ~750 tokens (+164 tokens)
- Still under 2000 limit ✓
## Key Changes
1. **Explicit sequencing:** "Step 1", "Step 2", etc.
2. **Actual code:** No "..." or "etc." - real working content
3. **Critical note:** "MUST do this BEFORE npm install"
4. **Minimal structure:** Just what's needed for Hello World
## Success Criteria
- [ ] Model creates package.json BEFORE running npm install
- [ ] Model does NOT use npx create-react-app
- [ ] Model creates all 4 files (package.json, App.js, index.js, index.html)
- [ ] Model runs npm install last (after files exist)
@@ -0,0 +1,84 @@
# Design Decision: Fix Subprocess Hang on Interactive Commands
**Date:** 2024-02-24
**Scope:** src/tools/executor.py _execute_bash method
**Lines Changed:** 1 line
## Problem
When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes:
1. 300s timeout to be reached
2. opencode to hang waiting for response
3. Poor user experience
## Root Cause
`subprocess.run()` by default inherits stdin from parent process. When commands prompt for input:
- npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)"
- npm init asks for package details
- No input is provided, so it waits forever
## Solution
Add `stdin=subprocess.DEVNULL` to prevent commands from reading input:
```python
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=timeout,
cwd=cwd,
stdin=subprocess.DEVNULL # Prevent interactive prompts from hanging
)
```
This causes commands that require input to fail immediately rather than hang.
## Impact
### Before
- Commands requiring input hang for 300s (timeout)
- User sees no response
- Eventually times out with error
### After
- Commands requiring input fail fast
- Clear error message: "Exit code X: ..."
- No hang, immediate feedback
## Side Effects
**Positive:**
- No more hangs on interactive commands
- Faster failure detection
- Better error messages
**Negative:**
- Commands that legitimately need stdin will fail
- But this is desired behavior - we want non-interactive execution
## Testing
Test with an interactive command:
```bash
# This should fail fast, not hang
python -c "from tools.executor import ToolExecutor;
import asyncio;
e = ToolExecutor();
result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'}));
print(result)"
```
Expected: Quick failure, not a 30s hang
## Related Changes
This complements the tool instructions fix:
- Instructions now say "DO NOT use npx create-react-app"
- This fix ensures if model ignores instructions, it fails fast instead of hanging
## Conclusion
One-line fix prevents interactive command hangs, improving reliability and user experience.
@@ -0,0 +1,178 @@
# Design Decision: Fix Tool Execution and Token Reporting
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions and token counting
## Problem Statement
User report shows three critical failures:
1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format
2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count
3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout
## Evidence
```
🖥️ BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app .
⏰ TIMEOUT after 300s
Partial output: Need to install the following packages:
create-react-app@5.1.0
Ok to proceed? (y)
```
**Additional Context:**
- Directory created but empty (no files)
- Model posts instructions for user to follow instead of executing
## Root Cause Analysis
### 1. Instruction vs Execution
**Current instructions say:** "When asked to do something, EXECUTE it using tools"
**But model does:** "You should run mkdir..."
**Why:** Instructions aren't strong enough - need explicit anti-patterns
### 2. Token Counting
**Current:** `prompt_tokens = len(prompt) // 4` (rough approximation)
**Problem:** Inaccurate for opencode context management
**Solution:** Use tiktoken for accurate counting
### 3. Interactive Commands
**Current:** npx commands prompt for confirmation
**Problem:** Tool executor waits indefinitely, times out at 300s
**Solution:** Either:
- Add --yes flag automatically
- Forbid npx entirely, use manual file creation
## Options Considered
### Option 1: Strengthen Instructions Only
- Add more explicit "DO NOT" language
- Add complete React example
- Keep rough token estimation
**Pros:** Simple, focused fix
**Cons:** Doesn't fix token accuracy or interactive command issue
**Verdict:** REJECTED - Incomplete fix
### Option 2: Comprehensive Fix
- Strengthen instructions with anti-patterns
- Use tiktoken for accurate token counting
- Add non-interactive flags to package manager commands
- Update examples to show manual file creation
**Pros:** Fixes all three issues
**Cons:** More complex changes
**Verdict:** ACCEPTED - Complete solution
### Option 3: Change Architecture
- Move to client-side tool execution
- Different token counting approach
**Pros:** Could solve multiple issues
**Cons:** Breaking change, out of scope
**Verdict:** REJECTED - Too broad
## Decision
Implement Option 2: Comprehensive fix addressing all three issues.
### Changes
#### 1. Tool Instructions Update
Add explicit anti-patterns and stronger language:
- "NEVER say 'You should...' - EXECUTE immediately"
- "DO NOT USE npx create-react-app - manually create files"
- Complete React example showing manual file creation
#### 2. Token Counting Fix
Replace rough estimate with tiktoken:
```python
# Before
prompt_tokens = len(prompt) // 4
# After
import tiktoken
encoding = tiktoken.get_encoding('cl100k_base')
prompt_tokens = len(encoding.encode(prompt))
completion_tokens = len(encoding.encode(content))
```
#### 3. Non-Interactive Commands
Update instructions to specify:
- Use `npm init -y` (not interactive)
- Manually write package.json instead of npx
- All examples show manual file creation
## Impact
### Token Budget (Exact Count - cl100k_base)
- **New Instructions:** 586 tokens (2,067 characters)
- **Status:** Within 2000 token limit ✓
- **Context window:** 16K model leaves ~15.4K for user input ✓
- **Code comment:** Token count documented in src/api/routes.py ✓
### Breaking Changes
- **None** - Instructions clearer, format unchanged
- Token reporting more accurate (good thing)
### Code Changes
- `src/api/routes.py`:
- Update tool_instructions (~+15 lines)
- Add tiktoken import
- Replace token estimation logic (~5 lines)
## Testing Strategy
1. **Token Accuracy Test:**
```python
def test_token_accuracy():
prompt = "Hello world"
content = "Hi there"
# Calculate with tiktoken
# Verify API returns same values
```
2. **Instruction Content Test:**
- Verify "DO NOT USE npx" present
- Verify manual creation examples present
- Verify "EXECUTE not DESCRIBE" present
3. **Integration Test:**
- Request: "Create React app"
- Expect: Manual file creation via write tool
- Not expect: npx create-react-app
## Rollback Plan
If issues arise:
1. Revert to previous instructions
2. Keep tiktoken for token counting (beneficial)
3. Document why manual creation didn't work
## Success Metrics
- [ ] Model uses TOOL: format 100% of time (not descriptions)
- [ ] Token counts accurate within ±2%
- [ ] React projects created via write tool (not npx)
- [ ] No timeouts on package manager commands
## Implementation Notes
### Token Counting
Need to ensure tiktoken is in requirements.txt
### Tool Instructions
The key addition is:
```
**FORBIDDEN PATTERNS:**
- "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"}
- "npx create-react-app myapp" → USE: Manual file creation with write tool
- "First create package.json, then..." → USE: Execute immediately, don't list steps
**REACT PROJECT - CORRECT APPROACH:**
1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"}
3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."}
4. Continue until all files created
```
@@ -0,0 +1,172 @@
# Design Decision: Improved Tool Instructions
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions
**Lines Changed:** ~25 lines
## Problem
Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
1. **Passive vs Active:** Model describes what to do instead of doing it
2. **Refusal:** Model claims "I am only an AI assistant" instead of executing
3. **Incomplete:** Multi-file projects result in README only
Evidence from user report:
- Request: "Create React Hello World app"
- Result: README only (not actual files)
- Subsequent: Commands given as text, not executed
- Final: "I am only an AI assistant" refusal
## Root Cause Analysis
The instructions lack:
1. **Authority statement** - "You CAN and SHOULD use tools"
2. **Execution mandate** - "Execute commands, don't just describe them"
3. **Workflow clarity** - Clear step-by-step expectations
4. **Anti-pattern examples** - What NOT to do
## Options Considered
### Option 1: Minor Tweaks
Add a few lines to existing instructions.
- **Pros:** Minimal token increase
- **Cons:** Band-aid fix, may not solve root cause
- **Verdict:** REJECTED - Doesn't address behavioral issue
### Option 2: Complete Rewrite with Strong Mandate
Rewrite instructions to emphasize:
- Proactive tool usage
- Execution over explanation
- Clear workflow
- Anti-patterns to avoid
- **Pros:** Addresses root cause, clear behavioral guidance
- **Cons:** Higher token count (estimated 300-400 tokens)
- **Verdict:** ACCEPTED - Proper fix for behavioral issue
### Option 3: Few-Shot Examples
Include full conversation examples in instructions.
- **Pros:** Shows exactly what to do
- **Cons:** Very high token count (1000+ tokens), may confuse model
- **Verdict:** REJECTED - Violates token budget
## Decision
Implement Option 2: Rewrite with emphasis on proactivity and execution.
**Key additions:**
1. **Capability statement:** "You have tools. Use them."
2. **Execution mandate:** "Don't describe, execute"
3. **Workflow:** Clear request→tool→result→next cycle
4. **Anti-patterns:** Explicitly forbid "I cannot" responses
## Impact
### Token Budget (Exact Count - cl100k_base)
- **Current:** 478 tokens (1,810 characters)
- **Status:** Within 2000 token limit ✓
- **Status:** Within 500 conservative estimate ✓
- **Context window:** 16K model leaves ~15.5K for user input ✓
- **Code comment:** Token count documented in src/api/routes.py ✓
### Code Changes
- **File:** src/api/routes.py
- **Lines:** +48/-18 (net +30)
- **Type:** Instructions replacement
- **Token documentation:** Added inline comment with exact token count
### Breaking Changes
- **None** - Instructions are additive/clearer, not different format
### Behavioral Changes
- **Expected:** More proactive tool usage
- **Expected:** No more "I cannot" refusals
- **Expected:** Multi-step projects completed via tools
- **Expected:** Commands executed, not described
### Review Blockers Addressed
- ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
- ✅ Exact token count calculated using tiktoken (478 tokens)
- ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
- ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
- ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
## Implementation
```python
tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
**CRITICAL RULES:**
1. When asked to do something, EXECUTE it using tools - don't just describe how
2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
3. You MUST use the write tool to create files
4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
5. Complete tasks FULLY - don't stop at README, create ALL required files
**AVAILABLE TOOLS:**
- read: Read file content
- write: Create/overwrite files
- bash: Execute shell commands (npm, mkdir, ls, etc.)
**TOOL FORMAT (STRICT):**
TOOL: tool_name
ARGUMENTS: {"param": "value"}
**WORKFLOW:**
1. User asks for something
2. You decide what tool to use
3. You respond with ONLY the TOOL: format above
4. You receive the tool result
5. You continue with next tool until task is COMPLETE
**EXAMPLES:**
Creating a project:
User: "Create a React app"
You: TOOL: bash
ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
[wait for result]
You: TOOL: write
ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
[continue until all files created]
Running commands:
User: "Install dependencies"
You: TOOL: bash
ARGUMENTS: {"command": "npm install"}
[wait for result, then confirm completion]
**WHAT NOT TO DO:**
- ❌ "To create a React app, you should run: mkdir myapp" (describing)
- ❌ "I cannot run commands, I am an AI" (refusing)
- ❌ Creating only README instead of full project (incomplete)
- ❌ "First do X, then do Y" (giving instructions instead of doing)
**CORRECT BEHAVIOR:**
- ✅ Execute the command immediately using the bash tool
- ✅ Create all files using the write tool
- ✅ Continue until task is 100% complete
- ✅ Use ONE tool at a time and wait for results"""
```
## Testing
1. Test with React Hello World request
2. Verify model uses bash to create directory structure
3. Verify model uses write to create all files
4. Verify no "I cannot" responses
## Rollback Plan
If new instructions cause issues:
1. Revert to previous ~125 token version
2. Analyze what specifically failed
3. Iterate on smaller changes
## Success Metrics
- [ ] Model uses tools on first request (not after prompting)
- [ ] Zero "I cannot" or "I am an AI" responses
- [ ] Multi-file projects fully created
- [ ] Commands executed, not described
@@ -0,0 +1,151 @@
# Design Decision: Task Planning and Verification Workflow
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions
**Problem:** Model creates folder but doesn't complete full task or verify completion
## Problem Statement
User reports:
1. "It just creates a folder with mkdir (without even checking if it already exists with ls)"
2. No verification that tasks are completed
3. No planning of full task scope
4. Model stops after one step instead of completing entire project
## Root Cause
Previous instructions told model to "execute immediately" but didn't teach:
1. **Planning** - What needs to be done
2. **Checking** - What already exists
3. **Verification** - Did the step work
4. **Completion loop** - Keep going until done
## Solution
Add **Task Completion Workflow** to instructions:
```
**TASK COMPLETION WORKFLOW (MANDATORY):**
**1. PLAN:** List ALL steps needed before starting
**2. CHECK:** Use ls to verify what exists before creating
**3. EXECUTE:** Run first step
**4. VERIFY:** Confirm step worked (ls, read file)
**5. REPEAT:** Steps 3-4 until ALL complete
**6. FINAL CHECK:** Verify entire task is done
**7. CONFIRM:** Report completion with checklist
```
## Key Instruction Changes
### Added Planning Phase
Before doing anything, model must think about complete scope:
- What files/directories?
- What dependencies?
- Complete task requirements
### Added Verification Steps
Every step must be verified:
- `ls -la` after mkdir
- `read` file after write
- Check content is correct
### Added Completion Loop
Model must continue until:
✓ All directories exist
✓ All files exist with correct content
✓ All dependencies installed
✓ Each component verified
### Complete Working Example
Provided 13-step React example showing:
1. Check existing (ls)
2. Create directory
3. Verify created (ls)
4. Create package.json
5. Verify package.json (read)
6. Create source files
7. Final verification (find myapp -type f)
8. Install dependencies
9. Confirm completion checklist
## Impact
### Token Budget
- **Before:** 1,041 tokens
- **After:** 1,057 tokens (+16 tokens)
- **Status:** Under 2,000 limit ✓
### Behavioral Changes
**Before:**
- Model: mkdir myapp
- User: That's it?
- Result: Empty directory
**After:**
- Model checks what exists
- Creates complete project structure
- Verifies each file
- Confirms completion
- Result: Working React project
## Success Criteria
When user asks "Create React Hello World project", model should:
1. ✓ Check current directory contents
2. ✓ Create myapp/ directory
3. ✓ Verify directory created
4. ✓ Create package.json
5. ✓ Verify package.json content
6. ✓ Create src/App.js
7. ✓ Create src/index.js
8. ✓ Create public/index.html
9. ✓ Final verification (list all files)
10. ✓ npm install
11. ✓ Confirm completion checklist
## Testing
Test instructions contain:
- PLAN/CHECK keywords
- VERIFY keyword
- COMPLETE keyword
All tests pass: 11/11 ✓
## Trade-offs
**Pros:**
- Complete task execution
- Verification prevents partial work
- Clear completion criteria
- Better user experience
**Cons:**
- More tokens (but still under limit)
- More verbose instructions
- May be slower (more verification steps)
## Related Files Changed
1. src/api/routes.py - Updated tool_instructions
2. tests/test_tool_parsing.py - Updated tests for new content
3. docs/design/2024-02-24-task-planning-verification.md - This doc
## Future Improvements
1. **Task Queue System:** Server-side queue of pending operations
2. **State Persistence:** Remember what's been done across conversations
3. **Smart Resumption:** If interrupted, pick up where left off
4. **Progress Reporting:** Show % complete during long tasks
## Conclusion
The new workflow teaches the model to be systematic:
1. Plan before acting
2. Check before creating
3. Verify after each step
4. Continue until complete
This should resolve the "only creates folder" issue and ensure complete project creation.
@@ -0,0 +1,132 @@
# Design Decision: Tool Parsing Simplification
**Date:** 2024-02-24
**Scope:** src/api/routes.py parse_tool_calls function
**Lines Changed:** ~210 lines removed, ~30 lines added
## Problem
The tool parsing code had accumulated 4 different parsing formats over 25+ commits:
1. JSON `tool_calls` format with nested objects
2. TOOL:/ARGUMENTS: format (simple text)
3. Function pattern format `func_name(args)`
4. Multiple JSON handling variants
This caused:
- Circular development (adding/removing formats repeatedly)
- No single source of truth
- Complex, unmaintainable code
- No confidence that changes wouldn't break existing cases
## Options Considered
### Option 1: Keep All Formats
- **Pros:** Backward compatible
- **Cons:** 210 lines of unmaintainable code, continues circular development pattern
- **Verdict:** REJECTED - Perpetuates the problem
### Option 2: Standardize on TOOL:/ARGUMENTS: Only
- **Pros:**
- Simple regex pattern (~30 lines)
- Matches current tool instructions
- Easy to test
- Clear single format for models
- **Cons:**
- Breaking change if any code relies on old formats
- Need to update any existing examples/docs
- **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well)
### Option 3: Create Parser per Format with Feature Flags
- **Pros:** Flexible, can toggle formats
- **Cons:**
- Violates Rule 5 and "No Feature Flags in Core Logic"
- Still maintains multiple code paths
- **Verdict:** REJECTED - Doesn't solve the root problem
## Decision
Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code.
**Rationale:**
- Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only"
- Token cost is minimal (no complex regex)
- Test coverage provides confidence
- Aligns with existing tool instructions
## Impact
### Token Count
- **Parser code:** 210 lines → 30 lines (-180 lines)
- **No change** to tool instructions (separate optimization)
### Breaking Changes
- **Yes** - Removes support for:
- JSON `tool_calls` format in model responses
- Function pattern format `read_file(path="test.txt")`
**Migration:** Models must use:
```
TOOL: read
ARGUMENTS: {"filePath": "test.txt"}
```
### Testing
- Unit tests added: 9 test cases
- Coverage: All parsing scenarios
- All tests pass
## Implementation
```python
# New implementation (30 lines)
def parse_tool_calls(text: str) -> tuple:
"""Parse tool calls using standardized format."""
import json
import re
tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
if not tool_matches:
return text, None
tool_calls = []
for i, tool_match in enumerate(tool_matches):
tool_name = tool_match.group(1)
args_str = tool_match.group(2)
try:
args_dict = json.loads(args_str)
tool_calls.append({
"id": f"call_{i+1}",
"type": "function",
"function": {
"name": tool_name,
"arguments": json.dumps(args_dict)
}
})
except json.JSONDecodeError:
continue
if not tool_calls:
return text, None
first_start = tool_matches[0].start()
content = text[:first_start].strip()
return content, tool_calls
```
## Verification
Run tests:
```bash
python tests/test_tool_parsing.py
```
Expected: 9 passed, 0 failed
## Follow-up
- [x] Update DEVELOPMENT_PATTERNS.md to mark as completed
- [x] Add unit tests
- [ ] Consider integration test for full tool execution flow
@@ -0,0 +1,112 @@
# Test Plan: Fix Tool Execution and Token Reporting
## Problem Analysis
### Issue 1: Model Gives Instructions Instead of Executing
**Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format
**Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."}
### Issue 2: Token Counting Inaccurate
**Current:** Rough estimate `len(prompt) // 4`
**Expected:** Accurate token count using tiktoken
**Impact:** opencode can't properly manage context window
### Issue 3: npx Commands Timeout/Need Input
**Current:** `npx create-react-app .` prompts for confirmation (y/n)
**Expected:** Non-interactive execution or manual file creation
**Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)"
## Unit Tests
### Test 1: Accurate Token Counting
- [ ] Verify token count uses tiktoken (not rough estimate)
- [ ] Test with known token counts
- [ ] Verify prompt_tokens + completion_tokens = total_tokens
### Test 2: Non-Interactive Bash Commands
- [ ] Verify npm/npx commands use --yes or equivalent flags
- [ ] Test timeout handling for package managers
- [ ] Verify commands don't prompt for user input
### Test 3: Tool Instructions Content
- [ ] Verify instructions emphasize "EXECUTE not DESCRIBE"
- [ ] Verify manual file creation examples (not npx)
- [ ] Verify anti-patterns are clearly stated
## Integration Tests
### Test 4: End-to-End React Project Creation
**Input:** "Create a React Hello World app"
**Expected Flow:**
1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."}
4. Continue until complete
**Failure Modes:**
- [ ] Model describes steps instead of executing
- [ ] Uses npx create-react-app (should manually create files)
- [ ] Stops after README only
### Test 5: Token Reporting Accuracy
**Input:** Any chat completion request
**Expected:**
- usage.prompt_tokens matches actual tokens
- usage.completion_tokens matches actual tokens
- usage.total_tokens is sum
**Verification:**
- Compare tiktoken count vs API response
## Manual Verification
```bash
# Test React creation
python main.py --auto &
curl -X POST http://localhost:17615/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Client-Working-Dir: /tmp/test-project" \
-d '{
"model": "local-swarm",
"messages": [{"role": "user", "content": "Create a React Hello World app"}],
"tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}]
}'
# Check token accuracy
curl -X POST http://localhost:17615/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-swarm",
"messages": [{"role": "user", "content": "Hello"}]
}' | jq '.usage'
```
## Success Criteria
1. **Execution:** 100% of requests use TOOL: format (not descriptions)
2. **Accuracy:** Token counts match tiktoken within ±5%
3. **Completion:** Multi-file projects fully created via write tool
4. **No npx:** Manual file creation for React (no npx create-react-app)
## Implementation Notes
### Token Counting Fix
```python
# Replace: prompt_tokens = len(prompt) // 4
# With:
import tiktoken
encoding = tiktoken.get_encoding('cl100k_base')
prompt_tokens = len(encoding.encode(prompt))
completion_tokens = len(encoding.encode(content))
```
### Tool Instructions Fix
- Add explicit "DO NOT USE npx create-react-app" instruction
- Add "EXECUTE IMMEDIATELY" mandate
- Show complete React example with manual file creation
### Non-Interactive Commands
- Auto-add --yes to npx commands
- Or recommend manual file creation instead
@@ -0,0 +1,97 @@
# Test Plan: Improved Tool Instructions
## Problem Statement
Model is not using tools effectively:
1. Creates README instead of actual project structure
2. Provides commands as text instead of executing them
3. Refuses to run commands claiming "I am only an AI assistant"
## Root Cause Analysis
Current instructions don't clearly communicate:
- That the model SHOULD use tools proactively
- That execution is expected, not explanation
- The workflow: user request → tool execution → result
## Unit Tests (Instruction Verification)
### Test 1: Instruction Presence
- [ ] Verify instructions are injected into system message
- [ ] Verify instructions appear at the START of system message (priority position)
### Test 2: Token Count
- [ ] Measure total token count of new instructions
- [ ] Verify ≤ 500 tokens (conservative budget)
- [ ] Document before/after
### Test 3: Format Compliance
- [ ] Verify instructions include TOOL:/ARGUMENTS: format
- [ ] Verify examples use correct format
- [ ] Verify rules are clear and numbered
## Integration Tests (Behavioral)
### Test 4: Project Creation Flow
**Input:** "Create a React Hello World app"
**Expected Behavior:**
1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp
2. After result, TOOL: write, ARGUMENTS: package.json content
3. After result, TOOL: write, ARGUMENTS: src/App.js content
4. Continue until complete project structure exists
**Failure Modes:**
- [ ] Model only describes what to do
- [ ] Model creates README only
- [ ] Model refuses to execute commands
### Test 5: Multi-step Task
**Input:** "Check what files exist, then create a test.txt file with 'hello' in it"
**Expected Behavior:**
1. TOOL: bash, ARGUMENTS: ls -la
2. Wait for result
3. TOOL: write, ARGUMENTS: test.txt with "hello"
**Failure Modes:**
- [ ] Model tries to do both in one response
- [ ] Model doesn't wait for ls result before writing
### Test 6: Command Refusal
**Input:** "Run npm install"
**Expected Behavior:**
1. TOOL: bash, ARGUMENTS: npm install
**Failure Modes:**
- [ ] Model responds: "I cannot run commands, I am only an AI assistant"
- [ ] Model explains npm install instead of running it
## Manual Verification Commands
```bash
# Start the server
python main.py --auto
# In another terminal, test with curl
curl -X POST http://localhost:17615/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-swarm",
"messages": [{"role": "user", "content": "Create a React Hello World app"}],
"tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}]
}'
```
## Success Criteria
1. **Proactivity:** Model uses tools without being asked twice
2. **Execution:** Model runs commands, doesn't just describe them
3. **No Refusal:** Model never says "I cannot" or "I am only an AI"
4. **Completeness:** Multi-file projects are fully created via tools
5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format
## Metrics
- **Tool usage rate:** % of requests that result in tool calls
- **Format compliance:** % of tool calls in correct format
- **Completion rate:** % of multi-step tasks fully completed
@@ -0,0 +1,35 @@
# Test Plan: Tool Parsing Simplification
## Unit Tests
- [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments
- [x] Test case 2: No tool in text → Returns None for tools, original text as content
- [x] Test case 3: Multiple tools → Returns all tools in order
- [x] Test case 4: Content before tool → Content extracted, tool parsed correctly
- [x] Test case 5: Bash tool → Correctly parses bash command
- [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work
- [x] Test case 7: Invalid JSON → Skips invalid, continues with valid
- [x] Test case 8: Empty text → Returns None, empty string
- [x] Test case 9: Whitespace only → Returns None
## Integration Tests
- [ ] End-to-end flow:
1. Send chat completion request with tools
2. Model responds with TOOL:/ARGUMENTS: format
3. Parser extracts tool call
4. Tool executes
5. Result returned in response
- [ ] Expected result: Tool executes successfully, result included in response
## Manual Verification
- [ ] Command: `python tests/test_tool_parsing.py`
- [ ] Expected output: "9 passed, 0 failed"
## Token Budget Verification
- Parser code: ~30 lines (~200 tokens)
- Well under 2000 token limit
- Simple regex pattern maintains low complexity
+495 -50
View File
@@ -10,63 +10,218 @@ import sys
import multiprocessing as mp
# CRITICAL: Set spawn method BEFORE any other imports on macOS
# This prevents fork-related issues with Metal GPU
if sys.platform == "darwin":
try:
mp.set_start_method("spawn", force=True)
except RuntimeError:
pass
pass # Already set
import argparse
import asyncio
from pathlib import Path
# Add src to path
# Add src to path - resolve for Windows compatibility
src_path = Path(__file__).parent.resolve() / "src"
sys.path.insert(0, str(src_path))
# Also add parent dir for Windows import issues
if str(Path(__file__).parent.resolve()) not in sys.path:
sys.path.insert(0, str(Path(__file__).parent.resolve()))
from cli.parser import parse_args
from cli.tool_server import run_tool_server
from utils.network import get_local_ip
from utils.logging_config import setup_logging
# These imports must come AFTER setting spawn method on macOS
from hardware.detector import detect_hardware
from interactive import print_hardware_info
from models.selector import select_optimal_model
from models.downloader import download_model_for_config
from swarm import SwarmManager
from api import create_server
from api.routes import set_federated_swarm
from mcp_server import create_mcp_server
from interactive import (
interactive_model_selection,
show_startup_summary,
show_runtime_menu,
custom_configuration,
)
from network import create_discovery_service, FederatedSwarm
from tools.executor import ToolExecutor, set_tool_executor
from utils.logging_config import setup_logging
# Set up logging
# Set up logging (DEBUG level for development)
setup_logging()
def handle_detect_mode(hardware) -> int:
"""Handle --detect mode."""
print_hardware_info(hardware)
print("\n✅ Detection complete")
return 0
def handle_tool_server_mode(args, hardware) -> int:
"""Handle --tool-server mode."""
print("\n🔧 Starting Tool Execution Server...")
host = args.host if args.host else get_local_ip()
async def setup_swarm(model_config, hardware):
"""Download model and initialize swarm."""
# Download model
print("\n⬇️ Downloading model...")
try:
asyncio.run(run_tool_server(host, args.tool_port))
return 0
except KeyboardInterrupt:
print("\n\nTool server stopped")
return 0
async def run_main_mode(args, hardware) -> int:
"""Run the main application mode."""
from cli.main_runner import MainRunner
model_path = download_model_for_config(model_config)
print(f"✓ Model ready at: {model_path}")
except Exception as e:
print(f"\n❌ Error downloading model: {e}", file=sys.stderr)
return None
runner = MainRunner(hardware, args)
return await runner.run()
# Initialize swarm
print("\n🚀 Initializing swarm...")
try:
swarm = SwarmManager(
model_config=model_config,
hardware=hardware,
consensus_strategy="similarity"
)
success = await swarm.initialize(str(model_path))
if not success:
print("❌ Failed to initialize swarm")
return None
return swarm
except Exception as e:
print(f"\n❌ Error initializing swarm: {e}", file=sys.stderr)
return None
def main() -> int:
"""Main entry point."""
args = parse_args()
def get_local_ip():
"""Get the local network IP address (private networks only)."""
import socket
try:
# Create a socket and connect to a public DNS server
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.settimeout(2)
# Try to connect to Google's DNS - this doesn't actually send data
s.connect(("8.8.8.8", 80))
ip = s.getsockname()[0]
s.close()
# Check if it's a private IP (only 192.168.x.x for this network)
is_private = (
ip.startswith('192.168.')
)
if is_private:
print(f" 📡 Detected local IP: {ip}")
return ip
else:
# If not private, return localhost for safety
print(f" ⚠️ IP {ip} is not a private network, binding to localhost")
return "127.0.0.1"
except Exception as e:
print(f" ⚠️ Could not detect local IP: {e}, using localhost")
return "127.0.0.1"
def main():
parser = argparse.ArgumentParser(
description="Local Swarm - AI-powered coding LLM swarm",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python main.py # Interactive setup and start
python main.py --auto # Auto-detect and start without menu
python main.py --detect # Show hardware detection only
python main.py --model qwen:3b:q4 # Use specific model (skip menu)
python main.py --port 17615 # Use custom port (default: 17615)
python main.py --host 192.168.1.5 # Bind to specific IP
python main.py --instances 4 # Force number of instances
python main.py --download-only # Download model only
python main.py --test # Test with sample prompt
python main.py --mcp # Enable MCP server
python main.py --federation # Enable federation with other instances
python main.py --federation --peer 192.168.1.10:17615 # Manual peer
"""
)
parser.add_argument(
"--auto",
action="store_true",
help="Auto-detect best configuration without interactive menu"
)
parser.add_argument(
"--detect",
action="store_true",
help="Show hardware detection and exit"
)
parser.add_argument(
"--model",
type=str,
help="Model to use (format: name:size:quant, e.g., qwen:3b:q4)"
)
parser.add_argument(
"--port",
type=int,
default=17615,
help="Port to run the API server on (default: 17615)"
)
parser.add_argument(
"--instances",
type=int,
help="Force number of instances (overrides auto-calculation)"
)
parser.add_argument(
"--download-only",
action="store_true",
help="Download models only, don't start server"
)
parser.add_argument(
"--test",
action="store_true",
help="Test with a sample prompt"
)
parser.add_argument(
"--mcp",
action="store_true",
help="Enable MCP server alongside HTTP API"
)
parser.add_argument(
"--config",
type=str,
default="config.yaml",
help="Path to config file"
)
parser.add_argument(
"--host",
type=str,
default=None,
help="Host IP to bind to (default: auto-detect)"
)
parser.add_argument(
"--federation",
action="store_true",
help="Enable federation with other Local Swarm instances on the network"
)
parser.add_argument(
"--peer",
action="append",
dest="peers",
help="Manually add a peer (format: host:port, can be used multiple times)"
)
parser.add_argument(
"--tool-server",
action="store_true",
help="Run as dedicated tool execution server (executes read/write/bash tools)"
)
parser.add_argument(
"--tool-port",
type=int,
default=17616,
help="Port for tool execution server (default: 17616)"
)
parser.add_argument(
"--tool-host",
type=str,
default=None,
nargs='?',
const='', # When --tool-host is used without a value, use empty string
help="URL of tool execution server. Use without value for auto-detected local IP (http://<local-ip>:17616), or provide explicit URL."
)
parser.add_argument(
"--version",
action="version",
version="%(prog)s 0.1.0"
)
args = parser.parse_args()
# Detect hardware first
print("\n🔍 Detecting hardware...")
@@ -74,26 +229,316 @@ def main() -> int:
hardware = detect_hardware()
except Exception as e:
print(f"\n❌ Error detecting hardware: {e}", file=sys.stderr)
return 1
sys.exit(1)
# Handle detect mode
if args.detect:
return handle_detect_mode(hardware)
# Just show hardware info
from interactive import print_hardware_info
print_hardware_info(hardware)
print("\n✅ Detection complete")
return
# Handle tool server mode
# Tool server mode - run minimal tool-only server
if args.tool_server:
return handle_tool_server_mode(args, hardware)
print("\n🔧 Starting Tool Execution Server...")
from fastapi import FastAPI
import uvicorn
# Initialize local tool executor
tool_executor = ToolExecutor(tool_host_url=None)
set_tool_executor(tool_executor)
app = FastAPI(title="Local Swarm Tool Server")
@app.post("/v1/tools/execute")
async def execute_tool(request: dict):
tool_name = request.get("tool", "")
tool_args = request.get("arguments", {})
result = await tool_executor.execute(tool_name, tool_args)
return {"result": result}
@app.get("/health")
async def health():
return {"status": "healthy", "mode": "tool-server"}
host = args.host if args.host else get_local_ip()
tool_port = args.tool_port
print(f"🔗 Tool server running at http://{host}:{tool_port}")
print(f" Endpoints:")
print(f" - POST /v1/tools/execute")
print(f" - GET /health")
print(f"\n✅ Tool server ready!")
uvicorn.run(app, host=host, port=tool_port)
return
# Run main mode
try:
return asyncio.run(run_main_mode(args, hardware))
except KeyboardInterrupt:
print("\n\nReceived stop signal")
return 0
except Exception as e:
print(f"\n❌ Error: {e}", file=sys.stderr)
return 1
# Determine model configuration
config = None
if args.model or args.instances or args.auto:
# Use command-line arguments or auto-detect
print("\n📊 Calculating optimal configuration...")
try:
config = select_optimal_model(
hardware,
preferred_model=args.model,
force_instances=args.instances
)
if not config:
print("\n❌ No suitable model found for your hardware")
print(" Minimum requirement: 2 GB available memory")
sys.exit(1)
# Show brief summary
print(f"\n✓ Selected: {config.display_name}")
print(f" Instances: {config.instances}")
print(f" Memory: {config.total_memory_gb:.1f} GB")
except Exception as e:
print(f"\n❌ Error selecting model: {e}", file=sys.stderr)
sys.exit(1)
else:
# Interactive mode - show menu
config = interactive_model_selection(hardware)
if not config:
print("\n❌ No configuration selected")
sys.exit(1)
if args.download_only:
# Download model only
print("\n" + "=" * 70)
print("⬇️ Download Mode: Downloading model only")
print("=" * 70)
try:
model_path = download_model_for_config(config)
print(f"✓ Model downloaded to: {model_path}")
print("\n" + "=" * 70)
print("✅ Download complete")
print("=" * 70)
except Exception as e:
print(f"\n❌ Download failed: {e}", file=sys.stderr)
sys.exit(1)
elif args.test:
# Test mode with sample prompt
print("\n" + "=" * 70)
print("🧪 Test Mode: Running sample inference")
print("=" * 70)
async def test_inference():
show_startup_summary(hardware, config)
swarm = await setup_swarm(config, hardware)
if not swarm:
return False
try:
# Test prompt
prompt = "Write a Python function to calculate factorial:"
print(f"\nPrompt: {prompt}\n")
print("Generating responses...\n")
result = await swarm.generate(prompt, max_tokens=200)
print("\n" + "=" * 70)
print("SELECTED RESPONSE:")
print("=" * 70)
print(result.selected_response.text)
print("\n" + "=" * 70)
print(f"Strategy: {result.strategy}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Latency: {result.selected_response.latency_ms:.1f}ms")
print(f"Tokens/sec: {result.selected_response.tokens_per_second:.1f}")
# Show all responses
print("\nAll responses received:")
for i, resp in enumerate(result.all_responses):
preview = resp.text[:60].replace('\n', ' ')
print(f" Worker {i}: {preview}... ({resp.latency_ms:.1f}ms)")
return True
finally:
await swarm.shutdown()
success = asyncio.run(test_inference())
if success:
print("\n" + "=" * 70)
print("✅ Test complete")
print("=" * 70)
else:
print("\n❌ Test failed")
sys.exit(1)
else:
# Full mode (download + start API server + optional MCP)
show_startup_summary(hardware, config)
async def run_server():
swarm = await setup_swarm(config, hardware)
if not swarm:
return False
# Initialize tool executor
if args.tool_host is not None:
# --tool-host was provided
if args.tool_host == "":
# --tool-host with no value - use local IP with default port
local_ip = get_local_ip()
tool_host_url = f"http://{local_ip}:17616"
print(f"\n🔧 Using remote tool host: {tool_host_url} (auto-detected local IP)")
else:
# --tool-host with explicit value
tool_host_url = args.tool_host
print(f"\n🔧 Using remote tool host: {tool_host_url}")
tool_executor = ToolExecutor(tool_host_url=tool_host_url)
set_tool_executor(tool_executor)
else:
# Local tool execution (default)
tool_executor = ToolExecutor(tool_host_url=None)
set_tool_executor(tool_executor)
# Update summary with runtime info
show_startup_summary(hardware, config, swarm)
# Initialize federation if enabled
discovery = None
federated_swarm = None
if args.federation:
print("\n🌐 Initializing federation...")
try:
# Use specified host for advertising if provided
advertise_ip = args.host if args.host else None
discovery = await create_discovery_service(args.port, advertise_ip=advertise_ip)
# Get swarm info for advertising
swarm_info = {
"version": "0.1.0",
"instances": config.instances,
"model_id": config.model_id,
"hardware_summary": f"{hardware.cpu_cores} CPU, {hardware.ram_gb:.1f}GB RAM"
}
await discovery.start_advertising(swarm_info)
await discovery.start_listening()
# Add manual peers if specified
if args.peers:
print(f" 📍 Adding {len(args.peers)} manual peer(s)...")
from network.discovery import PeerInfo
from datetime import datetime
for peer_str in args.peers:
try:
host, port = peer_str.rsplit(':', 1)
port = int(port)
peer = PeerInfo(
host=host,
port=port,
name=f"manual_{host}_{port}",
version="0.1.0",
instances=0,
model_id="unknown",
hardware_summary="manual",
last_seen=datetime.now()
)
discovery.peers[peer.name] = peer
print(f" ✓ Added peer: {host}:{port}")
except Exception as e:
print(f" ⚠️ Failed to add peer {peer_str}: {e}")
# Create federated swarm wrapper
federated_swarm = FederatedSwarm(swarm, discovery)
set_federated_swarm(federated_swarm)
# Start health check loop in background
asyncio.create_task(discovery.start_health_check_loop(interval_seconds=10))
print(f" ✓ Federation enabled")
print(f" ✓ Discovery active on port {discovery.discovery_port}")
print(f" ✓ Peer health checks every 10s")
except Exception as e:
print(f" ⚠️ Failed to initialize federation: {e}")
print(" Continuing without federation...")
mcp_server = None
try:
# Create and start API server
print("\n🌐 Starting HTTP API server...")
# Use provided host or auto-detect
if args.host:
host = args.host
print(f"🔗 Using specified host: {host}:{args.port}")
else:
# Use local network IP instead of 0.0.0.0 for security
host = get_local_ip()
print(f"🔗 Binding to {host}:{args.port}")
server = create_server(swarm, host=host, port=args.port)
print(f"\n✅ Local Swarm is running!")
print(f" API: http://{host}:{args.port}/v1")
print(f" Health: http://{host}:{args.port}/health")
if args.federation and discovery:
peers = discovery.get_peers()
print(f"\n🌐 Federation: Enabled")
print(f" Discovery port: {discovery.discovery_port}")
if peers:
print(f" Peers discovered: {len(peers)}")
for peer in peers:
print(f" - {peer.name} ({peer.model_id})")
else:
print(f" Peers discovered: 0 (waiting for peers...)")
# Show tool server status
if args.tool_host is not None:
print(f"\n🔧 Tool Server: Remote")
if args.tool_host == "":
local_ip = get_local_ip()
print(f" URL: http://{local_ip}:17616 (auto-detected)")
else:
print(f" URL: {args.tool_host}")
print(f" Mode: Tools executed remotely on tool host")
else:
print(f"\n🔧 Tool Server: Local")
print(f" Mode: Tools executed on this machine")
if args.mcp:
# Start MCP server alongside HTTP API
print("\n🤖 Starting MCP server...")
mcp_server = await create_mcp_server(swarm)
print(" MCP server active (stdio)")
print(f"\n💡 Configure opencode to use:")
print(f' base_url: http://127.0.0.1:{args.port}/v1')
print(f' api_key: any (not used)')
print(f"\nPress Ctrl+C to stop...\n")
# Start HTTP server (this will block)
await server.start()
except KeyboardInterrupt:
print("\n\nReceived stop signal")
finally:
if federated_swarm:
await federated_swarm.close()
if discovery:
await discovery.stop()
await swarm.shutdown()
return True
try:
success = asyncio.run(run_server())
if success:
print("\n" + "=" * 70)
print("✅ Server stopped gracefully")
print("=" * 70)
except Exception as e:
print(f"\n❌ Error running server: {e}", file=sys.stderr)
sys.exit(1)
if __name__ == "__main__":
sys.exit(main())
main()
-760
View File
@@ -1,760 +0,0 @@
"""Chat completion handlers for Local Swarm.
Contains the business logic for chat completions, separated from HTTP routing.
"""
import json
import logging
import time
import uuid
from typing import Optional, List
from api.models import (
ChatCompletionRequest,
ChatCompletionResponse,
ChatCompletionChoice,
ChatMessage,
UsageInfo,
)
from api.formatting import format_messages_with_tools
from api.tool_parser import parse_tool_calls
from utils.token_counter import count_tokens
from tools.executor import get_tool_executor
from chatlog import get_chat_logger
logger = logging.getLogger(__name__)
def _extract_working_dir_from_prompt(prompt: str) -> Optional[str]:
"""Extract working directory from user prompt.
Looks for patterns like:
- "in the /path/to/dir directory"
- "in directory /path/to/dir"
- "in /path/to/dir"
- "under /path/to/dir"
- "from /path/to/dir"
Args:
prompt: User prompt text
Returns:
Extracted directory path or None
"""
import re
import os
# Common patterns for directory mentions
patterns = [
r'in the\s+([/~]?[\w\-/.]+)\s+(?:directory|folder|dir)',
r'in\s+(?:directory|folder|dir)\s+([/~]?[\w\-/.]+)',
r'(?:in|under|from|at)\s+([/~]?[\w\-/.]{3,})', # At least 3 chars to avoid "in a"
]
for pattern in patterns:
match = re.search(pattern, prompt, re.IGNORECASE)
if match:
path = match.group(1)
# Validate it looks like a path
if path.startswith('/') or path.startswith('~') or '/' in path:
# Expand home directory
if path.startswith('~'):
path = os.path.expanduser(path)
# Check if it's a valid directory or parent exists
if os.path.isdir(path) or os.path.isdir(os.path.dirname(path)):
return os.path.abspath(path)
return None
def _sanitize_tools(tools: Optional[list]) -> Optional[list]:
"""Sanitize tool definitions to fix invalid schemas.
Removes extra 'description' from properties if present.
Args:
tools: List of tool definitions
Returns:
Sanitized tools list
"""
if not tools:
return tools
sanitized = []
for tool in tools:
if tool.type == "function" and tool.function.parameters:
params = tool.function.parameters
# Remove invalid 'description' from properties if present
if 'properties' in params and 'description' in params.get('properties', {}):
invalid_props = ['description']
# Also remove 'description' from required if present
if 'required' in params:
params['required'] = [r for r in params.get('required', []) if r not in invalid_props]
# Remove invalid properties
params['properties'] = {k: v for k, v in params.get('properties', {}).items() if k not in invalid_props}
logger.debug(f" 🔧 Sanitized tool '{tool.function.name}': removed {invalid_props} from properties/required")
sanitized.append(tool)
return sanitized
async def _execute_tools(
tool_calls: list,
client_working_dir: Optional[str],
executor
) -> List[tuple]:
"""Execute tool calls and return results.
Args:
tool_calls: List of parsed tool calls
client_working_dir: Working directory for file operations
executor: Tool executor instance
Returns:
List of tuples (tool_name, result_string)
"""
from api.routes import execute_tool_server_side
tool_results = []
for i, tc in enumerate(tool_calls):
tool_name = tc.get("function", {}).get("name", "")
tool_args_str = tc.get("function", {}).get("arguments", "{}")
try:
tool_args = json.loads(tool_args_str) if isinstance(tool_args_str, str) else tool_args_str
except:
tool_args = {}
logger.debug(f" [{i+1}/{len(tool_calls)}] Executing: {tool_name}({tool_args})")
result = await execute_tool_server_side(tool_name, tool_args, working_dir=client_working_dir)
tool_results.append((tool_name, result))
logger.debug(f" ✓ Completed: {result[:100]}..." if len(result) > 100 else f" ✓ Result: {result}")
return tool_results
def _create_response(
content: str,
tool_calls: list,
finish_reason: str,
prompt: str,
request: ChatCompletionRequest,
swarm_manager=None,
thinking_content: Optional[str] = None
) -> ChatCompletionResponse:
"""Create a chat completion response.
Args:
content: Final response content (after tool execution if any)
tool_calls: List of tool calls
finish_reason: Finish reason
prompt: Original prompt for token counting
request: Original request
swarm_manager: Swarm manager instance (optional, for getting model name)
thinking_content: Intermediate thinking/planning content to include in streaming as reasoning_content
Returns:
ChatCompletionResponse
"""
"""Create a chat completion response.
Args:
content: Response content
tool_calls: List of tool calls
finish_reason: Finish reason
prompt: Original prompt for token counting
request: Original request
swarm_manager: Swarm manager instance (optional, for getting model name)
Returns:
ChatCompletionResponse
"""
# Ensure content is at least an empty string (never None for OpenAI compatibility)
if content is None:
content = ""
prompt_tokens = count_tokens(prompt)
completion_tokens = count_tokens(content)
total_tokens = prompt_tokens + completion_tokens
# Get actual model name from swarm manager
model_name = request.model
system_fingerprint = None
if swarm_manager:
status = swarm_manager.get_status()
model_name = status.model_name
# Sanitize system_fingerprint to only include safe characters
import re
raw_fingerprint = model_name.lower().replace(" ", "-")
system_fingerprint = re.sub(r'[^a-z0-9\-_]', '', raw_fingerprint)
# Build message - omit tool_calls entirely if empty (OpenAI behavior)
message_kwargs = {
"role": "assistant",
"content": content
}
if tool_calls:
message_kwargs["tool_calls"] = tool_calls
message = ChatMessage(**message_kwargs)
response = ChatCompletionResponse(
id=f"chatcmpl-{uuid.uuid4().hex[:12]}",
created=int(time.time()),
model=model_name,
choices=[
ChatCompletionChoice(
index=0,
message=message,
logprobs=None,
finish_reason=finish_reason
)
],
usage=UsageInfo(
prompt_tokens=prompt_tokens,
completion_tokens=completion_tokens,
total_tokens=total_tokens
),
stats={},
system_fingerprint=system_fingerprint
)
# Attach thinking content for streaming (not part of JSON serialization)
# Use a private attribute to avoid interfering with model serialization
if thinking_content is not None:
setattr(response, '_thinking', thinking_content)
return response
async def _generate_with_consensus(
prompt: str,
max_tokens: int,
temperature: float,
swarm_manager,
federated_swarm=None
) -> tuple[str, int, float]:
"""Generate response with consensus (local or federated).
This is the unified generation interface - it handles both local-only
and federated generation transparently. Callers don't need to know
which mode is being used.
Args:
prompt: Prompt to generate from
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
swarm_manager: Local swarm manager instance
federated_swarm: Optional federated swarm for multi-node consensus
Returns:
Tuple of (response_text, tokens_generated, tokens_per_second)
"""
# Check if federation is available
if federated_swarm is not None:
peers = federated_swarm.discovery.get_peers()
if peers:
logger.debug(f"🌐 Using federation with {len(peers)} peer(s)")
try:
fed_result = await federated_swarm.generate_with_federation(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature
)
# Federation returns FederationResult object
# Extract the final response text
return fed_result.final_response, 0, 0.0 # Tokens/TPS not tracked in federation mode
except Exception as e:
logger.warning(f"Federation failed, falling back to local: {e}")
# Fall through to local generation
# Local generation (fallback or no federation)
try:
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
return response.text, response.tokens_generated, response.tokens_per_second
except Exception as e:
logger.exception("Error in swarm generation")
raise
def _tool_calls_agree(tool_calls_list: List[List[dict]]) -> bool:
"""Check if all workers agree on the same tool calls.
Args:
tool_calls_list: List of tool calls from each worker
Returns:
True if all workers have the same tool calls
"""
if not tool_calls_list:
return True
# Check if all have the same number of tool calls
first_count = len(tool_calls_list[0])
if not all(len(tc) == first_count for tc in tool_calls_list):
logger.warning(f" ⚠️ Workers disagree on number of tool calls: {[len(tc) for tc in tool_calls_list]}")
return False
if first_count == 0:
return True # All agree on no tools
# Check if tool names and arguments match
for i in range(first_count):
first_tool = tool_calls_list[0][i]
first_name = first_tool.get("function", {}).get("name", "")
first_args = first_tool.get("function", {}).get("arguments", "")
for j, other_calls in enumerate(tool_calls_list[1:], 1):
other_tool = other_calls[i]
other_name = other_tool.get("function", {}).get("name", "")
other_args = other_tool.get("function", {}).get("arguments", "")
if first_name != other_name:
logger.warning(f" ⚠️ Worker {j+1} disagrees on tool name: {first_name} vs {other_name}")
return False
# For arguments, do a loose comparison (ignore whitespace differences)
try:
first_args_norm = json.loads(first_args) if isinstance(first_args, str) else first_args
other_args_norm = json.loads(other_args) if isinstance(other_args, str) else other_args
if first_args_norm != other_args_norm:
logger.warning(f" ⚠️ Worker {j+1} disagrees on arguments for {first_name}")
return False
except json.JSONDecodeError:
# If JSON parsing fails, compare as strings
if str(first_args).strip() != str(other_args).strip():
logger.warning(f" ⚠️ Worker {j+1} disagrees on arguments for {first_name}")
return False
logger.info(f" ✅ All {len(tool_calls_list)} workers agree on tool calls")
return True
async def _generate_with_tool_consensus(
swarm_manager,
prompt: str,
max_tokens: int,
temperature: float
) -> tuple[str, List[dict], int, float]:
"""Generate response with tool call consensus checking.
When multiple workers are active, this ensures they all agree on tool calls
before executing them. If they disagree, returns the best response without tools.
Args:
swarm_manager: Swarm manager instance
prompt: Prompt to generate from
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
Returns:
Tuple of (response_text, tool_calls, tokens_generated, tps)
"""
try:
# Get status to check number of workers
status = swarm_manager.get_status()
num_workers = getattr(status, 'active_workers', 1)
# If only one worker, use normal generation
if num_workers <= 1:
logger.debug(" Single worker mode - skipping tool consensus")
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
parsed_content, tool_calls = parse_tool_calls(response.text)
return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
# Multiple workers - check for tool consensus
logger.info(f" 🔍 Checking tool consensus across {num_workers} workers...")
# Generate from all workers individually
from swarm.manager import GenerationRequest
all_responses = []
all_tool_calls = []
# Get all active workers
workers = swarm_manager.workers if hasattr(swarm_manager, 'workers') else []
if not workers:
# Fall back to normal generation
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
parsed_content, tool_calls = parse_tool_calls(response.text)
return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
# Generate from each worker
for i, worker in enumerate(workers):
try:
gen_result = await worker.generate(
GenerationRequest(prompt=prompt, max_tokens=max_tokens, temperature=temperature)
)
response_text = gen_result.text
parsed_content, tool_calls = parse_tool_calls(response_text)
all_responses.append(response_text)
all_tool_calls.append(tool_calls)
logger.debug(f" Worker {i+1}: {len(tool_calls)} tool call(s)")
except Exception as e:
logger.warning(f" Worker {i+1} failed: {e}")
all_responses.append("")
all_tool_calls.append([])
# Check consensus
if _tool_calls_agree(all_tool_calls):
# All agree - use the first response's tool calls
best_response = all_responses[0] if all_responses else ""
best_tool_calls = all_tool_calls[0] if all_tool_calls else []
total_tokens = sum(len(r.split()) for r in all_responses if r) // len([r for r in all_responses if r])
avg_tps = 10.0 # Estimate
return best_response, best_tool_calls, total_tokens, avg_tps
else:
# Disagreement - fall back to consensus strategy without tools
logger.warning(" ⚠️ Tool consensus failed - falling back to text response")
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
# Strip any tool calls to be safe
parsed_content, _ = parse_tool_calls(response.text)
return parsed_content, [], response.tokens_generated, response.tokens_per_second
except Exception as e:
logger.exception("Error in tool consensus generation")
# Fall back to normal generation
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
parsed_content, tool_calls = parse_tool_calls(response.text)
return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
async def _generate_with_federation(
federated_swarm,
prompt: str,
max_tokens: int,
temperature: float
) -> tuple[str, list, str]:
"""Generate response using federated swarm.
Args:
federated_swarm: Federated swarm instance
prompt: Prompt to generate from
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
Returns:
Tuple of (response_text, tool_calls, finish_reason)
"""
result = await federated_swarm.generate_with_federation(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
min_peers=0
)
content = result.final_response or ""
# Check for tool calls
content_parsed, tool_calls_parsed = parse_tool_calls(content)
if tool_calls_parsed:
return content_parsed or "", tool_calls_parsed, "tool_calls"
return content or "", [], "stop"
async def handle_chat_completion(
request: ChatCompletionRequest,
swarm_manager,
federated_swarm,
client_working_dir: Optional[str],
use_opencode_tools: bool
) -> ChatCompletionResponse:
"""Handle a chat completion request.
Args:
request: Chat completion request
swarm_manager: Swarm manager instance
federated_swarm: Optional federated swarm instance
client_working_dir: Client working directory
use_opencode_tools: Whether to use opencode tool definitions
Returns:
Chat completion response
"""
# Format messages into prompt
if use_opencode_tools:
sanitized_tools = _sanitize_tools(request.tools)
prompt = format_messages_with_tools(request.messages, sanitized_tools)
has_tools = sanitized_tools is not None and len(sanitized_tools) > 0
else:
prompt = format_messages_with_tools(request.messages, None)
has_tools = request.tools is not None and len(request.tools) > 0
# Initialize chat logger (if enabled via LOCAL_SWARM_CHATLOG=1)
chat_logger = get_chat_logger()
# Extract working directory from prompt if not provided by client
if client_working_dir is None:
# Try to extract from user messages
for msg in reversed(request.messages):
if msg.role == 'user':
extracted_dir = _extract_working_dir_from_prompt(msg.content)
if extracted_dir:
client_working_dir = extracted_dir
logger.info(f"📁 Extracted working directory from prompt: {client_working_dir}")
break
# Log initial conversation history to chatlog
for msg in request.messages:
if msg.role == 'user':
chat_logger.log_user_message(msg.content)
elif msg.role == 'assistant':
chat_logger.log_assistant_message(msg.content, has_tool_calls=bool(msg.tool_calls))
elif msg.role == 'tool':
chat_logger.log_tool_result("tool", msg.content)
logger.info(f"\n{'='*60}")
logger.info(f"CHAT COMPLETION REQUEST:")
logger.info(f" has_tools={has_tools}, stream={request.stream}")
logger.info(f" use_opencode={use_opencode_tools}")
logger.info(f" messages={len(request.messages)}")
logger.info(f"{'='*60}")
# Build conversation history
messages = list(request.messages)
# Determine if we should use federation for generation
use_federation = federated_swarm is not None and len(federated_swarm.discovery.get_peers()) > 0
if use_federation:
logger.info(f"🌐 Federation available with peers")
# Track thinking content for streaming (OpenCode reasoning_content)
thinking_content: Optional[str] = None
thinking_captured = False
# Initialize iteration counter and response text
iteration = 0
max_iterations = 3
response_text = ""
while iteration < max_iterations:
iteration += 1
logger.info(f"--- Tool Execution Iteration {iteration} ---")
# Generate response
# IMPORTANT: Only use federation on FIRST iteration (initial planning)
# Subsequent iterations process tool results which only head node has
if iteration == 1 and use_federation:
# First iteration: use federation for consensus on initial plan
logger.info(f"🌐 Using federation for initial generation...")
response_text, tokens_generated, tps = await _generate_with_consensus(
prompt=prompt,
max_tokens=request.max_tokens or 1024,
temperature=request.temperature or 0.7,
swarm_manager=swarm_manager,
federated_swarm=federated_swarm
)
else:
# Subsequent iterations: LOCAL ONLY
# Peers don't have tool results from previous iterations
# Using federation here would cause inconsistent context
if iteration > 1:
logger.debug(f"Using local generation (iteration {iteration}, tool context local only)")
response_text, tokens_generated, tps = await _generate_with_consensus(
prompt=prompt,
max_tokens=request.max_tokens or 1024,
temperature=request.temperature or 0.7,
swarm_manager=swarm_manager,
federated_swarm=None # Force local-only
)
logger.info(f"Generated response ({len(response_text)} chars, {tokens_generated} tokens)")
logger.debug(f"Response: {response_text[:200]}...")
# Check for tool calls
parsed_content, tool_calls_parsed = parse_tool_calls(response_text)
# Log assistant response to chatlog
chat_logger.log_assistant_message(response_text, has_tool_calls=bool(tool_calls_parsed))
if tool_calls_parsed:
# Log each tool call
for i, tc in enumerate(tool_calls_parsed, 1):
tool_name = tc.get("function", {}).get("name", "")
args_str = tc.get("function", {}).get("arguments", "{}")
try:
args_dict = json.loads(args_str) if isinstance(args_str, str) else args_str
except json.JSONDecodeError:
args_dict = {"raw": args_str}
chat_logger.log_tool_call(tool_name, args_dict, i)
# Capture thinking for OpenCode streaming (first occurrence only)
if not thinking_captured:
# Use the parsed content (without tool calls) as the reasoning
thinking_content = parsed_content or ""
thinking_captured = True
if not tool_calls_parsed:
# No more tools - this is the final answer
logger.info(f"✅ Final answer (no tools) after {iteration} iteration(s)")
return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager, thinking_content)
# Tools detected - execute them
logger.info(f"🔧 Found {len(tool_calls_parsed)} tool call(s)")
for i, tc in enumerate(tool_calls_parsed):
tool_name = tc.get("function", {}).get("name", "")
args_str = tc.get("function", {}).get("arguments", "{}")
logger.info(f" [{i+1}] {tool_name}: {args_str[:100]}...")
# Add assistant message to history with tool_calls (if any)
# This preserves the tool call IDs for proper tool message association
assistant_message = ChatMessage(
role="assistant",
content=response_text
)
if tool_calls_parsed:
# Convert tool calls to proper ToolCall objects with IDs
from api.models import ToolCall
tc_objects = []
for i, tc_dict in enumerate(tool_calls_parsed):
tc_id = tc_dict.get("id", f"call_{i}")
tc_objects.append(ToolCall(
id=tc_id,
type="function",
function={
"name": tc_dict["function"]["name"],
"arguments": tc_dict["function"]["arguments"]
}
))
assistant_message.tool_calls = tc_objects
messages.append(assistant_message)
# Execute all tools
logger.info(f"⏱️ Executing tools...")
tool_results = await _execute_tools(tool_calls_parsed, client_working_dir, get_tool_executor())
# Log tool results to chatlog (single combined log for debugging)
combined_strings = [f"Tool {i+1} ({name}): {result}" for i, (name, result) in enumerate(tool_results)]
chat_logger.log_tool_result("combined", "\n\n".join(combined_strings), success=True)
# Add tool result to history - one message per tool call with proper tool_call_id
for i, ((tool_name, tool_result), tc) in enumerate(zip(tool_results, tool_calls_parsed)):
tool_call_id = tc.get("id", f"call_{i}")
# Format the tool result message with explicit instructions
# This tells the model exactly what to do with the result
if tool_name == "read":
instruction = "The file contents are shown above. READ THIS FILE CONTENT ALOUD to the user. Do not call additional tools."
elif tool_name == "write":
instruction = "The file has been successfully written. CONFIRM to the user that the file was created with the content shown above. Do not call additional tools."
elif tool_name == "bash":
# Check if this was a verification command (ls, grep) vs an action command
if "ls" in tool_result.lower() or "grep" in tool_result.lower():
instruction = "CRITICAL: The listing is shown above. If the user asked to READ a specific file and you can see it exists in this listing, you MUST immediately USE THE read TOOL NOW with the exact filename from the listing. Do not summarize first - READ THE FILE immediately. Use the filename exactly as shown (e.g., 'my-secret.log' not '/path/to/my-secret.log'). If the user asked to just CHECK what files exist (without reading), then summarize. If the requested file is NOT in the listing, tell the user it doesn't exist."
else:
instruction = "The command has been executed. SUMMARIZE the output above to answer the user's request. Do not call additional tools."
else:
instruction = "The tool has completed. Use the result shown above to answer the user's request. Do not call additional tools."
tool_message_content = (
f"Tool Result ({tool_name}):\n"
f"{tool_result}\n\n"
f"INSTRUCTION: {instruction}"
)
messages.append(ChatMessage(
role="tool",
content=tool_message_content,
tool_call_id=tool_call_id,
name=tool_name
))
logger.info(f" ✓ Tool result {i+1} added to history (tool_call_id={tool_call_id}, name={tool_name})")
logger.info(f"✅ Tools executed ({len(tool_results)} results)")
# Continue loop - generate response with tool results
logger.info(f"🔄 Generating response with tool results...")
# Format with tool results (but DON'T include tool instruction - model should just use results)
next_prompt = format_messages_with_tools(messages, None if use_opencode_tools else request.tools)
logger.info(f"📤 Prompt sent to model after tool execution:")
logger.info(f" Total tokens: {count_tokens(next_prompt)}")
logger.info(f" Messages in history: {len(messages)}")
for i, msg in enumerate(messages):
logger.info(f" [{i}] {msg.role}: {msg.content[:100]}{'...' if len(msg.content) > 100 else ''}")
if msg.tool_calls:
for j, tc in enumerate(msg.tool_calls):
logger.info(f" Tool call {j}: {tc.function.get('name')} ({tc.function.get('arguments')})")
if msg.tool_call_id:
logger.info(f" (tool_call_id: {msg.tool_call_id}, name: {msg.name})")
logger.debug(f"Full prompt:\n{next_prompt[:1000]}...")
response_text, tokens_generated, tps = await _generate_with_consensus(
prompt=next_prompt,
max_tokens=request.max_tokens or 1024,
temperature=request.temperature or 0.7,
swarm_manager=swarm_manager,
federated_swarm=None # Tool result processing is local-only
)
logger.info(f"✅ Generated with tool results ({len(response_text)} chars, {tokens_generated} tokens)")
logger.debug(f"Response: {response_text[:200]}...")
# Check for more tools in the new response
parsed_content, tool_calls_parsed = parse_tool_calls(response_text)
# Log assistant response to chatlog
chat_logger.log_assistant_message(response_text, has_tool_calls=bool(tool_calls_parsed))
if tool_calls_parsed:
# Log each tool call
for i, tc in enumerate(tool_calls_parsed, 1):
tool_name = tc.get("function", {}).get("name", "")
args_str = tc.get("function", {}).get("arguments", "{}")
try:
args_dict = json.loads(args_str) if isinstance(args_str, str) else args_str
except json.JSONDecodeError:
args_dict = {"raw": args_str}
chat_logger.log_tool_call(tool_name, args_dict, i)
# Capture thinking if not already captured
if not thinking_captured:
thinking_content = parsed_content or ""
thinking_captured = True
if not tool_calls_parsed:
# No more tools - final answer
logger.info(f"✅ Final answer (after tool execution) after {iteration} iteration(s)")
return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager, thinking_content)
# More tools detected - continue loop
logger.info(f"🔧 More tools found - continuing loop")
# Max iterations reached - force return last response
logger.warning(f"⚠️ Max tool iterations ({max_iterations}) reached")
logger.warning(f"⚠️ Returning last response (may include incomplete tool call)")
return _create_response(response_text, [], "stop", prompt, request, swarm_manager, thinking_content)
-271
View File
@@ -1,271 +0,0 @@
"""Message formatting module for Local Swarm.
Formats chat messages into prompts and handles tool instructions.
"""
import logging
from pathlib import Path
from typing import Optional, List
from api.models import ChatMessage
from utils.token_counter import count_tokens
logger = logging.getLogger(__name__)
# Cache for tool instructions (loaded from config file)
_TOOL_INSTRUCTIONS_CACHE: Optional[str] = None
# Global flag for tool mode (default: local tool server to save tokens)
_USE_OPENCODE_TOOLS: bool = False
def set_use_opencode_tools(value: bool) -> None:
"""Set whether to use opencode's tool definitions (default: False = local tool server).
Args:
value: True to use opencode tools (~27k tokens), False to use local tool server (~125 tokens)
"""
global _USE_OPENCODE_TOOLS
_USE_OPENCODE_TOOLS = value
logger.info(f"🔧 Tool mode set to: {'opencode tools (~27k tokens)' if value else 'local tool server (~125 tokens)'}")
def _load_tool_instructions() -> str:
"""Load tool instructions from config file.
Loads from config/prompts/tool_instructions.txt
Falls back to default if file not found.
Returns:
Tool instructions string
"""
global _TOOL_INSTRUCTIONS_CACHE
if _TOOL_INSTRUCTIONS_CACHE is not None:
return _TOOL_INSTRUCTIONS_CACHE
# Try to load from config file
config_path = Path(__file__).parent.parent.parent / "config" / "prompts" / "tool_instructions.txt"
try:
if config_path.exists():
with open(config_path, 'r') as f:
_TOOL_INSTRUCTIONS_CACHE = f.read().strip()
logger.debug(f"Loaded tool instructions from {config_path}")
else:
# Fallback default instructions
_TOOL_INSTRUCTIONS_CACHE = """You MUST use tools. DO NOT explain. DO NOT use markdown.
OUTPUT THIS EXACT FORMAT - NOTHING ELSE:
TOOL: bash
ARGUMENTS: {"command": "your command here"}
Available tools:
- bash: Run shell commands
- write: Create files
- read: Read files
NEVER write explanations.
NEVER use numbered lists.
NEVER use markdown code blocks.
ONLY output TOOL: lines."""
logger.warning(f"Tool instructions config not found at {config_path}, using default")
except Exception as e:
logger.error(f"Error loading tool instructions: {e}")
# Use minimal fallback
_TOOL_INSTRUCTIONS_CACHE = 'Use TOOL: tool_name\nARGUMENTS: {"param": "value"} format.'
return _TOOL_INSTRUCTIONS_CACHE
def _is_initial_request(messages: List[ChatMessage]) -> bool:
"""Check if this is an initial request (no assistant or tool messages).
Args:
messages: List of chat messages
Returns:
True if this is the initial request
"""
has_assistant = any(msg.role == "assistant" for msg in messages)
has_tool = any(msg.role == "tool" for msg in messages)
return not has_assistant and not has_tool
def _compress_large_request(messages: List[ChatMessage], max_tokens: int = 4000) -> List[ChatMessage]:
"""Compress large initial requests by keeping only user messages.
Args:
messages: List of chat messages
max_tokens: Maximum tokens before compression
Returns:
Compressed list of messages
"""
full_text = "\n".join([f"{msg.role}: {msg.content}" for msg in messages])
current_tokens = count_tokens(full_text)
if current_tokens <= max_tokens:
return messages
logger.info(f"🗜️ COMPRESSING: Initial request is {current_tokens} tokens, compressing to <{max_tokens}...")
# Keep only user messages
user_messages = [msg for msg in messages if msg.role == "user"]
if not user_messages:
logger.warning("No user messages found in initial request!")
return []
# Get the last user message
last_user_msg = user_messages[-1]
user_content = last_user_msg.content
# Truncate if still too long
if len(user_content) > 2000:
user_content = user_content[:2000] + "... [truncated for token limit]"
logger.debug(f"Truncated user message from {len(last_user_msg.content)} to 2000 chars")
return [ChatMessage(role="user", content=user_content)]
def _filter_messages(messages: List[ChatMessage]) -> List[ChatMessage]:
"""Filter messages for processing.
For initial requests >4000 tokens, compress aggressively.
Otherwise, just remove system messages.
Args:
messages: List of chat messages
Returns:
Filtered list of messages
"""
if _is_initial_request(messages):
full_text = "\n".join([f"{msg.role}: {msg.content}" for msg in messages])
if count_tokens(full_text) > 4000:
return _compress_large_request(messages)
# Normal filtering: remove system messages
return [msg for msg in messages if msg.role != "system"]
def _add_tool_instructions(messages: List[ChatMessage]) -> List[ChatMessage]:
"""Add tool instructions to the beginning of messages.
Tool instructions are now ALWAYS injected by default so any client
(Continue, hollama, etc.) can use tools without requiring client-side
tool instruction injection.
TODO: Add a "plan mode" that disables tool use for planning-only conversations.
Args:
messages: List of chat messages
Returns:
Messages with tool instructions added
"""
tool_instructions = _load_tool_instructions()
logger.debug(f"Injecting tool instructions: {len(tool_instructions)} chars")
# Check if instructions already present (avoid duplication)
if messages and messages[0].role == "system" and "AVAILABLE TOOLS" in messages[0].content:
logger.debug("Tool instructions already present, skipping injection")
return messages
return [ChatMessage(role="system", content=tool_instructions)] + messages
def _format_to_chatml(messages: List[ChatMessage]) -> str:
"""Format messages to ChatML format.
Args:
messages: List of chat messages
Returns:
ChatML formatted string
"""
formatted = []
for msg in messages:
role = msg.role
content = msg.content
if role == "system":
formatted.append(f"<|im_start|>system\n{content}<|im_end|>")
elif role == "user":
formatted.append(f"<|im_start|>user\n{content}<|im_end|>")
elif role == "assistant":
formatted.append(f"<|im_start|>assistant\n{content}<|im_end|>")
elif role == "tool":
tool_name = getattr(msg, 'name', 'tool')
formatted.append(f"<|im_start|>tool\n{tool_name}: {content}<|im_end|>")
formatted.append("<|im_start|>assistant\n")
return "\n".join(formatted)
def _log_prompt_preview(messages: List[ChatMessage]) -> None:
"""Log a preview of the prompt for debugging.
Args:
messages: List of chat messages
"""
preview = []
for msg in messages:
if msg.role == "system":
preview.append(f"[SYSTEM] {msg.content[:200]}...")
elif msg.role == "user":
preview.append(f"[USER] {msg.content}")
logger.debug(f"Prompt preview: {' | '.join(preview)}")
def format_messages_with_tools(
messages: List[ChatMessage],
tools: Optional[list] = None
) -> str:
"""Format chat messages into a single prompt using ChatML format.
Note: Tools are handled server-side. The model should respond normally.
IMPORTANT: If _USE_OPENCODE_TOOLS is True, use opencode's tool definitions (~27k tokens).
If False, use local tool server (~125 tokens) to save tokens.
Args:
messages: List of chat messages
tools: Optional list of tools (currently ignored, server-side handling)
Returns:
Formatted prompt string in ChatML format
"""
# Filter messages
filtered_messages = _filter_messages(messages)
# Add tool instructions if needed
filtered_messages = _add_tool_instructions(filtered_messages)
# Log preview
_log_prompt_preview(filtered_messages)
# Format to ChatML
result = _format_to_chatml(filtered_messages)
# Log final token count
final_tokens = count_tokens(result)
original_tokens = count_tokens("\n".join([f"{msg.role}: {msg.content}" for msg in messages]))
logger.info(f"📊 Final prompt size: {final_tokens} tokens (reduced from {original_tokens})")
return result
def format_messages(messages: List[ChatMessage]) -> str:
"""Format chat messages into a single prompt using ChatML format.
Args:
messages: List of chat messages
Returns:
Formatted prompt string
"""
return format_messages_with_tools(messages, None)
+11 -18
View File
@@ -4,7 +4,7 @@ Pydantic models matching OpenAI's API specification.
"""
from typing import List, Optional, Literal, Dict, Any, Union
from pydantic import BaseModel, Field, ConfigDict
from pydantic import BaseModel, Field
class FunctionDefinition(BaseModel):
@@ -30,15 +30,13 @@ class ToolCall(BaseModel):
class ChatMessage(BaseModel):
"""A chat message."""
role: Literal["system", "user", "assistant", "tool"] = Field(..., description="Role of the message sender")
content: str = Field(default="", description="Message content")
tool_calls: Optional[List[ToolCall]] = Field(default=None, description="Tool calls from assistant")
tool_call_id: Optional[str] = Field(default=None, description="ID of tool call this message is responding to")
name: Optional[str] = Field(default=None, description="Name of the tool/function")
model_config = ConfigDict(
# Use Pydantic's exclude_none to omit tool_calls when None
exclude_none=True
)
content: Optional[str] = Field(default=None, description="Message content")
tool_calls: Optional[List[ToolCall]] = Field(default_factory=list, description="Tool calls from assistant")
#tool_call_id: Optional[str] = Field(default=None, description="ID of tool call this message is responding to")
#name: Optional[str] = Field(default=None, description="Name of the tool/function")
class Config:
exclude_none = True
class ChatCompletionRequest(BaseModel):
@@ -52,9 +50,9 @@ class ChatCompletionRequest(BaseModel):
stop: Optional[List[str]] = Field(default=None, description="Stop sequences")
tools: Optional[List[Tool]] = Field(default=None, description="List of tools the model may call")
tool_choice: Optional[Union[str, Dict[str, Any]]] = Field(default="auto", description="How to choose tools")
model_config = ConfigDict(
json_schema_extra={
class Config:
json_schema_extra = {
"example": {
"model": "local-swarm",
"messages": [
@@ -64,14 +62,12 @@ class ChatCompletionRequest(BaseModel):
"temperature": 0.7
}
}
)
class ChatCompletionChoice(BaseModel):
"""A choice in the chat completion response."""
index: int = Field(..., description="Choice index")
message: ChatMessage = Field(..., description="Generated message")
logprobs: Optional[Any] = Field(default=None, description="Log probabilities")
finish_reason: Optional[str] = Field(default="stop", description="Reason for finishing (stop, length, tool_calls, etc.)")
@@ -80,7 +76,6 @@ class UsageInfo(BaseModel):
prompt_tokens: int = Field(default=0, description="Tokens in prompt")
completion_tokens: int = Field(default=0, description="Tokens in completion")
total_tokens: int = Field(default=0, description="Total tokens")
tokens_per_second: Optional[float] = Field(default=None, description="Generation speed in tokens per second")
class ChatCompletionResponse(BaseModel):
@@ -91,8 +86,6 @@ class ChatCompletionResponse(BaseModel):
model: str = Field(..., description="Model used")
choices: List[ChatCompletionChoice] = Field(..., description="Generated choices")
usage: UsageInfo = Field(..., description="Token usage")
stats: Dict[str, Any] = Field(default_factory=dict, description="Additional stats")
system_fingerprint: Optional[str] = Field(default=None, description="System fingerprint")
class ChatCompletionStreamChoice(BaseModel):
+926 -320
View File
File diff suppressed because it is too large Load Diff
+11 -20
View File
@@ -18,23 +18,21 @@ from swarm.status_monitor import StatusMonitor
class APIServer:
"""OpenAI-compatible API server."""
def __init__(self, swarm_manager: SwarmManager, host: str = "127.0.0.1", port: int = 17615, show_live_status: bool = True, use_opencode_tools: bool = False):
def __init__(self, swarm_manager: SwarmManager, host: str = "127.0.0.1", port: int = 17615, show_live_status: bool = True):
"""
Initialize API server.
Args:
swarm_manager: Swarm manager instance
host: Host to bind to
port: Port to listen on
show_live_status: Whether to show live worker status updates
use_opencode_tools: Whether to use opencode's tool definitions (~27k tokens) or local tool server (~125 tokens)
"""
self.swarm_manager = swarm_manager
self.host = host
self.port = port
self.show_live_status = show_live_status
self.use_opencode_tools = use_opencode_tools
self.status_monitor: Optional[StatusMonitor] = None
self.app = self._create_app()
@@ -44,12 +42,8 @@ class APIServer:
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Lifespan context manager for startup/shutdown."""
# Startup: Set swarm manager in routes and app state
# Startup: Set swarm manager in routes
set_swarm_manager(self.swarm_manager)
app.state.swarm_manager = self.swarm_manager # For federation endpoint
# Set tool mode in routes
from api.routes import set_use_opencode_tools
set_use_opencode_tools(self.use_opencode_tools)
print(f"\n🌐 API server starting on http://{self.host}:{self.port}")
print(f" Endpoints:")
print(f" - POST /v1/chat/completions")
@@ -96,35 +90,32 @@ class APIServer:
self.app,
host=self.host,
port=self.port,
log_level="warning",
access_log=False
log_level="info"
)
server = uvicorn.Server(config)
await server.serve()
def run_sync(self):
"""Run server synchronously (blocking)."""
uvicorn.run(
self.app,
host=self.host,
port=self.port,
log_level="warning",
access_log=False
log_level="info"
)
def create_server(swarm_manager: SwarmManager, host: str = "127.0.0.1", port: int = 17615, show_live_status: bool = True, use_opencode_tools: bool = False) -> APIServer:
def create_server(swarm_manager: SwarmManager, host: str = "127.0.0.1", port: int = 17615, show_live_status: bool = True) -> APIServer:
"""
Create API server instance.
Args:
swarm_manager: Swarm manager instance
host: Host to bind to
port: Port to listen on
show_live_status: Whether to show live worker status updates
use_opencode_tools: Whether to use opencode's tool definitions (~27k tokens) or local tool server (~125 tokens)
Returns:
APIServer instance
"""
return APIServer(swarm_manager, host, port, show_live_status, use_opencode_tools)
return APIServer(swarm_manager, host, port, show_live_status)
-250
View File
@@ -1,250 +0,0 @@
"""Tool parsing module for Local Swarm.
Parses tool calls from model output in various formats.
"""
import json
import re
from typing import Tuple, Optional, List, Dict, Any
def ensure_tool_arguments(tool_name: str, args_dict: dict) -> dict:
"""Ensure tool arguments have all required fields.
For bash tool: inject 'description' field if missing.
Args:
tool_name: Name of the tool
args_dict: Tool arguments dictionary
Returns:
Updated arguments dictionary
"""
if tool_name == 'bash' and 'description' not in args_dict:
# Generate description from command
command = args_dict.get('command', '')
desc = command.split()[0] if command else 'Execute command'
args_dict['description'] = desc
return args_dict
def _parse_standard_format(text: str) -> Tuple[Optional[str], Optional[List[Dict[str, Any]]]]:
"""Parse standard TOOL: format.
Format: TOOL: name\nARGUMENTS: {"key": "value"}
Args:
text: Model output text
Returns:
Tuple of (content_without_tools, tool_calls) or (None, None) if not found
"""
tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
if not tool_matches:
return None, None
tool_calls = []
for i, tool_match in enumerate(tool_matches):
tool_name = tool_match.group(1)
args_str = tool_match.group(2)
try:
args_dict = json.loads(args_str)
args_dict = ensure_tool_arguments(tool_name, args_dict)
tool_calls.append({
"id": f"call_{i+1}",
"type": "function",
"function": {
"name": tool_name,
"arguments": json.dumps(args_dict)
}
})
except json.JSONDecodeError:
continue
if tool_calls:
first_start = tool_matches[0].start()
content = text[:first_start].strip()
return content, tool_calls
return None, None
def _parse_markdown_format(text: str) -> Tuple[Optional[str], Optional[List[Dict[str, Any]]]]:
"""Parse markdown code block format.
Format: ```bash command```
Args:
text: Model output text
Returns:
Tuple of (content_without_tools, tool_calls) or (None, None) if not found
"""
markdown_pattern = r'```(?:bash|shell|sh)?\s*\n(.*?)\n```'
markdown_matches = list(re.finditer(markdown_pattern, text, re.DOTALL))
if not markdown_matches:
return None, None
tool_calls = []
for i, match in enumerate(markdown_matches):
code_content = match.group(1).strip()
if code_content:
args_dict = {"command": code_content}
args_dict = ensure_tool_arguments("bash", args_dict)
tool_calls.append({
"id": f"call_{i+1}",
"type": "function",
"function": {
"name": "bash",
"arguments": json.dumps(args_dict)
}
})
if tool_calls:
first_start = markdown_matches[0].start()
content = text[:first_start].strip()
return content, tool_calls
return None, None
def _parse_command_lines(text: str) -> Tuple[Optional[str], Optional[List[Dict[str, Any]]]]:
"""Parse command lines in text.
Matches common bash commands with their arguments.
Args:
text: Model output text
Returns:
Tuple of (content_without_tools, tool_calls) or (None, None) if not found
"""
command_lines = []
command_pattern = r'^(npm|npx|mkdir|cd|ls|cat|echo|git|python|pip|node|yarn|create-react-app)\s+'
for line in text.split('\n'):
line = line.strip()
if re.match(command_pattern, line):
command_lines.append(line)
if command_lines:
combined_command = ' && '.join(command_lines)
args_dict = {"command": combined_command}
args_dict = ensure_tool_arguments("bash", args_dict)
return "", [{
"id": "call_1",
"type": "function",
"function": {
"name": "bash",
"arguments": json.dumps(args_dict)
}
}]
return None, None
def _parse_standalone_commands(text: str) -> Tuple[Optional[str], Optional[List[Dict[str, Any]]]]:
"""Parse standalone bash commands.
Args:
text: Model output text
Returns:
Tuple of (content_without_tools, tool_calls) or (None, None) if not found
"""
standalone_pattern = r'(?:^|\n)(npm\s+\w+|npx\s+\w+|mkdir\s+\w+|cd\s+\w+|git\s+\w+)(?:\s|$)'
standalone_matches = list(re.finditer(standalone_pattern, text, re.MULTILINE))
if standalone_matches:
commands = [match.group(1).strip() for match in standalone_matches]
if commands:
combined_command = ' && '.join(commands)
args_dict = {"command": combined_command}
args_dict = ensure_tool_arguments("bash", args_dict)
return "", [{
"id": "call_1",
"type": "function",
"function": {
"name": "bash",
"arguments": json.dumps(args_dict)
}
}]
return None, None
def _parse_urls(text: str) -> Tuple[Optional[str], Optional[List[Dict[str, Any]]]]:
"""Parse URLs for webfetch tool.
Args:
text: Model output text
Returns:
Tuple of (content_without_tools, tool_calls) or (None, None) if not found
"""
url_pattern = r'https?://[^\s<>"\')\]]+[a-zA-Z0-9]'
url_matches = list(re.finditer(url_pattern, text))
if url_matches:
urls = [match.group(0) for match in url_matches]
if urls:
tool_calls = []
for i, url in enumerate(urls):
tool_calls.append({
"id": f"call_{i+1}",
"type": "function",
"function": {
"name": "webfetch",
"arguments": json.dumps({"url": url, "format": "markdown"})
}
})
return "", tool_calls
return None, None
def parse_tool_calls(text: str) -> Tuple[str, Optional[List[Dict[str, Any]]]]:
"""Parse tool calls from model output using multiple formats.
Supports multiple formats for compatibility with different model sizes:
1. Standard: TOOL: name\nARGUMENTS: {"key": "value"}
2. Markdown: ```bash command```
3. Command lines: npm install, git clone, etc.
4. Standalone commands
5. URLs: for webfetch tool
Args:
text: Model output text
Returns:
Tuple of (content_without_tools, tool_calls or None)
"""
# Priority 1: Standard format
result = _parse_standard_format(text)
if result[1] is not None:
return result[0] or "", result[1]
# Priority 2: Markdown code blocks
result = _parse_markdown_format(text)
if result[1] is not None:
return result[0] or "", result[1]
# Priority 3: Command lines
result = _parse_command_lines(text)
if result[1] is not None:
return result[0] or "", result[1]
# Priority 4: Standalone commands
result = _parse_standalone_commands(text)
if result[1] is not None:
return result[0] or "", result[1]
# Priority 5: URLs
result = _parse_urls(text)
if result[1] is not None:
return result[0] or "", result[1]
return text, None
+7 -9
View File
@@ -4,7 +4,7 @@ Creates the appropriate backend based on hardware and platform.
"""
from typing import Optional
from hardware.detector import HardwareProfile, detect_hardware, calculate_gpu_layers
from hardware.detector import HardwareProfile, detect_hardware
from backends.base import LLMBackend
from backends.llamacpp import LlamaCppBackend
from backends.mlx import MLXBackend
@@ -31,17 +31,15 @@ def create_backend(hardware: Optional[HardwareProfile] = None) -> LLMBackend:
# Otherwise use llama.cpp (supports CUDA, ROCm, SYCL, CPU)
print("Using llama.cpp backend")
# Auto-configure GPU layers based on hardware
n_gpu_layers = calculate_gpu_layers(hardware.gpu)
# Determine GPU layers
if hardware.gpu and not hardware.is_apple_silicon:
# Has external GPU, offload all layers
n_gpu_layers = -1
print(f" GPU detected: {hardware.gpu.name}")
if hardware.gpu.is_nvidia:
print(f" Compute capability: {hardware.gpu.compute_capability or 'unknown'}")
if hardware.gpu.device_count > 1:
print(f" GPU count: {hardware.gpu.device_count}")
print(f" Offloading {n_gpu_layers} layers to GPU")
print(f" Offloading all layers to GPU")
else:
# CPU only
n_gpu_layers = 0
print(f" No GPU detected, using CPU")
return LlamaCppBackend(n_gpu_layers=n_gpu_layers)
-97
View File
@@ -1,97 +0,0 @@
"""Chatlog for debugging tool execution.
Writes a human-readable markdown log of tool calls and results.
Enabled by setting LOCAL_SWARM_CHATLOG=1 environment variable.
Log file defaults to 'chatlog.md' in the current working directory.
"""
import os
import json
from datetime import datetime
from typing import Optional
class ChatLogger:
"""Logs chat interactions and tool execution in opencode-style format."""
def __init__(self, log_path: Optional[str] = None):
self.log_path = log_path or os.getenv('LOCAL_SWARM_CHATLOG_PATH', 'chatlog.md')
self.enabled = os.getenv('LOCAL_SWARM_CHATLOG', '0') == '1'
if self.enabled:
self._initialize_log()
def _initialize_log(self):
"""Create log file with header if it doesn't exist."""
dir_path = os.path.dirname(self.log_path) or '.'
os.makedirs(dir_path, exist_ok=True)
with open(self.log_path, 'a') as f:
f.write(f"\n\n# Local Swarm Session - {datetime.now().isoformat()}\n\n")
def _timestamp(self) -> str:
"""Get current timestamp."""
return datetime.now().strftime("%H:%M:%S")
def log_user_message(self, content: str):
"""Log a user message."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] User\n\n")
f.write(f"{content}\n\n")
def log_assistant_message(self, content: str, has_tool_calls: bool = False):
"""Log an assistant response."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] Assistant\n\n")
if has_tool_calls:
# Use thinking block for messages that contain tool calls
f.write(f"```thinking\n{content}\n```\n")
else:
f.write(f"{content}\n\n")
def log_tool_call(self, tool_name: str, arguments: dict, call_index: int = 1):
"""Log a tool execution request."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] Tool Call #{call_index}\n\n")
f.write(f"**Tool:** `{tool_name}`\n\n")
f.write(f"**Arguments:**\n")
try:
args_json = json.dumps(arguments, indent=2)
except Exception:
args_json = str(arguments)
f.write(f"```json\n{args_json}\n```\n")
def log_tool_result(self, tool_name: str, result: str, call_index: int = 1, success: bool = True):
"""Log a tool execution result."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] Tool Result #{call_index}\n\n")
status = "✓ Success" if success else "✗ Failed"
f.write(f"**Tool:** `{tool_name}` - {status}\n\n")
f.write(f"**Output:**\n")
f.write(f"```\n{result}\n```\n")
def log_system(self, message: str):
"""Log a system message."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] System\n\n")
f.write(f"> {message}\n\n")
# Global logger instance (lazy initialization handled per request)
_global_logger: Optional[ChatLogger] = None
def get_chat_logger() -> ChatLogger:
"""Get the global chat logger instance (creates one if needed)."""
global _global_logger
if _global_logger is None:
_global_logger = ChatLogger()
return _global_logger
-286
View File
@@ -1,286 +0,0 @@
"""Main application runner for Local Swarm.
Handles the primary application modes: download-only, test, and full server mode.
"""
import asyncio
import sys
from typing import Optional
from models.selector import select_optimal_model, ModelConfig
from models.downloader import download_model_for_config
from swarm import SwarmManager
from api import create_server
from api.routes import set_federated_swarm
from interactive import (
interactive_model_selection,
show_startup_summary,
show_runtime_menu,
)
from network import create_discovery_service, FederatedSwarm
from tools.executor import ToolExecutor, set_tool_executor
from utils.network import get_local_ip
class MainRunner:
"""Runs the main application logic."""
def __init__(self, hardware, args):
"""Initialize the main runner.
Args:
hardware: Hardware profile
args: Parsed command line arguments
"""
self.hardware = hardware
self.args = args
self.config: Optional[ModelConfig] = None
self.swarm: Optional[SwarmManager] = None
self.discovery = None
self.federated_swarm = None
self.mcp_server = None
async def run(self) -> int:
"""Run the main application.
Returns:
Exit code (0 for success, 1 for error)
"""
# Get configuration
self.config = self._get_configuration()
if not self.config:
return 1
# Handle download-only mode
if self.args.download_only:
return await self._run_download_mode()
# Handle test mode
if self.args.test:
return await self._run_test_mode()
# Run full server mode
return await self._run_server_mode()
def _get_configuration(self) -> Optional[ModelConfig]:
"""Get the model configuration."""
if self.args.model or self.args.instances or self.args.auto:
return self._get_auto_config()
else:
return interactive_model_selection(self.hardware)
def _get_auto_config(self) -> Optional[ModelConfig]:
"""Get auto-detected configuration."""
print("\n📊 Calculating optimal configuration...")
try:
config = select_optimal_model(
self.hardware,
preferred_model=self.args.model,
force_instances=self.args.instances,
use_mlx=None # Auto-detect based on hardware
)
if not config:
print("\n❌ No suitable model found for your hardware")
print(" Minimum requirement: 2 GB available memory")
return None
print(f"\n✓ Selected: {config.display_name}")
print(f" Instances: {config.instances}")
print(f" Memory: {config.total_memory_gb:.1f} GB")
return config
except Exception as e:
print(f"\n❌ Error selecting model: {e}", file=sys.stderr)
return None
async def _run_download_mode(self) -> int:
"""Run download-only mode."""
print("\n" + "=" * 70)
print("⬇️ Download Mode: Downloading model only")
print("=" * 70)
try:
model_path = download_model_for_config(self.config)
print(f"✓ Model downloaded to: {model_path}")
print("\n" + "=" * 70)
print("✅ Download complete")
print("=" * 70)
return 0
except Exception as e:
print(f"\n❌ Download failed: {e}", file=sys.stderr)
return 1
async def _run_test_mode(self) -> int:
"""Run test mode with sample prompt."""
from cli.test_runner import run_test
return await run_test(self.hardware, self.config)
async def _run_server_mode(self) -> int:
"""Run full server mode."""
show_startup_summary(self.hardware, self.config)
# Setup swarm
if not await self._setup_swarm():
return 1
# Initialize tool executor
self._setup_tool_executor()
# Show updated summary with runtime info
show_startup_summary(self.hardware, self.config, self.swarm)
# Initialize federation if enabled
if self.args.federation:
await self._setup_federation()
# Start MCP server if enabled
if self.args.mcp:
await self._setup_mcp()
# Run server
return await self._run_server()
async def _setup_swarm(self) -> bool:
"""Setup the swarm.
Returns:
True if successful
"""
print("\n⬇️ Downloading model...")
try:
model_path = download_model_for_config(self.config)
print(f"✓ Model ready at: {model_path}")
except Exception as e:
print(f"\n❌ Error downloading model: {e}", file=sys.stderr)
return False
print("\n🚀 Initializing swarm...")
try:
self.swarm = SwarmManager(
model_config=self.config,
hardware=self.hardware,
consensus_strategy="similarity"
)
success = await self.swarm.initialize(str(model_path))
if not success:
print("❌ Failed to initialize swarm")
return False
return True
except Exception as e:
print(f"\n❌ Error initializing swarm: {e}", file=sys.stderr)
return False
def _setup_tool_executor(self) -> None:
"""Setup the tool executor."""
if self.args.tool_host is not None:
if self.args.tool_host == "":
tool_host_url = f"http://{get_local_ip()}:17616"
print(f"\n🔧 Using remote tool host: {tool_host_url} (auto-detected)")
else:
tool_host_url = self.args.tool_host
print(f"\n🔧 Using remote tool host: {tool_host_url}")
executor = ToolExecutor(tool_host_url=tool_host_url)
else:
executor = ToolExecutor(tool_host_url=None)
print("\n🔧 Tool Server: Local")
set_tool_executor(executor)
async def _setup_federation(self) -> None:
"""Setup federation."""
print("\n🌐 Initializing federation...")
try:
advertise_ip = self.args.host if self.args.host else None
self.discovery = await create_discovery_service(
self.args.port,
advertise_ip=advertise_ip
)
swarm_info = {
"version": "0.1.0",
"instances": self.config.instances,
"model_id": self.config.model_id,
"hardware_summary": f"{self.hardware.cpu_cores} CPU, {self.hardware.ram_gb:.1f}GB RAM"
}
await self.discovery.start_advertising(swarm_info)
await self.discovery.start_listening()
# Add manual peers
if self.args.peers:
await self._add_manual_peers()
self.federated_swarm = FederatedSwarm(self.swarm, self.discovery)
set_federated_swarm(self.federated_swarm)
# Start health check loop
asyncio.create_task(
self.discovery.start_health_check_loop(interval_seconds=10)
)
print(f" ✓ Federation enabled")
print(f" ✓ Discovery active on port {self.discovery.discovery_port}")
print(f" ✓ Peer health checks every 10s")
except Exception as e:
print(f" ⚠️ Failed to initialize federation: {e}")
print(" Continuing without federation...")
async def _add_manual_peers(self) -> None:
"""Add manual peers from command line."""
print(f" 📍 Adding {len(self.args.peers)} manual peer(s)...")
from network.discovery import PeerInfo
from datetime import datetime
for peer_str in self.args.peers:
try:
host, port = peer_str.rsplit(':', 1)
port = int(port)
peer = PeerInfo(
host=host,
port=port,
name=f"manual_{host}_{port}",
version="0.1.0",
instances=0,
model_id="unknown",
hardware_summary="manual",
last_seen=datetime.now()
)
self.discovery.peers[peer.name] = peer
print(f" ✓ Added peer: {host}:{port}")
except Exception as e:
print(f" ⚠️ Failed to add peer {peer_str}: {e}")
async def _setup_mcp(self) -> None:
"""Setup MCP server."""
print("\n🤖 Starting MCP server...")
from mcp_server import create_mcp_server
self.mcp_server = await create_mcp_server(self.swarm)
print(" MCP server active (stdio)")
async def _run_server(self) -> int:
"""Run the API server."""
from cli.server_runner import ServerRunner
runner = ServerRunner(
self.swarm,
self.discovery,
self.federated_swarm,
self.args
)
try:
return await runner.run()
finally:
await self._shutdown()
async def _shutdown(self) -> None:
"""Shutdown all services."""
if self.federated_swarm:
await self.federated_swarm.close()
if self.discovery:
await self.discovery.stop()
if self.swarm:
await self.swarm.shutdown()
-151
View File
@@ -1,151 +0,0 @@
"""CLI argument parsing for Local Swarm."""
import argparse
from typing import Optional
def create_parser() -> argparse.ArgumentParser:
"""Create and configure the argument parser."""
parser = argparse.ArgumentParser(
description="Local Swarm - AI-powered coding LLM swarm",
formatter_class=argparse.RawDescriptionHelpFormatter,
epilog="""
Examples:
python main.py # Interactive setup and start
python main.py --auto # Auto-detect and start without menu
python main.py --detect # Show hardware detection only
python main.py --model qwen:3b:q4 # Use specific model (skip menu)
python main.py --port 17615 # Use custom port (default: 17615)
python main.py --host 192.168.1.5 # Bind to specific IP
python main.py --instances 4 # Force number of instances
python main.py --download-only # Download model only
python main.py --test # Test with sample prompt
python main.py --mcp # Enable MCP server
python main.py --federation # Enable federation with other instances
python main.py --federation --peer 192.168.1.10:17615 # Manual peer
"""
)
# Mode options
parser.add_argument(
"--auto",
action="store_true",
help="Auto-detect best configuration without interactive menu"
)
parser.add_argument(
"--detect",
action="store_true",
help="Show hardware detection and exit"
)
# Model options
parser.add_argument(
"--model",
type=str,
help="Model to use (format: name:size:quant, e.g., qwen:3b:q4)"
)
parser.add_argument(
"--instances",
type=int,
help="Force number of instances (overrides auto-calculation)"
)
# Server options
parser.add_argument(
"--port",
type=int,
default=17615,
help="Port to run the API server on (default: 17615)"
)
parser.add_argument(
"--host",
type=str,
default=None,
help="Host IP to bind to (default: auto-detect)"
)
# Operation modes
parser.add_argument(
"--download-only",
action="store_true",
help="Download models only, don't start server"
)
parser.add_argument(
"--test",
action="store_true",
help="Test with a sample prompt"
)
parser.add_argument(
"--mcp",
action="store_true",
help="Enable MCP server alongside HTTP API"
)
# Configuration
parser.add_argument(
"--config",
type=str,
default="config.yaml",
help="Path to config file"
)
# Federation options
parser.add_argument(
"--federation",
action="store_true",
help="Enable federation with other Local Swarm instances on the network"
)
parser.add_argument(
"--peer",
action="append",
dest="peers",
help="Manually add a peer (format: host:port, can be used multiple times)"
)
# Tool server options
parser.add_argument(
"--tool-server",
action="store_true",
help="Run as dedicated tool execution server (executes read/write/bash tools)"
)
parser.add_argument(
"--tool-port",
type=int,
default=17616,
help="Port for tool execution server (default: 17616)"
)
parser.add_argument(
"--tool-host",
type=str,
default=None,
nargs='?',
const='',
help="URL of tool execution server. Use without value for auto-detected local IP"
)
parser.add_argument(
"--use-opencode-tools",
action="store_true",
help="Use opencode's tool definitions (~27k tokens). Default: use local tool server"
)
# Version
parser.add_argument(
"--version",
action="version",
version="%(prog)s 0.1.0"
)
return parser
def parse_args(args: Optional[list] = None):
"""Parse command line arguments.
Args:
args: Command line arguments (defaults to sys.argv)
Returns:
Parsed arguments namespace
"""
parser = create_parser()
return parser.parse_args(args)
-103
View File
@@ -1,103 +0,0 @@
"""Server runner for Local Swarm."""
import asyncio
from typing import Optional
from api import create_server
from api.routes import set_federated_swarm
from utils.network import get_local_ip
class ServerRunner:
"""Handles server startup and shutdown."""
def __init__(self, swarm, discovery, federated_swarm, args):
"""Initialize server runner.
Args:
swarm: Swarm manager instance
discovery: Discovery service (optional)
federated_swarm: Federated swarm (optional)
args: Command line arguments
"""
self.swarm = swarm
self.discovery = discovery
self.federated_swarm = federated_swarm
self.args = args
self.mcp_server = None
async def run(self) -> int:
"""Run the server.
Returns:
Exit code
"""
print("\n🌐 Starting HTTP API server...")
# Determine host
host = self._get_host()
# Show tool mode
self._show_tool_mode()
# Create and start server
server = create_server(
self.swarm,
host=host,
port=self.args.port,
use_opencode_tools=self.args.use_opencode_tools
)
self._print_connection_info(host)
# Start server
try:
await server.start()
finally:
await self._shutdown()
return 0
def _get_host(self) -> str:
"""Get the host to bind to."""
if self.args.host:
print(f"🔗 Using specified host: {self.args.host}:{self.args.port}")
return self.args.host
else:
host = get_local_ip()
print(f"🔗 Binding to {host}:{self.args.port}")
return host
def _show_tool_mode(self) -> None:
"""Display tool mode information."""
if self.args.use_opencode_tools:
print(f"🔧 Tool mode: opencode tools (~27k tokens)")
else:
print(f"🔧 Tool mode: local tool server (~125 tokens)")
def _print_connection_info(self, host: str) -> None:
"""Print server connection information."""
print(f"\n✅ Local Swarm is running!")
print(f" API: http://{host}:{self.args.port}/v1")
print(f" Health: http://{host}:{self.args.port}/health")
if self.args.federation and self.discovery:
peers = self.discovery.get_peers()
print(f"\n🌐 Federation: Enabled")
print(f" Discovery port: {self.discovery.discovery_port}")
if peers:
print(f" Peers discovered: {len(peers)}")
print(f"\n💡 Configure opencode to use:")
print(f' base_url: http://127.0.0.1:{self.args.port}/v1')
print(f' api_key: any (not used)')
print(f"\nPress Ctrl+C to stop...\n")
async def _shutdown(self) -> None:
"""Shutdown all services."""
if self.federated_swarm:
await self.federated_swarm.close()
if self.discovery:
await self.discovery.stop()
if self.swarm:
await self.swarm.shutdown()
-81
View File
@@ -1,81 +0,0 @@
"""Test mode runner for Local Swarm."""
import asyncio
from models.downloader import download_model_for_config
from swarm import SwarmManager
from interactive import show_startup_summary
async def run_test(hardware, config) -> int:
"""Run test mode with sample prompt.
Args:
hardware: Hardware profile
config: Model configuration
Returns:
Exit code (0 for success, 1 for error)
"""
print("\n" + "=" * 70)
print("🧪 Test Mode: Running sample inference")
print("=" * 70)
show_startup_summary(hardware, config)
# Download model
print("\n⬇️ Downloading model...")
try:
model_path = download_model_for_config(config)
print(f"✓ Model ready at: {model_path}")
except Exception as e:
print(f"\n❌ Error downloading model: {e}")
return 1
# Initialize swarm
print("\n🚀 Initializing swarm...")
try:
swarm = SwarmManager(
model_config=config,
hardware=hardware,
consensus_strategy="similarity"
)
success = await swarm.initialize(str(model_path))
if not success:
print("❌ Failed to initialize swarm")
return 1
except Exception as e:
print(f"\n❌ Error initializing swarm: {e}")
return 1
try:
# Test prompt
prompt = "Write a Python function to calculate factorial:"
print(f"\nPrompt: {prompt}\n")
print("Generating responses...\n")
result = await swarm.generate(prompt, max_tokens=200)
print("\n" + "=" * 70)
print("SELECTED RESPONSE:")
print("=" * 70)
print(result.selected_response.text)
print("\n" + "=" * 70)
print(f"Strategy: {result.strategy}")
print(f"Confidence: {result.confidence:.2f}")
print(f"Latency: {result.selected_response.latency_ms:.1f}ms")
print(f"Tokens/sec: {result.selected_response.tokens_per_second:.1f}")
# Show all responses
print("\nAll responses received:")
for i, resp in enumerate(result.all_responses):
preview = resp.text[:60].replace('\n', ' ')
print(f" Worker {i}: {preview}... ({resp.latency_ms:.1f}ms)")
print("\n" + "=" * 70)
print("✅ Test complete")
print("=" * 70)
return 0
finally:
await swarm.shutdown()
-69
View File
@@ -1,69 +0,0 @@
"""Tool server for Local Swarm.
Standalone tool execution server for distributed setups.
"""
import logging
from typing import Optional
from fastapi import FastAPI
import uvicorn
from tools.executor import ToolExecutor, set_tool_executor
logger = logging.getLogger(__name__)
def create_tool_server_app() -> FastAPI:
"""Create the tool server FastAPI application.
Returns:
Configured FastAPI application
"""
app = FastAPI(title="Local Swarm Tool Server")
@app.post("/v1/tools/execute")
async def execute_tool(request: dict):
tool_name = request.get("tool", "")
tool_args = request.get("arguments", {})
# Get the global executor
from tools.executor import get_tool_executor
executor = get_tool_executor()
if executor is None:
return {"result": "Error: No tool executor configured"}
result = await executor.execute(tool_name, tool_args)
return {"result": result}
@app.get("/health")
async def health():
return {"status": "healthy", "mode": "tool-server"}
return app
async def run_tool_server(host: str, port: int) -> None:
"""Run the tool server.
Args:
host: Host to bind to
port: Port to listen on
"""
# Initialize local tool executor
tool_executor = ToolExecutor(tool_host_url=None)
set_tool_executor(tool_executor)
app = create_tool_server_app()
print(f"🔗 Tool server running at http://{host}:{port}")
print(f" Endpoints:")
print(f" - POST /v1/tools/execute")
print(f" - GET /health")
print(f"\n✅ Tool server ready!")
config = uvicorn.Config(app, host=host, port=port, log_level="warning")
server = uvicorn.Server(config)
await server.serve()
+2 -132
View File
@@ -2,7 +2,6 @@
from dataclasses import dataclass
from typing import Optional, List
import os
import platform
import psutil
@@ -18,8 +17,6 @@ class GPUInfo:
is_nvidia: bool = False
is_amd: bool = False
is_mobile: bool = False
compute_capability: Optional[str] = None # CUDA compute capability
device_count: int = 1 # Number of GPUs available
@dataclass
@@ -73,55 +70,10 @@ class HardwareProfile:
return self.available_memory_gb
def is_android() -> bool:
"""Check if running on Android (beyond just Termux)."""
# Check multiple Android indicators
# 1. Check for Android-specific environment variables
android_env_vars = [
"ANDROID_ROOT",
"ANDROID_DATA",
"ANDROID_ART_ROOT",
"ANDROID_I18N_ROOT",
"ANDROID_TZDATA_ROOT",
]
if any(os.environ.get(var) for var in android_env_vars):
return True
# 2. Check for Android-specific paths
android_paths = [
"/system/build.prop",
"/system/bin/app_process",
"/data/data",
]
if any(os.path.exists(path) for path in android_paths):
return True
# 3. Check for Termux (which runs on Android)
if _is_android_or_termux():
return True
# 4. Check /proc/sys/kernel/osrelease for Android
try:
if os.path.exists("/proc/sys/kernel/osrelease"):
with open("/proc/sys/kernel/osrelease", "r") as f:
release = f.read().lower()
if "android" in release:
return True
except Exception:
pass
return False
def detect_os() -> str:
"""Detect the operating system."""
system = platform.system().lower()
# Check for Android first (reports as Linux)
if system == "linux" and is_android():
return "android"
elif system == "darwin":
if system == "darwin":
return "darwin"
elif system == "windows":
return "windows"
@@ -180,14 +132,6 @@ def detect_nvidia_gpu() -> Optional[GPUInfo]:
except Exception:
driver = None
# Get compute capability
compute_capability = None
try:
major, minor = pynvml.nvmlDeviceGetCudaComputeCapability(handle)
compute_capability = f"{major}.{minor}"
except Exception:
pass
return GPUInfo(
name=name,
vram_gb=vram_gb,
@@ -195,9 +139,7 @@ def detect_nvidia_gpu() -> Optional[GPUInfo]:
device_id=0,
is_nvidia=True,
is_apple_silicon=False,
is_amd=False,
compute_capability=compute_capability,
device_count=device_count
is_amd=False
)
finally:
pynvml.nvmlShutdown()
@@ -277,78 +219,6 @@ def detect_gpu() -> Optional[GPUInfo]:
return None
def calculate_gpu_layers(gpu: Optional[GPUInfo]) -> int:
"""Calculate optimal number of GPU layers to offload.
Args:
gpu: GPU information (None if no GPU)
Returns:
Number of layers to offload (-1 = all, 0 = CPU only)
"""
if gpu is None:
return 0
if gpu.is_apple_silicon:
# Apple Silicon: offload all layers (unified memory)
return -1
if gpu.is_nvidia:
# NVIDIA: Check compute capability for compatibility
if gpu.compute_capability:
major, _ = gpu.compute_capability.split('.')
if int(major) < 5:
# Very old GPUs (Kepler and earlier) may have issues
return 0
# Multi-GPU support: use device_count to determine layers
# For now, offload all layers if we have any NVIDIA GPU
return -1
if gpu.is_amd:
# AMD: ROCm support varies, be conservative
return -1
# Unknown GPU type: use CPU
return 0
def validate_gpu_layers(requested_layers: int, gpu: Optional[GPUInfo]) -> int:
"""Validate and adjust requested GPU layers.
Args:
requested_layers: Requested number of layers (-1 = all)
gpu: GPU information
Returns:
Validated layer count
"""
if requested_layers == 0:
return 0
if gpu is None:
if requested_layers != 0:
raise ValueError(
f"Requested {requested_layers} GPU layers but no GPU detected. "
"Use n_gpu_layers=0 for CPU-only mode."
)
return 0
if gpu.is_apple_silicon:
# Apple Silicon always uses all layers
return -1
if gpu.is_nvidia and gpu.compute_capability:
major, _ = gpu.compute_capability.split('.')
if int(major) < 5:
raise ValueError(
f"NVIDIA GPU {gpu.name} has compute capability {gpu.compute_capability}. "
f"Minimum required is 5.0. Use n_gpu_layers=0 for CPU mode."
)
return requested_layers
def detect_hardware() -> HardwareProfile:
"""Detect complete hardware profile."""
os_name = detect_os()
-58
View File
@@ -10,64 +10,6 @@ from typing import Optional
from hardware.detector import GPUInfo
# Android-specific file paths for common operations
ANDROID_PATHS = {
"termux_home": "/data/data/com.termux/files/home",
"termux_usr": "/data/data/com.termux/files/usr",
"termux_bin": "/data/data/com.termux/files/usr/bin",
"shared_storage": "/sdcard",
"android_data": "/data/data",
}
def get_android_path(path_type: str, subpath: str = "") -> str:
"""Get Android-specific file path.
Args:
path_type: Type of path (termux_home, shared_storage, etc.)
subpath: Additional path components
Returns:
Full path string
"""
base = ANDROID_PATHS.get(path_type, path_type)
if subpath:
return os.path.join(base, subpath)
return base
def normalize_path_for_android(path: str) -> str:
"""Normalize a path for Android/Termux environment.
Args:
path: Original path
Returns:
Normalized path for Android
"""
# Expand user home directory properly on Android
if path.startswith("~/"):
if is_termux():
home = ANDROID_PATHS["termux_home"]
else:
home = os.environ.get("HOME", "/")
path = os.path.join(home, path[2:])
# Handle /sdcard paths
if path.startswith("/sdcard") and not os.path.exists("/sdcard"):
# Try alternative storage paths
alternatives = [
"/storage/emulated/0",
"/storage/self/primary",
]
for alt in alternatives:
if os.path.exists(alt):
path = path.replace("/sdcard", alt, 1)
break
return os.path.normpath(path)
def is_termux() -> bool:
"""Check if running in Termux environment."""
return (
-122
View File
@@ -1,122 +0,0 @@
"""Configuration selection for Local Swarm interactive mode."""
from typing import List, Optional, Tuple
from hardware.detector import HardwareProfile
from models.registry import Model, list_models
from models.selector import ModelConfig, select_optimal_model, calculate_max_instances
from interactive.ui import print_section, MenuOption, display_menu
def get_recommended_config(
hardware: HardwareProfile,
context_size: int = 32768,
offload_percent: float = 0.0
) -> Optional[ModelConfig]:
"""Get the recommended configuration for the hardware with context and offload settings."""
use_mlx = hardware.is_apple_silicon if hardware else False
return select_optimal_model(
hardware,
context_size=context_size,
offload_percent=offload_percent,
use_mlx=use_mlx
)
def list_available_configurations(
hardware: HardwareProfile,
context_size: int = 32768,
offload_percent: float = 0.0
) -> List[Tuple[str, ModelConfig]]:
"""List all feasible configurations for the hardware with context and offload settings."""
from models.selector import calculate_memory_with_offload, get_available_memory_with_offload
configs = []
available_vram, available_ram = get_available_memory_with_offload(hardware, offload_percent)
# Use MLX models on Apple Silicon
use_mlx = hardware.is_apple_silicon if hardware else False
is_mac = use_mlx
for model in list_models(use_mlx=use_mlx):
for variant in model.variants:
for quant in variant.quantizations:
# Calculate memory with context and offload
if 'bit' in quant.name:
quantization_bits = int(quant.name.replace('bit', ''))
elif 'q4' in quant.name:
quantization_bits = 4
elif 'q5' in quant.name:
quantization_bits = 5
elif 'q6' in quant.name:
quantization_bits = 6
else:
quantization_bits = 4
vram_per_instance, ram_per_instance = calculate_memory_with_offload(
quant.vram_gb, context_size, offload_percent, quantization_bits
)
# Check if at least 1 instance fits in VRAM
if vram_per_instance <= available_vram:
if is_mac:
num_responses = 3
total_memory = vram_per_instance + ram_per_instance
else:
num_responses = calculate_max_instances(available_vram, vram_per_instance)
total_memory = (vram_per_instance + ram_per_instance) * num_responses
config = ModelConfig(
model=model,
variant=variant,
quantization=quant,
instances=num_responses,
memory_per_instance_gb=vram_per_instance + ram_per_instance,
total_memory_gb=total_memory,
context_size=context_size,
offload_percent=offload_percent,
vram_usage_gb=vram_per_instance,
ram_usage_gb=ram_per_instance
)
ctx_label = model.context_label
label = f"{model.name} [{ctx_label}] {variant.size} ({quant.name})"
configs.append((label, config))
return configs
def select_context_size() -> int:
"""Let user select context window size."""
print_section("Context Size Selection")
print(" Context window determines how much text the model can process at once.")
print(" Larger context = more memory usage but can handle longer code files.\n")
options = [
MenuOption("1", "16K tokens", "Good for small code files"),
MenuOption("2", "32K tokens (Recommended)", "Best balance for most users"),
MenuOption("3", "64K tokens", "Large codebases"),
MenuOption("4", "128K tokens", "Very large files (uses more memory)"),
]
choice = display_menu(options, "Select Context Size")
context_map = {"1": 16384, "2": 32768, "3": 65536, "4": 131072}
return context_map.get(choice, 32768)
def select_offload_option() -> float:
"""Let user select offloading option."""
print_section("Memory Offloading")
print(" Offloading moves some model layers to system RAM.")
print(" This allows larger models/contexts but may be slower.\n")
options = [
MenuOption("1", "No offload (Default)", "100% GPU VRAM - fastest"),
MenuOption("2", "20% offload", "80% GPU + 20% RAM - balanced"),
MenuOption("3", "50% offload", "50% GPU + 50% RAM - maximum capacity"),
]
choice = display_menu(options, "Select Offloading")
offload_map = {"1": 0.0, "2": 0.2, "3": 0.5}
return offload_map.get(choice, 0.0)
-87
View File
@@ -1,87 +0,0 @@
"""Display functions for Local Swarm interactive mode.
Hardware info and resource usage display.
"""
from typing import Optional
from hardware.detector import HardwareProfile
from interactive.ui import print_section
def print_hardware_info(hardware: HardwareProfile) -> None:
"""Print detailed hardware information."""
print_section("Hardware Detection")
print(f" Operating System: {hardware.os.capitalize()}")
print(f" CPU: {hardware.cpu_cores} cores")
print(f" System RAM: {hardware.ram_gb:.1f} GB")
print(f" Available RAM: {hardware.ram_available_gb:.1f} GB")
if hardware.gpu:
print(f"\n GPU Detected:")
print(f" Name: {hardware.gpu.name}")
if hardware.is_apple_silicon:
print(f" Type: Apple Silicon (Unified Memory)")
print(f" Total Memory: {hardware.gpu.vram_gb:.1f} GB")
else:
print(f" Type: {hardware.gpu.name}")
print(f" VRAM: {hardware.gpu.vram_gb:.1f} GB")
if hardware.gpu.driver_version:
print(f" Driver: {hardware.gpu.driver_version}")
else:
print(f"\n GPU: None detected (CPU-only mode)")
if hardware.has_dedicated_gpu:
# Dedicated GPU: hard limit based on VRAM
print(f"\n Available for LLMs: {hardware.available_memory_gb:.1f} GB")
print(f" (Using 100% of GPU VRAM minus buffer)")
elif hardware.is_apple_silicon:
# Apple Silicon: show recommendation vs limit (like CPU-only)
print(f"\n Recommended for LLMs: {hardware.recommended_memory_gb:.1f} GB (50% of unified memory)")
print(f" Maximum available: {hardware.available_memory_gb:.1f} GB (unified memory - 4GB safety)")
else:
# CPU-only: show recommendation vs limit
print(f"\n Recommended for LLMs: {hardware.recommended_memory_gb:.1f} GB (50% of RAM)")
print(f" Maximum available: {hardware.available_memory_gb:.1f} GB (system RAM - 4GB safety)")
def print_resource_usage(swarm_manager) -> None:
"""Print current resource usage if swarm is running."""
if swarm_manager is None:
return
print_section("Current Resource Usage")
status = swarm_manager.get_status()
workers = swarm_manager.get_worker_info()
print(f" Swarm Status: {'Running' if status.is_running else 'Stopped'}")
print(f" Model: {status.model_name}")
print(f" Workers: {status.healthy_workers}/{status.total_workers} healthy")
print(f" Consensus Strategy: {status.strategy}")
print(f" Memory Usage: {status.total_memory_gb:.2f} GB")
print(f" Memory per Worker: {status.total_memory_gb / status.total_workers:.2f} GB" if status.total_workers > 0 else " Memory per Worker: N/A")
if workers:
print(f"\n Worker Details:")
for w in workers:
status_icon = "" if w.is_healthy else ""
# Show IP for remote workers
location = f" [{w.ip_address}]" if w.is_remote and w.ip_address else ""
print(f" [{status_icon}] {w.name}{location}: {w.backend_name}")
# Show live data if available
if w.is_generating:
progress_bar = "" * int(w.progress / 5) + "" * (20 - int(w.progress / 5))
print(f" 🔄 Generating: {progress_bar} ({w.progress:.0f}%)")
print(f" 📏 Context: {w.context_used:,} tokens")
if w.last_output:
preview = w.last_output[:60].replace('\n', ' ')
print(f" 💬 Last: {preview}...")
if w.stats.total_requests > 0:
print(f" 📊 Requests: {w.stats.total_requests}")
print(f" ⏱️ Avg Latency: {w.stats.avg_latency_ms:.1f}ms")
print(f" 🚀 Tokens/sec: {w.stats.tokens_per_second:.1f}")
-226
View File
@@ -1,226 +0,0 @@
"""Tips and help content for Local Swarm.
Educational content about models, quantization, and optimization.
"""
from hardware.detector import HardwareProfile
from interactive.ui import clear_screen, print_header, print_section
def show_model_recommendations():
"""Display model recommendations."""
clear_screen()
print_header("Model Recommendations")
print_section("Best Models for Coding (Ranked)")
print("""
🥇 Qwen 2.5 Coder - BEST OVERALL
• Excellent code completion and generation
• Strong performance even at smaller sizes (3B)
• Good at following instructions
• Recommended for most users
🥈 DeepSeek Coder - GREAT ALTERNATIVE
• Very capable on coding tasks
• Good balance of speed and quality
• Smaller 1.3B option for low-end hardware
🥉 CodeLlama - SOLID CHOICE
• Meta's dedicated code model
• Good performance, widely tested
• Larger sizes (13B+) for complex tasks
Other Good Options:
• Llama 3.2 - General model with good coding skills
• Phi-4 - Microsoft's efficient small model
• Gemma 2 - Google's open model
• StarCoder2 - Good for code completion
Which size to choose?
• 1-3B: Fast, good for simple tasks, low VRAM
• 7B: Sweet spot for most users
• 13-15B: Better quality, needs more VRAM
• 30B+: Best quality but very slow
""")
input("\n Press Enter to continue...")
def show_quantization_guide():
"""Display quantization guide."""
clear_screen()
print_header("Quantization Guide")
print_section("What is Quantization?")
print("""
Quantization compresses the model to use less memory.
Lower precision = smaller size = faster loading
But may reduce quality slightly.
""")
print_section("Quantization Levels")
print("""
Q4_K_M (Good) - RECOMMENDED FOR MOST USERS
• 4-bit quantization with medium quality
• ~70% smaller than original
• Minimal quality loss for coding
• Best speed/memory/quality balance
• Use this if unsure!
Q5_K_M (Better)
• 5-bit quantization with better quality
• ~60% smaller than original
• Better for complex reasoning
• Slightly more VRAM needed
Q6_K (Best)
• 6-bit quantization with highest quality
• ~50% smaller than original
• Close to original model quality
• Requires more VRAM
• Use if you have plenty of memory
When to use each:
• Q4_K_M: Default choice, works great
• Q5_K_M: If you have extra VRAM, want better quality
• Q6_K: If VRAM is abundant, want best quality
""")
print_section("Quick Reference")
print("""
Size comparison for 7B model:
• Original (FP16): ~14 GB
• Q6_K: ~6 GB
• Q5_K_M: ~5.2 GB
• Q4_K_M: ~4.5 GB
""")
input("\n Press Enter to continue...")
def show_instance_tips(hardware: HardwareProfile):
"""Display tips for optimal instance count."""
clear_screen()
print_header("Instance Count Optimization")
print_section("What Are Instances?")
print("""
Each instance = one copy of the model running.
Multiple instances = multiple workers voting on answers.
More instances = better consensus but uses more memory.
""")
print_section("Recommended Instance Counts")
print(f"""
Based on your hardware ({hardware.available_memory_gb:.1f} GB available):
Minimum: 2 instances
• Required for consensus voting
• Detects bad/hallucinated responses
• Better than single model
Good Range: 3-5 instances
• Most common setup
• Good consensus quality
• Reasonable memory usage
• Recommended sweet spot
Maximum: 8 instances
• Best consensus quality
• Higher memory usage
• Diminishing returns after 5-6
• Use only if VRAM abundant
Research Note:
Studies show consensus with 3-5 models gives 85-90%
of the benefit, with minimal overhead. More than 8
provides minimal improvement.
""")
print_section("Memory Calculation Example")
print(f"""
Your available memory: {hardware.available_memory_gb:.1f} GB
Example: 7B model at Q4_K_M (4.5 GB per instance)
• 2 instances: 9.0 GB used
• 3 instances: 13.5 GB used
• 4 instances: 18.0 GB used
Rule of thumb: Leave 10% buffer for overhead
""")
input("\n Press Enter to continue...")
def show_hardware_tips(hardware: HardwareProfile):
"""Display hardware-specific tips."""
clear_screen()
print_header("Hardware Optimization Tips")
print_section("Your Hardware Profile")
print(f"""
OS: {hardware.os.capitalize()}
CPU: {hardware.cpu_cores} cores
Available Memory: {hardware.available_memory_gb:.1f} GB
GPU: {hardware.gpu.name if hardware.gpu else "None (CPU mode)"}
""")
if hardware.is_apple_silicon:
print_section("Apple Silicon Tips")
print("""
✓ Using MLX backend (optimized for Metal)
✓ Unified memory architecture
✓ 50% of RAM allocated for LLMs
Tips:
• Use Q4_K_M quantization for best balance
• 7B models work great on 16GB+ Macs
• 3B models good for 8GB Macs
• M1/M2/M3 all supported
• Close other apps for best performance
""")
elif hardware.gpu and not hardware.is_apple_silicon:
print_section("Discrete GPU Tips")
print(f"""
✓ GPU: {hardware.gpu.name}
✓ Using 100% of VRAM
Tips:
• Install CUDA/ROCm drivers for acceleration
• Use Q4_K_M or Q5_K_M quantization
• Monitor GPU temperature during long runs
• Close GPU-intensive apps (games, etc.)
• 7B-13B models work well on 8-16GB VRAM
""")
else:
print_section("CPU-Only Tips")
print("""
✓ Running in CPU mode
✓ 50% of system RAM allocated
Tips:
• Use smaller models (3B-4B) for speed
• Use Q4_K_M quantization
• Fewer instances (2-3) recommended
• Expect slower generation than GPU
• Good for testing, not production use
• Consider cloud GPU for heavy use
""")
print_section("General Optimization")
print("""
Speed vs Quality:
• Smaller models (3B) = faster, less capable
• Larger models (7B+) = slower, smarter
• Q4 = faster, less precise
• Q6 = slower, more precise
Memory Management:
• Leave 10-20% RAM/VRAM free
• Close browsers and heavy apps
• Use swap if necessary (slower)
Best Practices:
• Start with recommended config
• Test with --test flag first
• Monitor memory usage
• Adjust instances based on performance
""")
input("\n Press Enter to continue...")
-63
View File
@@ -1,63 +0,0 @@
"""UI utilities for Local Swarm interactive mode.
Terminal display helpers and formatting functions.
"""
import subprocess
import os
from typing import List
from dataclasses import dataclass
@dataclass
class MenuOption:
"""A menu option."""
key: str
label: str
description: str = ""
def clear_screen():
"""Clear the terminal screen."""
subprocess.run(['cls' if os.name == 'nt' else 'clear'], shell=True, check=False)
def print_header(title: str):
"""Print a formatted header."""
width = 70
print("=" * width)
print(f" {title}".ljust(width))
print("=" * width)
print()
def print_section(title: str):
"""Print a section title."""
print(f"\n{'' * 70}")
print(f" {title}")
print(f"{'' * 70}")
def display_menu(options: List[MenuOption], title: str = "Menu") -> str:
"""Display a menu and return the user's choice.
Args:
options: List of menu options
title: Menu title
Returns:
Selected option key
"""
print_section(title)
for opt in options:
desc = f" - {opt.description}" if opt.description else ""
print(f" [{opt.key}] {opt.label}{desc}")
print()
while True:
choice = input(" Enter your choice: ").strip().lower()
valid_keys = [opt.key.lower() for opt in options]
if choice in valid_keys:
return choice
print(f" Invalid choice. Please enter one of: {', '.join(valid_keys)}")
-107
View File
@@ -1,107 +0,0 @@
"""Memory calculation utilities for model selection."""
from typing import Tuple
def calculate_context_memory(context_size: int, quantization_bits: int = 4) -> float:
"""Calculate additional memory needed for KV cache based on context size.
Args:
context_size: Number of tokens in context window
quantization_bits: Quantization bits (4 for Q4, 5 for Q5, etc.)
Returns:
Additional VRAM needed in GB
"""
# KV cache memory per token: 2 * num_layers * hidden_dim * bytes_per_param
# Rough estimate: ~0.5MB per 1K tokens for 4-bit quantization
bytes_per_token = (quantization_bits / 8) * 0.5 # 0.5 MB base per token at fp16
memory_mb = (context_size / 1000) * bytes_per_token * 1000
return memory_mb / 1024 # Convert to GB
def calculate_memory_with_offload(
base_vram_gb: float,
context_size: int,
offload_percent: float,
quantization_bits: int = 4
) -> Tuple[float, float]:
"""Calculate VRAM and RAM usage with offloading.
Args:
base_vram_gb: Base model VRAM without context
context_size: Context window size in tokens
offload_percent: Percentage of model offloaded to RAM (0.0-1.0)
quantization_bits: Quantization precision
Returns:
(vram_usage_gb, ram_usage_gb)
"""
# Context memory (KV cache) - always in VRAM for speed
context_memory = calculate_context_memory(context_size, quantization_bits)
# Model weights split between GPU and RAM
gpu_model_memory = base_vram_gb * (1 - offload_percent)
ram_model_memory = base_vram_gb * offload_percent
# Context cache stays in VRAM for performance
vram_total = gpu_model_memory + context_memory
ram_total = ram_model_memory
return vram_total, ram_total
def get_available_memory_with_offload(
hardware,
offload_percent: float
) -> Tuple[float, float]:
"""Get available GPU VRAM and system RAM considering offloading.
Args:
hardware: Hardware profile
offload_percent: Offloading percentage
Returns:
(available_vram_gb, available_ram_gb)
"""
if hardware.gpu and not hardware.is_apple_silicon:
# External GPU - use GPU VRAM + potentially some system RAM
available_vram = hardware.gpu.vram_gb * 0.9 # 10% buffer
available_ram = hardware.ram_gb * 0.5 * offload_percent # Portion of RAM for offload
elif hardware.is_apple_silicon:
# Apple Silicon - unified memory
available_total = hardware.available_memory_gb
available_vram = available_total * (1 - offload_percent)
available_ram = available_total * offload_percent
else:
# CPU only - use full available memory (RAM - 4GB safety)
available_vram = hardware.available_memory_gb
available_ram = 0
return available_vram, available_ram
def calculate_max_instances(
available_vram: float,
vram_per_instance: float,
optimal: bool = False
) -> int:
"""Calculate maximum number of instances that fit in available VRAM.
Args:
available_vram: Available VRAM in GB
vram_per_instance: VRAM needed per instance
optimal: If True, cap at optimal max (5) instead of hard max (8)
Returns:
Maximum number of instances
"""
from models.selector import MAX_INSTANCES, OPTIMAL_MAX_INSTANCES, MEMORY_OVERHEAD_FACTOR
if vram_per_instance <= 0:
return 1
max_possible = int((available_vram * MEMORY_OVERHEAD_FACTOR) / vram_per_instance)
max_allowed = OPTIMAL_MAX_INSTANCES if optimal else MAX_INSTANCES
return max(1, min(max_possible, max_allowed))
+292 -201
View File
@@ -1,12 +1,9 @@
"""Model registry for Local Swarm.
"""Model registry for Local Swarm."""
Loads model data from JSON configuration files.
"""
import json
from dataclasses import dataclass, field
from typing import Dict, List, Optional
from typing import Dict, List, Optional, Tuple
from pathlib import Path
import yaml
@dataclass
@@ -14,7 +11,7 @@ class QuantizationConfig:
"""Configuration for a specific quantization level."""
name: str
vram_gb: float
quality: str
quality: str # 'fast', 'good', 'better', 'best'
def __repr__(self) -> str:
return f"QuantizationConfig({self.name}, {self.vram_gb}GB, {self.quality})"
@@ -36,6 +33,7 @@ class ModelVariant:
def get_best_quantization_for_memory(self, available_gb: float) -> Optional[QuantizationConfig]:
"""Get the best quantization that fits in available memory."""
# Sort by quality (best first) then by VRAM (smallest first)
sorted_quants = sorted(
self.quantizations,
key=lambda q: (['fast', 'good', 'better', 'best'].index(q.quality), q.vram_gb),
@@ -55,8 +53,8 @@ class Model:
name: str
description: str
variants: List[ModelVariant]
priority: int = 100
max_context: int = 8192
priority: int = 100 # Lower = higher priority
max_context: int = 8192 # Maximum context window in tokens
def get_variant(self, size: str) -> Optional[ModelVariant]:
"""Get a specific size variant."""
@@ -69,6 +67,7 @@ class Model:
"""Get the largest variant available."""
if not self.variants:
return None
# Sort by base VRAM (largest first)
return sorted(self.variants, key=lambda v: v.base_vram_gb, reverse=True)[0]
@property
@@ -86,223 +85,315 @@ class Model:
return f"{self.max_context//1000}K"
class ModelRegistry:
"""Registry for loading and managing models from config files."""
# MLX quantization sizes (GB) based on mlx-community models
# HARDOCODED: These are verified to exist on HuggingFace mlx-community
# Last verified: 2025-02-23
# DO NOT make API calls on startup - use this hardcoded list
MLX_QUANT_SIZES = {
# Format: model_id: {variant_size: {quant_bit: vram_gb}}
# Only includes quantizations that actually exist on HF
"qwen2.5-coder": {
"3b": {"3bit": 1.3, "4bit": 1.7, "6bit": 2.5, "8bit": 3.3},
# 5bit does NOT exist for 3b
"7b": {"3bit": 3.1, "4bit": 4.1, "6bit": 6.1, "8bit": 8.1},
# 5bit does NOT exist for 7b
"14b": {"3bit": 6.2, "4bit": 8.2, "6bit": 12.2, "8bit": 16.2},
# 5bit does NOT exist for 14b
},
"deepseek-coder": {
"1.3b": {"4bit": 0.8, "6bit": 1.2},
# 3bit, 5bit, 8bit do NOT exist
"6.7b": {"4bit": 3.9, "6bit": 5.9, "8bit": 7.9},
# 3bit, 5bit do NOT exist
},
"codellama": {
"7b": {"4bit": 4.1, "6bit": 6.1, "8bit": 8.1},
# 3bit, 5bit do NOT exist
"13b": {"4bit": 7.6, "6bit": 11.4, "8bit": 15.2},
# 3bit, 5bit do NOT exist
},
"llama-3.2": {
"1b": {"4bit": 0.6, "8bit": 1.2},
# 3bit, 5bit, 6bit do NOT exist
"3b": {"4bit": 1.8, "6bit": 2.6, "8bit": 3.5},
# 3bit, 5bit do NOT exist
},
"phi-4": {
"4b": {"4bit": 2.4, "6bit": 3.6, "8bit": 4.8},
# 3bit, 5bit do NOT exist
},
"gemma-2": {
"2b": {"4bit": 1.2, "6bit": 1.8, "8bit": 2.4},
# 3bit, 5bit do NOT exist
"4b": {"4bit": 2.4, "6bit": 3.6, "8bit": 4.8},
# 3bit, 5bit do NOT exist
"9b": {"4bit": 5.3, "6bit": 7.9, "8bit": 10.5},
# 3bit, 5bit do NOT exist
},
"starcoder2": {
"3b": {"4bit": 1.8, "6bit": 2.6, "8bit": 3.5},
# 3bit, 5bit do NOT exist
"7b": {"4bit": 4.1, "6bit": 6.1, "8bit": 8.1},
# 3bit, 5bit do NOT exist
"15b": {"4bit": 8.8, "6bit": 13.2, "8bit": 17.6},
# 3bit, 5bit do NOT exist
},
}
# Quality mapping for MLX quantizations
MLX_QUALITY_MAP = {
"3bit": "fast",
"4bit": "good",
"5bit": "better",
"6bit": "best",
"8bit": "best",
}
# Base model metadata (without quantization-specific data)
MODEL_METADATA = {
"qwen2.5-coder": {
"name": "Qwen 2.5 Coder",
"description": "Alibaba's code-focused model, excellent for small sizes",
"priority": 1,
"max_context": 128000,
"variants": ["3b", "7b", "14b"],
},
"deepseek-coder": {
"name": "DeepSeek Coder",
"description": "DeepSeek's code model, good alternative",
"priority": 2,
"max_context": 16384,
"variants": ["1.3b", "6.7b"],
},
"codellama": {
"name": "CodeLlama",
"description": "Meta's code model",
"priority": 3,
"max_context": 16384,
"variants": ["7b", "13b"],
},
"llama-3.2": {
"name": "Llama 3.2",
"description": "Meta's latest general-purpose model with strong coding abilities",
"priority": 4,
"max_context": 128000,
"variants": ["1b", "3b"],
},
"phi-4": {
"name": "Phi-4",
"description": "Microsoft's efficient small model with excellent coding performance",
"priority": 5,
"max_context": 16384,
"variants": ["4b"],
},
"gemma-2": {
"name": "Gemma 2",
"description": "Google's open model, good for coding tasks",
"priority": 6,
"max_context": 8192,
"variants": ["2b", "4b", "9b"],
},
"starcoder2": {
"name": "StarCoder2",
"description": "BigCode's open code generation model",
"priority": 7,
"max_context": 8192,
"variants": ["3b", "7b", "15b"],
},
}
# GGUF quantization sizes (GB) - accurate sizes
GGUF_QUANT_SIZES = {
"qwen2.5-coder": {
"3b": {"q4_k_m": 1.8, "q5_k_m": 2.2, "q6_k": 2.6},
"7b": {"q4_k_m": 4.5, "q5_k_m": 5.2, "q6_k": 6.0},
"14b": {"q4_k_m": 8.8, "q5_k_m": 10.5},
},
"deepseek-coder": {
"1.3b": {"q4_k_m": 0.8, "q5_k_m": 1.0},
"6.7b": {"q4_k_m": 4.2, "q5_k_m": 5.0},
},
"codellama": {
"7b": {"q4_k_m": 4.5, "q5_k_m": 5.2},
"13b": {"q4_k_m": 8.0, "q5_k_m": 9.5},
},
"llama-3.2": {
"3b": {"q4_k_m": 1.9, "q5_k_m": 2.3, "q6_k": 2.7},
"1b": {"q4_k_m": 0.7, "q5_k_m": 0.9},
},
"phi-4": {
"4b": {"q4_k_m": 2.4, "q5_k_m": 2.9, "q6_k": 3.4},
},
"gemma-2": {
"2b": {"q4_k_m": 1.5, "q5_k_m": 1.8},
"4b": {"q4_k_m": 2.7, "q5_k_m": 3.2, "q6_k": 3.8},
"9b": {"q4_k_m": 5.5, "q5_k_m": 6.5},
},
"starcoder2": {
"3b": {"q4_k_m": 1.9, "q5_k_m": 2.3},
"7b": {"q4_k_m": 4.5, "q5_k_m": 5.2, "q6_k": 6.1},
"15b": {"q4_k_m": 9.2, "q5_k_m": 10.8},
},
}
# GGUF quality mapping
GGUF_QUALITY_MAP = {
"q4_k_m": "good",
"q5_k_m": "better",
"q6_k": "best",
}
def get_quantization_sizes(model_id: str, use_mlx: bool = False) -> Dict[str, Dict[str, float]]:
"""Get quantization sizes for a model."""
if use_mlx:
return MLX_QUANT_SIZES.get(model_id, {})
else:
return GGUF_QUANT_SIZES.get(model_id, {})
def get_quality_map(use_mlx: bool = False) -> Dict[str, str]:
"""Get quality mapping for quantizations."""
if use_mlx:
return MLX_QUALITY_MAP
else:
return GGUF_QUALITY_MAP
def build_model_variants(model_id: str, use_mlx: bool = False) -> List[ModelVariant]:
"""Build model variants with appropriate quantizations for the platform."""
metadata = MODEL_METADATA.get(model_id)
if not metadata:
return []
def __init__(self):
"""Initialize registry and load config files."""
self.config_dir = Path(__file__).parent.parent.parent / "config" / "models"
self._metadata: Dict = {}
self._mlx_sizes: Dict = {}
self._gguf_sizes: Dict = {}
self._load_configs()
quality_map = get_quality_map(use_mlx)
variants = []
def _load_configs(self) -> None:
"""Load all configuration files."""
# Load model metadata
metadata_path = self.config_dir / "model_metadata.json"
if metadata_path.exists():
with open(metadata_path, 'r') as f:
self._metadata = json.load(f)
for variant_size in metadata["variants"]:
quant_sizes = get_quantization_sizes(model_id, use_mlx).get(variant_size, {})
# Load MLX quantization sizes
mlx_path = self.config_dir / "mlx_quant_sizes.json"
if mlx_path.exists():
with open(mlx_path, 'r') as f:
self._mlx_sizes = json.load(f)
if not quant_sizes:
continue
# Load GGUF quantization sizes
gguf_path = self.config_dir / "gguf_quant_sizes.json"
if gguf_path.exists():
with open(gguf_path, 'r') as f:
self._gguf_sizes = json.load(f)
quantizations = []
for quant_name, vram_gb in quant_sizes.items():
quality = quality_map.get(quant_name, "good")
quantizations.append(QuantizationConfig(quant_name, vram_gb, quality))
# Calculate base VRAM (rough estimate: use 8bit or largest quant size)
base_vram = max(quant_sizes.values()) * 1.5
variants.append(ModelVariant(
size=variant_size,
base_vram_gb=base_vram,
quantizations=quantizations
))
def get_model(self, model_id: str, use_mlx: bool = False) -> Optional[Model]:
"""Get a model by ID."""
if model_id not in self._metadata:
return None
meta = self._metadata[model_id]
# Ensure meta is a dict (not a string like "_comment")
if not isinstance(meta, dict):
return None
sizes = self._mlx_sizes if use_mlx else self._gguf_sizes
quality_map = self._get_quality_map(use_mlx)
variants = []
for variant_size in meta.get("variants", []):
quantizations = []
size_data = sizes.get(model_id, {}).get(variant_size, {})
for quant_name, vram_gb in size_data.items():
quality = quality_map.get(quant_name, "good")
quantizations.append(QuantizationConfig(quant_name, vram_gb, quality))
if quantizations:
base_vram = min(q.vram_gb for q in quantizations)
variants.append(ModelVariant(variant_size, base_vram, quantizations))
return variants
def build_models(use_mlx: bool = False) -> Dict[str, Model]:
"""Build the model registry with platform-appropriate quantizations."""
models = {}
for model_id, metadata in MODEL_METADATA.items():
variants = build_model_variants(model_id, use_mlx)
if not variants:
return None
continue
return Model(
models[model_id] = Model(
id=model_id,
name=meta.get("name", model_id),
description=meta.get("description", ""),
name=metadata["name"],
description=metadata["description"],
variants=variants,
priority=meta.get("priority", 100),
max_context=meta.get("max_context", 8192)
priority=metadata["priority"],
max_context=metadata["max_context"],
)
def list_models(self, use_mlx: bool = False) -> List[Model]:
"""List all available models."""
models = []
for model_id, meta in self._metadata.items():
# Skip non-dict entries (like _comment)
if not isinstance(meta, dict):
continue
model = self.get_model(model_id, use_mlx)
if model:
models.append(model)
# Sort by priority
return sorted(models, key=lambda m: m.priority)
def _get_quality_map(self, use_mlx: bool) -> Dict[str, str]:
"""Get quality mapping for quantization types."""
if use_mlx:
return {
"3bit": "fast",
"4bit": "good",
"5bit": "better",
"6bit": "best",
"8bit": "best",
}
else:
return {
"q4_k_m": "good",
"q5_k_m": "better",
"q6_k": "best",
}
return models
# Global registry instance
_registry = ModelRegistry()
# Default models (GGUF format for llama.cpp)
DEFAULT_MODELS = build_models(use_mlx=False)
def get_model(model_id: str, use_mlx: bool = False) -> Optional[Model]:
"""Get a model by ID."""
return _registry.get_model(model_id, use_mlx)
"""Get a model by ID with platform-appropriate quantizations."""
if use_mlx:
models = build_models(use_mlx=True)
return models.get(model_id)
else:
return DEFAULT_MODELS.get(model_id)
def list_models(use_mlx: bool = False) -> List[Model]:
"""List all available models."""
return _registry.list_models(use_mlx)
"""List all available models sorted by priority."""
if use_mlx:
models = build_models(use_mlx=True)
else:
models = DEFAULT_MODELS
return sorted(models.values(), key=lambda m: m.priority)
def get_model_hf_repo(model_id: str, variant: ModelVariant, quant: QuantizationConfig) -> Optional[str]:
"""Get HuggingFace repository ID for a GGUF model.
Args:
model_id: Model identifier
variant: Model variant (size)
quant: Quantization config
Returns:
HuggingFace repo ID (e.g., "Qwen/Qwen2.5-Coder-7B-Instruct-GGUF") or None if unknown
"""
# Get the base repo from metadata
if model_id not in _registry._metadata:
return None
meta = _registry._metadata[model_id]
if not isinstance(meta, dict):
return None
base_repo = meta.get("hf_repo")
if not base_repo:
return None
# Convert variant size (e.g., "14b" to "-14B") and construct repo ID
size_suffix = f"-{variant.size.upper()}"
# For GGUF, add -Instruct-GGUF suffix
repo_id = f"{base_repo}{size_suffix}-Instruct-GGUF"
return repo_id
def get_model_hf_repo(model_id: str, variant: ModelVariant, quant: QuantizationConfig) -> str:
"""Get the HuggingFace repository path for a model (GGUF format)."""
repo_map = {
"qwen2.5-coder": f"Qwen/Qwen2.5-Coder-{variant.size}-Instruct-GGUF",
"deepseek-coder": f"TheBloke/deepseek-coder-{variant.size}-base-GGUF",
"codellama": f"TheBloke/CodeLlama-{variant.size}-Instruct-GGUF",
"llama-3.2": f"bartowski/Llama-3.2-{variant.size}-Instruct-GGUF",
"phi-4": f"unsloth/phi-4-GGUF",
"gemma-2": f"bartowski/gemma-2-{variant.size}-it-GGUF",
"starcoder2": f"TheBloke/starcoder2-{variant.size}-GGUF",
}
return repo_map.get(model_id, "")
def get_model_hf_repo_mlx(model_id: str, variant: ModelVariant, quant: QuantizationConfig) -> Optional[str]:
"""Get HuggingFace repository ID for an MLX model.
def get_model_hf_repo_mlx(model_id: str, variant: ModelVariant, quant: QuantizationConfig) -> str:
"""Get the HuggingFace repository path for MLX quantized models (Apple Silicon)."""
# Map GGUF quantization names to MLX quantization names
# MLX uses simple names: 3bit, 4bit, 8bit, not q4_k_m, q6_k, etc.
gguf_to_mlx_quant = {
"q3_k_m": "3bit",
"q4_k_m": "4bit",
"q4_k": "4bit",
"q5_k_m": "5bit",
"q5_k": "5bit",
"q6_k": "6bit",
"q8_0": "8bit",
"q8": "8bit",
}
Args:
model_id: Model identifier
variant: Model variant (size)
quant: Quantization config
Returns:
HuggingFace repo ID (e.g., "mlx-community/Qwen2.5-Coder-14B-Instruct-4bit") or None if unknown
"""
# Get the base repo from metadata
if model_id not in _registry._metadata:
return None
# MLX quantized models are in mlx-community org with -{quant}bit suffix
# Map base model names to mlx-community quantized versions
mlx_repo_map = {
"qwen2.5-coder": f"mlx-community/Qwen2.5-Coder-{variant.size.capitalize()}-Instruct",
"deepseek-coder": f"mlx-community/deepseek-coder-{variant.size}-base",
"codellama": f"mlx-community/CodeLlama-{variant.size}-Instruct",
"llama-3.2": f"mlx-community/Llama-3.2-{variant.size}-Instruct",
"phi-4": f"mlx-community/phi-4",
"gemma-2": f"mlx-community/gemma-2-{variant.size}-it",
"starcoder2": f"mlx-community/starcoder2-{variant.size}",
}
meta = _registry._metadata[model_id]
if not isinstance(meta, dict):
return None
base_repo = meta.get("hf_repo")
if not base_repo:
return None
# MLX models are typically in mlx-community namespace
# Format: mlx-community/{ModelName}-{Size}-{Quantization}
# For example: mlx-community/Qwen2.5-Coder-14B-Instruct-4bit
# Convert variant size (e.g., "14b" to "-14B")
size_suffix = f"-{variant.size.upper()}"
# Add quantization suffix (e.g., "-4bit" for MLX quantization names)
quant_suffix = f"-{quant.name}"
# Construct the full repo name
model_name = base_repo.split('/')[-1] # Get just the model name, not the org
repo_id = f"mlx-community/{model_name}{size_suffix}-Instruct{quant_suffix}"
return repo_id
base_repo = mlx_repo_map.get(model_id, "")
if base_repo and quant:
# Convert GGUF quant name to MLX quant name
mlx_quant = gguf_to_mlx_quant.get(quant.name, quant.name)
# Append quantization suffix
return f"{base_repo}-{mlx_quant}"
return base_repo
def get_model_filename(model_id: str, variant: ModelVariant, quant: QuantizationConfig) -> str:
"""Get the filename for a GGUF model file.
Args:
model_id: Model identifier
variant: Model variant (size)
quant: Quantization config
Returns:
GGUF filename (e.g., "qwen2.5-coder-14b-instruct-q4_k_m.gguf")
"""
# Extract model name from metadata
if model_id not in _registry._metadata:
meta = {"name": model_id, "hf_repo": model_id}
else:
meta = _registry._metadata[model_id]
if not isinstance(meta, dict):
meta = {"name": model_id, "hf_repo": model_id}
# Use the base repo name or model name
base_name = meta.get("hf_repo", meta.get("name", model_id))
# Remove org prefix if present
if '/' in base_name:
base_name = base_name.split('/')[-1]
# Standard GGUF naming (all lowercase): {model}-{variant}-instruct-{quantization}.gguf
# For example: qwen2.5-coder-14b-instruct-q4_k_m.gguf
variant_size = f"-{variant.size.lower()}"
quant_name = quant.name.lower()
filename = f"{base_name.lower()}{variant_size}-instruct-{quant_name}.gguf"
return filename
"""Get the GGUF filename for a model."""
filename_map = {
"qwen2.5-coder": f"qwen2.5-coder-{variant.size}-instruct-{quant.name}.gguf",
"deepseek-coder": f"deepseek-coder-{variant.size}-base.{quant.name.upper()}.gguf",
"codellama": f"codellama-{variant.size}-instruct.{quant.name.upper()}.gguf",
"llama-3.2": f"Llama-3.2-{variant.size}-Instruct-{quant.name}.gguf",
"phi-4": f"phi-4-{quant.name}.gguf",
"gemma-2": f"gemma-2-{variant.size}-it-{quant.name}.gguf",
"starcoder2": f"starcoder2-{variant.size}.{quant.name.upper()}.gguf",
}
return filename_map.get(model_id, f"{model_id}-{variant.size}-{quant.name}.gguf")
+296 -140
View File
@@ -1,17 +1,9 @@
"""Model selection logic for Local Swarm."""
import json
from dataclasses import dataclass
from pathlib import Path
from typing import Optional, List
from hardware.detector import HardwareProfile
from models.registry import Model, ModelVariant, QuantizationConfig, list_models
from models.memory_calculator import (
calculate_memory_with_offload,
get_available_memory_with_offload,
calculate_max_instances
)
@dataclass
@@ -23,10 +15,10 @@ class ModelConfig:
instances: int
memory_per_instance_gb: float
total_memory_gb: float
context_size: int = 32768
offload_percent: float = 0.0
vram_usage_gb: float = 0.0
ram_usage_gb: float = 0.0
context_size: int = 32768 # Context window in tokens (16K, 32K, 64K, 128K)
offload_percent: float = 0.0 # Percentage of layers offloaded to RAM (0.0, 0.2, 0.5)
vram_usage_gb: float = 0.0 # Actual VRAM usage per instance
ram_usage_gb: float = 0.0 # System RAM usage per instance (when offloading)
def __post_init__(self):
"""Ensure default values are set if not provided."""
@@ -50,24 +42,147 @@ class ModelConfig:
return f"{self.model.name} {self.variant.size} ({self.quantization.name}, {self.context_size//1000}K ctx{offload_str})"
# Load configuration from JSON
_config_path = Path(__file__).parent.parent.parent / "config" / "models" / "selector_config.json"
_config = {}
if _config_path.exists():
with open(_config_path, 'r') as f:
_config = json.load(f)
# Configuration constraints
MIN_INSTANCES = 1 # Allow 1 instance (needed for Apple Silicon MLX)
MAX_INSTANCES = 8
OPTIMAL_MAX_INSTANCES = 5 # Sweet spot for consensus (85-90% benefit)
MEMORY_OVERHEAD_FACTOR = 0.95 # Leave 5% buffer
# Extract constraints
_constraints = _config.get("constraints", {})
MIN_INSTANCES = _constraints.get("min_instances", 1)
MAX_INSTANCES = _constraints.get("max_instances", 8)
OPTIMAL_MAX_INSTANCES = _constraints.get("optimal_max_instances", 5)
MEMORY_OVERHEAD_FACTOR = _constraints.get("memory_overhead_factor", 0.95)
MLX_MAX_INSTANCES = _constraints.get("mlx_max_instances", 1)
# Apple Silicon MLX constraints - MLX uses GPU efficiently with 1 worker
MLX_MAX_INSTANCES = 1 # MLX handles all GPU resources in single instance
# Context and offload options
CONTEXT_OPTIONS = {int(k): v for k, v in _config.get("context_options", {}).items()}
OFFLOAD_OPTIONS = {float(k): v for k, v in _config.get("offload_options", {}).items()}
# Context window options
CONTEXT_OPTIONS = {
16384: "16K tokens",
32768: "32K tokens (default)",
65536: "64K tokens",
131072: "128K tokens"
}
# Offloading options
OFFLOAD_OPTIONS = {
0.0: "No offload (default) - 100% GPU",
0.2: "20% offload - 80% GPU, 20% RAM",
0.5: "50% offload - 50% GPU, 50% RAM"
}
def calculate_context_memory(context_size: int, quantization_bits: int = 4) -> float:
"""
Calculate additional memory needed for KV cache based on context size.
Args:
context_size: Number of tokens in context window
quantization_bits: Quantization bits (4 for Q4, 5 for Q5, etc.)
Returns:
Additional VRAM needed in GB
"""
# KV cache memory per token: 2 * num_layers * hidden_dim * bytes_per_param
# Rough estimate: ~0.5MB per 1K tokens for 4-bit quantization
bytes_per_token = (quantization_bits / 8) * 0.5 # 0.5 MB base per token at fp16
memory_mb = (context_size / 1000) * bytes_per_token * 1000
return memory_mb / 1024 # Convert to GB
def calculate_memory_with_offload(
base_vram_gb: float,
context_size: int,
offload_percent: float,
quantization_bits: int = 4
) -> tuple[float, float]:
"""
Calculate VRAM and RAM usage with offloading.
Args:
base_vram_gb: Base model VRAM without context
context_size: Context window size in tokens
offload_percent: Percentage of model offloaded to RAM (0.0-1.0)
quantization_bits: Quantization precision
Returns:
(vram_usage_gb, ram_usage_gb)
"""
# Context memory (KV cache) - always in VRAM for speed
context_memory = calculate_context_memory(context_size, quantization_bits)
# Model weights split between GPU and RAM
gpu_model_memory = base_vram_gb * (1 - offload_percent)
ram_model_memory = base_vram_gb * offload_percent
# Context cache stays in VRAM for performance
vram_total = gpu_model_memory + context_memory
ram_total = ram_model_memory
return vram_total, ram_total
def get_available_memory_with_offload(
hardware: HardwareProfile,
offload_percent: float
) -> tuple[float, float]:
"""
Get available GPU VRAM and system RAM considering offloading.
Args:
hardware: Hardware profile
offload_percent: Offloading percentage
Returns:
(available_vram_gb, available_ram_gb)
"""
if hardware.gpu and not hardware.is_apple_silicon:
# External GPU - use GPU VRAM + potentially some system RAM
available_vram = hardware.gpu.vram_gb * 0.9 # 10% buffer
available_ram = hardware.ram_gb * 0.5 * offload_percent # Portion of RAM for offload
elif hardware.is_apple_silicon:
# Apple Silicon - unified memory
# Use full available memory (RAM - 4GB), not just 50%
available_total = hardware.available_memory_gb
available_vram = available_total * (1 - offload_percent)
available_ram = available_total * offload_percent
else:
# CPU only - use full available memory (RAM - 4GB safety)
# On CPU-only, there's no VRAM/RAM split, just system RAM
available_vram = hardware.available_memory_gb # Use the new limit
available_ram = 0
return available_vram, available_ram
def calculate_max_instances(available_memory_gb: float, memory_per_instance: float, optimal: bool = True) -> int:
"""
Calculate number of instances based on available memory.
Args:
available_memory_gb: Available memory in GB
memory_per_instance: Memory required per instance in GB
optimal: If True, cap at OPTIMAL_MAX_INSTANCES (3-5 sweet spot).
If False, return maximum possible (up to MAX_INSTANCES).
Returns:
Recommended number of instances (2-5 for optimal, 2-8 for max)
"""
effective_memory = available_memory_gb * MEMORY_OVERHEAD_FACTOR
max_possible = int(effective_memory // memory_per_instance)
if optimal:
# Use optimal range: 2-5 instances (research-backed sweet spot)
# 3-5 instances gives 85-90% of consensus benefit
# More than 5 has diminishing returns
if max_possible >= OPTIMAL_MAX_INSTANCES:
return OPTIMAL_MAX_INSTANCES # Cap at sweet spot
elif max_possible >= 3:
return max_possible # Use 3-4 if memory allows
else:
# Only return MIN_INSTANCES if memory actually fits that many
if max_possible >= MIN_INSTANCES:
return max_possible # Return what actually fits, not MIN_INSTANCES
return max(max_possible, 1)
else:
# Return absolute maximum (for users who explicitly want more)
return max(MIN_INSTANCES, min(max_possible, MAX_INSTANCES))
def select_optimal_model(
@@ -76,115 +191,122 @@ def select_optimal_model(
force_instances: Optional[int] = None,
context_size: int = 32768,
offload_percent: float = 0.0,
use_mlx: Optional[bool] = None
use_mlx: bool = False
) -> Optional[ModelConfig]:
"""Select the optimal model configuration for given hardware."""
# Auto-detect MLX usage for Apple Silicon if not explicitly set
if use_mlx is None:
use_mlx = hardware.is_apple_silicon
"""
Select the optimal model configuration for given hardware.
available_vram, _ = get_available_memory_with_offload(hardware, offload_percent)
Args:
hardware: Hardware profile
preferred_model: Optional model ID to force (e.g., "qwen2.5-coder")
force_instances: Optional number of instances to force
context_size: Context window size in tokens (default: 32768)
offload_percent: Portion of model to offload to RAM (0.0-1.0)
use_mlx: Whether to use MLX format models (Apple Silicon)
Returns:
ModelConfig or None if no suitable model found
"""
# Auto-detect MLX if on Apple Silicon and not explicitly set
if use_mlx is None and hardware.is_apple_silicon:
use_mlx = True
# Get available memory considering offloading
available_vram, available_ram = get_available_memory_with_offload(hardware, offload_percent)
# Get models to try (with appropriate quantizations)
# Note: Don't check available quantizations here (too slow for menu rendering)
# Only check when user is actually browsing or selecting custom config
if preferred_model:
config = _handle_preferred_model(
preferred_model, hardware, available_vram, force_instances,
context_size, offload_percent, use_mlx
)
if config:
return config
models = list_models(use_mlx=use_mlx)
for model in models:
config = _try_model_with_context(model, available_vram, force_instances, context_size, offload_percent, use_mlx)
if config:
return config
if models:
return _try_smallest_variant_with_context(models[0], available_vram, force_instances, context_size, offload_percent, use_mlx)
return None
def _handle_preferred_model(
preferred_model: str,
hardware: HardwareProfile,
available_vram: float,
force_instances: Optional[int],
context_size: int,
offload_percent: float,
use_mlx: bool
) -> Optional[ModelConfig]:
"""Handle preferred model selection."""
from models.registry import get_model
model_id = preferred_model
preferred_size = None
preferred_quant = None
if ':' in preferred_model:
parts = preferred_model.split(':')
if len(parts) >= 3:
model_id = parts[0]
preferred_size = parts[1]
preferred_quant = parts[2]
preferred = get_model(model_id, use_mlx=use_mlx)
if not preferred:
return None
if preferred_size and preferred_quant:
return _try_specific_config(
preferred, preferred_size, preferred_quant, available_vram,
force_instances, context_size, offload_percent, use_mlx
)
models = [preferred]
for model in models:
config = _try_model_with_context(model, available_vram, force_instances, context_size, offload_percent, use_mlx)
if config:
return config
return None
def _try_specific_config(
model: Model,
preferred_size: str,
preferred_quant: str,
available_vram: float,
force_instances: Optional[int],
context_size: int,
offload_percent: float,
use_mlx: bool
) -> Optional[ModelConfig]:
"""Try to use a specific model configuration."""
for variant in model.variants:
if variant.size.lower() == preferred_size.lower():
for quant in variant.quantizations:
if quant.name.lower() == preferred_quant.lower():
if quant.vram_gb <= available_vram:
instances = force_instances or calculate_max_instances(available_vram, quant.vram_gb, optimal=True)
from models.registry import get_model
# Parse preferred_model - can be:
# - "qwen2.5-coder" (just model ID)
# - "qwen2.5-coder:7b:4bit" (model:size:quant format)
if ':' in preferred_model:
# Full format: model_id:size:quant
parts = preferred_model.split(':')
if len(parts) >= 3:
model_id = parts[0]
preferred_size = parts[1]
preferred_quant = parts[2]
else:
model_id = preferred_model
preferred_size = None
preferred_quant = None
else:
model_id = preferred_model
preferred_size = None
preferred_quant = None
preferred = get_model(model_id, use_mlx=use_mlx)
if preferred and preferred_size and preferred_quant:
# Try to find specific variant and quantization
found_variant = None
found_quant = None
for variant in preferred.variants:
if variant.size.lower() == preferred_size.lower():
found_variant = variant
for quant in variant.quantizations:
if quant.name.lower() == preferred_quant.lower():
found_quant = quant
break
break
if found_variant and found_quant:
# Found the specific variant/quant, check if it fits
memory_needed = found_quant.vram_gb
if memory_needed <= available_vram:
# Calculate instances
if force_instances:
instances = force_instances
else:
instances = calculate_max_instances(available_vram, found_quant.vram_gb, optimal=True)
# Cap at 1 for MLX
if use_mlx:
instances = min(instances, MLX_MAX_INSTANCES)
return ModelConfig(
model=model,
variant=variant,
quantization=quant,
instances=instances,
memory_per_instance_gb=quant.vram_gb,
total_memory_gb=quant.vram_gb * instances,
context_size=context_size,
offload_percent=offload_percent,
vram_usage_gb=quant.vram_gb,
ram_usage_gb=0.0
)
else:
print(f"\n⚠️ Requested model requires {quant.vram_gb:.1f}GB but only {available_vram:.1f}GB available")
return None
return ModelConfig(
model=preferred,
variant=found_variant,
quantization=found_quant,
instances=instances,
memory_per_instance_gb=found_quant.vram_gb,
total_memory_gb=found_quant.vram_gb * instances,
context_size=context_size,
offload_percent=offload_percent,
vram_usage_gb=found_quant.vram_gb,
ram_usage_gb=0.0
)
else:
# Specific config requested but doesn't fit - return None
print(f"\n⚠️ Requested model {model_id}:{preferred_size}:{preferred_quant} requires {memory_needed:.1f}GB but only {available_vram:.1f}GB available")
return None
else:
# Specific config requested but not found
print(f"\n⚠️ Model configuration not found: {model_id}:{preferred_size}:{preferred_quant}")
print(f" Available sizes for {model_id}: {[v.size for v in preferred.variants]}")
return None
models = [preferred] if preferred else []
else:
models = list_models(use_mlx=use_mlx)
# Note: On Apple Silicon with MLX, multiple instances work fine in sequential mode
# The swarm manager will handle sequential execution to avoid GPU conflicts
# Try each model in priority order
for model in models:
config = _try_model_with_context(model, available_vram, force_instances, context_size, offload_percent, use_mlx)
if config:
return config
# If nothing fits, try smallest variant of first model
if models:
smallest_config = _try_smallest_variant_with_context(models[0], available_vram, force_instances, context_size, offload_percent, use_mlx)
if smallest_config:
return smallest_config
print(f"\n⚠️ Model configuration not found: {model.id}:{preferred_size}:{preferred_quant}")
return None
@@ -197,7 +319,9 @@ def _try_model_with_context(
use_mlx: bool = False
) -> Optional[ModelConfig]:
"""Try to fit a model in available memory with context and offloading."""
# Try variants from largest to smallest
for variant in sorted(model.variants, key=lambda v: v.base_vram_gb, reverse=True):
# Try quantizations from best to fastest
sorted_quants = sorted(
variant.quantizations,
key=lambda q: (['fast', 'good', 'better', 'best'].index(q.quality), -q.vram_gb),
@@ -205,7 +329,10 @@ def _try_model_with_context(
)
for quant in sorted_quants:
# Calculate memory with context and offloading
# Extract quantization bits from name (e.g., "4bit" -> 4, "q4_k_m" -> 4)
if 'bit' in quant.name:
# MLX format: "4bit", "3bit", etc.
quantization_bits = int(quant.name.replace('bit', ''))
elif 'q4' in quant.name:
quantization_bits = 4
@@ -214,23 +341,37 @@ def _try_model_with_context(
elif 'q6' in quant.name:
quantization_bits = 6
else:
quantization_bits = 4
quantization_bits = 4 # Default fallback
vram_per_instance, ram_per_instance = calculate_memory_with_offload(
quant.vram_gb, context_size, offload_percent, quantization_bits
)
if vram_per_instance * MIN_INSTANCES > available_vram:
# Check if at least MIN_INSTANCES can fit
min_needed = vram_per_instance * MIN_INSTANCES
if min_needed > available_vram:
continue
# Calculate instances
if force_instances:
instances = force_instances
if not use_mlx and vram_per_instance * instances > available_vram:
continue
if not use_mlx: # On non-Mac, check if all instances fit in VRAM
total_needed = vram_per_instance * instances
if total_needed > available_vram:
continue
else:
instances = 1 if use_mlx else calculate_max_instances(available_vram, vram_per_instance)
# On Mac with MLX (use_mlx=True), use 3 responses by default
# On other platforms, calculate based on VRAM
if use_mlx:
instances = 1 # DEBUG: Changed from 3 to 1 for faster testing
else:
instances = calculate_max_instances(available_vram, vram_per_instance)
total_memory = vram_per_instance + ram_per_instance if use_mlx else (vram_per_instance + ram_per_instance) * instances
# On Mac with seed variation, memory doesn't multiply
if use_mlx:
total_memory = vram_per_instance + ram_per_instance
else:
total_memory = (vram_per_instance + ram_per_instance) * instances
return ModelConfig(
model=model,
@@ -260,25 +401,38 @@ def _try_smallest_variant_with_context(
if not model.variants:
return None
# Get smallest variant
smallest_variant = min(model.variants, key=lambda v: v.base_vram_gb)
# Get smallest quantization
if not smallest_variant.quantizations:
return None
smallest_quant = min(smallest_variant.quantizations, key=lambda q: q.vram_gb)
# Calculate memory with context and offloading
quantization_bits = 4 if 'q4' in smallest_quant.name else (5 if 'q5' in smallest_quant.name else 6)
vram_per_instance, ram_per_instance = calculate_memory_with_offload(
smallest_quant.vram_gb, context_size, offload_percent, quantization_bits
)
# Check if even this fits
if vram_per_instance > available_vram:
return None
instances = force_instances or (1 if use_mlx else calculate_max_instances(available_vram, vram_per_instance))
# On Mac with MLX, use 3 responses by default
if use_mlx:
instances = force_instances or 1 # DEBUG: Changed from 3 to 1
else:
instances = force_instances or calculate_max_instances(available_vram, vram_per_instance)
instances = max(instances, 1)
total_memory = vram_per_instance + ram_per_instance if use_mlx else (vram_per_instance + ram_per_instance) * instances
# On Mac with seed variation, memory doesn't multiply
if use_mlx:
total_memory = vram_per_instance + ram_per_instance
else:
total_memory = (vram_per_instance + ram_per_instance) * instances
return ModelConfig(
model=model,
variant=smallest_variant,
@@ -302,6 +456,7 @@ def format_recommendation(config: ModelConfig, hardware: HardwareProfile) -> str
f"GPU VRAM per instance: {config.vram_usage_gb:.1f} GB",
]
# Show RAM usage if offloading
if config.offload_percent > 0:
lines.append(f"System RAM per instance: {config.ram_usage_gb:.1f} GB")
@@ -309,7 +464,8 @@ def format_recommendation(config: ModelConfig, hardware: HardwareProfile) -> str
f"Total memory used: {config.total_memory_gb:.1f} GB",
f"Available memory: {hardware.available_memory_gb:.1f} GB",
])
# Add instance count explanation
if config.instances == OPTIMAL_MAX_INSTANCES:
lines.append(f"Note: Using optimal instance count (3-5 = 85-90% consensus benefit)")
elif config.instances == MIN_INSTANCES:
+6 -55
View File
@@ -6,43 +6,10 @@ Uses mDNS/Bonjour to discover other Local Swarm instances on the local network.
import socket
import asyncio
from typing import Dict, List, Optional, Any
from dataclasses import dataclass, field
from dataclasses import dataclass
from datetime import datetime, timedelta
@dataclass
class PeerMetrics:
"""Metrics for tracking peer performance."""
total_requests: int = 0
successful_requests: int = 0
failed_requests: int = 0
total_latency_ms: float = 0.0
avg_latency_ms: float = 0.0
last_error: Optional[str] = None
last_error_time: Optional[datetime] = None
@property
def success_rate(self) -> float:
"""Calculate success rate (0.0 to 1.0)."""
if self.total_requests == 0:
return 1.0
return self.successful_requests / self.total_requests
def record_success(self, latency_ms: float):
"""Record a successful request."""
self.total_requests += 1
self.successful_requests += 1
self.total_latency_ms += latency_ms
self.avg_latency_ms = self.total_latency_ms / self.successful_requests
def record_failure(self, error: str):
"""Record a failed request."""
self.total_requests += 1
self.failed_requests += 1
self.last_error = error
self.last_error_time = datetime.now()
@dataclass
class PeerInfo:
"""Information about a peer swarm."""
@@ -54,8 +21,6 @@ class PeerInfo:
model_id: str
hardware_summary: str
last_seen: datetime
timeout_seconds: float = 60.0 # Configurable timeout per peer
metrics: PeerMetrics = field(default_factory=PeerMetrics)
@property
def api_url(self) -> str:
@@ -135,8 +100,6 @@ class SwarmDiscovery:
await asyncio.to_thread(self._zeroconf.register_service, self._info)
print(f" ✓ Advertising on mDNS: {service_name}")
print(f" IP: {ip}:{self.listen_port}")
print(f" Service type: {self.SERVICE_TYPE}")
print(f" Properties: instances={swarm_info.get('instances', 0)}, model={swarm_info.get('model_id', 'unknown')}")
except ImportError:
print(" ⚠️ zeroconf not installed, skipping mDNS advertising")
@@ -154,10 +117,6 @@ class SwarmDiscovery:
self._async_zeroconf = AsyncZeroconf()
self._zeroconf = self._async_zeroconf.zeroconf
# Store event loop reference for callbacks
self._loop = asyncio.get_event_loop()
print(f" Event loop: {self._loop}")
# Create async browser (passes the underlying Zeroconf instance)
self._browser = AsyncServiceBrowser(
self._zeroconf,
@@ -166,7 +125,6 @@ class SwarmDiscovery:
)
print(f" ✓ Listening for peers on {self.SERVICE_TYPE}")
print(f" Will discover peers advertising on same network")
self._running = True
except ImportError:
@@ -178,23 +136,16 @@ class SwarmDiscovery:
"""Handle mDNS service state changes (called from zeroconf background thread)."""
from zeroconf import ServiceStateChange
print(f" [mDNS] Service state change: {name} -> {state_change.name}")
if state_change == ServiceStateChange.Added:
print(f" [mDNS] Service added: {name}")
# Schedule coroutine on the event loop from this background thread
if self._loop is not None and self._loop.is_running():
print(f" [mDNS] Scheduling peer addition...")
asyncio.run_coroutine_threadsafe(
self._add_peer(zeroconf, service_type, name),
self._loop
)
else:
print(f" [mDNS] Warning: Event loop not available")
elif state_change == ServiceStateChange.Removed:
# Service removed
peer_key = name.replace(f".{self.SERVICE_TYPE}", "")
print(f" [mDNS] Service removed: {peer_key}")
if peer_key in self.peers:
del self.peers[peer_key]
print(f" 👋 Peer left: {peer_key}")
@@ -341,13 +292,13 @@ class SwarmDiscovery:
finally:
s.close()
# Only bind to 192.168.x.x as requested
if ip.startswith('192.168.'):
print(f" ✓ Using IP: {ip}")
# Verify it's the correct private IP (192.168.x.x only for this network)
is_private = ip.startswith('192.168.')
if is_private:
return ip
else:
print(f" ⚠️ IP {ip} is not 192.168.x.x, using localhost")
print(f" Federation requires 192.168.x.x network")
print(f" ⚠️ IP {ip} is not private, using localhost")
return '127.0.0.1'
except Exception as e:
print(f" ⚠️ Error detecting IP: {e}")
+59 -139
View File
@@ -20,8 +20,6 @@ class PeerVote:
confidence: float
latency_ms: float
worker_count: int
tokens_per_second: float = 0.0
tokens_generated: int = 0
@dataclass
@@ -32,13 +30,12 @@ class FederationResult:
peer_votes: List[PeerVote]
strategy: str
winner: str = "" # Name of the winning node ("local" or peer name)
global_tokens_per_second: float = 0.0 # Includes sync + voting overhead
class FederationClient:
"""Client for communicating with peer swarms."""
def __init__(self, timeout: float = 60.0):
def __init__(self, timeout: float = 30.0):
"""
Initialize federation client.
@@ -83,58 +80,42 @@ class FederationClient:
Returns:
PeerVote or None if request failed
"""
request_start = time.time()
# Use peer-specific timeout if available, otherwise use default
timeout = getattr(peer, 'timeout_seconds', self.timeout)
try:
import aiohttp
# Create session with peer-specific timeout
session_timeout = aiohttp.ClientTimeout(total=timeout)
async with aiohttp.ClientSession(timeout=session_timeout) as session:
url = f"{peer.api_url}/v1/federation/vote"
payload = {
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
"request_id": f"fed_{time.time()}"
}
session = await self._get_session()
print(f" → Sending request to {url} (timeout: {timeout}s)")
async with session.post(url, json=payload) as resp:
print(f" ← Got response {resp.status} from {peer.name}")
if resp.status != 200:
print(f" ✗ Peer {peer.name} returned status {resp.status}")
peer.metrics.record_failure(f"HTTP {resp.status}")
return None
url = f"{peer.api_url}/v1/federation/vote"
payload = {
"prompt": prompt,
"max_tokens": max_tokens,
"temperature": temperature,
"request_id": f"fed_{time.time()}"
}
data = await resp.json()
latency_ms = (time.time() - request_start) * 1000
print(f" ✓ Peer {peer.name} responded successfully ({latency_ms:.0f}ms)")
# Record success metrics
peer.metrics.record_success(latency_ms)
print(f" → Sending request to {url}")
async with session.post(url, json=payload) as resp:
print(f" ← Got response {resp.status} from {peer.name}")
if resp.status != 200:
print(f" ✗ Peer {peer.name} returned status {resp.status}")
return None
return PeerVote(
peer_name=peer.name,
response_text=data.get("response", ""),
confidence=data.get("confidence", 0.5),
latency_ms=data.get("latency_ms", latency_ms),
worker_count=data.get("worker_count", 0),
tokens_per_second=data.get("tokens_per_second", 0.0),
tokens_generated=data.get("tokens_generated", 0)
)
data = await resp.json()
print(f" ✓ Peer {peer.name} responded successfully")
return PeerVote(
peer_name=peer.name,
response_text=data.get("response", ""),
confidence=data.get("confidence", 0.5),
latency_ms=data.get("latency_ms", 0),
worker_count=data.get("worker_count", 0)
)
except asyncio.TimeoutError:
error_msg = f"Timeout ({timeout}s)"
print(f" ⚠️ Peer {peer.name} {error_msg}")
peer.metrics.record_failure(error_msg)
print(f" ⚠️ Peer {peer.name} timed out (>{self.timeout}s)")
return None
except Exception as e:
error_msg = str(e)
print(f" ⚠️ Error contacting peer {peer.name}: {error_msg}")
peer.metrics.record_failure(error_msg)
print(f" ⚠️ Error contacting peer {peer.name}: {e}")
return None
async def health_check(self, peer: PeerInfo) -> bool:
@@ -212,38 +193,22 @@ class FederatedSwarm:
# Solo mode - just run local generation
print(f" 🏠 Solo mode - local swarm generating...")
solo_start_time = time.time()
local_result = await self.local_swarm.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
solo_end_time = time.time()
total_elapsed = solo_end_time - solo_start_time
tokens_generated = local_result.selected_response.tokens_generated
global_tps = tokens_generated / total_elapsed if total_elapsed > 0 else 0.0
print(f"\n 📊 Global Performance:")
print(f" Total tokens: {tokens_generated}")
print(f" Total time: {total_elapsed:.2f}s")
print(f" Global speed: {global_tps:.1f} t/s")
return FederationResult(
final_response=local_result.selected_response.text,
local_confidence=local_result.confidence,
peer_votes=[],
strategy="solo",
global_tokens_per_second=global_tps
strategy="solo"
)
# Parallel generation: Local swarm AND peers generate simultaneously
print(f" 🏠 Local swarm AND {len(peers)} peer(s) generating in parallel...")
# Track timing for global t/sec calculation (includes sync + voting overhead)
federation_start_time = time.time()
total_tokens_generated = 0
# Start local generation
local_task = self.local_swarm.generate(
prompt=prompt,
@@ -272,9 +237,7 @@ class FederatedSwarm:
local_result: ConsensusResult = local_result_raw # Now guaranteed not to be an exception
local_best = local_result.selected_response
local_confidence = local_result.confidence
local_tps = local_best.tokens_per_second
total_tokens_generated += local_best.tokens_generated
print(f" ✓ Local completed (confidence: {local_confidence:.2f}, {local_tps:.1f} t/s)")
print(f" ✓ Local completed (confidence: {local_confidence:.2f})")
# Collect peer votes
peer_votes = []
@@ -283,52 +246,28 @@ class FederatedSwarm:
print(f" ✗ Peer {peer.name} failed: {result}")
elif result is not None:
peer_votes.append(result)
total_tokens_generated += result.tokens_generated if hasattr(result, 'tokens_generated') else 0
print(f" ✓ Peer {peer.name} completed (confidence: {result.confidence:.2f}, {result.tokens_per_second:.1f} t/s)")
print(f" ✓ Peer {peer.name} completed (confidence: {result.confidence:.2f})")
if len(peer_votes) == 0:
# No peers responded, use local result
print(" ⚠️ No peers responded, using local result")
# Calculate global t/sec even in fallback mode
federation_end_time = time.time()
total_elapsed_seconds = federation_end_time - federation_start_time
global_tps = total_tokens_generated / total_elapsed_seconds if total_elapsed_seconds > 0 else 0.0
print(f"\n 📊 Global Performance:")
print(f" Total tokens: {total_tokens_generated}")
print(f" Total time: {total_elapsed_seconds:.2f}s")
print(f" Global speed: {global_tps:.1f} t/s")
return FederationResult(
final_response=local_best.text,
local_confidence=local_confidence,
peer_votes=[],
strategy="local_fallback",
global_tokens_per_second=global_tps
strategy="local_fallback"
)
# Global consensus
print(f" 🗳️ Running global consensus ({len(peer_votes) + 1} votes)...")
final_response, winner = self._weighted_vote(local_best.text, local_confidence, peer_votes)
# Calculate global tokens/sec including sync + voting overhead
federation_end_time = time.time()
total_elapsed_seconds = federation_end_time - federation_start_time
global_tps = total_tokens_generated / total_elapsed_seconds if total_elapsed_seconds > 0 else 0.0
print(f"\n 📊 Global Performance:")
print(f" Total tokens: {total_tokens_generated}")
print(f" Total time: {total_elapsed_seconds:.2f}s")
print(f" Global speed: {global_tps:.1f} t/s (includes sync + voting)")
return FederationResult(
final_response=final_response,
local_confidence=local_confidence,
peer_votes=peer_votes,
strategy=self.consensus_strategy,
winner=winner,
global_tokens_per_second=global_tps
winner=winner
)
def _weighted_vote(
@@ -351,39 +290,38 @@ class FederatedSwarm:
for vote in peer_votes:
all_votes.append((vote.response_text, vote.confidence, vote.peer_name))
# Always use quality-based selection - the head node judges ALL responses
# This prevents nodes from being overconfident about their own mediocre answers
from swarm.consensus import ConsensusEngine, GenerationResponse
if self.consensus_strategy == "best_of_n":
# Use the consensus engine to pick the best response
from swarm.consensus import ConsensusEngine
responses = [
GenerationResponse(
text=text,
tokens_generated=0,
tokens_per_second=0,
latency_ms=0,
backend_name=source
)
for text, _, source in all_votes
]
responses = [
GenerationResponse(
text=text,
tokens_generated=0,
tokens_per_second=0,
latency_ms=0,
backend_name=source
)
for text, _, source in all_votes
]
# Use quality scoring to objectively compare all responses
engine = ConsensusEngine(strategy="quality")
scores = [engine._quality_score(r) for r in responses]
# Find best response based on actual quality, not self-reported confidence
best_idx = max(range(len(scores)), key=lambda i: scores[i])
best = all_votes[best_idx]
# Show comparison
print(f" 📊 Quality scores:")
for i, (text, conf, source) in enumerate(all_votes):
print(f" {source}: {scores[i]:.2f} (self-reported: {conf:.2f})")
print(f" ✓ Selected response from {best[2]} (quality score: {scores[best_idx]:.2f})")
# Use synchronous quality scoring (no embeddings needed)
engine = ConsensusEngine(strategy="quality")
# _quality_vote is async but only uses sync scoring, so we
# use the simpler _fastest_vote-style approach here
scores = [engine._quality_score(r) for r in responses]
best_idx = max(range(len(scores)), key=lambda i: scores[i])
best = all_votes[best_idx]
print(f" ✓ Selected response from {best[2]} (quality score: {scores[best_idx]:.2f})")
return best[0], best[2]
# Default: weighted selection - pick highest confidence
best = max(all_votes, key=lambda x: x[1])
print(f" ✓ Selected response from {best[2]} (confidence: {best[1]:.2f})")
return best[0], best[2]
async def get_federation_status(self) -> Dict[str, Any]:
"""Get current federation status with peer metrics."""
"""Get current federation status."""
peers = self.discovery.get_peers()
# Check health of all peers
@@ -391,24 +329,7 @@ class FederatedSwarm:
health_results = await asyncio.gather(*health_checks, return_exceptions=True)
healthy_peers = []
peer_metrics_info = []
for peer, healthy in zip(peers, health_results):
peer_info = {
"name": peer.name,
"healthy": healthy is True,
"timeout": peer.timeout_seconds,
"model": peer.model_id,
"instances": peer.instances,
"metrics": {
"success_rate": peer.metrics.success_rate,
"avg_latency_ms": round(peer.metrics.avg_latency_ms, 2),
"total_requests": peer.metrics.total_requests,
"last_error": peer.metrics.last_error,
}
}
peer_metrics_info.append(peer_info)
if healthy is True:
healthy_peers.append(peer.name)
@@ -417,7 +338,6 @@ class FederatedSwarm:
"total_peers": len(peers),
"healthy_peers": len(healthy_peers),
"peer_names": [p.name for p in peers],
"peer_details": peer_metrics_info,
"strategy": self.consensus_strategy
}
+3 -21
View File
@@ -232,7 +232,7 @@ class SwarmManager:
response = await worker.generate_with_progress(request)
responses.append(response)
if not self.mcp_mode:
print(f"{worker.name} completed ({response.tokens_generated} tokens, {response.tokens_per_second:.1f} t/s)")
print(f"{worker.name} completed ({response.tokens_generated} tokens)")
except Exception as e:
responses.append(e)
if not self.mcp_mode:
@@ -283,11 +283,6 @@ class SwarmManager:
if not self.mcp_mode:
print(f" Got {len(valid_responses)} valid responses")
# Print performance summary
print(f"\n 📊 Performance Summary:")
for i, resp in enumerate(valid_responses, 1):
print(f" Worker {i}: {resp.tokens_generated} tokens @ {resp.tokens_per_second:.1f} t/s ({resp.latency_ms:.0f}ms)")
# Run consensus
result = await self.consensus.select_best(valid_responses)
@@ -357,21 +352,13 @@ class SwarmManager:
if not self.mcp_mode:
print(f"🔄 Starting stream from {fastest_worker.name}...")
chunk_count = 0
total_chars = 0
start_time = asyncio.get_event_loop().time()
async for chunk in fastest_worker.generate_with_progress_stream(request):
chunk_count += 1
total_chars += len(chunk)
if not self.mcp_mode and chunk_count % 50 == 0: # Print progress every 50 chunks
print(f" Streamed {chunk_count} chunks...")
yield chunk
end_time = asyncio.get_event_loop().time()
duration = end_time - start_time
# Estimate tokens (roughly 4 chars per token)
estimated_tokens = total_chars // 4
tps = estimated_tokens / duration if duration > 0 else 0
if not self.mcp_mode:
print(f" Stream complete: {chunk_count} chunks, {estimated_tokens} tokens, {tps:.1f} t/s")
print(f" Stream complete: {chunk_count} chunks total")
def get_status(self) -> SwarmStatus:
"""Get current swarm status."""
@@ -507,7 +494,7 @@ class SwarmManager:
try:
response = await worker.generate_with_progress(request)
responses.append(response)
print(f" ✓ Response {i+1} completed ({response.tokens_generated} tokens, {response.tokens_per_second:.1f} t/s)")
print(f" ✓ Response {i+1} completed ({response.tokens_generated} tokens)")
except Exception as e:
responses.append(e)
print(f" ✗ Response {i+1} failed: {e}")
@@ -526,11 +513,6 @@ class SwarmManager:
print(f" Got {len(valid_responses)} valid responses")
# Print performance summary
print(f"\n 📊 Performance Summary:")
for i, resp in enumerate(valid_responses, 1):
print(f" Seed {i}: {resp.tokens_generated} tokens @ {resp.tokens_per_second:.1f} t/s ({resp.latency_ms:.0f}ms)")
# Run consensus
result = await self.consensus.select_best(valid_responses)
print(f" Selected response using '{result.strategy}' strategy (confidence: {result.confidence:.2f})")
-103
View File
@@ -1,103 +0,0 @@
"""Swarm orchestration for Local Swarm.
Handles generation orchestration across multiple workers.
"""
import asyncio
from typing import List, Optional
from backends.base import GenerationRequest, GenerationResponse
from swarm.consensus import ConsensusResult
from swarm.worker import SwarmWorker
class SwarmOrchestrator:
"""Orchestrates generation across multiple workers."""
def __init__(self, workers: List[SwarmWorker], sequential_mode: bool = False, mcp_mode: bool = False):
"""Initialize orchestrator.
Args:
workers: List of swarm workers
sequential_mode: Whether to run workers sequentially
mcp_mode: Whether to suppress console output
"""
self.workers = workers
self.sequential_mode = sequential_mode
self.mcp_mode = mcp_mode
async def generate_single(
self,
worker: SwarmWorker,
request: GenerationRequest
) -> GenerationResponse:
"""Generate using a single worker.
Args:
worker: Worker to use
request: Generation request
Returns:
Generation response
"""
return await worker.generate_with_progress(request)
async def generate_parallel(
self,
workers: List[SwarmWorker],
request: GenerationRequest
) -> List[GenerationResponse]:
"""Generate using multiple workers in parallel.
Args:
workers: List of workers
request: Generation request
Returns:
List of generation responses
"""
tasks = [w.generate_with_progress(request) for w in workers]
return await asyncio.gather(*tasks, return_exceptions=True)
async def generate_sequential(
self,
workers: List[SwarmWorker],
request: GenerationRequest
) -> List[GenerationResponse]:
"""Generate using multiple workers sequentially.
Args:
workers: List of workers
request: Generation request
Returns:
List of generation responses
"""
responses = []
for worker in workers:
try:
response = await worker.generate_with_progress(request)
responses.append(response)
except Exception as e:
responses.append(e)
return responses
def filter_responses(
self,
responses: List,
workers: List[SwarmWorker]
) -> List[GenerationResponse]:
"""Filter out error responses.
Args:
responses: List of responses (may contain exceptions)
workers: Corresponding workers
Returns:
List of valid responses
"""
valid = []
for i, resp in enumerate(responses):
if not isinstance(resp, Exception):
valid.append(resp)
return valid
+12 -6
View File
@@ -66,19 +66,25 @@ class StatusMonitor:
if not self.swarm_manager or not self.swarm_manager.workers:
return
# Clear previous display
self._clear_display()
# Get worker status
workers = self.swarm_manager.workers
generating_workers = [w for w in workers if w._is_generating]
if not generating_workers:
# No active generation, clear display and return (don't spam "Workers Idle")
if self._last_lines > 0:
self._clear_display()
# No active generation, show minimal status
lines = []
lines.append("📊 Workers Idle")
for w in workers:
status = "🟢" if w.is_healthy() else "🔴"
ip_str = f" [{w._ip_address}]" if w._is_remote else ""
lines.append(f" {status} {w.name}{ip_str}: Idle")
self._print_lines(lines)
return
# Clear previous display
self._clear_display()
# Active generation - show detailed status
lines = []
lines.append(f"{len(generating_workers)} Worker{'s' if len(generating_workers) > 1 else ''} Active")
+41 -71
View File
@@ -5,14 +5,12 @@ Remote execution allows a single "tool host" to manage the workspace
while workers perform distributed generation.
"""
import asyncio
import logging
import os
import subprocess
import aiohttp
from typing import Optional
from utils.project_discovery import discover_project_root
logger = logging.getLogger(__name__)
@@ -86,7 +84,29 @@ class ToolExecutor:
logger.debug(f" ❌ Error contacting tool host: {e}")
return f"Error contacting tool host: {str(e)}"
def _discover_project_root(self, start_dir: Optional[str] = None) -> str:
"""Discover the project root directory by looking for common markers."""
import os
if start_dir is None:
start_dir = os.getcwd()
current = os.path.abspath(start_dir)
# Common project root markers
markers = ['.git', 'package.json', 'pyproject.toml', 'Cargo.toml', 'go.mod',
'requirements.txt', 'setup.py', 'pom.xml', 'build.gradle', '.project', '.venv']
while True:
try:
if any(os.path.exists(os.path.join(current, marker)) for marker in markers):
return current
except Exception:
pass # Permission errors, just skip
parent = os.path.dirname(current)
if parent == current: # Reached filesystem root
break
current = parent
return start_dir
async def _execute_local(self, tool_name: str, tool_args: dict) -> str:
"""Execute tool locally."""
@@ -97,8 +117,6 @@ class ToolExecutor:
return await self._execute_write(tool_args)
elif tool_name == "bash":
return await self._execute_bash(tool_args)
elif tool_name == "webfetch":
return await self._execute_webfetch(tool_args)
elif tool_name == "question":
return f"Question: {tool_args}"
elif tool_name == "skill":
@@ -109,7 +127,7 @@ class ToolExecutor:
return "Current todo list: (empty)"
else:
return f"Tool '{tool_name}' not implemented"
except Exception as e:
return f"Error executing {tool_name}: {str(e)}"
@@ -121,13 +139,6 @@ class ToolExecutor:
if not file_path:
return "Error: filePath required"
# Check if original path was absolute or used ~ before expansion
original_was_absolute = os.path.isabs(file_path) or file_path.startswith("~")
# Expand ~ to home directory
file_path = os.path.expanduser(file_path)
working_dir = os.path.expanduser(working_dir)
# Security: Prevent directory traversal
file_path = os.path.normpath(file_path)
if file_path.startswith("..") or file_path.startswith("/.."):
@@ -139,16 +150,14 @@ class ToolExecutor:
else:
full_path = file_path
# Additional security: only enforce working_dir restriction for relative paths
# If user explicitly specified an absolute path or ~ path, allow it
if not original_was_absolute:
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
# Additional security: ensure resolved path is within working_dir
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
logger.debug(f" 📁 Reading: {file_path}")
logger.debug(f" 📍 Working dir: {working_dir}")
@@ -172,13 +181,6 @@ class ToolExecutor:
if not file_path:
return "Error: filePath required"
# Check if original path was absolute or used ~ before expansion
original_was_absolute = os.path.isabs(file_path) or file_path.startswith("~")
# Expand ~ to home directory
file_path = os.path.expanduser(file_path)
working_dir = os.path.expanduser(working_dir)
# Security: Prevent directory traversal
file_path = os.path.normpath(file_path)
if file_path.startswith("..") or file_path.startswith("/.."):
@@ -190,16 +192,14 @@ class ToolExecutor:
else:
full_path = file_path
# Additional security: only enforce working_dir restriction for relative paths
# If user explicitly specified an absolute path or ~ path, allow it
if not original_was_absolute:
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
# Additional security: ensure resolved path is within working_dir
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
logger.debug(f" 📁 Writing: {file_path}")
logger.debug(f" 📍 Working dir: {working_dir}")
@@ -226,9 +226,6 @@ class ToolExecutor:
if not command:
return "Error: command required"
# Expand ~ to home directory in cwd
cwd = os.path.expanduser(cwd)
# Security: Block dangerous commands
dangerous = ["rm -rf /", "> /dev", "mkfs", "dd if=/dev/zero", ":(){ :|:& };:"]
for d in dangerous:
@@ -331,34 +328,7 @@ class ToolExecutor:
logger.debug(f" 📄 Partial output (last 500 chars): ...{partial_output[-500:]}")
return f"Error executing bash: {error_msg}"
async def _execute_webfetch(self, args: dict) -> str:
"""Execute webfetch tool."""
url = args.get("url", "")
format = args.get("format", "text") # Default to text
if not url:
return "Error: url required"
logger.debug(f" 🌐 Fetching: {url[:100]}... (format: {format})")
try:
session = await self._get_session()
async with session.get(url, timeout=aiohttp.ClientTimeout(total=30)) as resp:
if resp.status == 200:
content = await resp.text()
logger.debug(f" ✓ Fetched {len(content)} chars")
return content
else:
logger.debug(f" ❌ HTTP {resp.status}: {url[:100]}")
return f"Error: HTTP {resp.status} from {url[:100]}"
except asyncio.TimeoutError:
logger.debug(f" ⏰ Timeout fetching: {url[:100]}")
return f"Error: Timeout fetching {url[:100]} (30s)"
except Exception as e:
logger.debug(f" ❌ Error: {e}")
return f"Error fetching {url[:100]}: {str(e)}"
async def close(self):
"""Close HTTP session."""
if self._session:
+2 -2
View File
@@ -7,11 +7,11 @@ import logging
import sys
def setup_logging(level=logging.INFO):
def setup_logging(level=logging.DEBUG):
"""Set up logging configuration.
Args:
level: Logging level (default: INFO)
level: Logging level (default: DEBUG for development)
"""
# Create formatter
formatter = logging.Formatter(
-45
View File
@@ -1,45 +0,0 @@
"""Network utilities for Local Swarm."""
import socket
from typing import Optional
def get_local_ip() -> str:
"""Get the local network IP address (private networks only).
Returns:
Local IP address or 127.0.0.1 if detection fails
"""
try:
# Create a socket and connect to a public DNS server
s = socket.socket(socket.AF_INET, socket.SOCK_DGRAM)
s.settimeout(2)
# Try to connect to Google's DNS - this doesn't actually send data
s.connect(("8.8.8.8", 80))
ip = s.getsockname()[0]
s.close()
# Check if it's a private IP
is_private = ip.startswith('192.168.')
if is_private:
print(f" 📡 Detected local IP: {ip}")
return ip
else:
print(f" ⚠️ IP {ip} is not a private network, binding to localhost")
return "127.0.0.1"
except Exception as e:
print(f" ⚠️ Could not detect local IP: {e}, using localhost")
return "127.0.0.1"
def is_private_ip(ip: str) -> bool:
"""Check if an IP address is private.
Args:
ip: IP address string
Returns:
True if IP is private
"""
return ip.startswith('192.168.') or ip.startswith('10.') or ip.startswith('172.16.')
-86
View File
@@ -1,86 +0,0 @@
"""Project root discovery utilities.
Provides functionality to discover project root directories.
"""
import os
from typing import Optional, List
# Common project root markers
DEFAULT_MARKERS = [
'.git', 'package.json', 'pyproject.toml', 'Cargo.toml', 'go.mod',
'requirements.txt', 'setup.py', 'pom.xml', 'build.gradle', '.project', '.venv'
]
def discover_project_root(
start_dir: Optional[str] = None,
markers: Optional[List[str]] = None
) -> str:
"""Discover the project root directory by looking for common markers.
Args:
start_dir: Directory to start searching from (defaults to cwd)
markers: List of marker files/directories to look for (defaults to DEFAULT_MARKERS)
Returns:
Path to project root, or start_dir if no markers found
"""
if start_dir is None:
start_dir = os.getcwd()
if markers is None:
markers = DEFAULT_MARKERS
current = os.path.abspath(start_dir)
while True:
try:
if any(os.path.exists(os.path.join(current, marker)) for marker in markers):
return current
except (OSError, PermissionError):
pass # Permission errors, just skip
parent = os.path.dirname(current)
if parent == current: # Reached filesystem root
break
current = parent
return start_dir
def is_within_project(path: str, project_root: str) -> bool:
"""Check if a path is within a project root.
Args:
path: Path to check
project_root: Project root directory
Returns:
True if path is within project root
"""
try:
real_path = os.path.realpath(path)
real_root = os.path.realpath(project_root)
return real_path.startswith(real_root)
except (OSError, ValueError):
return False
def get_relative_to_project(path: str, project_root: str) -> str:
"""Get path relative to project root.
Args:
path: Absolute or relative path
project_root: Project root directory
Returns:
Path relative to project root
"""
try:
real_path = os.path.realpath(path)
real_root = os.path.realpath(project_root)
return os.path.relpath(real_path, real_root)
except (OSError, ValueError):
return path
-90
View File
@@ -1,90 +0,0 @@
"""Token counting utilities for Local Swarm.
Centralizes token counting functionality to avoid duplication across modules.
"""
import tiktoken
from typing import Optional
# Initialize tokenizer for accurate token counting
TOKEN_ENCODING = tiktoken.get_encoding('cl100k_base')
def count_tokens(text: str) -> int:
"""Count tokens in a text string using tiktoken.
Args:
text: Text to count tokens for
Returns:
Number of tokens
"""
if not text:
return 0
return len(TOKEN_ENCODING.encode(text))
def count_tokens_in_messages(messages: list) -> int:
"""Count tokens in a list of messages.
Args:
messages: List of message objects with content attribute
Returns:
Total token count
"""
total = 0
for msg in messages:
if hasattr(msg, 'content') and msg.content:
total += count_tokens(msg.content)
return total
def estimate_tokens_from_characters(char_count: int, chars_per_token: int = 4) -> int:
"""Estimate token count from character count.
This is a fallback when tiktoken is not available or for quick estimates.
Args:
char_count: Number of characters
chars_per_token: Average characters per token (default 4)
Returns:
Estimated token count
"""
return char_count // chars_per_token
def truncate_to_max_tokens(text: str, max_tokens: int) -> str:
"""Truncate text to fit within max tokens.
Args:
text: Text to truncate
max_tokens: Maximum number of tokens allowed
Returns:
Truncated text
"""
tokens = TOKEN_ENCODING.encode(text)
if len(tokens) <= max_tokens:
return text
truncated = tokens[:max_tokens]
return TOKEN_ENCODING.decode(truncated)
def format_token_info(prompt_tokens: int, completion_tokens: int) -> dict:
"""Format token information for responses.
Args:
prompt_tokens: Number of prompt tokens
completion_tokens: Number of completion tokens
Returns:
Dictionary with token counts
"""
return {
"prompt_tokens": prompt_tokens,
"completion_tokens": completion_tokens,
"total_tokens": prompt_tokens + completion_tokens
}
-16
View File
@@ -1,16 +0,0 @@
# Patch to add real-time streaming for tools
# This patch adds real-time streaming of assistant content ("thinking") and tool calls
# when tools are used. Previously, all content was buffered until complete,
# causing opencode to wait with no feedback.
# Key changes:
# 1. Stream model output incrementally as it's generated
# 2. Parse for tool_calls and content in each chunk
# 3. Send content chunks immediately (the "thinking")
# 4. Send tool_calls deltas immediately when found
# 5. Don't execute tools server-side in streaming mode
# 6. Send DONE marker at end
# Apply this patch with:
# patch -p1 < this_file src/api/routes.py
-63
View File
@@ -1,63 +0,0 @@
## Test Plan for CUDA and Android Support
### Unit Tests
#### Test Case 1: NVIDIA GPU Detection
- **Input:** System with NVIDIA GPU and pynvml installed
- **Expected Output:** GPUInfo with correct name, VRAM, and is_nvidia=True
- **Location:** src/hardware/detector.py:detect_nvidia_gpu()
#### Test Case 2: GPU Layer Configuration for CUDA
- **Input:** HardwareProfile with NVIDIA GPU (4GB VRAM)
- **Expected Output:** n_gpu_layers=-1 (all layers), proper CUDA configuration
- **Location:** src/backends/__init__.py:create_backend()
#### Test Case 3: Android Platform Detection
- **Input:** platform.system() returns 'Linux', Termux environment detected
- **Expected Output:** is_android=True, proper Android path handling
- **Location:** src/hardware/detector.py:detect_android()
#### Test Case 4: PeerInfo with Timeout
- **Input:** PeerInfo with custom timeout
- **Expected Output:** FederationClient respects peer timeout
- **Location:** src/network/discovery.py:PeerInfo
### Integration Tests
#### End-to-End Flow 1: CUDA Backend Creation
1. Detect hardware with NVIDIA GPU
2. Create backend via factory
3. Verify n_gpu_layers=-1 set
4. Load test model
5. Expected: Successful GPU offload
#### End-to-End Flow 2: Android Device Join Federation
1. Start discovery on Android (Termux)
2. Advertise Android hardware
3. Join federation from macOS peer
4. Send vote request
5. Expected: Android responds successfully
#### End-to-End Flow 3: Federation with Per-Peer Timeout
1. Add peer with 30s timeout
2. Add peer with 60s timeout
3. Request votes from both
4. Expected: Each peer uses its own timeout
### Manual Verification
#### Command to Run:
```bash
python -m pytest tests/ -v -k "cuda or android or federation"
```
#### Expected Output:
- All tests pass
- No ImportError for pynvml
- GPU layer detection works on CUDA machines
- Android detection passes on Termux
#### Platform Testing:
1. **macOS (Apple Silicon):** MLX backend loads
2. **Linux (NVIDIA):** CUDA backend auto-detects
3. **Android (Termux):** CPU-only mode, proper paths
-140
View File
@@ -1,140 +0,0 @@
"""Test Apple Silicon MLX auto-detection and download."""
import sys
import os
from pathlib import Path
# Add src to path
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
def test_apple_silicon_mlx_selection():
"""Test that Apple Silicon correctly selects MLX models."""
from hardware.detector import HardwareProfile, GPUInfo
from models.selector import select_optimal_model
# Mock Apple Silicon hardware
class MockAppleHardware:
os = "darwin"
cpu_cores = 12
ram_gb = 24.0
ram_available_gb = 12.0
is_apple_silicon = True
has_dedicated_gpu = False
gpu = GPUInfo(name="Apple Silicon GPU", vram_gb=24.0, driver_version=None)
available_memory_gb = 12.0
recommended_memory_gb = 12.0
hardware = MockAppleHardware()
# Test auto-detection (use_mlx=None)
print("=" * 60)
print("Apple Silicon MLX Auto-Detection Test")
print("=" * 60)
print("\n1. Testing auto-detection (use_mlx=None)...")
config = select_optimal_model(hardware, use_mlx=None)
assert config is not None, "Should find a model"
print(f" ✓ Model selected: {config.model.name}")
# Verify quantization is MLX format (4bit, 8bit, etc.)
print("\n2. Verifying MLX quantization format...")
is_mlx_format = 'bit' in config.quantization.name.lower()
assert is_mlx_format, f"Quantization should be MLX format (4bit/8bit), got {config.quantization.name}"
print(f" ✓ Quantization: {config.quantization.name} (MLX format)")
# Test repository name generation
print("\n3. Testing MLX repository name generation...")
from models.registry import get_model_hf_repo_mlx
mlx_repo = get_model_hf_repo_mlx(config.model.id, config.variant, config.quantization)
assert mlx_repo is not None, "MLX repository should be generated"
assert "mlx-community" in mlx_repo, "Should use mlx-community namespace"
assert "-Instruct-" in mlx_repo, "Should have -Instruct- suffix"
assert config.quantization.name in mlx_repo, "Should include quantization"
print(f" ✓ Repository: {mlx_repo}")
# Verify it's NOT using GGUF format
print("\n4. Verifying NOT using GGUF format...")
has_gguf = 'q4_k_m' in config.quantization.name or 'q5_k_m' in config.quantization.name
has_gguf_suffix = '-GGUF' in mlx_repo
assert not has_gguf, f"Should not use GGUF quantization names"
assert not has_gguf_suffix, f"Should not use GGUF repository suffix"
print(f" ✓ Not using GGUF format")
print("\n" + "=" * 60)
print("All Apple Silicon MLX tests passed!")
print("=" * 60)
def test_nvidia_gpu_gguf_selection():
"""Test that NVIDIA GPU correctly selects GGUF models."""
from hardware.detector import HardwareProfile, GPUInfo
from models.selector import select_optimal_model
# Mock NVIDIA hardware
class MockNvidiaHardware:
os = "linux"
cpu_cores = 8
ram_gb = 32.0
ram_available_gb = 20.0
is_apple_silicon = False
has_dedicated_gpu = True
gpu = GPUInfo(name="NVIDIA RTX 4090", vram_gb=24.0, driver_version="550.80")
available_memory_gb = 20.0
recommended_memory_gb = 20.0
hardware = MockNvidiaHardware()
print("\n" + "=" * 60)
print("NVIDIA GPU GGUF Auto-Detection Test")
print("=" * 60)
print("\n1. Testing auto-detection (use_mlx=None)...")
config = select_optimal_model(hardware, use_mlx=None)
assert config is not None, "Should find a model"
print(f" ✓ Model selected: {config.model.name}")
# Verify quantization is GGUF format (q4_k_m, q5_k_m, etc.)
print("\n2. Verifying GGUF quantization format...")
is_gguf_format = 'q' in config.quantization.name.lower()
assert is_gguf_format, f"Quantization should be GGUF format (q4_k_m/q5_k_m), got {config.quantization.name}"
print(f" ✓ Quantization: {config.quantization.name} (GGUF format)")
# Test repository name generation
print("\n3. Testing GGUF repository name generation...")
from models.registry import get_model_hf_repo
gguf_repo = get_model_hf_repo(config.model.id, config.variant, config.quantization)
assert gguf_repo is not None, "GGUF repository should be generated"
assert "-GGUF" in gguf_repo, "Should have -GGUF suffix"
print(f" ✓ Repository: {gguf_repo}")
# Verify it's NOT using MLX format
print("\n4. Verifying NOT using MLX format...")
has_mlx_format = 'bit' in config.quantization.name.lower() and config.quantization.name not in ['q4_k_m', 'q5_k_m', 'q6_k']
has_mlx_namespace = 'mlx-community' in gguf_repo
assert not has_mlx_namespace, f"Should not use mlx-community namespace"
print(f" ✓ Not using MLX format")
print("\n" + "=" * 60)
print("All NVIDIA GPU GGUF tests passed!")
print("=" * 60)
if __name__ == "__main__":
try:
test_apple_silicon_mlx_selection()
test_nvidia_gpu_gguf_selection()
print("\n" + "=" * 60)
print("ALL AUTO-DETECTION TESTS PASSED!")
print("=" * 60)
except AssertionError as e:
print(f"\n❌ Test failed: {e}")
sys.exit(1)
except Exception as e:
print(f"\n❌ Test error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
-100
View File
@@ -1,100 +0,0 @@
"""End-to-end test for tool execution with a mock server.
This tests the complete flow:
1. Model generates tool call
2. Tools are executed
3. Response is generated based on tool results
"""
import asyncio
import sys
import os
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
@pytest.mark.asyncio
async def test_tool_flow():
"""Test the tool execution flow end-to-end."""
# Import after path is set
from api.models import ChatMessage, ChatCompletionRequest
from api.tool_parser import parse_tool_calls
from api.formatting import format_messages_with_tools
from tools.executor import ToolExecutor
print("=" * 60)
print("End-to-End Tool Execution Test")
print("=" * 60)
# Test 1: Parse tool call from model response
print("\n1. Testing tool parsing...")
model_response = "TOOL: bash\nARGUMENTS: {\"command\": \"echo hello\"}"
content, tool_calls = parse_tool_calls(model_response)
assert tool_calls is not None, "Should parse tool call"
assert len(tool_calls) == 1, "Should have one tool call"
assert tool_calls[0]["function"]["name"] == "bash", "Should be bash tool"
print(f" ✓ Parsed tool: {tool_calls[0]['function']['name']}")
# Test 2: Simulate tool result and format for next prompt
print("\n2. Testing tool result formatting...")
tool_result = "hello\n"
# Build conversation history
messages = [
ChatMessage(role="user", content="Run echo hello"),
ChatMessage(role="assistant", content=model_response),
ChatMessage(role="tool", content=tool_result)
]
# Format for next generation
next_prompt = format_messages_with_tools(messages, None)
assert "tool" in next_prompt.lower(), "Prompt should include tool result"
assert "hello" in next_prompt, "Prompt should include tool output"
print(f" ✓ Tool result formatted for next prompt")
# Test 3: Verify loop detection
print("\n3. Testing loop detection...")
seen_tools = set()
# First tool call
tc1 = [{"function": {"name": "bash", "arguments": '{"command": "ls"}'}}]
sig1 = "bash:{'command': \"ls\"}'[:50]"
seen_tools.add(sig1)
print(f" ✓ First tool call tracked")
# Duplicate tool call
tc2 = tc1
sig2 = sig1
is_duplicate = sig2 in seen_tools
assert is_duplicate, "Should detect duplicate"
print(f" ✓ Duplicate tool call detected")
# Test 4: Verify tool result truncation
print("\n4. Testing tool result truncation...")
long_result = "a" * 3000
max_length = 2000
if len(long_result) > max_length:
truncated = long_result[:max_length] + "\n[...truncated...]"
assert len(truncated) == max_length + len("\n[...truncated...]"), "Should truncate properly"
print(f" ✓ Tool result truncated from {len(long_result)} to {len(truncated)} chars")
print("\n" + "=" * 60)
print("All end-to-end tests passed!")
print("=" * 60)
if __name__ == "__main__":
try:
asyncio.run(test_tool_flow())
except AssertionError as e:
print(f"\n❌ Test failed: {e}")
sys.exit(1)
except Exception as e:
print(f"\n❌ Test error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
-166
View File
@@ -1,166 +0,0 @@
"""Tests for federation metrics and peer timeout."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
import pytest
from datetime import datetime
from network.discovery import PeerInfo, PeerMetrics
from network.federation import FederationClient, PeerVote
class TestPeerMetrics:
"""Test peer metrics tracking."""
def test_peer_metrics_defaults(self):
"""Test default metric values."""
metrics = PeerMetrics()
assert metrics.total_requests == 0
assert metrics.successful_requests == 0
assert metrics.failed_requests == 0
assert metrics.success_rate == 1.0 # No requests = 100% success
def test_record_success(self):
"""Test recording successful requests."""
metrics = PeerMetrics()
metrics.record_success(100.0)
assert metrics.total_requests == 1
assert metrics.successful_requests == 1
assert metrics.failed_requests == 0
assert metrics.success_rate == 1.0
assert metrics.avg_latency_ms == 100.0
# Record another success
metrics.record_success(200.0)
assert metrics.total_requests == 2
assert metrics.avg_latency_ms == 150.0 # (100 + 200) / 2
def test_record_failure(self):
"""Test recording failed requests."""
metrics = PeerMetrics()
metrics.record_failure("Connection timeout")
assert metrics.total_requests == 1
assert metrics.successful_requests == 0
assert metrics.failed_requests == 1
assert metrics.success_rate == 0.0
assert metrics.last_error == "Connection timeout"
assert metrics.last_error_time is not None
def test_mixed_success_and_failure(self):
"""Test mixed success and failure recording."""
metrics = PeerMetrics()
metrics.record_success(100.0)
metrics.record_failure("Error")
metrics.record_success(150.0)
assert metrics.total_requests == 3
assert metrics.successful_requests == 2
assert metrics.failed_requests == 1
assert metrics.success_rate == 2/3
class TestPeerInfo:
"""Test PeerInfo with metrics and timeout."""
def test_peer_info_defaults(self):
"""Test PeerInfo default values."""
peer = PeerInfo(
host="192.168.1.100",
port=17615,
name="test-peer",
version="0.1.0",
instances=2,
model_id="qwen:7b:q4",
hardware_summary="Apple M1 Pro",
last_seen=datetime.now()
)
assert peer.timeout_seconds == 60.0 # Default timeout
assert peer.metrics is not None
assert isinstance(peer.metrics, PeerMetrics)
assert peer.api_url == "http://192.168.1.100:17615"
def test_peer_info_custom_timeout(self):
"""Test PeerInfo with custom timeout."""
peer = PeerInfo(
host="192.168.1.100",
port=17615,
name="slow-peer",
version="0.1.0",
instances=1,
model_id="test-model",
hardware_summary="CPU only",
last_seen=datetime.now(),
timeout_seconds=120.0 # Custom timeout
)
assert peer.timeout_seconds == 120.0
class TestFederationClient:
"""Test FederationClient with peer-specific timeouts."""
@pytest.fixture
def client(self):
return FederationClient(timeout=60.0)
@pytest.fixture
def fast_peer(self):
return PeerInfo(
host="192.168.1.10",
port=17615,
name="fast-peer",
version="0.1.0",
instances=2,
model_id="qwen:7b:q4",
hardware_summary="Apple M1 Max",
last_seen=datetime.now(),
timeout_seconds=30.0 # Fast peer with short timeout
)
@pytest.fixture
def slow_peer(self):
return PeerInfo(
host="192.168.1.11",
port=17615,
name="slow-peer",
version="0.1.0",
instances=1,
model_id="qwen:7b:q4",
hardware_summary="CPU only",
last_seen=datetime.now(),
timeout_seconds=90.0 # Slow peer with longer timeout
)
def test_peer_timeout_override(self, client, fast_peer, slow_peer):
"""Test that peer-specific timeout overrides default."""
# The client should use the peer's timeout, not the default
assert fast_peer.timeout_seconds == 30.0
assert slow_peer.timeout_seconds == 90.0
assert client.timeout == 60.0 # Default unchanged
def test_metrics_updated_on_success(self, fast_peer):
"""Test that metrics are updated on successful request."""
assert fast_peer.metrics.total_requests == 0
# Simulate recording a success (this would happen in request_vote)
fast_peer.metrics.record_success(150.0)
assert fast_peer.metrics.total_requests == 1
assert fast_peer.metrics.successful_requests == 1
assert fast_peer.metrics.success_rate == 1.0
def test_metrics_updated_on_failure(self, slow_peer):
"""Test that metrics are updated on failed request."""
assert slow_peer.metrics.total_requests == 0
# Simulate recording a failure
slow_peer.metrics.record_failure("Connection refused")
assert slow_peer.metrics.total_requests == 1
assert slow_peer.metrics.failed_requests == 1
assert slow_peer.metrics.success_rate == 0.0
assert slow_peer.metrics.last_error == "Connection refused"
-176
View File
@@ -1,176 +0,0 @@
"""Tests for hardware detection and GPU layer configuration."""
import sys
import os
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
import pytest
from unittest.mock import Mock, patch, MagicMock
from hardware.detector import (
GPUInfo, HardwareProfile, detect_nvidia_gpu,
calculate_gpu_layers, validate_gpu_layers, is_android
)
class TestNvidiaGPU:
"""Test NVIDIA GPU detection."""
def test_detect_nvidia_gpu_success(self):
"""Test successful NVIDIA GPU detection."""
# Mock the entire import system
mock_pynvml = Mock()
mock_pynvml.nvmlInit = Mock()
mock_pynvml.nvmlShutdown = Mock()
mock_pynvml.nvmlDeviceGetCount = Mock(return_value=1)
# Mock device handle and info
mock_handle = Mock()
mock_pynvml.nvmlDeviceGetHandleByIndex = Mock(return_value=mock_handle)
mock_pynvml.nvmlDeviceGetName = Mock(return_value="NVIDIA GeForce RTX 3080")
# Mock memory info
mock_mem = Mock()
mock_mem.total = 10737418240 # 10 GB
mock_pynvml.nvmlDeviceGetMemoryInfo = Mock(return_value=mock_mem)
# Mock driver version
mock_pynvml.nvmlSystemGetDriverVersion = Mock(return_value="535.104.05")
# Mock compute capability
mock_pynvml.nvmlDeviceGetCudaComputeCapability = Mock(return_value=(8, 6))
# Patch __import__ to return our mock
def mock_import(name, *args, **kwargs):
if name == 'pynvml':
return mock_pynvml
return __builtins__.__import__(name, *args, **kwargs)
with patch('builtins.__import__', side_effect=mock_import):
gpu = detect_nvidia_gpu()
assert gpu is not None
assert gpu.name == "NVIDIA GeForce RTX 3080"
assert gpu.vram_gb == 10.0
assert gpu.driver_version == "535.104.05"
assert gpu.is_nvidia is True
assert gpu.compute_capability == "8.6"
assert gpu.device_count == 1
def test_detect_nvidia_gpu_no_gpu(self):
"""Test detection when no NVIDIA GPU present."""
mock_pynvml = Mock()
mock_pynvml.nvmlInit = Mock()
mock_pynvml.nvmlShutdown = Mock()
mock_pynvml.nvmlDeviceGetCount = Mock(return_value=0)
def mock_import(name, *args, **kwargs):
if name == 'pynvml':
return mock_pynvml
return __builtins__.__import__(name, *args, **kwargs)
with patch('builtins.__import__', side_effect=mock_import):
gpu = detect_nvidia_gpu()
assert gpu is None
def test_detect_nvidia_gpu_import_error(self):
"""Test detection when pynvml is not installed."""
def mock_import(name, *args, **kwargs):
if name == 'pynvml':
raise ImportError("No module named 'pynvml'")
return __builtins__.__import__(name, *args, **kwargs)
with patch('builtins.__import__', side_effect=mock_import):
gpu = detect_nvidia_gpu()
assert gpu is None
class TestGPULayerCalculation:
"""Test GPU layer auto-configuration."""
def test_calculate_gpu_layers_apple_silicon(self):
"""Test layer calculation for Apple Silicon."""
gpu = GPUInfo(
name="Apple Silicon GPU",
vram_gb=32.0,
is_apple_silicon=True
)
assert calculate_gpu_layers(gpu) == -1
def test_calculate_gpu_layers_nvidia(self):
"""Test layer calculation for NVIDIA GPU."""
gpu = GPUInfo(
name="NVIDIA GeForce RTX 3080",
vram_gb=10.0,
is_nvidia=True,
compute_capability="8.6"
)
assert calculate_gpu_layers(gpu) == -1
def test_calculate_gpu_layers_old_nvidia(self):
"""Test layer calculation for old NVIDIA GPU."""
gpu = GPUInfo(
name="NVIDIA GeForce GTX 680",
vram_gb=2.0,
is_nvidia=True,
compute_capability="3.0"
)
assert calculate_gpu_layers(gpu) == 0 # Too old
def test_calculate_gpu_layers_no_gpu(self):
"""Test layer calculation with no GPU."""
assert calculate_gpu_layers(None) == 0
def test_validate_gpu_layers_success(self):
"""Test successful layer validation."""
gpu = GPUInfo(
name="NVIDIA GeForce RTX 3080",
vram_gb=10.0,
is_nvidia=True,
compute_capability="8.6"
)
assert validate_gpu_layers(-1, gpu) == -1
def test_validate_gpu_layers_no_gpu_error(self):
"""Test validation error when GPU requested but none available."""
with pytest.raises(ValueError, match="no GPU detected"):
validate_gpu_layers(-1, None)
def test_validate_gpu_layers_old_gpu_error(self):
"""Test validation error for unsupported GPU."""
gpu = GPUInfo(
name="NVIDIA GeForce GTX 680",
vram_gb=2.0,
is_nvidia=True,
compute_capability="3.0"
)
with pytest.raises(ValueError, match="Minimum required is 5.0"):
validate_gpu_layers(-1, gpu)
class TestAndroidDetection:
"""Test Android platform detection."""
@patch.dict('os.environ', {'ANDROID_ROOT': '/system'}, clear=True)
@patch('os.path.exists')
def test_is_android_env_var(self, mock_exists):
"""Test Android detection via environment variables."""
mock_exists.return_value = False
assert is_android() is True
@patch.dict('os.environ', {}, clear=True)
@patch('os.path.exists')
def test_is_android_paths(self, mock_exists):
"""Test Android detection via filesystem paths."""
def exists_side_effect(path):
return path == "/system/build.prop"
mock_exists.side_effect = exists_side_effect
assert is_android() is True
@patch.dict('os.environ', {}, clear=True)
@patch('os.path.exists')
def test_is_not_android(self, mock_exists):
"""Test non-Android system."""
mock_exists.return_value = False
assert is_android() is False
-183
View File
@@ -1,183 +0,0 @@
"""Integration test for tool execution in chat completions.
This test verifies that:
1. Tools are properly parsed from model output
2. Tools are executed and results fed back to model
3. The loop continues generating until final response
"""
import asyncio
import json
import sys
import os
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
from api.models import ChatMessage
from api.chat_handlers import handle_chat_completion, _sanitize_tools
from api.tool_parser import parse_tool_calls
from api.formatting import format_messages_with_tools
class MockSwarm:
"""Mock swarm manager for testing."""
async def generate(self, prompt, max_tokens, temperature, use_consensus):
"""Generate a mock response."""
# Return different responses based on prompt content
if "tool_result" in prompt.lower():
# Final response after tool execution
return MockResponse("Here's the result: The tool was executed successfully!")
else:
# First response with tool call
return MockResponse("TOOL: bash\nARGUMENTS: {\"command\": \"echo test\"}")
class MockResponse:
"""Mock generation result."""
def __init__(self, text):
self.selected_response = MockSelectedResponse(text)
class MockSelectedResponse:
"""Mock selected response."""
def __init__(self, text):
self.text = text
self.tokens_generated = 50
self.tokens_per_second = 10.0
class MockExecutor:
"""Mock tool executor."""
async def execute_tool(self, tool_name, tool_args, working_dir=None):
"""Execute a tool mock."""
return f"Mock result from {tool_name} with args {tool_args}"
@pytest.mark.asyncio
async def test_tool_execution_loop():
"""Test that tools are executed and loop continues."""
print("Testing tool execution loop...")
# Create a mock request
request = ChatMessage(
role="user",
content="Run echo test"
)
# Wrap in request object
from api.models import ChatCompletionRequest
req = ChatCompletionRequest(
model="test-model",
messages=[request],
tools=None,
max_tokens=1024,
temperature=0.7
)
# Create mock swarm
swarm = MockSwarm()
# We can't easily test the full handler without a real tool executor,
# so let's test the key parts
# Test 1: Verify tool parsing works
print(" Test 1: Tool parsing")
tool_text = 'TOOL: bash\nARGUMENTS: {"command": "echo test"}'
content, tool_calls = parse_tool_calls(tool_text)
assert tool_calls is not None, "Tool calls should be parsed"
assert len(tool_calls) == 1, "Should parse one tool call"
assert tool_calls[0]["function"]["name"] == "bash", "Tool name should be bash"
assert "echo test" in tool_calls[0]["function"]["arguments"], "Command should be in arguments"
print(" ✓ Tool parsing works correctly")
# Test 2: Verify tool instructions are loaded
print(" Test 2: Tool instructions")
instructions = format_messages_with_tools([request], None)
assert len(instructions) > 0, "Instructions should be generated"
assert "tool" in instructions.lower(), "Instructions should mention tools"
print(" ✓ Tool instructions are loaded")
# Test 3: Verify multiple tool calls can be parsed
print(" Test 3: Multiple tool calls")
multi_tool = '''TOOL: bash
ARGUMENTS: {"command": "ls"}
TOOL: write
ARGUMENTS: {"filePath": "test.txt", "content": "hello"}'''
content, tool_calls = parse_tool_calls(multi_tool)
assert tool_calls is not None, "Multiple tools should be parsed"
assert len(tool_calls) == 2, "Should parse two tool calls"
assert tool_calls[0]["function"]["name"] == "bash", "First tool should be bash"
assert tool_calls[1]["function"]["name"] == "write", "Second tool should be write"
print(" ✓ Multiple tool calls parsed correctly")
# Test 4: Verify tool sanitization
print(" Test 4: Tool sanitization")
# Create mock tool with invalid 'description' in properties
from api.models import Tool, FunctionDefinition
mock_tool = Tool(
type="function",
function=FunctionDefinition(
name="test_tool",
description="Test tool",
parameters={
"type": "object",
"properties": {
"description": "Invalid field",
"param1": {"type": "string"}
},
"required": ["description", "param1"]
}
)
)
sanitized = _sanitize_tools([mock_tool])
assert len(sanitized) == 1, "Should return one tool"
assert "description" not in sanitized[0].function.parameters.get("properties", {}), \
"Should remove invalid 'description' from properties"
print(" ✓ Tool sanitization removes invalid fields")
print("\n✅ All tool execution loop tests passed!")
@pytest.mark.asyncio
async def test_no_tool_parsing():
"""Test that normal responses without tools work."""
print("\nTesting response without tools...")
# Test normal response
normal_text = "This is a normal response without any tool calls."
content, tool_calls = parse_tool_calls(normal_text)
assert tool_calls is None, "No tool calls should be found"
assert content == normal_text, "Content should be returned unchanged"
print(" ✓ Normal responses pass through without modification")
print("\n✅ No-tool parsing test passed!")
if __name__ == "__main__":
async def run_tests():
try:
await test_tool_execution_loop()
await test_no_tool_parsing()
print("\n" + "=" * 60)
print("All integration tests passed!")
print("=" * 60)
except AssertionError as e:
print(f"\n❌ Test failed: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
except Exception as e:
print(f"\n❌ Test error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
asyncio.run(run_tests())
+2 -2
View File
@@ -133,7 +133,7 @@ ls -la
def test_tool_instructions_content():
"""Test that tool instructions contain required sections (REVIEW-2026-02-24 Blocker #4)."""
from api.formatting import _load_tool_instructions
from api.routes import _load_tool_instructions
# Load instructions from config file
instructions = _load_tool_instructions()
@@ -147,7 +147,7 @@ def test_tool_instructions_content():
def test_tool_instructions_token_count():
"""Test that tool instructions are within token budget (REVIEW-2026-02-24 Blocker #1)."""
from api.formatting import _load_tool_instructions
from api.routes import _load_tool_instructions
# Load instructions from config file
instructions = _load_tool_instructions()
-105
View File
@@ -1,105 +0,0 @@
"""Test to verify tool execution is triggered when model generates tool calls."""
import asyncio
import sys
import os
import pytest
sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
@pytest.mark.asyncio
async def test_tool_execution_triggered():
"""Verify that tool execution is properly triggered."""
from api.models import ChatMessage, ChatCompletionRequest
from api.chat_handlers import handle_chat_completion
from api.tool_parser import parse_tool_calls
from tools.executor import ToolExecutor, set_tool_executor
print("=" * 60)
print("Tool Execution Trigger Test")
print("=" * 60)
# Create a mock swarm that generates a tool call
class MockSwarm:
async def generate(self, prompt, max_tokens, temperature, use_consensus):
# First call: generate tool call
if "user" in prompt and "echo hello" in prompt:
return MockResult("TOOL: bash\nARGUMENTS: {\"command\": \"echo hello\"}")
# Second call: after tool result, generate answer
elif "tool" in prompt.lower():
return MockResult("Output: hello\nThe command executed successfully!")
else:
return MockResult("I don't understand")
class MockResult:
def __init__(self, text):
self.selected_response = MockSelectedResponse(text)
class MockSelectedResponse:
def __init__(self, text):
self.text = text
self.tokens_generated = 20
self.tokens_per_second = 5.0
# Set up tool executor
executor = ToolExecutor(tool_host_url=None)
set_tool_executor(executor)
# Create request
request = ChatCompletionRequest(
model="test-model",
messages=[ChatMessage(role="user", content="echo hello")],
tools=None, # No explicit tools - should still parse from response
max_tokens=1024,
temperature=0.7
)
print("\n1. Testing that tool calls are parsed...")
model_response = "TOOL: bash\nARGUMENTS: {\"command\": \"echo hello\"}"
content, tool_calls = parse_tool_calls(model_response)
assert tool_calls is not None, "Tool calls should be parsed from response"
assert len(tool_calls) == 1, "Should have one tool call"
print(f" ✓ Tool call parsed: {tool_calls[0]['function']['name']}")
print("\n2. Verifying tool executor is set...")
from tools.executor import get_tool_executor
current_executor = get_tool_executor()
assert current_executor is not None, "Tool executor should be set"
print(f" ✓ Tool executor configured: {current_executor.tool_host_url or 'local'}")
print("\n3. Testing tool execution...")
# Try to execute the tool
try:
from api.routes import execute_tool_server_side
result = await execute_tool_server_side(
"bash",
{"command": "echo hello"},
working_dir=None
)
print(f" ✓ Tool executed successfully")
print(f" ✓ Result: {result[:50]}..." if len(result) > 50 else f" ✓ Result: {result}")
except Exception as e:
print(f" ✗ Tool execution failed: {e}")
raise
print("\n" + "=" * 60)
print("All tool execution trigger tests passed!")
print("=" * 60)
if __name__ == "__main__":
try:
asyncio.run(test_tool_execution_triggered())
except AssertionError as e:
print(f"\n❌ Test failed: {e}")
import traceback
traceback.print_exc()
sys.exit(1)
except Exception as e:
print(f"\n❌ Test error: {e}")
import traceback
traceback.print_exc()
sys.exit(1)