feat: comprehensive tool system improvements and webfetch support (#3)

* feat: enhanced tool instructions for multi-step operations - Add comprehensive examples for ls, find, grep, mkdir, npm init, etc. - Explain multi-step workflow (explore → read → write) - Tool system already supports chaining via conversation history - Bash tool supports: ls, find, grep, cat, mkdir, cd, npm, etc. - 30 second timeout on commands - Output limited to 3000 chars for readability * Cleanup: Consolidate documentation and tidy codebase Documentation: - Consolidate 6 markdown files into simplified README.md - Remove redundant docs: TODO.md, NETWORK.md, REVIEW.md, PLAN.md, CONTEXT.md, GUIDE.md - Add ARCHITECTURE.md with clean technical overview - README now focuses on quick start and core concepts Code verification: - Verified blocking I/O properly wrapped in asyncio.to_thread() - Confirmed locks initialized correctly in backends - AMD VRAM detection uses proper regex (takes max value, not first match) - All exception handling uses 'except Exception:' (not bare except) Tool execution improvements (existing changes): - Better working directory handling with project root detection - Extended timeouts for package managers (300s) - Multi-tool call parsing support - Improved error handling and logging Note: System prompt concern noted - 30k tokens too large for 16-32k context windows * docs: add development patterns analysis Document circular development issues identified in commit history: - Tool execution went back-and-forth 3+ times (server-side vs client-side) - Tool instructions changed from 40k → 300 → removed → enhanced tokens - 8+ parsing fixes for same issues (no tests) - 6 debug-only commits (production debugging) Provides recommendations to prevent future cycles: 1. Pick one architecture and stick with it 2. Add unit tests before fixes 3. Token budget (<2000 for instructions) 4. One format only (remove alternative parsers) 5. Integration test script 6. Separate concerns into smaller modules 7. Design doc before code changes 8. CI/CD with automated testing * docs: add comprehensive agent guidelines AGENT_WORKER.md (600+ lines): - Pre-flight checklist: token budget, test plan, design doc - Coding rules: TDD, no debug code, architecture consistency - Git workflow: branching strategy, commit rules, release process - Testing requirements: unit (≥80%), integration structure - Code quality: PEP 8, type hints, max 50 lines per function - Architecture: no feature flags, separation of concerns - Continuous learning: research requirements, documentation - Forbidden patterns: bare except, production debugging, etc. AGENT_REVIEW.md (400+ lines): - Review philosophy: prevent circular development - 6-phase review checklist: structure, quality, tokens, architecture, research, logic - Report format with token impact analysis - Severity levels: blocking vs warnings vs approved - Common issues with examples (good vs bad) - Review workflow: 30-35 min per PR - Reports stored in reports/ folder (gitignored) Also added: - tests/test_tool_parsing.py - example test following guidelines - Updated DEVELOPMENT_PATTERNS.md with recommendations Reports folder in .gitignore for local review storage * chore: gitignore review reports folder * feat: fix tool execution and enhance instructions with accurate token counting - Enhanced tool instructions (1041 tokens, within 2000 budget) - Added tiktoken>=0.5.0 for accurate token counting - Fixed subprocess hang by adding stdin=subprocess.DEVNULL - Removed 9 DEBUG print statements from routes.py - Added tests for instruction content and token budget verification - All tests pass (11/11) Resolves blockers from previous review: - Token budget verified ✓ - Token documentation added ✓ - Debug code cleaned ✓ - Missing tests added ✓ * feat: implement comprehensive tool system with proper logging Major improvements to tool instructions and execution: - Enhanced tool instructions with 7-step task completion workflow - Added markdown code block fallback parser for tool calls - Fixed subprocess hang with stdin=subprocess.DEVNULL - Fixed streaming path to return tool_calls (enabling multi-turn conversations) - Added complete React project creation example with verification steps - Token count: 1,743 tokens (within 2,000 limit) Logging infrastructure: - Created centralized logging configuration (src/utils/logging_config.py) - Replaced 80+ print statements with logger.debug() - Set log level to DEBUG for development - All modules now use proper logging instead of print Testing: - Added 4 new tests for markdown parsing and instruction content - All 13 tests passing - Token budget verification test Documentation: - Added comprehensive design docs for all major changes - Added test plans for verification - Created helper scripts for logging migration Files changed: - main.py: Added logging setup - src/api/routes.py: Tool instructions, streaming fixes, logging - src/tools/executor.py: subprocess fix, logging - src/utils/: New logging configuration module - tests/test_tool_parsing.py: New tests - docs/: Design decisions and test plans - scripts/: Helper scripts for development * refactor: simplify tool instructions to 109 tokens for 7B model Reduced from 1,743 tokens to 109 tokens (94% reduction) to help qwen2.5 7B 4bit model follow instructions better. Changes: - Removed complex workflow documentation - Removed multi-turn conversation examples - Removed lengthy anti-patterns - Kept only essential format and rules - Updated tests to match simplified content Before: 1,743 tokens, 6,004 chars (87% of budget) After: 109 tokens, 392 chars (5.5% of budget) This should make it much easier for smaller models to: 1. Understand they must use tools 2. Follow the simple TOOL: format 3. Not get overwhelmed by instructions * refactor: make tool instructions ultra-direct for 7B models Further simplify instructions to prevent model from adding explanations. Before: 109 tokens - model still added explanatory text After: 86 tokens - ultra-direct commands Key changes: - Start with 'You MUST use tools. DO NOT explain.' - 'OUTPUT THIS EXACT FORMAT - NOTHING ELSE' - Removed all examples and pleasantries - Added 'NEVER' rules in all caps - 'ONLY output TOOL: lines' The model was outputting: '1. First, install... TOOL: bash ARGUMENTS: {...}' Now should output just: 'TOOL: bash ARGUMENTS: {...}' This should force the 7B qwen model to stop explaining and just execute. * refactor: move tool instructions to external config file Moves hardcoded tool instructions from routes.py to external config file for better maintainability and easier editing. Changes: - Created config/prompts/tool_instructions.txt - Added _load_tool_instructions() function with caching - Falls back to default if config file not found - Updated tests to use the loader function - Added proper error handling Benefits: - Easier to modify instructions without code changes - Instructions can be edited by non-developers - Cleaner separation of config vs code - Supports hot-reloading (cached but easy to invalidate) Token count: 86 tokens (loaded from file) Location: config/prompts/tool_instructions.txt * refactor: simplify tool instructions further and add debug logging - Reduced instructions to bare minimum: 50 tokens - Added debug logging to verify instructions are sent - Removed all caps and aggressive language - Made instructions more straightforward Instructions now: 'Use tools to execute commands. Output only tool calls. Format: TOOL: bash ARGUMENTS: {...} No explanations. No numbered lists. No markdown. Only tool calls.' This should be easier for 7B models to follow while still conveying the essential requirements. * feat: improve tool parser to handle 7B model output variations Enhanced parse_tool_calls() with multiple fallback strategies: 1. Standard TOOL:/ARGUMENTS: format (original) 2. Markdown code blocks () 3. Numbered list items (1. npm install ...) 4. Standalone bash commands (npm, npx, mkdir, etc.) Now handles messy output from small models like: '1. Install: npm install -g create-react-app' '2. Create: create-react-app hello-world' Parses these into chained bash commands for execution. Also simplified instructions to 50 tokens minimum: 'Use tools to execute commands. Output only tool calls. Format: TOOL: bash ARGUMENTS: {...} No explanations. No numbered lists. No markdown. Only tool calls.' This combination should make 7B models much more likely to have their output successfully parsed and executed. * fix: improve command extraction for 7B model output Parser now extracts bash commands from any line containing: - npm, npx, mkdir, cd, ls, cat, echo, git, python, pip, node, yarn - create-react-app (added for React projects) Example: Extracts 'npm install -g create-react-app' from: '1. Install: npm install -g create-react-app' Chains multiple commands with && for sequential execution. This should now successfully parse the numbered list output from 7B models and execute the commands. * feat: add bash tool description validation and improve 7B model parsing Changes: - Added _ensure_tool_arguments() function to inject 'description' field - Updated tool_instructions.txt to require description for bash tool - Improved 7B model command extraction with better regex patterns - Added 'create-react-app' to command detection list - Updated delta field type to Dict[str, Any] for streaming - Added GGUF to MLX quantization mapping for registry.py - Clarified agent responsibilities in AGENT_REVIEW.md and AGENT_WORKER.md Fixes: - Bash tool now validates required 'description' field - 7B model output parsed more reliably (numbered lists) - Multiple commands chained with && for sequential execution Token count: 69 tokens (down from 86, -19.8%) All tests pass: 13/13 * feat: add webfetch tool support with URL extraction Changes: - Added webfetch to tool instructions config - Added URL extraction pattern to parse_tool_calls() - Parser now recognizes URLs and creates webfetch tool calls - Updated token count: 89 tokens (+29% from 69) The webfetch tool is available through opencode environment. System prompt adjustment enables model to use it for URL fetching. Token budget: 89 tokens (4.45% of 2000 limit) Tests pass: 13/13
2026-02-24 22:35:05 +01:00
parent 40fe75c738
commit 580d1e5d17
34 changed files with 3829 additions and 3152 deletions
@@ -151,3 +151,6 @@ cython_debug/
 config.local.yaml
 *.pid
 logs/
 # Review reports
 reports/
@@ -0,0 +1,427 @@
 # Agent Reviewer Rules
 > **⚠️ IMPORTANT:** This document is for REVIEW AGENTS who handle commits, PRs, and code reviews.
 > Regular agents follow AGENT_WORKER.md for implementation tasks and DO NOT make commits.
 ## Review Philosophy
 **Mission:** Prevent the circular development patterns identified in commit history.
 **Standards:**
 - Reject code that doesn't meet quality bar
 - Ask for tests, don't accept "I'll add them later"
 - Check token counts for prompt changes
 - Verify architectural consistency
 - Demand clear error messages
 **Reviewer Authority:**
 - Can block PR for: missing tests, token bloat, architecture violations
 - Cannot approve own code
 - Must provide constructive feedback with specific fixes
 ## Review Checklist
 ### Phase 1: Structure & Hygiene (Block if failed)
 - [ ] **Branch naming follows convention**
  - Format: `type/description` (e.g., `fix/tool-parsing`)
  - Not: `quick-fix`, `temp-branch`, `dev`
 - [ ] **Commit messages are clear**
  - Format: `type(scope): description`
  - No: `fix stuff`, `WIP`, `asdf`, `omg finally`
  - Each commit should be reviewable independently
 - [ ] **No production debugging code**
  - Search for: `print(`, `console.log`, `debugger`, `TODO`, `FIXME`, `XXX`
  - Check: No commented-out code blocks
  - Check: No temporary files committed
 - [ ] **Git history is clean**
  - No "fix typo" commits after initial commit
  - No "WIP" commits in PR
  - No merge commits (rebase instead)
  - Squash fixup commits
 ### Phase 2: Code Quality (Block if failed)
 - [ ] **Tests exist and pass**
  - Unit tests for new functions
  - Integration tests for API changes
  - Run: `pytest -v` (must pass)
  - Coverage: ≥80% for new code
  - **BLOCKING:** No tests = No merge
 - [ ] **Type hints present**
  - All function parameters typed
  - All return values typed
  - Run: `mypy src/` (must pass with zero errors)
 - [ ] **No code smells**
  - No functions > 50 lines
  - No files > 300 lines
  - No indentation > 3 levels deep
  - No circular imports
  - No duplicate code (>3 lines copied)
 - [ ] **Error handling is robust**
  - No bare `except:` clauses
  - All errors have clear messages
  - No silent failures
  - Edge cases handled
 - [ ] **Documentation is adequate**
  - All public functions have docstrings
  - Complex logic has inline comments
  - README updated if user-facing change
  - Architecture doc updated if pattern changes
 ### Phase 3: Token Budget (Block if failed)
 **For any prompt/instruction changes:**
 - [ ] **Token count documented**
  - Before: X tokens
  - After: Y tokens  
  - Change: +/- Z tokens
 - [ ] **Within budget**
  - System prompt + instructions ≤ 2000 tokens (HARD LIMIT)
  - Leaves ≥ 50% context window for user input
  - **BLOCKING:** Over budget = Request reduction
 - [ ] **Efficient wording**
  - No redundant examples
  - No verbose explanations
  - Prefer code over prose
 **Token Counting Command:**
 ```bash
 # Count tokens in a string
 echo "Your prompt here" | python -c "import sys; import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(sys.stdin.read())))"
 ```
 ### Phase 4: Architecture (Block if failed)
 - [ ] **Consistent with ARCHITECTURE.md**
  - No new patterns without updating docs
  - No mixing of concerns
  - Follows existing module structure
 - [ ] **No architecture changes in fixes**
  - Bug fixes should not refactor
  - Refactors should be separate PRs
  - **Exception:** If fix requires arch change, document WHY
 - [ ] **Parser rules**
  - Only ONE parser per format
  - No alternative parsing paths
  - Clear regex patterns
  - Handles all documented cases
 - [ ] **No feature flags in core**
  - Code should not have `if config.get("ENABLE_X"):`
  - Pick one approach, remove old one
  - A/B testing only in separate branch
 ### Phase 5: Research & Continuous Learning
 **For significant changes (>100 lines or new algorithms):**
 - [ ] **Research documented**
  - Check `research/` folder for related findings
  - PR description mentions alternatives considered
  - Links to sources (docs, papers, repos)
  - Not: "I thought this would work"
  - Yes: "Based on [source], this approach handles [case] better than [alternative]"
 - [ ] **Best practices followed**
  - Implementation matches current language/framework conventions
  - No deprecated patterns
  - Modern Python features used appropriately (3.9+)
 - [ ] **No reinvention**
  - Check if standard library solves the problem
  - Check if well-maintained package exists
  - If custom implementation needed, document WHY
 **Research Documentation Requirements:**
 ```markdown
 ## Research
 - Alternatives considered: [list]
 - Sources: [links]
 - Decision: [why chosen approach]
 - Benchmarks: [if applicable]
 ```
 ### Phase 6: Logic Correctness
 - [ ] **Logic is sound**
  - Read through the code
  - Check edge cases
  - Verify error conditions
  - Question anything unclear
 - [ ] **No performance regressions**
  - No blocking I/O in async functions (unless wrapped)
  - No memory leaks
  - No N+1 queries
  - Reasonable algorithmic complexity
 - [ ] **Security check**
  - No SQL injection vectors
  - No command injection (bash execution sanitized)
  - Path traversal protection (for file ops)
  - No secrets in code
 ## Review Report Format
 After review, write a report to `reports/PR-{number}-{branch}.md`:
 ```markdown
 # Review Report: PR #{number} - {branch}
 **Reviewer:** {your name}
 **Date:** {YYYY-MM-DD}
 **Status:** [APPROVED / CHANGES_REQUESTED / BLOCKED]
 ## Summary
 Brief description of what this PR does and overall quality assessment.
 ## Detailed Findings
 ### ✅ Passed
 - [List items that passed review]
 - [Be specific: "Tests cover 85% of new code"]
 ### ⚠️ Warnings (Non-blocking)
 - [Minor issues that don't block merge]
 - [Style suggestions]
 - [Future improvements]
 ### ❌ Blockers (Must fix)
 1. **[Category]** [Specific issue]
   - **Location:** `file.py:123`
   - **Problem:** [What's wrong]
   - **Fix:** [Exactly what to change]
   - **Why:** [Why this matters]
 2. **[Category]** [Specific issue]
   - ...
 ## Token Impact Analysis
 - Component: [what changed]
 - Before: [X] tokens
 - After: [Y] tokens
 - Impact: [+/- Z] tokens
 - Within budget: [Yes/No]
 ## Test Coverage
 - New code coverage: [X]%
 - Tests pass: [Yes/No]
 - Integration tests: [Present/Missing]
 ## Architecture Review
 - Follows existing patterns: [Yes/No]
 - Introduces new dependencies: [List if any]
 - Breaking changes: [Yes/No - explain if yes]
 ## Research Review
 - Alternatives considered: [Listed/None]
 - Sources cited: [Yes/No]
 - Best practices followed: [Yes/No]
 - Research documented: [Yes/No - location]
 ## Code Quality Score
 - Structure: [0-10]
 - Testing: [0-10]
 - Documentation: [0-10]
 - Logic: [0-10]
 - **Overall: [0-10]**
 ## Action Items
 - [ ] [Specific fix needed]
 - [ ] [Specific fix needed]
 - [ ] [Test to add]
 ## Verdict
 [APPROVED / CHANGES_REQUESTED / BLOCKED]
 **If CHANGES_REQUESTED:** 
 - Address all blockers
 - Re-request review when ready
 **If BLOCKED:**
 - Major issues require architecture discussion
 - Schedule meeting before continuing
 ```
 ## Severity Levels
 ### 🔴 BLOCKING (Cannot merge)
 - Missing tests for new functionality
 - Token budget exceeded
 - Bare `except:` clauses
 - Production debugging code (`print` statements)
 - Breaking changes without documentation
 - Security vulnerabilities
 - Tests failing
 - Type check errors
 - Architecture violations
 ### 🟡 CHANGES_REQUESTED (Fix before merge)
 - Unclear variable names
 - Missing docstrings
 - Inefficient algorithms
 - Missing error handling
 - Unclear commit messages
 - Minor style issues
 ### 🟢 APPROVED (Optional suggestions)
 - Style preferences
 - Future improvements
 - Optional refactors
 ## Common Issues to Watch For
 ### Issue 1: Tool Parsing Duplication
 ```python
 # ❌ WRONG - Multiple parsers
 def parse_tools_v1(text): ...
 def parse_tools_v2(text): ...
 def parse_tools_legacy(text): ...
 # ✅ CORRECT - Single parser
 TOOL_PATTERN = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
 ```
 **Check:** Search for "def parse" - should be ONE per format.
 ### Issue 2: Token Bloat
 ```python
 # ❌ WRONG - Too verbose
 SYSTEM_PROMPT = """
 You are an AI assistant. Here are detailed instructions...
 [2000 words of explanation]
 [10 examples]
 """
 # ✅ CORRECT - Concise
 SYSTEM_PROMPT = """Use TOOL: name\nARGUMENTS: {...} format. Available: read, write, bash."""
 ```
 **Check:** Count tokens, verify < 2000.
 ### Issue 3: Architecture Drift
 ```python
 # ❌ WRONG - Mixing concerns in one file
 # src/api/routes.py
 def handle_request(): ...
 def parse_tools(): ...
 def execute_tool(): ...
 def format_response(): ...
 # ✅ CORRECT - Separated
 # src/api/routes.py - only HTTP handling
 # src/tools/parser.py - only parsing
 # src/tools/executor.py - only execution
 ```
 **Check:** Each module has ONE responsibility.
 ### Issue 4: Debug Code Left In
 ```python
 # ❌ WRONG
 def process(data):
    print(f"DEBUG: data={data}")  # REMOVE THIS
    result = transform(data)
    print(f"DEBUG: result={result}")  # REMOVE THIS
    return result
 # ✅ CORRECT
 logger = logging.getLogger(__name__)
 def process(data):
    logger.debug("Processing data", extra={"data_size": len(data)})
    return transform(data)
 ```
 **Check:** `grep -r "print(" src/ --include="*.py" | grep -v "^#"`
 ### Issue 5: Missing Error Context
 ```python
 # ❌ WRONG
 raise ValueError("Invalid input")
 # ✅ CORRECT
 raise ValueError(f"Invalid model format: '{model_str}'. Expected: 'name:size:quant' (e.g., 'qwen:7b:q4')")
 ```
 **Check:** All errors explain what was expected vs received.
 ## Review Workflow
 1. **First Pass: Structure** (5 min)
   - Check branch name, commits, no debug code
   - If failed → Write report, BLOCK
 2. **Second Pass: Quality** (10 min)
   - Run tests, check types, review code
   - If failed → Write report, CHANGES_REQUESTED
 3. **Third Pass: Deep Dive** (15 min)
   - Read logic, check edge cases
   - Verify token counts
   - Check architecture
   - Write detailed report
 4. **Final Decision** (5 min)
   - APPROVE / CHANGES_REQUESTED / BLOCK
   - Write report to `reports/` folder
   - Post summary in PR comments
 **Total time per review: 30-35 minutes**
 ## Reviewer Self-Check
 Before submitting review:
 - [ ] I ran all tests locally
 - [ ] I checked type hints
 - [ ] I counted tokens (if applicable)
 - [ ] I read every line of changed code
 - [ ] My feedback is specific and actionable
 - [ ] I explained WHY for each blocker
 - [ ] I wrote a report to `reports/` folder
 ## Escalation
 Escalate to architecture discussion if:
 - PR changes core patterns
 - Token budget cannot be met
 - Two reviewers disagree
 - Breaking changes proposed
 **Don't just approve to be nice.** 
 **Don't let technical debt accumulate.**
 ## Report Storage
 All reports go in `reports/` folder:
 ```
 reports/
 ├── PR-123-fix-tool-parsing.md
 ├── PR-124-add-federation.md
 ├── PR-125-refactor-consensus.md
 └── README.md  # Index of all reviews
 ```
 **This folder is gitignored - reports stay local.**
 Generate index with:
 ```bash
 ls -1 reports/PR-*.md | sort -t'-' -k2 -n > reports/README.md
 ```
 ---
 **Remember: You're the last line of defense against technical debt. Be thorough, be kind, be strict.**
@@ -0,0 +1,790 @@
 # Agent Worker Rules
 > **⚠️ IMPORTANT:** This document is for IMPLEMENTATION AGENTS (coding, testing, documentation).
 > **DO NOT MAKE COMMITS** - that's the AGENT_REVIEW.md agent's job.
 ## Pre-Flight Checklist (MUST complete before coding)
 ### ⚠️ GIT OPERATIONS REMINDER
 **DO NOT make commits.** Commits are ONLY handled by AGENT_REVIEW.md agents.
 You CAN create branches and stage files (git add), but DO NOT commit (git commit).
 ### 1. Token Budget Verification
 - [ ] System prompt + instructions ≤ 2000 tokens (hard limit)
 - [ ] Leave ≥ 50% of context window for user input
 - [ ] If adding documentation/examples, remove old ones to maintain budget
 - [ ] Use `tiktoken` or estimate: ~4 chars = 1 token
 ### 2. Test Plan Required
 Before writing ANY code, write a test plan:
 ```markdown
 ## Test Plan for [Feature]
 ### Unit Tests
 - [ ] Test case 1: [specific input] → [expected output]
 - [ ] Test case 2: [edge case]
 - [ ] Test case 3: [error condition]
 ### Integration Tests  
 - [ ] End-to-end flow: [steps]
 - [ ] Expected result: [what success looks like]
 ### Manual Verification
 - [ ] Command to run: [exact command]
 - [ ] Expected output: [what to see]
 ```
 ### 3. Design Decision Document
 For any change > 50 lines:
 ```markdown
 ## Design Decision
 ### Problem
 [What are we solving?]
 ### Options Considered
 1. [Option A] - Pros: ..., Cons: ...
 2. [Option B] - Pros: ..., Cons: ...
 ### Decision
 [Which option and WHY]
 ### Impact
 - Token count change: [+/- X tokens]
 - Breaking changes: [Yes/No]
 - Migration needed: [Yes/No]
 ```
 ## Coding Rules
 ### Rule 1: One Feature = One Commit
 **NOTE:** Regular agents DO NOT make commits. AGENT_REVIEW.md agents handle commits.
 When AGENT_REVIEW.md agents make commits:
 - Never combine unrelated changes in one commit
 - If you fix a bug AND refactor, make 2 commits
 - Commit message format: `type(scope): description`
  - Types: `feat`, `fix`, `refactor`, `test`, `docs`, `chore`
  - Example: `feat(tools): add working directory support`
 ### Rule 2: Tests First (TDD)
 ```python
 # BAD: Write code, maybe test later
 def parse_tools(text):
    # ... implementation ...
    pass
 # GOOD: Write test first
 def test_parse_simple_tool():
    text = 'TOOL: read\nARGUMENTS: {"filePath": "test.txt"}'
    content, tools = parse_tool_calls(text)
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "read"
 # Then write minimal code to pass
 ```
 ### Rule 3: No Production Debugging
 - NEVER add `print()` statements for debugging
 - Use `logging` module with appropriate levels
 - Remove ALL debug logging before committing
 - Exception: Structured logging for observability (metrics, errors)
 ```python
 # BAD
 def process_request(request):
    print(f"DEBUG: Got request {request}")  # REMOVE THIS
    result = handle(request)
    print(f"DEBUG: Result {result}")  # REMOVE THIS
    return result
 # GOOD
 def process_request(request):
    logger.debug("Processing request", extra={"request_id": request.id})
    result = handle(request)
    return result
 ```
 ### Rule 4: Architecture Consistency
 - Check ARCHITECTURE.md before changing patterns
 - If unsure, ask in PR description
 - NEVER change architecture in a "fix" commit
 - Architecture changes require design doc + team review
 ### Rule 5: Parse Once, Parse Well
 - ONE parser per format
 - If adding new format, remove old one
 - Parser must handle all documented cases
 - Parser must fail gracefully (return empty, not crash)
 ```python
 # BAD: Multiple parsers for same thing
 def parse_tools_v1(text): ...
 def parse_tools_v2(text): ...
 def parse_tools_legacy(text): ...
 # GOOD: Single parser with clear regex
 TOOL_PATTERN = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
 def parse_tool_calls(text: str) -> Tuple[str, List[dict]]:
    matches = list(re.finditer(TOOL_PATTERN, text, re.IGNORECASE))
    if not matches:
        return text, []
    # ... rest of parsing ...
 ```
 ### Rule 6: Token-Aware Documentation
 - Every docstring/example has a token cost
 - Count tokens before adding
 - If over budget, remove something else
 - Prioritize: Code clarity > Examples > Explanations
 ```python
 # BAD: 150 tokens of fluff
 def calculate(x, y):
    """
    This function calculates the sum of two numbers.
    The sum is calculated by using the built-in Python 
    addition operator which adds the values together.
    Args:
        x (int): The first number to add
        y (int): The second number to add
    Returns:
        int: The sum of x and y
    Example:
        >>> calculate(1, 2)
        3
    """
    return x + y
 # GOOD: 20 tokens, clear enough
 def calculate(x: int, y: int) -> int:
    """Return sum of x and y."""
    return x + y
 ```
 ### Rule 7: Clear Error Messages
 - Every error must tell user EXACTLY what went wrong
 - Include context: what was expected vs what was received
 - Suggest fix if possible
 ```python
 # BAD
 raise ValueError("Invalid input")
 # GOOD
 raise ValueError(f"Invalid model format: '{model_str}'. Expected: 'name:size:quant' (e.g., 'qwen:7b:q4')")
 ```
 ### Rule 8: No Circular Imports
 ```python
 # BAD: src/a.py imports src/b.py, src/b.py imports src/a.py
 # GOOD: Use dependency injection or move shared code to common module
 ```
 ## Git Workflow Rules
 ### CRITICAL: Commit Handling
 **REGULAR AGENTS: DO NOT MAKE COMMITS**
 - Regular agents do NOT create commits, pull requests, or manage git history
 - Commits are ONLY handled by agents following AGENT_REVIEW.md guidelines
 - If you need to commit code, the AGENT_REVIEW.md agent should handle it
 - Exception: You may manually stage files (git add) for the review agent
 - **You CAN create and checkout branches** (that's fine) - just don't commit to them
 ### Branch Strategy
 **Main Branches (Protected):**
 - `main` - Production-ready code only
 - `develop` - Integration branch for features (optional for small projects)
 **Working Branches (Temporary - AGENT_REVIEW.md ONLY):**
 ```
 feature/description           # New features
 fix/description               # Bug fixes  
 refactor/description          # Code refactoring
 hotfix/description            # Critical production fixes
 docs/description              # Documentation only
 experiment/description        # Experimental work (may be deleted)
 ```
 **Note:** Regular agents should NOT create branches or handle git operations
 ### Workflow Steps
 #### 1. Starting New Work
 ```bash
 # ALWAYS start from main
 git checkout main
 git pull origin main
 # Create feature branch
 git checkout -b feature/description
 # Push branch to remote immediately
 git push -u origin feature/description
 ```
 #### 2. During Development
 ```bash
 # Commit often (small, logical commits)
 git add -p  # Stage interactively (review each change)
 git commit -m "feat(scope): description"
 # Push regularly (backup)
 git push origin feature/description
 # Keep up-to-date with main
 git fetch origin
 git rebase origin/main  # Resolve conflicts immediately
 ```
 #### 3. Before PR (Final Cleanup)
 ```bash
 # Interactive rebase to clean history
 git rebase -i main
 # Squash these:
 # - "fix typo"
 # - "WIP"
 # - "asdf"
 # - "omg finally"
 # - Multiple attempts at same fix
 # Keep separate:
 # - Logical feature steps
 # - Refactoring separate from features
 # - Test additions separate from code changes
 ```
 #### 4. Creating PR
 - Push final branch: `git push origin feature/description`
 - Create PR to `main` (not develop unless project uses git-flow)
 - Fill PR template completely
 - Request review from AGENT_REVIEW.md qualified reviewer
 - Link related issues: `Closes #123`, `Fixes #456`
 ### Commit Rules
 **Commit Frequency:**
 - Commit after each logical step (not just at end of day)
 - Each commit should leave codebase in working state
 - "Work in progress" commits OK on feature branches (clean before PR)
 **Commit Size:**
 - Max 200 lines changed per commit
 - Max 5 files changed per commit (unless related)
 - Each commit reviewable in 5 minutes
 - Split large changes:
  ```bash
  # BAD: One giant commit
  git commit -am "Add federation + fix bugs + refactor + docs"
  # GOOD: Separate commits
  git commit -m "refactor(network): extract peer discovery logic"
  git commit -m "feat(federation): implement cross-swarm voting"
  git commit -m "fix(federation): handle peer timeout edge case"
  git commit -m "docs: update federation architecture docs"
  ```
 **Commit Message Format:**
 ```
 type(scope): subject (50 chars or less)
 Body (wrap at 72 chars):
 - Why this change was made
 - What problem it solves  
 - Any breaking changes or migration notes
 Refs: #123, #456
 ```
 **Types:**
 - `feat`: New feature
 - `fix`: Bug fix
 - `refactor`: Code restructuring (no behavior change)
 - `test`: Adding/updating tests
 - `docs`: Documentation only
 - `chore`: Build, dependencies, tooling
 - `perf`: Performance improvement
 - `style`: Formatting (no code change)
 **Subject Rules:**
 - Use imperative mood: "Add feature" not "Added feature"
 - No period at end
 - Lowercase after type
 - Max 50 characters
 ### Branch Hygiene
 **DO:**
 - Create branch from latest main
 - Use descriptive branch names
 - Push branch to remote immediately
 - Rebase onto main regularly
 - Delete merged branches
 - Squash fixup commits before PR
 **DON'T:**
 - Commit directly to main
 - Have long-lived branches (>1 week without rebase)
 - Include unrelated changes in one branch
 - Commit broken code (even temporarily)
 - Force push to shared branches
 - Merge without review
 ### Handling Conflicts
 ```bash
 # While rebasing
 git rebase main
 # Conflicts happen...
 # Resolve conflicts in files
 git add <resolved-files>
 git rebase --continue
 # If messed up, abort
 git rebase --abort
 ```
 **Conflict Resolution Rules:**
 1. Understand both changes before resolving
 2. Don't just pick "ours" or "theirs"
 3. Test after resolving
 4. Commit message should explain resolution
 ### Emergency Procedures
 **Committed to wrong branch:**
 ```bash
 # Undo last commit (keep changes)
 git reset HEAD~1
 # Stash changes
 git stash
 # Switch to correct branch
 git checkout correct-branch
 # Apply changes
 git stash pop
 # Commit properly
 git commit -m "..."
 ```
 **Need to undo pushed commit:**
 ```bash
 # Revert (creates new commit, safe for shared history)
 git revert <commit-hash>
 git push origin branch-name
 # OR if feature branch not shared yet
 # Reset and force push (DANGEROUS)
 git reset --hard HEAD~1
 git push --force-with-lease origin branch-name
 ```
 ### Release Process
 **NOTE:** Release process should be handled by AGENT_REVIEW.md agents.
 ```bash
 # Create release branch
 git checkout -b release/v1.2.0
 # Bump version, update changelog
 git commit -m "chore: bump version to 1.2.0"
 # Tag release
 git tag -a v1.2.0 -m "Release version 1.2.0"
 git push origin v1.2.0
 # Merge to main
 git checkout main
 git merge --no-ff release/v1.2.0
 git push origin main
 # Delete release branch
 git branch -d release/v1.2.0
 ```
 ### What Regular Agents Should NOT Do
 **REGULAR AGENTS DO NOT:**
 - Make commits (git commit)
 - Create pull requests
 - Push to remote repositories
 - Merge branches
 - Manage git history (rebase, reset, etc.)
 - Delete branches
 **REGULAR AGENTS CAN:**
 - Create and checkout branches (git checkout -b)
 - Stage files for review (git add)
 - Switch between branches
 **REGULAR AGENTS SHOULD:**
 - Write code and tests
 - Run tests locally
 - Use logging instead of print()
 - Follow code quality standards
 - Document changes in code comments or design docs
 - Hand off completed work to AGENT_REVIEW.md agent for commit/PR creation
 **Example Workflow:**
 ```
 1. Agent reads task from user
 2. Agent creates feature branch (git checkout -b feature/name)
 3. Agent implements feature (writes code, tests, docs)
 4. Agent stages changes for review (git add)
 5. Agent reports completion with summary of changes
 6. AGENT_REVIEW.md agent:
   - Reviews code quality
   - Makes commits
   - Creates PR
 ```
 ### Pre-Commit Checklist
 - [ ] Code passes `pytest` (if tests exist)
 - [ ] No `print()` statements (use logging)
 - [ ] No bare `except:` clauses
 - [ ] All functions have type hints
 - [ ] All public functions have docstrings
 - [ ] No TODO comments (create issues instead)
 - [ ] Token count checked (if modifying prompts)
 ## Testing Requirements
 ### Unit Test Coverage
 Minimum 80% coverage for:
 - Parsing functions
 - Business logic
 - State machines
 ### Integration Tests Required For:
 - API endpoints
 - Tool execution
 - File operations
 - Network calls (mocked)
 ### Test File Structure
 ```
 tests/
 ├── unit/
 │   ├── test_parser.py
 │   ├── test_executor.py
 │   └── test_consensus.py
 ├── integration/
 │   ├── test_api.py
 │   └── test_tools.py
 └── fixtures/
    └── sample_responses.json
 ```
 ## Code Quality Standards
 ### Python Style
 - Follow PEP 8
 - Use type hints for all function signatures
 - Max line length: 100 characters
 - Max function length: 50 lines
 - Max file length: 300 lines (split if larger)
 ### Imports (Order Matters)
 ```python
 # 1. Standard library
 import os
 import sys
 from typing import List
 # 2. Third party
 import numpy as np
 from fastapi import APIRouter
 # 3. Local (absolute imports only)
 from src.tools.executor import ToolExecutor
 from src.swarm.manager import SwarmManager
 ```
 ### Documentation Standards
 Every module must have:
 ```python
 """Module purpose in one line.
 Longer description if needed (2-3 sentences max).
 """
 ```
 Every public function must have:
 ```python
 def process_data(data: dict, options: Optional[dict] = None) -> Result:
    """Process data with given options.
    Args:
        data: Input data to process
        options: Processing options (default: None)
    Returns:
        Processed result
    Raises:
        ValueError: If data is invalid
    """
 ```
 ## Architecture Rules
 ### No Feature Flags in Core Logic
 ```python
 # BAD
 if config.get("USE_NEW_PARSER", False):
    result = new_parser(text)
 else:
    result = old_parser(text)
 # GOOD: Pick one, remove the other
 def parse_tool_calls(text: str) -> Tuple[str, List[dict]]:
    """Parse tool calls from text."""
    # Single implementation
 ```
 ### No Code Duplication
 - If you copy-paste > 3 lines, extract to function
 - Shared code goes in `src/common/` or `src/utils/`
 ### Separation of Concerns
 ```
 src/
 ├── parser/       # Only parsing logic
 ├── executor/     # Only execution logic
 ├── formatter/    # Only formatting/output
 └── integration/  # Only API glue code
 ```
 ## Forbidden Patterns
 ### Never Do These:
 1. **Bare except clauses** - Always catch specific exceptions
 2. **Production debugging** - No `print()`, use logging
 3. **Multiple return formats** - One function = one return type
 4. **Silent failures** - Always log/report errors
 5. **Magic numbers** - Use named constants
 6. **Global state** - Use dependency injection
 7. **Deep nesting** - Max 3 levels of indentation
 8. **Circular dependencies** - Re-architect if needed
 ## Review Preparation
 Before marking PR ready:
 1. **Self-Review Checklist** (check each item):
   - [ ] Tests pass: `pytest -v`
   - [ ] Type checking: `mypy src/`
   - [ ] Linting: `ruff check src/`
   - [ ] Formatting: `black src/`
   - [ ] Token count verified (if applicable)
   - [ ] No debug code left in
   - [ ] Commit messages follow format
   - [ ] Documentation updated
 2. **PR Description Template**:
   ```markdown
   ## Changes
   - [Brief description]
   ## Testing
   - [How you tested it]
   ## Token Impact (if applicable)
   - Before: X tokens
   - After: Y tokens
   - Change: +/- Z tokens
   ## Checklist
   - [ ] Tests added/updated
   - [ ] Documentation updated
   - [ ] Self-review completed
   ```
 3. **Run Final Verification**:
   ```bash
   # Run all checks
   pytest && mypy src/ && ruff check src/ && black --check src/
   ```
 ## Continuous Learning & Research
 You MUST periodically research best practices and alternative implementations. This prevents stagnation and ensures we're using proven approaches.
 ### When to Research
 **Before Major Features:**
 - Spend 15-30 minutes researching similar implementations
 - Check: GitHub, Stack Overflow, official docs, research papers
 - Document findings in PR description
 **Monthly Reviews:**
 - Review project's core technologies for updates
 - Check if better libraries/algorithms exist
 - Look for deprecated patterns we're using
 **When Stuck:**
 - Don't brute force a solution
 - Research how others solved similar problems
 - Consider if problem indicates architectural issue
 ### What to Research
 **1. Best Practices**
 ```bash
 # Search queries to use:
 "python async best practices 2024"
 "fastapi error handling patterns"
 "LLM consensus voting algorithms"
 "gguf quantization comparison"
 ```
 **2. Similar Implementations**
 - Search GitHub for similar projects
 - Read their architecture decisions
 - Check their issues for pitfalls they hit
 - Note: Don't copy code blindly, understand WHY
 **3. Research Papers & Benchmarks**
 - For consensus algorithms
 - For quantization strategies
 - For context window optimization
 - For distributed systems patterns
 **4. Library Updates**
 - Check CHANGELOG of major dependencies
 - Review migration guides
 - Test new features in separate branch
 ### Documentation of Research
 Create `research/YYYY-MM-DD-topic.md` for significant findings:
 ```markdown
 # Research: [Topic]
 **Date:** YYYY-MM-DD
 **Researcher:** [Name]
 **Trigger:** [Why researched this]
 ## Findings
 ### Option 1: [Name]
 - Source: [Link]
 - Pros: ...
 - Cons: ...
 - Complexity: Low/Medium/High
 ### Option 2: [Name]
 - Source: [Link]
 - Pros: ...
 - Cons: ...
 - Complexity: Low/Medium/High
 ## Recommendation
 [Which option and WHY]
 ## Implementation Notes
 [Specific code changes needed]
 ## Risks
 [What could go wrong]
 ```
 ### Research Checklist
 **Before implementing:**
 - [ ] Searched for similar open-source implementations
 - [ ] Checked recent best practices (2023+)
 - [ ] Looked for benchmarking data if applicable
 - [ ] Reviewed alternative approaches
 - [ ] Considered long-term maintenance implications
 **After implementing:**
 - [ ] Documented why chosen approach was selected
 - [ ] Added comments linking to research sources
 - [ ] Created test comparing against alternatives (if applicable)
 ### Example Research Topics
 **Immediate:**
 - "Python type hints best practices 2024"
 - "FastAPI dependency injection patterns"
 - "LLM tool use format comparison"
 **Short-term:**
 - "Consensus algorithms for distributed LLM systems"
 - "Context window compression techniques"
 - "GGUF quantization vs other formats"
 **Long-term:**
 - "Speculative decoding implementation"
 - "PagedAttention for multiple workers"
 - "RAG integration patterns"
 ### Research Sources
 **Reliable:**
 - Official documentation (Python, FastAPI, etc.)
 - Well-maintained GitHub repos (>1k stars, active)
 - Recent conference talks (PyCon, NeurIPS, etc.)
 - Research papers with code (Papers With Code)
 - Official blogs (Python.org, FastAPI.tiangolo.com)
 **Use with Caution:**
 - Medium articles (variable quality)
 - Old Stack Overflow answers (>2 years)
 - Tutorial sites (often outdated)
 - YouTube videos (hard to verify)
 ### Integration with Development
 **Weekly:**
 - Spend 30 minutes reading about one technology we use
 - Note any improvements we could make
 - Create issues for promising findings
 **Monthly:**
 - Review all open research issues
 - Prioritize based on impact vs effort
 - Schedule implementation of high-value items
 **Quarterly:**
 - Architecture review: Are our patterns still best?
 - Dependency audit: Updates needed?
 - Performance review: Could we be faster?
 ---
 **Remember:**
 - Research prevents reinvention of the wheel
 - But don't research forever - timebox it (30 min max for most decisions)
 - Document findings so others don't repeat the research
 - Apply critical thinking - "best practice" depends on context
 ---
 ## Breaking This Ruleset
 If you MUST break a rule:
 1. Document WHY in code comments
 2. Get explicit approval in PR
 3. Create follow-up issue to fix properly
 4. Never break Rule 3 (No Production Debugging)
 ---
 **Remember: Quality over speed. A fix that takes 2 days with tests is better than a fix that takes 2 hours and breaks 3 other things.**
@@ -1,204 +0,0 @@
 # Network Federation Status
 ## Overview
 Local Swarm has a federation system designed to allow multiple instances to collaborate on the same network, enabling distributed consensus and load balancing across multiple machines.
 ## Current Implementation Status
 ### ✅ What's Working
 #### 1. Network Discovery (`src/network/discovery.py`)
 **Purpose**: Automatic discovery of other Local Swarm instances on the local network using mDNS/Bonjour.
 **Key Components**:
 - `SwarmDiscovery` class - Main discovery service
 - `PeerInfo` dataclass - Stores information about peer swarms
 - `start_advertising()` - Announces this swarm to the network
 - `start_discovery()` - Listens for other swarms on the network
 - `create_discovery_service()` - Factory function to create discovery instance
 **How It Works**:
 - Uses mDNS service type: `_local-swarm._tcp.local.`
 - Advertises on port 63323 (discovery) + API port (17615)
 - Broadcasts: version, instances, model_id, hardware_summary
 - Peers timeout after 60 seconds if not seen
 #### 2. Federation Client (`src/network/federation.py`)
 **Purpose**: Communication protocol between peer swarms.
 **Key Components**:
 - `FederationClient` class - HTTP client for peer communication
 - `FederatedSwarm` class - Wraps local swarm with federation logic
 - `request_vote()` - Gets generation results from peers
 - `generate_with_federation()` - Coordinates distributed generation
 - Federation strategies: `best_of_n`, `weighted_vote`, `first_valid`
 **API Endpoints** (not yet exposed):
 - `POST /v1/federation/vote` - Request generation from peer
 - `GET /v1/federation/health` - Check peer health
 #### 3. Network Binding (`main.py`)
 **Purpose**: Secure local network access without internet exposure.
 **Implementation**:
 - `get_local_ip()` - Detects local network IP (192.x.x.x or 100.x.x.x)
 - Binds to specific local IP instead of 0.0.0.0
 - Falls back to localhost if not on private network
 ## ❌ What's Missing
 ### Critical Gap: No Integration
 **The federation system exists as standalone modules but is NOT connected to the main application flow.**
 **Specific Issues**:
 1. **No CLI Flag**: No `--federation` or `--enable-federation` argument in `main.py`
 2. **Discovery Never Starts**: 
   - `SwarmDiscovery` class is imported in `network/__init__.py`
   - But never instantiated or started in `main.py`
   - `start_advertising()` and `start_discovery()` are never called
 3. **Federation Never Starts**:
   - `FederatedSwarm` class exists but is never instantiated
   - `main.py` calls `swarm.generate()` directly
   - Should call `federated_swarm.generate_with_federation()` when enabled
 4. **API Routes Not Registered**:
   - Federation endpoints exist in `federation.py` but aren't added to FastAPI router
   - Routes in `src/api/routes.py` don't include `/v1/federation/*`
 5. **No Peer Management UI**:
   - No way to see discovered peers
   - No status dashboard for federation
   - No manual peer configuration
 ## File Structure
 ```
 src/network/
 ├── __init__.py           # Exports SwarmDiscovery, FederationClient, etc.
 ├── discovery.py          # mDNS/Bonjour discovery service
 │   ├── SwarmDiscovery    # Main discovery class
 │   ├── PeerInfo          # Peer information dataclass
 │   └── create_discovery_service()  # Factory function
 ├── federation.py         # Inter-swarm communication
 │   ├── FederationClient  # HTTP client for peers
 │   ├── FederatedSwarm    # Wraps swarm with federation
 │   ├── PeerVote          # Vote from peer
 │   └── FederationResult  # Result of federated generation
 └── (routes missing)      # Should add federation routes
 main.py                   # Should integrate federation here
  └── Currently: Just runs local swarm
  └── Should: Optionally run federated swarm with discovery
 ```
 ## Scope
 ### In Scope
 - Automatic discovery of peers on same local network
 - Distributed generation across multiple machines
 - Consensus voting between local and peer responses
 - Health checking and peer timeout handling
 - Secure local network binding (no internet exposure)
 ### Out of Scope (Future)
 - Internet-wide federation (would need authentication/encryption)
 - Cross-platform federation (Mac ↔ Linux ↔ Windows)
 - Peer authentication/authorization
 - Encrypted peer communication
 - WAN federation through NAT traversal
 - Peer reputation/scoring system
 ## TODO
 ### Phase 1: Basic Integration (Minimum Viable)
 1. **Add `--federation` CLI flag** to `main.py`
   - Add argument parser entry
   - Conditionally enable federation
 2. **Integrate discovery in main flow**
   ```python
   # In main.py after swarm initialization:
   if args.federation:
       discovery = await create_discovery_service(args.port)
       await discovery.start_advertising(swarm_info)
       await discovery.start_discovery()
   ```
 3. **Add federation API routes** to `src/api/routes.py`
   - `POST /v1/federation/vote`
   - `GET /v1/federation/health`
   - `GET /v1/federation/peers` (list discovered peers)
 4. **Create FederatedSwarm wrapper**
   ```python
   # Replace: result = await swarm.generate(...)
   # With:
   if args.federation:
       federated = FederatedSwarm(swarm, discovery)
       result = await federated.generate_with_federation(...)
   else:
       result = await swarm.generate(...)
   ```
 ### Phase 2: Polish
 5. **Add peer status display**
   - Show discovered peers in startup banner
   - Display peer count in status
   - Log when peers join/leave
 6. **Handle edge cases**
   - No peers available (fallback to local only)
   - All peers timeout (graceful degradation)
   - Split-brain scenarios
 7. **Configuration**
   - Config file support for federation settings
   - Manual peer list (bypass discovery)
   - Federation strategy selection
 ### Phase 3: Testing
 8. **Integration tests**
   - Two instances on same machine
   - Two instances on same network
   - Peer timeout handling
   - Consensus validation
 ## Usage (When Complete)
 ### Start Federated Mode
 ```bash
 # On Mac 1 (192.168.1.100)
 python main.py --auto --federation
 # On Mac 2 (192.168.1.101)
 python main.py --auto --federation
 # Both will:
 # 1. Start local API on 192.168.x.x:17615
 # 2. Advertise via mDNS
 # 3. Discover each other within 5-10 seconds
 # 4. Distribute generation requests between them
 ```
 ### Expected Behavior
 1. Both Macs advertise themselves via mDNS
 2. Each discovers the other within 10 seconds
 3. When a request comes in, both generate responses
 4. Consensus algorithm picks best response
 5. Result returned to client
 ## Benefits When Complete
 - **More workers**: Combine instances across machines
 - **Better consensus**: More responses = better selection
 - **Load balancing**: Distribute generation across devices
 - **Redundancy**: If one fails, others continue
 - **Heterogeneous hardware**: Mix Macs, PCs, servers
 ## Current Workaround
 Until federation is integrated, you can:
 1. Run instances independently on different machines
 2. Point clients to specific instances manually
 3. No automatic peer discovery or coordination
@@ -1,597 +1,191 @@
 # Local Swarm
-Automatically configure and run a swarm of small coding LLMs optimized for your hardware. Provides an OpenAI-compatible API for seamless integration with opencode and other tools.
+Run a swarm of local LLMs on your hardware. Multiple models work together to give you the best answer through consensus voting.
-## Features
+## What It Does
- **Interactive Menu System**: Easy-to-use menu for selecting model configurations, browsing options, or creating custom setups
+- **Auto-detects your hardware** (NVIDIA, AMD, Intel, Apple Silicon, Qualcomm, or CPU)
- **Hardware Auto-Detection**: Automatically detects your GPU (NVIDIA, AMD, Intel), Apple Silicon, Qualcomm (Android), or CPU and selects optimal settings
+- **Downloads and runs multiple LLM instances** optimized for your VRAM/RAM
- **Smart Model Selection**: Chooses the best model, quantization, and instance count based on available VRAM/RAM
+- **Uses consensus voting** - all instances answer, best response wins
- **Startup Summary**: Clear display of detected hardware, selected model, resource usage, and worker status
+- **Connects multiple machines** on your network for a "hive mind" effect
- **Swarm Consensus**: Multiple LLM instances vote on the best response for higher quality outputs
+- **Provides an OpenAI-compatible API** at `http://localhost:17615/v1`
 - **Network Federation**: Multiple machines on the same network can join into a "federated swarm" for distributed consensus
 - **OpenAI-Compatible API**: Drop-in replacement for OpenAI API at `http://localhost:8000/v1`
 - **MCP Server**: Model Context Protocol support for tight AI assistant integration
 - **Cross-Platform**: Works on Windows, macOS, Linux, and Android (via Termux) with automatic backend selection
 ## Documentation
 - **[Quick Start](#quick-start)** - Get up and running in minutes
 - **[Complete Guide](docs/GUIDE.md)** - Comprehensive documentation
  - Opencode configuration examples
  - API reference
  - Troubleshooting guide
  - Performance tuning
  - Advanced configuration
 - **[Configuration](#configuration)** - Customize your setup
 - **[Interactive Mode](#interactive-mode)** - Using the menu system
 - **[Tips & Help](#tips--help)** - Learn about models, quantization, and optimization
 ## Quick Start
-### Installation
+```bash
-
+# Clone and install
 #### Windows (PowerShell)
 ```powershell
 # Clone the repository
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 pip install -r requirements.txt
-# Run installer
+# Run it
 .\scripts\install.bat
 ```
 #### macOS/Linux
 ```bash
 # Clone the repository
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 # Run installer
 chmod +x scripts/install.sh
 ./scripts/install.sh
 ```
 #### Android (Termux)
 ```bash
 # In Termux app
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 # Run Termux installer
 chmod +x scripts/install-termux.sh
 ./scripts/install-termux.sh
 ```
 **Note**: Android support is limited to small models (1-3B) due to memory constraints. Requires 8GB+ RAM.
 ### Usage
 #### Start the Swarm
 ```bash
 # Auto-detect hardware and start
 python -m local_swarm
 # Or use the CLI
 python main.py
 ```
-On first run, the tool will:
+On first run, it will:
-1. Scan your hardware (GPU, RAM, CPU)
+1. Detect your hardware
-2. Select the optimal model and quantization
+2. Pick the best model and quantization
 3. Download the model (one-time)
-4. Start multiple instances based on available memory
+4. Start multiple LLM workers
-5. Expose the API at `http://localhost:8000`
+5. Expose the API at `http://localhost:17615`
-Example startup output:
+## Usage
 ```
 🔍 Detecting hardware...
   OS: Windows 11
   GPU: NVIDIA GeForce RTX 4060 Ti (16 GB VRAM)
   CPU: 16 cores
   RAM: 32 GB
-📊 Optimal configuration:
+### Interactive Mode (default)
-   Model: Qwen 2.5 Coder 3B
+```bash
-   Quantization: Q4_K_M (1.8 GB per instance)
+python main.py
   Instances: 8 (using 14.4 GB VRAM)
 ⬇️  Downloading model...
   Progress: 100% ▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓▓ 1.8/1.8 GB
 🚀 Starting swarm...
   Worker 1: Ready (GPU:0)
   Worker 2: Ready (GPU:0)
   ...
   Worker 8: Ready (GPU:0)
 ✅ Local Swarm is running!
   API: http://localhost:8000/v1
   Models: http://localhost:8000/v1/models
   Health: http://localhost:8000/health
 💡 Configure opencode to use:
   base_url: http://localhost:8000/v1
   api_key: any (not used)
 ```
-#### Configure opencode
+Shows a menu with:
 - Recommended configuration (auto-selected)
 - Browse all compatible models
 - Custom configuration wizard
-Add to your opencode configuration:
+### Auto Mode (no menu)
 ```bash
 python main.py --auto
 ```
 ### With Other Options
 ```bash
 python main.py --model qwen:3b:q4      # Use specific model
 python main.py --instances 4           # Force 4 workers
 python main.py --port 8080             # Custom port
 python main.py --detect                # Show hardware info only
 python main.py --federation            # Enable network federation
 python main.py --mcp                   # Enable MCP server
 ```
 ## Connect to Opencode
 Add to your opencode config:
 ```json
 {
  "model": {
    "provider": "openai",
-    "base_url": "http://localhost:8000/v1",
+    "base_url": "http://localhost:17615/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
 }
 ```
-#### MCP Server (Optional)
+## Network Federation (Hive Mind)
-For tighter integration with AI assistants, enable the MCP server:
+Run on multiple machines to combine their power:
 ```bash
-python main.py --mcp
+# Machine 1 (Windows with RTX 4060)
 python main.py --auto --federation
 # Machine 2 (Mac Mini M1)
 python main.py --auto --federation
 # Machine 3 (Old laptop)
 python main.py --auto --federation
 ```
-This runs alongside the HTTP API and exposes tools AI assistants can use:
+Machines auto-discover each other and vote together on every request.
 - `get_hardware_info` - Query CPU, GPU, and RAM
 - `get_swarm_status` - Check worker health
 - `generate_code` - Generate code with consensus
 - `list_available_models` - See what models can run
 - `get_worker_details` - Get detailed worker statistics
-MCP allows AI assistants to automatically query your hardware capabilities and select appropriate models.
+## How Consensus Works
 1. Your prompt goes to all LLM instances
 2. Each instance generates a response independently
 3. The consensus algorithm picks the best answer:
   - **Similarity** (default): Groups responses by meaning, picks the largest group
   - **Quality**: Scores on completeness, code blocks, structure
   - **Fastest**: Returns the quickest response
   - **Majority**: Simple text match voting
 ## Configuration
-Create a `config.yaml` file for customization:
+Create `config.yaml`:
 ```yaml
 server:
  host: "127.0.0.1"
-  port: 8000
+  port: 17615
 swarm:
-  consensus_strategy: "similarity"  # similarity, quality, fastest
+  consensus_strategy: "similarity"  # similarity, quality, fastest, majority
  min_instances: 2
  max_instances: 8
 hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU/Apple Silicon
 federation:
  enabled: true
  discovery_port: 8765
  federation_port: 8766
  max_peers: 10
 models:
  cache_dir: "~/.local_swarm/models"
 ```
-## CLI Options
+## Supported Hardware
-```bash
+| Hardware | Backend | Notes |
-# Show hardware detection without starting
+|----------|---------|-------|
-python -m local_swarm --detect
+| NVIDIA GPU | llama.cpp (CUDA) | Best performance |
-
+| AMD GPU | llama.cpp (ROCm) | Linux/Windows |
-# Use specific model
+| Intel GPU | llama.cpp (SYCL) | Linux/Windows |
-python -m local_swarm --model qwen2.5-coder:3b:q4
+| Apple Silicon | MLX | Native Metal |
-
+| Qualcomm | llama.cpp (CPU) | Android/Termux |
-# Use specific port
+| CPU-only | llama.cpp | Slower but works |
 python -m local_swarm --port 8080
 # Force number of instances
 python -m local_swarm --instances 4
 # Download models only (no server)
 python -m local_swarm --download-only
 # Enable MCP server alongside HTTP API
 python -m local_swarm --mcp
 # Show help
 python -m local_swarm --help
 # Auto-detect without interactive menu
 python -m local_swarm --auto
 ```
 ## Interactive Mode
 By default, Local Swarm starts in **interactive mode** with a menu system:
 ```
 ======================================================================
 Local Swarm - Model Selection
 ======================================================================
 ----------------------------------------------------------------------
 Hardware Detection
 ----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB
  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB
  Available for LLMs: 12.0 GB
  (Using 50% of system RAM)
 ----------------------------------------------------------------------
 Configuration Options
 ----------------------------------------------------------------------
  💡 Recommended: Qwen 2.5 Coder 7b (q6_k)
     Instances: 2
     Memory: 12.0 GB
  [1] Recommended Configuration - Qwen 2.5 Coder 7b (q6_k) with 2 instances
  [2] Browse All Configurations - See all models that fit your hardware
  [3] Custom Configuration - Specify exact model and number of instances
  Enter your choice: 
 ```
 ### Menu Options
 1. **Recommended Configuration** - Automatically selects the best model and instance count for your hardware
 2. **Browse All Configurations** - Shows all feasible models that fit in your available memory
 3. **Custom Configuration** - Step-by-step wizard to select:
   - Model family (Qwen, DeepSeek, CodeLlama)
   - Model size (3B, 7B, 14B)
   - Quantization level (Q4, Q5, Q6)
   - Number of instances (1 to max supported)
 To skip the menu and use auto-detection, use `--auto` flag.
 ## Startup Summary
 When starting, Local Swarm displays a comprehensive summary:
 ```
 ======================================================================
 Local Swarm - Startup Summary
 ======================================================================
 ----------------------------------------------------------------------
 Hardware Detection
 ----------------------------------------------------------------------
  Operating System: Darwin
  CPU: 12 cores
  System RAM: 24.0 GB
  Available RAM: 6.2 GB
  GPU Detected:
    Name: Apple Silicon GPU
    Type: Apple Silicon (Unified Memory)
    Total Memory: 24.0 GB
  Available for LLMs: 12.0 GB
 ----------------------------------------------------------------------
 Model Configuration
 ----------------------------------------------------------------------
  Model: Qwen 2.5 Coder 7b (q6_k)
  Description: Alibaba's code-focused model
  Instances: 2
  Memory per Instance: 6.0 GB
  Total Memory: 12.0 GB
  Utilization: 100.0% of available
 ======================================================================
 ```
 ## How It Works
 ### Hardware Detection
 The tool automatically detects your system:
 - **Windows**: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI)
 - **macOS**: Apple Silicon via Metal, unified memory model
 - **Linux**: NVIDIA (NVML), AMD (ROCm), Intel (OneAPI/OpenCL)
 - **Android**: Qualcomm Adreno GPUs (via Termux)
 **Supported Backends**:
 - **NVIDIA**: CUDA via llama.cpp
 - **AMD**: ROCm via llama.cpp (Linux, Windows experimental)
 - **Intel**: OneAPI/SYCL via llama.cpp
 - **Apple Silicon**: Metal via MLX
 - **Qualcomm**: CPU fallback on llama.cpp (Android/Termux)
 ### Model Selection
 Based on available memory:
 1. **External GPU**: Use 100% of VRAM minus OS overhead
 2. **Apple Silicon**: Use 50% of unified RAM
 3. **CPU-only**: Use 50% of system RAM
 The algorithm selects:
 - Largest model size that fits
 - Highest quantization quality possible
 - Maximum instances (2-8) based on memory
 Example configurations:
 | Hardware | Model | Quant | Instances | Memory Used |
 |----------|-------|-------|-----------|-------------|
 | RTX 4090 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
 | RTX 4060 Ti 16GB | Qwen 2.5 7B | Q4_K_M | 3 | ~13.5 GB |
 | RTX 4060 Ti 8GB | Qwen 2.5 3B | Q6_K | 4 | ~10.4 GB |
 | RX 7900 XTX 24GB | Qwen 2.5 14B | Q4_K_M | 2 | ~17.6 GB |
 | Arc A770 16GB | Qwen 2.5 7B | Q5_K_M | 2 | ~10.4 GB |
 | M4 Max 64GB | Qwen 2.5 14B | Q4_K_M | 4 | ~35.2 GB |
 | M3 Pro 36GB | Qwen 2.5 7B | Q4_K_M | 4 | ~18 GB |
 | M1 8GB | Qwen 2.5 3B | Q4_K_M | 2 | ~3.6 GB |
 | Snapdragon 8 Gen 3 | Qwen 2.5 3B | Q4_K_M | 1 | ~1.8 GB |
 | CPU 32GB | Qwen 2.5 3B | Q4_K_M | 8 | ~14.4 GB |
 | **Federated (3 machines)** | **Qwen 2.5 7B** | **Q4_K_M** | **9** | **~40.5 GB** |
 ### Swarm Consensus
 For each request, the swarm:
 1. Sends the prompt to all running instances
 2. Collects responses in parallel
 3. Runs consensus algorithm:
   - **Similarity**: Groups responses by semantic similarity, returns largest group
   - **Quality**: Scores responses on completeness and code quality
   - **Fastest**: Returns the quickest response
 4. Returns the winning response via OpenAI-compatible API
 ### Network Federation
 Run Local Swarm on multiple machines in the same network to create a "federated swarm":
 **Example Setup**:
 - Windows PC (RTX 4060 Ti): 4 instances
 - Mac Mini (M1): 2 instances  
 - MacBook (M4): 3 instances
 - Total: 9 instances voting on every request
 **How it works**:
 1. Each machine auto-discovers others via mDNS/Bonjour
 2. Each swarm generates responses independently
 3. Local consensus picks best response per machine
 4. Cross-swarm consensus votes across all machines
 5. Best response returned to client
 **To enable federation**:
 ```yaml
 federation:
  enabled: true
  discovery_port: 8765  # mDNS/Bonjour discovery
  federation_port: 8766  # Inter-swarm communication
 ```
 Machines will automatically discover each other within 10 seconds.
 ## API Endpoints
 ### GET /v1/models
 List available models
 ### POST /v1/chat/completions
 Chat completion with consensus
 **Request**:
 ```json
 {
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to sort a list"}
  ]
 }
 ```
 **Response**:
 ```json
 {
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def sort_list(lst):\n    return sorted(lst)"
    },
    "finish_reason": "stop"
  }]
 }
 ```
 ### GET /health
 Health check
 ### GET /metrics
 Prometheus metrics (optional)
 ## Supported Models
-Currently supported models (auto-selected based on hardware):
+- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended
 - **DeepSeek Coder** (1.3B, 6.7B, 33B)
 - **CodeLlama** (7B, 13B, 34B)
- **Qwen 2.5 Coder** (3B, 7B, 14B) - Recommended for coding tasks
+All support GGUF quantization (Q4_K_M recommended).
 - **DeepSeek Coder** (1.3B, 6.7B, 33B) - Good alternative
 - **CodeLlama** (7B, 13B, 34B) - Meta's code model
-All models support GGUF quantization:
+## API Endpoints
- Q4_K_M - Good quality, smallest size (recommended)
+
- Q5_K_M - Better quality
+- `GET /v1/models` - List available models
- Q6_K - Best quality
+- `POST /v1/chat/completions` - Chat completion with consensus
 - `GET /health` - Health check
 - `GET /v1/federation/peers` - List discovered peers (when federation enabled)
 ## Troubleshooting
 ### Out of Memory
 If you get OOM errors:
 ```bash
-# Reduce instances
+python main.py --instances 2           # Reduce workers
-python -m local_swarm --instances 2
+python main.py --model qwen:3b:q4      # Use smaller model
 # Or use smaller model
 python -m local_swarm --model qwen2.5-coder:3b:q4
 ```
 ### Slow Performance
- Check GPU utilization with `nvidia-smi` (NVIDIA) or Activity Monitor (macOS)
+- Check GPU utilization with `nvidia-smi`
- Ensure model is cached (first run downloads to `~/.local_swarm/models`)
+- Reduce instances to avoid contention
- Try reducing instances to avoid contention
+- Use Q4 quantization instead of Q6
-### Windows: CUDA not detected
+### CUDA Not Detected (Windows)
 Make sure NVIDIA drivers are installed:
 ```powershell
-nvidia-smi
+nvidia-smi  # Check drivers
 pip uninstall llama-cpp-python
 pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
 ```
 If this fails, reinstall drivers from nvidia.com
-### macOS: MLX not found
+### macOS: MLX Not Found
 ```bash
 pip install mlx-lm
 ```
-### Linux: AMD GPU not detected
+## Project Structure
 Ensure ROCm is installed:
 ```bash
 rocm-smi
 ```
 If not found, install from https://www.amd.com/en/developer/rocm-hub.html
 ### Linux: Intel GPU not detected
 Install Intel oneAPI:
 ```bash
 # Ubuntu/Debian
 wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | sudo gpg --dearmor -o /usr/share/keyrings/intel-oneapi-archive-keyring.gpg
 echo "deb [signed-by=/usr/share/keyrings/intel-oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list
 sudo apt update
 sudo apt install intel-basekit
 ```
 ### Android: Termux issues
 - Ensure Termux is installed from F-Droid (not Play Store)
 - Run `pkg update` before installation
 - Limited to small models (1-3B) due to RAM constraints
 - Use CPU backend only (no GPU acceleration on Android yet)
 ## Requirements
 - Python 3.9+
 - 4GB+ RAM (8GB+ recommended)
 - Optional: NVIDIA/AMD/Intel GPU with 4GB+ VRAM
 - Optional: Apple Silicon Mac
 - Optional: Android device with 8GB+ RAM (via Termux)
 ## Development
 ```bash
 # Install dev dependencies
 pip install -r requirements-dev.txt
 # Run tests
 pytest
 # Run specific platform tests
 pytest tests/test_hardware.py -v
 # Format code
 black src/
 ruff check src/
 ```
 ## Architecture
 ### Single Machine
 ```
-┌─────────────────────────────────────┐
+local_swarm/
-│         OpenAI API Client           │
+├── main.py                   # CLI entry point
-│        (opencode, etc.)             │
+├── src/
-└─────────────┬───────────────────────┘
+│   ├── hardware/            # GPU detection (NVIDIA, AMD, Intel, Apple, Qualcomm)
-              │ HTTP
+│   ├── models/              # Model registry, selection, downloading
-              ▼
+│   ├── backends/            # llama.cpp and MLX backends
-┌─────────────────────────────────────┐
+│   ├── swarm/               # Worker management and consensus
-│     Local Swarm API Server          │
+│   ├── network/             # Federation and peer discovery
-│    (FastAPI / localhost:8000)       │
+│   ├── api/                 # OpenAI-compatible API server
-└─────────────┬───────────────────────┘
+│   └── tools/               # Tool execution (read, write, bash)
-              │
+└── docs/                    # Documentation
              ▼
 ┌─────────────────────────────────────┐
 │       Swarm Manager                 │
 │  ┌─────────┐ ┌─────────┐           │
 │  │ Worker 1│ │ Worker 2│ ...       │
 │  │(LLM #1) │ │(LLM #2) │           │
 │  └────┬────┘ └────┬────┘           │
 │       │           │                 │
 │       └─────┬─────┘                 │
 │             ▼                       │
 │      Consensus Engine               │
 └─────────────────────────────────────┘
              │
              ▼
 ┌─────────────────────────────────────┐
 │     Backend (llama.cpp / MLX)       │
 │    ┌─────────────────────┐          │
 │    │   GGUF/MLX Model    │          │
 │    │   (Qwen/Codellama)  │          │
 │    └─────────────────────┘          │
 └─────────────────────────────────────┘
              │
              ▼
 ┌─────────────────────────────────────┐
 │    Hardware (GPU/CPU/Apple Silicon) │
 └─────────────────────────────────────┘
 ```
 ### Federated Swarm (Multiple Machines)
 ```
 ┌─────────────────────────────────────────────────────────────┐
 │                    Local Network                             │
 │                                                              │
 │  ┌──────────────┐    ┌──────────────┐    ┌──────────────┐  │
 │  │  Windows PC  │    │   Mac Mini   │    │   MacBook    │  │
 │  │  (RTX 4060)  │    │    (M1)      │    │    (M4)      │  │
 │  │  4 instances │    │  2 instances │    │  3 instances │  │
 │  └──────┬───────┘    └──────┬───────┘    └──────┬───────┘  │
 │         │                   │                   │           │
 │         │                   │                   │           │
 │         └───────────────────┼───────────────────┘           │
 │                             │                               │
 │                    ┌────────┴────────┐                      │
 │                    │  Cross-Swarm    │                      │
 │                    │    Consensus    │                      │
 │                    └────────┬────────┘                      │
 │                             │                               │
 │                    ┌────────▼────────┐                      │
 │                    │   opencode      │                      │
 │                    └─────────────────┘                      │
 └─────────────────────────────────────────────────────────────┘
 ```
 ## License
-MIT License - See LICENSE file
+MIT License
 ## Contributing
 Contributions welcome! Please read CONTRIBUTING.md first.
 ## Acknowledgments
 - [llama.cpp](https://github.com/ggerganov/llama.cpp) - Inference engine (CUDA/ROCm/SYCL)
 - [MLX](https://github.com/ml-explore/mlx) - Apple Silicon backend
 - [Qwen](https://github.com/QwenLM/Qwen) - Model family
 - [DeepSeek](https://github.com/deepseek-ai/deepseek-coder) - Model family
 - [HuggingFace](https://huggingface.co) - Model hosting
 - [ROCm](https://github.com/RadeonOpenCompute/ROCm) - AMD GPU support
 - [oneAPI](https://www.intel.com/content/www/us/en/developer/tools/oneapi/overview.html) - Intel GPU support
 - [Termux](https://termux.dev) - Android terminal emulator
@@ -1,106 +0,0 @@
 Here's a comprehensive review of your project. It's well-architected overall, but there are several issues worth addressing.
  ---
  Critical Bugs
  1. src/network/discovery.py:128 — asyncio.create_task() called from a non-asyncio thread
  Zeroconf's ServiceBrowser invokes _on_service_state_change from a background thread, but asyncio.create_task() requires the asyncio event loop in
  the current thread. This will crash with RuntimeError: no current event loop. Use asyncio.run_coroutine_threadsafe(coro, loop) instead.
  2. src/network/discovery.py:161 — int() on bytes raises TypeError
  int(properties.get(b"instances", b"0")) — in Python 3, int(b"0") is a TypeError. Need .decode() first.
  3. src/hardware/detector.py:149,174 — Android/Qualcomm detection is unreachable
  platform.system() returns "Linux" on Android, not "android". So the code enters the Linux branch, tries NVIDIA/AMD/Intel, fails, and returns None —
   never reaching Qualcomm detection.
  4. src/api/routes.py:77 — response_model breaks streaming
  The route declares response_model=ChatCompletionResponse, but when request.stream=True, it returns a StreamingResponse. FastAPI will try to
  validate the streaming response against the Pydantic model and fail.
  ---
  High Severity
  5. src/backends/llamacpp.py:85-94 and src/backends/mlx.py:88-96 — Blocking calls in async methods
  Both backends call synchronous inference (self._llm(...), mlx_generate(...)) directly inside async def methods. This blocks the entire event loop,
  freezing the API server during inference. Wrap in await asyncio.to_thread(...).
  6. src/backends/llamacpp.py:29 — Lock declared but never initialized
  self._lock = None is never replaced with an actual asyncio.Lock(), so there's no concurrency protection when multiple requests hit the same backend
   instance.
  7. src/swarm/consensus.py:85,89 — Blocking I/O in async context
  SentenceTransformer('all-MiniLM-L6-v2') downloads/loads a model synchronously, and .encode() is CPU-bound. Both freeze the event loop.
  8. src/hardware/amd.py:80 — VRAM regex matches wrong number
  re.search(r'(\d+)', line) on a line like GPU[0] : VRAM Total Memory (B): 17179869184 matches 0 (from GPU[0]), not the VRAM value.
  9. src/models/downloader.py:79-88 — Partial downloads cached as valid
  If a download is interrupted, the partial file remains. is_model_cached() sees size > 0 and treats it as valid. Should download to a .tmp file and
  rename atomically on completion.
  10. src/network/federation.py:253-277 — best_of_n strategy is non-functional
  The code creates GenerationResponse objects but never uses them, then just returns the local response. This strategy is dead code.
  ---
  Medium Severity
  11. src/models/selector.py:182-184 — Memory calculation uses wrong instance count
  total_memory_gb = smallest_quant.vram_gb * instances uses the pre-clamped value, but instances gets max(instances, 1) on the next line. Data
  inconsistency.
  12. src/models/selector.py:65 — calculate_max_instances returns infeasible count
  Returns MIN_INSTANCES (2) even when only 0-1 instances fit in memory. _try_smallest_variant calls this without the memory guard that _try_model
  has.
  13. src/hardware/detector.py:87-88 — NVML resource leak
  pynvml.nvmlInit() is called but nvmlShutdown() is never called. Need a try/finally.
  14. src/api/server.py:60-66 — Invalid CORS configuration
  allow_origins=["*"] with allow_credentials=True violates the CORS spec. Browsers will reject this.
  15. src/swarm/consensus.py:186-199 — _majority_vote doesn't do majority voting
  It picks the median-length response, not the most common one. Name and docstring are misleading.
  16. src/interactive.py:226,368,458 — Recursive menu navigation risks stack overflow
  Menu functions call each other recursively. Repeated back-and-forth navigation can blow the stack. Use a loop-based state machine instead.
  17. Multiple files — Bare except: clauses
  llamacpp.py:157,187, mlx.py:141, detector.py:108,190, amd.py:214, intel.py:220,248, qualcomm.py:185, discovery.py:236, federation.py:116,
  updater.py:141,218,231 — all catch SystemExit and KeyboardInterrupt. Use except Exception: instead.
  ---
  Low Severity / Code Quality
  18. src/api/routes.py:112,133,147 — .json() deprecated in Pydantic v2. Use .model_dump_json().
  19. src/backends/mlx.py:59-63 — GGUF loading via MLX is suspect. Passing the parent directory of a GGUF file to mlx_lm.load() likely won't work.
  20. src/swarm/consensus.py:233 — False-positive list detection. Checks for -, *, 1., 2. which match hyphens in code, multiplication operators,
  version numbers, etc.
  21. src/network/discovery.py:56 — Dict[str, any] should be Dict[str, Any] (capital A).
  22. src/mcp_server.py:15-18 — Unused imports (ImageContent, Resource, EmbeddedResource, LoggingLevel).
  23. src/models/downloader.py:74,118 — timeout=30 is connect-only, no read timeout. Multi-GB downloads can hang on stalled reads.
  24. src/models/downloader.py — No checksum verification after download. Corrupted files are silently cached.
  25. Tests directory is empty — tests/__init__.py exists but no actual tests.
  ---
  Suggested Improvements
  1. Wrap all blocking inference in asyncio.to_thread() — this is the single most impactful fix. Without it, the API server can only handle one
  request at a time.
  2. Atomic downloads — download to .part file, rename on success, verify checksum against HuggingFace metadata.
  3. Replace recursive menus with a loop-based state machine — e.g. state = "main" in a while True loop with if state == "main": ... branches.
  4. Add proper logging — replace all print() calls with logging.getLogger(__name__). The codebase uses print() everywhere, making it hard to control
   verbosity.
  5. Fix the Android detection path — check is_termux() or /system/build.prop existence early in detect_gpu() before the platform branching.
  6. Add integration tests — even simple smoke tests (hardware detection returns valid data, model selection picks something reasonable, API server
  starts and responds to /health) would catch regressions.
  7. Use aiohttp.ClientSession as async context manager in federation to ensure proper cleanup.
  8. Consider separating streaming and non-streaming API routes — this avoids the response_model conflict and makes the code clearer.
@@ -1,134 +0,0 @@
 # Local Swarm TODO / Future Enhancements
 ## Context Window Optimization (For Long Context 30K+)
 Based on docs/CONTEXT.md, implement context compression for memory-constrained setups:
 ### Option 2: Context Compression (Recommended for 16GB VRAM)
 **Stage 1: Compression Swarm (3-5 workers)**
 - Split 60K input into 6x 10K chunks
 - Each worker summarizes one chunk
 - Aggregate summaries into 8K compressed context
 - Added latency: ~2-3 seconds
 **Stage 2: Solution Swarm (N workers)**
 - Each worker gets 8K compressed + 2K relevant original
 - Generate solutions independently
 - Vote on best response
 **Benefits:**
 - Works with standard 8K models
 - Maintains swarm consensus architecture
 - 2-3x more workers possible
 **Implementation:**
 ```python
 # New: CompressionEngine class
 class CompressionEngine:
    def compress(self, text: str, target_tokens: int) -> str:
        # Split into chunks
        # Parallel summarization
        # Aggregate results
        pass
 ```
 ### Option 3: Hierarchical RAG (For 100K+ contexts)
 **Tier 1: Indexing**
 - Embed context into vector database
 - Build searchable knowledge graph
 **Tier 2: Retrieval + Generation**
 - Query index for relevant context
 - Each worker gets ~6K retrieved + 2K raw
 **Tier 3: Voting**
 - Rerank and consensus
 **Use case:** Codebase-wide analysis, large document processing
 ---
 ## Tool Execution Enhancements
 ### Streaming Tool Results
 - Stream long file reads progressively
 - Show bash command output in real-time
 - Progress indicators for large operations
 ### Tool Permissions
 - Configurable permission levels per tool
 - Approval required for destructive operations (rm, overwrite)
 - Audit log of all tool executions
 ### Tool Result Caching
 - Cache file reads (hash-based)
 - Invalidate on file modification
 - Reduce redundant disk I/O
 ---
 ## Federation Improvements
 ### Automatic Peer Discovery
 - Better mDNS reliability
 - Fallback to broadcast/multicast
 - Manual peer list persistence
 ### Load Balancing
 - Distribute requests across peers based on:
  - Current load (active workers)
  - Latency (response time)
  - Capability (model quality)
 ### Fault Tolerance
 - Automatic peer failover
 - Retry with different peers
 - Degraded mode (fewer voters)
 ---
 ## UI/UX Enhancements
 ### Web Dashboard
 - Real-time worker status visualization
 - Generation progress bars
 - Tool execution log viewer
 - Configuration management UI
 ### Better Error Messages
 - Clear explanations of OOM errors
 - Suggested configurations based on hardware
 - Model compatibility checker
 ---
 ## Performance Optimizations
 ### Speculative Decoding
 - Small draft model generates tokens
 - Large model verifies (2-3x speedup)
 - Requires draft model download
 ### KV Cache Optimization
 - PagedAttention (vLLM-style)
 - Memory-efficient attention states
 - Better long-context performance
 ### Model Quantization
 - Support for GPTQ/AWQ quantization
 - 2-3x smaller models with minimal quality loss
 - Enable larger models on same hardware
 ---
 ## Completed ✓
 - [x] Tool execution architecture (local + remote)
 - [x] Simplified tool instructions (300 tokens vs 40k)
 - [x] Federation with peer discovery
 - [x] Hardware auto-detection
 - [x] MLX backend for Apple Silicon
 - [x] Consensus voting strategies
 - [x] Model auto-selection based on VRAM
@@ -0,0 +1,12 @@
 Use tools to execute commands and fetch information. Output only tool calls.
 Format:
 TOOL: bash
 ARGUMENTS: {"command": "ls -la", "description": "Lists files in directory"}
 TOOL: webfetch
 ARGUMENTS: {"url": "https://example.com", "format": "markdown"}
 Available tools: bash, webfetch
 No explanations. No numbered lists. No markdown. Only tool calls.
@@ -0,0 +1,115 @@
 # Local Swarm Architecture
 ## Core Concept
 Deploy multiple LLM instances on your hardware. Each instance processes the same input independently, then they vote on the best answer. Connect multiple machines running this to create a "hive mind" utilizing all your old hardware.
 ## How It Works
 ```
 ┌─────────────────┐     ┌─────────────────────────────────────┐
 │   Your Prompt   │────▶│         Swarm Manager               │
 └─────────────────┘     │  ┌─────────┐ ┌─────────┐ ┌─────────┐│
                        │  │Worker 1 │ │Worker 2 │ │Worker 3 ││
                        │  │ (LLM)   │ │ (LLM)   │ │ (LLM)   ││
                        │  └────┬────┘ └────┬────┘ └────┬────┘│
                        │       └───────────┼───────────┘     │
                        │                   ▼                 │
                        │         Consensus Engine            │
                        │         (Picks best answer)         │
                        └───────────────────┬─────────────────┘
                                            ▼
                                    ┌───────────────┐
                                    │ Best Response │
                                    └───────────────┘
 ```
 ## Components
 ### 1. Hardware Detection (`src/hardware/`)
 Detects your GPU and available memory to optimize model selection.
 - **NVIDIA** - pynvml
 - **AMD** - rocm-smi
 - **Intel** - sycl-ls
 - **Apple Silicon** - sysctl/unified memory
 - **Qualcomm** - Android/Termux detection
 - **CPU** - psutil
 ### 2. Model Selection (`src/models/`)
 Automatically picks the best model based on available memory:
 ```
 Available Memory → Model Size → Quantization → Instance Count
     24 GB     →   14B      →    Q4_K_M    →   2-3 instances
     16 GB     →    7B      →    Q4_K_M    →   3-4 instances
      8 GB     →    3B      →    Q6_K      →   2-3 instances
 ```
 ### 3. Backends (`src/backends/`)
 Run the actual LLM inference:
 - **llama.cpp** - CUDA, ROCm, SYCL, CPU (cross-platform)
 - **MLX** - Apple Silicon optimized
 ### 4. Swarm Management (`src/swarm/`)
 Manages multiple LLM workers and consensus voting.
 **Workers**: Each runs an independent LLM instance
 **Consensus**: Picks the best response using:
 - Similarity (semantic grouping)
 - Quality (code blocks, structure)
 - Fastest (latency)
 - Majority (exact match)
 ### 5. Network Federation (`src/network/`)
 Connect multiple machines into a distributed swarm:
 ```
 Machine 1 (4 workers) ──┐
 Machine 2 (2 workers) ──┼──▶ Cross-Swarm Consensus ──▶ Best Answer
 Machine 3 (3 workers) ──┘
 ```
 **Discovery**: mDNS/Bonjour auto-discovery
 **Protocol**: HTTP between peers
 **Voting**: Two-phase (local consensus → global consensus)
 ### 6. API (`src/api/`)
 OpenAI-compatible REST API:
 - `POST /v1/chat/completions` - Main endpoint
 - `GET /v1/models` - List models
 - `GET /health` - Health check
 - Federation endpoints when enabled
 ### 7. Tools (`src/tools/`)
 Optional tool execution for enhanced capabilities:
 - `read_file` - Read files
 - `write_file` - Write files  
 - `execute_bash` - Run shell commands
 ## Data Flow
 1. **Request** comes in via API
 2. **Swarm Manager** sends to all workers
 3. **Workers** generate responses in parallel
 4. **Consensus** picks the best answer
 5. **Response** returned to client
 ## Memory Model
 - **External GPU**: Use 90% of VRAM
 - **Apple Silicon**: Use RAM - 4GB buffer
 - **CPU-only**: Use RAM - 4GB buffer
 Each worker loads the full model independently (no sharing).
 ## Future Ideas
 - Context compression for long inputs
 - CPU offloading for memory-constrained systems
 - RAG integration for knowledge bases
 - Speculative decoding for speed
@@ -1,210 +0,0 @@
 # Context Window Handling in Local Swarm
 ## Overview
 This document summarizes how context windows work in swarm architectures and the design decisions made for Local Swarm.
 ## The Core Challenge
 When running multiple LLM workers (instances) for consensus voting, each worker needs to process the input. For long contexts (30K-60K+ tokens), this creates memory pressure:
 - **7B model at 32K context:** ~8GB VRAM per worker
 - **7B model at 64K context:** ~14GB VRAM per worker
 - **Input duplication:** Each worker processes the full input independently
 ## Industry Approaches
 ### 1. Mixture of Experts (MoE)
 **Used by:** GPT-4, Mixtral 8x7B
 - Full input goes to all "expert" sub-models
 - Router network decides which experts to activate
 - Each expert is smaller (e.g., 8x7B vs 1x56B equivalent)
 - **Trade-off:** More parameters total, but only a subset active per token
 ### 2. Ensemble Voting (Local Swarm's Approach)
 **Characteristics:**
 - Full input to all workers
 - Each worker generates independently
 - Vote on final outputs
 - **Pros:** True parallel processing, diverse perspectives
 - **Cons:** 100% input duplication, memory intensive
 ### 3. Pipeline/Multi-Agent
 **Used by:** LangChain, AutoGPT
 - Different workers get different subtasks
 - Sequential processing (not parallel)
 - **Pros:** Efficient memory usage, specialization
 - **Cons:** Loses swarm consensus benefit, higher latency
 ### 4. Speculative Decoding
 **Used by:** vLLM, Text Generation Inference
 - Small "draft" model processes input
 - Large model verifies (doesn't reprocess)
 - **Pros:** 2-3x speedup
 - **Cons:** Complex implementation
 ## Memory Offloading
 ### What It Is
 Moving part of the model's state from GPU VRAM to system RAM:
 - **Hot context** (active tokens) → GPU VRAM (fast)
 - **Cold context** (earlier tokens) → System RAM (slower)
 ### Performance Impact
 | Configuration | Speed | Memory |
 |---------------|-------|--------|
 | 100% GPU | 100% | 20GB VRAM |
 | 50% offload | 75% | 10GB VRAM + 10GB RAM |
 | 80% offload | 60% | 4GB VRAM + 16GB RAM |
 ### When to Use
 - **Recommended:** When you have plenty of RAM (32GB+) but limited VRAM (8-12GB)
 - **Trade-off:** 25-40% slower, but can run 2-3x more workers
 - **Implementation:** vLLM, DeepSpeed ZeRO-Infinity, llama.cpp
 ## Can Workers Share Context?
 ### The Short Answer
 **Raw input tokens:** Yes (negligible memory)
 **KV Cache (attention states):** No (99% of memory, unique per worker)
 ### Why KV Cache Can't Be Shared
 The attention mechanism requires unique Key/Value tensors per token position:
 ```
 Token 1: [K1, V1] ← unique to this position
 Token 2: [K2, V2] ← depends on Token 1
 ...
 Token N: [KN, VN] ← depends on all previous
 ```
 Even with the same input:
 - Different random seeds → different attention patterns
 - Each worker builds its own understanding
 - The "notes and highlights" (KV cache) are unique per worker
 ### Analogy
 Five people reading the same book:
 - ✅ **Can share:** The physical book (input tokens)
 - ❌ **Can't share:** Their notes, highlights, thoughts (KV cache)
 ## Options for Long Context (30K-60K+ tokens)
 ### Option 1: Long-Context Models
 **Models:** Phi-3.5 Mini, Llama 3.1/3.2, Qwen 2.5 (128K context)
 **Pros:**
 - Simplest architecture
 - True parallel swarm voting
 - No preprocessing
 **Cons:**
 - Requires 8-12GB VRAM per worker at 60K context
 - Limited model selection
 **Best for:** Users with high-end GPUs (RTX 4090, 24GB+ VRAM)
 ### Option 2: Context Compression
 **Architecture:** Two-stage processing
 **Stage 1:** Compression swarm (3-5 workers)
 - Split 60K into chunks
 - Summarize each chunk
 - Aggregate to 8K compressed context
 **Stage 2:** Solution swarm (N workers)
 - Each worker gets 8K compressed + 2K relevant original
 - Generate independently
 - Vote on best
 **Pros:**
 - Works with standard 8K models
 - Maintains swarm architecture
 - More workers possible
 **Cons:**
 - Potential information loss
 - Added latency (~2-3s)
 **Best for:** Users with 8-16GB VRAM who need 30K+ context
 ### Option 3: Hierarchical RAG
 **Architecture:** Three-tier system
 **Tier 1:** Indexing swarm
 - Embed context into vector database
 - Create searchable knowledge graph
 **Tier 2:** Retrieval + Generation
 - Query index for relevant context
 - Each worker gets ~6K retrieved + 2K raw
 - Generate solutions
 **Tier 3:** Voting swarm
 - Rerank and consensus
 **Pros:**
 - Scales to 100K+ tokens
 - Most robust to information loss
 - Specialized workers
 **Cons:**
 - Complex implementation
 - 3x higher latency
 - Requires vector DB
 **Best for:** Maximum accuracy, production deployments
 ## Current Local Swarm Implementation
 Local Swarm currently uses **Ensemble Voting (Option 1)** with standard context windows:
 - 2K-8K context (model dependent)
 - Each worker loads full model independently
 - No context sharing between workers
 - No offloading to system RAM (yet)
 ## Recommendations
 ### For 8K-16K Context
 Use current implementation with standard models
 ### For 30K+ Context
 Choose based on your hardware:
 | Setup | Recommended Approach |
 |-------|---------------------|
 | RTX 4090 (24GB) | Option 1: Long-context models |
 | RTX 4060 Ti (16GB) | Option 2: Context compression |
 | Multiple machines (federated) | Option 2 or 3 |
 | CPU-only | Option 2 with aggressive compression |
 ### Memory-Constrained Setups
 Enable CPU offloading to run more workers:
 ```bash
 # llama.cpp example
 ./main --cpu-partial 0.8  # Offload 80% to RAM
 ```
 ## Future Enhancements
 Potential improvements for Local Swarm:
 1. **Context compression layer** (Option 2 implementation)
 2. **CPU offloading support** for memory-constrained systems
 3. **Hierarchical RAG** for enterprise use cases
 4. **Speculative decoding** for 2-3x speedup
 ## References
 - vLLM PagedAttention: Efficient KV cache management
 - DeepSpeed ZeRO-Infinity: Offloading to CPU/NVMe
 - Mixtral 8x7B: Mixture of Experts architecture
 - Phi-3.5 Technical Report: Long-context small models
@@ -0,0 +1,215 @@
 # Development Patterns Analysis
 ## Circular Development Issues Identified
 ### 1. Tool Execution Architecture (15+ commits going in circles)
 **The Cycle:**
 ```
 Add server-side tool execution → Fix looping issues → Remove/simplify instructions 
 → Tools don't work → Add tool host → Return tool_calls to client (reversal) 
 → Execute server-side again (reversal back) → Fix parsing → Simplify format 
 → Enhance instructions → Add streaming support → Fix streaming format...
 ```
 **Commits showing the cycle:**
 - `00cd483` - Add server-side tool execution  
 - `df4587e` - Fix: prevent looping (checking for server-side results)
 - `c70f83a` - Fix: simplify looping prevention  
 - `1b181bf` - Fix: remove tool instructions (40k → 0 tokens)
 - `bad8732` - Fix: simplify to ~300 tokens
 - `12eaac0` - Add distributed tool host
 - `b7fc184` - **REVERSAL:** Return tool_calls to opencode (not server-side)
 - `f83e6fc` - **REVERSAL BACK:** Execute via tool executor
 - `aa137b6` - Fix: handle tool_calls as single object or array
 - `539ca21` - Simplify format to TOOL:/ARGUMENTS: pattern
 - `aabd2b2` - Enhance instructions for multi-step operations
 **Root Cause:** No clear architectural decision on:
 - Who executes tools? (Server vs Client)
 - What format? (JSON vs text patterns vs markdown)
 - When to add instructions? (Always vs first request vs never)
 ### 2. Tool Instruction Token Count (4 changes)
 ```
 40,000 tokens → 300 tokens → removed → enhanced (unknown count)
 ```
 **Problem:** No testing to validate if instructions actually work.
 ### 3. Tool Parsing (8+ fixes)
 Multiple commits fixing the same parsing issues:
 - `c5b8196` - Parse nested JSON in arguments
 - `76b12b3` - Parse JavaScript-style output  
 - `9d838c1` - Handle markdown code blocks
 - `e3701cf` - Extract content before tool_calls block
 - `aa137b6` - Handle single object or array
 - `539ca21` - Simplify to TOOL:/ARGUMENTS: pattern
 **Problem:** No unit tests for parsing. Each fix only handles one case.
 ### 4. Streaming + Tools (4 commits)
 ```
 Disable streaming when tools present → Add to streaming path → Fix SSE format
 ```
 **Problem:** Two completely different code paths that diverge and need separate fixes.
 ### 5. Debugging Commits (6 commits)
 Commits that only add debug logging:
 - `e0c500e` - "very visible request/response logging"
 - `25b675c` - "explicit logging for tool executor configuration"
 - `27e1971` - "response logging to both paths"
 - `e3eb52d` - "log message state"
 - `13e6fb2` - "add logging to tool call parsing"
 - `3039629` - "log request.tools"
 **Problem:** Debugging in production instead of having tests.
 ## Why This Happens
 ### 1. No Tests
 - **Impact:** Every change requires manual testing
 - **Result:** Fixes break other cases, regressions common
 - **Evidence:** 25+ commits fixing tool-related issues
 ### 2. Production Debugging
 - **Pattern:** Add debug logging → Fix → Remove debug logging
 - **Commits:** `e0c500e`, `3728eb7` (add then clean up)
 - **Better:** Unit tests with mocked LLM responses
 ### 3. Architectural Ambiguity
 - **Question:** Who owns tool execution?
 - **Server-side:** Better for simple providers
 - **Client-side:** Better for complex opencode integration
 - **Actual:** Switched back and forth 3+ times
 ### 4. Feature Interaction Complexity
 - Tools + Streaming = Two paths to maintain
 - Tools + Federation = Distributed execution complexity  
 - Tools + Different formats = Parsing nightmare
 ### 5. Unclear Requirements
 - Should instructions be in system prompt or user prompt?
 - How many tokens is acceptable?
 - What format should tools return?
 ## Recommendations to Prevent This
 ### Immediate (Prevents Next Cycle)
 1. **Pick One Architecture**
   - Decision: Server-side execution via tool executor
   - Document why in ARCHITECTURE.md
 2. **Token Budget**
   - Max 2000 tokens for tool instructions
   - Test with actual 16K context models
   - Never exceed 50% of context window
 3. **One Format Only**
   - Standardize on: `TOOL: name\nARGUMENTS: {"key": "value"}`
   - Remove all other parsing code
   - Single regex pattern
 4. **Add Unit Tests**
    ```python
    # test_tool_parsing.py
    def test_parse_simple_tool():
        text = "TOOL: read\nARGUMENTS: {\"filePath\": \"test.txt\"}"
        content, tools = parse_tool_calls(text)
        assert len(tools) == 1
        assert tools[0]["function"]["name"] == "read"
    def test_parse_no_tool():
        text = "Just a regular response"
        content, tools = parse_tool_calls(text)
        assert len(tools) == 0
        assert content == text
    def test_parse_multiple_tools():
        text = "TOOL: read\nARGUMENTS: {...}\n\nTOOL: write\nARGUMENTS: {...}"
        content, tools = parse_tool_calls(text)
        assert len(tools) == 2
    ```
 5. **Integration Test Script**
    ```bash
    # test_tools.sh
    python main.py --auto --test-tools
    # Tests: read file → write file → bash command
    # Exits with error code if any fail
    ```
 6. **Simplify Tool Instructions**
    - Current: ~300 tokens with 5 examples
    - Target: ~100 tokens with 2 examples
    - Include: read, write only (bash is obvious)
 ### Medium-term
 7. **Separate Concerns**
   ```
   src/tools/
   ├── parser.py      # Only parsing logic
   ├── executor.py    # Only execution logic  
   ├── formatter.py   # Only formatting instructions
   └── integration.py # Only API integration
   ```
 8. **Design Doc Before Code**
   - For tool system changes, write 1-page design first
   - Include: format, token count, examples, test plan
   - Get it right on paper before coding
 9. **Feature Flags**
   ```python
   # config.py
   USE_SERVER_SIDE_TOOLS = True  # Can toggle without code changes
   TOOL_INSTRUCTION_VERSION = "v2"  # A/B test formats
   ```
 ### Long-term
 10. **CI/CD Pipeline**
    - Run tests on every PR
    - Block merge if tests fail
    - Include: unit tests, integration tests, token count check
 11. **Observability**
    - Structured logging (not print statements)
    - Metrics: tool success rate, parsing errors, latency
    - Dashboard to see issues before users report them
 ## Current State Assessment
 **Good:**
 - Tool executor abstraction exists
 - Distributed tool execution works
 - Working directory handling improved
 - Timeout handling for package managers
 **Needs Work:**
 - Too many parsing code paths (simplify to one)
 - Instructions too long (reduce to <2000 tokens)
 - No automated testing
 - Debug logging still in production code
 ## Suggested Immediate Actions
 1. Merge current cleanup branch (already done ✓)
 2. Remove all but one parsing format (done ✓)
 3. Reduce tool instructions to <2000 tokens (done ✓)
 4. Add unit tests for tool parsing (done ✓)
 5. Add integration test for tool execution
 ## Success Metrics
 - Tool-related commits stabilize to <2 per month
 - Zero "fix: prevent looping" commits
 - All tool changes include tests
 - Instructions stay under 2000 tokens
@@ -1,524 +0,0 @@
 # Local Swarm - Complete Documentation
 ## Table of Contents
 1. [Quick Start Guide](#quick-start-guide)
 2. [Opencode Configuration](#opencode-configuration)
 3. [API Reference](#api-reference)
 4. [Troubleshooting](#troubleshooting)
 5. [Advanced Configuration](#advanced-configuration)
 6. [Performance Tuning](#performance-tuning)
 ---
 ## Quick Start Guide
 ### Installation
 **Windows:**
 ```powershell
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 .\scripts\install.bat
 ```
 **macOS/Linux:**
 ```bash
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 chmod +x scripts/install.sh
 ./scripts/install.sh
 ```
 **Android (Termux):**
 ```bash
 git clone https://github.com/yourusername/local_swarm.git
 cd local_swarm
 chmod +x scripts/install-termux.sh
 ./scripts/install-termux.sh
 ```
 ### First Run
 ```bash
 # Start with interactive menu
 python main.py
 # Or skip menu with auto-detection
 python main.py --auto
 ```
 ---
 ## Opencode Configuration
 ### Basic Configuration
 Add to your opencode configuration file (usually `~/.config/opencode/config.json`):
 ```json
 {
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
 }
 ```
 ### Configuration with Local Swarm on Different Machine
 If Local Swarm is running on another computer in your network:
 ```json
 {
  "model": {
    "provider": "openai",
    "base_url": "http://192.168.1.100:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm"
  }
 }
 ```
 ### Multiple Model Options
 You can configure multiple models and switch between them:
 ```json
 {
  "models": {
    "local-swarm": {
      "provider": "openai",
      "base_url": "http://localhost:8000/v1",
      "api_key": "not-needed",
      "model": "local-swarm"
    },
    "local-swarm-fast": {
      "provider": "openai",
      "base_url": "http://localhost:8000/v1",
      "api_key": "not-needed",
      "model": "local-swarm",
      "temperature": 0.2
    }
  },
  "default_model": "local-swarm"
 }
 ```
 ### With Context Window Configuration
 ```json
 {
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm",
    "max_tokens": 4096,
    "temperature": 0.7
  }
 }
 ```
 ### Environment-Specific Configurations
 **Development (local only):**
 ```json
 {
  "model": {
    "provider": "openai",
    "base_url": "http://localhost:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm",
    "temperature": 0.8
  }
 }
 ```
 **Production (federated swarm):**
 ```json
 {
  "model": {
    "provider": "openai",
    "base_url": "http://swarm-coordinator.local:8000/v1",
    "api_key": "not-needed",
    "model": "local-swarm",
    "temperature": 0.5
  }
 }
 ```
 ### Testing the Configuration
 After configuring opencode, test with:
 ```bash
 # Simple test
 opencode --version
 # Test with a prompt
 echo "Write a Python function to calculate factorial" | opencode
 ```
 ---
 ## API Reference
 ### OpenAI-Compatible Endpoints
 Local Swarm implements the OpenAI API specification.
 #### POST /v1/chat/completions
 Generate a chat completion.
 **Request:**
 ```json
 {
  "model": "local-swarm",
  "messages": [
    {"role": "user", "content": "Write a Python function to calculate factorial"}
  ],
  "max_tokens": 2048,
  "temperature": 0.7,
  "stream": false
 }
 ```
 **Response:**
 ```json
 {
  "id": "chatcmpl-abc123",
  "object": "chat.completion",
  "created": 1234567890,
  "model": "local-swarm",
  "choices": [{
    "index": 0,
    "message": {
      "role": "assistant",
      "content": "def factorial(n):\n    if n <= 1:\n        return 1\n    return n * factorial(n-1)"
    },
    "finish_reason": "stop"
  }],
  "usage": {
    "prompt_tokens": 15,
    "completion_tokens": 25,
    "total_tokens": 40
  }
 }
 ```
 #### GET /v1/models
 List available models.
 **Response:**
 ```json
 {
  "object": "list",
  "data": [
    {
      "id": "local-swarm",
      "object": "model",
      "created": 1234567890,
      "owned_by": "local-swarm"
    }
  ]
 }
 ```
 #### GET /health
 Check health status.
 **Response:**
 ```json
 {
  "status": "healthy",
  "version": "0.1.0",
  "workers": 5,
  "model": "Qwen 2.5 Coder 7b (q4_k_m)"
 }
 ```
 #### Federation Endpoints (when enabled)
 **GET /v1/federation/status**
 ```json
 {
  "enabled": true,
  "total_peers": 3,
  "healthy_peers": 3,
  "strategy": "weighted"
 }
 ```
 **GET /v1/federation/peers**
 ```json
 {
  "peers": [
    {
      "name": "desktop-pc",
      "host": "192.168.1.100",
      "port": 8000,
      "model_id": "qwen2.5-coder:7b:q4_k_m",
      "instances": 3
    }
  ]
 }
 ```
 ---
 ## Troubleshooting
 ### Common Issues
 #### Issue: "No module named 'llama_cpp'"
 **Solution:**
 ```bash
 # Install with pre-built wheel (recommended)
 pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
 # Or CPU-only
 pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cpu
 ```
 #### Issue: "CUDA not detected" on Windows
 **Solution:**
 1. Install NVIDIA drivers: https://www.nvidia.com/drivers
 2. Verify with: `nvidia-smi`
 3. Reinstall with CUDA support:
 ```powershell
 pip uninstall llama-cpp-python
 pip install llama-cpp-python --extra-index-url https://abetlen.github.io/llama-cpp-python/whl/cu121
 ```
 #### Issue: "Out of memory" errors
 **Solution:**
 ```bash
 # Reduce instances
 python main.py --instances 2
 # Or use smaller model
 python main.py --model qwen2.5-coder:3b:q4
 ```
 #### Issue: Slow performance on CPU
 **Solution:**
 - Use smaller models (3B instead of 7B)
 - Use Q4 quantization instead of Q6
 - Reduce number of instances to 2-3
 - Close other applications
 #### Issue: "No suitable model found"
 **Solution:**
 Your system has less than 2GB available memory. Try:
 - Close other applications
 - Use CPU-only mode (automatic if no GPU)
 - Add more RAM or use a machine with GPU
 #### Issue: Models not downloading
 **Solution:**
 ```bash
 # Check internet connection
 ping huggingface.co
 # Try manual download
 python main.py --download-only
 # Check cache directory
 ls ~/.local_swarm/models
 ```
 ### Platform-Specific Issues
 **Windows:**
 - Ensure Python is in PATH
 - Run PowerShell as Administrator if needed
 - Install Visual C++ Redistributable
 **macOS:**
 - Xcode Command Line Tools: `xcode-select --install`
 - May need to allow llama.cpp in Security preferences
 **Linux:**
 - Install build essentials: `sudo apt-get install build-essential`
 - For AMD: Install ROCm drivers
 - For Intel: Install oneAPI toolkit
 ---
 ## Advanced Configuration
 ### Configuration File (config.yaml)
 Create `config.yaml` in the project root:
 ```yaml
 server:
  host: "127.0.0.1"
  port: 8000
 swarm:
  consensus_strategy: "similarity"  # similarity, quality, fastest
  min_instances: 2
  max_instances: 5
 federation:
  enabled: false
  discovery_port: 8765
  federation_port: 8766
  max_peers: 10
 hardware:
  gpu_memory_fraction: 1.0  # Use 100% of GPU VRAM
  ram_fraction: 0.5  # Use 50% of system RAM for CPU
 models:
  cache_dir: "~/.local_swarm/models"
  preferred_models:
    - qwen2.5-coder
    - deepseek-coder
 ```
 ### Environment Variables
 ```bash
 # Custom cache directory
 export LOCAL_SWARM_CACHE_DIR="/path/to/models"
 # Debug mode
 export LOCAL_SWARM_DEBUG=1
 # Custom config file
 export LOCAL_SWARM_CONFIG="/path/to/config.yaml"
 ```
 ---
 ## Performance Tuning
 ### For Maximum Speed
 ```bash
 # Use smaller model
 python main.py --model qwen2.5-coder:3b:q4
 # Reduce instances (less memory contention)
 python main.py --instances 2
 # Skip consensus (single worker)
 # Edit config: consensus_strategy: "fastest"
 ```
 ### For Maximum Quality
 ```bash
 # Use largest model that fits
 python main.py --model qwen2.5-coder:7b:q6
 # More instances for better consensus
 python main.py --instances 5
 # Use quality consensus strategy
 # Edit config: consensus_strategy: "quality"
 ```
 ### For Balanced Performance
 ```bash
 # Recommended defaults (automatic)
 python main.py
 # Or explicitly
 python main.py --model qwen2.5-coder:7b:q4
 ```
 ### Memory Usage by Model
 | Model Size | Q4 VRAM | Q5 VRAM | Q6 VRAM |
 |------------|---------|---------|---------|
 | 1B-3B      | 0.7-2GB | 0.9-2.5GB | 1.1-3GB |
 | 7B         | 4.5GB   | 5.2GB   | 6.0GB   |
 | 13B-15B    | 8-9GB   | 9.5-11GB | 11-13GB |
 **Recommended:** Use Q4_K_M for best speed/quality balance.
 ---
 ## MCP Server Configuration
 ### Enable MCP Server
 ```bash
 python main.py --mcp
 ```
 ### MCP Tools Available
 When MCP is enabled, AI assistants can use:
 - `get_hardware_info` - Query system capabilities
 - `get_swarm_status` - Check swarm health
 - `generate_code` - Generate with consensus
 - `list_available_models` - Browse models
 - `get_worker_details` - Worker statistics
 ### Testing MCP
 ```bash
 # List available tools
 mcp-cli call local-swarm list_tools
 # Call a tool
 mcp-cli call local-swarm call_tool get_swarm_status
 ```
 ---
 ## Network Federation
 ### Setup Federated Swarm
 On each machine in your network:
 ```bash
 # Machine 1 (Windows PC with RTX 4060)
 python main.py --federation --port 8000
 # Machine 2 (Mac Mini M1)
 python main.py --federation --port 8000
 # Machine 3 (Linux with AMD GPU)
 python main.py --federation --port 8000
 ```
 Machines will auto-discover each other via mDNS.
 ### Verify Federation
 ```bash
 curl http://localhost:8000/v1/federation/status
 curl http://localhost:8000/v1/federation/peers
 ```
 ---
 ## Getting Help
 - **GitHub Issues:** https://github.com/sleepyeldrazi/local_swarm/issues
 - **Interactive Help:** Run `python main.py` and select `[t] Tips & Help`
 - **Hardware Detection:** Run `python main.py --detect`
 ## License
 MIT License - See LICENSE file
@@ -0,0 +1,92 @@
 # Design Decision: Complete React Example with Actual Code
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions
 ## Problem
 Model is still not following instructions:
 1. Tries `npm install` before creating package.json
 2. Still tries `npx create-react-app` despite being told not to
 3. Instructions have placeholders like "..." and "etc." which models don't understand
 ## Root Cause
 The current instructions say:
 ```
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"}
 [Continue with src/index.js, src/App.js, public/index.html, etc.]
 ```
 **Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples.
 ## Solution
 Provide a **complete, working, minimal React example** with actual file contents:
 1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install
 2. Actual file content, not placeholders
 3. Minimal viable React app (not full create-react-app structure)
 ## Implementation
 Replace vague example with complete working code:
 ```
 **COMPLETE REACT HELLO WORLD EXAMPLE:**
 User: "Create a React Hello World app"
 Step 1 - Create directory:
 TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp"}
 Step 2 - Create package.json (MUST do this BEFORE npm install):
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"}
 Step 3 - Create src directory:
 TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp/src"}
 Step 4 - Create App.js:
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n  return (\n    <div className=\"App\">\n      <h1>Hello World</h1>\n      <p>Welcome to my React app!</p>\n    </div>\n  );\n}\n\nexport default App;"}
 Step 5 - Create index.js:
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render(<App />);"}
 Step 6 - Create public directory and index.html:
 TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp/public"}
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <title>React App</title>\n</head>\n<body>\n    <div id=\"root\"></div>\n</body>\n</html>"}
 Step 7 - NOW install dependencies (AFTER package.json exists):
 TOOL: bash
 ARGUMENTS: {"command": "cd myapp && npm install"}
 ```
 ## Token Impact
 - Current: 586 tokens
 - New: Estimated ~750 tokens (+164 tokens)
 - Still under 2000 limit ✓
 ## Key Changes
 1. **Explicit sequencing:** "Step 1", "Step 2", etc.
 2. **Actual code:** No "..." or "etc." - real working content
 3. **Critical note:** "MUST do this BEFORE npm install"
 4. **Minimal structure:** Just what's needed for Hello World
 ## Success Criteria
 - [ ] Model creates package.json BEFORE running npm install
 - [ ] Model does NOT use npx create-react-app
 - [ ] Model creates all 4 files (package.json, App.js, index.js, index.html)
 - [ ] Model runs npm install last (after files exist)
@@ -0,0 +1,84 @@
 # Design Decision: Fix Subprocess Hang on Interactive Commands
 **Date:** 2024-02-24
 **Scope:** src/tools/executor.py _execute_bash method
 **Lines Changed:** 1 line
 ## Problem
 When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes:
 1. 300s timeout to be reached
 2. opencode to hang waiting for response
 3. Poor user experience
 ## Root Cause
 `subprocess.run()` by default inherits stdin from parent process. When commands prompt for input:
 - npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)"
 - npm init asks for package details
 - No input is provided, so it waits forever
 ## Solution
 Add `stdin=subprocess.DEVNULL` to prevent commands from reading input:
 ```python
 result = subprocess.run(
    command,
    shell=True,
    capture_output=True,
    text=True,
    timeout=timeout,
    cwd=cwd,
    stdin=subprocess.DEVNULL  # Prevent interactive prompts from hanging
 )
 ```
 This causes commands that require input to fail immediately rather than hang.
 ## Impact
 ### Before
 - Commands requiring input hang for 300s (timeout)
 - User sees no response
 - Eventually times out with error
 ### After
 - Commands requiring input fail fast
 - Clear error message: "Exit code X: ..." 
 - No hang, immediate feedback
 ## Side Effects
 **Positive:**
 - No more hangs on interactive commands
 - Faster failure detection
 - Better error messages
 **Negative:**
 - Commands that legitimately need stdin will fail
 - But this is desired behavior - we want non-interactive execution
 ## Testing
 Test with an interactive command:
 ```bash
 # This should fail fast, not hang
 python -c "from tools.executor import ToolExecutor; 
 import asyncio; 
 e = ToolExecutor(); 
 result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'})); 
 print(result)"
 ```
 Expected: Quick failure, not a 30s hang
 ## Related Changes
 This complements the tool instructions fix:
 - Instructions now say "DO NOT use npx create-react-app"
 - This fix ensures if model ignores instructions, it fails fast instead of hanging
 ## Conclusion
 One-line fix prevents interactive command hangs, improving reliability and user experience.
@@ -0,0 +1,178 @@
 # Design Decision: Fix Tool Execution and Token Reporting
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions and token counting
 ## Problem Statement
 User report shows three critical failures:
 1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format
 2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count
 3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout
 ## Evidence
 ```
 🖥️  BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app .
 ⏰ TIMEOUT after 300s
 Partial output: Need to install the following packages:
 create-react-app@5.1.0
 Ok to proceed? (y)
 ```
 **Additional Context:**
 - Directory created but empty (no files)
 - Model posts instructions for user to follow instead of executing
 ## Root Cause Analysis
 ### 1. Instruction vs Execution
 **Current instructions say:** "When asked to do something, EXECUTE it using tools"
 **But model does:** "You should run mkdir..."
 **Why:** Instructions aren't strong enough - need explicit anti-patterns
 ### 2. Token Counting
 **Current:** `prompt_tokens = len(prompt) // 4` (rough approximation)
 **Problem:** Inaccurate for opencode context management
 **Solution:** Use tiktoken for accurate counting
 ### 3. Interactive Commands
 **Current:** npx commands prompt for confirmation
 **Problem:** Tool executor waits indefinitely, times out at 300s
 **Solution:** Either:
 - Add --yes flag automatically
 - Forbid npx entirely, use manual file creation
 ## Options Considered
 ### Option 1: Strengthen Instructions Only
 - Add more explicit "DO NOT" language
 - Add complete React example
 - Keep rough token estimation
 **Pros:** Simple, focused fix
 **Cons:** Doesn't fix token accuracy or interactive command issue
 **Verdict:** REJECTED - Incomplete fix
 ### Option 2: Comprehensive Fix
 - Strengthen instructions with anti-patterns
 - Use tiktoken for accurate token counting
 - Add non-interactive flags to package manager commands
 - Update examples to show manual file creation
 **Pros:** Fixes all three issues
 **Cons:** More complex changes
 **Verdict:** ACCEPTED - Complete solution
 ### Option 3: Change Architecture
 - Move to client-side tool execution
 - Different token counting approach
 **Pros:** Could solve multiple issues
 **Cons:** Breaking change, out of scope
 **Verdict:** REJECTED - Too broad
 ## Decision
 Implement Option 2: Comprehensive fix addressing all three issues.
 ### Changes
 #### 1. Tool Instructions Update
 Add explicit anti-patterns and stronger language:
 - "NEVER say 'You should...' - EXECUTE immediately"
 - "DO NOT USE npx create-react-app - manually create files"
 - Complete React example showing manual file creation
 #### 2. Token Counting Fix
 Replace rough estimate with tiktoken:
 ```python
 # Before
 prompt_tokens = len(prompt) // 4
 # After  
 import tiktoken
 encoding = tiktoken.get_encoding('cl100k_base')
 prompt_tokens = len(encoding.encode(prompt))
 completion_tokens = len(encoding.encode(content))
 ```
 #### 3. Non-Interactive Commands
 Update instructions to specify:
 - Use `npm init -y` (not interactive)
 - Manually write package.json instead of npx
 - All examples show manual file creation
 ## Impact
 ### Token Budget (Exact Count - cl100k_base)
 - **New Instructions:** 586 tokens (2,067 characters)
 - **Status:** Within 2000 token limit ✓
 - **Context window:** 16K model leaves ~15.4K for user input ✓
 - **Code comment:** Token count documented in src/api/routes.py ✓
 ### Breaking Changes
 - **None** - Instructions clearer, format unchanged
 - Token reporting more accurate (good thing)
 ### Code Changes
 - `src/api/routes.py`:
  - Update tool_instructions (~+15 lines)
  - Add tiktoken import
  - Replace token estimation logic (~5 lines)
 ## Testing Strategy
 1. **Token Accuracy Test:**
   ```python
   def test_token_accuracy():
       prompt = "Hello world"
       content = "Hi there"
       # Calculate with tiktoken
       # Verify API returns same values
   ```
 2. **Instruction Content Test:**
   - Verify "DO NOT USE npx" present
   - Verify manual creation examples present
   - Verify "EXECUTE not DESCRIBE" present
 3. **Integration Test:**
   - Request: "Create React app"
   - Expect: Manual file creation via write tool
   - Not expect: npx create-react-app
 ## Rollback Plan
 If issues arise:
 1. Revert to previous instructions
 2. Keep tiktoken for token counting (beneficial)
 3. Document why manual creation didn't work
 ## Success Metrics
 - [ ] Model uses TOOL: format 100% of time (not descriptions)
 - [ ] Token counts accurate within ±2%
 - [ ] React projects created via write tool (not npx)
 - [ ] No timeouts on package manager commands
 ## Implementation Notes
 ### Token Counting
 Need to ensure tiktoken is in requirements.txt
 ### Tool Instructions
 The key addition is:
 ```
 **FORBIDDEN PATTERNS:**
 - "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"}
 - "npx create-react-app myapp" → USE: Manual file creation with write tool
 - "First create package.json, then..." → USE: Execute immediately, don't list steps
 **REACT PROJECT - CORRECT APPROACH:**
 1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
 2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"}
 3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."}
 4. Continue until all files created
 ```
@@ -0,0 +1,172 @@
 # Design Decision: Improved Tool Instructions
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions
 **Lines Changed:** ~25 lines
 ## Problem
 Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
 1. **Passive vs Active:** Model describes what to do instead of doing it
 2. **Refusal:** Model claims "I am only an AI assistant" instead of executing
 3. **Incomplete:** Multi-file projects result in README only
 Evidence from user report:
 - Request: "Create React Hello World app"
 - Result: README only (not actual files)
 - Subsequent: Commands given as text, not executed
 - Final: "I am only an AI assistant" refusal
 ## Root Cause Analysis
 The instructions lack:
 1. **Authority statement** - "You CAN and SHOULD use tools"
 2. **Execution mandate** - "Execute commands, don't just describe them"
 3. **Workflow clarity** - Clear step-by-step expectations
 4. **Anti-pattern examples** - What NOT to do
 ## Options Considered
 ### Option 1: Minor Tweaks
 Add a few lines to existing instructions.
 - **Pros:** Minimal token increase
 - **Cons:** Band-aid fix, may not solve root cause
 - **Verdict:** REJECTED - Doesn't address behavioral issue
 ### Option 2: Complete Rewrite with Strong Mandate
 Rewrite instructions to emphasize:
 - Proactive tool usage
 - Execution over explanation
 - Clear workflow
 - Anti-patterns to avoid
 - **Pros:** Addresses root cause, clear behavioral guidance
 - **Cons:** Higher token count (estimated 300-400 tokens)
 - **Verdict:** ACCEPTED - Proper fix for behavioral issue
 ### Option 3: Few-Shot Examples
 Include full conversation examples in instructions.
 - **Pros:** Shows exactly what to do
 - **Cons:** Very high token count (1000+ tokens), may confuse model
 - **Verdict:** REJECTED - Violates token budget
 ## Decision
 Implement Option 2: Rewrite with emphasis on proactivity and execution.
 **Key additions:**
 1. **Capability statement:** "You have tools. Use them."
 2. **Execution mandate:** "Don't describe, execute"
 3. **Workflow:** Clear request→tool→result→next cycle
 4. **Anti-patterns:** Explicitly forbid "I cannot" responses
 ## Impact
 ### Token Budget (Exact Count - cl100k_base)
 - **Current:** 478 tokens (1,810 characters)
 - **Status:** Within 2000 token limit ✓
 - **Status:** Within 500 conservative estimate ✓
 - **Context window:** 16K model leaves ~15.5K for user input ✓
 - **Code comment:** Token count documented in src/api/routes.py ✓
 ### Code Changes
 - **File:** src/api/routes.py
 - **Lines:** +48/-18 (net +30)
 - **Type:** Instructions replacement
 - **Token documentation:** Added inline comment with exact token count
 ### Breaking Changes
 - **None** - Instructions are additive/clearer, not different format
 ### Behavioral Changes
 - **Expected:** More proactive tool usage
 - **Expected:** No more "I cannot" refusals
 - **Expected:** Multi-step projects completed via tools
 - **Expected:** Commands executed, not described
 ### Review Blockers Addressed
 - ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
 - ✅ Exact token count calculated using tiktoken (478 tokens)
 - ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
 - ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
 - ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
 ## Implementation
 ```python
 tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
 **CRITICAL RULES:**
 1. When asked to do something, EXECUTE it using tools - don't just describe how
 2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
 3. You MUST use the write tool to create files
 4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
 5. Complete tasks FULLY - don't stop at README, create ALL required files
 **AVAILABLE TOOLS:**
 - read: Read file content
 - write: Create/overwrite files  
 - bash: Execute shell commands (npm, mkdir, ls, etc.)
 **TOOL FORMAT (STRICT):**
 TOOL: tool_name
 ARGUMENTS: {"param": "value"}
 **WORKFLOW:**
 1. User asks for something
 2. You decide what tool to use
 3. You respond with ONLY the TOOL: format above
 4. You receive the tool result
 5. You continue with next tool until task is COMPLETE
 **EXAMPLES:**
 Creating a project:
 User: "Create a React app"
 You: TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
 [wait for result]
 You: TOOL: write
 ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
 [continue until all files created]
 Running commands:
 User: "Install dependencies"
 You: TOOL: bash
 ARGUMENTS: {"command": "npm install"}
 [wait for result, then confirm completion]
 **WHAT NOT TO DO:**
 - ❌ "To create a React app, you should run: mkdir myapp" (describing)
 - ❌ "I cannot run commands, I am an AI" (refusing)
 - ❌ Creating only README instead of full project (incomplete)
 - ❌ "First do X, then do Y" (giving instructions instead of doing)
 **CORRECT BEHAVIOR:**
 - ✅ Execute the command immediately using the bash tool
 - ✅ Create all files using the write tool
 - ✅ Continue until task is 100% complete
 - ✅ Use ONE tool at a time and wait for results"""
 ```
 ## Testing
 1. Test with React Hello World request
 2. Verify model uses bash to create directory structure
 3. Verify model uses write to create all files
 4. Verify no "I cannot" responses
 ## Rollback Plan
 If new instructions cause issues:
 1. Revert to previous ~125 token version
 2. Analyze what specifically failed
 3. Iterate on smaller changes
 ## Success Metrics
 - [ ] Model uses tools on first request (not after prompting)
 - [ ] Zero "I cannot" or "I am an AI" responses
 - [ ] Multi-file projects fully created
 - [ ] Commands executed, not described
@@ -0,0 +1,151 @@
 # Design Decision: Task Planning and Verification Workflow
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions
 **Problem:** Model creates folder but doesn't complete full task or verify completion
 ## Problem Statement
 User reports:
 1. "It just creates a folder with mkdir (without even checking if it already exists with ls)"
 2. No verification that tasks are completed
 3. No planning of full task scope
 4. Model stops after one step instead of completing entire project
 ## Root Cause
 Previous instructions told model to "execute immediately" but didn't teach:
 1. **Planning** - What needs to be done
 2. **Checking** - What already exists
 3. **Verification** - Did the step work
 4. **Completion loop** - Keep going until done
 ## Solution
 Add **Task Completion Workflow** to instructions:
 ```
 **TASK COMPLETION WORKFLOW (MANDATORY):**
 **1. PLAN:** List ALL steps needed before starting
 **2. CHECK:** Use ls to verify what exists before creating
 **3. EXECUTE:** Run first step
 **4. VERIFY:** Confirm step worked (ls, read file)
 **5. REPEAT:** Steps 3-4 until ALL complete
 **6. FINAL CHECK:** Verify entire task is done
 **7. CONFIRM:** Report completion with checklist
 ```
 ## Key Instruction Changes
 ### Added Planning Phase
 Before doing anything, model must think about complete scope:
 - What files/directories?
 - What dependencies?
 - Complete task requirements
 ### Added Verification Steps
 Every step must be verified:
 - `ls -la` after mkdir
 - `read` file after write
 - Check content is correct
 ### Added Completion Loop
 Model must continue until:
 ✓ All directories exist
 ✓ All files exist with correct content
 ✓ All dependencies installed
 ✓ Each component verified
 ### Complete Working Example
 Provided 13-step React example showing:
 1. Check existing (ls)
 2. Create directory
 3. Verify created (ls)
 4. Create package.json
 5. Verify package.json (read)
 6. Create source files
 7. Final verification (find myapp -type f)
 8. Install dependencies
 9. Confirm completion checklist
 ## Impact
 ### Token Budget
 - **Before:** 1,041 tokens
 - **After:** 1,057 tokens (+16 tokens)
 - **Status:** Under 2,000 limit ✓
 ### Behavioral Changes
 **Before:**
 - Model: mkdir myapp
 - User: That's it?
 - Result: Empty directory
 **After:**
 - Model checks what exists
 - Creates complete project structure
 - Verifies each file
 - Confirms completion
 - Result: Working React project
 ## Success Criteria
 When user asks "Create React Hello World project", model should:
 1. ✓ Check current directory contents
 2. ✓ Create myapp/ directory
 3. ✓ Verify directory created
 4. ✓ Create package.json
 5. ✓ Verify package.json content
 6. ✓ Create src/App.js
 7. ✓ Create src/index.js
 8. ✓ Create public/index.html
 9. ✓ Final verification (list all files)
 10. ✓ npm install
 11. ✓ Confirm completion checklist
 ## Testing
 Test instructions contain:
 - PLAN/CHECK keywords
 - VERIFY keyword
 - COMPLETE keyword
 All tests pass: 11/11 ✓
 ## Trade-offs
 **Pros:**
 - Complete task execution
 - Verification prevents partial work
 - Clear completion criteria
 - Better user experience
 **Cons:**
 - More tokens (but still under limit)
 - More verbose instructions
 - May be slower (more verification steps)
 ## Related Files Changed
 1. src/api/routes.py - Updated tool_instructions
 2. tests/test_tool_parsing.py - Updated tests for new content
 3. docs/design/2024-02-24-task-planning-verification.md - This doc
 ## Future Improvements
 1. **Task Queue System:** Server-side queue of pending operations
 2. **State Persistence:** Remember what's been done across conversations
 3. **Smart Resumption:** If interrupted, pick up where left off
 4. **Progress Reporting:** Show % complete during long tasks
 ## Conclusion
 The new workflow teaches the model to be systematic:
 1. Plan before acting
 2. Check before creating
 3. Verify after each step
 4. Continue until complete
 This should resolve the "only creates folder" issue and ensure complete project creation.
@@ -0,0 +1,132 @@
 # Design Decision: Tool Parsing Simplification
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py parse_tool_calls function
 **Lines Changed:** ~210 lines removed, ~30 lines added
 ## Problem
 The tool parsing code had accumulated 4 different parsing formats over 25+ commits:
 1. JSON `tool_calls` format with nested objects
 2. TOOL:/ARGUMENTS: format (simple text)
 3. Function pattern format `func_name(args)`
 4. Multiple JSON handling variants
 This caused:
 - Circular development (adding/removing formats repeatedly)
 - No single source of truth
 - Complex, unmaintainable code
 - No confidence that changes wouldn't break existing cases
 ## Options Considered
 ### Option 1: Keep All Formats
 - **Pros:** Backward compatible
 - **Cons:** 210 lines of unmaintainable code, continues circular development pattern
 - **Verdict:** REJECTED - Perpetuates the problem
 ### Option 2: Standardize on TOOL:/ARGUMENTS: Only
 - **Pros:** 
  - Simple regex pattern (~30 lines)
  - Matches current tool instructions
  - Easy to test
  - Clear single format for models
 - **Cons:** 
  - Breaking change if any code relies on old formats
  - Need to update any existing examples/docs
 - **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well)
 ### Option 3: Create Parser per Format with Feature Flags
 - **Pros:** Flexible, can toggle formats
 - **Cons:** 
  - Violates Rule 5 and "No Feature Flags in Core Logic"
  - Still maintains multiple code paths
 - **Verdict:** REJECTED - Doesn't solve the root problem
 ## Decision
 Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code.
 **Rationale:**
 - Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only"
 - Token cost is minimal (no complex regex)
 - Test coverage provides confidence
 - Aligns with existing tool instructions
 ## Impact
 ### Token Count
 - **Parser code:** 210 lines → 30 lines (-180 lines)
 - **No change** to tool instructions (separate optimization)
 ### Breaking Changes
 - **Yes** - Removes support for:
  - JSON `tool_calls` format in model responses
  - Function pattern format `read_file(path="test.txt")`
 **Migration:** Models must use:
 ```
 TOOL: read
 ARGUMENTS: {"filePath": "test.txt"}
 ```
 ### Testing
 - Unit tests added: 9 test cases
 - Coverage: All parsing scenarios
 - All tests pass
 ## Implementation
 ```python
 # New implementation (30 lines)
 def parse_tool_calls(text: str) -> tuple:
    """Parse tool calls using standardized format."""
    import json
    import re
    tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
    tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
    if not tool_matches:
        return text, None
    tool_calls = []
    for i, tool_match in enumerate(tool_matches):
        tool_name = tool_match.group(1)
        args_str = tool_match.group(2)
        try:
            args_dict = json.loads(args_str)
            tool_calls.append({
                "id": f"call_{i+1}",
                "type": "function", 
                "function": {
                    "name": tool_name,
                    "arguments": json.dumps(args_dict)
                }
            })
        except json.JSONDecodeError:
            continue
    if not tool_calls:
        return text, None
    first_start = tool_matches[0].start()
    content = text[:first_start].strip()
    return content, tool_calls
 ```
 ## Verification
 Run tests:
 ```bash
 python tests/test_tool_parsing.py
 ```
 Expected: 9 passed, 0 failed
 ## Follow-up
 - [x] Update DEVELOPMENT_PATTERNS.md to mark as completed
 - [x] Add unit tests
 - [ ] Consider integration test for full tool execution flow
@@ -0,0 +1,112 @@
 # Test Plan: Fix Tool Execution and Token Reporting
 ## Problem Analysis
 ### Issue 1: Model Gives Instructions Instead of Executing
 **Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format
 **Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."}
 ### Issue 2: Token Counting Inaccurate
 **Current:** Rough estimate `len(prompt) // 4` 
 **Expected:** Accurate token count using tiktoken
 **Impact:** opencode can't properly manage context window
 ### Issue 3: npx Commands Timeout/Need Input
 **Current:** `npx create-react-app .` prompts for confirmation (y/n)
 **Expected:** Non-interactive execution or manual file creation
 **Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)"
 ## Unit Tests
 ### Test 1: Accurate Token Counting
 - [ ] Verify token count uses tiktoken (not rough estimate)
 - [ ] Test with known token counts
 - [ ] Verify prompt_tokens + completion_tokens = total_tokens
 ### Test 2: Non-Interactive Bash Commands
 - [ ] Verify npm/npx commands use --yes or equivalent flags
 - [ ] Test timeout handling for package managers
 - [ ] Verify commands don't prompt for user input
 ### Test 3: Tool Instructions Content
 - [ ] Verify instructions emphasize "EXECUTE not DESCRIBE"
 - [ ] Verify manual file creation examples (not npx)
 - [ ] Verify anti-patterns are clearly stated
 ## Integration Tests
 ### Test 4: End-to-End React Project Creation
 **Input:** "Create a React Hello World app"
 **Expected Flow:**
 1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
 2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
 3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."}
 4. Continue until complete
 **Failure Modes:**
 - [ ] Model describes steps instead of executing
 - [ ] Uses npx create-react-app (should manually create files)
 - [ ] Stops after README only
 ### Test 5: Token Reporting Accuracy
 **Input:** Any chat completion request
 **Expected:**
 - usage.prompt_tokens matches actual tokens
 - usage.completion_tokens matches actual tokens  
 - usage.total_tokens is sum
 **Verification:**
 - Compare tiktoken count vs API response
 ## Manual Verification
 ```bash
 # Test React creation
 python main.py --auto &
 curl -X POST http://localhost:17615/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Client-Working-Dir: /tmp/test-project" \
  -d '{
    "model": "local-swarm",
    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
    "tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}]
  }'
 # Check token accuracy
 curl -X POST http://localhost:17615/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-swarm",
    "messages": [{"role": "user", "content": "Hello"}]
  }' | jq '.usage'
 ```
 ## Success Criteria
 1. **Execution:** 100% of requests use TOOL: format (not descriptions)
 2. **Accuracy:** Token counts match tiktoken within ±5%
 3. **Completion:** Multi-file projects fully created via write tool
 4. **No npx:** Manual file creation for React (no npx create-react-app)
 ## Implementation Notes
 ### Token Counting Fix
 ```python
 # Replace: prompt_tokens = len(prompt) // 4
 # With:
 import tiktoken
 encoding = tiktoken.get_encoding('cl100k_base')
 prompt_tokens = len(encoding.encode(prompt))
 completion_tokens = len(encoding.encode(content))
 ```
 ### Tool Instructions Fix
 - Add explicit "DO NOT USE npx create-react-app" instruction
 - Add "EXECUTE IMMEDIATELY" mandate
 - Show complete React example with manual file creation
 ### Non-Interactive Commands
 - Auto-add --yes to npx commands
 - Or recommend manual file creation instead
@@ -0,0 +1,97 @@
 # Test Plan: Improved Tool Instructions
 ## Problem Statement
 Model is not using tools effectively:
 1. Creates README instead of actual project structure
 2. Provides commands as text instead of executing them
 3. Refuses to run commands claiming "I am only an AI assistant"
 ## Root Cause Analysis
 Current instructions don't clearly communicate:
 - That the model SHOULD use tools proactively
 - That execution is expected, not explanation
 - The workflow: user request → tool execution → result
 ## Unit Tests (Instruction Verification)
 ### Test 1: Instruction Presence
 - [ ] Verify instructions are injected into system message
 - [ ] Verify instructions appear at the START of system message (priority position)
 ### Test 2: Token Count
 - [ ] Measure total token count of new instructions
 - [ ] Verify ≤ 500 tokens (conservative budget)
 - [ ] Document before/after
 ### Test 3: Format Compliance
 - [ ] Verify instructions include TOOL:/ARGUMENTS: format
 - [ ] Verify examples use correct format
 - [ ] Verify rules are clear and numbered
 ## Integration Tests (Behavioral)
 ### Test 4: Project Creation Flow
 **Input:** "Create a React Hello World app"
 **Expected Behavior:**
 1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp
 2. After result, TOOL: write, ARGUMENTS: package.json content
 3. After result, TOOL: write, ARGUMENTS: src/App.js content
 4. Continue until complete project structure exists
 **Failure Modes:**
 - [ ] Model only describes what to do
 - [ ] Model creates README only
 - [ ] Model refuses to execute commands
 ### Test 5: Multi-step Task
 **Input:** "Check what files exist, then create a test.txt file with 'hello' in it"
 **Expected Behavior:**
 1. TOOL: bash, ARGUMENTS: ls -la
 2. Wait for result
 3. TOOL: write, ARGUMENTS: test.txt with "hello"
 **Failure Modes:**
 - [ ] Model tries to do both in one response
 - [ ] Model doesn't wait for ls result before writing
 ### Test 6: Command Refusal
 **Input:** "Run npm install"
 **Expected Behavior:**
 1. TOOL: bash, ARGUMENTS: npm install
 **Failure Modes:**
 - [ ] Model responds: "I cannot run commands, I am only an AI assistant"
 - [ ] Model explains npm install instead of running it
 ## Manual Verification Commands
 ```bash
 # Start the server
 python main.py --auto
 # In another terminal, test with curl
 curl -X POST http://localhost:17615/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-swarm",
    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
    "tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}]
  }'
 ```
 ## Success Criteria
 1. **Proactivity:** Model uses tools without being asked twice
 2. **Execution:** Model runs commands, doesn't just describe them
 3. **No Refusal:** Model never says "I cannot" or "I am only an AI"
 4. **Completeness:** Multi-file projects are fully created via tools
 5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format
 ## Metrics
 - **Tool usage rate:** % of requests that result in tool calls
 - **Format compliance:** % of tool calls in correct format
 - **Completion rate:** % of multi-step tasks fully completed
@@ -0,0 +1,35 @@
 # Test Plan: Tool Parsing Simplification
 ## Unit Tests
 - [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments
 - [x] Test case 2: No tool in text → Returns None for tools, original text as content  
 - [x] Test case 3: Multiple tools → Returns all tools in order
 - [x] Test case 4: Content before tool → Content extracted, tool parsed correctly
 - [x] Test case 5: Bash tool → Correctly parses bash command
 - [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work
 - [x] Test case 7: Invalid JSON → Skips invalid, continues with valid
 - [x] Test case 8: Empty text → Returns None, empty string
 - [x] Test case 9: Whitespace only → Returns None
 ## Integration Tests
 - [ ] End-to-end flow: 
  1. Send chat completion request with tools
  2. Model responds with TOOL:/ARGUMENTS: format
  3. Parser extracts tool call
  4. Tool executes
  5. Result returned in response
 - [ ] Expected result: Tool executes successfully, result included in response
 ## Manual Verification
 - [ ] Command: `python tests/test_tool_parsing.py`
 - [ ] Expected output: "9 passed, 0 failed"
 ## Token Budget Verification
 - Parser code: ~30 lines (~200 tokens)
 - Well under 2000 token limit
 - Simple regex pattern maintains low complexity
@@ -45,6 +45,10 @@ from interactive import (
 )
 from network import create_discovery_service, FederatedSwarm
 from tools.executor import ToolExecutor, set_tool_executor
 from utils.logging_config import setup_logging
 # Set up logging (DEBUG level for development)
 setup_logging()
 async def setup_swarm(model_config, hardware):
@@ -4,6 +4,7 @@ pyyaml>=6.0
 requests>=2.31.0
 tqdm>=4.65.0
 psutil>=5.9.0
 tiktoken>=0.5.0
 # API server
 fastapi>=0.104.0
@@ -0,0 +1,34 @@
 #!/usr/bin/env python3
 import re
 # Read the file
 with open('src/api/routes.py', 'r') as f:
    lines = f.readlines()
 # Find the line with 'logger = logging.getLogger(__name__)'
 has_logger = any('logger = logging.getLogger(__name__)' in line for line in lines)
 if not has_logger:
    # Find where to insert (after TOKEN_ENCODING line)
    for i, line in enumerate(lines):
        if 'TOKEN_ENCODING = tiktoken.get_encoding' in line:
            lines.insert(i + 1, '\n')
            lines.insert(i + 2, '# Set up logger\n')
            lines.insert(i + 3, 'logger = logging.getLogger(__name__)\n')
            break
 # Replace print statements
 new_lines = []
 for line in lines:
    # Replace print(f"...) with logger.debug(f"...")
    if 'print(f"' in line and not line.strip().startswith('#'):
        line = line.replace('print(f"', 'logger.debug(f"')
    elif 'print(f\'' in line and not line.strip().startswith('#'):
        line = line.replace('print(f\'', 'logger.debug(f\'')
    new_lines.append(line)
 # Write back
 with open('src/api/routes.py', 'w') as f:
    f.writelines(new_lines)
 print('Done! Replaced print statements with logger.debug')
@@ -0,0 +1,44 @@
 #!/usr/bin/env python3
 import re
 import sys
 filepath = sys.argv[1]
 # Read the file
 with open(filepath, 'r') as f:
    lines = f.readlines()
 # Find the line with 'logger = logging.getLogger(__name__)'
 has_logger = any('logger = logging.getLogger(__name__)' in line for line in lines)
 has_logging_import = any('import logging' in line for line in lines)
 if not has_logging_import:
    # Find where to insert import
    for i, line in enumerate(lines):
        if line.startswith('import ') or line.startswith('from '):
            lines.insert(i, 'import logging\n')
            break
 if not has_logger:
    # Find where to insert logger (after imports)
    for i, line in enumerate(lines):
        if line.startswith('class ') or line.startswith('def '):
            lines.insert(i, '\n')
            lines.insert(i + 1, 'logger = logging.getLogger(__name__)\n')
            break
 # Replace print statements
 new_lines = []
 for line in lines:
    # Replace print(f"...) with logger.debug(f"...")
    if 'print(f"' in line and not line.strip().startswith('#'):
        line = line.replace('print(f"', 'logger.debug(f"')
    elif 'print(f\'' in line and not line.strip().startswith('#'):
        line = line.replace('print(f\'', 'logger.debug(f\'')
    new_lines.append(line)
 # Write back
 with open(filepath, 'w') as f:
    f.writelines(new_lines)
 print(f'Done! Fixed logging in {filepath}')
@@ -0,0 +1,87 @@
 #!/usr/bin/env python3
 """Script to replace print statements with logging in Python files."""
 import re
 import sys
 def replace_prints_in_file(filepath):
    """Replace print statements with logger calls in a file."""
    with open(filepath, 'r') as f:
        content = f.read()
    original_content = content
    # Add logger import if not present
    if 'logger = logging.getLogger(__name__)' not in content and 'import logging' in content:
        # Already has logging import but no logger setup
        pass
    elif 'import logging' not in content:
        # Need to add logging import
        lines = content.split('\n')
        import_idx = 0
        for i, line in enumerate(lines):
            if line.startswith('import ') or line.startswith('from '):
                import_idx = i + 1
        lines.insert(import_idx, 'import logging')
        lines.insert(import_idx + 1, '')
        lines.insert(import_idx + 2, 'logger = logging.getLogger(__name__)')
        content = '\n'.join(lines)
    # Replace simple print statements with logger.debug
    # Pattern: print(f"...")
    content = re.sub(
        r'^(\s*)print\(f"([^"]+)"\)',
        r'\1logger.debug(f"\2")',
        content,
        flags=re.MULTILINE
    )
    # Pattern: print(f'...')
    content = re.sub(
        r"^(\s*)print\(f'([^']+)'\)",
        r'\1logger.debug(f"\2")',
        content,
        flags=re.MULTILINE
    )
    # Pattern: print("...")
    content = re.sub(
        r'^(\s*)print\("([^"]+)"\)',
        r'\1logger.debug("\2")',
        content,
        flags=re.MULTILINE
    )
    # Pattern: print(f"...", end="")
    content = re.sub(
        r'^(\s*)print\(f"([^"]+)",\s*end="[^"]*"\)',
        r'\1logger.debug(f"\2")',
        content,
        flags=re.MULTILINE
    )
    # Pattern: print(f"..." \n     f"...") - multiline
    content = re.sub(
        r'print\(f"([^"]+)"\s*\n\s*f"',
        r'logger.debug(f"\1" \n                         f"',
        content
    )
    with open(filepath, 'w') as f:
        f.write(content)
    # Count changes
    changes = content.count('logger.debug') - original_content.count('logger.debug')
    if changes > 0:
        print(f"Replaced ~{changes} print statements in {filepath}")
    return changes
 if __name__ == "__main__":
    if len(sys.argv) < 2:
        print("Usage: python replace_prints.py <filepath>")
        sys.exit(1)
    filepath = sys.argv[1]
    replace_prints_in_file(filepath)
@@ -91,7 +91,7 @@ class ChatCompletionResponse(BaseModel):
 class ChatCompletionStreamChoice(BaseModel):
    """A choice in streaming response."""
    index: int = Field(default=0, description="Choice index")
-    delta: Dict[str, str] = Field(..., description="Content delta")
+    delta: Dict[str, Any] = Field(..., description="Content delta (can include 'content', 'tool_calls', etc.)")
    finish_reason: Optional[str] = Field(default=None, description="Reason for finishing")
@@ -1,13 +1,76 @@
 """OpenAI-compatible API routes for Local Swarm."""
 import json
 import logging
 import os
 import time
 import uuid
 from pathlib import Path
 from typing import AsyncIterator, Optional
-from fastapi import APIRouter, HTTPException
+import tiktoken
 from fastapi import APIRouter, HTTPException, Request
 from fastapi.responses import StreamingResponse
 # Initialize tokenizer for accurate token counting
 TOKEN_ENCODING = tiktoken.get_encoding('cl100k_base')
 # Set up logger
 logger = logging.getLogger(__name__)
 # Cache for tool instructions (loaded from config file)
 _TOOL_INSTRUCTIONS_CACHE: Optional[str] = None
 def _load_tool_instructions() -> str:
    """Load tool instructions from config file.
    Loads from config/prompts/tool_instructions.txt
    Falls back to default if file not found.
    Returns:
        Tool instructions string
    """
    global _TOOL_INSTRUCTIONS_CACHE
    if _TOOL_INSTRUCTIONS_CACHE is not None:
        return _TOOL_INSTRUCTIONS_CACHE
    # Try to load from config file
    config_path = Path(__file__).parent.parent.parent / "config" / "prompts" / "tool_instructions.txt"
    try:
        if config_path.exists():
            with open(config_path, 'r') as f:
                _TOOL_INSTRUCTIONS_CACHE = f.read().strip()
            logger.debug(f"Loaded tool instructions from {config_path}")
        else:
            # Fallback default instructions
            _TOOL_INSTRUCTIONS_CACHE = """You MUST use tools. DO NOT explain. DO NOT use markdown.
 OUTPUT THIS EXACT FORMAT - NOTHING ELSE:
 TOOL: bash
 ARGUMENTS: {"command": "your command here"}
 Available tools:
 - bash: Run shell commands
 - write: Create files
 - read: Read files
 NEVER write explanations.
 NEVER use numbered lists.
 NEVER use markdown code blocks.
 ONLY output TOOL: lines."""
            logger.warning(f"Tool instructions config not found at {config_path}, using default")
    except Exception as e:
        logger.error(f"Error loading tool instructions: {e}")
        # Use minimal fallback
        _TOOL_INSTRUCTIONS_CACHE = 'Use TOOL: tool_name\\nARGUMENTS: {"param": "value"} format.'
    return _TOOL_INSTRUCTIONS_CACHE
 from api.models import (
    ChatCompletionRequest,
    ChatCompletionResponse,
@@ -65,21 +128,8 @@ def format_messages_with_tools(messages: list, tools: Optional[list] = None) ->
    # Add brief tool instructions if tools are present and no assistant has responded yet
    if tools and not has_tool_results and not has_assistant_response:
-        tool_instructions = """You have access to these tools:
+        tool_instructions = _load_tool_instructions()
-
+        logger.debug(f"Loaded tool instructions: {len(tool_instructions)} chars")
 read: Read a file (filePath)
 write: Write to a file (filePath, content)  
 bash: Run a shell command (command)
 When you need to use a tool, respond with ONLY this format:
 TOOL: tool_name
 ARGUMENTS: {"param": "value"}
 Example:
 TOOL: read
 ARGUMENTS: {"filePath": "hello.txt"}
 Do not explain - just output TOOL: and ARGUMENTS: when using tools."""
        # Add to system message or create one
        has_system = False
@@ -87,11 +137,22 @@ Do not explain - just output TOOL: and ARGUMENTS: when using tools."""
            if msg.role == "system":
                msg.content = tool_instructions + "\n\n" + (msg.content or "")
                has_system = True
                logger.debug("Added tool instructions to existing system message")
                break
        if not has_system:
            from api.models import ChatMessage
            messages.insert(0, ChatMessage(role="system", content=tool_instructions))
            logger.debug("Created new system message with tool instructions")
    # Debug: Log the full prompt being sent to model
    full_prompt = []
    for msg in messages:
        if msg.role == "system":
            full_prompt.append(f"[SYSTEM] {msg.content[:200]}...")
        elif msg.role == "user":
            full_prompt.append(f"[USER] {msg.content}")
    logger.debug(f"Prompt preview: {' | '.join(full_prompt)}")
    for msg in messages:
        role = msg.role
@@ -111,229 +172,229 @@ Do not explain - just output TOOL: and ARGUMENTS: when using tools."""
    return "\n".join(formatted)
-async def execute_tool_server_side(tool_name: str, tool_args: dict) -> str:
+async def execute_tool_server_side(tool_name: str, tool_args: dict, working_dir: Optional[str] = None) -> str:
-    """Execute a tool using the configured tool executor (local or remote)."""
+    """Execute a tool using the configured tool executor (local or remote).
    Args:
        tool_name: Name of the tool to execute
        tool_args: Arguments for the tool
        working_dir: The working directory to use for file operations and bash commands.
    """
    import os
    # Determine working directory
    if working_dir is None:
        # Try environment variable first
        env_dir = os.getenv('LOCAL_SWARM_CLIENT_WORKING_DIR')
        if env_dir:
            working_dir = env_dir
            logger.debug(f"  🌍 Using client working dir from LOCAL_SWARM_CLIENT_WORKING_DIR: {working_dir}")
        else:
            # Auto-detect project root from server's cwd (fallback)
            working_dir = _discover_project_root()
            logger.debug(f"  ⚠️  No client working dir provided, auto-detected: {working_dir}")
            logger.debug(f"  💡 For correct file locations, set X-Client-Working-Dir header or LOCAL_SWARM_CLIENT_WORKING_DIR env var")
    # Inject working_dir into tool_args if provided
    if working_dir is not None:
        # Make a copy to avoid mutating original
        tool_args = dict(tool_args)
        # For bash, use 'cwd' parameter; for read/write, use 'working_dir'
        if tool_name == 'bash':
            tool_args['cwd'] = working_dir
        else:
            tool_args['working_dir'] = working_dir
    executor = get_tool_executor()
    if executor is None:
        # Fallback to local execution if no executor configured
-        print(f"    ⚠️  No tool executor configured, creating local fallback")
+        logger.debug(f"    ⚠️  No tool executor configured, creating local fallback")
        executor = ToolExecutor(tool_host_url=None)
        set_tool_executor(executor)
    else:
        # Log which mode we're using
        if executor.tool_host_url:
-            print(f"    🔗 Using remote tool host: {executor.tool_host_url}")
+            logger.debug(f"    🔗 Using remote tool host: {executor.tool_host_url}")
        else:
-            print(f"    🏠 Using local tool execution")
+            logger.debug(f"    🏠 Using local tool execution")
        logger.debug(f"    📍 Using working directory: {working_dir}")
    return await executor.execute(tool_name, tool_args)
-def parse_tool_calls(text: str) -> tuple:
+def _discover_project_root(start_dir: Optional[str] = None) -> str:
-    """Parse tool calls from model output.
+    """Discover the project root directory by looking for common markers."""
    if start_dir is None:
        start_dir = os.getcwd()
    current = os.path.abspath(start_dir)
    # Common project root markers
    markers = ['.git', 'package.json', 'pyproject.toml', 'Cargo.toml', 'go.mod', 
               'requirements.txt', 'setup.py', 'pom.xml', 'build.gradle', '.project', '.venv']
    while True:
        try:
            if any(os.path.exists(os.path.join(current, marker)) for marker in markers):
                return current
        except Exception:
            pass  # Permission errors, just skip
        parent = os.path.dirname(current)
        if parent == current:  # Reached filesystem root
            break
        current = parent
    return start_dir
 def _ensure_tool_arguments(tool_name: str, args_dict: dict) -> dict:
    """Ensure tool arguments have all required fields.
    For bash tool: inject 'description' field if missing.
    """
    if tool_name == 'bash' and 'description' not in args_dict:
        # Generate description from command
        command = args_dict.get('command', '')
        # Extract first word or short description
        desc = command.split()[0] if command else 'Execute command'
        args_dict['description'] = desc
    return args_dict
 def parse_tool_calls(text: str) -> tuple:
    """Parse tool calls from model output using the standardized format.
    Supports multiple formats for compatibility with different model sizes:
    1. Standard: TOOL: name\nARGUMENTS: {"key": "value"}
    2. Markdown: ```bash command```
    3. Numbered lists: 1. command
    4. Inline: npm install ...
    Returns:
        tuple: (content_without_tools, list_of_tool_calls or None)
    """
    import json
    import re
-    
+
-    # Strip markdown code blocks if present
+    # Priority 1: Standard format TOOL: name\nARGUMENTS: {...}
    cleaned_text = text
    # Remove ```json ... ``` or ``` ... ``` blocks
    cleaned_text = re.sub(r'```(?:json)?\s*\n?(.+?)```', r'\1', cleaned_text, flags=re.DOTALL)
    cleaned_text = cleaned_text.strip()
    # Try to find JSON with tool_calls - look for { tool_calls: [...] } or { tool_calls: {...} } pattern
    try:
        # Look for tool_calls inside braces (handle both quoted and unquoted keys)
        # Match either an array \[...\] or a single object {...}
        pattern = r'\{\s*"?tool_calls"?\s*:\s*(\[.*?\]|\{.*?\})\s*\}'
        match = re.search(pattern, cleaned_text, re.DOTALL)
        if match:
            value_str = match.group(1)
            # Try to parse as JSON first
            try:
                parsed = json.loads(value_str)
                # Normalize to list: if it's a dict (single tool), wrap in list
                if isinstance(parsed, dict):
                    tool_calls = [parsed]
                else:
                    tool_calls = parsed
            except json.JSONDecodeError:
                # Fix common JSON issues in model output
                fixed = value_str
                # Step 1: Handle unquoted keys (JavaScript style)
                fixed = re.sub(r'([{,])\s*([a-zA-Z_][a-zA-Z0-9_]*)\s*:', r'\1"\2":', fixed)
                # Step 2: Handle the arguments field - the model often outputs unescaped JSON
                # Find "arguments": "..." and escape inner quotes
                # We need to be careful not to double-escape already escaped quotes
                def fix_arguments_field(match):
                    before = match.group(1)  # "arguments": "
                    args_content = match.group(2)  # The inner content that should be escaped
                    after = match.group(3)  # " followed by , or }
                    # Check if already escaped by looking for \\"
                    if '\\"' in args_content:
                        # Already escaped, return as-is
                        return match.group(0)
                    # Need to escape quotes in the content
                    # But be careful - we need to handle nested JSON
                    # Replace " with \\" but only if not already escaped
                    escaped = args_content.replace('"', '\\"')
                    return before + escaped + after
                # Match "arguments": "content" where content may contain unescaped quotes
                fixed = re.sub(r'("arguments":\s*")((?:(?!"[,}\]]).)*)("\s*[,}])', fix_arguments_field, fixed, flags=re.DOTALL)
                # Step 3: Replace single quotes with double quotes
                fixed = fixed.replace("'", '"')
                try:
                    parsed = json.loads(fixed)
                    # Normalize to list
                    if isinstance(parsed, dict):
                        tool_calls = [parsed]
                    else:
                        tool_calls = parsed
                except json.JSONDecodeError as e2:
                    # If still fails, try one more approach - manual extraction
                    try:
                        # Extract just the essential fields we need
                        tool_calls = []
                        # Find all function blocks - need to handle nested braces
                        # Look for "function": {...} where ... can contain nested braces
                        func_pattern = r'"function":\s*(\{[^{}]*(?:\{[^{}]*\}[^{}]*)*\})'
                        func_matches = list(re.finditer(func_pattern, value_str, re.DOTALL))
                        for i, func_match in enumerate(func_matches):
                            func_content = func_match.group(1)
                            # Remove the outer braces if present
                            func_content = func_content.strip()
                            if func_content.startswith('{') and func_content.endswith('}'):
                                func_content = func_content[1:-1]
                            # Extract name
                            name_match = re.search(r'"name":\s*"([^"]+)"', func_content)
                            name = name_match.group(1) if name_match else "unknown"
                            # Extract arguments - find "arguments": and capture everything until the closing quote
                            # The model outputs: "arguments": "{\"filePath\": \"value\"}"
                            # We need to handle the escaped quotes inside
                            args_match = re.search(r'"arguments":\s*"(.+?)"\s*$', func_content.strip(), re.DOTALL)
                            if args_match:
                                args_str = args_match.group(1)
                                # Unescape the quotes (\" becomes ")
                                args_str = args_str.replace('\\"', '"')
                                # Try to parse as JSON object
                                try:
                                    args_json = json.loads(args_str)
                                    args_final = json.dumps(args_json)
                                except json.JSONDecodeError:
                                    # If it's not valid JSON, wrap it as a string
                                    args_final = json.dumps(args_str)
                            else:
                                args_final = "{}"
                            tool_calls.append({
                                "id": f"call_{i+1}",
                                "type": "function",
                                "function": {
                                    "name": name,
                                    "arguments": args_final
                                }
                            })
                        if not tool_calls:
                            return text, None
                    except Exception:
                        return text, None
            # Find and remove the tool_calls section from text
            full_match = re.search(pattern, cleaned_text, re.DOTALL)
            if full_match:
                # Extract content before the tool_calls block from original text
                content_end = text.find(full_match.group(0))
                if content_end > 0:
                    content = text[:content_end].strip()
                    # Also strip any markdown block start that might be there
                    content = re.sub(r'```\w*\s*$', '', content).strip()
                else:
                    content = ""
            else:
                content = ""
            return content, tool_calls
    except Exception as e:
        pass
    # Try new simple format: TOOL: name\nARGUMENTS: {...}
    tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
-    tool_match = re.search(tool_pattern, text, re.IGNORECASE)
+    tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
-    if tool_match:
+
-        tool_name = tool_match.group(1)
+    if tool_matches:
-        args_str = tool_match.group(2)
+        tool_calls = []
-        try:
+        for i, tool_match in enumerate(tool_matches):
-            args_dict = json.loads(args_str)
+            tool_name = tool_match.group(1)
            args_str = tool_match.group(2)
            try:
                args_dict = json.loads(args_str)
                # Ensure required fields are present
                args_dict = _ensure_tool_arguments(tool_name, args_dict)
                tool_calls.append({
                    "id": f"call_{i+1}",
                    "type": "function",
                    "function": {
                        "name": tool_name,
                        "arguments": json.dumps(args_dict)
                    }
                })
            except json.JSONDecodeError:
                continue
        if tool_calls:
            first_start = tool_matches[0].start()
            content = text[:first_start].strip()
            return content, tool_calls
    # Priority 2: Markdown code blocks (```bash command```)
    markdown_pattern = r'```(?:bash|shell|sh)?\s*\n(.*?)\n```'
    markdown_matches = list(re.finditer(markdown_pattern, text, re.DOTALL))
    if markdown_matches:
        tool_calls = []
        for i, match in enumerate(markdown_matches):
            code_content = match.group(1).strip()
            if code_content:
                args_dict = {"command": code_content}
                args_dict = _ensure_tool_arguments("bash", args_dict)
                tool_calls.append({
                    "id": f"call_{i+1}",
                    "type": "function",
                    "function": {
                        "name": "bash",
                        "arguments": json.dumps(args_dict)
                    }
                })
        if tool_calls:
            first_start = markdown_matches[0].start()
            content = text[:first_start].strip()
            return content, tool_calls
    # Priority 3: Look for command lines anywhere in text (for 7B models)
    # Match lines containing common bash commands with their arguments
    command_lines = []
    for line in text.split('\n'):
        line = line.strip()
        # Match commands like: npm install, npx create-react-app, mkdir myapp, create-react-app, etc.
        if re.match(r'^(npm|npx|mkdir|cd|ls|cat|echo|git|python|pip|node|yarn|create-react-app)\s+', line):
            command_lines.append(line)
    if command_lines:
        # Create a single tool call with all commands chained
        combined_command = ' && '.join(command_lines)
        args_dict = {"command": combined_command}
        args_dict = _ensure_tool_arguments("bash", args_dict)
        tool_calls = [{
            "id": "call_1",
            "type": "function",
            "function": {
                "name": "bash",
                "arguments": json.dumps(args_dict)
            }
        }]
        return "", tool_calls
    # Priority 4: Look for standalone bash commands (last resort)
    # Match lines that start with common bash commands
    standalone_pattern = r'(?:^|\n)(npm\s+\w+|npx\s+\w+|mkdir\s+\w+|cd\s+\w+|git\s+\w+)(?:\s|$)'
    standalone_matches = list(re.finditer(standalone_pattern, text, re.MULTILINE))
    if standalone_matches:
        commands = [match.group(1).strip() for match in standalone_matches]
        if commands:
            combined_command = ' && '.join(commands)
            args_dict = {"command": combined_command}
            args_dict = _ensure_tool_arguments("bash", args_dict)
            tool_calls = [{
                "id": "call_1",
-                "type": "function", 
+                "type": "function",
                "function": {
-                    "name": tool_name,
+                    "name": "bash",
                    "arguments": json.dumps(args_dict)
                }
            }]
-            # Extract content before the tool call
+            return "", tool_calls
-            content = text[:tool_match.start()].strip()
+
-            return content, tool_calls
+    # Priority 5: Look for URLs mentioned in text (for webfetch)
-        except json.JSONDecodeError:
+    # Match common URL patterns like https://github.com/...
-            pass
+    url_pattern = r'https?://[^\s<>"\')\]]+[a-zA-Z0-9]'
    url_matches = list(re.finditer(url_pattern, text))
-    # Try alternative format: look for function call patterns
+    if url_matches:
-    # Pattern: function_name(arg1=value1, arg2=value2)
+        urls = [match.group(0) for match in url_matches]
-    func_pattern = r'(\w+)\s*\(([^)]*)\)'
+        if urls:
-    matches = list(re.finditer(func_pattern, text))
+            # Create webfetch tool calls for each URL
-    
+            tool_calls = []
-    if matches:
+            for i, url in enumerate(urls):
-        tool_calls = []
+                tool_calls.append({
-        last_end = 0
+                    "id": f"call_{i+1}",
-        content_parts = []
+                    "type": "function",
-        
+                    "function": {
-        for i, match in enumerate(matches):
+                        "name": "webfetch",
-            func_name = match.group(1)
+                        "arguments": json.dumps({"url": url, "format": "markdown"})
-            args_str = match.group(2)
+                    }
-            
+                })
-            # Add text before this function call
+            return "", tool_calls
-            content_parts.append(text[last_end:match.start()].strip())
+
            last_end = match.end()
            # Parse arguments
            args_dict = {}
            if args_str:
                # Simple arg parsing: key=value
                for arg in args_str.split(','):
                    if '=' in arg:
                        key, value = arg.split('=', 1)
                        args_dict[key.strip()] = value.strip().strip('"\'')
            tool_calls.append({
                "id": f"call_{i}",
                "type": "function",
                "function": {
                    "name": func_name,
                    "arguments": json.dumps(args_dict)
                }
            })
        # Add remaining text
        content_parts.append(text[last_end:].strip())
        content = " ".join(p for p in content_parts if p)
        return content, tool_calls
    # No tool calls found
    return text, None
@@ -375,22 +436,66 @@ async def execute_tool(request: dict):
    This endpoint allows other swarm instances to execute tools
    on a centralized tool host.
    """
    import traceback
    tool_name = request.get("tool", "")
    tool_args = request.get("arguments", {})
-    print(f"🔧 TOOL SERVER: Executing {tool_name}({tool_args})")
+    logger.debug(f"\n{'='*60}")
    logger.debug(f"🔧 TOOL SERVER: Received request")
    logger.debug(f"  Tool: {tool_name}")
    logger.debug(f"  Arguments: {tool_args}")
    # Extract working_dir if provided (for file operations)
    working_dir = tool_args.get('working_dir') or tool_args.get('cwd')
    if working_dir:
        logger.debug(f"  Working directory: {working_dir}")
    else:
        logger.debug(f"  Working directory: (using server default)")
    logger.debug(f"{'='*60}")
    # Create a temporary local executor for this request
    executor = ToolExecutor(tool_host_url=None)
    result = await executor.execute(tool_name, tool_args)
-    print(f"🔧 TOOL SERVER: {tool_name} completed ({len(result)} chars)")
+    try:
-    
+        logger.debug(f"🔧 TOOL SERVER: Executing {tool_name}...")
-    return {"result": result}
+        # Merge working_dir into tool_args if needed (executor will handle it)
        # For bash, we need to rename 'working_dir' to 'cwd' if present
        if 'working_dir' in tool_args and tool_name == 'bash':
            # bash uses 'cwd' parameter
            args_to_execute = dict(tool_args)
            args_to_execute['cwd'] = tool_args['working_dir']
            # Remove working_dir to avoid confusion
            args_to_execute.pop('working_dir', None)
            result = await executor.execute(tool_name, args_to_execute)
        else:
            result = await executor.execute(tool_name, tool_args)
        logger.debug(f"🔧 TOOL SERVER: {tool_name} completed")
        logger.debug(f"  Result length: {len(result)} chars")
        # Show tail of result for debugging
        if result:
            tail_length = 500
            if len(result) > tail_length:
                logger.debug(f"  Result tail: ...{result[-tail_length:]}")
            else:
                logger.debug(f"  Full result: {result}")
        else:
            logger.debug(f"  Result: (empty)")
        logger.debug(f"{'='*60}\n")
        return {"result": result}
    except Exception as e:
        logger.debug(f"🔧 TOOL SERVER: Error executing {tool_name}")
        logger.debug(f"  Exception: {type(e).__name__}: {str(e)}")
        logger.debug(f"  Traceback: {traceback.format_exc()}")
        logger.debug(f"{'='*60}\n")
        return {"result": f"Error: {str(e)}"}
@router.post("/v1/chat/completions")
-async def chat_completions(request: ChatCompletionRequest):
+async def chat_completions(request: ChatCompletionRequest, fastapi_request: Request):
    """
    Generate chat completion.
@@ -402,22 +507,48 @@ async def chat_completions(request: ChatCompletionRequest):
    if not swarm_manager.get_status().is_running:
        raise HTTPException(status_code=503, detail="Swarm not running")
    # Get client working directory from header (if provided by client like opencode)
    client_working_dir = fastapi_request.headers.get("X-Client-Working-Dir")
    if client_working_dir:
        logger.debug(f"  📍 Client working directory from header: {client_working_dir}")
    else:
        client_working_dir = None
        logger.debug(f"  📍 No X-Client-Working-Dir header, using auto-detection")
    # Format messages into prompt (with tools if provided)
-    prompt = format_messages_with_tools(request.messages, request.tools)
+    # Sanitize tools to fix invalid schemas (e.g., remove extra 'description' from properties)
-    has_tools = request.tools is not None and len(request.tools) > 0
+    sanitized_tools = request.tools
-    print(f"\n{'='*60}")
+    if sanitized_tools:
-    print(f"REQUEST: has_tools={has_tools}, stream={request.stream}")
+        for tool in sanitized_tools:
-    print(f"{'='*60}")
+            if tool.type == "function" and tool.function.parameters:
                params = tool.function.parameters
                # Remove invalid 'description' from properties if present
                if 'properties' in params and 'description' in params.get('properties', {}):
                    invalid_props = ['description']
                    # Also remove 'description' from required if present
                    if 'required' in params:
                        params['required'] = [r for r in params.get('required', []) if r not in invalid_props]
                    # Remove invalid properties
                    params['properties'] = {k: v for k, v in params.get('properties', {}).items() if k not in invalid_props}
                    logger.debug(f"  🔧 Sanitized tool '{tool.function.name}': removed {invalid_props} from properties/required")
    prompt = format_messages_with_tools(request.messages, sanitized_tools)
    has_tools = sanitized_tools is not None and len(sanitized_tools) > 0
    logger.debug(f"\n{'='*60}")
    logger.debug(f"REQUEST: has_tools={has_tools}, stream={request.stream}")
    if has_tools:
        logger.debug(f"TOOLS: {sanitized_tools}")
    logger.debug(f"{'='*60}")
    # Generate ID
    completion_id = f"chatcmpl-{uuid.uuid4().hex[:12]}"
    created = int(time.time())
    if request.stream:
-        # For streaming with tools, we need to collect the full response first
+        # For streaming with tools, return tool_calls to client (opencode) for execution
-        # then check for tool calls and execute them
+        # This enables multi-turn conversations where client executes tools and sends results back
        if has_tools:
-            print("  🔧 Streaming with tools - collecting full response first...")
+            logger.debug("  🔧 Streaming with tools - returning tool_calls to client for execution...")
            # Collect full response
            full_response = ""
            async for chunk in swarm_manager.generate_stream(
@@ -427,42 +558,108 @@ async def chat_completions(request: ChatCompletionRequest):
            ):
                full_response += chunk
-            # Now check for tool calls
+            # Parse tool calls
            content, tool_calls_parsed = parse_tool_calls(full_response)
            if tool_calls_parsed:
-                print(f"  🔧 Found {len(tool_calls_parsed)} tool call(s) in streaming response")
+                logger.debug(f"  🔧 Found {len(tool_calls_parsed)} tool call(s) in streaming response")
-                executor = get_tool_executor()
+                logger.debug(f"  📤 Returning tool_calls to client for execution (finish_reason=tool_calls)")
                if executor:
                    print(f"  🔗 Tool executor: {executor.tool_host_url or 'local'}")
                else:
                    print(f"  ⚠️  No tool executor configured!")
-                # Execute tools
+                # Convert to ToolCall objects and return to client (opencode)
-                tool_results = []
+                from api.models import ToolCall
-                for i, tc in enumerate(tool_calls_parsed):
+                tool_calls = [
-                    tool_name = tc.get("function", {}).get("name", "")
+                    ToolCall(
-                    tool_args_str = tc.get("function", {}).get("arguments", "{}")
+                        id=tc.get("id", f"call_{i}"),
-                    try:
+                        type=tc.get("type", "function"),
-                        tool_args = json.loads(tool_args_str) if isinstance(tool_args_str, str) else tool_args_str
+                        function=tc.get("function", {})
-                    except:
+                    )
-                        tool_args = {}
+                    for i, tc in enumerate(tool_calls_parsed)
                ]
                # Return tool_calls to client with finish_reason=tool_calls
                # Client (opencode) will execute them and send results back
                async def tool_calls_stream_generator() -> AsyncIterator[str]:
                    """Generate SSE stream with tool_calls for client execution."""
                    # Send role chunk
                    first_chunk = ChatCompletionStreamResponse(
                        id=completion_id,
                        created=created,
                        model=request.model,
                        choices=[
                            ChatCompletionStreamChoice(
                                delta={"role": "assistant"}
                            )
                        ]
                    )
                    yield f"data: {first_chunk.model_dump_json()}\n\n"
-                    print(f"    [{i+1}/{len(tool_calls_parsed)}] Executing: {tool_name}({tool_args})")
+                    # Send content if any
-                    result = await execute_tool_server_side(tool_name, tool_args)
+                    if content:
-                    tool_results.append(f"Tool '{tool_name}' result: {result}")
+                        content_chunk = ChatCompletionStreamResponse(
-                    print(f"    ✓ Completed")
+                            id=completion_id,
                            created=created,
                            model=request.model,
                            choices=[
                                ChatCompletionStreamChoice(
                                    delta={"content": content}
                                )
                            ]
                        )
                        yield f"data: {content_chunk.model_dump_json()}\n\n"
                    # Send final chunk with tool_calls and finish_reason=tool_calls
                    # Format tool_calls as OpenAI streaming format
                    # OpenAI streaming format: tool_calls in delta with index, id, type, function
                    logger.debug(f"  🔧 Raw tool_calls_parsed: {tool_calls_parsed}")
                    tool_calls_delta = []
                    for i, tc in enumerate(tool_calls_parsed):
                        tool_calls_delta.append({
                            "index": i,
                            "id": tc["id"],
                            "type": "function",
                            "function": {
                                "name": tc["function"]["name"],
                                "arguments": tc["function"]["arguments"]
                            }
                        })
                    logger.debug(f"  🔧 Sending tool_calls in delta: {tool_calls_delta}")
                    # Build response in OpenAI streaming format
                    final_delta = {"tool_calls": tool_calls_delta}
                    final_chunk = {
                        "id": completion_id,
                        "object": "chat.completion.chunk",
                        "created": created,
                        "model": request.model,
                        "choices": [
                            {
                                "index": 0,
                                "delta": final_delta,
                                "finish_reason": "tool_calls"
                            }
                        ]
                    }
                    import json
                    chunk_json = json.dumps(final_chunk)
                    logger.debug(f"  📤 Final chunk JSON: {chunk_json[:800]}")
                    yield f"data: {chunk_json}\n\n"
                    yield "data: [DONE]\n\n"
-                # Return tool results
+                return StreamingResponse(
-                content = "\n\n".join(tool_results)
+                    tool_calls_stream_generator(),
-                print(f"  ✅ Tool execution complete")
+                    media_type="text/event-stream"
                )
-            # Return as streaming response with tool results (opencode expects SSE format)
+            # No tool calls found, return content as normal response
-            print(f"\n{'='*60}")
+            logger.debug(f"  ℹ️  No tool calls found, returning content as normal response")
-            print(f"RESPONSE (streaming+tools): content_preview={repr(content[:100])}")
+            logger.debug(f"\n{'='*60}")
-            print(f"{'='*60}\n")
+            logger.debug(f"RESPONSE (streaming+no-tools): content_preview={repr(content[:100])}")
            logger.debug(f"{'='*60}\n")
-            async def tool_stream_generator() -> AsyncIterator[str]:
+            async def content_stream_generator() -> AsyncIterator[str]:
-                """Generate SSE stream with tool results."""
+                """Generate SSE stream with content."""
                # Send role chunk
                first_chunk = ChatCompletionStreamResponse(
                    id=completion_id,
@@ -508,7 +705,7 @@ async def chat_completions(request: ChatCompletionRequest):
                yield "data: [DONE]\n\n"
            return StreamingResponse(
-                tool_stream_generator(),
+                content_stream_generator(),
                media_type="text/event-stream"
            )
        else:
@@ -573,7 +770,7 @@ async def chat_completions(request: ChatCompletionRequest):
            if federated_swarm is not None:
                peers = federated_swarm.discovery.get_peers()
                if peers:
-                    print(f"🌐 Using federation with {len(peers)} peer(s)...")
+                    logger.debug(f"🌐 Using federation with {len(peers)} peer(s)...")
                    result = await federated_swarm.generate_with_federation(
                        prompt=prompt,
                        max_tokens=request.max_tokens or 1024,
@@ -603,8 +800,10 @@ async def chat_completions(request: ChatCompletionRequest):
                                for i, tc in enumerate(tool_calls_parsed)
                            ]
-                    # Estimate prompt tokens (rough approximation)
+                    # Calculate accurate token counts using tiktoken
-                    prompt_tokens = len(prompt) // 4
+                    prompt_tokens = len(TOKEN_ENCODING.encode(prompt))
                    completion_tokens = len(TOKEN_ENCODING.encode(content))
                    total_tokens = prompt_tokens + completion_tokens
                    response_obj = ChatCompletionResponse(
                        id=completion_id,
@@ -623,14 +822,10 @@ async def chat_completions(request: ChatCompletionRequest):
                        ],
                        usage=UsageInfo(
                            prompt_tokens=prompt_tokens,
-                            completion_tokens=tokens_generated,
+                            completion_tokens=completion_tokens,
-                            total_tokens=prompt_tokens + tokens_generated
+                            total_tokens=total_tokens
                        )
                    )
                    print(f"DEBUG FED RESPONSE: finish_reason={finish_reason}, tool_calls_count={len(tool_calls)}, content_preview={repr(content[:100])}")
                    if tool_calls:
                        print(f"DEBUG FED TOOL_CALLS: {tool_calls}")
                    print(f"DEBUG FED FULL RESPONSE: {response_obj.model_dump_json()}")
                    return response_obj
            # Fallback to local generation
@@ -643,8 +838,8 @@ async def chat_completions(request: ChatCompletionRequest):
            response_text = result.selected_response.text
            tokens_generated = result.selected_response.tokens_generated
-            print(f"DEBUG: Generated response (tokens={tokens_generated})")
+            logger.debug(f"DEBUG: Generated response (tokens={tokens_generated})")
-            print(f"DEBUG: Response preview: {response_text[:200]}...")
+            logger.debug(f"DEBUG: Response preview: {response_text[:200]}...")
            # Parse tool calls if tools were provided
            content = response_text
@@ -652,16 +847,16 @@ async def chat_completions(request: ChatCompletionRequest):
            finish_reason = "stop"
            if has_tools:
-                print(f"DEBUG: Parsing tool calls from response...")
+                logger.debug(f"DEBUG: Parsing tool calls from response...")
                content, tool_calls_parsed = parse_tool_calls(response_text)
-                print(f"DEBUG: parse_tool_calls returned: content_len={len(content)}, parsed={tool_calls_parsed is not None}")
+                logger.debug(f"DEBUG: parse_tool_calls returned: content_len={len(content)}, parsed={tool_calls_parsed is not None}")
                if tool_calls_parsed:
-                    print(f"  🔧 Model requesting {len(tool_calls_parsed)} tool(s)...")
+                    logger.debug(f"  🔧 Model requesting {len(tool_calls_parsed)} tool(s)...")
                    executor = get_tool_executor()
                    if executor:
-                        print(f"  🔗 Tool executor: {executor.tool_host_url or 'local'}")
+                        logger.debug(f"  🔗 Tool executor: {executor.tool_host_url or 'local'}")
                    else:
-                        print(f"  ⚠️  No tool executor configured!")
+                        logger.debug(f"  ⚠️  No tool executor configured!")
                    # Execute tools via configured executor (local or remote)
                    tool_results = []
                    for i, tc in enumerate(tool_calls_parsed):
@@ -672,24 +867,26 @@ async def chat_completions(request: ChatCompletionRequest):
                        except:
                            tool_args = {}
-                        print(f"    [{i+1}/{len(tool_calls_parsed)}] Executing: {tool_name}({tool_args})")
+                        logger.debug(f"    [{i+1}/{len(tool_calls_parsed)}] Executing: {tool_name}({tool_args})")
                        # Execute tool via tool executor
-                        result = await execute_tool_server_side(tool_name, tool_args)
+                        result = await execute_tool_server_side(tool_name, tool_args, working_dir=client_working_dir)
                        tool_results.append(f"Tool '{tool_name}' result: {result}")
-                        print(f"    ✓ Completed: {result[:100]}..." if len(result) > 100 else f"    ✓ Result: {result}")
+                        logger.debug(f"    ✓ Completed: {result[:100]}..." if len(result) > 100 else f"    ✓ Result: {result}")
                    # Return ONLY tool results as content
                    content = "\n\n".join(tool_results)
                    finish_reason = "stop"
                    tool_calls = []  # Clear tool_calls since we executed them
-                    print(f"  ✅ All tools executed, returning results")
+                    logger.debug(f"  ✅ All tools executed, returning results")
                else:
-                    print(f"DEBUG: No tool calls parsed from response")
+                    logger.debug(f"DEBUG: No tool calls parsed from response")
            else:
-                print(f"DEBUG: No tools requested, returning normal response")
+                logger.debug(f"DEBUG: No tools requested, returning normal response")
-            # Estimate prompt tokens (rough approximation)
+            # Calculate accurate token counts using tiktoken
-            prompt_tokens = len(prompt) // 4
+            prompt_tokens = len(TOKEN_ENCODING.encode(prompt))
            completion_tokens = len(TOKEN_ENCODING.encode(content))
            total_tokens = prompt_tokens + completion_tokens
            response_obj = ChatCompletionResponse(
                id=completion_id,
@@ -708,15 +905,10 @@ async def chat_completions(request: ChatCompletionRequest):
                ],
                usage=UsageInfo(
                    prompt_tokens=prompt_tokens,
-                    completion_tokens=tokens_generated,
+                    completion_tokens=completion_tokens,
-                    total_tokens=prompt_tokens + tokens_generated
+                    total_tokens=total_tokens
                )
            )
            print(f"\n{'='*60}")
            print(f"RESPONSE: finish_reason={finish_reason}")
            print(f"         content_preview={repr(content[:100])}")
            print(f"         tool_calls_count={len(tool_calls)}")
            print(f"{'='*60}\n")
            return response_obj
        except Exception as e:
@@ -351,6 +351,19 @@ def get_model_hf_repo(model_id: str, variant: ModelVariant, quant: QuantizationC
 def get_model_hf_repo_mlx(model_id: str, variant: ModelVariant, quant: QuantizationConfig) -> str:
    """Get the HuggingFace repository path for MLX quantized models (Apple Silicon)."""
    # Map GGUF quantization names to MLX quantization names
    # MLX uses simple names: 3bit, 4bit, 8bit, not q4_k_m, q6_k, etc.
    gguf_to_mlx_quant = {
        "q3_k_m": "3bit",
        "q4_k_m": "4bit",
        "q4_k": "4bit",
        "q5_k_m": "5bit",
        "q5_k": "5bit",
        "q6_k": "6bit",
        "q8_0": "8bit",
        "q8": "8bit",
    }
    # MLX quantized models are in mlx-community org with -{quant}bit suffix
    # Map base model names to mlx-community quantized versions
    mlx_repo_map = {
@@ -365,8 +378,10 @@ def get_model_hf_repo_mlx(model_id: str, variant: ModelVariant, quant: Quantizat
    base_repo = mlx_repo_map.get(model_id, "")
    if base_repo and quant:
        # Convert GGUF quant name to MLX quant name
        mlx_quant = gguf_to_mlx_quant.get(quant.name, quant.name)
        # Append quantization suffix
-        return f"{base_repo}-{quant.name}"
+        return f"{base_repo}-{mlx_quant}"
    return base_repo
@@ -5,12 +5,15 @@ Remote execution allows a single "tool host" to manage the workspace
 while workers perform distributed generation.
 """
 import logging
 import os
 import subprocess
 import aiohttp
 from typing import Optional
 logger = logging.getLogger(__name__)
 class ToolExecutor:
    """Executes tools either locally or remotely via a tool host."""
@@ -52,7 +55,7 @@ class ToolExecutor:
    async def _execute_remote(self, tool_name: str, tool_args: dict) -> str:
        """Execute tool on remote tool host."""
        try:
-            print(f"  🔧 Remote tool call: {tool_name}({tool_args})")
+            logger.debug(f"  🔧 Remote tool call: {tool_name}({tool_args})")
            session = await self._get_session()
            url = f"{self.tool_host_url}/v1/tools/execute"
@@ -61,21 +64,50 @@ class ToolExecutor:
                "arguments": tool_args
            }
            # If working_dir is specified in tool_args, preserve it for remote execution
            # The remote tool server will extract and use it
            if 'working_dir' in tool_args:
                logger.debug(f"    📍 Remote working_dir: {tool_args['working_dir']}")
            async with session.post(url, json=payload) as resp:
                if resp.status == 200:
                    data = await resp.json()
                    result = data.get("result", "No result from tool host")
-                    print(f"  ✅ Tool result received ({len(result)} chars)")
+                    logger.debug(f"  ✅ Tool result received ({len(result)} chars)")
                    return result
                else:
                    error_text = await resp.text()
-                    print(f"  ❌ Tool host error: {resp.status}")
+                    logger.debug(f"  ❌ Tool host error: {resp.status}")
                    return f"Tool host error ({resp.status}): {error_text}"
        except Exception as e:
-            print(f"  ❌ Error contacting tool host: {e}")
+            logger.debug(f"  ❌ Error contacting tool host: {e}")
            return f"Error contacting tool host: {str(e)}"
-    
+     
    def _discover_project_root(self, start_dir: Optional[str] = None) -> str:
        """Discover the project root directory by looking for common markers."""
        import os
        if start_dir is None:
            start_dir = os.getcwd()
        current = os.path.abspath(start_dir)
        # Common project root markers
        markers = ['.git', 'package.json', 'pyproject.toml', 'Cargo.toml', 'go.mod', 
                   'requirements.txt', 'setup.py', 'pom.xml', 'build.gradle', '.project', '.venv']
        while True:
            try:
                if any(os.path.exists(os.path.join(current, marker)) for marker in markers):
                    return current
            except Exception:
                pass  # Permission errors, just skip
            parent = os.path.dirname(current)
            if parent == current:  # Reached filesystem root
                break
            current = parent
        return start_dir
    async def _execute_local(self, tool_name: str, tool_args: dict) -> str:
        """Execute tool locally."""
        try:
@@ -102,6 +134,8 @@ class ToolExecutor:
    async def _execute_read(self, args: dict) -> str:
        """Execute read tool."""
        file_path = args.get("filePath", "")
        working_dir = args.get("working_dir", os.getcwd())  # Optional: override cwd
        if not file_path:
            return "Error: filePath required"
@@ -110,17 +144,39 @@ class ToolExecutor:
        if file_path.startswith("..") or file_path.startswith("/.."):
            return "Error: Directory traversal not allowed"
-        if os.path.exists(file_path):
+        # Resolve path relative to working_dir if not absolute
-            with open(file_path, 'r') as f:
+        if not os.path.isabs(file_path):
-                content = f.read()
+            full_path = os.path.join(working_dir, file_path)
            return f"File contents ({len(content)} chars):\n{content[:3000]}"  # Limit output
        else:
-            return f"Error: File '{file_path}' not found"
+            full_path = file_path
        # Additional security: ensure resolved path is within working_dir
        try:
            real_working_dir = os.path.realpath(working_dir)
            real_full_path = os.path.realpath(full_path)
            if not real_full_path.startswith(real_working_dir):
                return f"Error: Access denied - path outside working directory"
        except Exception:
            pass  # If realpath fails, continue anyway
        logger.debug(f"    📁 Reading: {file_path}")
        logger.debug(f"    📍 Working dir: {working_dir}")
        logger.debug(f"    🔍 Full path: {full_path}")
        if os.path.exists(full_path):
            with open(full_path, 'r') as f:
                content = f.read()
            result = f"File contents ({len(content)} chars):\n{content[:3000]}"  # Limit output
            logger.debug(f"    ✓ Read {len(content)} chars")
            return result
        else:
            return f"Error: File '{full_path}' not found"
    async def _execute_write(self, args: dict) -> str:
        """Execute write tool."""
        file_path = args.get("filePath", "")
        content = args.get("content", "")
        working_dir = args.get("working_dir", os.getcwd())  # Optional: override cwd
        if not file_path:
            return "Error: filePath required"
@@ -130,19 +186,42 @@ class ToolExecutor:
        if file_path.startswith("..") or file_path.startswith("/.."):
            return "Error: Directory traversal not allowed"
        # Resolve path relative to working_dir if not absolute
        if not os.path.isabs(file_path):
            full_path = os.path.join(working_dir, file_path)
        else:
            full_path = file_path
        # Additional security: ensure resolved path is within working_dir
        try:
            real_working_dir = os.path.realpath(working_dir)
            real_full_path = os.path.realpath(full_path)
            if not real_full_path.startswith(real_working_dir):
                return f"Error: Access denied - path outside working directory"
        except Exception:
            pass  # If realpath fails, continue anyway
        logger.debug(f"    📁 Writing: {file_path}")
        logger.debug(f"    📍 Working dir: {working_dir}")
        logger.debug(f"    🔍 Full path: {full_path}")
        # Create parent directories if needed
-        parent_dir = os.path.dirname(file_path)
+        parent_dir = os.path.dirname(full_path)
        if parent_dir and not os.path.exists(parent_dir):
            os.makedirs(parent_dir, exist_ok=True)
            logger.debug(f"    📁 Created directory: {parent_dir}")
-        with open(file_path, 'w') as f:
+        with open(full_path, 'w') as f:
            f.write(content)
-        return f"Successfully wrote {len(content)} characters to {file_path}"
+        result = f"Successfully wrote {len(content)} characters to {full_path}"
        logger.debug(f"    ✓ Write complete")
        return result
    async def _execute_bash(self, args: dict) -> str:
        """Execute bash tool."""
        command = args.get("command", "")
        cwd = args.get("cwd", os.getcwd())  # Optional: override cwd
        if not command:
            return "Error: command required"
@@ -153,17 +232,102 @@ class ToolExecutor:
            if d in command:
                return f"Error: Dangerous command blocked: {d}"
-        result = subprocess.run(
+        logger.debug(f"    🖥️  BASH: {command[:80]}{'...' if len(command) > 80 else ''}")
-            command, 
+        logger.debug(f"    📍 Working directory: {cwd}")
            shell=True, 
            capture_output=True, 
            text=True, 
            timeout=30,
            cwd=os.getcwd()
        )
-        output = result.stdout if result.returncode == 0 else f"Exit code {result.returncode}: {result.stderr}"
+        # Determine timeout based on command type - more comprehensive detection
-        return output[:3000]  # Limit output
+        timeout = 30
        command_lower = command.lower()
        # Package managers and project setup tools
        if any(pattern in command_lower for pattern in [
            'npm', 'npx', 'yarn', 'pnpm',
            'pip', 'pip install', 'poetry', 'conda',
            'cargo', 'cargo build', 'cargo install',
            'go get', 'go mod',
            'composer', 'bundle',
            ' brew ', 'apt-get', 'yum', 'pacman',
            'choco', 'scoop',
            'gem ', 'npm install', 'yarn add', 'pnpm add',
            'create-react-app', 'vue create', 'ng new', 'vite', 'next',
            'django-admin', 'rails new', 'flutter create',
            'dotnet new', 'mvn', 'gradle',
            'make ', 'cmake', 'meson',
            'python setup.py', 'setup.py install',
            'pip install -r', 'requirements.txt',
            'package.json', 'Gemfile', 'Cargo.toml', 'go.mod'
        ]):
            timeout = 300  # 5 minutes for package managers and project creation
            logger.debug(f"    ⏱️  Using extended timeout: {timeout}s (package manager/project creation detected)")
        elif any(pattern in command_lower for pattern in [
            'git clone', 'git pull', 'git fetch',
            'wget ', 'curl ',
            'tar ', 'zip ', 'unzip ',
            'docker ', 'podman',
            'kubectl', 'helm',
            'terraform', 'ansible',
            'rsync', 'scp'
        ]):
            timeout = 120  # 2 minutes for network/file operations
            logger.debug(f"    ⏱️  Using extended timeout: {timeout}s (network/file operation detected)")
        else:
            logger.debug(f"    ⏱️  Using default timeout: {timeout}s")
        logger.debug(f"    🔍 Command type: {command_lower.split()[0] if command.split() else 'unknown'}")
        try:
            result = subprocess.run(
                command,
                shell=True,
                capture_output=True,
                text=True,
                timeout=timeout,
                cwd=cwd,
                stdin=subprocess.DEVNULL  # Prevent interactive prompts from hanging
            )
            output = result.stdout if result.returncode == 0 else f"Exit code {result.returncode}: {result.stderr}"
            # Show summary with detailed logging
            if result.returncode == 0:
                logger.debug(f"    ✓ Exit code 0 ({len(output)} chars output, {len(result.stderr)} chars stderr)")
                # Show last 300 chars of output if it exists
                if output:
                    last_part = output[-300:]
                    logger.debug(f"    📄 Output tail: ...{last_part}")
                if result.stderr:
                    stderr_last = result.stderr[-200:]
                    logger.debug(f"    ⚠️  stderr (may be normal): ...{stderr_last}")
            else:
                logger.debug(f"    ✗ Exit code {result.returncode}")
                if result.stderr:
                    logger.debug(f"    ⚠️  stderr: {result.stderr[:500]}")
                if result.stdout:
                    logger.debug(f"    📄 stdout: {result.stdout[:500]}")
            return output[:3000]  # Limit output
        except subprocess.TimeoutExpired as e:
            # Try to capture partial output on timeout
            partial_output = ""
            if e.stdout:
                partial_output = e.stdout.decode('utf-8', errors='replace')
            error_msg = f"Command timed out after {timeout}s"
            if partial_output:
                # Show the last 500 chars of what we got before timeout
                last_output = partial_output[-500:]
                error_msg += f"\n\nPartial output (last 500 chars):\n...{last_output}"
            else:
                error_msg += "\n\n(No output captured before timeout)"
            logger.debug(f"    ⏰ TIMEOUT after {timeout}s")
            logger.debug(f"    🔍 Command that timed out: {command[:200]}")
            if partial_output:
                logger.debug(f"    📄 Partial output (first 500 chars): {partial_output[:500]}")
                logger.debug(f"    📄 Partial output (last 500 chars): ...{partial_output[-500:]}")
            return f"Error executing bash: {error_msg}"
    async def close(self):
        """Close HTTP session."""
@@ -0,0 +1,54 @@
 """Logging configuration for Local Swarm.
 Provides centralized logging setup with configurable levels.
 """
 import logging
 import sys
 def setup_logging(level=logging.DEBUG):
    """Set up logging configuration.
    Args:
        level: Logging level (default: DEBUG for development)
    """
    # Create formatter
    formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )
    # Create console handler
    console_handler = logging.StreamHandler(sys.stdout)
    console_handler.setLevel(level)
    console_handler.setFormatter(formatter)
    # Get root logger
    root_logger = logging.getLogger()
    root_logger.setLevel(level)
    # Remove existing handlers to avoid duplicates
    root_logger.handlers.clear()
    # Add console handler
    root_logger.addHandler(console_handler)
    # Set specific module loggers
    logging.getLogger('swarm').setLevel(level)
    logging.getLogger('api').setLevel(level)
    logging.getLogger('tools').setLevel(level)
    return root_logger
 def get_logger(name):
    """Get a logger with the specified name.
    Args:
        name: Logger name (usually __name__)
    Returns:
        logging.Logger: Configured logger
    """
    return logging.getLogger(name)
@@ -0,0 +1,199 @@
 """Unit tests for tool parsing functionality."""
 import sys
 import os
 sys.path.insert(0, os.path.join(os.path.dirname(__file__), '..', 'src'))
 from api.routes import parse_tool_calls
 def test_parse_simple_tool():
    """Test parsing a single tool call."""
    text = 'TOOL: read\nARGUMENTS: {"filePath": "test.txt"}'
    content, tools = parse_tool_calls(text)
    assert tools is not None
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "read"
    assert tools[0]["function"]["arguments"] == '{"filePath": "test.txt"}'
 def test_parse_no_tool():
    """Test parsing text without tool calls."""
    text = "Just a regular response"
    content, tools = parse_tool_calls(text)
    assert tools is None
    assert content == text
 def test_parse_multiple_tools():
    """Test parsing multiple tool calls."""
    text = '''TOOL: read
 ARGUMENTS: {"filePath": "file1.txt"}
 TOOL: write
 ARGUMENTS: {"filePath": "file2.txt", "content": "hello"}'''
    content, tools = parse_tool_calls(text)
    assert tools is not None
    assert len(tools) == 2
    assert tools[0]["function"]["name"] == "read"
    assert tools[1]["function"]["name"] == "write"
 def test_parse_tool_with_content_before():
    """Test parsing when there's content before the tool call."""
    text = '''I'll read that file for you.
 TOOL: read
 ARGUMENTS: {"filePath": "config.yaml"}'''
    content, tools = parse_tool_calls(text)
    assert tools is not None
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "read"
    assert "I'll read that file for you." in content
 def test_parse_bash_tool():
    """Test parsing bash tool call."""
    text = 'TOOL: bash\nARGUMENTS: {"command": "ls -la"}'
    content, tools = parse_tool_calls(text)
    assert tools is not None
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "bash"
 def test_parse_case_insensitive():
    """Test that TOOL:/ARGUMENTS: is case insensitive."""
    text = 'tool: read\narguments: {"filePath": "test.txt"}'
    content, tools = parse_tool_calls(text)
    assert tools is not None
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "read"
 def test_parse_invalid_json():
    """Test that invalid JSON is skipped gracefully."""
    text = '''TOOL: read
 ARGUMENTS: {invalid json}
 TOOL: write
 ARGUMENTS: {"filePath": "test.txt"}'''
    content, tools = parse_tool_calls(text)
    # Should skip the invalid one and parse the valid one
    assert tools is not None
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "write"
 def test_parse_empty_text():
    """Test parsing empty text."""
    text = ""
    content, tools = parse_tool_calls(text)
    assert tools is None
    assert content == ""
 def test_parse_whitespace_only():
    """Test parsing whitespace-only text."""
    text = "   \n\t  "
    content, tools = parse_tool_calls(text)
    assert tools is None
 def test_parse_markdown_code_block():
    """Test parsing markdown code blocks as fallback (e.g., ```bash command```)."""
    text = '''I'll help you create a project.
 ```bash
 mkdir myapp
 cd myapp
 ```
 Now let's create a file.'''
    content, tools = parse_tool_calls(text)
    assert tools is not None
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "bash"
    assert "mkdir myapp" in tools[0]["function"]["arguments"]
    assert "cd myapp" in tools[0]["function"]["arguments"]
 def test_parse_markdown_inline():
    """Test parsing inline bash commands in markdown."""
    text = '''Here's what to do:
 ```bash
 ls -la
 ```'''
    content, tools = parse_tool_calls(text)
    assert tools is not None
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "bash"
    assert "ls -la" in tools[0]["function"]["arguments"]
 def test_tool_instructions_content():
    """Test that tool instructions contain required sections (REVIEW-2026-02-24 Blocker #4)."""
    from api.routes import _load_tool_instructions
    # Load instructions from config file
    instructions = _load_tool_instructions()
    # Verify key instruction components are present (minimal instructions)
    assert "use tools" in instructions.lower(), "Instructions must mention tool usage"
    assert "Format" in instructions or "format" in instructions.lower(), "Instructions must mention format"
    assert "no explanations" in instructions.lower(), "Instructions must forbid explanations"
    assert "no markdown" in instructions.lower(), "Instructions must forbid markdown"
 def test_tool_instructions_token_count():
    """Test that tool instructions are within token budget (REVIEW-2026-02-24 Blocker #1)."""
    from api.routes import _load_tool_instructions
    # Load instructions from config file
    instructions = _load_tool_instructions()
    # Token budget: 2000 hard limit
    # Rough estimate: 4 chars = 1 token
    char_count = len(instructions)
    estimated_tokens = char_count // 4
    assert estimated_tokens <= 2000, f"Instructions estimated at {estimated_tokens} tokens, must be under 2000"
 if __name__ == "__main__":
    # Run all tests
    test_functions = [
        test_parse_simple_tool,
        test_parse_no_tool,
        test_parse_multiple_tools,
        test_parse_tool_with_content_before,
        test_parse_bash_tool,
        test_parse_case_insensitive,
        test_parse_invalid_json,
        test_parse_empty_text,
        test_parse_whitespace_only,
        test_parse_markdown_code_block,
        test_parse_markdown_inline,
        test_tool_instructions_content,
        test_tool_instructions_token_count,
    ]
    passed = 0
    failed = 0
    for test_func in test_functions:
        try:
            test_func()
            print(f"✓ {test_func.__name__}")
            passed += 1
        except AssertionError as e:
            print(f"✗ {test_func.__name__}: {e}")
            failed += 1
        except Exception as e:
            print(f"✗ {test_func.__name__}: Exception - {e}")
            failed += 1
    print(f"\n{passed} passed, {failed} failed")
    if failed > 0:
        sys.exit(1)