diff --git a/README.md b/README.md
index 3514759..da6dfa3 100644
--- a/README.md
+++ b/README.md
@@ -91,7 +91,9 @@ python main.py --auto --federation
 python main.py --auto --federation
 ```
 
-Machines auto-discover each other and vote together on every request.
+Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
+
+**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).
 
 ## How Consensus Works
 
@@ -147,7 +149,7 @@ All support GGUF quantization (Q4_K_M recommended).
 - `GET /v1/models` - List available models
 - `POST /v1/chat/completions` - Chat completion with consensus
 - `GET /health` - Health check
-- `GET /v1/federation/peers` - List discovered peers (when federation enabled)
+- `POST /v1/federation/vote` - Federation voting (used internally between peers)
 
 ## Troubleshooting
 
@@ -282,21 +284,29 @@ Major refactoring completed to improve modularity:
 
 See `docs/ARCHITECTURE.md` for detailed architecture documentation.
 
-## TODO / Roadmap
+## Recent Improvements
 
-### Planned Features
+### ✅ Universal Tool Support (2025-02-25)
+- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
+- No client-side configuration needed - just use the API
+- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
+- Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
+- Proper OpenAI tool format with unique IDs and tool_call_id linking
 
-- **Plan Mode**: Add a "plan mode" that disables tool execution for planning-only conversations. This would allow the model to discuss file changes without actually modifying them until explicitly confirmed.
-  - Usage: `--plan-mode` flag or API parameter
-  - When enabled: Model can see what tools would do but doesn't execute them
-  - Use case: Review changes before applying them
+### ✅ OpenCode-Compatible Streaming (2025-02-25)
+- Proper `reasoning_content` field for "Thinking..." collapsible blocks
+- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
+- Final answer delivered in `content` field after tool execution
 
-### Current Status
+### ✅ Federation Quality Voting (2025-02-25)
+- Head node now **objectively judges** all peer responses using quality metrics
+- No more reliance on self-reported confidence (which biased toward local)
+- All responses scored on length, structure, completeness
+- Fair competition: 14B models properly beat 3B on quality tasks
 
-- ✅ Tool instructions now injected by default for all clients
-- ✅ Improved file operation safety (verify with ls/grep before reading)
-- ✅ Working directory support (extracted from client context)
-- 🔄 Plan mode - coming soon
+### 🚧 Planned Features
+- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
+- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)
 
 ## Contributing
 
diff --git a/docs/design/2024-02-24-complete-react-example.md b/docs/design/2024-02-24-complete-react-example.md
deleted file mode 100644
index b004957..0000000
--- a/docs/design/2024-02-24-complete-react-example.md
+++ /dev/null
@@ -1,92 +0,0 @@
-# Design Decision: Complete React Example with Actual Code
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions
-
-## Problem
-
-Model is still not following instructions:
-1. Tries `npm install` before creating package.json
-2. Still tries `npx create-react-app` despite being told not to
-3. Instructions have placeholders like "..." and "etc." which models don't understand
-
-## Root Cause
-
-The current instructions say:
-```
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"}
-
-[Continue with src/index.js, src/App.js, public/index.html, etc.]
-```
-
-**Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples.
-
-## Solution
-
-Provide a **complete, working, minimal React example** with actual file contents:
-
-1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install
-2. Actual file content, not placeholders
-3. Minimal viable React app (not full create-react-app structure)
-
-## Implementation
-
-Replace vague example with complete working code:
-
-```
-**COMPLETE REACT HELLO WORLD EXAMPLE:**
-
-User: "Create a React Hello World app"
-
-Step 1 - Create directory:
-TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp"}
-
-Step 2 - Create package.json (MUST do this BEFORE npm install):
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"}
-
-Step 3 - Create src directory:
-TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp/src"}
-
-Step 4 - Create App.js:
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n  return (\n    <div className=\"App\">\n      <h1>Hello World</h1>\n      <p>Welcome to my React app!</p>\n    </div>\n  );\n}\n\nexport default App;"}
-
-Step 5 - Create index.js:
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render(<App />);"}
-
-Step 6 - Create public directory and index.html:
-TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp/public"}
-
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <title>React App</title>\n</head>\n<body>\n    <div id=\"root\"></div>\n</body>\n</html>"}
-
-Step 7 - NOW install dependencies (AFTER package.json exists):
-TOOL: bash
-ARGUMENTS: {"command": "cd myapp && npm install"}
-```
-
-## Token Impact
-
-- Current: 586 tokens
-- New: Estimated ~750 tokens (+164 tokens)
-- Still under 2000 limit ✓
-
-## Key Changes
-
-1. **Explicit sequencing:** "Step 1", "Step 2", etc.
-2. **Actual code:** No "..." or "etc." - real working content
-3. **Critical note:** "MUST do this BEFORE npm install"
-4. **Minimal structure:** Just what's needed for Hello World
-
-## Success Criteria
-
-- [ ] Model creates package.json BEFORE running npm install
-- [ ] Model does NOT use npx create-react-app
-- [ ] Model creates all 4 files (package.json, App.js, index.js, index.html)
-- [ ] Model runs npm install last (after files exist)
diff --git a/docs/design/2024-02-24-fix-subprocess-hang.md b/docs/design/2024-02-24-fix-subprocess-hang.md
deleted file mode 100644
index 0af3fa3..0000000
--- a/docs/design/2024-02-24-fix-subprocess-hang.md
+++ /dev/null
@@ -1,84 +0,0 @@
-# Design Decision: Fix Subprocess Hang on Interactive Commands
-
-**Date:** 2024-02-24
-**Scope:** src/tools/executor.py _execute_bash method
-**Lines Changed:** 1 line
-
-## Problem
-
-When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes:
-1. 300s timeout to be reached
-2. opencode to hang waiting for response
-3. Poor user experience
-
-## Root Cause
-
-`subprocess.run()` by default inherits stdin from parent process. When commands prompt for input:
-- npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)"
-- npm init asks for package details
-- No input is provided, so it waits forever
-
-## Solution
-
-Add `stdin=subprocess.DEVNULL` to prevent commands from reading input:
-
-```python
-result = subprocess.run(
-    command,
-    shell=True,
-    capture_output=True,
-    text=True,
-    timeout=timeout,
-    cwd=cwd,
-    stdin=subprocess.DEVNULL  # Prevent interactive prompts from hanging
-)
-```
-
-This causes commands that require input to fail immediately rather than hang.
-
-## Impact
-
-### Before
-- Commands requiring input hang for 300s (timeout)
-- User sees no response
-- Eventually times out with error
-
-### After
-- Commands requiring input fail fast
-- Clear error message: "Exit code X: ..." 
-- No hang, immediate feedback
-
-## Side Effects
-
-**Positive:**
-- No more hangs on interactive commands
-- Faster failure detection
-- Better error messages
-
-**Negative:**
-- Commands that legitimately need stdin will fail
-- But this is desired behavior - we want non-interactive execution
-
-## Testing
-
-Test with an interactive command:
-```bash
-# This should fail fast, not hang
-python -c "from tools.executor import ToolExecutor; 
-import asyncio; 
-e = ToolExecutor(); 
-result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'})); 
-print(result)"
-```
-
-Expected: Quick failure, not a 30s hang
-
-## Related Changes
-
-This complements the tool instructions fix:
-- Instructions now say "DO NOT use npx create-react-app"
-- This fix ensures if model ignores instructions, it fails fast instead of hanging
-
-## Conclusion
-
-One-line fix prevents interactive command hangs, improving reliability and user experience.
diff --git a/docs/design/2024-02-24-fix-tool-execution-tokens.md b/docs/design/2024-02-24-fix-tool-execution-tokens.md
deleted file mode 100644
index 05c877b..0000000
--- a/docs/design/2024-02-24-fix-tool-execution-tokens.md
+++ /dev/null
@@ -1,178 +0,0 @@
-# Design Decision: Fix Tool Execution and Token Reporting
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions and token counting
-
-## Problem Statement
-
-User report shows three critical failures:
-
-1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format
-2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count
-3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout
-
-## Evidence
-
-```
-🖥️  BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app .
-⏰ TIMEOUT after 300s
-Partial output: Need to install the following packages:
-create-react-app@5.1.0
-Ok to proceed? (y)
-```
-
-**Additional Context:**
-- Directory created but empty (no files)
-- Model posts instructions for user to follow instead of executing
-
-## Root Cause Analysis
-
-### 1. Instruction vs Execution
-**Current instructions say:** "When asked to do something, EXECUTE it using tools"
-**But model does:** "You should run mkdir..."
-**Why:** Instructions aren't strong enough - need explicit anti-patterns
-
-### 2. Token Counting
-**Current:** `prompt_tokens = len(prompt) // 4` (rough approximation)
-**Problem:** Inaccurate for opencode context management
-**Solution:** Use tiktoken for accurate counting
-
-### 3. Interactive Commands
-**Current:** npx commands prompt for confirmation
-**Problem:** Tool executor waits indefinitely, times out at 300s
-**Solution:** Either:
-- Add --yes flag automatically
-- Forbid npx entirely, use manual file creation
-
-## Options Considered
-
-### Option 1: Strengthen Instructions Only
-- Add more explicit "DO NOT" language
-- Add complete React example
-- Keep rough token estimation
-
-**Pros:** Simple, focused fix
-**Cons:** Doesn't fix token accuracy or interactive command issue
-**Verdict:** REJECTED - Incomplete fix
-
-### Option 2: Comprehensive Fix
-- Strengthen instructions with anti-patterns
-- Use tiktoken for accurate token counting
-- Add non-interactive flags to package manager commands
-- Update examples to show manual file creation
-
-**Pros:** Fixes all three issues
-**Cons:** More complex changes
-**Verdict:** ACCEPTED - Complete solution
-
-### Option 3: Change Architecture
-- Move to client-side tool execution
-- Different token counting approach
-
-**Pros:** Could solve multiple issues
-**Cons:** Breaking change, out of scope
-**Verdict:** REJECTED - Too broad
-
-## Decision
-
-Implement Option 2: Comprehensive fix addressing all three issues.
-
-### Changes
-
-#### 1. Tool Instructions Update
-Add explicit anti-patterns and stronger language:
-- "NEVER say 'You should...' - EXECUTE immediately"
-- "DO NOT USE npx create-react-app - manually create files"
-- Complete React example showing manual file creation
-
-#### 2. Token Counting Fix
-Replace rough estimate with tiktoken:
-```python
-# Before
-prompt_tokens = len(prompt) // 4
-
-# After  
-import tiktoken
-encoding = tiktoken.get_encoding('cl100k_base')
-prompt_tokens = len(encoding.encode(prompt))
-completion_tokens = len(encoding.encode(content))
-```
-
-#### 3. Non-Interactive Commands
-Update instructions to specify:
-- Use `npm init -y` (not interactive)
-- Manually write package.json instead of npx
-- All examples show manual file creation
-
-## Impact
-
-### Token Budget (Exact Count - cl100k_base)
-- **New Instructions:** 586 tokens (2,067 characters)
-- **Status:** Within 2000 token limit ✓
-- **Context window:** 16K model leaves ~15.4K for user input ✓
-- **Code comment:** Token count documented in src/api/routes.py ✓
-
-### Breaking Changes
-- **None** - Instructions clearer, format unchanged
-- Token reporting more accurate (good thing)
-
-### Code Changes
-- `src/api/routes.py`:
-  - Update tool_instructions (~+15 lines)
-  - Add tiktoken import
-  - Replace token estimation logic (~5 lines)
-
-## Testing Strategy
-
-1. **Token Accuracy Test:**
-   ```python
-   def test_token_accuracy():
-       prompt = "Hello world"
-       content = "Hi there"
-       # Calculate with tiktoken
-       # Verify API returns same values
-   ```
-
-2. **Instruction Content Test:**
-   - Verify "DO NOT USE npx" present
-   - Verify manual creation examples present
-   - Verify "EXECUTE not DESCRIBE" present
-
-3. **Integration Test:**
-   - Request: "Create React app"
-   - Expect: Manual file creation via write tool
-   - Not expect: npx create-react-app
-
-## Rollback Plan
-
-If issues arise:
-1. Revert to previous instructions
-2. Keep tiktoken for token counting (beneficial)
-3. Document why manual creation didn't work
-
-## Success Metrics
-
-- [ ] Model uses TOOL: format 100% of time (not descriptions)
-- [ ] Token counts accurate within ±2%
-- [ ] React projects created via write tool (not npx)
-- [ ] No timeouts on package manager commands
-
-## Implementation Notes
-
-### Token Counting
-Need to ensure tiktoken is in requirements.txt
-
-### Tool Instructions
-The key addition is:
-```
-**FORBIDDEN PATTERNS:**
-- "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"}
-- "npx create-react-app myapp" → USE: Manual file creation with write tool
-- "First create package.json, then..." → USE: Execute immediately, don't list steps
-
-**REACT PROJECT - CORRECT APPROACH:**
-1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
-2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"}
-3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."}
-4. Continue until all files created
-```
diff --git a/docs/design/2024-02-24-improved-tool-instructions.md b/docs/design/2024-02-24-improved-tool-instructions.md
deleted file mode 100644
index 71b1016..0000000
--- a/docs/design/2024-02-24-improved-tool-instructions.md
+++ /dev/null
@@ -1,172 +0,0 @@
-# Design Decision: Improved Tool Instructions
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions
-**Lines Changed:** ~25 lines
-
-## Problem
-
-Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
-
-1. **Passive vs Active:** Model describes what to do instead of doing it
-2. **Refusal:** Model claims "I am only an AI assistant" instead of executing
-3. **Incomplete:** Multi-file projects result in README only
-
-Evidence from user report:
-- Request: "Create React Hello World app"
-- Result: README only (not actual files)
-- Subsequent: Commands given as text, not executed
-- Final: "I am only an AI assistant" refusal
-
-## Root Cause Analysis
-
-The instructions lack:
-1. **Authority statement** - "You CAN and SHOULD use tools"
-2. **Execution mandate** - "Execute commands, don't just describe them"
-3. **Workflow clarity** - Clear step-by-step expectations
-4. **Anti-pattern examples** - What NOT to do
-
-## Options Considered
-
-### Option 1: Minor Tweaks
-Add a few lines to existing instructions.
-- **Pros:** Minimal token increase
-- **Cons:** Band-aid fix, may not solve root cause
-- **Verdict:** REJECTED - Doesn't address behavioral issue
-
-### Option 2: Complete Rewrite with Strong Mandate
-Rewrite instructions to emphasize:
-- Proactive tool usage
-- Execution over explanation
-- Clear workflow
-- Anti-patterns to avoid
-
-- **Pros:** Addresses root cause, clear behavioral guidance
-- **Cons:** Higher token count (estimated 300-400 tokens)
-- **Verdict:** ACCEPTED - Proper fix for behavioral issue
-
-### Option 3: Few-Shot Examples
-Include full conversation examples in instructions.
-- **Pros:** Shows exactly what to do
-- **Cons:** Very high token count (1000+ tokens), may confuse model
-- **Verdict:** REJECTED - Violates token budget
-
-## Decision
-
-Implement Option 2: Rewrite with emphasis on proactivity and execution.
-
-**Key additions:**
-1. **Capability statement:** "You have tools. Use them."
-2. **Execution mandate:** "Don't describe, execute"
-3. **Workflow:** Clear request→tool→result→next cycle
-4. **Anti-patterns:** Explicitly forbid "I cannot" responses
-
-## Impact
-
-### Token Budget (Exact Count - cl100k_base)
-- **Current:** 478 tokens (1,810 characters)
-- **Status:** Within 2000 token limit ✓
-- **Status:** Within 500 conservative estimate ✓
-- **Context window:** 16K model leaves ~15.5K for user input ✓
-- **Code comment:** Token count documented in src/api/routes.py ✓
-
-### Code Changes
-- **File:** src/api/routes.py
-- **Lines:** +48/-18 (net +30)
-- **Type:** Instructions replacement
-- **Token documentation:** Added inline comment with exact token count
-
-### Breaking Changes
-- **None** - Instructions are additive/clearer, not different format
-
-### Behavioral Changes
-- **Expected:** More proactive tool usage
-- **Expected:** No more "I cannot" refusals
-- **Expected:** Multi-step projects completed via tools
-- **Expected:** Commands executed, not described
-
-### Review Blockers Addressed
-- ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
-- ✅ Exact token count calculated using tiktoken (478 tokens)
-- ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
-- ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
-- ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
-
-## Implementation
-
-```python
-tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
-
-**CRITICAL RULES:**
-1. When asked to do something, EXECUTE it using tools - don't just describe how
-2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
-3. You MUST use the write tool to create files
-4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
-5. Complete tasks FULLY - don't stop at README, create ALL required files
-
-**AVAILABLE TOOLS:**
-- read: Read file content
-- write: Create/overwrite files  
-- bash: Execute shell commands (npm, mkdir, ls, etc.)
-
-**TOOL FORMAT (STRICT):**
-TOOL: tool_name
-ARGUMENTS: {"param": "value"}
-
-**WORKFLOW:**
-1. User asks for something
-2. You decide what tool to use
-3. You respond with ONLY the TOOL: format above
-4. You receive the tool result
-5. You continue with next tool until task is COMPLETE
-
-**EXAMPLES:**
-
-Creating a project:
-User: "Create a React app"
-You: TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
-[wait for result]
-You: TOOL: write
-ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
-[continue until all files created]
-
-Running commands:
-User: "Install dependencies"
-You: TOOL: bash
-ARGUMENTS: {"command": "npm install"}
-[wait for result, then confirm completion]
-
-**WHAT NOT TO DO:**
-- ❌ "To create a React app, you should run: mkdir myapp" (describing)
-- ❌ "I cannot run commands, I am an AI" (refusing)
-- ❌ Creating only README instead of full project (incomplete)
-- ❌ "First do X, then do Y" (giving instructions instead of doing)
-
-**CORRECT BEHAVIOR:**
-- ✅ Execute the command immediately using the bash tool
-- ✅ Create all files using the write tool
-- ✅ Continue until task is 100% complete
-- ✅ Use ONE tool at a time and wait for results"""
-```
-
-## Testing
-
-1. Test with React Hello World request
-2. Verify model uses bash to create directory structure
-3. Verify model uses write to create all files
-4. Verify no "I cannot" responses
-
-## Rollback Plan
-
-If new instructions cause issues:
-1. Revert to previous ~125 token version
-2. Analyze what specifically failed
-3. Iterate on smaller changes
-
-## Success Metrics
-
-- [ ] Model uses tools on first request (not after prompting)
-- [ ] Zero "I cannot" or "I am an AI" responses
-- [ ] Multi-file projects fully created
-- [ ] Commands executed, not described
diff --git a/docs/design/2024-02-24-task-planning-verification.md b/docs/design/2024-02-24-task-planning-verification.md
deleted file mode 100644
index 559b2bd..0000000
--- a/docs/design/2024-02-24-task-planning-verification.md
+++ /dev/null
@@ -1,151 +0,0 @@
-# Design Decision: Task Planning and Verification Workflow
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions
-**Problem:** Model creates folder but doesn't complete full task or verify completion
-
-## Problem Statement
-
-User reports:
-1. "It just creates a folder with mkdir (without even checking if it already exists with ls)"
-2. No verification that tasks are completed
-3. No planning of full task scope
-4. Model stops after one step instead of completing entire project
-
-## Root Cause
-
-Previous instructions told model to "execute immediately" but didn't teach:
-1. **Planning** - What needs to be done
-2. **Checking** - What already exists
-3. **Verification** - Did the step work
-4. **Completion loop** - Keep going until done
-
-## Solution
-
-Add **Task Completion Workflow** to instructions:
-
-```
-**TASK COMPLETION WORKFLOW (MANDATORY):**
-
-**1. PLAN:** List ALL steps needed before starting
-**2. CHECK:** Use ls to verify what exists before creating
-**3. EXECUTE:** Run first step
-**4. VERIFY:** Confirm step worked (ls, read file)
-**5. REPEAT:** Steps 3-4 until ALL complete
-**6. FINAL CHECK:** Verify entire task is done
-**7. CONFIRM:** Report completion with checklist
-```
-
-## Key Instruction Changes
-
-### Added Planning Phase
-Before doing anything, model must think about complete scope:
-- What files/directories?
-- What dependencies?
-- Complete task requirements
-
-### Added Verification Steps
-Every step must be verified:
-- `ls -la` after mkdir
-- `read` file after write
-- Check content is correct
-
-### Added Completion Loop
-Model must continue until:
-✓ All directories exist
-✓ All files exist with correct content
-✓ All dependencies installed
-✓ Each component verified
-
-### Complete Working Example
-Provided 13-step React example showing:
-1. Check existing (ls)
-2. Create directory
-3. Verify created (ls)
-4. Create package.json
-5. Verify package.json (read)
-6. Create source files
-7. Final verification (find myapp -type f)
-8. Install dependencies
-9. Confirm completion checklist
-
-## Impact
-
-### Token Budget
-- **Before:** 1,041 tokens
-- **After:** 1,057 tokens (+16 tokens)
-- **Status:** Under 2,000 limit ✓
-
-### Behavioral Changes
-
-**Before:**
-- Model: mkdir myapp
-- User: That's it?
-- Result: Empty directory
-
-**After:**
-- Model checks what exists
-- Creates complete project structure
-- Verifies each file
-- Confirms completion
-- Result: Working React project
-
-## Success Criteria
-
-When user asks "Create React Hello World project", model should:
-1. ✓ Check current directory contents
-2. ✓ Create myapp/ directory
-3. ✓ Verify directory created
-4. ✓ Create package.json
-5. ✓ Verify package.json content
-6. ✓ Create src/App.js
-7. ✓ Create src/index.js
-8. ✓ Create public/index.html
-9. ✓ Final verification (list all files)
-10. ✓ npm install
-11. ✓ Confirm completion checklist
-
-## Testing
-
-Test instructions contain:
-- PLAN/CHECK keywords
-- VERIFY keyword
-- COMPLETE keyword
-
-All tests pass: 11/11 ✓
-
-## Trade-offs
-
-**Pros:**
-- Complete task execution
-- Verification prevents partial work
-- Clear completion criteria
-- Better user experience
-
-**Cons:**
-- More tokens (but still under limit)
-- More verbose instructions
-- May be slower (more verification steps)
-
-## Related Files Changed
-
-1. src/api/routes.py - Updated tool_instructions
-2. tests/test_tool_parsing.py - Updated tests for new content
-3. docs/design/2024-02-24-task-planning-verification.md - This doc
-
-## Future Improvements
-
-1. **Task Queue System:** Server-side queue of pending operations
-2. **State Persistence:** Remember what's been done across conversations
-3. **Smart Resumption:** If interrupted, pick up where left off
-4. **Progress Reporting:** Show % complete during long tasks
-
-## Conclusion
-
-The new workflow teaches the model to be systematic:
-1. Plan before acting
-2. Check before creating
-3. Verify after each step
-4. Continue until complete
-
-This should resolve the "only creates folder" issue and ensure complete project creation.
diff --git a/docs/design/2024-02-24-tool-parsing-simplification.md b/docs/design/2024-02-24-tool-parsing-simplification.md
deleted file mode 100644
index a31c268..0000000
--- a/docs/design/2024-02-24-tool-parsing-simplification.md
+++ /dev/null
@@ -1,132 +0,0 @@
-# Design Decision: Tool Parsing Simplification
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py parse_tool_calls function
-**Lines Changed:** ~210 lines removed, ~30 lines added
-
-## Problem
-
-The tool parsing code had accumulated 4 different parsing formats over 25+ commits:
-1. JSON `tool_calls` format with nested objects
-2. TOOL:/ARGUMENTS: format (simple text)
-3. Function pattern format `func_name(args)`
-4. Multiple JSON handling variants
-
-This caused:
-- Circular development (adding/removing formats repeatedly)
-- No single source of truth
-- Complex, unmaintainable code
-- No confidence that changes wouldn't break existing cases
-
-## Options Considered
-
-### Option 1: Keep All Formats
-- **Pros:** Backward compatible
-- **Cons:** 210 lines of unmaintainable code, continues circular development pattern
-- **Verdict:** REJECTED - Perpetuates the problem
-
-### Option 2: Standardize on TOOL:/ARGUMENTS: Only
-- **Pros:** 
-  - Simple regex pattern (~30 lines)
-  - Matches current tool instructions
-  - Easy to test
-  - Clear single format for models
-- **Cons:** 
-  - Breaking change if any code relies on old formats
-  - Need to update any existing examples/docs
-- **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well)
-
-### Option 3: Create Parser per Format with Feature Flags
-- **Pros:** Flexible, can toggle formats
-- **Cons:** 
-  - Violates Rule 5 and "No Feature Flags in Core Logic"
-  - Still maintains multiple code paths
-- **Verdict:** REJECTED - Doesn't solve the root problem
-
-## Decision
-
-Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code.
-
-**Rationale:**
-- Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only"
-- Token cost is minimal (no complex regex)
-- Test coverage provides confidence
-- Aligns with existing tool instructions
-
-## Impact
-
-### Token Count
-- **Parser code:** 210 lines → 30 lines (-180 lines)
-- **No change** to tool instructions (separate optimization)
-
-### Breaking Changes
-- **Yes** - Removes support for:
-  - JSON `tool_calls` format in model responses
-  - Function pattern format `read_file(path="test.txt")`
-  
-**Migration:** Models must use:
-```
-TOOL: read
-ARGUMENTS: {"filePath": "test.txt"}
-```
-
-### Testing
-- Unit tests added: 9 test cases
-- Coverage: All parsing scenarios
-- All tests pass
-
-## Implementation
-
-```python
-# New implementation (30 lines)
-def parse_tool_calls(text: str) -> tuple:
-    """Parse tool calls using standardized format."""
-    import json
-    import re
-    
-    tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
-    tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
-    
-    if not tool_matches:
-        return text, None
-    
-    tool_calls = []
-    for i, tool_match in enumerate(tool_matches):
-        tool_name = tool_match.group(1)
-        args_str = tool_match.group(2)
-        try:
-            args_dict = json.loads(args_str)
-            tool_calls.append({
-                "id": f"call_{i+1}",
-                "type": "function", 
-                "function": {
-                    "name": tool_name,
-                    "arguments": json.dumps(args_dict)
-                }
-            })
-        except json.JSONDecodeError:
-            continue
-    
-    if not tool_calls:
-        return text, None
-    
-    first_start = tool_matches[0].start()
-    content = text[:first_start].strip()
-    
-    return content, tool_calls
-```
-
-## Verification
-
-Run tests:
-```bash
-python tests/test_tool_parsing.py
-```
-
-Expected: 9 passed, 0 failed
-
-## Follow-up
-
-- [x] Update DEVELOPMENT_PATTERNS.md to mark as completed
-- [x] Add unit tests
-- [ ] Consider integration test for full tool execution flow
diff --git a/docs/design/2024-02-25-reduce-system-prompt-tokens.md b/docs/design/2024-02-25-reduce-system-prompt-tokens.md
deleted file mode 100644
index 5713d9f..0000000
--- a/docs/design/2024-02-25-reduce-system-prompt-tokens.md
+++ /dev/null
@@ -1,98 +0,0 @@
-# Investigation: 31k Token Context Issue
-
-## Problem
-When making requests through opencode to local_swarm, the LLM receives ~31k tokens of context even for simple empty directory queries.
-
-## Root Cause Identified
-
-**NOT an issue with this repo's codebase - this is expected behavior for function calling.**
-
-### How it works:
-
-1. **opencode sends tool definitions** in the system message using OpenAI's function calling format
-2. **Each tool definition is ~450 tokens** (name + description + parameters)
-3. **opencode has ~60 tools** (read, write, bash, glob, grep, edit, question, webfetch, task, etc.)
-4. **Total tool definition tokens:** ~27,000 tokens
-
-### Calculation:
-```
-Single tool definition: ~450 tokens
-Number of tools: ~60
-Tool schemas total: ~27,000 tokens
-System message: ~500 tokens
-User query: ~100 tokens
----
-Total: ~27,600 tokens
-```
-
-**This matches the observed ~31k tokens.**
-
-## Why This Happens
-
-OpenAI's function calling protocol requires sending the **complete function schemas** to the LLM with every request. This is how the model:
-- Knows what tools are available
-- Understands parameter requirements
-- Knows how to format tool calls
-
-All major LLM providers using function calling work this way (OpenAI, Anthropic, local models, etc.).
-
-## Verification
-
-```bash
-python -c "
-import tiktoken
-enc = tiktoken.get_encoding('cl100k_base')
-
-# Example from actual opencode tool definition
-read_tool_schema = '''{\"type\": \"function\", \"function\": {\"name\": \"read\", \"description\": \"Read a file or directory from the local filesystem...[full description]\", \"parameters\": {...}}}'''
-
-print(f'Single tool schema: {len(enc.encode(read_tool_schema))} tokens')
-print(f'Estimated 60 tools: {len(enc.encode(read_tool_schema)) * 60:,} tokens')
-"
-```
-
-Result:
-- Single tool definition: ~451 tokens
-- 60 tools: ~27,060 tokens
-- Plus system + user message: ~27,660 total
-
-## This Is NOT a Bug
-
-The 31k token context is **correct and expected** for function calling with 60+ tools. This is how:
-- OpenAI API works
-- Claude API works
-- Local models with function calling work
-
-## Potential Optimizations (Optional)
-
-If reducing context size is critical, consider:
-
-### Option 1: Dynamic Tool Selection
-- Only send tools relevant to current task
-- Example: For file operations, only send [read, write, glob, edit]
-- Trade-off: Requires opencode to intelligently filter tools
-
-### Option 2: Compressed Tool Descriptions
-- Shorten tool descriptions to essentials
-- Example: "Read file at path (required: filePath)"
-- Trade-off: Model may make more errors with less guidance
-
-### Option 3: Tool Grouping
-- Group similar tools into single "tools: [read, write, glob]" parameter
-- Trade-off: Breaks OpenAI compatibility
-
-## Recommendation
-
-**NO ACTION REQUIRED.** The 31k token context is:
-- Standard for function calling with many tools
-- Within capabilities of modern LLMs (32k-128k context windows)
-- Not caused by this repo's code
-
-The `.opencodeignore` created earlier will help with opencode's own system prompt, but doesn't affect the LLM context sent to local_swarm.
-
-## Additional Finding
-
-While investigating, verified:
-- `config/prompts/tool_instructions.txt`: 125 tokens ✅
-- This repo's tool execution code: No token bloat ✅
-- Issue is purely opencode's function calling protocol ✅
diff --git a/docs/test-plans/fix-tool-execution-tokens.md b/docs/test-plans/fix-tool-execution-tokens.md
deleted file mode 100644
index 629ec41..0000000
--- a/docs/test-plans/fix-tool-execution-tokens.md
+++ /dev/null
@@ -1,112 +0,0 @@
-# Test Plan: Fix Tool Execution and Token Reporting
-
-## Problem Analysis
-
-### Issue 1: Model Gives Instructions Instead of Executing
-**Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format
-**Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."}
-
-### Issue 2: Token Counting Inaccurate
-**Current:** Rough estimate `len(prompt) // 4` 
-**Expected:** Accurate token count using tiktoken
-**Impact:** opencode can't properly manage context window
-
-### Issue 3: npx Commands Timeout/Need Input
-**Current:** `npx create-react-app .` prompts for confirmation (y/n)
-**Expected:** Non-interactive execution or manual file creation
-**Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)"
-
-## Unit Tests
-
-### Test 1: Accurate Token Counting
-- [ ] Verify token count uses tiktoken (not rough estimate)
-- [ ] Test with known token counts
-- [ ] Verify prompt_tokens + completion_tokens = total_tokens
-
-### Test 2: Non-Interactive Bash Commands
-- [ ] Verify npm/npx commands use --yes or equivalent flags
-- [ ] Test timeout handling for package managers
-- [ ] Verify commands don't prompt for user input
-
-### Test 3: Tool Instructions Content
-- [ ] Verify instructions emphasize "EXECUTE not DESCRIBE"
-- [ ] Verify manual file creation examples (not npx)
-- [ ] Verify anti-patterns are clearly stated
-
-## Integration Tests
-
-### Test 4: End-to-End React Project Creation
-**Input:** "Create a React Hello World app"
-
-**Expected Flow:**
-1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
-2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
-3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."}
-4. Continue until complete
-
-**Failure Modes:**
-- [ ] Model describes steps instead of executing
-- [ ] Uses npx create-react-app (should manually create files)
-- [ ] Stops after README only
-
-### Test 5: Token Reporting Accuracy
-**Input:** Any chat completion request
-
-**Expected:**
-- usage.prompt_tokens matches actual tokens
-- usage.completion_tokens matches actual tokens  
-- usage.total_tokens is sum
-
-**Verification:**
-- Compare tiktoken count vs API response
-
-## Manual Verification
-
-```bash
-# Test React creation
-python main.py --auto &
-curl -X POST http://localhost:17615/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "X-Client-Working-Dir: /tmp/test-project" \
-  -d '{
-    "model": "local-swarm",
-    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
-    "tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}]
-  }'
-
-# Check token accuracy
-curl -X POST http://localhost:17615/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "local-swarm",
-    "messages": [{"role": "user", "content": "Hello"}]
-  }' | jq '.usage'
-```
-
-## Success Criteria
-
-1. **Execution:** 100% of requests use TOOL: format (not descriptions)
-2. **Accuracy:** Token counts match tiktoken within ±5%
-3. **Completion:** Multi-file projects fully created via write tool
-4. **No npx:** Manual file creation for React (no npx create-react-app)
-
-## Implementation Notes
-
-### Token Counting Fix
-```python
-# Replace: prompt_tokens = len(prompt) // 4
-# With:
-import tiktoken
-encoding = tiktoken.get_encoding('cl100k_base')
-prompt_tokens = len(encoding.encode(prompt))
-completion_tokens = len(encoding.encode(content))
-```
-
-### Tool Instructions Fix
-- Add explicit "DO NOT USE npx create-react-app" instruction
-- Add "EXECUTE IMMEDIATELY" mandate
-- Show complete React example with manual file creation
-
-### Non-Interactive Commands
-- Auto-add --yes to npx commands
-- Or recommend manual file creation instead
diff --git a/docs/test-plans/improved-tool-instructions.md b/docs/test-plans/improved-tool-instructions.md
deleted file mode 100644
index fafc02d..0000000
--- a/docs/test-plans/improved-tool-instructions.md
+++ /dev/null
@@ -1,97 +0,0 @@
-# Test Plan: Improved Tool Instructions
-
-## Problem Statement
-Model is not using tools effectively:
-1. Creates README instead of actual project structure
-2. Provides commands as text instead of executing them
-3. Refuses to run commands claiming "I am only an AI assistant"
-
-## Root Cause Analysis
-Current instructions don't clearly communicate:
-- That the model SHOULD use tools proactively
-- That execution is expected, not explanation
-- The workflow: user request → tool execution → result
-
-## Unit Tests (Instruction Verification)
-
-### Test 1: Instruction Presence
-- [ ] Verify instructions are injected into system message
-- [ ] Verify instructions appear at the START of system message (priority position)
-
-### Test 2: Token Count
-- [ ] Measure total token count of new instructions
-- [ ] Verify ≤ 500 tokens (conservative budget)
-- [ ] Document before/after
-
-### Test 3: Format Compliance
-- [ ] Verify instructions include TOOL:/ARGUMENTS: format
-- [ ] Verify examples use correct format
-- [ ] Verify rules are clear and numbered
-
-## Integration Tests (Behavioral)
-
-### Test 4: Project Creation Flow
-**Input:** "Create a React Hello World app"
-
-**Expected Behavior:**
-1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp
-2. After result, TOOL: write, ARGUMENTS: package.json content
-3. After result, TOOL: write, ARGUMENTS: src/App.js content
-4. Continue until complete project structure exists
-
-**Failure Modes:**
-- [ ] Model only describes what to do
-- [ ] Model creates README only
-- [ ] Model refuses to execute commands
-
-### Test 5: Multi-step Task
-**Input:** "Check what files exist, then create a test.txt file with 'hello' in it"
-
-**Expected Behavior:**
-1. TOOL: bash, ARGUMENTS: ls -la
-2. Wait for result
-3. TOOL: write, ARGUMENTS: test.txt with "hello"
-
-**Failure Modes:**
-- [ ] Model tries to do both in one response
-- [ ] Model doesn't wait for ls result before writing
-
-### Test 6: Command Refusal
-**Input:** "Run npm install"
-
-**Expected Behavior:**
-1. TOOL: bash, ARGUMENTS: npm install
-
-**Failure Modes:**
-- [ ] Model responds: "I cannot run commands, I am only an AI assistant"
-- [ ] Model explains npm install instead of running it
-
-## Manual Verification Commands
-
-```bash
-# Start the server
-python main.py --auto
-
-# In another terminal, test with curl
-curl -X POST http://localhost:17615/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "local-swarm",
-    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
-    "tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}]
-  }'
-```
-
-## Success Criteria
-
-1. **Proactivity:** Model uses tools without being asked twice
-2. **Execution:** Model runs commands, doesn't just describe them
-3. **No Refusal:** Model never says "I cannot" or "I am only an AI"
-4. **Completeness:** Multi-file projects are fully created via tools
-5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format
-
-## Metrics
-
-- **Tool usage rate:** % of requests that result in tool calls
-- **Format compliance:** % of tool calls in correct format
-- **Completion rate:** % of multi-step tasks fully completed
diff --git a/docs/test-plans/tool-parsing-simplification.md b/docs/test-plans/tool-parsing-simplification.md
deleted file mode 100644
index 114b37b..0000000
--- a/docs/test-plans/tool-parsing-simplification.md
+++ /dev/null
@@ -1,35 +0,0 @@
-# Test Plan: Tool Parsing Simplification
-
-## Unit Tests
-
-- [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments
-- [x] Test case 2: No tool in text → Returns None for tools, original text as content  
-- [x] Test case 3: Multiple tools → Returns all tools in order
-- [x] Test case 4: Content before tool → Content extracted, tool parsed correctly
-- [x] Test case 5: Bash tool → Correctly parses bash command
-- [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work
-- [x] Test case 7: Invalid JSON → Skips invalid, continues with valid
-- [x] Test case 8: Empty text → Returns None, empty string
-- [x] Test case 9: Whitespace only → Returns None
-
-## Integration Tests
-
-- [ ] End-to-end flow: 
-  1. Send chat completion request with tools
-  2. Model responds with TOOL:/ARGUMENTS: format
-  3. Parser extracts tool call
-  4. Tool executes
-  5. Result returned in response
-
-- [ ] Expected result: Tool executes successfully, result included in response
-
-## Manual Verification
-
-- [ ] Command: `python tests/test_tool_parsing.py`
-- [ ] Expected output: "9 passed, 0 failed"
-
-## Token Budget Verification
-
-- Parser code: ~30 lines (~200 tokens)
-- Well under 2000 token limit
-- Simple regex pattern maintains low complexity