docs: update README with current features and remove outdated docs

- Removed old design docs and test plans from docs/ directory - Updated TODO section to reflect completed improvements - Added section on Recent Improvements with detailed changelog - Updated Federation description to explain objective quality voting - Added federation vote endpoint to API endpoints list - Clarified universal tool support and OpenCode streaming compatibility - All changes ready for main branch merge
2026-02-25 22:38:46 +01:00
parent 3799240d74
commit e7b826da4e
11 changed files with 23 additions and 1164 deletions
@@ -91,7 +91,9 @@ python main.py --auto --federation
 python main.py --auto --federation
 ```
-Machines auto-discover each other and vote together on every request.
+Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
 **Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).
 ## How Consensus Works
@@ -147,7 +149,7 @@ All support GGUF quantization (Q4_K_M recommended).
 - `GET /v1/models` - List available models
 - `POST /v1/chat/completions` - Chat completion with consensus
 - `GET /health` - Health check
- `GET /v1/federation/peers` - List discovered peers (when federation enabled)
+- `POST /v1/federation/vote` - Federation voting (used internally between peers)
 ## Troubleshooting
@@ -282,21 +284,29 @@ Major refactoring completed to improve modularity:
 See `docs/ARCHITECTURE.md` for detailed architecture documentation.
-## TODO / Roadmap
+## Recent Improvements
-### Planned Features
+### ✅ Universal Tool Support (2025-02-25)
 - Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
 - No client-side configuration needed - just use the API
 - Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
 - Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
 - Proper OpenAI tool format with unique IDs and tool_call_id linking
- **Plan Mode**: Add a "plan mode" that disables tool execution for planning-only conversations. This would allow the model to discuss file changes without actually modifying them until explicitly confirmed.
+### ✅ OpenCode-Compatible Streaming (2025-02-25)
-  - Usage: `--plan-mode` flag or API parameter
+- Proper `reasoning_content` field for "Thinking..." collapsible blocks
-  - When enabled: Model can see what tools would do but doesn't execute them
+- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
-  - Use case: Review changes before applying them
+- Final answer delivered in `content` field after tool execution
-### Current Status
+### ✅ Federation Quality Voting (2025-02-25)
 - Head node now **objectively judges** all peer responses using quality metrics
 - No more reliance on self-reported confidence (which biased toward local)
 - All responses scored on length, structure, completeness
 - Fair competition: 14B models properly beat 3B on quality tasks
- ✅ Tool instructions now injected by default for all clients
+### 🚧 Planned Features
- ✅ Improved file operation safety (verify with ls/grep before reading)
+- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
- ✅ Working directory support (extracted from client context)
+- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)
 - 🔄 Plan mode - coming soon
 ## Contributing
@@ -1,92 +0,0 @@
 # Design Decision: Complete React Example with Actual Code
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions
 ## Problem
 Model is still not following instructions:
 1. Tries `npm install` before creating package.json
 2. Still tries `npx create-react-app` despite being told not to
 3. Instructions have placeholders like "..." and "etc." which models don't understand
 ## Root Cause
 The current instructions say:
 ```
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"}
 [Continue with src/index.js, src/App.js, public/index.html, etc.]
 ```
 **Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples.
 ## Solution
 Provide a **complete, working, minimal React example** with actual file contents:
 1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install
 2. Actual file content, not placeholders
 3. Minimal viable React app (not full create-react-app structure)
 ## Implementation
 Replace vague example with complete working code:
 ```
 **COMPLETE REACT HELLO WORLD EXAMPLE:**
 User: "Create a React Hello World app"
 Step 1 - Create directory:
 TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp"}
 Step 2 - Create package.json (MUST do this BEFORE npm install):
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"}
 Step 3 - Create src directory:
 TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp/src"}
 Step 4 - Create App.js:
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n  return (\n    <div className=\"App\">\n      <h1>Hello World</h1>\n      <p>Welcome to my React app!</p>\n    </div>\n  );\n}\n\nexport default App;"}
 Step 5 - Create index.js:
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render(<App />);"}
 Step 6 - Create public directory and index.html:
 TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp/public"}
 TOOL: write
 ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <title>React App</title>\n</head>\n<body>\n    <div id=\"root\"></div>\n</body>\n</html>"}
 Step 7 - NOW install dependencies (AFTER package.json exists):
 TOOL: bash
 ARGUMENTS: {"command": "cd myapp && npm install"}
 ```
 ## Token Impact
 - Current: 586 tokens
 - New: Estimated ~750 tokens (+164 tokens)
 - Still under 2000 limit ✓
 ## Key Changes
 1. **Explicit sequencing:** "Step 1", "Step 2", etc.
 2. **Actual code:** No "..." or "etc." - real working content
 3. **Critical note:** "MUST do this BEFORE npm install"
 4. **Minimal structure:** Just what's needed for Hello World
 ## Success Criteria
 - [ ] Model creates package.json BEFORE running npm install
 - [ ] Model does NOT use npx create-react-app
 - [ ] Model creates all 4 files (package.json, App.js, index.js, index.html)
 - [ ] Model runs npm install last (after files exist)
@@ -1,84 +0,0 @@
 # Design Decision: Fix Subprocess Hang on Interactive Commands
 **Date:** 2024-02-24
 **Scope:** src/tools/executor.py _execute_bash method
 **Lines Changed:** 1 line
 ## Problem
 When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes:
 1. 300s timeout to be reached
 2. opencode to hang waiting for response
 3. Poor user experience
 ## Root Cause
 `subprocess.run()` by default inherits stdin from parent process. When commands prompt for input:
 - npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)"
 - npm init asks for package details
 - No input is provided, so it waits forever
 ## Solution
 Add `stdin=subprocess.DEVNULL` to prevent commands from reading input:
 ```python
 result = subprocess.run(
    command,
    shell=True,
    capture_output=True,
    text=True,
    timeout=timeout,
    cwd=cwd,
    stdin=subprocess.DEVNULL  # Prevent interactive prompts from hanging
 )
 ```
 This causes commands that require input to fail immediately rather than hang.
 ## Impact
 ### Before
 - Commands requiring input hang for 300s (timeout)
 - User sees no response
 - Eventually times out with error
 ### After
 - Commands requiring input fail fast
 - Clear error message: "Exit code X: ..." 
 - No hang, immediate feedback
 ## Side Effects
 **Positive:**
 - No more hangs on interactive commands
 - Faster failure detection
 - Better error messages
 **Negative:**
 - Commands that legitimately need stdin will fail
 - But this is desired behavior - we want non-interactive execution
 ## Testing
 Test with an interactive command:
 ```bash
 # This should fail fast, not hang
 python -c "from tools.executor import ToolExecutor; 
 import asyncio; 
 e = ToolExecutor(); 
 result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'})); 
 print(result)"
 ```
 Expected: Quick failure, not a 30s hang
 ## Related Changes
 This complements the tool instructions fix:
 - Instructions now say "DO NOT use npx create-react-app"
 - This fix ensures if model ignores instructions, it fails fast instead of hanging
 ## Conclusion
 One-line fix prevents interactive command hangs, improving reliability and user experience.
@@ -1,178 +0,0 @@
 # Design Decision: Fix Tool Execution and Token Reporting
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions and token counting
 ## Problem Statement
 User report shows three critical failures:
 1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format
 2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count
 3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout
 ## Evidence
 ```
 🖥️  BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app .
 ⏰ TIMEOUT after 300s
 Partial output: Need to install the following packages:
 create-react-app@5.1.0
 Ok to proceed? (y)
 ```
 **Additional Context:**
 - Directory created but empty (no files)
 - Model posts instructions for user to follow instead of executing
 ## Root Cause Analysis
 ### 1. Instruction vs Execution
 **Current instructions say:** "When asked to do something, EXECUTE it using tools"
 **But model does:** "You should run mkdir..."
 **Why:** Instructions aren't strong enough - need explicit anti-patterns
 ### 2. Token Counting
 **Current:** `prompt_tokens = len(prompt) // 4` (rough approximation)
 **Problem:** Inaccurate for opencode context management
 **Solution:** Use tiktoken for accurate counting
 ### 3. Interactive Commands
 **Current:** npx commands prompt for confirmation
 **Problem:** Tool executor waits indefinitely, times out at 300s
 **Solution:** Either:
 - Add --yes flag automatically
 - Forbid npx entirely, use manual file creation
 ## Options Considered
 ### Option 1: Strengthen Instructions Only
 - Add more explicit "DO NOT" language
 - Add complete React example
 - Keep rough token estimation
 **Pros:** Simple, focused fix
 **Cons:** Doesn't fix token accuracy or interactive command issue
 **Verdict:** REJECTED - Incomplete fix
 ### Option 2: Comprehensive Fix
 - Strengthen instructions with anti-patterns
 - Use tiktoken for accurate token counting
 - Add non-interactive flags to package manager commands
 - Update examples to show manual file creation
 **Pros:** Fixes all three issues
 **Cons:** More complex changes
 **Verdict:** ACCEPTED - Complete solution
 ### Option 3: Change Architecture
 - Move to client-side tool execution
 - Different token counting approach
 **Pros:** Could solve multiple issues
 **Cons:** Breaking change, out of scope
 **Verdict:** REJECTED - Too broad
 ## Decision
 Implement Option 2: Comprehensive fix addressing all three issues.
 ### Changes
 #### 1. Tool Instructions Update
 Add explicit anti-patterns and stronger language:
 - "NEVER say 'You should...' - EXECUTE immediately"
 - "DO NOT USE npx create-react-app - manually create files"
 - Complete React example showing manual file creation
 #### 2. Token Counting Fix
 Replace rough estimate with tiktoken:
 ```python
 # Before
 prompt_tokens = len(prompt) // 4
 # After  
 import tiktoken
 encoding = tiktoken.get_encoding('cl100k_base')
 prompt_tokens = len(encoding.encode(prompt))
 completion_tokens = len(encoding.encode(content))
 ```
 #### 3. Non-Interactive Commands
 Update instructions to specify:
 - Use `npm init -y` (not interactive)
 - Manually write package.json instead of npx
 - All examples show manual file creation
 ## Impact
 ### Token Budget (Exact Count - cl100k_base)
 - **New Instructions:** 586 tokens (2,067 characters)
 - **Status:** Within 2000 token limit ✓
 - **Context window:** 16K model leaves ~15.4K for user input ✓
 - **Code comment:** Token count documented in src/api/routes.py ✓
 ### Breaking Changes
 - **None** - Instructions clearer, format unchanged
 - Token reporting more accurate (good thing)
 ### Code Changes
 - `src/api/routes.py`:
  - Update tool_instructions (~+15 lines)
  - Add tiktoken import
  - Replace token estimation logic (~5 lines)
 ## Testing Strategy
 1. **Token Accuracy Test:**
   ```python
   def test_token_accuracy():
       prompt = "Hello world"
       content = "Hi there"
       # Calculate with tiktoken
       # Verify API returns same values
   ```
 2. **Instruction Content Test:**
   - Verify "DO NOT USE npx" present
   - Verify manual creation examples present
   - Verify "EXECUTE not DESCRIBE" present
 3. **Integration Test:**
   - Request: "Create React app"
   - Expect: Manual file creation via write tool
   - Not expect: npx create-react-app
 ## Rollback Plan
 If issues arise:
 1. Revert to previous instructions
 2. Keep tiktoken for token counting (beneficial)
 3. Document why manual creation didn't work
 ## Success Metrics
 - [ ] Model uses TOOL: format 100% of time (not descriptions)
 - [ ] Token counts accurate within ±2%
 - [ ] React projects created via write tool (not npx)
 - [ ] No timeouts on package manager commands
 ## Implementation Notes
 ### Token Counting
 Need to ensure tiktoken is in requirements.txt
 ### Tool Instructions
 The key addition is:
 ```
 **FORBIDDEN PATTERNS:**
 - "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"}
 - "npx create-react-app myapp" → USE: Manual file creation with write tool
 - "First create package.json, then..." → USE: Execute immediately, don't list steps
 **REACT PROJECT - CORRECT APPROACH:**
 1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
 2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"}
 3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."}
 4. Continue until all files created
 ```
@@ -1,172 +0,0 @@
 # Design Decision: Improved Tool Instructions
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions
 **Lines Changed:** ~25 lines
 ## Problem
 Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
 1. **Passive vs Active:** Model describes what to do instead of doing it
 2. **Refusal:** Model claims "I am only an AI assistant" instead of executing
 3. **Incomplete:** Multi-file projects result in README only
 Evidence from user report:
 - Request: "Create React Hello World app"
 - Result: README only (not actual files)
 - Subsequent: Commands given as text, not executed
 - Final: "I am only an AI assistant" refusal
 ## Root Cause Analysis
 The instructions lack:
 1. **Authority statement** - "You CAN and SHOULD use tools"
 2. **Execution mandate** - "Execute commands, don't just describe them"
 3. **Workflow clarity** - Clear step-by-step expectations
 4. **Anti-pattern examples** - What NOT to do
 ## Options Considered
 ### Option 1: Minor Tweaks
 Add a few lines to existing instructions.
 - **Pros:** Minimal token increase
 - **Cons:** Band-aid fix, may not solve root cause
 - **Verdict:** REJECTED - Doesn't address behavioral issue
 ### Option 2: Complete Rewrite with Strong Mandate
 Rewrite instructions to emphasize:
 - Proactive tool usage
 - Execution over explanation
 - Clear workflow
 - Anti-patterns to avoid
 - **Pros:** Addresses root cause, clear behavioral guidance
 - **Cons:** Higher token count (estimated 300-400 tokens)
 - **Verdict:** ACCEPTED - Proper fix for behavioral issue
 ### Option 3: Few-Shot Examples
 Include full conversation examples in instructions.
 - **Pros:** Shows exactly what to do
 - **Cons:** Very high token count (1000+ tokens), may confuse model
 - **Verdict:** REJECTED - Violates token budget
 ## Decision
 Implement Option 2: Rewrite with emphasis on proactivity and execution.
 **Key additions:**
 1. **Capability statement:** "You have tools. Use them."
 2. **Execution mandate:** "Don't describe, execute"
 3. **Workflow:** Clear request→tool→result→next cycle
 4. **Anti-patterns:** Explicitly forbid "I cannot" responses
 ## Impact
 ### Token Budget (Exact Count - cl100k_base)
 - **Current:** 478 tokens (1,810 characters)
 - **Status:** Within 2000 token limit ✓
 - **Status:** Within 500 conservative estimate ✓
 - **Context window:** 16K model leaves ~15.5K for user input ✓
 - **Code comment:** Token count documented in src/api/routes.py ✓
 ### Code Changes
 - **File:** src/api/routes.py
 - **Lines:** +48/-18 (net +30)
 - **Type:** Instructions replacement
 - **Token documentation:** Added inline comment with exact token count
 ### Breaking Changes
 - **None** - Instructions are additive/clearer, not different format
 ### Behavioral Changes
 - **Expected:** More proactive tool usage
 - **Expected:** No more "I cannot" refusals
 - **Expected:** Multi-step projects completed via tools
 - **Expected:** Commands executed, not described
 ### Review Blockers Addressed
 - ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
 - ✅ Exact token count calculated using tiktoken (478 tokens)
 - ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
 - ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
 - ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
 ## Implementation
 ```python
 tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
 **CRITICAL RULES:**
 1. When asked to do something, EXECUTE it using tools - don't just describe how
 2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
 3. You MUST use the write tool to create files
 4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
 5. Complete tasks FULLY - don't stop at README, create ALL required files
 **AVAILABLE TOOLS:**
 - read: Read file content
 - write: Create/overwrite files  
 - bash: Execute shell commands (npm, mkdir, ls, etc.)
 **TOOL FORMAT (STRICT):**
 TOOL: tool_name
 ARGUMENTS: {"param": "value"}
 **WORKFLOW:**
 1. User asks for something
 2. You decide what tool to use
 3. You respond with ONLY the TOOL: format above
 4. You receive the tool result
 5. You continue with next tool until task is COMPLETE
 **EXAMPLES:**
 Creating a project:
 User: "Create a React app"
 You: TOOL: bash
 ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
 [wait for result]
 You: TOOL: write
 ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
 [continue until all files created]
 Running commands:
 User: "Install dependencies"
 You: TOOL: bash
 ARGUMENTS: {"command": "npm install"}
 [wait for result, then confirm completion]
 **WHAT NOT TO DO:**
 - ❌ "To create a React app, you should run: mkdir myapp" (describing)
 - ❌ "I cannot run commands, I am an AI" (refusing)
 - ❌ Creating only README instead of full project (incomplete)
 - ❌ "First do X, then do Y" (giving instructions instead of doing)
 **CORRECT BEHAVIOR:**
 - ✅ Execute the command immediately using the bash tool
 - ✅ Create all files using the write tool
 - ✅ Continue until task is 100% complete
 - ✅ Use ONE tool at a time and wait for results"""
 ```
 ## Testing
 1. Test with React Hello World request
 2. Verify model uses bash to create directory structure
 3. Verify model uses write to create all files
 4. Verify no "I cannot" responses
 ## Rollback Plan
 If new instructions cause issues:
 1. Revert to previous ~125 token version
 2. Analyze what specifically failed
 3. Iterate on smaller changes
 ## Success Metrics
 - [ ] Model uses tools on first request (not after prompting)
 - [ ] Zero "I cannot" or "I am an AI" responses
 - [ ] Multi-file projects fully created
 - [ ] Commands executed, not described
@@ -1,151 +0,0 @@
 # Design Decision: Task Planning and Verification Workflow
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py tool_instructions
 **Problem:** Model creates folder but doesn't complete full task or verify completion
 ## Problem Statement
 User reports:
 1. "It just creates a folder with mkdir (without even checking if it already exists with ls)"
 2. No verification that tasks are completed
 3. No planning of full task scope
 4. Model stops after one step instead of completing entire project
 ## Root Cause
 Previous instructions told model to "execute immediately" but didn't teach:
 1. **Planning** - What needs to be done
 2. **Checking** - What already exists
 3. **Verification** - Did the step work
 4. **Completion loop** - Keep going until done
 ## Solution
 Add **Task Completion Workflow** to instructions:
 ```
 **TASK COMPLETION WORKFLOW (MANDATORY):**
 **1. PLAN:** List ALL steps needed before starting
 **2. CHECK:** Use ls to verify what exists before creating
 **3. EXECUTE:** Run first step
 **4. VERIFY:** Confirm step worked (ls, read file)
 **5. REPEAT:** Steps 3-4 until ALL complete
 **6. FINAL CHECK:** Verify entire task is done
 **7. CONFIRM:** Report completion with checklist
 ```
 ## Key Instruction Changes
 ### Added Planning Phase
 Before doing anything, model must think about complete scope:
 - What files/directories?
 - What dependencies?
 - Complete task requirements
 ### Added Verification Steps
 Every step must be verified:
 - `ls -la` after mkdir
 - `read` file after write
 - Check content is correct
 ### Added Completion Loop
 Model must continue until:
 ✓ All directories exist
 ✓ All files exist with correct content
 ✓ All dependencies installed
 ✓ Each component verified
 ### Complete Working Example
 Provided 13-step React example showing:
 1. Check existing (ls)
 2. Create directory
 3. Verify created (ls)
 4. Create package.json
 5. Verify package.json (read)
 6. Create source files
 7. Final verification (find myapp -type f)
 8. Install dependencies
 9. Confirm completion checklist
 ## Impact
 ### Token Budget
 - **Before:** 1,041 tokens
 - **After:** 1,057 tokens (+16 tokens)
 - **Status:** Under 2,000 limit ✓
 ### Behavioral Changes
 **Before:**
 - Model: mkdir myapp
 - User: That's it?
 - Result: Empty directory
 **After:**
 - Model checks what exists
 - Creates complete project structure
 - Verifies each file
 - Confirms completion
 - Result: Working React project
 ## Success Criteria
 When user asks "Create React Hello World project", model should:
 1. ✓ Check current directory contents
 2. ✓ Create myapp/ directory
 3. ✓ Verify directory created
 4. ✓ Create package.json
 5. ✓ Verify package.json content
 6. ✓ Create src/App.js
 7. ✓ Create src/index.js
 8. ✓ Create public/index.html
 9. ✓ Final verification (list all files)
 10. ✓ npm install
 11. ✓ Confirm completion checklist
 ## Testing
 Test instructions contain:
 - PLAN/CHECK keywords
 - VERIFY keyword
 - COMPLETE keyword
 All tests pass: 11/11 ✓
 ## Trade-offs
 **Pros:**
 - Complete task execution
 - Verification prevents partial work
 - Clear completion criteria
 - Better user experience
 **Cons:**
 - More tokens (but still under limit)
 - More verbose instructions
 - May be slower (more verification steps)
 ## Related Files Changed
 1. src/api/routes.py - Updated tool_instructions
 2. tests/test_tool_parsing.py - Updated tests for new content
 3. docs/design/2024-02-24-task-planning-verification.md - This doc
 ## Future Improvements
 1. **Task Queue System:** Server-side queue of pending operations
 2. **State Persistence:** Remember what's been done across conversations
 3. **Smart Resumption:** If interrupted, pick up where left off
 4. **Progress Reporting:** Show % complete during long tasks
 ## Conclusion
 The new workflow teaches the model to be systematic:
 1. Plan before acting
 2. Check before creating
 3. Verify after each step
 4. Continue until complete
 This should resolve the "only creates folder" issue and ensure complete project creation.
@@ -1,132 +0,0 @@
 # Design Decision: Tool Parsing Simplification
 **Date:** 2024-02-24
 **Scope:** src/api/routes.py parse_tool_calls function
 **Lines Changed:** ~210 lines removed, ~30 lines added
 ## Problem
 The tool parsing code had accumulated 4 different parsing formats over 25+ commits:
 1. JSON `tool_calls` format with nested objects
 2. TOOL:/ARGUMENTS: format (simple text)
 3. Function pattern format `func_name(args)`
 4. Multiple JSON handling variants
 This caused:
 - Circular development (adding/removing formats repeatedly)
 - No single source of truth
 - Complex, unmaintainable code
 - No confidence that changes wouldn't break existing cases
 ## Options Considered
 ### Option 1: Keep All Formats
 - **Pros:** Backward compatible
 - **Cons:** 210 lines of unmaintainable code, continues circular development pattern
 - **Verdict:** REJECTED - Perpetuates the problem
 ### Option 2: Standardize on TOOL:/ARGUMENTS: Only
 - **Pros:** 
  - Simple regex pattern (~30 lines)
  - Matches current tool instructions
  - Easy to test
  - Clear single format for models
 - **Cons:** 
  - Breaking change if any code relies on old formats
  - Need to update any existing examples/docs
 - **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well)
 ### Option 3: Create Parser per Format with Feature Flags
 - **Pros:** Flexible, can toggle formats
 - **Cons:** 
  - Violates Rule 5 and "No Feature Flags in Core Logic"
  - Still maintains multiple code paths
 - **Verdict:** REJECTED - Doesn't solve the root problem
 ## Decision
 Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code.
 **Rationale:**
 - Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only"
 - Token cost is minimal (no complex regex)
 - Test coverage provides confidence
 - Aligns with existing tool instructions
 ## Impact
 ### Token Count
 - **Parser code:** 210 lines → 30 lines (-180 lines)
 - **No change** to tool instructions (separate optimization)
 ### Breaking Changes
 - **Yes** - Removes support for:
  - JSON `tool_calls` format in model responses
  - Function pattern format `read_file(path="test.txt")`
 **Migration:** Models must use:
 ```
 TOOL: read
 ARGUMENTS: {"filePath": "test.txt"}
 ```
 ### Testing
 - Unit tests added: 9 test cases
 - Coverage: All parsing scenarios
 - All tests pass
 ## Implementation
 ```python
 # New implementation (30 lines)
 def parse_tool_calls(text: str) -> tuple:
    """Parse tool calls using standardized format."""
    import json
    import re
    tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
    tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
    if not tool_matches:
        return text, None
    tool_calls = []
    for i, tool_match in enumerate(tool_matches):
        tool_name = tool_match.group(1)
        args_str = tool_match.group(2)
        try:
            args_dict = json.loads(args_str)
            tool_calls.append({
                "id": f"call_{i+1}",
                "type": "function", 
                "function": {
                    "name": tool_name,
                    "arguments": json.dumps(args_dict)
                }
            })
        except json.JSONDecodeError:
            continue
    if not tool_calls:
        return text, None
    first_start = tool_matches[0].start()
    content = text[:first_start].strip()
    return content, tool_calls
 ```
 ## Verification
 Run tests:
 ```bash
 python tests/test_tool_parsing.py
 ```
 Expected: 9 passed, 0 failed
 ## Follow-up
 - [x] Update DEVELOPMENT_PATTERNS.md to mark as completed
 - [x] Add unit tests
 - [ ] Consider integration test for full tool execution flow
@@ -1,98 +0,0 @@
 # Investigation: 31k Token Context Issue
 ## Problem
 When making requests through opencode to local_swarm, the LLM receives ~31k tokens of context even for simple empty directory queries.
 ## Root Cause Identified
 **NOT an issue with this repo's codebase - this is expected behavior for function calling.**
 ### How it works:
 1. **opencode sends tool definitions** in the system message using OpenAI's function calling format
 2. **Each tool definition is ~450 tokens** (name + description + parameters)
 3. **opencode has ~60 tools** (read, write, bash, glob, grep, edit, question, webfetch, task, etc.)
 4. **Total tool definition tokens:** ~27,000 tokens
 ### Calculation:
 ```
 Single tool definition: ~450 tokens
 Number of tools: ~60
 Tool schemas total: ~27,000 tokens
 System message: ~500 tokens
 User query: ~100 tokens
 ---
 Total: ~27,600 tokens
 ```
 **This matches the observed ~31k tokens.**
 ## Why This Happens
 OpenAI's function calling protocol requires sending the **complete function schemas** to the LLM with every request. This is how the model:
 - Knows what tools are available
 - Understands parameter requirements
 - Knows how to format tool calls
 All major LLM providers using function calling work this way (OpenAI, Anthropic, local models, etc.).
 ## Verification
 ```bash
 python -c "
 import tiktoken
 enc = tiktoken.get_encoding('cl100k_base')
 # Example from actual opencode tool definition
 read_tool_schema = '''{\"type\": \"function\", \"function\": {\"name\": \"read\", \"description\": \"Read a file or directory from the local filesystem...[full description]\", \"parameters\": {...}}}'''
 print(f'Single tool schema: {len(enc.encode(read_tool_schema))} tokens')
 print(f'Estimated 60 tools: {len(enc.encode(read_tool_schema)) * 60:,} tokens')
 "
 ```
 Result:
 - Single tool definition: ~451 tokens
 - 60 tools: ~27,060 tokens
 - Plus system + user message: ~27,660 total
 ## This Is NOT a Bug
 The 31k token context is **correct and expected** for function calling with 60+ tools. This is how:
 - OpenAI API works
 - Claude API works
 - Local models with function calling work
 ## Potential Optimizations (Optional)
 If reducing context size is critical, consider:
 ### Option 1: Dynamic Tool Selection
 - Only send tools relevant to current task
 - Example: For file operations, only send [read, write, glob, edit]
 - Trade-off: Requires opencode to intelligently filter tools
 ### Option 2: Compressed Tool Descriptions
 - Shorten tool descriptions to essentials
 - Example: "Read file at path (required: filePath)"
 - Trade-off: Model may make more errors with less guidance
 ### Option 3: Tool Grouping
 - Group similar tools into single "tools: [read, write, glob]" parameter
 - Trade-off: Breaks OpenAI compatibility
 ## Recommendation
 **NO ACTION REQUIRED.** The 31k token context is:
 - Standard for function calling with many tools
 - Within capabilities of modern LLMs (32k-128k context windows)
 - Not caused by this repo's code
 The `.opencodeignore` created earlier will help with opencode's own system prompt, but doesn't affect the LLM context sent to local_swarm.
 ## Additional Finding
 While investigating, verified:
 - `config/prompts/tool_instructions.txt`: 125 tokens ✅
 - This repo's tool execution code: No token bloat ✅
 - Issue is purely opencode's function calling protocol ✅
@@ -1,112 +0,0 @@
 # Test Plan: Fix Tool Execution and Token Reporting
 ## Problem Analysis
 ### Issue 1: Model Gives Instructions Instead of Executing
 **Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format
 **Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."}
 ### Issue 2: Token Counting Inaccurate
 **Current:** Rough estimate `len(prompt) // 4` 
 **Expected:** Accurate token count using tiktoken
 **Impact:** opencode can't properly manage context window
 ### Issue 3: npx Commands Timeout/Need Input
 **Current:** `npx create-react-app .` prompts for confirmation (y/n)
 **Expected:** Non-interactive execution or manual file creation
 **Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)"
 ## Unit Tests
 ### Test 1: Accurate Token Counting
 - [ ] Verify token count uses tiktoken (not rough estimate)
 - [ ] Test with known token counts
 - [ ] Verify prompt_tokens + completion_tokens = total_tokens
 ### Test 2: Non-Interactive Bash Commands
 - [ ] Verify npm/npx commands use --yes or equivalent flags
 - [ ] Test timeout handling for package managers
 - [ ] Verify commands don't prompt for user input
 ### Test 3: Tool Instructions Content
 - [ ] Verify instructions emphasize "EXECUTE not DESCRIBE"
 - [ ] Verify manual file creation examples (not npx)
 - [ ] Verify anti-patterns are clearly stated
 ## Integration Tests
 ### Test 4: End-to-End React Project Creation
 **Input:** "Create a React Hello World app"
 **Expected Flow:**
 1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
 2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
 3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."}
 4. Continue until complete
 **Failure Modes:**
 - [ ] Model describes steps instead of executing
 - [ ] Uses npx create-react-app (should manually create files)
 - [ ] Stops after README only
 ### Test 5: Token Reporting Accuracy
 **Input:** Any chat completion request
 **Expected:**
 - usage.prompt_tokens matches actual tokens
 - usage.completion_tokens matches actual tokens  
 - usage.total_tokens is sum
 **Verification:**
 - Compare tiktoken count vs API response
 ## Manual Verification
 ```bash
 # Test React creation
 python main.py --auto &
 curl -X POST http://localhost:17615/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "X-Client-Working-Dir: /tmp/test-project" \
  -d '{
    "model": "local-swarm",
    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
    "tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}]
  }'
 # Check token accuracy
 curl -X POST http://localhost:17615/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-swarm",
    "messages": [{"role": "user", "content": "Hello"}]
  }' | jq '.usage'
 ```
 ## Success Criteria
 1. **Execution:** 100% of requests use TOOL: format (not descriptions)
 2. **Accuracy:** Token counts match tiktoken within ±5%
 3. **Completion:** Multi-file projects fully created via write tool
 4. **No npx:** Manual file creation for React (no npx create-react-app)
 ## Implementation Notes
 ### Token Counting Fix
 ```python
 # Replace: prompt_tokens = len(prompt) // 4
 # With:
 import tiktoken
 encoding = tiktoken.get_encoding('cl100k_base')
 prompt_tokens = len(encoding.encode(prompt))
 completion_tokens = len(encoding.encode(content))
 ```
 ### Tool Instructions Fix
 - Add explicit "DO NOT USE npx create-react-app" instruction
 - Add "EXECUTE IMMEDIATELY" mandate
 - Show complete React example with manual file creation
 ### Non-Interactive Commands
 - Auto-add --yes to npx commands
 - Or recommend manual file creation instead
@@ -1,97 +0,0 @@
 # Test Plan: Improved Tool Instructions
 ## Problem Statement
 Model is not using tools effectively:
 1. Creates README instead of actual project structure
 2. Provides commands as text instead of executing them
 3. Refuses to run commands claiming "I am only an AI assistant"
 ## Root Cause Analysis
 Current instructions don't clearly communicate:
 - That the model SHOULD use tools proactively
 - That execution is expected, not explanation
 - The workflow: user request → tool execution → result
 ## Unit Tests (Instruction Verification)
 ### Test 1: Instruction Presence
 - [ ] Verify instructions are injected into system message
 - [ ] Verify instructions appear at the START of system message (priority position)
 ### Test 2: Token Count
 - [ ] Measure total token count of new instructions
 - [ ] Verify ≤ 500 tokens (conservative budget)
 - [ ] Document before/after
 ### Test 3: Format Compliance
 - [ ] Verify instructions include TOOL:/ARGUMENTS: format
 - [ ] Verify examples use correct format
 - [ ] Verify rules are clear and numbered
 ## Integration Tests (Behavioral)
 ### Test 4: Project Creation Flow
 **Input:** "Create a React Hello World app"
 **Expected Behavior:**
 1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp
 2. After result, TOOL: write, ARGUMENTS: package.json content
 3. After result, TOOL: write, ARGUMENTS: src/App.js content
 4. Continue until complete project structure exists
 **Failure Modes:**
 - [ ] Model only describes what to do
 - [ ] Model creates README only
 - [ ] Model refuses to execute commands
 ### Test 5: Multi-step Task
 **Input:** "Check what files exist, then create a test.txt file with 'hello' in it"
 **Expected Behavior:**
 1. TOOL: bash, ARGUMENTS: ls -la
 2. Wait for result
 3. TOOL: write, ARGUMENTS: test.txt with "hello"
 **Failure Modes:**
 - [ ] Model tries to do both in one response
 - [ ] Model doesn't wait for ls result before writing
 ### Test 6: Command Refusal
 **Input:** "Run npm install"
 **Expected Behavior:**
 1. TOOL: bash, ARGUMENTS: npm install
 **Failure Modes:**
 - [ ] Model responds: "I cannot run commands, I am only an AI assistant"
 - [ ] Model explains npm install instead of running it
 ## Manual Verification Commands
 ```bash
 # Start the server
 python main.py --auto
 # In another terminal, test with curl
 curl -X POST http://localhost:17615/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local-swarm",
    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
    "tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}]
  }'
 ```
 ## Success Criteria
 1. **Proactivity:** Model uses tools without being asked twice
 2. **Execution:** Model runs commands, doesn't just describe them
 3. **No Refusal:** Model never says "I cannot" or "I am only an AI"
 4. **Completeness:** Multi-file projects are fully created via tools
 5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format
 ## Metrics
 - **Tool usage rate:** % of requests that result in tool calls
 - **Format compliance:** % of tool calls in correct format
 - **Completion rate:** % of multi-step tasks fully completed
@@ -1,35 +0,0 @@
 # Test Plan: Tool Parsing Simplification
 ## Unit Tests
 - [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments
 - [x] Test case 2: No tool in text → Returns None for tools, original text as content  
 - [x] Test case 3: Multiple tools → Returns all tools in order
 - [x] Test case 4: Content before tool → Content extracted, tool parsed correctly
 - [x] Test case 5: Bash tool → Correctly parses bash command
 - [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work
 - [x] Test case 7: Invalid JSON → Skips invalid, continues with valid
 - [x] Test case 8: Empty text → Returns None, empty string
 - [x] Test case 9: Whitespace only → Returns None
 ## Integration Tests
 - [ ] End-to-end flow: 
  1. Send chat completion request with tools
  2. Model responds with TOOL:/ARGUMENTS: format
  3. Parser extracts tool call
  4. Tool executes
  5. Result returned in response
 - [ ] Expected result: Tool executes successfully, result included in response
 ## Manual Verification
 - [ ] Command: `python tests/test_tool_parsing.py`
 - [ ] Expected output: "9 passed, 0 failed"
 ## Token Budget Verification
 - Parser code: ~30 lines (~200 tokens)
 - Well under 2000 token limit
 - Simple regex pattern maintains low complexity