diff --git a/README.md b/README.md index 3514759..da6dfa3 100644 --- a/README.md +++ b/README.md @@ -91,7 +91,9 @@ python main.py --auto --federation python main.py --auto --federation ``` -Machines auto-discover each other and vote together on every request. +Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models. + +**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured). ## How Consensus Works @@ -147,7 +149,7 @@ All support GGUF quantization (Q4_K_M recommended). - `GET /v1/models` - List available models - `POST /v1/chat/completions` - Chat completion with consensus - `GET /health` - Health check -- `GET /v1/federation/peers` - List discovered peers (when federation enabled) +- `POST /v1/federation/vote` - Federation voting (used internally between peers) ## Troubleshooting @@ -282,21 +284,29 @@ Major refactoring completed to improve modularity: See `docs/ARCHITECTURE.md` for detailed architecture documentation. -## TODO / Roadmap +## Recent Improvements -### Planned Features +### ✅ Universal Tool Support (2025-02-25) +- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.) +- No client-side configuration needed - just use the API +- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading +- Working directory auto-extraction from prompts (`in /path/to/dir` patterns) +- Proper OpenAI tool format with unique IDs and tool_call_id linking -- **Plan Mode**: Add a "plan mode" that disables tool execution for planning-only conversations. This would allow the model to discuss file changes without actually modifying them until explicitly confirmed. - - Usage: `--plan-mode` flag or API parameter - - When enabled: Model can see what tools would do but doesn't execute them - - Use case: Review changes before applying them +### ✅ OpenCode-Compatible Streaming (2025-02-25) +- Proper `reasoning_content` field for "Thinking..." collapsible blocks +- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format +- Final answer delivered in `content` field after tool execution -### Current Status +### ✅ Federation Quality Voting (2025-02-25) +- Head node now **objectively judges** all peer responses using quality metrics +- No more reliance on self-reported confidence (which biased toward local) +- All responses scored on length, structure, completeness +- Fair competition: 14B models properly beat 3B on quality tasks -- ✅ Tool instructions now injected by default for all clients -- ✅ Improved file operation safety (verify with ls/grep before reading) -- ✅ Working directory support (extracted from client context) -- 🔄 Plan mode - coming soon +### 🚧 Planned Features +- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`) +- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations) ## Contributing diff --git a/docs/design/2024-02-24-complete-react-example.md b/docs/design/2024-02-24-complete-react-example.md deleted file mode 100644 index b004957..0000000 --- a/docs/design/2024-02-24-complete-react-example.md +++ /dev/null @@ -1,92 +0,0 @@ -# Design Decision: Complete React Example with Actual Code - -**Date:** 2024-02-24 -**Scope:** src/api/routes.py tool_instructions - -## Problem - -Model is still not following instructions: -1. Tries `npm install` before creating package.json -2. Still tries `npx create-react-app` despite being told not to -3. Instructions have placeholders like "..." and "etc." which models don't understand - -## Root Cause - -The current instructions say: -``` -TOOL: write -ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"} - -[Continue with src/index.js, src/App.js, public/index.html, etc.] -``` - -**Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples. - -## Solution - -Provide a **complete, working, minimal React example** with actual file contents: - -1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install -2. Actual file content, not placeholders -3. Minimal viable React app (not full create-react-app structure) - -## Implementation - -Replace vague example with complete working code: - -``` -**COMPLETE REACT HELLO WORLD EXAMPLE:** - -User: "Create a React Hello World app" - -Step 1 - Create directory: -TOOL: bash -ARGUMENTS: {"command": "mkdir myapp"} - -Step 2 - Create package.json (MUST do this BEFORE npm install): -TOOL: write -ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"} - -Step 3 - Create src directory: -TOOL: bash -ARGUMENTS: {"command": "mkdir myapp/src"} - -Step 4 - Create App.js: -TOOL: write -ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n return (\n
\n

Hello World

\n

Welcome to my React app!

\n
\n );\n}\n\nexport default App;"} - -Step 5 - Create index.js: -TOOL: write -ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render();"} - -Step 6 - Create public directory and index.html: -TOOL: bash -ARGUMENTS: {"command": "mkdir myapp/public"} - -TOOL: write -ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "\n\n\n \n \n React App\n\n\n
\n\n"} - -Step 7 - NOW install dependencies (AFTER package.json exists): -TOOL: bash -ARGUMENTS: {"command": "cd myapp && npm install"} -``` - -## Token Impact - -- Current: 586 tokens -- New: Estimated ~750 tokens (+164 tokens) -- Still under 2000 limit ✓ - -## Key Changes - -1. **Explicit sequencing:** "Step 1", "Step 2", etc. -2. **Actual code:** No "..." or "etc." - real working content -3. **Critical note:** "MUST do this BEFORE npm install" -4. **Minimal structure:** Just what's needed for Hello World - -## Success Criteria - -- [ ] Model creates package.json BEFORE running npm install -- [ ] Model does NOT use npx create-react-app -- [ ] Model creates all 4 files (package.json, App.js, index.js, index.html) -- [ ] Model runs npm install last (after files exist) diff --git a/docs/design/2024-02-24-fix-subprocess-hang.md b/docs/design/2024-02-24-fix-subprocess-hang.md deleted file mode 100644 index 0af3fa3..0000000 --- a/docs/design/2024-02-24-fix-subprocess-hang.md +++ /dev/null @@ -1,84 +0,0 @@ -# Design Decision: Fix Subprocess Hang on Interactive Commands - -**Date:** 2024-02-24 -**Scope:** src/tools/executor.py _execute_bash method -**Lines Changed:** 1 line - -## Problem - -When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes: -1. 300s timeout to be reached -2. opencode to hang waiting for response -3. Poor user experience - -## Root Cause - -`subprocess.run()` by default inherits stdin from parent process. When commands prompt for input: -- npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)" -- npm init asks for package details -- No input is provided, so it waits forever - -## Solution - -Add `stdin=subprocess.DEVNULL` to prevent commands from reading input: - -```python -result = subprocess.run( - command, - shell=True, - capture_output=True, - text=True, - timeout=timeout, - cwd=cwd, - stdin=subprocess.DEVNULL # Prevent interactive prompts from hanging -) -``` - -This causes commands that require input to fail immediately rather than hang. - -## Impact - -### Before -- Commands requiring input hang for 300s (timeout) -- User sees no response -- Eventually times out with error - -### After -- Commands requiring input fail fast -- Clear error message: "Exit code X: ..." -- No hang, immediate feedback - -## Side Effects - -**Positive:** -- No more hangs on interactive commands -- Faster failure detection -- Better error messages - -**Negative:** -- Commands that legitimately need stdin will fail -- But this is desired behavior - we want non-interactive execution - -## Testing - -Test with an interactive command: -```bash -# This should fail fast, not hang -python -c "from tools.executor import ToolExecutor; -import asyncio; -e = ToolExecutor(); -result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'})); -print(result)" -``` - -Expected: Quick failure, not a 30s hang - -## Related Changes - -This complements the tool instructions fix: -- Instructions now say "DO NOT use npx create-react-app" -- This fix ensures if model ignores instructions, it fails fast instead of hanging - -## Conclusion - -One-line fix prevents interactive command hangs, improving reliability and user experience. diff --git a/docs/design/2024-02-24-fix-tool-execution-tokens.md b/docs/design/2024-02-24-fix-tool-execution-tokens.md deleted file mode 100644 index 05c877b..0000000 --- a/docs/design/2024-02-24-fix-tool-execution-tokens.md +++ /dev/null @@ -1,178 +0,0 @@ -# Design Decision: Fix Tool Execution and Token Reporting - -**Date:** 2024-02-24 -**Scope:** src/api/routes.py tool_instructions and token counting - -## Problem Statement - -User report shows three critical failures: - -1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format -2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count -3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout - -## Evidence - -``` -🖥️ BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app . -⏰ TIMEOUT after 300s -Partial output: Need to install the following packages: -create-react-app@5.1.0 -Ok to proceed? (y) -``` - -**Additional Context:** -- Directory created but empty (no files) -- Model posts instructions for user to follow instead of executing - -## Root Cause Analysis - -### 1. Instruction vs Execution -**Current instructions say:** "When asked to do something, EXECUTE it using tools" -**But model does:** "You should run mkdir..." -**Why:** Instructions aren't strong enough - need explicit anti-patterns - -### 2. Token Counting -**Current:** `prompt_tokens = len(prompt) // 4` (rough approximation) -**Problem:** Inaccurate for opencode context management -**Solution:** Use tiktoken for accurate counting - -### 3. Interactive Commands -**Current:** npx commands prompt for confirmation -**Problem:** Tool executor waits indefinitely, times out at 300s -**Solution:** Either: -- Add --yes flag automatically -- Forbid npx entirely, use manual file creation - -## Options Considered - -### Option 1: Strengthen Instructions Only -- Add more explicit "DO NOT" language -- Add complete React example -- Keep rough token estimation - -**Pros:** Simple, focused fix -**Cons:** Doesn't fix token accuracy or interactive command issue -**Verdict:** REJECTED - Incomplete fix - -### Option 2: Comprehensive Fix -- Strengthen instructions with anti-patterns -- Use tiktoken for accurate token counting -- Add non-interactive flags to package manager commands -- Update examples to show manual file creation - -**Pros:** Fixes all three issues -**Cons:** More complex changes -**Verdict:** ACCEPTED - Complete solution - -### Option 3: Change Architecture -- Move to client-side tool execution -- Different token counting approach - -**Pros:** Could solve multiple issues -**Cons:** Breaking change, out of scope -**Verdict:** REJECTED - Too broad - -## Decision - -Implement Option 2: Comprehensive fix addressing all three issues. - -### Changes - -#### 1. Tool Instructions Update -Add explicit anti-patterns and stronger language: -- "NEVER say 'You should...' - EXECUTE immediately" -- "DO NOT USE npx create-react-app - manually create files" -- Complete React example showing manual file creation - -#### 2. Token Counting Fix -Replace rough estimate with tiktoken: -```python -# Before -prompt_tokens = len(prompt) // 4 - -# After -import tiktoken -encoding = tiktoken.get_encoding('cl100k_base') -prompt_tokens = len(encoding.encode(prompt)) -completion_tokens = len(encoding.encode(content)) -``` - -#### 3. Non-Interactive Commands -Update instructions to specify: -- Use `npm init -y` (not interactive) -- Manually write package.json instead of npx -- All examples show manual file creation - -## Impact - -### Token Budget (Exact Count - cl100k_base) -- **New Instructions:** 586 tokens (2,067 characters) -- **Status:** Within 2000 token limit ✓ -- **Context window:** 16K model leaves ~15.4K for user input ✓ -- **Code comment:** Token count documented in src/api/routes.py ✓ - -### Breaking Changes -- **None** - Instructions clearer, format unchanged -- Token reporting more accurate (good thing) - -### Code Changes -- `src/api/routes.py`: - - Update tool_instructions (~+15 lines) - - Add tiktoken import - - Replace token estimation logic (~5 lines) - -## Testing Strategy - -1. **Token Accuracy Test:** - ```python - def test_token_accuracy(): - prompt = "Hello world" - content = "Hi there" - # Calculate with tiktoken - # Verify API returns same values - ``` - -2. **Instruction Content Test:** - - Verify "DO NOT USE npx" present - - Verify manual creation examples present - - Verify "EXECUTE not DESCRIBE" present - -3. **Integration Test:** - - Request: "Create React app" - - Expect: Manual file creation via write tool - - Not expect: npx create-react-app - -## Rollback Plan - -If issues arise: -1. Revert to previous instructions -2. Keep tiktoken for token counting (beneficial) -3. Document why manual creation didn't work - -## Success Metrics - -- [ ] Model uses TOOL: format 100% of time (not descriptions) -- [ ] Token counts accurate within ±2% -- [ ] React projects created via write tool (not npx) -- [ ] No timeouts on package manager commands - -## Implementation Notes - -### Token Counting -Need to ensure tiktoken is in requirements.txt - -### Tool Instructions -The key addition is: -``` -**FORBIDDEN PATTERNS:** -- "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"} -- "npx create-react-app myapp" → USE: Manual file creation with write tool -- "First create package.json, then..." → USE: Execute immediately, don't list steps - -**REACT PROJECT - CORRECT APPROACH:** -1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"} -2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"} -3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."} -4. Continue until all files created -``` diff --git a/docs/design/2024-02-24-improved-tool-instructions.md b/docs/design/2024-02-24-improved-tool-instructions.md deleted file mode 100644 index 71b1016..0000000 --- a/docs/design/2024-02-24-improved-tool-instructions.md +++ /dev/null @@ -1,172 +0,0 @@ -# Design Decision: Improved Tool Instructions - -**Date:** 2024-02-24 -**Scope:** src/api/routes.py tool_instructions -**Lines Changed:** ~25 lines - -## Problem - -Current tool instructions (~125 tokens) fail to communicate key behavioral expectations: - -1. **Passive vs Active:** Model describes what to do instead of doing it -2. **Refusal:** Model claims "I am only an AI assistant" instead of executing -3. **Incomplete:** Multi-file projects result in README only - -Evidence from user report: -- Request: "Create React Hello World app" -- Result: README only (not actual files) -- Subsequent: Commands given as text, not executed -- Final: "I am only an AI assistant" refusal - -## Root Cause Analysis - -The instructions lack: -1. **Authority statement** - "You CAN and SHOULD use tools" -2. **Execution mandate** - "Execute commands, don't just describe them" -3. **Workflow clarity** - Clear step-by-step expectations -4. **Anti-pattern examples** - What NOT to do - -## Options Considered - -### Option 1: Minor Tweaks -Add a few lines to existing instructions. -- **Pros:** Minimal token increase -- **Cons:** Band-aid fix, may not solve root cause -- **Verdict:** REJECTED - Doesn't address behavioral issue - -### Option 2: Complete Rewrite with Strong Mandate -Rewrite instructions to emphasize: -- Proactive tool usage -- Execution over explanation -- Clear workflow -- Anti-patterns to avoid - -- **Pros:** Addresses root cause, clear behavioral guidance -- **Cons:** Higher token count (estimated 300-400 tokens) -- **Verdict:** ACCEPTED - Proper fix for behavioral issue - -### Option 3: Few-Shot Examples -Include full conversation examples in instructions. -- **Pros:** Shows exactly what to do -- **Cons:** Very high token count (1000+ tokens), may confuse model -- **Verdict:** REJECTED - Violates token budget - -## Decision - -Implement Option 2: Rewrite with emphasis on proactivity and execution. - -**Key additions:** -1. **Capability statement:** "You have tools. Use them." -2. **Execution mandate:** "Don't describe, execute" -3. **Workflow:** Clear request→tool→result→next cycle -4. **Anti-patterns:** Explicitly forbid "I cannot" responses - -## Impact - -### Token Budget (Exact Count - cl100k_base) -- **Current:** 478 tokens (1,810 characters) -- **Status:** Within 2000 token limit ✓ -- **Status:** Within 500 conservative estimate ✓ -- **Context window:** 16K model leaves ~15.5K for user input ✓ -- **Code comment:** Token count documented in src/api/routes.py ✓ - -### Code Changes -- **File:** src/api/routes.py -- **Lines:** +48/-18 (net +30) -- **Type:** Instructions replacement -- **Token documentation:** Added inline comment with exact token count - -### Breaking Changes -- **None** - Instructions are additive/clearer, not different format - -### Behavioral Changes -- **Expected:** More proactive tool usage -- **Expected:** No more "I cannot" refusals -- **Expected:** Multi-step projects completed via tools -- **Expected:** Commands executed, not described - -### Review Blockers Addressed -- ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1) -- ✅ Exact token count calculated using tiktoken (478 tokens) -- ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2) -- ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change) -- ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests - -## Implementation - -```python -tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks. - -**CRITICAL RULES:** -1. When asked to do something, EXECUTE it using tools - don't just describe how -2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc. -3. You MUST use the write tool to create files -4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them -5. Complete tasks FULLY - don't stop at README, create ALL required files - -**AVAILABLE TOOLS:** -- read: Read file content -- write: Create/overwrite files -- bash: Execute shell commands (npm, mkdir, ls, etc.) - -**TOOL FORMAT (STRICT):** -TOOL: tool_name -ARGUMENTS: {"param": "value"} - -**WORKFLOW:** -1. User asks for something -2. You decide what tool to use -3. You respond with ONLY the TOOL: format above -4. You receive the tool result -5. You continue with next tool until task is COMPLETE - -**EXAMPLES:** - -Creating a project: -User: "Create a React app" -You: TOOL: bash -ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"} -[wait for result] -You: TOOL: write -ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."} -[continue until all files created] - -Running commands: -User: "Install dependencies" -You: TOOL: bash -ARGUMENTS: {"command": "npm install"} -[wait for result, then confirm completion] - -**WHAT NOT TO DO:** -- ❌ "To create a React app, you should run: mkdir myapp" (describing) -- ❌ "I cannot run commands, I am an AI" (refusing) -- ❌ Creating only README instead of full project (incomplete) -- ❌ "First do X, then do Y" (giving instructions instead of doing) - -**CORRECT BEHAVIOR:** -- ✅ Execute the command immediately using the bash tool -- ✅ Create all files using the write tool -- ✅ Continue until task is 100% complete -- ✅ Use ONE tool at a time and wait for results""" -``` - -## Testing - -1. Test with React Hello World request -2. Verify model uses bash to create directory structure -3. Verify model uses write to create all files -4. Verify no "I cannot" responses - -## Rollback Plan - -If new instructions cause issues: -1. Revert to previous ~125 token version -2. Analyze what specifically failed -3. Iterate on smaller changes - -## Success Metrics - -- [ ] Model uses tools on first request (not after prompting) -- [ ] Zero "I cannot" or "I am an AI" responses -- [ ] Multi-file projects fully created -- [ ] Commands executed, not described diff --git a/docs/design/2024-02-24-task-planning-verification.md b/docs/design/2024-02-24-task-planning-verification.md deleted file mode 100644 index 559b2bd..0000000 --- a/docs/design/2024-02-24-task-planning-verification.md +++ /dev/null @@ -1,151 +0,0 @@ -# Design Decision: Task Planning and Verification Workflow - -**Date:** 2024-02-24 -**Scope:** src/api/routes.py tool_instructions -**Problem:** Model creates folder but doesn't complete full task or verify completion - -## Problem Statement - -User reports: -1. "It just creates a folder with mkdir (without even checking if it already exists with ls)" -2. No verification that tasks are completed -3. No planning of full task scope -4. Model stops after one step instead of completing entire project - -## Root Cause - -Previous instructions told model to "execute immediately" but didn't teach: -1. **Planning** - What needs to be done -2. **Checking** - What already exists -3. **Verification** - Did the step work -4. **Completion loop** - Keep going until done - -## Solution - -Add **Task Completion Workflow** to instructions: - -``` -**TASK COMPLETION WORKFLOW (MANDATORY):** - -**1. PLAN:** List ALL steps needed before starting -**2. CHECK:** Use ls to verify what exists before creating -**3. EXECUTE:** Run first step -**4. VERIFY:** Confirm step worked (ls, read file) -**5. REPEAT:** Steps 3-4 until ALL complete -**6. FINAL CHECK:** Verify entire task is done -**7. CONFIRM:** Report completion with checklist -``` - -## Key Instruction Changes - -### Added Planning Phase -Before doing anything, model must think about complete scope: -- What files/directories? -- What dependencies? -- Complete task requirements - -### Added Verification Steps -Every step must be verified: -- `ls -la` after mkdir -- `read` file after write -- Check content is correct - -### Added Completion Loop -Model must continue until: -✓ All directories exist -✓ All files exist with correct content -✓ All dependencies installed -✓ Each component verified - -### Complete Working Example -Provided 13-step React example showing: -1. Check existing (ls) -2. Create directory -3. Verify created (ls) -4. Create package.json -5. Verify package.json (read) -6. Create source files -7. Final verification (find myapp -type f) -8. Install dependencies -9. Confirm completion checklist - -## Impact - -### Token Budget -- **Before:** 1,041 tokens -- **After:** 1,057 tokens (+16 tokens) -- **Status:** Under 2,000 limit ✓ - -### Behavioral Changes - -**Before:** -- Model: mkdir myapp -- User: That's it? -- Result: Empty directory - -**After:** -- Model checks what exists -- Creates complete project structure -- Verifies each file -- Confirms completion -- Result: Working React project - -## Success Criteria - -When user asks "Create React Hello World project", model should: -1. ✓ Check current directory contents -2. ✓ Create myapp/ directory -3. ✓ Verify directory created -4. ✓ Create package.json -5. ✓ Verify package.json content -6. ✓ Create src/App.js -7. ✓ Create src/index.js -8. ✓ Create public/index.html -9. ✓ Final verification (list all files) -10. ✓ npm install -11. ✓ Confirm completion checklist - -## Testing - -Test instructions contain: -- PLAN/CHECK keywords -- VERIFY keyword -- COMPLETE keyword - -All tests pass: 11/11 ✓ - -## Trade-offs - -**Pros:** -- Complete task execution -- Verification prevents partial work -- Clear completion criteria -- Better user experience - -**Cons:** -- More tokens (but still under limit) -- More verbose instructions -- May be slower (more verification steps) - -## Related Files Changed - -1. src/api/routes.py - Updated tool_instructions -2. tests/test_tool_parsing.py - Updated tests for new content -3. docs/design/2024-02-24-task-planning-verification.md - This doc - -## Future Improvements - -1. **Task Queue System:** Server-side queue of pending operations -2. **State Persistence:** Remember what's been done across conversations -3. **Smart Resumption:** If interrupted, pick up where left off -4. **Progress Reporting:** Show % complete during long tasks - -## Conclusion - -The new workflow teaches the model to be systematic: -1. Plan before acting -2. Check before creating -3. Verify after each step -4. Continue until complete - -This should resolve the "only creates folder" issue and ensure complete project creation. diff --git a/docs/design/2024-02-24-tool-parsing-simplification.md b/docs/design/2024-02-24-tool-parsing-simplification.md deleted file mode 100644 index a31c268..0000000 --- a/docs/design/2024-02-24-tool-parsing-simplification.md +++ /dev/null @@ -1,132 +0,0 @@ -# Design Decision: Tool Parsing Simplification - -**Date:** 2024-02-24 -**Scope:** src/api/routes.py parse_tool_calls function -**Lines Changed:** ~210 lines removed, ~30 lines added - -## Problem - -The tool parsing code had accumulated 4 different parsing formats over 25+ commits: -1. JSON `tool_calls` format with nested objects -2. TOOL:/ARGUMENTS: format (simple text) -3. Function pattern format `func_name(args)` -4. Multiple JSON handling variants - -This caused: -- Circular development (adding/removing formats repeatedly) -- No single source of truth -- Complex, unmaintainable code -- No confidence that changes wouldn't break existing cases - -## Options Considered - -### Option 1: Keep All Formats -- **Pros:** Backward compatible -- **Cons:** 210 lines of unmaintainable code, continues circular development pattern -- **Verdict:** REJECTED - Perpetuates the problem - -### Option 2: Standardize on TOOL:/ARGUMENTS: Only -- **Pros:** - - Simple regex pattern (~30 lines) - - Matches current tool instructions - - Easy to test - - Clear single format for models -- **Cons:** - - Breaking change if any code relies on old formats - - Need to update any existing examples/docs -- **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well) - -### Option 3: Create Parser per Format with Feature Flags -- **Pros:** Flexible, can toggle formats -- **Cons:** - - Violates Rule 5 and "No Feature Flags in Core Logic" - - Still maintains multiple code paths -- **Verdict:** REJECTED - Doesn't solve the root problem - -## Decision - -Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code. - -**Rationale:** -- Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only" -- Token cost is minimal (no complex regex) -- Test coverage provides confidence -- Aligns with existing tool instructions - -## Impact - -### Token Count -- **Parser code:** 210 lines → 30 lines (-180 lines) -- **No change** to tool instructions (separate optimization) - -### Breaking Changes -- **Yes** - Removes support for: - - JSON `tool_calls` format in model responses - - Function pattern format `read_file(path="test.txt")` - -**Migration:** Models must use: -``` -TOOL: read -ARGUMENTS: {"filePath": "test.txt"} -``` - -### Testing -- Unit tests added: 9 test cases -- Coverage: All parsing scenarios -- All tests pass - -## Implementation - -```python -# New implementation (30 lines) -def parse_tool_calls(text: str) -> tuple: - """Parse tool calls using standardized format.""" - import json - import re - - tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})' - tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE)) - - if not tool_matches: - return text, None - - tool_calls = [] - for i, tool_match in enumerate(tool_matches): - tool_name = tool_match.group(1) - args_str = tool_match.group(2) - try: - args_dict = json.loads(args_str) - tool_calls.append({ - "id": f"call_{i+1}", - "type": "function", - "function": { - "name": tool_name, - "arguments": json.dumps(args_dict) - } - }) - except json.JSONDecodeError: - continue - - if not tool_calls: - return text, None - - first_start = tool_matches[0].start() - content = text[:first_start].strip() - - return content, tool_calls -``` - -## Verification - -Run tests: -```bash -python tests/test_tool_parsing.py -``` - -Expected: 9 passed, 0 failed - -## Follow-up - -- [x] Update DEVELOPMENT_PATTERNS.md to mark as completed -- [x] Add unit tests -- [ ] Consider integration test for full tool execution flow diff --git a/docs/design/2024-02-25-reduce-system-prompt-tokens.md b/docs/design/2024-02-25-reduce-system-prompt-tokens.md deleted file mode 100644 index 5713d9f..0000000 --- a/docs/design/2024-02-25-reduce-system-prompt-tokens.md +++ /dev/null @@ -1,98 +0,0 @@ -# Investigation: 31k Token Context Issue - -## Problem -When making requests through opencode to local_swarm, the LLM receives ~31k tokens of context even for simple empty directory queries. - -## Root Cause Identified - -**NOT an issue with this repo's codebase - this is expected behavior for function calling.** - -### How it works: - -1. **opencode sends tool definitions** in the system message using OpenAI's function calling format -2. **Each tool definition is ~450 tokens** (name + description + parameters) -3. **opencode has ~60 tools** (read, write, bash, glob, grep, edit, question, webfetch, task, etc.) -4. **Total tool definition tokens:** ~27,000 tokens - -### Calculation: -``` -Single tool definition: ~450 tokens -Number of tools: ~60 -Tool schemas total: ~27,000 tokens -System message: ~500 tokens -User query: ~100 tokens ---- -Total: ~27,600 tokens -``` - -**This matches the observed ~31k tokens.** - -## Why This Happens - -OpenAI's function calling protocol requires sending the **complete function schemas** to the LLM with every request. This is how the model: -- Knows what tools are available -- Understands parameter requirements -- Knows how to format tool calls - -All major LLM providers using function calling work this way (OpenAI, Anthropic, local models, etc.). - -## Verification - -```bash -python -c " -import tiktoken -enc = tiktoken.get_encoding('cl100k_base') - -# Example from actual opencode tool definition -read_tool_schema = '''{\"type\": \"function\", \"function\": {\"name\": \"read\", \"description\": \"Read a file or directory from the local filesystem...[full description]\", \"parameters\": {...}}}''' - -print(f'Single tool schema: {len(enc.encode(read_tool_schema))} tokens') -print(f'Estimated 60 tools: {len(enc.encode(read_tool_schema)) * 60:,} tokens') -" -``` - -Result: -- Single tool definition: ~451 tokens -- 60 tools: ~27,060 tokens -- Plus system + user message: ~27,660 total - -## This Is NOT a Bug - -The 31k token context is **correct and expected** for function calling with 60+ tools. This is how: -- OpenAI API works -- Claude API works -- Local models with function calling work - -## Potential Optimizations (Optional) - -If reducing context size is critical, consider: - -### Option 1: Dynamic Tool Selection -- Only send tools relevant to current task -- Example: For file operations, only send [read, write, glob, edit] -- Trade-off: Requires opencode to intelligently filter tools - -### Option 2: Compressed Tool Descriptions -- Shorten tool descriptions to essentials -- Example: "Read file at path (required: filePath)" -- Trade-off: Model may make more errors with less guidance - -### Option 3: Tool Grouping -- Group similar tools into single "tools: [read, write, glob]" parameter -- Trade-off: Breaks OpenAI compatibility - -## Recommendation - -**NO ACTION REQUIRED.** The 31k token context is: -- Standard for function calling with many tools -- Within capabilities of modern LLMs (32k-128k context windows) -- Not caused by this repo's code - -The `.opencodeignore` created earlier will help with opencode's own system prompt, but doesn't affect the LLM context sent to local_swarm. - -## Additional Finding - -While investigating, verified: -- `config/prompts/tool_instructions.txt`: 125 tokens ✅ -- This repo's tool execution code: No token bloat ✅ -- Issue is purely opencode's function calling protocol ✅ diff --git a/docs/test-plans/fix-tool-execution-tokens.md b/docs/test-plans/fix-tool-execution-tokens.md deleted file mode 100644 index 629ec41..0000000 --- a/docs/test-plans/fix-tool-execution-tokens.md +++ /dev/null @@ -1,112 +0,0 @@ -# Test Plan: Fix Tool Execution and Token Reporting - -## Problem Analysis - -### Issue 1: Model Gives Instructions Instead of Executing -**Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format -**Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."} - -### Issue 2: Token Counting Inaccurate -**Current:** Rough estimate `len(prompt) // 4` -**Expected:** Accurate token count using tiktoken -**Impact:** opencode can't properly manage context window - -### Issue 3: npx Commands Timeout/Need Input -**Current:** `npx create-react-app .` prompts for confirmation (y/n) -**Expected:** Non-interactive execution or manual file creation -**Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)" - -## Unit Tests - -### Test 1: Accurate Token Counting -- [ ] Verify token count uses tiktoken (not rough estimate) -- [ ] Test with known token counts -- [ ] Verify prompt_tokens + completion_tokens = total_tokens - -### Test 2: Non-Interactive Bash Commands -- [ ] Verify npm/npx commands use --yes or equivalent flags -- [ ] Test timeout handling for package managers -- [ ] Verify commands don't prompt for user input - -### Test 3: Tool Instructions Content -- [ ] Verify instructions emphasize "EXECUTE not DESCRIBE" -- [ ] Verify manual file creation examples (not npx) -- [ ] Verify anti-patterns are clearly stated - -## Integration Tests - -### Test 4: End-to-End React Project Creation -**Input:** "Create a React Hello World app" - -**Expected Flow:** -1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"} -2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."} -3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."} -4. Continue until complete - -**Failure Modes:** -- [ ] Model describes steps instead of executing -- [ ] Uses npx create-react-app (should manually create files) -- [ ] Stops after README only - -### Test 5: Token Reporting Accuracy -**Input:** Any chat completion request - -**Expected:** -- usage.prompt_tokens matches actual tokens -- usage.completion_tokens matches actual tokens -- usage.total_tokens is sum - -**Verification:** -- Compare tiktoken count vs API response - -## Manual Verification - -```bash -# Test React creation -python main.py --auto & -curl -X POST http://localhost:17615/v1/chat/completions \ - -H "Content-Type: application/json" \ - -H "X-Client-Working-Dir: /tmp/test-project" \ - -d '{ - "model": "local-swarm", - "messages": [{"role": "user", "content": "Create a React Hello World app"}], - "tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}] - }' - -# Check token accuracy -curl -X POST http://localhost:17615/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "local-swarm", - "messages": [{"role": "user", "content": "Hello"}] - }' | jq '.usage' -``` - -## Success Criteria - -1. **Execution:** 100% of requests use TOOL: format (not descriptions) -2. **Accuracy:** Token counts match tiktoken within ±5% -3. **Completion:** Multi-file projects fully created via write tool -4. **No npx:** Manual file creation for React (no npx create-react-app) - -## Implementation Notes - -### Token Counting Fix -```python -# Replace: prompt_tokens = len(prompt) // 4 -# With: -import tiktoken -encoding = tiktoken.get_encoding('cl100k_base') -prompt_tokens = len(encoding.encode(prompt)) -completion_tokens = len(encoding.encode(content)) -``` - -### Tool Instructions Fix -- Add explicit "DO NOT USE npx create-react-app" instruction -- Add "EXECUTE IMMEDIATELY" mandate -- Show complete React example with manual file creation - -### Non-Interactive Commands -- Auto-add --yes to npx commands -- Or recommend manual file creation instead diff --git a/docs/test-plans/improved-tool-instructions.md b/docs/test-plans/improved-tool-instructions.md deleted file mode 100644 index fafc02d..0000000 --- a/docs/test-plans/improved-tool-instructions.md +++ /dev/null @@ -1,97 +0,0 @@ -# Test Plan: Improved Tool Instructions - -## Problem Statement -Model is not using tools effectively: -1. Creates README instead of actual project structure -2. Provides commands as text instead of executing them -3. Refuses to run commands claiming "I am only an AI assistant" - -## Root Cause Analysis -Current instructions don't clearly communicate: -- That the model SHOULD use tools proactively -- That execution is expected, not explanation -- The workflow: user request → tool execution → result - -## Unit Tests (Instruction Verification) - -### Test 1: Instruction Presence -- [ ] Verify instructions are injected into system message -- [ ] Verify instructions appear at the START of system message (priority position) - -### Test 2: Token Count -- [ ] Measure total token count of new instructions -- [ ] Verify ≤ 500 tokens (conservative budget) -- [ ] Document before/after - -### Test 3: Format Compliance -- [ ] Verify instructions include TOOL:/ARGUMENTS: format -- [ ] Verify examples use correct format -- [ ] Verify rules are clear and numbered - -## Integration Tests (Behavioral) - -### Test 4: Project Creation Flow -**Input:** "Create a React Hello World app" - -**Expected Behavior:** -1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp -2. After result, TOOL: write, ARGUMENTS: package.json content -3. After result, TOOL: write, ARGUMENTS: src/App.js content -4. Continue until complete project structure exists - -**Failure Modes:** -- [ ] Model only describes what to do -- [ ] Model creates README only -- [ ] Model refuses to execute commands - -### Test 5: Multi-step Task -**Input:** "Check what files exist, then create a test.txt file with 'hello' in it" - -**Expected Behavior:** -1. TOOL: bash, ARGUMENTS: ls -la -2. Wait for result -3. TOOL: write, ARGUMENTS: test.txt with "hello" - -**Failure Modes:** -- [ ] Model tries to do both in one response -- [ ] Model doesn't wait for ls result before writing - -### Test 6: Command Refusal -**Input:** "Run npm install" - -**Expected Behavior:** -1. TOOL: bash, ARGUMENTS: npm install - -**Failure Modes:** -- [ ] Model responds: "I cannot run commands, I am only an AI assistant" -- [ ] Model explains npm install instead of running it - -## Manual Verification Commands - -```bash -# Start the server -python main.py --auto - -# In another terminal, test with curl -curl -X POST http://localhost:17615/v1/chat/completions \ - -H "Content-Type: application/json" \ - -d '{ - "model": "local-swarm", - "messages": [{"role": "user", "content": "Create a React Hello World app"}], - "tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}] - }' -``` - -## Success Criteria - -1. **Proactivity:** Model uses tools without being asked twice -2. **Execution:** Model runs commands, doesn't just describe them -3. **No Refusal:** Model never says "I cannot" or "I am only an AI" -4. **Completeness:** Multi-file projects are fully created via tools -5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format - -## Metrics - -- **Tool usage rate:** % of requests that result in tool calls -- **Format compliance:** % of tool calls in correct format -- **Completion rate:** % of multi-step tasks fully completed diff --git a/docs/test-plans/tool-parsing-simplification.md b/docs/test-plans/tool-parsing-simplification.md deleted file mode 100644 index 114b37b..0000000 --- a/docs/test-plans/tool-parsing-simplification.md +++ /dev/null @@ -1,35 +0,0 @@ -# Test Plan: Tool Parsing Simplification - -## Unit Tests - -- [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments -- [x] Test case 2: No tool in text → Returns None for tools, original text as content -- [x] Test case 3: Multiple tools → Returns all tools in order -- [x] Test case 4: Content before tool → Content extracted, tool parsed correctly -- [x] Test case 5: Bash tool → Correctly parses bash command -- [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work -- [x] Test case 7: Invalid JSON → Skips invalid, continues with valid -- [x] Test case 8: Empty text → Returns None, empty string -- [x] Test case 9: Whitespace only → Returns None - -## Integration Tests - -- [ ] End-to-end flow: - 1. Send chat completion request with tools - 2. Model responds with TOOL:/ARGUMENTS: format - 3. Parser extracts tool call - 4. Tool executes - 5. Result returned in response - -- [ ] Expected result: Tool executes successfully, result included in response - -## Manual Verification - -- [ ] Command: `python tests/test_tool_parsing.py` -- [ ] Expected output: "9 passed, 0 failed" - -## Token Budget Verification - -- Parser code: ~30 lines (~200 tokens) -- Well under 2000 token limit -- Simple regex pattern maintains low complexity