docs: update README with current features and remove outdated docs
- Removed old design docs and test plans from docs/ directory - Updated TODO section to reflect completed improvements - Added section on Recent Improvements with detailed changelog - Updated Federation description to explain objective quality voting - Added federation vote endpoint to API endpoints list - Clarified universal tool support and OpenCode streaming compatibility - All changes ready for main branch merge
This commit is contained in:
@@ -91,7 +91,9 @@ python main.py --auto --federation
|
||||
python main.py --auto --federation
|
||||
```
|
||||
|
||||
Machines auto-discover each other and vote together on every request.
|
||||
Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
|
||||
|
||||
**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).
|
||||
|
||||
## How Consensus Works
|
||||
|
||||
@@ -147,7 +149,7 @@ All support GGUF quantization (Q4_K_M recommended).
|
||||
- `GET /v1/models` - List available models
|
||||
- `POST /v1/chat/completions` - Chat completion with consensus
|
||||
- `GET /health` - Health check
|
||||
- `GET /v1/federation/peers` - List discovered peers (when federation enabled)
|
||||
- `POST /v1/federation/vote` - Federation voting (used internally between peers)
|
||||
|
||||
## Troubleshooting
|
||||
|
||||
@@ -282,21 +284,29 @@ Major refactoring completed to improve modularity:
|
||||
|
||||
See `docs/ARCHITECTURE.md` for detailed architecture documentation.
|
||||
|
||||
## TODO / Roadmap
|
||||
## Recent Improvements
|
||||
|
||||
### Planned Features
|
||||
### ✅ Universal Tool Support (2025-02-25)
|
||||
- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
|
||||
- No client-side configuration needed - just use the API
|
||||
- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
|
||||
- Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
|
||||
- Proper OpenAI tool format with unique IDs and tool_call_id linking
|
||||
|
||||
- **Plan Mode**: Add a "plan mode" that disables tool execution for planning-only conversations. This would allow the model to discuss file changes without actually modifying them until explicitly confirmed.
|
||||
- Usage: `--plan-mode` flag or API parameter
|
||||
- When enabled: Model can see what tools would do but doesn't execute them
|
||||
- Use case: Review changes before applying them
|
||||
### ✅ OpenCode-Compatible Streaming (2025-02-25)
|
||||
- Proper `reasoning_content` field for "Thinking..." collapsible blocks
|
||||
- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
|
||||
- Final answer delivered in `content` field after tool execution
|
||||
|
||||
### Current Status
|
||||
### ✅ Federation Quality Voting (2025-02-25)
|
||||
- Head node now **objectively judges** all peer responses using quality metrics
|
||||
- No more reliance on self-reported confidence (which biased toward local)
|
||||
- All responses scored on length, structure, completeness
|
||||
- Fair competition: 14B models properly beat 3B on quality tasks
|
||||
|
||||
- ✅ Tool instructions now injected by default for all clients
|
||||
- ✅ Improved file operation safety (verify with ls/grep before reading)
|
||||
- ✅ Working directory support (extracted from client context)
|
||||
- 🔄 Plan mode - coming soon
|
||||
### 🚧 Planned Features
|
||||
- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
|
||||
- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)
|
||||
|
||||
## Contributing
|
||||
|
||||
|
||||
@@ -1,92 +0,0 @@
|
||||
# Design Decision: Complete React Example with Actual Code
|
||||
|
||||
**Date:** 2024-02-24
|
||||
**Scope:** src/api/routes.py tool_instructions
|
||||
|
||||
## Problem
|
||||
|
||||
Model is still not following instructions:
|
||||
1. Tries `npm install` before creating package.json
|
||||
2. Still tries `npx create-react-app` despite being told not to
|
||||
3. Instructions have placeholders like "..." and "etc." which models don't understand
|
||||
|
||||
## Root Cause
|
||||
|
||||
The current instructions say:
|
||||
```
|
||||
TOOL: write
|
||||
ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"}
|
||||
|
||||
[Continue with src/index.js, src/App.js, public/index.html, etc.]
|
||||
```
|
||||
|
||||
**Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples.
|
||||
|
||||
## Solution
|
||||
|
||||
Provide a **complete, working, minimal React example** with actual file contents:
|
||||
|
||||
1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install
|
||||
2. Actual file content, not placeholders
|
||||
3. Minimal viable React app (not full create-react-app structure)
|
||||
|
||||
## Implementation
|
||||
|
||||
Replace vague example with complete working code:
|
||||
|
||||
```
|
||||
**COMPLETE REACT HELLO WORLD EXAMPLE:**
|
||||
|
||||
User: "Create a React Hello World app"
|
||||
|
||||
Step 1 - Create directory:
|
||||
TOOL: bash
|
||||
ARGUMENTS: {"command": "mkdir myapp"}
|
||||
|
||||
Step 2 - Create package.json (MUST do this BEFORE npm install):
|
||||
TOOL: write
|
||||
ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"}
|
||||
|
||||
Step 3 - Create src directory:
|
||||
TOOL: bash
|
||||
ARGUMENTS: {"command": "mkdir myapp/src"}
|
||||
|
||||
Step 4 - Create App.js:
|
||||
TOOL: write
|
||||
ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n return (\n <div className=\"App\">\n <h1>Hello World</h1>\n <p>Welcome to my React app!</p>\n </div>\n );\n}\n\nexport default App;"}
|
||||
|
||||
Step 5 - Create index.js:
|
||||
TOOL: write
|
||||
ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render(<App />);"}
|
||||
|
||||
Step 6 - Create public directory and index.html:
|
||||
TOOL: bash
|
||||
ARGUMENTS: {"command": "mkdir myapp/public"}
|
||||
|
||||
TOOL: write
|
||||
ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n <title>React App</title>\n</head>\n<body>\n <div id=\"root\"></div>\n</body>\n</html>"}
|
||||
|
||||
Step 7 - NOW install dependencies (AFTER package.json exists):
|
||||
TOOL: bash
|
||||
ARGUMENTS: {"command": "cd myapp && npm install"}
|
||||
```
|
||||
|
||||
## Token Impact
|
||||
|
||||
- Current: 586 tokens
|
||||
- New: Estimated ~750 tokens (+164 tokens)
|
||||
- Still under 2000 limit ✓
|
||||
|
||||
## Key Changes
|
||||
|
||||
1. **Explicit sequencing:** "Step 1", "Step 2", etc.
|
||||
2. **Actual code:** No "..." or "etc." - real working content
|
||||
3. **Critical note:** "MUST do this BEFORE npm install"
|
||||
4. **Minimal structure:** Just what's needed for Hello World
|
||||
|
||||
## Success Criteria
|
||||
|
||||
- [ ] Model creates package.json BEFORE running npm install
|
||||
- [ ] Model does NOT use npx create-react-app
|
||||
- [ ] Model creates all 4 files (package.json, App.js, index.js, index.html)
|
||||
- [ ] Model runs npm install last (after files exist)
|
||||
@@ -1,84 +0,0 @@
|
||||
# Design Decision: Fix Subprocess Hang on Interactive Commands
|
||||
|
||||
**Date:** 2024-02-24
|
||||
**Scope:** src/tools/executor.py _execute_bash method
|
||||
**Lines Changed:** 1 line
|
||||
|
||||
## Problem
|
||||
|
||||
When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes:
|
||||
1. 300s timeout to be reached
|
||||
2. opencode to hang waiting for response
|
||||
3. Poor user experience
|
||||
|
||||
## Root Cause
|
||||
|
||||
`subprocess.run()` by default inherits stdin from parent process. When commands prompt for input:
|
||||
- npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)"
|
||||
- npm init asks for package details
|
||||
- No input is provided, so it waits forever
|
||||
|
||||
## Solution
|
||||
|
||||
Add `stdin=subprocess.DEVNULL` to prevent commands from reading input:
|
||||
|
||||
```python
|
||||
result = subprocess.run(
|
||||
command,
|
||||
shell=True,
|
||||
capture_output=True,
|
||||
text=True,
|
||||
timeout=timeout,
|
||||
cwd=cwd,
|
||||
stdin=subprocess.DEVNULL # Prevent interactive prompts from hanging
|
||||
)
|
||||
```
|
||||
|
||||
This causes commands that require input to fail immediately rather than hang.
|
||||
|
||||
## Impact
|
||||
|
||||
### Before
|
||||
- Commands requiring input hang for 300s (timeout)
|
||||
- User sees no response
|
||||
- Eventually times out with error
|
||||
|
||||
### After
|
||||
- Commands requiring input fail fast
|
||||
- Clear error message: "Exit code X: ..."
|
||||
- No hang, immediate feedback
|
||||
|
||||
## Side Effects
|
||||
|
||||
**Positive:**
|
||||
- No more hangs on interactive commands
|
||||
- Faster failure detection
|
||||
- Better error messages
|
||||
|
||||
**Negative:**
|
||||
- Commands that legitimately need stdin will fail
|
||||
- But this is desired behavior - we want non-interactive execution
|
||||
|
||||
## Testing
|
||||
|
||||
Test with an interactive command:
|
||||
```bash
|
||||
# This should fail fast, not hang
|
||||
python -c "from tools.executor import ToolExecutor;
|
||||
import asyncio;
|
||||
e = ToolExecutor();
|
||||
result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'}));
|
||||
print(result)"
|
||||
```
|
||||
|
||||
Expected: Quick failure, not a 30s hang
|
||||
|
||||
## Related Changes
|
||||
|
||||
This complements the tool instructions fix:
|
||||
- Instructions now say "DO NOT use npx create-react-app"
|
||||
- This fix ensures if model ignores instructions, it fails fast instead of hanging
|
||||
|
||||
## Conclusion
|
||||
|
||||
One-line fix prevents interactive command hangs, improving reliability and user experience.
|
||||
@@ -1,178 +0,0 @@
|
||||
# Design Decision: Fix Tool Execution and Token Reporting
|
||||
|
||||
**Date:** 2024-02-24
|
||||
**Scope:** src/api/routes.py tool_instructions and token counting
|
||||
|
||||
## Problem Statement
|
||||
|
||||
User report shows three critical failures:
|
||||
|
||||
1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format
|
||||
2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count
|
||||
3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout
|
||||
|
||||
## Evidence
|
||||
|
||||
```
|
||||
🖥️ BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app .
|
||||
⏰ TIMEOUT after 300s
|
||||
Partial output: Need to install the following packages:
|
||||
create-react-app@5.1.0
|
||||
Ok to proceed? (y)
|
||||
```
|
||||
|
||||
**Additional Context:**
|
||||
- Directory created but empty (no files)
|
||||
- Model posts instructions for user to follow instead of executing
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
### 1. Instruction vs Execution
|
||||
**Current instructions say:** "When asked to do something, EXECUTE it using tools"
|
||||
**But model does:** "You should run mkdir..."
|
||||
**Why:** Instructions aren't strong enough - need explicit anti-patterns
|
||||
|
||||
### 2. Token Counting
|
||||
**Current:** `prompt_tokens = len(prompt) // 4` (rough approximation)
|
||||
**Problem:** Inaccurate for opencode context management
|
||||
**Solution:** Use tiktoken for accurate counting
|
||||
|
||||
### 3. Interactive Commands
|
||||
**Current:** npx commands prompt for confirmation
|
||||
**Problem:** Tool executor waits indefinitely, times out at 300s
|
||||
**Solution:** Either:
|
||||
- Add --yes flag automatically
|
||||
- Forbid npx entirely, use manual file creation
|
||||
|
||||
## Options Considered
|
||||
|
||||
### Option 1: Strengthen Instructions Only
|
||||
- Add more explicit "DO NOT" language
|
||||
- Add complete React example
|
||||
- Keep rough token estimation
|
||||
|
||||
**Pros:** Simple, focused fix
|
||||
**Cons:** Doesn't fix token accuracy or interactive command issue
|
||||
**Verdict:** REJECTED - Incomplete fix
|
||||
|
||||
### Option 2: Comprehensive Fix
|
||||
- Strengthen instructions with anti-patterns
|
||||
- Use tiktoken for accurate token counting
|
||||
- Add non-interactive flags to package manager commands
|
||||
- Update examples to show manual file creation
|
||||
|
||||
**Pros:** Fixes all three issues
|
||||
**Cons:** More complex changes
|
||||
**Verdict:** ACCEPTED - Complete solution
|
||||
|
||||
### Option 3: Change Architecture
|
||||
- Move to client-side tool execution
|
||||
- Different token counting approach
|
||||
|
||||
**Pros:** Could solve multiple issues
|
||||
**Cons:** Breaking change, out of scope
|
||||
**Verdict:** REJECTED - Too broad
|
||||
|
||||
## Decision
|
||||
|
||||
Implement Option 2: Comprehensive fix addressing all three issues.
|
||||
|
||||
### Changes
|
||||
|
||||
#### 1. Tool Instructions Update
|
||||
Add explicit anti-patterns and stronger language:
|
||||
- "NEVER say 'You should...' - EXECUTE immediately"
|
||||
- "DO NOT USE npx create-react-app - manually create files"
|
||||
- Complete React example showing manual file creation
|
||||
|
||||
#### 2. Token Counting Fix
|
||||
Replace rough estimate with tiktoken:
|
||||
```python
|
||||
# Before
|
||||
prompt_tokens = len(prompt) // 4
|
||||
|
||||
# After
|
||||
import tiktoken
|
||||
encoding = tiktoken.get_encoding('cl100k_base')
|
||||
prompt_tokens = len(encoding.encode(prompt))
|
||||
completion_tokens = len(encoding.encode(content))
|
||||
```
|
||||
|
||||
#### 3. Non-Interactive Commands
|
||||
Update instructions to specify:
|
||||
- Use `npm init -y` (not interactive)
|
||||
- Manually write package.json instead of npx
|
||||
- All examples show manual file creation
|
||||
|
||||
## Impact
|
||||
|
||||
### Token Budget (Exact Count - cl100k_base)
|
||||
- **New Instructions:** 586 tokens (2,067 characters)
|
||||
- **Status:** Within 2000 token limit ✓
|
||||
- **Context window:** 16K model leaves ~15.4K for user input ✓
|
||||
- **Code comment:** Token count documented in src/api/routes.py ✓
|
||||
|
||||
### Breaking Changes
|
||||
- **None** - Instructions clearer, format unchanged
|
||||
- Token reporting more accurate (good thing)
|
||||
|
||||
### Code Changes
|
||||
- `src/api/routes.py`:
|
||||
- Update tool_instructions (~+15 lines)
|
||||
- Add tiktoken import
|
||||
- Replace token estimation logic (~5 lines)
|
||||
|
||||
## Testing Strategy
|
||||
|
||||
1. **Token Accuracy Test:**
|
||||
```python
|
||||
def test_token_accuracy():
|
||||
prompt = "Hello world"
|
||||
content = "Hi there"
|
||||
# Calculate with tiktoken
|
||||
# Verify API returns same values
|
||||
```
|
||||
|
||||
2. **Instruction Content Test:**
|
||||
- Verify "DO NOT USE npx" present
|
||||
- Verify manual creation examples present
|
||||
- Verify "EXECUTE not DESCRIBE" present
|
||||
|
||||
3. **Integration Test:**
|
||||
- Request: "Create React app"
|
||||
- Expect: Manual file creation via write tool
|
||||
- Not expect: npx create-react-app
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If issues arise:
|
||||
1. Revert to previous instructions
|
||||
2. Keep tiktoken for token counting (beneficial)
|
||||
3. Document why manual creation didn't work
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- [ ] Model uses TOOL: format 100% of time (not descriptions)
|
||||
- [ ] Token counts accurate within ±2%
|
||||
- [ ] React projects created via write tool (not npx)
|
||||
- [ ] No timeouts on package manager commands
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Token Counting
|
||||
Need to ensure tiktoken is in requirements.txt
|
||||
|
||||
### Tool Instructions
|
||||
The key addition is:
|
||||
```
|
||||
**FORBIDDEN PATTERNS:**
|
||||
- "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"}
|
||||
- "npx create-react-app myapp" → USE: Manual file creation with write tool
|
||||
- "First create package.json, then..." → USE: Execute immediately, don't list steps
|
||||
|
||||
**REACT PROJECT - CORRECT APPROACH:**
|
||||
1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
|
||||
2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"}
|
||||
3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."}
|
||||
4. Continue until all files created
|
||||
```
|
||||
@@ -1,172 +0,0 @@
|
||||
# Design Decision: Improved Tool Instructions
|
||||
|
||||
**Date:** 2024-02-24
|
||||
**Scope:** src/api/routes.py tool_instructions
|
||||
**Lines Changed:** ~25 lines
|
||||
|
||||
## Problem
|
||||
|
||||
Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
|
||||
|
||||
1. **Passive vs Active:** Model describes what to do instead of doing it
|
||||
2. **Refusal:** Model claims "I am only an AI assistant" instead of executing
|
||||
3. **Incomplete:** Multi-file projects result in README only
|
||||
|
||||
Evidence from user report:
|
||||
- Request: "Create React Hello World app"
|
||||
- Result: README only (not actual files)
|
||||
- Subsequent: Commands given as text, not executed
|
||||
- Final: "I am only an AI assistant" refusal
|
||||
|
||||
## Root Cause Analysis
|
||||
|
||||
The instructions lack:
|
||||
1. **Authority statement** - "You CAN and SHOULD use tools"
|
||||
2. **Execution mandate** - "Execute commands, don't just describe them"
|
||||
3. **Workflow clarity** - Clear step-by-step expectations
|
||||
4. **Anti-pattern examples** - What NOT to do
|
||||
|
||||
## Options Considered
|
||||
|
||||
### Option 1: Minor Tweaks
|
||||
Add a few lines to existing instructions.
|
||||
- **Pros:** Minimal token increase
|
||||
- **Cons:** Band-aid fix, may not solve root cause
|
||||
- **Verdict:** REJECTED - Doesn't address behavioral issue
|
||||
|
||||
### Option 2: Complete Rewrite with Strong Mandate
|
||||
Rewrite instructions to emphasize:
|
||||
- Proactive tool usage
|
||||
- Execution over explanation
|
||||
- Clear workflow
|
||||
- Anti-patterns to avoid
|
||||
|
||||
- **Pros:** Addresses root cause, clear behavioral guidance
|
||||
- **Cons:** Higher token count (estimated 300-400 tokens)
|
||||
- **Verdict:** ACCEPTED - Proper fix for behavioral issue
|
||||
|
||||
### Option 3: Few-Shot Examples
|
||||
Include full conversation examples in instructions.
|
||||
- **Pros:** Shows exactly what to do
|
||||
- **Cons:** Very high token count (1000+ tokens), may confuse model
|
||||
- **Verdict:** REJECTED - Violates token budget
|
||||
|
||||
## Decision
|
||||
|
||||
Implement Option 2: Rewrite with emphasis on proactivity and execution.
|
||||
|
||||
**Key additions:**
|
||||
1. **Capability statement:** "You have tools. Use them."
|
||||
2. **Execution mandate:** "Don't describe, execute"
|
||||
3. **Workflow:** Clear request→tool→result→next cycle
|
||||
4. **Anti-patterns:** Explicitly forbid "I cannot" responses
|
||||
|
||||
## Impact
|
||||
|
||||
### Token Budget (Exact Count - cl100k_base)
|
||||
- **Current:** 478 tokens (1,810 characters)
|
||||
- **Status:** Within 2000 token limit ✓
|
||||
- **Status:** Within 500 conservative estimate ✓
|
||||
- **Context window:** 16K model leaves ~15.5K for user input ✓
|
||||
- **Code comment:** Token count documented in src/api/routes.py ✓
|
||||
|
||||
### Code Changes
|
||||
- **File:** src/api/routes.py
|
||||
- **Lines:** +48/-18 (net +30)
|
||||
- **Type:** Instructions replacement
|
||||
- **Token documentation:** Added inline comment with exact token count
|
||||
|
||||
### Breaking Changes
|
||||
- **None** - Instructions are additive/clearer, not different format
|
||||
|
||||
### Behavioral Changes
|
||||
- **Expected:** More proactive tool usage
|
||||
- **Expected:** No more "I cannot" refusals
|
||||
- **Expected:** Multi-step projects completed via tools
|
||||
- **Expected:** Commands executed, not described
|
||||
|
||||
### Review Blockers Addressed
|
||||
- ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
|
||||
- ✅ Exact token count calculated using tiktoken (478 tokens)
|
||||
- ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
|
||||
- ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
|
||||
- ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
|
||||
|
||||
## Implementation
|
||||
|
||||
```python
|
||||
tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
|
||||
|
||||
**CRITICAL RULES:**
|
||||
1. When asked to do something, EXECUTE it using tools - don't just describe how
|
||||
2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
|
||||
3. You MUST use the write tool to create files
|
||||
4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
|
||||
5. Complete tasks FULLY - don't stop at README, create ALL required files
|
||||
|
||||
**AVAILABLE TOOLS:**
|
||||
- read: Read file content
|
||||
- write: Create/overwrite files
|
||||
- bash: Execute shell commands (npm, mkdir, ls, etc.)
|
||||
|
||||
**TOOL FORMAT (STRICT):**
|
||||
TOOL: tool_name
|
||||
ARGUMENTS: {"param": "value"}
|
||||
|
||||
**WORKFLOW:**
|
||||
1. User asks for something
|
||||
2. You decide what tool to use
|
||||
3. You respond with ONLY the TOOL: format above
|
||||
4. You receive the tool result
|
||||
5. You continue with next tool until task is COMPLETE
|
||||
|
||||
**EXAMPLES:**
|
||||
|
||||
Creating a project:
|
||||
User: "Create a React app"
|
||||
You: TOOL: bash
|
||||
ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
|
||||
[wait for result]
|
||||
You: TOOL: write
|
||||
ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
|
||||
[continue until all files created]
|
||||
|
||||
Running commands:
|
||||
User: "Install dependencies"
|
||||
You: TOOL: bash
|
||||
ARGUMENTS: {"command": "npm install"}
|
||||
[wait for result, then confirm completion]
|
||||
|
||||
**WHAT NOT TO DO:**
|
||||
- ❌ "To create a React app, you should run: mkdir myapp" (describing)
|
||||
- ❌ "I cannot run commands, I am an AI" (refusing)
|
||||
- ❌ Creating only README instead of full project (incomplete)
|
||||
- ❌ "First do X, then do Y" (giving instructions instead of doing)
|
||||
|
||||
**CORRECT BEHAVIOR:**
|
||||
- ✅ Execute the command immediately using the bash tool
|
||||
- ✅ Create all files using the write tool
|
||||
- ✅ Continue until task is 100% complete
|
||||
- ✅ Use ONE tool at a time and wait for results"""
|
||||
```
|
||||
|
||||
## Testing
|
||||
|
||||
1. Test with React Hello World request
|
||||
2. Verify model uses bash to create directory structure
|
||||
3. Verify model uses write to create all files
|
||||
4. Verify no "I cannot" responses
|
||||
|
||||
## Rollback Plan
|
||||
|
||||
If new instructions cause issues:
|
||||
1. Revert to previous ~125 token version
|
||||
2. Analyze what specifically failed
|
||||
3. Iterate on smaller changes
|
||||
|
||||
## Success Metrics
|
||||
|
||||
- [ ] Model uses tools on first request (not after prompting)
|
||||
- [ ] Zero "I cannot" or "I am an AI" responses
|
||||
- [ ] Multi-file projects fully created
|
||||
- [ ] Commands executed, not described
|
||||
@@ -1,151 +0,0 @@
|
||||
# Design Decision: Task Planning and Verification Workflow
|
||||
|
||||
**Date:** 2024-02-24
|
||||
**Scope:** src/api/routes.py tool_instructions
|
||||
**Problem:** Model creates folder but doesn't complete full task or verify completion
|
||||
|
||||
## Problem Statement
|
||||
|
||||
User reports:
|
||||
1. "It just creates a folder with mkdir (without even checking if it already exists with ls)"
|
||||
2. No verification that tasks are completed
|
||||
3. No planning of full task scope
|
||||
4. Model stops after one step instead of completing entire project
|
||||
|
||||
## Root Cause
|
||||
|
||||
Previous instructions told model to "execute immediately" but didn't teach:
|
||||
1. **Planning** - What needs to be done
|
||||
2. **Checking** - What already exists
|
||||
3. **Verification** - Did the step work
|
||||
4. **Completion loop** - Keep going until done
|
||||
|
||||
## Solution
|
||||
|
||||
Add **Task Completion Workflow** to instructions:
|
||||
|
||||
```
|
||||
**TASK COMPLETION WORKFLOW (MANDATORY):**
|
||||
|
||||
**1. PLAN:** List ALL steps needed before starting
|
||||
**2. CHECK:** Use ls to verify what exists before creating
|
||||
**3. EXECUTE:** Run first step
|
||||
**4. VERIFY:** Confirm step worked (ls, read file)
|
||||
**5. REPEAT:** Steps 3-4 until ALL complete
|
||||
**6. FINAL CHECK:** Verify entire task is done
|
||||
**7. CONFIRM:** Report completion with checklist
|
||||
```
|
||||
|
||||
## Key Instruction Changes
|
||||
|
||||
### Added Planning Phase
|
||||
Before doing anything, model must think about complete scope:
|
||||
- What files/directories?
|
||||
- What dependencies?
|
||||
- Complete task requirements
|
||||
|
||||
### Added Verification Steps
|
||||
Every step must be verified:
|
||||
- `ls -la` after mkdir
|
||||
- `read` file after write
|
||||
- Check content is correct
|
||||
|
||||
### Added Completion Loop
|
||||
Model must continue until:
|
||||
✓ All directories exist
|
||||
✓ All files exist with correct content
|
||||
✓ All dependencies installed
|
||||
✓ Each component verified
|
||||
|
||||
### Complete Working Example
|
||||
Provided 13-step React example showing:
|
||||
1. Check existing (ls)
|
||||
2. Create directory
|
||||
3. Verify created (ls)
|
||||
4. Create package.json
|
||||
5. Verify package.json (read)
|
||||
6. Create source files
|
||||
7. Final verification (find myapp -type f)
|
||||
8. Install dependencies
|
||||
9. Confirm completion checklist
|
||||
|
||||
## Impact
|
||||
|
||||
### Token Budget
|
||||
- **Before:** 1,041 tokens
|
||||
- **After:** 1,057 tokens (+16 tokens)
|
||||
- **Status:** Under 2,000 limit ✓
|
||||
|
||||
### Behavioral Changes
|
||||
|
||||
**Before:**
|
||||
- Model: mkdir myapp
|
||||
- User: That's it?
|
||||
- Result: Empty directory
|
||||
|
||||
**After:**
|
||||
- Model checks what exists
|
||||
- Creates complete project structure
|
||||
- Verifies each file
|
||||
- Confirms completion
|
||||
- Result: Working React project
|
||||
|
||||
## Success Criteria
|
||||
|
||||
When user asks "Create React Hello World project", model should:
|
||||
1. ✓ Check current directory contents
|
||||
2. ✓ Create myapp/ directory
|
||||
3. ✓ Verify directory created
|
||||
4. ✓ Create package.json
|
||||
5. ✓ Verify package.json content
|
||||
6. ✓ Create src/App.js
|
||||
7. ✓ Create src/index.js
|
||||
8. ✓ Create public/index.html
|
||||
9. ✓ Final verification (list all files)
|
||||
10. ✓ npm install
|
||||
11. ✓ Confirm completion checklist
|
||||
|
||||
## Testing
|
||||
|
||||
Test instructions contain:
|
||||
- PLAN/CHECK keywords
|
||||
- VERIFY keyword
|
||||
- COMPLETE keyword
|
||||
|
||||
All tests pass: 11/11 ✓
|
||||
|
||||
## Trade-offs
|
||||
|
||||
**Pros:**
|
||||
- Complete task execution
|
||||
- Verification prevents partial work
|
||||
- Clear completion criteria
|
||||
- Better user experience
|
||||
|
||||
**Cons:**
|
||||
- More tokens (but still under limit)
|
||||
- More verbose instructions
|
||||
- May be slower (more verification steps)
|
||||
|
||||
## Related Files Changed
|
||||
|
||||
1. src/api/routes.py - Updated tool_instructions
|
||||
2. tests/test_tool_parsing.py - Updated tests for new content
|
||||
3. docs/design/2024-02-24-task-planning-verification.md - This doc
|
||||
|
||||
## Future Improvements
|
||||
|
||||
1. **Task Queue System:** Server-side queue of pending operations
|
||||
2. **State Persistence:** Remember what's been done across conversations
|
||||
3. **Smart Resumption:** If interrupted, pick up where left off
|
||||
4. **Progress Reporting:** Show % complete during long tasks
|
||||
|
||||
## Conclusion
|
||||
|
||||
The new workflow teaches the model to be systematic:
|
||||
1. Plan before acting
|
||||
2. Check before creating
|
||||
3. Verify after each step
|
||||
4. Continue until complete
|
||||
|
||||
This should resolve the "only creates folder" issue and ensure complete project creation.
|
||||
@@ -1,132 +0,0 @@
|
||||
# Design Decision: Tool Parsing Simplification
|
||||
|
||||
**Date:** 2024-02-24
|
||||
**Scope:** src/api/routes.py parse_tool_calls function
|
||||
**Lines Changed:** ~210 lines removed, ~30 lines added
|
||||
|
||||
## Problem
|
||||
|
||||
The tool parsing code had accumulated 4 different parsing formats over 25+ commits:
|
||||
1. JSON `tool_calls` format with nested objects
|
||||
2. TOOL:/ARGUMENTS: format (simple text)
|
||||
3. Function pattern format `func_name(args)`
|
||||
4. Multiple JSON handling variants
|
||||
|
||||
This caused:
|
||||
- Circular development (adding/removing formats repeatedly)
|
||||
- No single source of truth
|
||||
- Complex, unmaintainable code
|
||||
- No confidence that changes wouldn't break existing cases
|
||||
|
||||
## Options Considered
|
||||
|
||||
### Option 1: Keep All Formats
|
||||
- **Pros:** Backward compatible
|
||||
- **Cons:** 210 lines of unmaintainable code, continues circular development pattern
|
||||
- **Verdict:** REJECTED - Perpetuates the problem
|
||||
|
||||
### Option 2: Standardize on TOOL:/ARGUMENTS: Only
|
||||
- **Pros:**
|
||||
- Simple regex pattern (~30 lines)
|
||||
- Matches current tool instructions
|
||||
- Easy to test
|
||||
- Clear single format for models
|
||||
- **Cons:**
|
||||
- Breaking change if any code relies on old formats
|
||||
- Need to update any existing examples/docs
|
||||
- **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well)
|
||||
|
||||
### Option 3: Create Parser per Format with Feature Flags
|
||||
- **Pros:** Flexible, can toggle formats
|
||||
- **Cons:**
|
||||
- Violates Rule 5 and "No Feature Flags in Core Logic"
|
||||
- Still maintains multiple code paths
|
||||
- **Verdict:** REJECTED - Doesn't solve the root problem
|
||||
|
||||
## Decision
|
||||
|
||||
Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code.
|
||||
|
||||
**Rationale:**
|
||||
- Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only"
|
||||
- Token cost is minimal (no complex regex)
|
||||
- Test coverage provides confidence
|
||||
- Aligns with existing tool instructions
|
||||
|
||||
## Impact
|
||||
|
||||
### Token Count
|
||||
- **Parser code:** 210 lines → 30 lines (-180 lines)
|
||||
- **No change** to tool instructions (separate optimization)
|
||||
|
||||
### Breaking Changes
|
||||
- **Yes** - Removes support for:
|
||||
- JSON `tool_calls` format in model responses
|
||||
- Function pattern format `read_file(path="test.txt")`
|
||||
|
||||
**Migration:** Models must use:
|
||||
```
|
||||
TOOL: read
|
||||
ARGUMENTS: {"filePath": "test.txt"}
|
||||
```
|
||||
|
||||
### Testing
|
||||
- Unit tests added: 9 test cases
|
||||
- Coverage: All parsing scenarios
|
||||
- All tests pass
|
||||
|
||||
## Implementation
|
||||
|
||||
```python
|
||||
# New implementation (30 lines)
|
||||
def parse_tool_calls(text: str) -> tuple:
|
||||
"""Parse tool calls using standardized format."""
|
||||
import json
|
||||
import re
|
||||
|
||||
tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
|
||||
tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
|
||||
|
||||
if not tool_matches:
|
||||
return text, None
|
||||
|
||||
tool_calls = []
|
||||
for i, tool_match in enumerate(tool_matches):
|
||||
tool_name = tool_match.group(1)
|
||||
args_str = tool_match.group(2)
|
||||
try:
|
||||
args_dict = json.loads(args_str)
|
||||
tool_calls.append({
|
||||
"id": f"call_{i+1}",
|
||||
"type": "function",
|
||||
"function": {
|
||||
"name": tool_name,
|
||||
"arguments": json.dumps(args_dict)
|
||||
}
|
||||
})
|
||||
except json.JSONDecodeError:
|
||||
continue
|
||||
|
||||
if not tool_calls:
|
||||
return text, None
|
||||
|
||||
first_start = tool_matches[0].start()
|
||||
content = text[:first_start].strip()
|
||||
|
||||
return content, tool_calls
|
||||
```
|
||||
|
||||
## Verification
|
||||
|
||||
Run tests:
|
||||
```bash
|
||||
python tests/test_tool_parsing.py
|
||||
```
|
||||
|
||||
Expected: 9 passed, 0 failed
|
||||
|
||||
## Follow-up
|
||||
|
||||
- [x] Update DEVELOPMENT_PATTERNS.md to mark as completed
|
||||
- [x] Add unit tests
|
||||
- [ ] Consider integration test for full tool execution flow
|
||||
@@ -1,98 +0,0 @@
|
||||
# Investigation: 31k Token Context Issue
|
||||
|
||||
## Problem
|
||||
When making requests through opencode to local_swarm, the LLM receives ~31k tokens of context even for simple empty directory queries.
|
||||
|
||||
## Root Cause Identified
|
||||
|
||||
**NOT an issue with this repo's codebase - this is expected behavior for function calling.**
|
||||
|
||||
### How it works:
|
||||
|
||||
1. **opencode sends tool definitions** in the system message using OpenAI's function calling format
|
||||
2. **Each tool definition is ~450 tokens** (name + description + parameters)
|
||||
3. **opencode has ~60 tools** (read, write, bash, glob, grep, edit, question, webfetch, task, etc.)
|
||||
4. **Total tool definition tokens:** ~27,000 tokens
|
||||
|
||||
### Calculation:
|
||||
```
|
||||
Single tool definition: ~450 tokens
|
||||
Number of tools: ~60
|
||||
Tool schemas total: ~27,000 tokens
|
||||
System message: ~500 tokens
|
||||
User query: ~100 tokens
|
||||
---
|
||||
Total: ~27,600 tokens
|
||||
```
|
||||
|
||||
**This matches the observed ~31k tokens.**
|
||||
|
||||
## Why This Happens
|
||||
|
||||
OpenAI's function calling protocol requires sending the **complete function schemas** to the LLM with every request. This is how the model:
|
||||
- Knows what tools are available
|
||||
- Understands parameter requirements
|
||||
- Knows how to format tool calls
|
||||
|
||||
All major LLM providers using function calling work this way (OpenAI, Anthropic, local models, etc.).
|
||||
|
||||
## Verification
|
||||
|
||||
```bash
|
||||
python -c "
|
||||
import tiktoken
|
||||
enc = tiktoken.get_encoding('cl100k_base')
|
||||
|
||||
# Example from actual opencode tool definition
|
||||
read_tool_schema = '''{\"type\": \"function\", \"function\": {\"name\": \"read\", \"description\": \"Read a file or directory from the local filesystem...[full description]\", \"parameters\": {...}}}'''
|
||||
|
||||
print(f'Single tool schema: {len(enc.encode(read_tool_schema))} tokens')
|
||||
print(f'Estimated 60 tools: {len(enc.encode(read_tool_schema)) * 60:,} tokens')
|
||||
"
|
||||
```
|
||||
|
||||
Result:
|
||||
- Single tool definition: ~451 tokens
|
||||
- 60 tools: ~27,060 tokens
|
||||
- Plus system + user message: ~27,660 total
|
||||
|
||||
## This Is NOT a Bug
|
||||
|
||||
The 31k token context is **correct and expected** for function calling with 60+ tools. This is how:
|
||||
- OpenAI API works
|
||||
- Claude API works
|
||||
- Local models with function calling work
|
||||
|
||||
## Potential Optimizations (Optional)
|
||||
|
||||
If reducing context size is critical, consider:
|
||||
|
||||
### Option 1: Dynamic Tool Selection
|
||||
- Only send tools relevant to current task
|
||||
- Example: For file operations, only send [read, write, glob, edit]
|
||||
- Trade-off: Requires opencode to intelligently filter tools
|
||||
|
||||
### Option 2: Compressed Tool Descriptions
|
||||
- Shorten tool descriptions to essentials
|
||||
- Example: "Read file at path (required: filePath)"
|
||||
- Trade-off: Model may make more errors with less guidance
|
||||
|
||||
### Option 3: Tool Grouping
|
||||
- Group similar tools into single "tools: [read, write, glob]" parameter
|
||||
- Trade-off: Breaks OpenAI compatibility
|
||||
|
||||
## Recommendation
|
||||
|
||||
**NO ACTION REQUIRED.** The 31k token context is:
|
||||
- Standard for function calling with many tools
|
||||
- Within capabilities of modern LLMs (32k-128k context windows)
|
||||
- Not caused by this repo's code
|
||||
|
||||
The `.opencodeignore` created earlier will help with opencode's own system prompt, but doesn't affect the LLM context sent to local_swarm.
|
||||
|
||||
## Additional Finding
|
||||
|
||||
While investigating, verified:
|
||||
- `config/prompts/tool_instructions.txt`: 125 tokens ✅
|
||||
- This repo's tool execution code: No token bloat ✅
|
||||
- Issue is purely opencode's function calling protocol ✅
|
||||
@@ -1,112 +0,0 @@
|
||||
# Test Plan: Fix Tool Execution and Token Reporting
|
||||
|
||||
## Problem Analysis
|
||||
|
||||
### Issue 1: Model Gives Instructions Instead of Executing
|
||||
**Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format
|
||||
**Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."}
|
||||
|
||||
### Issue 2: Token Counting Inaccurate
|
||||
**Current:** Rough estimate `len(prompt) // 4`
|
||||
**Expected:** Accurate token count using tiktoken
|
||||
**Impact:** opencode can't properly manage context window
|
||||
|
||||
### Issue 3: npx Commands Timeout/Need Input
|
||||
**Current:** `npx create-react-app .` prompts for confirmation (y/n)
|
||||
**Expected:** Non-interactive execution or manual file creation
|
||||
**Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)"
|
||||
|
||||
## Unit Tests
|
||||
|
||||
### Test 1: Accurate Token Counting
|
||||
- [ ] Verify token count uses tiktoken (not rough estimate)
|
||||
- [ ] Test with known token counts
|
||||
- [ ] Verify prompt_tokens + completion_tokens = total_tokens
|
||||
|
||||
### Test 2: Non-Interactive Bash Commands
|
||||
- [ ] Verify npm/npx commands use --yes or equivalent flags
|
||||
- [ ] Test timeout handling for package managers
|
||||
- [ ] Verify commands don't prompt for user input
|
||||
|
||||
### Test 3: Tool Instructions Content
|
||||
- [ ] Verify instructions emphasize "EXECUTE not DESCRIBE"
|
||||
- [ ] Verify manual file creation examples (not npx)
|
||||
- [ ] Verify anti-patterns are clearly stated
|
||||
|
||||
## Integration Tests
|
||||
|
||||
### Test 4: End-to-End React Project Creation
|
||||
**Input:** "Create a React Hello World app"
|
||||
|
||||
**Expected Flow:**
|
||||
1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
|
||||
2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
|
||||
3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."}
|
||||
4. Continue until complete
|
||||
|
||||
**Failure Modes:**
|
||||
- [ ] Model describes steps instead of executing
|
||||
- [ ] Uses npx create-react-app (should manually create files)
|
||||
- [ ] Stops after README only
|
||||
|
||||
### Test 5: Token Reporting Accuracy
|
||||
**Input:** Any chat completion request
|
||||
|
||||
**Expected:**
|
||||
- usage.prompt_tokens matches actual tokens
|
||||
- usage.completion_tokens matches actual tokens
|
||||
- usage.total_tokens is sum
|
||||
|
||||
**Verification:**
|
||||
- Compare tiktoken count vs API response
|
||||
|
||||
## Manual Verification
|
||||
|
||||
```bash
|
||||
# Test React creation
|
||||
python main.py --auto &
|
||||
curl -X POST http://localhost:17615/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-H "X-Client-Working-Dir: /tmp/test-project" \
|
||||
-d '{
|
||||
"model": "local-swarm",
|
||||
"messages": [{"role": "user", "content": "Create a React Hello World app"}],
|
||||
"tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}]
|
||||
}'
|
||||
|
||||
# Check token accuracy
|
||||
curl -X POST http://localhost:17615/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "local-swarm",
|
||||
"messages": [{"role": "user", "content": "Hello"}]
|
||||
}' | jq '.usage'
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. **Execution:** 100% of requests use TOOL: format (not descriptions)
|
||||
2. **Accuracy:** Token counts match tiktoken within ±5%
|
||||
3. **Completion:** Multi-file projects fully created via write tool
|
||||
4. **No npx:** Manual file creation for React (no npx create-react-app)
|
||||
|
||||
## Implementation Notes
|
||||
|
||||
### Token Counting Fix
|
||||
```python
|
||||
# Replace: prompt_tokens = len(prompt) // 4
|
||||
# With:
|
||||
import tiktoken
|
||||
encoding = tiktoken.get_encoding('cl100k_base')
|
||||
prompt_tokens = len(encoding.encode(prompt))
|
||||
completion_tokens = len(encoding.encode(content))
|
||||
```
|
||||
|
||||
### Tool Instructions Fix
|
||||
- Add explicit "DO NOT USE npx create-react-app" instruction
|
||||
- Add "EXECUTE IMMEDIATELY" mandate
|
||||
- Show complete React example with manual file creation
|
||||
|
||||
### Non-Interactive Commands
|
||||
- Auto-add --yes to npx commands
|
||||
- Or recommend manual file creation instead
|
||||
@@ -1,97 +0,0 @@
|
||||
# Test Plan: Improved Tool Instructions
|
||||
|
||||
## Problem Statement
|
||||
Model is not using tools effectively:
|
||||
1. Creates README instead of actual project structure
|
||||
2. Provides commands as text instead of executing them
|
||||
3. Refuses to run commands claiming "I am only an AI assistant"
|
||||
|
||||
## Root Cause Analysis
|
||||
Current instructions don't clearly communicate:
|
||||
- That the model SHOULD use tools proactively
|
||||
- That execution is expected, not explanation
|
||||
- The workflow: user request → tool execution → result
|
||||
|
||||
## Unit Tests (Instruction Verification)
|
||||
|
||||
### Test 1: Instruction Presence
|
||||
- [ ] Verify instructions are injected into system message
|
||||
- [ ] Verify instructions appear at the START of system message (priority position)
|
||||
|
||||
### Test 2: Token Count
|
||||
- [ ] Measure total token count of new instructions
|
||||
- [ ] Verify ≤ 500 tokens (conservative budget)
|
||||
- [ ] Document before/after
|
||||
|
||||
### Test 3: Format Compliance
|
||||
- [ ] Verify instructions include TOOL:/ARGUMENTS: format
|
||||
- [ ] Verify examples use correct format
|
||||
- [ ] Verify rules are clear and numbered
|
||||
|
||||
## Integration Tests (Behavioral)
|
||||
|
||||
### Test 4: Project Creation Flow
|
||||
**Input:** "Create a React Hello World app"
|
||||
|
||||
**Expected Behavior:**
|
||||
1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp
|
||||
2. After result, TOOL: write, ARGUMENTS: package.json content
|
||||
3. After result, TOOL: write, ARGUMENTS: src/App.js content
|
||||
4. Continue until complete project structure exists
|
||||
|
||||
**Failure Modes:**
|
||||
- [ ] Model only describes what to do
|
||||
- [ ] Model creates README only
|
||||
- [ ] Model refuses to execute commands
|
||||
|
||||
### Test 5: Multi-step Task
|
||||
**Input:** "Check what files exist, then create a test.txt file with 'hello' in it"
|
||||
|
||||
**Expected Behavior:**
|
||||
1. TOOL: bash, ARGUMENTS: ls -la
|
||||
2. Wait for result
|
||||
3. TOOL: write, ARGUMENTS: test.txt with "hello"
|
||||
|
||||
**Failure Modes:**
|
||||
- [ ] Model tries to do both in one response
|
||||
- [ ] Model doesn't wait for ls result before writing
|
||||
|
||||
### Test 6: Command Refusal
|
||||
**Input:** "Run npm install"
|
||||
|
||||
**Expected Behavior:**
|
||||
1. TOOL: bash, ARGUMENTS: npm install
|
||||
|
||||
**Failure Modes:**
|
||||
- [ ] Model responds: "I cannot run commands, I am only an AI assistant"
|
||||
- [ ] Model explains npm install instead of running it
|
||||
|
||||
## Manual Verification Commands
|
||||
|
||||
```bash
|
||||
# Start the server
|
||||
python main.py --auto
|
||||
|
||||
# In another terminal, test with curl
|
||||
curl -X POST http://localhost:17615/v1/chat/completions \
|
||||
-H "Content-Type: application/json" \
|
||||
-d '{
|
||||
"model": "local-swarm",
|
||||
"messages": [{"role": "user", "content": "Create a React Hello World app"}],
|
||||
"tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}]
|
||||
}'
|
||||
```
|
||||
|
||||
## Success Criteria
|
||||
|
||||
1. **Proactivity:** Model uses tools without being asked twice
|
||||
2. **Execution:** Model runs commands, doesn't just describe them
|
||||
3. **No Refusal:** Model never says "I cannot" or "I am only an AI"
|
||||
4. **Completeness:** Multi-file projects are fully created via tools
|
||||
5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format
|
||||
|
||||
## Metrics
|
||||
|
||||
- **Tool usage rate:** % of requests that result in tool calls
|
||||
- **Format compliance:** % of tool calls in correct format
|
||||
- **Completion rate:** % of multi-step tasks fully completed
|
||||
@@ -1,35 +0,0 @@
|
||||
# Test Plan: Tool Parsing Simplification
|
||||
|
||||
## Unit Tests
|
||||
|
||||
- [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments
|
||||
- [x] Test case 2: No tool in text → Returns None for tools, original text as content
|
||||
- [x] Test case 3: Multiple tools → Returns all tools in order
|
||||
- [x] Test case 4: Content before tool → Content extracted, tool parsed correctly
|
||||
- [x] Test case 5: Bash tool → Correctly parses bash command
|
||||
- [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work
|
||||
- [x] Test case 7: Invalid JSON → Skips invalid, continues with valid
|
||||
- [x] Test case 8: Empty text → Returns None, empty string
|
||||
- [x] Test case 9: Whitespace only → Returns None
|
||||
|
||||
## Integration Tests
|
||||
|
||||
- [ ] End-to-end flow:
|
||||
1. Send chat completion request with tools
|
||||
2. Model responds with TOOL:/ARGUMENTS: format
|
||||
3. Parser extracts tool call
|
||||
4. Tool executes
|
||||
5. Result returned in response
|
||||
|
||||
- [ ] Expected result: Tool executes successfully, result included in response
|
||||
|
||||
## Manual Verification
|
||||
|
||||
- [ ] Command: `python tests/test_tool_parsing.py`
|
||||
- [ ] Expected output: "9 passed, 0 failed"
|
||||
|
||||
## Token Budget Verification
|
||||
|
||||
- Parser code: ~30 lines (~200 tokens)
|
||||
- Well under 2000 token limit
|
||||
- Simple regex pattern maintains low complexity
|
||||
Reference in New Issue
Block a user