580d1e5d17
* feat: enhanced tool instructions for multi-step operations
- Add comprehensive examples for ls, find, grep, mkdir, npm init, etc.
- Explain multi-step workflow (explore → read → write)
- Tool system already supports chaining via conversation history
- Bash tool supports: ls, find, grep, cat, mkdir, cd, npm, etc.
- 30 second timeout on commands
- Output limited to 3000 chars for readability
* Cleanup: Consolidate documentation and tidy codebase
Documentation:
- Consolidate 6 markdown files into simplified README.md
- Remove redundant docs: TODO.md, NETWORK.md, REVIEW.md, PLAN.md, CONTEXT.md, GUIDE.md
- Add ARCHITECTURE.md with clean technical overview
- README now focuses on quick start and core concepts
Code verification:
- Verified blocking I/O properly wrapped in asyncio.to_thread()
- Confirmed locks initialized correctly in backends
- AMD VRAM detection uses proper regex (takes max value, not first match)
- All exception handling uses 'except Exception:' (not bare except)
Tool execution improvements (existing changes):
- Better working directory handling with project root detection
- Extended timeouts for package managers (300s)
- Multi-tool call parsing support
- Improved error handling and logging
Note: System prompt concern noted - 30k tokens too large for 16-32k context windows
* docs: add development patterns analysis
Document circular development issues identified in commit history:
- Tool execution went back-and-forth 3+ times (server-side vs client-side)
- Tool instructions changed from 40k → 300 → removed → enhanced tokens
- 8+ parsing fixes for same issues (no tests)
- 6 debug-only commits (production debugging)
Provides recommendations to prevent future cycles:
1. Pick one architecture and stick with it
2. Add unit tests before fixes
3. Token budget (<2000 for instructions)
4. One format only (remove alternative parsers)
5. Integration test script
6. Separate concerns into smaller modules
7. Design doc before code changes
8. CI/CD with automated testing
* docs: add comprehensive agent guidelines
AGENT_WORKER.md (600+ lines):
- Pre-flight checklist: token budget, test plan, design doc
- Coding rules: TDD, no debug code, architecture consistency
- Git workflow: branching strategy, commit rules, release process
- Testing requirements: unit (≥80%), integration structure
- Code quality: PEP 8, type hints, max 50 lines per function
- Architecture: no feature flags, separation of concerns
- Continuous learning: research requirements, documentation
- Forbidden patterns: bare except, production debugging, etc.
AGENT_REVIEW.md (400+ lines):
- Review philosophy: prevent circular development
- 6-phase review checklist: structure, quality, tokens, architecture, research, logic
- Report format with token impact analysis
- Severity levels: blocking vs warnings vs approved
- Common issues with examples (good vs bad)
- Review workflow: 30-35 min per PR
- Reports stored in reports/ folder (gitignored)
Also added:
- tests/test_tool_parsing.py - example test following guidelines
- Updated DEVELOPMENT_PATTERNS.md with recommendations
Reports folder in .gitignore for local review storage
* chore: gitignore review reports folder
* feat: fix tool execution and enhance instructions with accurate token counting
- Enhanced tool instructions (1041 tokens, within 2000 budget)
- Added tiktoken>=0.5.0 for accurate token counting
- Fixed subprocess hang by adding stdin=subprocess.DEVNULL
- Removed 9 DEBUG print statements from routes.py
- Added tests for instruction content and token budget verification
- All tests pass (11/11)
Resolves blockers from previous review:
- Token budget verified ✓
- Token documentation added ✓
- Debug code cleaned ✓
- Missing tests added ✓
* feat: implement comprehensive tool system with proper logging
Major improvements to tool instructions and execution:
- Enhanced tool instructions with 7-step task completion workflow
- Added markdown code block fallback parser for tool calls
- Fixed subprocess hang with stdin=subprocess.DEVNULL
- Fixed streaming path to return tool_calls (enabling multi-turn conversations)
- Added complete React project creation example with verification steps
- Token count: 1,743 tokens (within 2,000 limit)
Logging infrastructure:
- Created centralized logging configuration (src/utils/logging_config.py)
- Replaced 80+ print statements with logger.debug()
- Set log level to DEBUG for development
- All modules now use proper logging instead of print
Testing:
- Added 4 new tests for markdown parsing and instruction content
- All 13 tests passing
- Token budget verification test
Documentation:
- Added comprehensive design docs for all major changes
- Added test plans for verification
- Created helper scripts for logging migration
Files changed:
- main.py: Added logging setup
- src/api/routes.py: Tool instructions, streaming fixes, logging
- src/tools/executor.py: subprocess fix, logging
- src/utils/: New logging configuration module
- tests/test_tool_parsing.py: New tests
- docs/: Design decisions and test plans
- scripts/: Helper scripts for development
* refactor: simplify tool instructions to 109 tokens for 7B model
Reduced from 1,743 tokens to 109 tokens (94% reduction) to help
qwen2.5 7B 4bit model follow instructions better.
Changes:
- Removed complex workflow documentation
- Removed multi-turn conversation examples
- Removed lengthy anti-patterns
- Kept only essential format and rules
- Updated tests to match simplified content
Before: 1,743 tokens, 6,004 chars (87% of budget)
After: 109 tokens, 392 chars (5.5% of budget)
This should make it much easier for smaller models to:
1. Understand they must use tools
2. Follow the simple TOOL: format
3. Not get overwhelmed by instructions
* refactor: make tool instructions ultra-direct for 7B models
Further simplify instructions to prevent model from adding explanations.
Before: 109 tokens - model still added explanatory text
After: 86 tokens - ultra-direct commands
Key changes:
- Start with 'You MUST use tools. DO NOT explain.'
- 'OUTPUT THIS EXACT FORMAT - NOTHING ELSE'
- Removed all examples and pleasantries
- Added 'NEVER' rules in all caps
- 'ONLY output TOOL: lines'
The model was outputting:
'1. First, install... TOOL: bash ARGUMENTS: {...}'
Now should output just:
'TOOL: bash
ARGUMENTS: {...}'
This should force the 7B qwen model to stop explaining and just execute.
* refactor: move tool instructions to external config file
Moves hardcoded tool instructions from routes.py to external config file
for better maintainability and easier editing.
Changes:
- Created config/prompts/tool_instructions.txt
- Added _load_tool_instructions() function with caching
- Falls back to default if config file not found
- Updated tests to use the loader function
- Added proper error handling
Benefits:
- Easier to modify instructions without code changes
- Instructions can be edited by non-developers
- Cleaner separation of config vs code
- Supports hot-reloading (cached but easy to invalidate)
Token count: 86 tokens (loaded from file)
Location: config/prompts/tool_instructions.txt
* refactor: simplify tool instructions further and add debug logging
- Reduced instructions to bare minimum: 50 tokens
- Added debug logging to verify instructions are sent
- Removed all caps and aggressive language
- Made instructions more straightforward
Instructions now:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'
This should be easier for 7B models to follow while still
conveying the essential requirements.
* feat: improve tool parser to handle 7B model output variations
Enhanced parse_tool_calls() with multiple fallback strategies:
1. Standard TOOL:/ARGUMENTS: format (original)
2. Markdown code blocks ()
3. Numbered list items (1. npm install ...)
4. Standalone bash commands (npm, npx, mkdir, etc.)
Now handles messy output from small models like:
'1. Install: npm install -g create-react-app'
'2. Create: create-react-app hello-world'
Parses these into chained bash commands for execution.
Also simplified instructions to 50 tokens minimum:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'
This combination should make 7B models much more likely to
have their output successfully parsed and executed.
* fix: improve command extraction for 7B model output
Parser now extracts bash commands from any line containing:
- npm, npx, mkdir, cd, ls, cat, echo, git, python, pip, node, yarn
- create-react-app (added for React projects)
Example: Extracts 'npm install -g create-react-app' from:
'1. Install: npm install -g create-react-app'
Chains multiple commands with && for sequential execution.
This should now successfully parse the numbered list output
from 7B models and execute the commands.
* feat: add bash tool description validation and improve 7B model parsing
Changes:
- Added _ensure_tool_arguments() function to inject 'description' field
- Updated tool_instructions.txt to require description for bash tool
- Improved 7B model command extraction with better regex patterns
- Added 'create-react-app' to command detection list
- Updated delta field type to Dict[str, Any] for streaming
- Added GGUF to MLX quantization mapping for registry.py
- Clarified agent responsibilities in AGENT_REVIEW.md and AGENT_WORKER.md
Fixes:
- Bash tool now validates required 'description' field
- 7B model output parsed more reliably (numbered lists)
- Multiple commands chained with && for sequential execution
Token count: 69 tokens (down from 86, -19.8%)
All tests pass: 13/13
* feat: add webfetch tool support with URL extraction
Changes:
- Added webfetch to tool instructions config
- Added URL extraction pattern to parse_tool_calls()
- Parser now recognizes URLs and creates webfetch tool calls
- Updated token count: 89 tokens (+29% from 69)
The webfetch tool is available through opencode environment.
System prompt adjustment enables model to use it for URL fetching.
Token budget: 89 tokens (4.45% of 2000 limit)
Tests pass: 13/13
5.8 KiB
5.8 KiB
Design Decision: Improved Tool Instructions
Date: 2024-02-24 Scope: src/api/routes.py tool_instructions Lines Changed: ~25 lines
Problem
Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
- Passive vs Active: Model describes what to do instead of doing it
- Refusal: Model claims "I am only an AI assistant" instead of executing
- Incomplete: Multi-file projects result in README only
Evidence from user report:
- Request: "Create React Hello World app"
- Result: README only (not actual files)
- Subsequent: Commands given as text, not executed
- Final: "I am only an AI assistant" refusal
Root Cause Analysis
The instructions lack:
- Authority statement - "You CAN and SHOULD use tools"
- Execution mandate - "Execute commands, don't just describe them"
- Workflow clarity - Clear step-by-step expectations
- Anti-pattern examples - What NOT to do
Options Considered
Option 1: Minor Tweaks
Add a few lines to existing instructions.
- Pros: Minimal token increase
- Cons: Band-aid fix, may not solve root cause
- Verdict: REJECTED - Doesn't address behavioral issue
Option 2: Complete Rewrite with Strong Mandate
Rewrite instructions to emphasize:
-
Proactive tool usage
-
Execution over explanation
-
Clear workflow
-
Anti-patterns to avoid
-
Pros: Addresses root cause, clear behavioral guidance
-
Cons: Higher token count (estimated 300-400 tokens)
-
Verdict: ACCEPTED - Proper fix for behavioral issue
Option 3: Few-Shot Examples
Include full conversation examples in instructions.
- Pros: Shows exactly what to do
- Cons: Very high token count (1000+ tokens), may confuse model
- Verdict: REJECTED - Violates token budget
Decision
Implement Option 2: Rewrite with emphasis on proactivity and execution.
Key additions:
- Capability statement: "You have tools. Use them."
- Execution mandate: "Don't describe, execute"
- Workflow: Clear request→tool→result→next cycle
- Anti-patterns: Explicitly forbid "I cannot" responses
Impact
Token Budget (Exact Count - cl100k_base)
- Current: 478 tokens (1,810 characters)
- Status: Within 2000 token limit ✓
- Status: Within 500 conservative estimate ✓
- Context window: 16K model leaves ~15.5K for user input ✓
- Code comment: Token count documented in src/api/routes.py ✓
Code Changes
- File: src/api/routes.py
- Lines: +48/-18 (net +30)
- Type: Instructions replacement
- Token documentation: Added inline comment with exact token count
Breaking Changes
- None - Instructions are additive/clearer, not different format
Behavioral Changes
- Expected: More proactive tool usage
- Expected: No more "I cannot" refusals
- Expected: Multi-step projects completed via tools
- Expected: Commands executed, not described
Review Blockers Addressed
- ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
- ✅ Exact token count calculated using tiktoken (478 tokens)
- ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
- ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
- ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
Implementation
tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
**CRITICAL RULES:**
1. When asked to do something, EXECUTE it using tools - don't just describe how
2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
3. You MUST use the write tool to create files
4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
5. Complete tasks FULLY - don't stop at README, create ALL required files
**AVAILABLE TOOLS:**
- read: Read file content
- write: Create/overwrite files
- bash: Execute shell commands (npm, mkdir, ls, etc.)
**TOOL FORMAT (STRICT):**
TOOL: tool_name
ARGUMENTS: {"param": "value"}
**WORKFLOW:**
1. User asks for something
2. You decide what tool to use
3. You respond with ONLY the TOOL: format above
4. You receive the tool result
5. You continue with next tool until task is COMPLETE
**EXAMPLES:**
Creating a project:
User: "Create a React app"
You: TOOL: bash
ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
[wait for result]
You: TOOL: write
ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
[continue until all files created]
Running commands:
User: "Install dependencies"
You: TOOL: bash
ARGUMENTS: {"command": "npm install"}
[wait for result, then confirm completion]
**WHAT NOT TO DO:**
- ❌ "To create a React app, you should run: mkdir myapp" (describing)
- ❌ "I cannot run commands, I am an AI" (refusing)
- ❌ Creating only README instead of full project (incomplete)
- ❌ "First do X, then do Y" (giving instructions instead of doing)
**CORRECT BEHAVIOR:**
- ✅ Execute the command immediately using the bash tool
- ✅ Create all files using the write tool
- ✅ Continue until task is 100% complete
- ✅ Use ONE tool at a time and wait for results"""
Testing
- Test with React Hello World request
- Verify model uses bash to create directory structure
- Verify model uses write to create all files
- Verify no "I cannot" responses
Rollback Plan
If new instructions cause issues:
- Revert to previous ~125 token version
- Analyze what specifically failed
- Iterate on smaller changes
Success Metrics
- Model uses tools on first request (not after prompting)
- Zero "I cannot" or "I am an AI" responses
- Multi-file projects fully created
- Commands executed, not described