Files

T

sleepy 580d1e5d17 feat: comprehensive tool system improvements and webfetch support (#3 )

* feat: enhanced tool instructions for multi-step operations

- Add comprehensive examples for ls, find, grep, mkdir, npm init, etc.
- Explain multi-step workflow (explore → read → write)
- Tool system already supports chaining via conversation history
- Bash tool supports: ls, find, grep, cat, mkdir, cd, npm, etc.
- 30 second timeout on commands
- Output limited to 3000 chars for readability

* Cleanup: Consolidate documentation and tidy codebase

Documentation:
- Consolidate 6 markdown files into simplified README.md
- Remove redundant docs: TODO.md, NETWORK.md, REVIEW.md, PLAN.md, CONTEXT.md, GUIDE.md
- Add ARCHITECTURE.md with clean technical overview
- README now focuses on quick start and core concepts

Code verification:
- Verified blocking I/O properly wrapped in asyncio.to_thread()
- Confirmed locks initialized correctly in backends
- AMD VRAM detection uses proper regex (takes max value, not first match)
- All exception handling uses 'except Exception:' (not bare except)

Tool execution improvements (existing changes):
- Better working directory handling with project root detection
- Extended timeouts for package managers (300s)
- Multi-tool call parsing support
- Improved error handling and logging

Note: System prompt concern noted - 30k tokens too large for 16-32k context windows

* docs: add development patterns analysis

Document circular development issues identified in commit history:
- Tool execution went back-and-forth 3+ times (server-side vs client-side)
- Tool instructions changed from 40k → 300 → removed → enhanced tokens
- 8+ parsing fixes for same issues (no tests)
- 6 debug-only commits (production debugging)

Provides recommendations to prevent future cycles:
1. Pick one architecture and stick with it
2. Add unit tests before fixes
3. Token budget (<2000 for instructions)
4. One format only (remove alternative parsers)
5. Integration test script
6. Separate concerns into smaller modules
7. Design doc before code changes
8. CI/CD with automated testing

* docs: add comprehensive agent guidelines

AGENT_WORKER.md (600+ lines):
- Pre-flight checklist: token budget, test plan, design doc
- Coding rules: TDD, no debug code, architecture consistency
- Git workflow: branching strategy, commit rules, release process
- Testing requirements: unit (≥80%), integration structure
- Code quality: PEP 8, type hints, max 50 lines per function
- Architecture: no feature flags, separation of concerns
- Continuous learning: research requirements, documentation
- Forbidden patterns: bare except, production debugging, etc.

AGENT_REVIEW.md (400+ lines):
- Review philosophy: prevent circular development
- 6-phase review checklist: structure, quality, tokens, architecture, research, logic
- Report format with token impact analysis
- Severity levels: blocking vs warnings vs approved
- Common issues with examples (good vs bad)
- Review workflow: 30-35 min per PR
- Reports stored in reports/ folder (gitignored)

Also added:
- tests/test_tool_parsing.py - example test following guidelines
- Updated DEVELOPMENT_PATTERNS.md with recommendations

Reports folder in .gitignore for local review storage

* chore: gitignore review reports folder

* feat: fix tool execution and enhance instructions with accurate token counting

- Enhanced tool instructions (1041 tokens, within 2000 budget)
- Added tiktoken>=0.5.0 for accurate token counting
- Fixed subprocess hang by adding stdin=subprocess.DEVNULL
- Removed 9 DEBUG print statements from routes.py
- Added tests for instruction content and token budget verification
- All tests pass (11/11)

Resolves blockers from previous review:
- Token budget verified ✓
- Token documentation added ✓
- Debug code cleaned ✓
- Missing tests added ✓

* feat: implement comprehensive tool system with proper logging

Major improvements to tool instructions and execution:
- Enhanced tool instructions with 7-step task completion workflow
- Added markdown code block fallback parser for tool calls
- Fixed subprocess hang with stdin=subprocess.DEVNULL
- Fixed streaming path to return tool_calls (enabling multi-turn conversations)
- Added complete React project creation example with verification steps
- Token count: 1,743 tokens (within 2,000 limit)

Logging infrastructure:
- Created centralized logging configuration (src/utils/logging_config.py)
- Replaced 80+ print statements with logger.debug()
- Set log level to DEBUG for development
- All modules now use proper logging instead of print

Testing:
- Added 4 new tests for markdown parsing and instruction content
- All 13 tests passing
- Token budget verification test

Documentation:
- Added comprehensive design docs for all major changes
- Added test plans for verification
- Created helper scripts for logging migration

Files changed:
- main.py: Added logging setup
- src/api/routes.py: Tool instructions, streaming fixes, logging
- src/tools/executor.py: subprocess fix, logging
- src/utils/: New logging configuration module
- tests/test_tool_parsing.py: New tests
- docs/: Design decisions and test plans
- scripts/: Helper scripts for development

* refactor: simplify tool instructions to 109 tokens for 7B model

Reduced from 1,743 tokens to 109 tokens (94% reduction) to help
qwen2.5 7B 4bit model follow instructions better.

Changes:
- Removed complex workflow documentation
- Removed multi-turn conversation examples
- Removed lengthy anti-patterns
- Kept only essential format and rules
- Updated tests to match simplified content

Before: 1,743 tokens, 6,004 chars (87% of budget)
After: 109 tokens, 392 chars (5.5% of budget)

This should make it much easier for smaller models to:
1. Understand they must use tools
2. Follow the simple TOOL: format
3. Not get overwhelmed by instructions

* refactor: make tool instructions ultra-direct for 7B models

Further simplify instructions to prevent model from adding explanations.

Before: 109 tokens - model still added explanatory text
After: 86 tokens - ultra-direct commands

Key changes:
- Start with 'You MUST use tools. DO NOT explain.'
- 'OUTPUT THIS EXACT FORMAT - NOTHING ELSE'
- Removed all examples and pleasantries
- Added 'NEVER' rules in all caps
- 'ONLY output TOOL: lines'

The model was outputting:
'1. First, install... TOOL: bash ARGUMENTS: {...}'

Now should output just:
'TOOL: bash
ARGUMENTS: {...}'

This should force the 7B qwen model to stop explaining and just execute.

* refactor: move tool instructions to external config file

Moves hardcoded tool instructions from routes.py to external config file
for better maintainability and easier editing.

Changes:
- Created config/prompts/tool_instructions.txt
- Added _load_tool_instructions() function with caching
- Falls back to default if config file not found
- Updated tests to use the loader function
- Added proper error handling

Benefits:
- Easier to modify instructions without code changes
- Instructions can be edited by non-developers
- Cleaner separation of config vs code
- Supports hot-reloading (cached but easy to invalidate)

Token count: 86 tokens (loaded from file)
Location: config/prompts/tool_instructions.txt

* refactor: simplify tool instructions further and add debug logging

- Reduced instructions to bare minimum: 50 tokens
- Added debug logging to verify instructions are sent
- Removed all caps and aggressive language
- Made instructions more straightforward

Instructions now:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This should be easier for 7B models to follow while still
conveying the essential requirements.

* feat: improve tool parser to handle 7B model output variations

Enhanced parse_tool_calls() with multiple fallback strategies:

1. Standard TOOL:/ARGUMENTS: format (original)
2. Markdown code blocks ()
3. Numbered list items (1. npm install ...)
4. Standalone bash commands (npm, npx, mkdir, etc.)

Now handles messy output from small models like:
'1. Install: npm install -g create-react-app'
'2. Create: create-react-app hello-world'

Parses these into chained bash commands for execution.

Also simplified instructions to 50 tokens minimum:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This combination should make 7B models much more likely to
have their output successfully parsed and executed.

* fix: improve command extraction for 7B model output

Parser now extracts bash commands from any line containing:
- npm, npx, mkdir, cd, ls, cat, echo, git, python, pip, node, yarn
- create-react-app (added for React projects)

Example: Extracts 'npm install -g create-react-app' from:
'1. Install: npm install -g create-react-app'

Chains multiple commands with && for sequential execution.

This should now successfully parse the numbered list output
from 7B models and execute the commands.

* feat: add bash tool description validation and improve 7B model parsing

Changes:
- Added _ensure_tool_arguments() function to inject 'description' field
- Updated tool_instructions.txt to require description for bash tool
- Improved 7B model command extraction with better regex patterns
- Added 'create-react-app' to command detection list
- Updated delta field type to Dict[str, Any] for streaming
- Added GGUF to MLX quantization mapping for registry.py
- Clarified agent responsibilities in AGENT_REVIEW.md and AGENT_WORKER.md

Fixes:
- Bash tool now validates required 'description' field
- 7B model output parsed more reliably (numbered lists)
- Multiple commands chained with && for sequential execution

Token count: 69 tokens (down from 86, -19.8%)

All tests pass: 13/13

* feat: add webfetch tool support with URL extraction

Changes:
- Added webfetch to tool instructions config
- Added URL extraction pattern to parse_tool_calls()
- Parser now recognizes URLs and creates webfetch tool calls
- Updated token count: 89 tokens (+29% from 69)

The webfetch tool is available through opencode environment.
System prompt adjustment enables model to use it for URL fetching.

Token budget: 89 tokens (4.45% of 2000 limit)
Tests pass: 13/13

2026-02-24 22:35:05 +01:00

6.8 KiB

Raw Blame History

Development Patterns Analysis

Circular Development Issues Identified

1. Tool Execution Architecture (15+ commits going in circles)

The Cycle:

Add server-side tool execution → Fix looping issues → Remove/simplify instructions 
→ Tools don't work → Add tool host → Return tool_calls to client (reversal) 
→ Execute server-side again (reversal back) → Fix parsing → Simplify format 
→ Enhance instructions → Add streaming support → Fix streaming format...

Commits showing the cycle:

00cd483 - Add server-side tool execution
df4587e - Fix: prevent looping (checking for server-side results)
c70f83a - Fix: simplify looping prevention
1b181bf - Fix: remove tool instructions (40k → 0 tokens)
bad8732 - Fix: simplify to ~300 tokens
12eaac0 - Add distributed tool host
b7fc184 - REVERSAL: Return tool_calls to opencode (not server-side)
f83e6fc - REVERSAL BACK: Execute via tool executor
aa137b6 - Fix: handle tool_calls as single object or array
539ca21 - Simplify format to TOOL:/ARGUMENTS: pattern
aabd2b2 - Enhance instructions for multi-step operations

Root Cause: No clear architectural decision on:

Who executes tools? (Server vs Client)
What format? (JSON vs text patterns vs markdown)
When to add instructions? (Always vs first request vs never)

2. Tool Instruction Token Count (4 changes)

40,000 tokens → 300 tokens → removed → enhanced (unknown count)

Problem: No testing to validate if instructions actually work.

3. Tool Parsing (8+ fixes)

Multiple commits fixing the same parsing issues:

c5b8196 - Parse nested JSON in arguments
76b12b3 - Parse JavaScript-style output
9d838c1 - Handle markdown code blocks
e3701cf - Extract content before tool_calls block
aa137b6 - Handle single object or array
539ca21 - Simplify to TOOL:/ARGUMENTS: pattern

Problem: No unit tests for parsing. Each fix only handles one case.

4. Streaming + Tools (4 commits)

Disable streaming when tools present → Add to streaming path → Fix SSE format

Problem: Two completely different code paths that diverge and need separate fixes.

5. Debugging Commits (6 commits)

Commits that only add debug logging:

e0c500e - "very visible request/response logging"
25b675c - "explicit logging for tool executor configuration"
27e1971 - "response logging to both paths"
e3eb52d - "log message state"
13e6fb2 - "add logging to tool call parsing"
3039629 - "log request.tools"

Problem: Debugging in production instead of having tests.

Why This Happens

1. No Tests

Impact: Every change requires manual testing
Result: Fixes break other cases, regressions common
Evidence: 25+ commits fixing tool-related issues

2. Production Debugging

Pattern: Add debug logging → Fix → Remove debug logging
Commits: e0c500e, 3728eb7 (add then clean up)
Better: Unit tests with mocked LLM responses

3. Architectural Ambiguity

Question: Who owns tool execution?
Server-side: Better for simple providers
Client-side: Better for complex opencode integration
Actual: Switched back and forth 3+ times

4. Feature Interaction Complexity

Tools + Streaming = Two paths to maintain
Tools + Federation = Distributed execution complexity
Tools + Different formats = Parsing nightmare

5. Unclear Requirements

Should instructions be in system prompt or user prompt?
How many tokens is acceptable?
What format should tools return?

Recommendations to Prevent This

Immediate (Prevents Next Cycle)

Pick One Architecture
- Decision: Server-side execution via tool executor
- Document why in ARCHITECTURE.md
Token Budget
- Max 2000 tokens for tool instructions
- Test with actual 16K context models
- Never exceed 50% of context window
One Format Only
- Standardize on: TOOL: name\nARGUMENTS: {"key": "value"}
- Remove all other parsing code
- Single regex pattern

Add Unit Tests

# test_tool_parsing.py
def test_parse_simple_tool():
    text = "TOOL: read\nARGUMENTS: {\"filePath\": \"test.txt\"}"
    content, tools = parse_tool_calls(text)
    assert len(tools) == 1
    assert tools[0]["function"]["name"] == "read"

def test_parse_no_tool():
    text = "Just a regular response"
    content, tools = parse_tool_calls(text)
    assert len(tools) == 0
    assert content == text

def test_parse_multiple_tools():
    text = "TOOL: read\nARGUMENTS: {...}\n\nTOOL: write\nARGUMENTS: {...}"
    content, tools = parse_tool_calls(text)
    assert len(tools) == 2

Integration Test Script

# test_tools.sh
python main.py --auto --test-tools
# Tests: read file → write file → bash command
# Exits with error code if any fail

Simplify Tool Instructions
- Current: ~300 tokens with 5 examples
- Target: ~100 tokens with 2 examples
- Include: read, write only (bash is obvious)

Medium-term

Separate Concerns

src/tools/
├── parser.py      # Only parsing logic
├── executor.py    # Only execution logic  
├── formatter.py   # Only formatting instructions
└── integration.py # Only API integration

Design Doc Before Code
- For tool system changes, write 1-page design first
- Include: format, token count, examples, test plan
- Get it right on paper before coding

Feature Flags

# config.py
USE_SERVER_SIDE_TOOLS = True  # Can toggle without code changes
TOOL_INSTRUCTION_VERSION = "v2"  # A/B test formats

Long-term

CI/CD Pipeline
- Run tests on every PR
- Block merge if tests fail
- Include: unit tests, integration tests, token count check
Observability
- Structured logging (not print statements)
- Metrics: tool success rate, parsing errors, latency
- Dashboard to see issues before users report them

Current State Assessment

Good:

Tool executor abstraction exists
Distributed tool execution works
Working directory handling improved
Timeout handling for package managers

Needs Work:

Too many parsing code paths (simplify to one)
Instructions too long (reduce to <2000 tokens)
No automated testing
Debug logging still in production code

Suggested Immediate Actions

Merge current cleanup branch (already done ✓)
Remove all but one parsing format (done ✓)
Reduce tool instructions to <2000 tokens (done ✓)
Add unit tests for tool parsing (done ✓)
Add integration test for tool execution

Success Metrics

Tool-related commits stabilize to <2 per month
Zero "fix: prevent looping" commits
All tool changes include tests
Instructions stay under 2000 tokens

6.8 KiB Raw Blame History