Files

T

sleepy 580d1e5d17 feat: comprehensive tool system improvements and webfetch support (#3 )

* feat: enhanced tool instructions for multi-step operations

- Add comprehensive examples for ls, find, grep, mkdir, npm init, etc.
- Explain multi-step workflow (explore → read → write)
- Tool system already supports chaining via conversation history
- Bash tool supports: ls, find, grep, cat, mkdir, cd, npm, etc.
- 30 second timeout on commands
- Output limited to 3000 chars for readability

* Cleanup: Consolidate documentation and tidy codebase

Documentation:
- Consolidate 6 markdown files into simplified README.md
- Remove redundant docs: TODO.md, NETWORK.md, REVIEW.md, PLAN.md, CONTEXT.md, GUIDE.md
- Add ARCHITECTURE.md with clean technical overview
- README now focuses on quick start and core concepts

Code verification:
- Verified blocking I/O properly wrapped in asyncio.to_thread()
- Confirmed locks initialized correctly in backends
- AMD VRAM detection uses proper regex (takes max value, not first match)
- All exception handling uses 'except Exception:' (not bare except)

Tool execution improvements (existing changes):
- Better working directory handling with project root detection
- Extended timeouts for package managers (300s)
- Multi-tool call parsing support
- Improved error handling and logging

Note: System prompt concern noted - 30k tokens too large for 16-32k context windows

* docs: add development patterns analysis

Document circular development issues identified in commit history:
- Tool execution went back-and-forth 3+ times (server-side vs client-side)
- Tool instructions changed from 40k → 300 → removed → enhanced tokens
- 8+ parsing fixes for same issues (no tests)
- 6 debug-only commits (production debugging)

Provides recommendations to prevent future cycles:
1. Pick one architecture and stick with it
2. Add unit tests before fixes
3. Token budget (<2000 for instructions)
4. One format only (remove alternative parsers)
5. Integration test script
6. Separate concerns into smaller modules
7. Design doc before code changes
8. CI/CD with automated testing

* docs: add comprehensive agent guidelines

AGENT_WORKER.md (600+ lines):
- Pre-flight checklist: token budget, test plan, design doc
- Coding rules: TDD, no debug code, architecture consistency
- Git workflow: branching strategy, commit rules, release process
- Testing requirements: unit (≥80%), integration structure
- Code quality: PEP 8, type hints, max 50 lines per function
- Architecture: no feature flags, separation of concerns
- Continuous learning: research requirements, documentation
- Forbidden patterns: bare except, production debugging, etc.

AGENT_REVIEW.md (400+ lines):
- Review philosophy: prevent circular development
- 6-phase review checklist: structure, quality, tokens, architecture, research, logic
- Report format with token impact analysis
- Severity levels: blocking vs warnings vs approved
- Common issues with examples (good vs bad)
- Review workflow: 30-35 min per PR
- Reports stored in reports/ folder (gitignored)

Also added:
- tests/test_tool_parsing.py - example test following guidelines
- Updated DEVELOPMENT_PATTERNS.md with recommendations

Reports folder in .gitignore for local review storage

* chore: gitignore review reports folder

* feat: fix tool execution and enhance instructions with accurate token counting

- Enhanced tool instructions (1041 tokens, within 2000 budget)
- Added tiktoken>=0.5.0 for accurate token counting
- Fixed subprocess hang by adding stdin=subprocess.DEVNULL
- Removed 9 DEBUG print statements from routes.py
- Added tests for instruction content and token budget verification
- All tests pass (11/11)

Resolves blockers from previous review:
- Token budget verified ✓
- Token documentation added ✓
- Debug code cleaned ✓
- Missing tests added ✓

* feat: implement comprehensive tool system with proper logging

Major improvements to tool instructions and execution:
- Enhanced tool instructions with 7-step task completion workflow
- Added markdown code block fallback parser for tool calls
- Fixed subprocess hang with stdin=subprocess.DEVNULL
- Fixed streaming path to return tool_calls (enabling multi-turn conversations)
- Added complete React project creation example with verification steps
- Token count: 1,743 tokens (within 2,000 limit)

Logging infrastructure:
- Created centralized logging configuration (src/utils/logging_config.py)
- Replaced 80+ print statements with logger.debug()
- Set log level to DEBUG for development
- All modules now use proper logging instead of print

Testing:
- Added 4 new tests for markdown parsing and instruction content
- All 13 tests passing
- Token budget verification test

Documentation:
- Added comprehensive design docs for all major changes
- Added test plans for verification
- Created helper scripts for logging migration

Files changed:
- main.py: Added logging setup
- src/api/routes.py: Tool instructions, streaming fixes, logging
- src/tools/executor.py: subprocess fix, logging
- src/utils/: New logging configuration module
- tests/test_tool_parsing.py: New tests
- docs/: Design decisions and test plans
- scripts/: Helper scripts for development

* refactor: simplify tool instructions to 109 tokens for 7B model

Reduced from 1,743 tokens to 109 tokens (94% reduction) to help
qwen2.5 7B 4bit model follow instructions better.

Changes:
- Removed complex workflow documentation
- Removed multi-turn conversation examples
- Removed lengthy anti-patterns
- Kept only essential format and rules
- Updated tests to match simplified content

Before: 1,743 tokens, 6,004 chars (87% of budget)
After: 109 tokens, 392 chars (5.5% of budget)

This should make it much easier for smaller models to:
1. Understand they must use tools
2. Follow the simple TOOL: format
3. Not get overwhelmed by instructions

* refactor: make tool instructions ultra-direct for 7B models

Further simplify instructions to prevent model from adding explanations.

Before: 109 tokens - model still added explanatory text
After: 86 tokens - ultra-direct commands

Key changes:
- Start with 'You MUST use tools. DO NOT explain.'
- 'OUTPUT THIS EXACT FORMAT - NOTHING ELSE'
- Removed all examples and pleasantries
- Added 'NEVER' rules in all caps
- 'ONLY output TOOL: lines'

The model was outputting:
'1. First, install... TOOL: bash ARGUMENTS: {...}'

Now should output just:
'TOOL: bash
ARGUMENTS: {...}'

This should force the 7B qwen model to stop explaining and just execute.

* refactor: move tool instructions to external config file

Moves hardcoded tool instructions from routes.py to external config file
for better maintainability and easier editing.

Changes:
- Created config/prompts/tool_instructions.txt
- Added _load_tool_instructions() function with caching
- Falls back to default if config file not found
- Updated tests to use the loader function
- Added proper error handling

Benefits:
- Easier to modify instructions without code changes
- Instructions can be edited by non-developers
- Cleaner separation of config vs code
- Supports hot-reloading (cached but easy to invalidate)

Token count: 86 tokens (loaded from file)
Location: config/prompts/tool_instructions.txt

* refactor: simplify tool instructions further and add debug logging

- Reduced instructions to bare minimum: 50 tokens
- Added debug logging to verify instructions are sent
- Removed all caps and aggressive language
- Made instructions more straightforward

Instructions now:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This should be easier for 7B models to follow while still
conveying the essential requirements.

* feat: improve tool parser to handle 7B model output variations

Enhanced parse_tool_calls() with multiple fallback strategies:

1. Standard TOOL:/ARGUMENTS: format (original)
2. Markdown code blocks ()
3. Numbered list items (1. npm install ...)
4. Standalone bash commands (npm, npx, mkdir, etc.)

Now handles messy output from small models like:
'1. Install: npm install -g create-react-app'
'2. Create: create-react-app hello-world'

Parses these into chained bash commands for execution.

Also simplified instructions to 50 tokens minimum:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This combination should make 7B models much more likely to
have their output successfully parsed and executed.

* fix: improve command extraction for 7B model output

Parser now extracts bash commands from any line containing:
- npm, npx, mkdir, cd, ls, cat, echo, git, python, pip, node, yarn
- create-react-app (added for React projects)

Example: Extracts 'npm install -g create-react-app' from:
'1. Install: npm install -g create-react-app'

Chains multiple commands with && for sequential execution.

This should now successfully parse the numbered list output
from 7B models and execute the commands.

* feat: add bash tool description validation and improve 7B model parsing

Changes:
- Added _ensure_tool_arguments() function to inject 'description' field
- Updated tool_instructions.txt to require description for bash tool
- Improved 7B model command extraction with better regex patterns
- Added 'create-react-app' to command detection list
- Updated delta field type to Dict[str, Any] for streaming
- Added GGUF to MLX quantization mapping for registry.py
- Clarified agent responsibilities in AGENT_REVIEW.md and AGENT_WORKER.md

Fixes:
- Bash tool now validates required 'description' field
- 7B model output parsed more reliably (numbered lists)
- Multiple commands chained with && for sequential execution

Token count: 69 tokens (down from 86, -19.8%)

All tests pass: 13/13

* feat: add webfetch tool support with URL extraction

Changes:
- Added webfetch to tool instructions config
- Added URL extraction pattern to parse_tool_calls()
- Parser now recognizes URLs and creates webfetch tool calls
- Updated token count: 89 tokens (+29% from 69)

The webfetch tool is available through opencode environment.
System prompt adjustment enables model to use it for URL fetching.

Token budget: 89 tokens (4.45% of 2000 limit)
Tests pass: 13/13

2026-02-24 22:35:05 +01:00

11 KiB

Raw Permalink Blame History

Agent Reviewer Rules

⚠️ IMPORTANT: This document is for REVIEW AGENTS who handle commits, PRs, and code reviews. Regular agents follow AGENT_WORKER.md for implementation tasks and DO NOT make commits.

Review Philosophy

Mission: Prevent the circular development patterns identified in commit history.

Standards:

Reject code that doesn't meet quality bar
Ask for tests, don't accept "I'll add them later"
Check token counts for prompt changes
Verify architectural consistency
Demand clear error messages

Reviewer Authority:

Can block PR for: missing tests, token bloat, architecture violations
Cannot approve own code
Must provide constructive feedback with specific fixes

Review Checklist

Phase 1: Structure & Hygiene (Block if failed)

Branch naming follows convention
- Format: type/description (e.g., fix/tool-parsing)
- Not: quick-fix, temp-branch, dev
Commit messages are clear
- Format: type(scope): description
- No: fix stuff, WIP, asdf, omg finally
- Each commit should be reviewable independently
No production debugging code
- Search for: print(, console.log, debugger, TODO, FIXME, XXX
- Check: No commented-out code blocks
- Check: No temporary files committed
Git history is clean
- No "fix typo" commits after initial commit
- No "WIP" commits in PR
- No merge commits (rebase instead)
- Squash fixup commits

Phase 2: Code Quality (Block if failed)

Tests exist and pass
- Unit tests for new functions
- Integration tests for API changes
- Run: pytest -v (must pass)
- Coverage: ≥80% for new code
- BLOCKING: No tests = No merge
Type hints present
- All function parameters typed
- All return values typed
- Run: mypy src/ (must pass with zero errors)
No code smells
- No functions > 50 lines
- No files > 300 lines
- No indentation > 3 levels deep
- No circular imports
- No duplicate code (>3 lines copied)
Error handling is robust
- No bare except: clauses
- All errors have clear messages
- No silent failures
- Edge cases handled
Documentation is adequate
- All public functions have docstrings
- Complex logic has inline comments
- README updated if user-facing change
- Architecture doc updated if pattern changes

Phase 3: Token Budget (Block if failed)

For any prompt/instruction changes:

Token count documented
- Before: X tokens
- After: Y tokens
- Change: +/- Z tokens
Within budget
- System prompt + instructions ≤ 2000 tokens (HARD LIMIT)
- Leaves ≥ 50% context window for user input
- BLOCKING: Over budget = Request reduction
Efficient wording
- No redundant examples
- No verbose explanations
- Prefer code over prose

Token Counting Command:

# Count tokens in a string
echo "Your prompt here" | python -c "import sys; import tiktoken; enc = tiktoken.get_encoding('cl100k_base'); print(len(enc.encode(sys.stdin.read())))"

Phase 4: Architecture (Block if failed)

Consistent with ARCHITECTURE.md
- No new patterns without updating docs
- No mixing of concerns
- Follows existing module structure
No architecture changes in fixes
- Bug fixes should not refactor
- Refactors should be separate PRs
- Exception: If fix requires arch change, document WHY
Parser rules
- Only ONE parser per format
- No alternative parsing paths
- Clear regex patterns
- Handles all documented cases
No feature flags in core
- Code should not have if config.get("ENABLE_X"):
- Pick one approach, remove old one
- A/B testing only in separate branch

Phase 5: Research & Continuous Learning

For significant changes (>100 lines or new algorithms):

Research documented
- Check research/ folder for related findings
- PR description mentions alternatives considered
- Links to sources (docs, papers, repos)
- Not: "I thought this would work"
- Yes: "Based on [source], this approach handles [case] better than [alternative]"
Best practices followed
- Implementation matches current language/framework conventions
- No deprecated patterns
- Modern Python features used appropriately (3.9+)
No reinvention
- Check if standard library solves the problem
- Check if well-maintained package exists
- If custom implementation needed, document WHY

Research Documentation Requirements:

## Research
- Alternatives considered: [list]
- Sources: [links]
- Decision: [why chosen approach]
- Benchmarks: [if applicable]

Phase 6: Logic Correctness

Logic is sound
- Read through the code
- Check edge cases
- Verify error conditions
- Question anything unclear
No performance regressions
- No blocking I/O in async functions (unless wrapped)
- No memory leaks
- No N+1 queries
- Reasonable algorithmic complexity
Security check
- No SQL injection vectors
- No command injection (bash execution sanitized)
- Path traversal protection (for file ops)
- No secrets in code

Review Report Format

After review, write a report to reports/PR-{number}-{branch}.md:

# Review Report: PR #{number} - {branch}

**Reviewer:** {your name}
**Date:** {YYYY-MM-DD}
**Status:** [APPROVED / CHANGES_REQUESTED / BLOCKED]

## Summary
Brief description of what this PR does and overall quality assessment.

## Detailed Findings

### ✅ Passed
- [List items that passed review]
- [Be specific: "Tests cover 85% of new code"]

### ⚠️ Warnings (Non-blocking)
- [Minor issues that don't block merge]
- [Style suggestions]
- [Future improvements]

### ❌ Blockers (Must fix)
1. **[Category]** [Specific issue]
   - **Location:** `file.py:123`
   - **Problem:** [What's wrong]
   - **Fix:** [Exactly what to change]
   - **Why:** [Why this matters]

2. **[Category]** [Specific issue]
   - ...

## Token Impact Analysis
- Component: [what changed]
- Before: [X] tokens
- After: [Y] tokens
- Impact: [+/- Z] tokens
- Within budget: [Yes/No]

## Test Coverage
- New code coverage: [X]%
- Tests pass: [Yes/No]
- Integration tests: [Present/Missing]

## Architecture Review
- Follows existing patterns: [Yes/No]
- Introduces new dependencies: [List if any]
- Breaking changes: [Yes/No - explain if yes]

## Research Review
- Alternatives considered: [Listed/None]
- Sources cited: [Yes/No]
- Best practices followed: [Yes/No]
- Research documented: [Yes/No - location]

## Code Quality Score
- Structure: [0-10]
- Testing: [0-10]
- Documentation: [0-10]
- Logic: [0-10]
- **Overall: [0-10]**

## Action Items
- [ ] [Specific fix needed]
- [ ] [Specific fix needed]
- [ ] [Test to add]

## Verdict
[APPROVED / CHANGES_REQUESTED / BLOCKED]

**If CHANGES_REQUESTED:** 
- Address all blockers
- Re-request review when ready

**If BLOCKED:**
- Major issues require architecture discussion
- Schedule meeting before continuing

Severity Levels

🔴 BLOCKING (Cannot merge)

Missing tests for new functionality
Token budget exceeded
Bare except: clauses
Production debugging code (print statements)
Breaking changes without documentation
Security vulnerabilities
Tests failing
Type check errors
Architecture violations

🟡 CHANGES_REQUESTED (Fix before merge)

Unclear variable names
Missing docstrings
Inefficient algorithms
Missing error handling
Unclear commit messages
Minor style issues

🟢 APPROVED (Optional suggestions)

Style preferences
Future improvements
Optional refactors

Common Issues to Watch For

Issue 1: Tool Parsing Duplication

# ❌ WRONG - Multiple parsers
def parse_tools_v1(text): ...
def parse_tools_v2(text): ...
def parse_tools_legacy(text): ...

# ✅ CORRECT - Single parser
TOOL_PATTERN = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'

Check: Search for "def parse" - should be ONE per format.

Issue 2: Token Bloat

# ❌ WRONG - Too verbose
SYSTEM_PROMPT = """
You are an AI assistant. Here are detailed instructions...
[2000 words of explanation]
[10 examples]
"""

# ✅ CORRECT - Concise
SYSTEM_PROMPT = """Use TOOL: name\nARGUMENTS: {...} format. Available: read, write, bash."""

Check: Count tokens, verify < 2000.

Issue 3: Architecture Drift

# ❌ WRONG - Mixing concerns in one file
# src/api/routes.py
def handle_request(): ...
def parse_tools(): ...
def execute_tool(): ...
def format_response(): ...

# ✅ CORRECT - Separated
# src/api/routes.py - only HTTP handling
# src/tools/parser.py - only parsing
# src/tools/executor.py - only execution

Check: Each module has ONE responsibility.

Issue 4: Debug Code Left In

# ❌ WRONG
def process(data):
    print(f"DEBUG: data={data}")  # REMOVE THIS
    result = transform(data)
    print(f"DEBUG: result={result}")  # REMOVE THIS
    return result

# ✅ CORRECT
logger = logging.getLogger(__name__)

def process(data):
    logger.debug("Processing data", extra={"data_size": len(data)})
    return transform(data)

Check: grep -r "print(" src/ --include="*.py" | grep -v "^#"

Issue 5: Missing Error Context

# ❌ WRONG
raise ValueError("Invalid input")

# ✅ CORRECT
raise ValueError(f"Invalid model format: '{model_str}'. Expected: 'name:size:quant' (e.g., 'qwen:7b:q4')")

Check: All errors explain what was expected vs received.

Review Workflow

First Pass: Structure (5 min)
- Check branch name, commits, no debug code
- If failed → Write report, BLOCK
Second Pass: Quality (10 min)
- Run tests, check types, review code
- If failed → Write report, CHANGES_REQUESTED
Third Pass: Deep Dive (15 min)
- Read logic, check edge cases
- Verify token counts
- Check architecture
- Write detailed report
Final Decision (5 min)
- APPROVE / CHANGES_REQUESTED / BLOCK
- Write report to reports/ folder
- Post summary in PR comments

Total time per review: 30-35 minutes

Reviewer Self-Check

Before submitting review:

I ran all tests locally
I checked type hints
I counted tokens (if applicable)
I read every line of changed code
My feedback is specific and actionable
I explained WHY for each blocker
I wrote a report to reports/ folder

Escalation

Escalate to architecture discussion if:

PR changes core patterns
Token budget cannot be met
Two reviewers disagree
Breaking changes proposed

Don't just approve to be nice. Don't let technical debt accumulate.

Report Storage

All reports go in reports/ folder:

reports/
├── PR-123-fix-tool-parsing.md
├── PR-124-add-federation.md
├── PR-125-refactor-consensus.md
└── README.md  # Index of all reviews

This folder is gitignored - reports stay local.

Generate index with:

ls -1 reports/PR-*.md | sort -t'-' -k2 -n > reports/README.md

Remember: You're the last line of defense against technical debt. Be thorough, be kind, be strict.

11 KiB Raw Permalink Blame History