Commit Graph

130 Commits

Author SHA1 Message Date
sleepy 414cb444f3 fix: integrate federation with tool execution loop
- Federation was returning directly without executing tools
- Now federation is used for initial generation (iteration 1)
- Tool execution loop still runs for all iterations
- Subsequent iterations use local swarm (for tool result processing)
- This fixes federation + tools not working together
- All 41 tests passing
2026-02-25 23:06:37 +01:00
sleepy 34b28597ff fix: peers in federation mode should not generate tool calls
- Added _strip_tool_instructions() to remove tool instructions from federation prompts
- Peer nodes now only generate text responses, not tool calls
- Head node is the only one that handles tool execution
- This prevents peers from generating tool calls that can't be executed
- Fixes federation + tools incompatibility
- All 41 tests passing
2026-02-25 22:46:15 +01:00
sleepy 67122052b4 Merge branch 'fix/tool-instructions-permission' 2026-02-25 22:39:00 +01:00
sleepy e7b826da4e docs: update README with current features and remove outdated docs
- Removed old design docs and test plans from docs/ directory
- Updated TODO section to reflect completed improvements
- Added section on Recent Improvements with detailed changelog
- Updated Federation description to explain objective quality voting
- Added federation vote endpoint to API endpoints list
- Clarified universal tool support and OpenCode streaming compatibility
- All changes ready for main branch merge
2026-02-25 22:38:46 +01:00
sleepy 3799240d74 fix: head node objectively judges all responses using quality metrics
- Removed biased self-reported confidence voting
- Head node now collects ALL responses and scores them objectively
- Uses quality scoring (length, structure, completeness) to compare
- Shows quality scores for all nodes so user can see comparison
- Prevents overconfident small models from beating better large models
- 3B models will only win if they actually produce better quality output
- All 41 tests passing
2026-02-25 22:24:00 +01:00
sleepy e0d04ae664 fix: use actual consensus confidence for peers instead of hardcoded 0.8
- Federation endpoint was hardcoding confidence: 0.8 for all peer responses
- Local swarm uses actual calculated confidence (often 1.0 for single worker)
- This created unfair bias toward local responses
- Now uses result.confidence from actual consensus calculation
- Peers and local now compete on equal footing
- All 41 tests passing
2026-02-25 22:21:19 +01:00
sleepy 896e9d6d9b fix: store swarm_manager in app.state for federation endpoint
- Added app.state.swarm_manager = self.swarm_manager in lifespan
- Federation endpoint reads from request.app.state.swarm_manager
- This fixes 'Swarm not ready' error when peers try to request generation
- All 41 tests passing
2026-02-25 22:11:06 +01:00
sleepy e2b0af7636 fix: add missing federation /v1/federation/vote endpoint
- Added POST /v1/federation/vote endpoint to handle peer generation requests
- Peers were discovering each other but requests had no endpoint to hit
- Endpoint generates using local swarm and returns vote results
- Logs federation requests for debugging
- All 41 tests passing
2026-02-25 22:05:37 +01:00
sleepy 5b29e15c0a fix: prevent path hallucination - read files directly without ls first
- Changed instructions to read files directly instead of verifying with ls first
- Added explicit warning against placeholder paths like '/path/to/file'
- Model now uses paths exactly as user provides them
- Should fix issues with hallucinated paths like '/path/to/my-secret.log'
- All 41 tests passing
2026-02-25 21:42:25 +01:00
sleepy 8431717235 fix: stronger instruction for bash ls results to read files immediately
- Changed bash ls instruction from 'SUMMARIZE' to 'CRITICAL: ... READ THE FILE immediately'
- Now explicitly tells model to NOT summarize first, but immediately read the file
- Uses stronger language: 'you MUST immediately USE THE read TOOL NOW'
- This should fix the loop where model keeps running ls instead of reading
- All 41 tests passing
2026-02-25 21:20:48 +01:00
sleepy 06df3c8dab fix: allow absolute and ~ paths to access files outside working directory
- Security check now only applies to relative paths
- If user specifies absolute path (/path/to/file) or tilde path (~/.bashrc), allow it
- Relative paths (like file.txt) are still restricted to working directory
- This fixes 'Access denied - path outside working directory' for valid user-specified paths
- Applied to both read and write tools
- All 41 tests passing
2026-02-25 21:13:02 +01:00
sleepy ab7cf7e9aa fix: expand tildes (~) to home directory in tool paths
- Added os.path.expanduser() to _execute_read for both file_path and working_dir
- Added os.path.expanduser() to _execute_write for both file_path and working_dir
- Added os.path.expanduser() to _execute_bash for cwd parameter
- This fixes paths like '~/Documents/file.txt' being treated literally
- Now correctly resolves to '/Users/username/Documents/file.txt'
- All 41 tests passing
2026-02-25 20:54:31 +01:00
sleepy 49a6d99bf8 CRITICAL FIX: fix indentation bug that prevented tool results from being added to history
- The for loop was only executing the first line (tool_call_id assignment)
- All the tool message creation code was outside the loop due to wrong indentation
- This caused tool results to never be added to conversation history
- Model would loop infinitely calling ls because it never saw the tool results
- Fixed indentation so all tool result processing is inside the for loop
- This should finally fix the infinite loop issue!
- All 41 tests passing
2026-02-25 20:49:30 +01:00
sleepy 586c113688 fix: smarter bash tool instructions - guide model to read files after verification
- Updated bash tool result instructions to detect verification commands (ls/grep)
- If ls/grep shows file exists and user asked to READ it: explicitly tells model to USE read TOOL NOW
- If user asked to check files: tells model to summarize the listing
- If file not found: tells model to inform user
- Prevents infinite loops of repeated ls commands
- Model now properly transitions from verification → action → answer
- All 41 tests passing
2026-02-25 20:39:55 +01:00
sleepy a09d23156b feat: universal tool support - inject instructions by default, add plan mode TODO, improve file handling
1. Tool instructions now ALWAYS injected by default:
   - Removed condition that only injected on first request
   - Any client (Continue, hollama) can now use tools without client-side setup
   - Added check to avoid duplicating instructions if already present

2. Updated tool instructions with file verification guidance:
   - Added 'FILE OPERATIONS - ALWAYS VERIFY FIRST' section
   - Instructs to use 'ls' and 'grep' to verify files exist before reading
   - Prevents blind file reads on non-existent paths

3. Added TODO to README:
   - Plan mode feature (disable tool execution for planning-only conversations)
   - Current status section showing what's implemented

4. Working directory extraction from prompts:
   - New _extract_working_dir_from_prompt() function
   - Extracts paths from patterns like 'in /path/to/dir', 'under /path/to/dir'
   - Validates paths exist before using
   - Falls back to auto-detection if not found in prompt
   - All 41 tests passing
2026-02-25 20:37:23 +01:00
sleepy c46684f03e fix: explicit tool result instructions to guide model response
- Changed vague 'Provide your final answer now' to specific per-tool instructions
- read: 'READ THIS FILE CONTENT ALOUD to the user'
- write: 'CONFIRM to the user that the file was created'
- bash: 'SUMMARIZE the output above to answer the user's request'
- Other tools: 'Use the result shown above to answer the user's request'
- Format tool result message with clear 'Tool Result (name):' header and explicit instruction
- This should fix models ignoring tool results or giving generic responses
- All 41 tests passing
2026-02-25 20:25:05 +01:00
sleepy bd3579737a feat: add detailed tool execution logging
- Log full message history before calling model after tool execution
- Shows each message role, content truncation, tool calls, and tool_call_id associations
- Logs token count and full prompt (first 1000 chars) at DEBUG level
- Helps diagnose why models might be ignoring tool results
- All 41 tests passing
2026-02-25 20:17:55 +01:00
sleepy 886ebbdb81 fix: proper OpenAI tool call format with tool_call_id linking
- Uncommented tool_call_id and name fields in ChatMessage model
- Modified tool execution to assign unique IDs to each tool call
- Assistant messages now include tool_calls array with proper ID, type, function
- Tool response messages now include tool_call_id and name to link to the call
- Each tool execution gets its own separate tool message (not combined)
- This ensures the model properly associates tool results with tool calls
- Should fix issues where models ignore tool results due to missing associations
- Updated _execute_tools to return List[tuple] instead of combined string
- Added List import to typing
- All 41 tests still passing
2026-02-25 20:12:40 +01:00
sleepy a0d3ae9d4f fix: OpenCode-compatible streaming format with reasoning_content
- Fixed thinking capture: use parsed_content (without tool call) instead of full response
- _stream_response now correctly emits reasoning_content before tool_calls
- Tool calls streamed with proper multi-chunk format: id+name (empty args), then arguments, then finish_reason
- Final answers sent as content with finish_reason=stop
- Used setattr to dynamically attach _thinking to response object
- ChatLogger already in place for debugging
- This should now work correctly with OpenCode's Vercel AI SDK integration
2026-02-25 20:03:55 +01:00
sleepy a0571c83a3 feat: implement OpenCode-compatible streaming format and enhance chatlogging
- Implement proper streaming with reasoning_content field for thinking blocks
- Stream tool_calls in multi-chunk format matching Vercel AI SDK
- Capture thinking content and send as reasoning_content before tool_calls
- Update _create_response to store thinking on response._thinking for streaming
- ChatLogger now logs assistant messages with thinking blocks when tool calls present
- Added json import in chat_handlers for tool arguments parsing
- All streaming code uses OpenCode-compatible SSE format
2026-02-25 19:57:38 +01:00
sleepy 46f14b2b53 feat: add chatlogger for tool execution debugging - logs to chatlog.md when LOCAL_SWARM_CHATLOG=1 2026-02-25 19:52:52 +01:00
sleepy 42a176f1d8 fix: update tool instructions to require file operations and prevent refusals
- Changed from hesitant 'use only when necessary' to mandatory 'you WILL use tools'
- Explicitly forbid refusal for file read/write operations
- Add NO explanations and NO markdown requirements (for test compliance)
- Provide clear examples for read/write tool usage
- Addresses issue where model says 'cannot read files or assist with file creation'
2026-02-25 19:41:16 +01:00
sleepy dcca89d89a fix: OpenAI API compatibility for hollama and other clients
- Fixed ChatMessage.tool_calls to be Optional with default None (excluded when empty)
- Added logprobs field to ChatCompletionChoice (always included as null)
- Added stats and system_fingerprint to ChatCompletionResponse
- Fixed streaming response to use delta format (not message format)
- Fixed non-streaming response to include logprobs: null
- Updated tool instructions to include 'NO explanations'
- Added pytest-asyncio markers to async tests
- All 41 tests passing

This fixes the 'Cannot read properties of undefined (reading content)' error in hollama and ensures compatibility with OpenAI clients.
2026-02-25 19:39:05 +01:00
sleepy b9ce5db8ef docs: update architecture and README with new modular structure
Updated documentation to reflect the recent refactoring:

README.md:
- Added detailed project structure with line counts
- Added Architecture Principles section
- Added Development section with code quality standards
- Added section about recent refactoring work

ARCHITECTURE.md:
- Added complete project structure tree
- Added Architecture Principles section
- Detailed all modules and their responsibilities
- Added Configuration Files section
- Added Code Quality Standards section

DEVELOPMENT_PATTERNS.md:
- Added Refactoring Success section
- Documented all changes made
- Listed architecture principles established
- Updated success metrics with checkmarks
2026-02-25 13:31:24 +01:00
sleepy 1acebbc6a2 refactor(models): extract memory calculations and config from selector
Changes:
- selector.py: 486 → 329 lines (-32%)
- Extracted memory calculation functions to memory_calculator.py
- Extracted constants to selector_config.json
- Updated selector.py to load config and import from memory_calculator
- All 35 tests pass
2026-02-25 13:23:47 +01:00
sleepy 32049c766c refactor(models): extract hardcoded data to JSON configs
Extracted from registry.py (437 → 194 lines):
- config/models/mlx_quant_sizes.json - MLX quantization VRAM sizes
- config/models/gguf_quant_sizes.json - GGUF quantization VRAM sizes
- config/models/model_metadata.json - Model metadata

Registry now loads from JSON files instead of hardcoded data.
All 35 tests pass.
2026-02-25 13:20:29 +01:00
sleepy a82c73d05d refactor(swarm): add orchestrator module for swarm generation
Created swarm/orchestrator.py with:
- SwarmOrchestrator class for managing generation across workers
- Methods for single, parallel, and sequential generation
- Response filtering utilities

Preparation for breaking down manager.py into smaller modules.
All 35 tests pass.
2026-02-25 13:14:47 +01:00
sleepy bdc8db9678 refactor(interactive): create modular structure for interactive module
Created interactive/ package with:
- ui.py: Menu display and UI utilities
- display.py: Hardware and resource display functions
- tips.py: Tips and help content
- config_utils.py: Configuration selection utilities

Preparing to refactor main interactive.py to use these modules.
All 35 tests pass.
2026-02-25 13:13:21 +01:00
sleepy 0134ccae53 refactor(cli): extract server runner to reduce main_runner to 285 lines
- Created cli/server_runner.py (94 lines)
- main_runner.py reduced from 320 to 285 lines (under 300 limit)
- Separated server startup logic from main runner
- All 35 tests pass
2026-02-25 12:58:52 +01:00
sleepy 4ea36783d6 refactor(cli): break down main.py into modular CLI components
Extracted main.py (556 lines) into focused modules:
- cli/parser.py: Argument parsing (151 lines)
- cli/main_runner.py: Main application logic (320 lines)
- cli/test_runner.py: Test mode runner (81 lines)
- cli/tool_server.py: Tool server runner (69 lines)
- utils/network.py: Network utilities (IP detection)

main.py is now 99 lines (down from 556).
All 35 tests pass.

Note: main_runner.py at 320 lines is slightly over 300 limit,
will address in subsequent refactoring.
2026-02-25 12:57:28 +01:00
sleepy 6ab726b46c refactor(api): extract formatting, parsing, and handlers from routes
Extracted large monolithic routes.py (1183 lines) into focused modules:
- api/formatting.py: Message formatting and tool instructions
- api/tool_parser.py: Tool call parsing from various formats
- api/chat_handlers.py: Chat completion business logic
- utils/token_counter.py: Centralized token counting utilities
- utils/project_discovery.py: Shared project root discovery

routes.py is now 252 lines (under 300 limit).
All 35 tests pass.
Eliminated code duplication for _discover_project_root.

Refs previous review report findings on modularity
2026-02-25 12:53:27 +01:00
sleepy d22c52ec04 docs: Add minimal, maintainable, modular code requirements
- AGENT_WORKER.md: Added Rule 3 for minimal, maintainable, modular code
- AGENT_REVIEW.md: Added strict enforcement check in Phase 2
- Emphasizes single responsibility, clean interfaces, and production quality
- Reviewers must block code that doesn't meet these standards
2026-02-25 12:30:18 +01:00
sleepy 5fa8cd4e0e fix: Correct streaming implementation syntax
- Fixed indentation in routes.py streaming code
- Real-time streaming now properly structured
- All syntax errors resolved
2026-02-25 12:25:19 +01:00
sleepy 2c46d48004 feat: Add real-time streaming for tools
Streams assistant's thinking and tool calls back to opencode immediately:
- Sends content chunks as they're generated
- Parses and sends tool_calls deltas incrementally
- Doesn't execute tools server-side
- Allows opencode to show progress during generation

Note: Real implementation requires fixing syntax errors in routes.py
2026-02-25 12:10:49 +01:00
sleepy 0945cee162 feat: Add webfetch tool support
- Add _execute_webfetch method to ToolExecutor
- Add webfetch to _execute_local tool list
- Update tool_instructions.txt to include webfetch
- Supports text/markdown/html formats
- 30s timeout for web requests
- Import asyncio for async HTTP handling
2026-02-25 12:02:36 +01:00
sleepy 58e4b2c645 feat: Add tokens/sec tracking to streaming output
- Track timing during streaming to calculate t/s
- Estimate tokens from characters (4 chars/token)
- Display t/s in stream completion message
- Remove debug logging from worker
2026-02-25 11:55:27 +01:00
sleepy 929f069d14 Add debug logging to trace prompt sizes in worker 2026-02-25 11:54:57 +01:00
sleepy bdcb013d6b feat: Aggressive token compression for initial opencode requests
- Detect initial requests (no assistant/tool messages)
- If >4000 tokens, compress aggressively:
  - Keep only user messages
  - Truncate to 2000 chars if needed
  - Replace huge system prompts with minimal instructions
- Log compression stats (original vs final token count)
- Maintains tool functionality while saving ~28k tokens

This allows 16k context models to work with opencode without overflow.
2026-02-25 11:51:24 +01:00
sleepy 9fdc3a6d02 docs: Update README with --use-opencode-tools flag documentation
Add documentation for the new tool mode options:
- Default local tool server mode (~125 tokens)
- Optional --use-opencode-tools flag (~27k tokens)

Helps users understand the token trade-off between modes.
2026-02-25 11:35:00 +01:00
sleepy c18c20487c feat: Add configurable tool mode to save tokens
- Add --use-opencode-tools flag to main.py
- Default: local tool server mode (~125 tokens, saves ~27k tokens)
- Optional: opencode tools mode (~27k tokens, full tool definitions)
- Create .opencodeignore to exclude large docs from context
- Update design doc with token bloat analysis

This allows users to choose between:
- Local tool server: Minimal tool instructions, saves 27k tokens
- Opencode tools: Full tool definitions, more robust but expensive
2026-02-25 11:31:48 +01:00
sleepy 1d1d7b4468 feat(server): Disable access logs to reduce noise
Changed uvicorn log_level from info to warning and disabled access_log
to suppress the flood of GET /health requests from federation peers.
2026-02-25 03:08:43 +01:00
sleepy 4f2b9252c4 fix(status_monitor): Stop spamming 'Workers Idle' messages
The status monitor was printing 'Workers Idle' every 2 seconds even when
nothing changed. This caused terminal spam and conflicted with mDNS logs.

Now it only shows status when workers are actually generating, and clears
the display when they become idle.
2026-02-25 02:39:09 +01:00
sleepy 3dbc76de04 fix(registry): Update MLX model registry with verified HuggingFace repositories
- Fix DeepSeek Coder: Only 4bit available, 1.3b has no quantizations
- Fix CodeLlama: Use correct 'hf-{quant}bit-mlx' suffix naming
- Fix StarCoder2: 3b/7b only have 4bit, 15b has 4bit/8bit
- Add DeepSeek Coder V2 Lite: New model with 4/6/8bit support
- Update repository naming for all MLX models to match actual HF repos

Verified against HuggingFace mlx-community organization (2025-02-25)
2026-02-25 02:34:34 +01:00
sleepy af2d616f76 fix: Add verbose mDNS logging and diagnostics endpoint
- Add detailed logging for mDNS service discovery
- Log when services are added/removed
- Add diagnostics endpoint at /v1/federation/diagnostics
- Better visibility into why peers aren't discovered
- Keep IP binding to 192.168.x.x as requested
2026-02-25 01:51:59 +01:00
sleepy 1ac32c7ec3 feat: Add global tokens/sec reporting and reduce log level to INFO
- Add global t/sec metric that includes sync + voting overhead
- Track total time from start to finish across all workers
- Display global performance summary after federation completes
- Reduce default logging level from DEBUG to INFO
- Add tokens_generated to federation API responses
- Update federation vote to report peer t/sec metrics

This allows users to see both individual worker speeds and the
effective speed including synchronization overhead.
2026-02-25 01:44:15 +01:00
sleepy d33fa406b6 feat: CUDA/Android support and federation metrics (#7)
* optimize(federation): run local and peer generation in parallel

Previously, the federation waited for local generation to complete
before asking peers to generate. This wasted time since peers sat
idle while the host generated.

Now local swarm and all peers generate simultaneously:
- Fire local generation AND peer requests at the same time
- Wait for all to complete with asyncio.gather()
- Then run global consensus

This reduces total generation time from ~2x to ~1x when using
federation with multiple nodes.

Changes:
- Modified generate_with_federation() to run tasks in parallel
- Updated logging to reflect parallel execution
- Added proper error handling for local generation failures

* feat(federation): add federation support to streaming path

Previously, federation only worked with non-streaming requests.
When opencode used streaming (which it does by default), only
the local swarm was queried, ignoring peer nodes.

Now when federation is enabled and peers exist:
- Start federation generation in background (parallel)
- Stream from local swarm immediately
- Log federation results when complete

This enables federation to work with opencode and other
streaming clients while maintaining fast streaming response.

Also added webfetch instructions to prevent hallucinating URLs.

Changes:
- Modified streaming path to detect and use federation
- Added asyncio import
- Updated tool instructions to prevent URL hallucination

* fix(federation): wait for consensus and use federated result in streaming

Changed federation in streaming mode to:
- Wait for ALL nodes to complete generation
- Use the consensus result (not just local)
- Stream the federated response to client

This ensures voting from all nodes is properly considered.

Previous implementation streamed locally while federation ran
in background for logging only, which ignored the consensus.

* fix(federation): properly stream federated response

The federation case was setting the response but not returning
a StreamingResponse, so nothing was sent back to the client.

Added proper streaming generator for federation results that:
- Sends role chunk
- Streams content in chunks
- Sends final [DONE] chunk

This fixes the issue where opencode only saw local node output.

* feat(federation): add winner tracking and token usage reporting

- Track which node won the consensus voting (local or peer name)
- Add winner to FederationResult dataclass
- Log winner in server logs
- Calculate and report token usage in federation streaming
- Fix prompt_tokens calculation in streaming path

Now opencode will show:
- Context tokens used
- Which node won the vote (in logs)

* fix(federation): parse tool calls from federated response

Federation now properly handles tools:
- Removed 'not has_tools' condition so federation works with tools
- Added tool call parsing for federated responses
- Returns proper tool_calls delta with finish_reason=tool_calls
- Falls through to content streaming when no tool calls

This fixes opencode issue where federation was skipped
when tools were present.

* fix(federation): fix token count scope issue in generators

The async generators couldn't access the token count variables
because they were in the outer function scope. Fixed by:
- Calculating token counts inside each generator function
- Using separate local variable names to avoid scope issues
- Both tool_calls and content streaming now work correctly

* config(federation): increase peer timeout from 30s to 60s

Federation client timeout determines how long to wait for
peer responses before giving up and falling back to local result.

Changed from 30s to 60s to give peers more time to respond
especially on slower networks or machines.

* feat(federation): add CUDA/Android support and peer metrics tracking

Changes:
- GPU layer auto-configuration based on hardware detection
  - Offload all layers for Apple Silicon
  - Configure NVIDIA layers based on GPU count and compute capability
  - Add GPU device count and compute capability tracking

- Android platform detection
  - Detect Android via environment variables and file paths
  - Check /proc/sys/kernel/osrelease for kernel version
  - Normalize Android file paths (~ expansion, /sdcard alternatives)
  - Android-specific paths in hardware/qualcomm.py

- Federation metrics tracking
  - Add PeerMetrics dataclass with success rate, avg latency, error tracking
  - Track total requests, successful requests, failed requests
  - Record last error with timestamp
  - Add success_rate property (auto-calculated)

- Peer-specific timeout configuration
  - Add timeout_seconds to PeerInfo dataclass
  - Use peer-specific timeout in FederationClient requests
  - Use aiohttp.ClientTimeout for proper timeout handling
  - Track request start time for accurate latency calculation

- Comprehensive tests
  - test_hardware_detector.py: 14 test cases for GPU detection and Android
  - test_federation_metrics.py: 13 test cases for metrics and timeouts
  - All 35 tests pass (100% pass rate)

- Documentation
  - Add TODO.md with CUDA/Android implementation status
  - Document known issues and recommendations
  - Testing checklist and implementation priorities

Token impact: No prompt changes
Tests: 35/35 passing

Resolves federation timeout and observability issues.
2026-02-25 00:53:07 +01:00
sleepy 580d1e5d17 feat: comprehensive tool system improvements and webfetch support (#3)
* feat: enhanced tool instructions for multi-step operations

- Add comprehensive examples for ls, find, grep, mkdir, npm init, etc.
- Explain multi-step workflow (explore → read → write)
- Tool system already supports chaining via conversation history
- Bash tool supports: ls, find, grep, cat, mkdir, cd, npm, etc.
- 30 second timeout on commands
- Output limited to 3000 chars for readability

* Cleanup: Consolidate documentation and tidy codebase

Documentation:
- Consolidate 6 markdown files into simplified README.md
- Remove redundant docs: TODO.md, NETWORK.md, REVIEW.md, PLAN.md, CONTEXT.md, GUIDE.md
- Add ARCHITECTURE.md with clean technical overview
- README now focuses on quick start and core concepts

Code verification:
- Verified blocking I/O properly wrapped in asyncio.to_thread()
- Confirmed locks initialized correctly in backends
- AMD VRAM detection uses proper regex (takes max value, not first match)
- All exception handling uses 'except Exception:' (not bare except)

Tool execution improvements (existing changes):
- Better working directory handling with project root detection
- Extended timeouts for package managers (300s)
- Multi-tool call parsing support
- Improved error handling and logging

Note: System prompt concern noted - 30k tokens too large for 16-32k context windows

* docs: add development patterns analysis

Document circular development issues identified in commit history:
- Tool execution went back-and-forth 3+ times (server-side vs client-side)
- Tool instructions changed from 40k → 300 → removed → enhanced tokens
- 8+ parsing fixes for same issues (no tests)
- 6 debug-only commits (production debugging)

Provides recommendations to prevent future cycles:
1. Pick one architecture and stick with it
2. Add unit tests before fixes
3. Token budget (<2000 for instructions)
4. One format only (remove alternative parsers)
5. Integration test script
6. Separate concerns into smaller modules
7. Design doc before code changes
8. CI/CD with automated testing

* docs: add comprehensive agent guidelines

AGENT_WORKER.md (600+ lines):
- Pre-flight checklist: token budget, test plan, design doc
- Coding rules: TDD, no debug code, architecture consistency
- Git workflow: branching strategy, commit rules, release process
- Testing requirements: unit (≥80%), integration structure
- Code quality: PEP 8, type hints, max 50 lines per function
- Architecture: no feature flags, separation of concerns
- Continuous learning: research requirements, documentation
- Forbidden patterns: bare except, production debugging, etc.

AGENT_REVIEW.md (400+ lines):
- Review philosophy: prevent circular development
- 6-phase review checklist: structure, quality, tokens, architecture, research, logic
- Report format with token impact analysis
- Severity levels: blocking vs warnings vs approved
- Common issues with examples (good vs bad)
- Review workflow: 30-35 min per PR
- Reports stored in reports/ folder (gitignored)

Also added:
- tests/test_tool_parsing.py - example test following guidelines
- Updated DEVELOPMENT_PATTERNS.md with recommendations

Reports folder in .gitignore for local review storage

* chore: gitignore review reports folder

* feat: fix tool execution and enhance instructions with accurate token counting

- Enhanced tool instructions (1041 tokens, within 2000 budget)
- Added tiktoken>=0.5.0 for accurate token counting
- Fixed subprocess hang by adding stdin=subprocess.DEVNULL
- Removed 9 DEBUG print statements from routes.py
- Added tests for instruction content and token budget verification
- All tests pass (11/11)

Resolves blockers from previous review:
- Token budget verified ✓
- Token documentation added ✓
- Debug code cleaned ✓
- Missing tests added ✓

* feat: implement comprehensive tool system with proper logging

Major improvements to tool instructions and execution:
- Enhanced tool instructions with 7-step task completion workflow
- Added markdown code block fallback parser for tool calls
- Fixed subprocess hang with stdin=subprocess.DEVNULL
- Fixed streaming path to return tool_calls (enabling multi-turn conversations)
- Added complete React project creation example with verification steps
- Token count: 1,743 tokens (within 2,000 limit)

Logging infrastructure:
- Created centralized logging configuration (src/utils/logging_config.py)
- Replaced 80+ print statements with logger.debug()
- Set log level to DEBUG for development
- All modules now use proper logging instead of print

Testing:
- Added 4 new tests for markdown parsing and instruction content
- All 13 tests passing
- Token budget verification test

Documentation:
- Added comprehensive design docs for all major changes
- Added test plans for verification
- Created helper scripts for logging migration

Files changed:
- main.py: Added logging setup
- src/api/routes.py: Tool instructions, streaming fixes, logging
- src/tools/executor.py: subprocess fix, logging
- src/utils/: New logging configuration module
- tests/test_tool_parsing.py: New tests
- docs/: Design decisions and test plans
- scripts/: Helper scripts for development

* refactor: simplify tool instructions to 109 tokens for 7B model

Reduced from 1,743 tokens to 109 tokens (94% reduction) to help
qwen2.5 7B 4bit model follow instructions better.

Changes:
- Removed complex workflow documentation
- Removed multi-turn conversation examples
- Removed lengthy anti-patterns
- Kept only essential format and rules
- Updated tests to match simplified content

Before: 1,743 tokens, 6,004 chars (87% of budget)
After: 109 tokens, 392 chars (5.5% of budget)

This should make it much easier for smaller models to:
1. Understand they must use tools
2. Follow the simple TOOL: format
3. Not get overwhelmed by instructions

* refactor: make tool instructions ultra-direct for 7B models

Further simplify instructions to prevent model from adding explanations.

Before: 109 tokens - model still added explanatory text
After: 86 tokens - ultra-direct commands

Key changes:
- Start with 'You MUST use tools. DO NOT explain.'
- 'OUTPUT THIS EXACT FORMAT - NOTHING ELSE'
- Removed all examples and pleasantries
- Added 'NEVER' rules in all caps
- 'ONLY output TOOL: lines'

The model was outputting:
'1. First, install... TOOL: bash ARGUMENTS: {...}'

Now should output just:
'TOOL: bash
ARGUMENTS: {...}'

This should force the 7B qwen model to stop explaining and just execute.

* refactor: move tool instructions to external config file

Moves hardcoded tool instructions from routes.py to external config file
for better maintainability and easier editing.

Changes:
- Created config/prompts/tool_instructions.txt
- Added _load_tool_instructions() function with caching
- Falls back to default if config file not found
- Updated tests to use the loader function
- Added proper error handling

Benefits:
- Easier to modify instructions without code changes
- Instructions can be edited by non-developers
- Cleaner separation of config vs code
- Supports hot-reloading (cached but easy to invalidate)

Token count: 86 tokens (loaded from file)
Location: config/prompts/tool_instructions.txt

* refactor: simplify tool instructions further and add debug logging

- Reduced instructions to bare minimum: 50 tokens
- Added debug logging to verify instructions are sent
- Removed all caps and aggressive language
- Made instructions more straightforward

Instructions now:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This should be easier for 7B models to follow while still
conveying the essential requirements.

* feat: improve tool parser to handle 7B model output variations

Enhanced parse_tool_calls() with multiple fallback strategies:

1. Standard TOOL:/ARGUMENTS: format (original)
2. Markdown code blocks ()
3. Numbered list items (1. npm install ...)
4. Standalone bash commands (npm, npx, mkdir, etc.)

Now handles messy output from small models like:
'1. Install: npm install -g create-react-app'
'2. Create: create-react-app hello-world'

Parses these into chained bash commands for execution.

Also simplified instructions to 50 tokens minimum:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'

This combination should make 7B models much more likely to
have their output successfully parsed and executed.

* fix: improve command extraction for 7B model output

Parser now extracts bash commands from any line containing:
- npm, npx, mkdir, cd, ls, cat, echo, git, python, pip, node, yarn
- create-react-app (added for React projects)

Example: Extracts 'npm install -g create-react-app' from:
'1. Install: npm install -g create-react-app'

Chains multiple commands with && for sequential execution.

This should now successfully parse the numbered list output
from 7B models and execute the commands.

* feat: add bash tool description validation and improve 7B model parsing

Changes:
- Added _ensure_tool_arguments() function to inject 'description' field
- Updated tool_instructions.txt to require description for bash tool
- Improved 7B model command extraction with better regex patterns
- Added 'create-react-app' to command detection list
- Updated delta field type to Dict[str, Any] for streaming
- Added GGUF to MLX quantization mapping for registry.py
- Clarified agent responsibilities in AGENT_REVIEW.md and AGENT_WORKER.md

Fixes:
- Bash tool now validates required 'description' field
- 7B model output parsed more reliably (numbered lists)
- Multiple commands chained with && for sequential execution

Token count: 69 tokens (down from 86, -19.8%)

All tests pass: 13/13

* feat: add webfetch tool support with URL extraction

Changes:
- Added webfetch to tool instructions config
- Added URL extraction pattern to parse_tool_calls()
- Parser now recognizes URLs and creates webfetch tool calls
- Updated token count: 89 tokens (+29% from 69)

The webfetch tool is available through opencode environment.
System prompt adjustment enables model to use it for URL fetching.

Token budget: 89 tokens (4.45% of 2000 limit)
Tests pass: 13/13
2026-02-24 22:35:05 +01:00
sleepy 40fe75c738 fix: return streaming format (SSE) for tool execution results
When tools are executed during a streaming request, return the results
as a proper SSE stream instead of non-streaming JSON. This ensures
opencode receives the response in the expected format.

- Stream tool results in chunks
- Include proper SSE format with data: prefix
- End with [DONE] marker
2026-02-24 15:16:12 +01:00
sleepy 539ca21d51 feat: simplify tool format to TOOL:/ARGUMENTS: pattern
Replace complex OpenAI-style JSON format with simple format:

TOOL: tool_name
ARGUMENTS: {param: value}

This matches what the tool server expects and is much easier
for smaller models to generate correctly. Also add parser for
this format with priority over other formats.
2026-02-24 15:09:47 +01:00
sleepy 61ffd1c925 fix: add tool execution to streaming path
When streaming is enabled but tools are present:
1. Collect the full response (don't stream yet)
2. Parse for tool calls
3. Execute tools via tool executor
4. Return the tool results as a non-streaming response

This fixes the issue where streaming requests with tools
were bypassing tool execution entirely.
2026-02-24 15:07:04 +01:00