Changes:
- Added usage field to ChatCompletionStreamResponse model
- Track prompt, completion, and total tokens in streaming responses
- Include usage info in final chunks of all streaming endpoints
- Clarified model descriptions as "Instruct variant" in registry
- Updated MLX repo mappings to prioritize instruction-following models
Fixes:
- CodeLlama: Using Instruct variant mapping
- DeepSeek: Using instruct-mlx variant
- StarCoder2: 15b only (has Instruct variant on MLX)
Token budget: 89 tokens (unchanged, 4.45% of 2000 limit)
Tests pass: 13/13
Minor improvements:
- Calculate prompt tokens once at function start
- Track completion tokens in streaming generators
- Include usage in tool_calls and content streaming
* feat: enhanced tool instructions for multi-step operations
- Add comprehensive examples for ls, find, grep, mkdir, npm init, etc.
- Explain multi-step workflow (explore → read → write)
- Tool system already supports chaining via conversation history
- Bash tool supports: ls, find, grep, cat, mkdir, cd, npm, etc.
- 30 second timeout on commands
- Output limited to 3000 chars for readability
* Cleanup: Consolidate documentation and tidy codebase
Documentation:
- Consolidate 6 markdown files into simplified README.md
- Remove redundant docs: TODO.md, NETWORK.md, REVIEW.md, PLAN.md, CONTEXT.md, GUIDE.md
- Add ARCHITECTURE.md with clean technical overview
- README now focuses on quick start and core concepts
Code verification:
- Verified blocking I/O properly wrapped in asyncio.to_thread()
- Confirmed locks initialized correctly in backends
- AMD VRAM detection uses proper regex (takes max value, not first match)
- All exception handling uses 'except Exception:' (not bare except)
Tool execution improvements (existing changes):
- Better working directory handling with project root detection
- Extended timeouts for package managers (300s)
- Multi-tool call parsing support
- Improved error handling and logging
Note: System prompt concern noted - 30k tokens too large for 16-32k context windows
* docs: add development patterns analysis
Document circular development issues identified in commit history:
- Tool execution went back-and-forth 3+ times (server-side vs client-side)
- Tool instructions changed from 40k → 300 → removed → enhanced tokens
- 8+ parsing fixes for same issues (no tests)
- 6 debug-only commits (production debugging)
Provides recommendations to prevent future cycles:
1. Pick one architecture and stick with it
2. Add unit tests before fixes
3. Token budget (<2000 for instructions)
4. One format only (remove alternative parsers)
5. Integration test script
6. Separate concerns into smaller modules
7. Design doc before code changes
8. CI/CD with automated testing
* docs: add comprehensive agent guidelines
AGENT_WORKER.md (600+ lines):
- Pre-flight checklist: token budget, test plan, design doc
- Coding rules: TDD, no debug code, architecture consistency
- Git workflow: branching strategy, commit rules, release process
- Testing requirements: unit (≥80%), integration structure
- Code quality: PEP 8, type hints, max 50 lines per function
- Architecture: no feature flags, separation of concerns
- Continuous learning: research requirements, documentation
- Forbidden patterns: bare except, production debugging, etc.
AGENT_REVIEW.md (400+ lines):
- Review philosophy: prevent circular development
- 6-phase review checklist: structure, quality, tokens, architecture, research, logic
- Report format with token impact analysis
- Severity levels: blocking vs warnings vs approved
- Common issues with examples (good vs bad)
- Review workflow: 30-35 min per PR
- Reports stored in reports/ folder (gitignored)
Also added:
- tests/test_tool_parsing.py - example test following guidelines
- Updated DEVELOPMENT_PATTERNS.md with recommendations
Reports folder in .gitignore for local review storage
* chore: gitignore review reports folder
* feat: fix tool execution and enhance instructions with accurate token counting
- Enhanced tool instructions (1041 tokens, within 2000 budget)
- Added tiktoken>=0.5.0 for accurate token counting
- Fixed subprocess hang by adding stdin=subprocess.DEVNULL
- Removed 9 DEBUG print statements from routes.py
- Added tests for instruction content and token budget verification
- All tests pass (11/11)
Resolves blockers from previous review:
- Token budget verified ✓
- Token documentation added ✓
- Debug code cleaned ✓
- Missing tests added ✓
* feat: implement comprehensive tool system with proper logging
Major improvements to tool instructions and execution:
- Enhanced tool instructions with 7-step task completion workflow
- Added markdown code block fallback parser for tool calls
- Fixed subprocess hang with stdin=subprocess.DEVNULL
- Fixed streaming path to return tool_calls (enabling multi-turn conversations)
- Added complete React project creation example with verification steps
- Token count: 1,743 tokens (within 2,000 limit)
Logging infrastructure:
- Created centralized logging configuration (src/utils/logging_config.py)
- Replaced 80+ print statements with logger.debug()
- Set log level to DEBUG for development
- All modules now use proper logging instead of print
Testing:
- Added 4 new tests for markdown parsing and instruction content
- All 13 tests passing
- Token budget verification test
Documentation:
- Added comprehensive design docs for all major changes
- Added test plans for verification
- Created helper scripts for logging migration
Files changed:
- main.py: Added logging setup
- src/api/routes.py: Tool instructions, streaming fixes, logging
- src/tools/executor.py: subprocess fix, logging
- src/utils/: New logging configuration module
- tests/test_tool_parsing.py: New tests
- docs/: Design decisions and test plans
- scripts/: Helper scripts for development
* refactor: simplify tool instructions to 109 tokens for 7B model
Reduced from 1,743 tokens to 109 tokens (94% reduction) to help
qwen2.5 7B 4bit model follow instructions better.
Changes:
- Removed complex workflow documentation
- Removed multi-turn conversation examples
- Removed lengthy anti-patterns
- Kept only essential format and rules
- Updated tests to match simplified content
Before: 1,743 tokens, 6,004 chars (87% of budget)
After: 109 tokens, 392 chars (5.5% of budget)
This should make it much easier for smaller models to:
1. Understand they must use tools
2. Follow the simple TOOL: format
3. Not get overwhelmed by instructions
* refactor: make tool instructions ultra-direct for 7B models
Further simplify instructions to prevent model from adding explanations.
Before: 109 tokens - model still added explanatory text
After: 86 tokens - ultra-direct commands
Key changes:
- Start with 'You MUST use tools. DO NOT explain.'
- 'OUTPUT THIS EXACT FORMAT - NOTHING ELSE'
- Removed all examples and pleasantries
- Added 'NEVER' rules in all caps
- 'ONLY output TOOL: lines'
The model was outputting:
'1. First, install... TOOL: bash ARGUMENTS: {...}'
Now should output just:
'TOOL: bash
ARGUMENTS: {...}'
This should force the 7B qwen model to stop explaining and just execute.
* refactor: move tool instructions to external config file
Moves hardcoded tool instructions from routes.py to external config file
for better maintainability and easier editing.
Changes:
- Created config/prompts/tool_instructions.txt
- Added _load_tool_instructions() function with caching
- Falls back to default if config file not found
- Updated tests to use the loader function
- Added proper error handling
Benefits:
- Easier to modify instructions without code changes
- Instructions can be edited by non-developers
- Cleaner separation of config vs code
- Supports hot-reloading (cached but easy to invalidate)
Token count: 86 tokens (loaded from file)
Location: config/prompts/tool_instructions.txt
* refactor: simplify tool instructions further and add debug logging
- Reduced instructions to bare minimum: 50 tokens
- Added debug logging to verify instructions are sent
- Removed all caps and aggressive language
- Made instructions more straightforward
Instructions now:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'
This should be easier for 7B models to follow while still
conveying the essential requirements.
* feat: improve tool parser to handle 7B model output variations
Enhanced parse_tool_calls() with multiple fallback strategies:
1. Standard TOOL:/ARGUMENTS: format (original)
2. Markdown code blocks ()
3. Numbered list items (1. npm install ...)
4. Standalone bash commands (npm, npx, mkdir, etc.)
Now handles messy output from small models like:
'1. Install: npm install -g create-react-app'
'2. Create: create-react-app hello-world'
Parses these into chained bash commands for execution.
Also simplified instructions to 50 tokens minimum:
'Use tools to execute commands. Output only tool calls.
Format: TOOL: bash ARGUMENTS: {...}
No explanations. No numbered lists. No markdown. Only tool calls.'
This combination should make 7B models much more likely to
have their output successfully parsed and executed.
* fix: improve command extraction for 7B model output
Parser now extracts bash commands from any line containing:
- npm, npx, mkdir, cd, ls, cat, echo, git, python, pip, node, yarn
- create-react-app (added for React projects)
Example: Extracts 'npm install -g create-react-app' from:
'1. Install: npm install -g create-react-app'
Chains multiple commands with && for sequential execution.
This should now successfully parse the numbered list output
from 7B models and execute the commands.
* feat: add bash tool description validation and improve 7B model parsing
Changes:
- Added _ensure_tool_arguments() function to inject 'description' field
- Updated tool_instructions.txt to require description for bash tool
- Improved 7B model command extraction with better regex patterns
- Added 'create-react-app' to command detection list
- Updated delta field type to Dict[str, Any] for streaming
- Added GGUF to MLX quantization mapping for registry.py
- Clarified agent responsibilities in AGENT_REVIEW.md and AGENT_WORKER.md
Fixes:
- Bash tool now validates required 'description' field
- 7B model output parsed more reliably (numbered lists)
- Multiple commands chained with && for sequential execution
Token count: 69 tokens (down from 86, -19.8%)
All tests pass: 13/13
* feat: add webfetch tool support with URL extraction
Changes:
- Added webfetch to tool instructions config
- Added URL extraction pattern to parse_tool_calls()
- Parser now recognizes URLs and creates webfetch tool calls
- Updated token count: 89 tokens (+29% from 69)
The webfetch tool is available through opencode environment.
System prompt adjustment enables model to use it for URL fetching.
Token budget: 89 tokens (4.45% of 2000 limit)
Tests pass: 13/13
When tools are executed during a streaming request, return the results
as a proper SSE stream instead of non-streaming JSON. This ensures
opencode receives the response in the expected format.
- Stream tool results in chunks
- Include proper SSE format with data: prefix
- End with [DONE] marker
Replace complex OpenAI-style JSON format with simple format:
TOOL: tool_name
ARGUMENTS: {param: value}
This matches what the tool server expects and is much easier
for smaller models to generate correctly. Also add parser for
this format with priority over other formats.
When streaming is enabled but tools are present:
1. Collect the full response (don't stream yet)
2. Parse for tool calls
3. Execute tools via tool executor
4. Return the tool results as a non-streaming response
This fixes the issue where streaming requests with tools
were bypassing tool execution entirely.
- Parse tool_calls whether it's a single object {...} or array [...]
- Normalize to list for consistent processing
- Add debug logging to trace tool execution flow
- Fix variable name (value_str instead of array_str)
- Execute tools server-side using configured tool executor (local or remote)
- Return tool results as content directly
- Add logging to show which tool executor is being used
- This should make tool execution work with opencode's broken tool support
Revert to proper OpenAI tool flow:
1. LLM Server returns tool_calls with finish_reason='tool_calls'
2. opencode executes tools (can use tool-server if configured)
3. opencode sends tool result back to LLM Server
4. LLM Server generates final response
This allows opencode to handle tool execution and retry logic.
- Log when no tool executor is configured (fallback to local)
- Log whether using remote tool host or local execution
- Help diagnose why tool requests aren't reaching the tool server
- --tool-host with no value: auto-detects local IP (e.g., http://192.168.1.5:17616)
- --tool-host with explicit URL: uses provided URL
- No --tool-host: local tool execution (default)
Example usage:
python main.py --auto --tool-host # Auto-detect local IP
python main.py --auto --tool-host http://192.168.1.10:17616 # Explicit URL
python main.py --auto # Local execution
- Tool server now runs on port 17616 by default (separate from main API on 17615)
- Add --tool-port argument to customize tool server port
- Update help text to reflect default port 17616
- Prevent port conflicts when running both services on same machine
Reduce tool instructions from 40k tokens to ~300 tokens:
- List only 3 main tools (read, write, bash) with brief descriptions
- Single concise JSON format example
- Remove verbose formatting and multiple examples
- Only add instructions on first request (no assistant response yet)
This makes tool usage feasible for 8K-32K context models,
especially important for home setups with limited VRAM.
- Add ToolExecutor class supporting both local and remote tool execution
- Add --tool-host argument to use remote tool execution server
- Add --tool-server argument to run dedicated tool execution server
- Add /v1/tools/execute endpoint for remote tool execution
- Workers can execute tools on centralized tool host
- Tools: read, write, bash with security restrictions
Architecture:
- Tool Host (--tool-server): Runs on one machine, executes all tools
- Workers (--tool-host): Send tool requests to tool host, get results
- Local mode (default): Execute tools locally as before
- Remove tool instructions from system prompt (they were confusing the 3B model)
- Allow streaming even when tools are present
- Model now responds normally, server parses and executes tools server-side
- Fixes infinite loop where opencode would retry requests repeatedly
Based on commit d30eedaa63 which originally fixed this.
Instead of complex checks for tool_calls in various formats, simply
check if any assistant message exists in the conversation. If the
assistant has already responded, don't add tool instructions again.
This prevents the conversation from growing with duplicate messages.
Add check for 'Tool X result' pattern in assistant messages to detect
when server-side tool execution has already occurred. This prevents
the conversation from growing with duplicate user messages.
Execute tools server-side instead of relying on client (opencode) to
execute them. This works around known bugs with OpenAI-compatible
providers and tool calling in opencode.
Supported tools: read, write, bash, question, skill, todowrite, todoread
- Parse model ID with format like qwen2.5-coder:7b:4bit
- Return specific error if requested config not found or doesn't fit
- Don't fall back to auto-selection when specific config requested
- Fix federation branch to properly return response instead of falling through
- Add detailed debug output showing the full API response
- Show finish_reason, tool_calls_count, and full JSON response
Check if assistant has already generated tool_calls (either via
tool_calls field or in content) and don't add instructions if so.
This prevents the model from continuing to call tools after the
first tool execution.
- Fix regex to properly extract function content with nested braces
- Fix arguments extraction to handle escaped quotes correctly
- Arguments are now properly unescaped and parsed as JSON
- Tool calls now include correct arguments field for opencode to execute
- Check if there are already tool results (role=tool) in messages
- Only add tool instructions if no tool results present
- Prevents model from generating more tool calls after tool execution
- Model should now respond to tool results instead of calling more tools
- Search for the pattern in cleaned_text but extract position from original
- Strip trailing markdown block markers from content
- Ensure content is empty when tool_calls are present
When tools are provided and streaming is requested, fall through to
non-streaming mode so tool calls can be properly parsed and returned.
Remove verbose input debug logging.
- Look for { tool_calls: [...] } pattern instead of just tool_calls:
- Better handle unquoted keys AND unquoted string values
- Add debug output when parsing fails
- Fix regex to match the full JSON structure
- Add tool call parsing when using federation (was missing)
- Add debug output showing which peers are being contacted
- Fix variable shadowing in tool_calls parsing
- Add periodic health check every 10s to keep peer connections alive
- Remove stale peers after 30s of unreachability
- Improve tool use instructions with clearer examples
- Add 'CRITICAL: Do not explain what you will do' instruction
- Add concrete example of tool use format
- Modify /v1/chat/completions endpoint to check for federation
- If federation enabled with peers, use generate_with_federation()
- Otherwise fall back to local generation
- Add --peer example to help text
Now when federation is enabled and peers are discovered/manually added,
generation requests will be distributed across local and peer swarms,
with consensus voting to select the best response.
Add --peer argument to manually specify peers when mDNS discovery
isn't working. Usage: --peer 192.168.178.192:17615
Can be used multiple times for multiple peers.
- Use AsyncZeroconf and AsyncServiceBrowser for listening
- Use AsyncServiceInfo.async_request() instead of get_service_info()
- Add safe property decoding with None checks
- Use async_close() for proper cleanup
Use asyncio.to_thread() for register_service and unregister_service
to prevent blocking the async event loop. This fixes the
EventLoopBlocked error on macOS.
- Add advertise_ip parameter to SwarmDiscovery and create_discovery_service
- Use specified --host IP for mDNS advertising instead of auto-detect
- Add feedback when using specified vs auto-detected IP
This ensures both the API server and mDNS advertise the same IP.
- Add --host argument to specify bind IP directly
- Fix HardwareProfile attribute names (cpu_cores, ram_gb)
- Update help text with new --host option
Allows manual override of IP detection for multi-adapter setups.
- Support all RFC 1918 private IP ranges (10.x, 172.16-31.x, 192.168.x)
- Add debug output to show detected IP and why it was rejected
- Fix API URL display to show actual bound host
- Use consistent IP detection between main.py and discovery.py
- Add --federation CLI flag to enable network federation
- Integrate SwarmDiscovery service for mDNS peer discovery
- Wire up FederatedSwarm wrapper in main application flow
- Add GET /v1/federation/health endpoint
- Display discovered peers in startup banner
- Proper cleanup of federation resources on shutdown
Enables multiple Local Swarm instances to collaborate on the same
network for distributed consensus and load balancing.
Documents the current state of network federation:
- What's working (discovery, federation client, network binding)
- What's missing (integration in main.py)
- Relevant files and functions
- Scope and limitations
- Comprehensive TODO list for implementation
Federation exists but isn't wired up to the main application flow.
- Add random name selection instead of sequential (Alpha, Beta, etc.)
- Track used names to ensure uniqueness within a session
- Add reset_used_names() function for swarm restart
- Workers now get cool names like 'Zeus', 'Python', 'Nebula', etc.