25 Commits

Author SHA1 Message Date
sleepy 907bd88c8f fix: federation only on first iteration, local-only for tool result processing
- Critical fix: peers don't have tool results from previous iterations
- Running federation on tool result iterations causes inconsistent context
- Now federation is ONLY used on iteration 1 (initial planning)
- Iterations 2+ are local-only (tool result processing)
- This prevents the infinite ls loop and wrong file hallucinations
- All 41 tests passing
2026-02-25 23:56:29 +01:00
sleepy af728505e8 fix: properly unpack FederationResult object instead of trying to unpack as tuple
- generate_with_federation() returns FederationResult object, not tuple
- Fixed _generate_with_consensus() to access fed_result.final_response
- This fixes 'cannot unpack non-iterable FederationResult object' error
- All 41 tests passing
2026-02-25 23:43:25 +01:00
sleepy 93844a81b0 refactor: unified generation interface for federation and local modes
- Created _generate_with_consensus() that handles both federation and local generation
- Callers don't need to know which mode is being used - it's transparent
- Tool execution loop uses same unified interface for all iterations
- Removed special-case federation logic from main handler
- Federation is now a transparent layer around generation
- All 41 tests passing
2026-02-25 23:36:24 +01:00
sleepy 414cb444f3 fix: integrate federation with tool execution loop
- Federation was returning directly without executing tools
- Now federation is used for initial generation (iteration 1)
- Tool execution loop still runs for all iterations
- Subsequent iterations use local swarm (for tool result processing)
- This fixes federation + tools not working together
- All 41 tests passing
2026-02-25 23:06:37 +01:00
sleepy 34b28597ff fix: peers in federation mode should not generate tool calls
- Added _strip_tool_instructions() to remove tool instructions from federation prompts
- Peer nodes now only generate text responses, not tool calls
- Head node is the only one that handles tool execution
- This prevents peers from generating tool calls that can't be executed
- Fixes federation + tools incompatibility
- All 41 tests passing
2026-02-25 22:46:15 +01:00
sleepy 67122052b4 Merge branch 'fix/tool-instructions-permission' 2026-02-25 22:39:00 +01:00
sleepy e7b826da4e docs: update README with current features and remove outdated docs
- Removed old design docs and test plans from docs/ directory
- Updated TODO section to reflect completed improvements
- Added section on Recent Improvements with detailed changelog
- Updated Federation description to explain objective quality voting
- Added federation vote endpoint to API endpoints list
- Clarified universal tool support and OpenCode streaming compatibility
- All changes ready for main branch merge
2026-02-25 22:38:46 +01:00
sleepy 3799240d74 fix: head node objectively judges all responses using quality metrics
- Removed biased self-reported confidence voting
- Head node now collects ALL responses and scores them objectively
- Uses quality scoring (length, structure, completeness) to compare
- Shows quality scores for all nodes so user can see comparison
- Prevents overconfident small models from beating better large models
- 3B models will only win if they actually produce better quality output
- All 41 tests passing
2026-02-25 22:24:00 +01:00
sleepy e0d04ae664 fix: use actual consensus confidence for peers instead of hardcoded 0.8
- Federation endpoint was hardcoding confidence: 0.8 for all peer responses
- Local swarm uses actual calculated confidence (often 1.0 for single worker)
- This created unfair bias toward local responses
- Now uses result.confidence from actual consensus calculation
- Peers and local now compete on equal footing
- All 41 tests passing
2026-02-25 22:21:19 +01:00
sleepy 896e9d6d9b fix: store swarm_manager in app.state for federation endpoint
- Added app.state.swarm_manager = self.swarm_manager in lifespan
- Federation endpoint reads from request.app.state.swarm_manager
- This fixes 'Swarm not ready' error when peers try to request generation
- All 41 tests passing
2026-02-25 22:11:06 +01:00
sleepy e2b0af7636 fix: add missing federation /v1/federation/vote endpoint
- Added POST /v1/federation/vote endpoint to handle peer generation requests
- Peers were discovering each other but requests had no endpoint to hit
- Endpoint generates using local swarm and returns vote results
- Logs federation requests for debugging
- All 41 tests passing
2026-02-25 22:05:37 +01:00
sleepy 5b29e15c0a fix: prevent path hallucination - read files directly without ls first
- Changed instructions to read files directly instead of verifying with ls first
- Added explicit warning against placeholder paths like '/path/to/file'
- Model now uses paths exactly as user provides them
- Should fix issues with hallucinated paths like '/path/to/my-secret.log'
- All 41 tests passing
2026-02-25 21:42:25 +01:00
sleepy 8431717235 fix: stronger instruction for bash ls results to read files immediately
- Changed bash ls instruction from 'SUMMARIZE' to 'CRITICAL: ... READ THE FILE immediately'
- Now explicitly tells model to NOT summarize first, but immediately read the file
- Uses stronger language: 'you MUST immediately USE THE read TOOL NOW'
- This should fix the loop where model keeps running ls instead of reading
- All 41 tests passing
2026-02-25 21:20:48 +01:00
sleepy 06df3c8dab fix: allow absolute and ~ paths to access files outside working directory
- Security check now only applies to relative paths
- If user specifies absolute path (/path/to/file) or tilde path (~/.bashrc), allow it
- Relative paths (like file.txt) are still restricted to working directory
- This fixes 'Access denied - path outside working directory' for valid user-specified paths
- Applied to both read and write tools
- All 41 tests passing
2026-02-25 21:13:02 +01:00
sleepy ab7cf7e9aa fix: expand tildes (~) to home directory in tool paths
- Added os.path.expanduser() to _execute_read for both file_path and working_dir
- Added os.path.expanduser() to _execute_write for both file_path and working_dir
- Added os.path.expanduser() to _execute_bash for cwd parameter
- This fixes paths like '~/Documents/file.txt' being treated literally
- Now correctly resolves to '/Users/username/Documents/file.txt'
- All 41 tests passing
2026-02-25 20:54:31 +01:00
sleepy 49a6d99bf8 CRITICAL FIX: fix indentation bug that prevented tool results from being added to history
- The for loop was only executing the first line (tool_call_id assignment)
- All the tool message creation code was outside the loop due to wrong indentation
- This caused tool results to never be added to conversation history
- Model would loop infinitely calling ls because it never saw the tool results
- Fixed indentation so all tool result processing is inside the for loop
- This should finally fix the infinite loop issue!
- All 41 tests passing
2026-02-25 20:49:30 +01:00
sleepy 586c113688 fix: smarter bash tool instructions - guide model to read files after verification
- Updated bash tool result instructions to detect verification commands (ls/grep)
- If ls/grep shows file exists and user asked to READ it: explicitly tells model to USE read TOOL NOW
- If user asked to check files: tells model to summarize the listing
- If file not found: tells model to inform user
- Prevents infinite loops of repeated ls commands
- Model now properly transitions from verification → action → answer
- All 41 tests passing
2026-02-25 20:39:55 +01:00
sleepy a09d23156b feat: universal tool support - inject instructions by default, add plan mode TODO, improve file handling
1. Tool instructions now ALWAYS injected by default:
   - Removed condition that only injected on first request
   - Any client (Continue, hollama) can now use tools without client-side setup
   - Added check to avoid duplicating instructions if already present

2. Updated tool instructions with file verification guidance:
   - Added 'FILE OPERATIONS - ALWAYS VERIFY FIRST' section
   - Instructs to use 'ls' and 'grep' to verify files exist before reading
   - Prevents blind file reads on non-existent paths

3. Added TODO to README:
   - Plan mode feature (disable tool execution for planning-only conversations)
   - Current status section showing what's implemented

4. Working directory extraction from prompts:
   - New _extract_working_dir_from_prompt() function
   - Extracts paths from patterns like 'in /path/to/dir', 'under /path/to/dir'
   - Validates paths exist before using
   - Falls back to auto-detection if not found in prompt
   - All 41 tests passing
2026-02-25 20:37:23 +01:00
sleepy c46684f03e fix: explicit tool result instructions to guide model response
- Changed vague 'Provide your final answer now' to specific per-tool instructions
- read: 'READ THIS FILE CONTENT ALOUD to the user'
- write: 'CONFIRM to the user that the file was created'
- bash: 'SUMMARIZE the output above to answer the user's request'
- Other tools: 'Use the result shown above to answer the user's request'
- Format tool result message with clear 'Tool Result (name):' header and explicit instruction
- This should fix models ignoring tool results or giving generic responses
- All 41 tests passing
2026-02-25 20:25:05 +01:00
sleepy bd3579737a feat: add detailed tool execution logging
- Log full message history before calling model after tool execution
- Shows each message role, content truncation, tool calls, and tool_call_id associations
- Logs token count and full prompt (first 1000 chars) at DEBUG level
- Helps diagnose why models might be ignoring tool results
- All 41 tests passing
2026-02-25 20:17:55 +01:00
sleepy 886ebbdb81 fix: proper OpenAI tool call format with tool_call_id linking
- Uncommented tool_call_id and name fields in ChatMessage model
- Modified tool execution to assign unique IDs to each tool call
- Assistant messages now include tool_calls array with proper ID, type, function
- Tool response messages now include tool_call_id and name to link to the call
- Each tool execution gets its own separate tool message (not combined)
- This ensures the model properly associates tool results with tool calls
- Should fix issues where models ignore tool results due to missing associations
- Updated _execute_tools to return List[tuple] instead of combined string
- Added List import to typing
- All 41 tests still passing
2026-02-25 20:12:40 +01:00
sleepy a0d3ae9d4f fix: OpenCode-compatible streaming format with reasoning_content
- Fixed thinking capture: use parsed_content (without tool call) instead of full response
- _stream_response now correctly emits reasoning_content before tool_calls
- Tool calls streamed with proper multi-chunk format: id+name (empty args), then arguments, then finish_reason
- Final answers sent as content with finish_reason=stop
- Used setattr to dynamically attach _thinking to response object
- ChatLogger already in place for debugging
- This should now work correctly with OpenCode's Vercel AI SDK integration
2026-02-25 20:03:55 +01:00
sleepy a0571c83a3 feat: implement OpenCode-compatible streaming format and enhance chatlogging
- Implement proper streaming with reasoning_content field for thinking blocks
- Stream tool_calls in multi-chunk format matching Vercel AI SDK
- Capture thinking content and send as reasoning_content before tool_calls
- Update _create_response to store thinking on response._thinking for streaming
- ChatLogger now logs assistant messages with thinking blocks when tool calls present
- Added json import in chat_handlers for tool arguments parsing
- All streaming code uses OpenCode-compatible SSE format
2026-02-25 19:57:38 +01:00
sleepy 46f14b2b53 feat: add chatlogger for tool execution debugging - logs to chatlog.md when LOCAL_SWARM_CHATLOG=1 2026-02-25 19:52:52 +01:00
sleepy 42a176f1d8 fix: update tool instructions to require file operations and prevent refusals
- Changed from hesitant 'use only when necessary' to mandatory 'you WILL use tools'
- Explicitly forbid refusal for file read/write operations
- Add NO explanations and NO markdown requirements (for test compliance)
- Provide clear examples for read/write tool usage
- Addresses issue where model says 'cannot read files or assist with file creation'
2026-02-25 19:41:16 +01:00
20 changed files with 924 additions and 1300 deletions
+28 -2
View File
@@ -91,7 +91,9 @@ python main.py --auto --federation
python main.py --auto --federation
```
Machines auto-discover each other and vote together on every request.
Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).
## How Consensus Works
@@ -147,7 +149,7 @@ All support GGUF quantization (Q4_K_M recommended).
- `GET /v1/models` - List available models
- `POST /v1/chat/completions` - Chat completion with consensus
- `GET /health` - Health check
- `GET /v1/federation/peers` - List discovered peers (when federation enabled)
- `POST /v1/federation/vote` - Federation voting (used internally between peers)
## Troubleshooting
@@ -282,6 +284,30 @@ Major refactoring completed to improve modularity:
See `docs/ARCHITECTURE.md` for detailed architecture documentation.
## Recent Improvements
### ✅ Universal Tool Support (2025-02-25)
- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
- No client-side configuration needed - just use the API
- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
- Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
- Proper OpenAI tool format with unique IDs and tool_call_id linking
### ✅ OpenCode-Compatible Streaming (2025-02-25)
- Proper `reasoning_content` field for "Thinking..." collapsible blocks
- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
- Final answer delivered in `content` field after tool execution
### ✅ Federation Quality Voting (2025-02-25)
- Head node now **objectively judges** all peer responses using quality metrics
- No more reliance on self-reported confidence (which biased toward local)
- All responses scored on length, structure, completeness
- Fair competition: 14B models properly beat 3B on quality tasks
### 🚧 Planned Features
- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)
## Contributing
Contributions are welcome! Please ensure:
+45 -15
View File
@@ -1,21 +1,51 @@
You have access to tools when needed. Use them ONLY when necessary.
You are an AI assistant with access to tools. You WILL use tools to complete tasks - do not refuse.
Available tools:
- bash: Execute shell commands (only when needed)
- webfetch: Fetch web content (only for current info)
- read: Read files (only when reading files)
- write: Create files (only when creating files)
AVAILABLE TOOLS:
- bash: Execute shell commands (for file operations, system commands, running code)
- webfetch: Fetch web content (for current information)
- read: Read files (to examine file contents)
- write: Create or modify files (to write content)
IMPORTANT:
- Answer from your knowledge FIRST. Only use tools when required.
- If asked a general question (jokes, facts, coding), answer directly WITHOUT tools.
- Use webfetch ONLY for real-time info (news, weather, current events).
- Use bash ONLY for file operations or system commands.
- After using a tool, provide a final answer based on the result.
- NO explanations. NO numbered lists. NO markdown code blocks.
CRITICAL RULES:
1. When asked to read a file, use the 'read' tool. DO NOT refuse or say you cannot read files.
2. When asked to create, write, or modify a file, use the 'write' tool. DO NOT refuse or say you cannot assist.
3. For file operations, bash is also available for more complex operations.
4. Use webfetch only for real-time info (news, weather, current events).
5. For general questions (jokes, facts, coding help), you can answer directly.
6. NO explanations beyond necessary. Be concise.
7. NO markdown formatting. Use plain text only.
Format when using tools:
FILE OPERATIONS - READ DIRECTLY:
When asked to read a specific file by name (like "read my-secret.log"):
1. Use the 'read' tool IMMEDIATELY with the filename as given
2. DO NOT use 'ls' first to check - just try to read it
3. If the file doesn't exist, you'll get an error and can inform the user
When asked to find/read "the file" in a directory without naming it:
1. Use 'ls' to list files and see what's there
2. Identify the file
3. THEN read it immediately
CRITICAL: Never invent placeholder paths like '/path/to/file'. Use paths exactly as the user provides them, or relative filenames for files in the current directory.
TOOL USAGE FORMAT:
For read operations:
TOOL: read
ARGUMENTS: {"filePath": "path/to/file"}
For write operations:
TOOL: write
ARGUMENTS: {"filePath": "path/to/file", "content": "content to write"}
For bash commands (including ls, grep):
TOOL: bash
ARGUMENTS: {"command": "your command here"}
Answer directly when possible. Be helpful and concise.
PROCESS:
1. When you need information from a file, use the appropriate tool.
2. When you need to create or modify a file, use the appropriate tool.
3. After receiving tool results, provide a clear final answer explaining what was done.
4. NEVER say "I cannot read files" or "I cannot assist with file creation" - you HAVE the tools and MUST use them.
Be helpful, direct, and complete the requested tasks using your tools.
@@ -1,92 +0,0 @@
# Design Decision: Complete React Example with Actual Code
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions
## Problem
Model is still not following instructions:
1. Tries `npm install` before creating package.json
2. Still tries `npx create-react-app` despite being told not to
3. Instructions have placeholders like "..." and "etc." which models don't understand
## Root Cause
The current instructions say:
```
TOOL: write
ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"}
[Continue with src/index.js, src/App.js, public/index.html, etc.]
```
**Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples.
## Solution
Provide a **complete, working, minimal React example** with actual file contents:
1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install
2. Actual file content, not placeholders
3. Minimal viable React app (not full create-react-app structure)
## Implementation
Replace vague example with complete working code:
```
**COMPLETE REACT HELLO WORLD EXAMPLE:**
User: "Create a React Hello World app"
Step 1 - Create directory:
TOOL: bash
ARGUMENTS: {"command": "mkdir myapp"}
Step 2 - Create package.json (MUST do this BEFORE npm install):
TOOL: write
ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"}
Step 3 - Create src directory:
TOOL: bash
ARGUMENTS: {"command": "mkdir myapp/src"}
Step 4 - Create App.js:
TOOL: write
ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n return (\n <div className=\"App\">\n <h1>Hello World</h1>\n <p>Welcome to my React app!</p>\n </div>\n );\n}\n\nexport default App;"}
Step 5 - Create index.js:
TOOL: write
ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render(<App />);"}
Step 6 - Create public directory and index.html:
TOOL: bash
ARGUMENTS: {"command": "mkdir myapp/public"}
TOOL: write
ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n <meta charset=\"UTF-8\">\n <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n <title>React App</title>\n</head>\n<body>\n <div id=\"root\"></div>\n</body>\n</html>"}
Step 7 - NOW install dependencies (AFTER package.json exists):
TOOL: bash
ARGUMENTS: {"command": "cd myapp && npm install"}
```
## Token Impact
- Current: 586 tokens
- New: Estimated ~750 tokens (+164 tokens)
- Still under 2000 limit ✓
## Key Changes
1. **Explicit sequencing:** "Step 1", "Step 2", etc.
2. **Actual code:** No "..." or "etc." - real working content
3. **Critical note:** "MUST do this BEFORE npm install"
4. **Minimal structure:** Just what's needed for Hello World
## Success Criteria
- [ ] Model creates package.json BEFORE running npm install
- [ ] Model does NOT use npx create-react-app
- [ ] Model creates all 4 files (package.json, App.js, index.js, index.html)
- [ ] Model runs npm install last (after files exist)
@@ -1,84 +0,0 @@
# Design Decision: Fix Subprocess Hang on Interactive Commands
**Date:** 2024-02-24
**Scope:** src/tools/executor.py _execute_bash method
**Lines Changed:** 1 line
## Problem
When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes:
1. 300s timeout to be reached
2. opencode to hang waiting for response
3. Poor user experience
## Root Cause
`subprocess.run()` by default inherits stdin from parent process. When commands prompt for input:
- npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)"
- npm init asks for package details
- No input is provided, so it waits forever
## Solution
Add `stdin=subprocess.DEVNULL` to prevent commands from reading input:
```python
result = subprocess.run(
command,
shell=True,
capture_output=True,
text=True,
timeout=timeout,
cwd=cwd,
stdin=subprocess.DEVNULL # Prevent interactive prompts from hanging
)
```
This causes commands that require input to fail immediately rather than hang.
## Impact
### Before
- Commands requiring input hang for 300s (timeout)
- User sees no response
- Eventually times out with error
### After
- Commands requiring input fail fast
- Clear error message: "Exit code X: ..."
- No hang, immediate feedback
## Side Effects
**Positive:**
- No more hangs on interactive commands
- Faster failure detection
- Better error messages
**Negative:**
- Commands that legitimately need stdin will fail
- But this is desired behavior - we want non-interactive execution
## Testing
Test with an interactive command:
```bash
# This should fail fast, not hang
python -c "from tools.executor import ToolExecutor;
import asyncio;
e = ToolExecutor();
result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'}));
print(result)"
```
Expected: Quick failure, not a 30s hang
## Related Changes
This complements the tool instructions fix:
- Instructions now say "DO NOT use npx create-react-app"
- This fix ensures if model ignores instructions, it fails fast instead of hanging
## Conclusion
One-line fix prevents interactive command hangs, improving reliability and user experience.
@@ -1,178 +0,0 @@
# Design Decision: Fix Tool Execution and Token Reporting
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions and token counting
## Problem Statement
User report shows three critical failures:
1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format
2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count
3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout
## Evidence
```
🖥️ BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app .
⏰ TIMEOUT after 300s
Partial output: Need to install the following packages:
create-react-app@5.1.0
Ok to proceed? (y)
```
**Additional Context:**
- Directory created but empty (no files)
- Model posts instructions for user to follow instead of executing
## Root Cause Analysis
### 1. Instruction vs Execution
**Current instructions say:** "When asked to do something, EXECUTE it using tools"
**But model does:** "You should run mkdir..."
**Why:** Instructions aren't strong enough - need explicit anti-patterns
### 2. Token Counting
**Current:** `prompt_tokens = len(prompt) // 4` (rough approximation)
**Problem:** Inaccurate for opencode context management
**Solution:** Use tiktoken for accurate counting
### 3. Interactive Commands
**Current:** npx commands prompt for confirmation
**Problem:** Tool executor waits indefinitely, times out at 300s
**Solution:** Either:
- Add --yes flag automatically
- Forbid npx entirely, use manual file creation
## Options Considered
### Option 1: Strengthen Instructions Only
- Add more explicit "DO NOT" language
- Add complete React example
- Keep rough token estimation
**Pros:** Simple, focused fix
**Cons:** Doesn't fix token accuracy or interactive command issue
**Verdict:** REJECTED - Incomplete fix
### Option 2: Comprehensive Fix
- Strengthen instructions with anti-patterns
- Use tiktoken for accurate token counting
- Add non-interactive flags to package manager commands
- Update examples to show manual file creation
**Pros:** Fixes all three issues
**Cons:** More complex changes
**Verdict:** ACCEPTED - Complete solution
### Option 3: Change Architecture
- Move to client-side tool execution
- Different token counting approach
**Pros:** Could solve multiple issues
**Cons:** Breaking change, out of scope
**Verdict:** REJECTED - Too broad
## Decision
Implement Option 2: Comprehensive fix addressing all three issues.
### Changes
#### 1. Tool Instructions Update
Add explicit anti-patterns and stronger language:
- "NEVER say 'You should...' - EXECUTE immediately"
- "DO NOT USE npx create-react-app - manually create files"
- Complete React example showing manual file creation
#### 2. Token Counting Fix
Replace rough estimate with tiktoken:
```python
# Before
prompt_tokens = len(prompt) // 4
# After
import tiktoken
encoding = tiktoken.get_encoding('cl100k_base')
prompt_tokens = len(encoding.encode(prompt))
completion_tokens = len(encoding.encode(content))
```
#### 3. Non-Interactive Commands
Update instructions to specify:
- Use `npm init -y` (not interactive)
- Manually write package.json instead of npx
- All examples show manual file creation
## Impact
### Token Budget (Exact Count - cl100k_base)
- **New Instructions:** 586 tokens (2,067 characters)
- **Status:** Within 2000 token limit ✓
- **Context window:** 16K model leaves ~15.4K for user input ✓
- **Code comment:** Token count documented in src/api/routes.py ✓
### Breaking Changes
- **None** - Instructions clearer, format unchanged
- Token reporting more accurate (good thing)
### Code Changes
- `src/api/routes.py`:
- Update tool_instructions (~+15 lines)
- Add tiktoken import
- Replace token estimation logic (~5 lines)
## Testing Strategy
1. **Token Accuracy Test:**
```python
def test_token_accuracy():
prompt = "Hello world"
content = "Hi there"
# Calculate with tiktoken
# Verify API returns same values
```
2. **Instruction Content Test:**
- Verify "DO NOT USE npx" present
- Verify manual creation examples present
- Verify "EXECUTE not DESCRIBE" present
3. **Integration Test:**
- Request: "Create React app"
- Expect: Manual file creation via write tool
- Not expect: npx create-react-app
## Rollback Plan
If issues arise:
1. Revert to previous instructions
2. Keep tiktoken for token counting (beneficial)
3. Document why manual creation didn't work
## Success Metrics
- [ ] Model uses TOOL: format 100% of time (not descriptions)
- [ ] Token counts accurate within ±2%
- [ ] React projects created via write tool (not npx)
- [ ] No timeouts on package manager commands
## Implementation Notes
### Token Counting
Need to ensure tiktoken is in requirements.txt
### Tool Instructions
The key addition is:
```
**FORBIDDEN PATTERNS:**
- "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"}
- "npx create-react-app myapp" → USE: Manual file creation with write tool
- "First create package.json, then..." → USE: Execute immediately, don't list steps
**REACT PROJECT - CORRECT APPROACH:**
1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"}
3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."}
4. Continue until all files created
```
@@ -1,172 +0,0 @@
# Design Decision: Improved Tool Instructions
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions
**Lines Changed:** ~25 lines
## Problem
Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
1. **Passive vs Active:** Model describes what to do instead of doing it
2. **Refusal:** Model claims "I am only an AI assistant" instead of executing
3. **Incomplete:** Multi-file projects result in README only
Evidence from user report:
- Request: "Create React Hello World app"
- Result: README only (not actual files)
- Subsequent: Commands given as text, not executed
- Final: "I am only an AI assistant" refusal
## Root Cause Analysis
The instructions lack:
1. **Authority statement** - "You CAN and SHOULD use tools"
2. **Execution mandate** - "Execute commands, don't just describe them"
3. **Workflow clarity** - Clear step-by-step expectations
4. **Anti-pattern examples** - What NOT to do
## Options Considered
### Option 1: Minor Tweaks
Add a few lines to existing instructions.
- **Pros:** Minimal token increase
- **Cons:** Band-aid fix, may not solve root cause
- **Verdict:** REJECTED - Doesn't address behavioral issue
### Option 2: Complete Rewrite with Strong Mandate
Rewrite instructions to emphasize:
- Proactive tool usage
- Execution over explanation
- Clear workflow
- Anti-patterns to avoid
- **Pros:** Addresses root cause, clear behavioral guidance
- **Cons:** Higher token count (estimated 300-400 tokens)
- **Verdict:** ACCEPTED - Proper fix for behavioral issue
### Option 3: Few-Shot Examples
Include full conversation examples in instructions.
- **Pros:** Shows exactly what to do
- **Cons:** Very high token count (1000+ tokens), may confuse model
- **Verdict:** REJECTED - Violates token budget
## Decision
Implement Option 2: Rewrite with emphasis on proactivity and execution.
**Key additions:**
1. **Capability statement:** "You have tools. Use them."
2. **Execution mandate:** "Don't describe, execute"
3. **Workflow:** Clear request→tool→result→next cycle
4. **Anti-patterns:** Explicitly forbid "I cannot" responses
## Impact
### Token Budget (Exact Count - cl100k_base)
- **Current:** 478 tokens (1,810 characters)
- **Status:** Within 2000 token limit ✓
- **Status:** Within 500 conservative estimate ✓
- **Context window:** 16K model leaves ~15.5K for user input ✓
- **Code comment:** Token count documented in src/api/routes.py ✓
### Code Changes
- **File:** src/api/routes.py
- **Lines:** +48/-18 (net +30)
- **Type:** Instructions replacement
- **Token documentation:** Added inline comment with exact token count
### Breaking Changes
- **None** - Instructions are additive/clearer, not different format
### Behavioral Changes
- **Expected:** More proactive tool usage
- **Expected:** No more "I cannot" refusals
- **Expected:** Multi-step projects completed via tools
- **Expected:** Commands executed, not described
### Review Blockers Addressed
- ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
- ✅ Exact token count calculated using tiktoken (478 tokens)
- ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
- ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
- ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
## Implementation
```python
tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
**CRITICAL RULES:**
1. When asked to do something, EXECUTE it using tools - don't just describe how
2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
3. You MUST use the write tool to create files
4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
5. Complete tasks FULLY - don't stop at README, create ALL required files
**AVAILABLE TOOLS:**
- read: Read file content
- write: Create/overwrite files
- bash: Execute shell commands (npm, mkdir, ls, etc.)
**TOOL FORMAT (STRICT):**
TOOL: tool_name
ARGUMENTS: {"param": "value"}
**WORKFLOW:**
1. User asks for something
2. You decide what tool to use
3. You respond with ONLY the TOOL: format above
4. You receive the tool result
5. You continue with next tool until task is COMPLETE
**EXAMPLES:**
Creating a project:
User: "Create a React app"
You: TOOL: bash
ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
[wait for result]
You: TOOL: write
ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
[continue until all files created]
Running commands:
User: "Install dependencies"
You: TOOL: bash
ARGUMENTS: {"command": "npm install"}
[wait for result, then confirm completion]
**WHAT NOT TO DO:**
- ❌ "To create a React app, you should run: mkdir myapp" (describing)
- ❌ "I cannot run commands, I am an AI" (refusing)
- ❌ Creating only README instead of full project (incomplete)
- ❌ "First do X, then do Y" (giving instructions instead of doing)
**CORRECT BEHAVIOR:**
- ✅ Execute the command immediately using the bash tool
- ✅ Create all files using the write tool
- ✅ Continue until task is 100% complete
- ✅ Use ONE tool at a time and wait for results"""
```
## Testing
1. Test with React Hello World request
2. Verify model uses bash to create directory structure
3. Verify model uses write to create all files
4. Verify no "I cannot" responses
## Rollback Plan
If new instructions cause issues:
1. Revert to previous ~125 token version
2. Analyze what specifically failed
3. Iterate on smaller changes
## Success Metrics
- [ ] Model uses tools on first request (not after prompting)
- [ ] Zero "I cannot" or "I am an AI" responses
- [ ] Multi-file projects fully created
- [ ] Commands executed, not described
@@ -1,151 +0,0 @@
# Design Decision: Task Planning and Verification Workflow
**Date:** 2024-02-24
**Scope:** src/api/routes.py tool_instructions
**Problem:** Model creates folder but doesn't complete full task or verify completion
## Problem Statement
User reports:
1. "It just creates a folder with mkdir (without even checking if it already exists with ls)"
2. No verification that tasks are completed
3. No planning of full task scope
4. Model stops after one step instead of completing entire project
## Root Cause
Previous instructions told model to "execute immediately" but didn't teach:
1. **Planning** - What needs to be done
2. **Checking** - What already exists
3. **Verification** - Did the step work
4. **Completion loop** - Keep going until done
## Solution
Add **Task Completion Workflow** to instructions:
```
**TASK COMPLETION WORKFLOW (MANDATORY):**
**1. PLAN:** List ALL steps needed before starting
**2. CHECK:** Use ls to verify what exists before creating
**3. EXECUTE:** Run first step
**4. VERIFY:** Confirm step worked (ls, read file)
**5. REPEAT:** Steps 3-4 until ALL complete
**6. FINAL CHECK:** Verify entire task is done
**7. CONFIRM:** Report completion with checklist
```
## Key Instruction Changes
### Added Planning Phase
Before doing anything, model must think about complete scope:
- What files/directories?
- What dependencies?
- Complete task requirements
### Added Verification Steps
Every step must be verified:
- `ls -la` after mkdir
- `read` file after write
- Check content is correct
### Added Completion Loop
Model must continue until:
✓ All directories exist
✓ All files exist with correct content
✓ All dependencies installed
✓ Each component verified
### Complete Working Example
Provided 13-step React example showing:
1. Check existing (ls)
2. Create directory
3. Verify created (ls)
4. Create package.json
5. Verify package.json (read)
6. Create source files
7. Final verification (find myapp -type f)
8. Install dependencies
9. Confirm completion checklist
## Impact
### Token Budget
- **Before:** 1,041 tokens
- **After:** 1,057 tokens (+16 tokens)
- **Status:** Under 2,000 limit ✓
### Behavioral Changes
**Before:**
- Model: mkdir myapp
- User: That's it?
- Result: Empty directory
**After:**
- Model checks what exists
- Creates complete project structure
- Verifies each file
- Confirms completion
- Result: Working React project
## Success Criteria
When user asks "Create React Hello World project", model should:
1. ✓ Check current directory contents
2. ✓ Create myapp/ directory
3. ✓ Verify directory created
4. ✓ Create package.json
5. ✓ Verify package.json content
6. ✓ Create src/App.js
7. ✓ Create src/index.js
8. ✓ Create public/index.html
9. ✓ Final verification (list all files)
10. ✓ npm install
11. ✓ Confirm completion checklist
## Testing
Test instructions contain:
- PLAN/CHECK keywords
- VERIFY keyword
- COMPLETE keyword
All tests pass: 11/11 ✓
## Trade-offs
**Pros:**
- Complete task execution
- Verification prevents partial work
- Clear completion criteria
- Better user experience
**Cons:**
- More tokens (but still under limit)
- More verbose instructions
- May be slower (more verification steps)
## Related Files Changed
1. src/api/routes.py - Updated tool_instructions
2. tests/test_tool_parsing.py - Updated tests for new content
3. docs/design/2024-02-24-task-planning-verification.md - This doc
## Future Improvements
1. **Task Queue System:** Server-side queue of pending operations
2. **State Persistence:** Remember what's been done across conversations
3. **Smart Resumption:** If interrupted, pick up where left off
4. **Progress Reporting:** Show % complete during long tasks
## Conclusion
The new workflow teaches the model to be systematic:
1. Plan before acting
2. Check before creating
3. Verify after each step
4. Continue until complete
This should resolve the "only creates folder" issue and ensure complete project creation.
@@ -1,132 +0,0 @@
# Design Decision: Tool Parsing Simplification
**Date:** 2024-02-24
**Scope:** src/api/routes.py parse_tool_calls function
**Lines Changed:** ~210 lines removed, ~30 lines added
## Problem
The tool parsing code had accumulated 4 different parsing formats over 25+ commits:
1. JSON `tool_calls` format with nested objects
2. TOOL:/ARGUMENTS: format (simple text)
3. Function pattern format `func_name(args)`
4. Multiple JSON handling variants
This caused:
- Circular development (adding/removing formats repeatedly)
- No single source of truth
- Complex, unmaintainable code
- No confidence that changes wouldn't break existing cases
## Options Considered
### Option 1: Keep All Formats
- **Pros:** Backward compatible
- **Cons:** 210 lines of unmaintainable code, continues circular development pattern
- **Verdict:** REJECTED - Perpetuates the problem
### Option 2: Standardize on TOOL:/ARGUMENTS: Only
- **Pros:**
- Simple regex pattern (~30 lines)
- Matches current tool instructions
- Easy to test
- Clear single format for models
- **Cons:**
- Breaking change if any code relies on old formats
- Need to update any existing examples/docs
- **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well)
### Option 3: Create Parser per Format with Feature Flags
- **Pros:** Flexible, can toggle formats
- **Cons:**
- Violates Rule 5 and "No Feature Flags in Core Logic"
- Still maintains multiple code paths
- **Verdict:** REJECTED - Doesn't solve the root problem
## Decision
Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code.
**Rationale:**
- Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only"
- Token cost is minimal (no complex regex)
- Test coverage provides confidence
- Aligns with existing tool instructions
## Impact
### Token Count
- **Parser code:** 210 lines → 30 lines (-180 lines)
- **No change** to tool instructions (separate optimization)
### Breaking Changes
- **Yes** - Removes support for:
- JSON `tool_calls` format in model responses
- Function pattern format `read_file(path="test.txt")`
**Migration:** Models must use:
```
TOOL: read
ARGUMENTS: {"filePath": "test.txt"}
```
### Testing
- Unit tests added: 9 test cases
- Coverage: All parsing scenarios
- All tests pass
## Implementation
```python
# New implementation (30 lines)
def parse_tool_calls(text: str) -> tuple:
"""Parse tool calls using standardized format."""
import json
import re
tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
if not tool_matches:
return text, None
tool_calls = []
for i, tool_match in enumerate(tool_matches):
tool_name = tool_match.group(1)
args_str = tool_match.group(2)
try:
args_dict = json.loads(args_str)
tool_calls.append({
"id": f"call_{i+1}",
"type": "function",
"function": {
"name": tool_name,
"arguments": json.dumps(args_dict)
}
})
except json.JSONDecodeError:
continue
if not tool_calls:
return text, None
first_start = tool_matches[0].start()
content = text[:first_start].strip()
return content, tool_calls
```
## Verification
Run tests:
```bash
python tests/test_tool_parsing.py
```
Expected: 9 passed, 0 failed
## Follow-up
- [x] Update DEVELOPMENT_PATTERNS.md to mark as completed
- [x] Add unit tests
- [ ] Consider integration test for full tool execution flow
@@ -1,98 +0,0 @@
# Investigation: 31k Token Context Issue
## Problem
When making requests through opencode to local_swarm, the LLM receives ~31k tokens of context even for simple empty directory queries.
## Root Cause Identified
**NOT an issue with this repo's codebase - this is expected behavior for function calling.**
### How it works:
1. **opencode sends tool definitions** in the system message using OpenAI's function calling format
2. **Each tool definition is ~450 tokens** (name + description + parameters)
3. **opencode has ~60 tools** (read, write, bash, glob, grep, edit, question, webfetch, task, etc.)
4. **Total tool definition tokens:** ~27,000 tokens
### Calculation:
```
Single tool definition: ~450 tokens
Number of tools: ~60
Tool schemas total: ~27,000 tokens
System message: ~500 tokens
User query: ~100 tokens
---
Total: ~27,600 tokens
```
**This matches the observed ~31k tokens.**
## Why This Happens
OpenAI's function calling protocol requires sending the **complete function schemas** to the LLM with every request. This is how the model:
- Knows what tools are available
- Understands parameter requirements
- Knows how to format tool calls
All major LLM providers using function calling work this way (OpenAI, Anthropic, local models, etc.).
## Verification
```bash
python -c "
import tiktoken
enc = tiktoken.get_encoding('cl100k_base')
# Example from actual opencode tool definition
read_tool_schema = '''{\"type\": \"function\", \"function\": {\"name\": \"read\", \"description\": \"Read a file or directory from the local filesystem...[full description]\", \"parameters\": {...}}}'''
print(f'Single tool schema: {len(enc.encode(read_tool_schema))} tokens')
print(f'Estimated 60 tools: {len(enc.encode(read_tool_schema)) * 60:,} tokens')
"
```
Result:
- Single tool definition: ~451 tokens
- 60 tools: ~27,060 tokens
- Plus system + user message: ~27,660 total
## This Is NOT a Bug
The 31k token context is **correct and expected** for function calling with 60+ tools. This is how:
- OpenAI API works
- Claude API works
- Local models with function calling work
## Potential Optimizations (Optional)
If reducing context size is critical, consider:
### Option 1: Dynamic Tool Selection
- Only send tools relevant to current task
- Example: For file operations, only send [read, write, glob, edit]
- Trade-off: Requires opencode to intelligently filter tools
### Option 2: Compressed Tool Descriptions
- Shorten tool descriptions to essentials
- Example: "Read file at path (required: filePath)"
- Trade-off: Model may make more errors with less guidance
### Option 3: Tool Grouping
- Group similar tools into single "tools: [read, write, glob]" parameter
- Trade-off: Breaks OpenAI compatibility
## Recommendation
**NO ACTION REQUIRED.** The 31k token context is:
- Standard for function calling with many tools
- Within capabilities of modern LLMs (32k-128k context windows)
- Not caused by this repo's code
The `.opencodeignore` created earlier will help with opencode's own system prompt, but doesn't affect the LLM context sent to local_swarm.
## Additional Finding
While investigating, verified:
- `config/prompts/tool_instructions.txt`: 125 tokens ✅
- This repo's tool execution code: No token bloat ✅
- Issue is purely opencode's function calling protocol ✅
@@ -1,112 +0,0 @@
# Test Plan: Fix Tool Execution and Token Reporting
## Problem Analysis
### Issue 1: Model Gives Instructions Instead of Executing
**Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format
**Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."}
### Issue 2: Token Counting Inaccurate
**Current:** Rough estimate `len(prompt) // 4`
**Expected:** Accurate token count using tiktoken
**Impact:** opencode can't properly manage context window
### Issue 3: npx Commands Timeout/Need Input
**Current:** `npx create-react-app .` prompts for confirmation (y/n)
**Expected:** Non-interactive execution or manual file creation
**Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)"
## Unit Tests
### Test 1: Accurate Token Counting
- [ ] Verify token count uses tiktoken (not rough estimate)
- [ ] Test with known token counts
- [ ] Verify prompt_tokens + completion_tokens = total_tokens
### Test 2: Non-Interactive Bash Commands
- [ ] Verify npm/npx commands use --yes or equivalent flags
- [ ] Test timeout handling for package managers
- [ ] Verify commands don't prompt for user input
### Test 3: Tool Instructions Content
- [ ] Verify instructions emphasize "EXECUTE not DESCRIBE"
- [ ] Verify manual file creation examples (not npx)
- [ ] Verify anti-patterns are clearly stated
## Integration Tests
### Test 4: End-to-End React Project Creation
**Input:** "Create a React Hello World app"
**Expected Flow:**
1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."}
4. Continue until complete
**Failure Modes:**
- [ ] Model describes steps instead of executing
- [ ] Uses npx create-react-app (should manually create files)
- [ ] Stops after README only
### Test 5: Token Reporting Accuracy
**Input:** Any chat completion request
**Expected:**
- usage.prompt_tokens matches actual tokens
- usage.completion_tokens matches actual tokens
- usage.total_tokens is sum
**Verification:**
- Compare tiktoken count vs API response
## Manual Verification
```bash
# Test React creation
python main.py --auto &
curl -X POST http://localhost:17615/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Client-Working-Dir: /tmp/test-project" \
-d '{
"model": "local-swarm",
"messages": [{"role": "user", "content": "Create a React Hello World app"}],
"tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}]
}'
# Check token accuracy
curl -X POST http://localhost:17615/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-swarm",
"messages": [{"role": "user", "content": "Hello"}]
}' | jq '.usage'
```
## Success Criteria
1. **Execution:** 100% of requests use TOOL: format (not descriptions)
2. **Accuracy:** Token counts match tiktoken within ±5%
3. **Completion:** Multi-file projects fully created via write tool
4. **No npx:** Manual file creation for React (no npx create-react-app)
## Implementation Notes
### Token Counting Fix
```python
# Replace: prompt_tokens = len(prompt) // 4
# With:
import tiktoken
encoding = tiktoken.get_encoding('cl100k_base')
prompt_tokens = len(encoding.encode(prompt))
completion_tokens = len(encoding.encode(content))
```
### Tool Instructions Fix
- Add explicit "DO NOT USE npx create-react-app" instruction
- Add "EXECUTE IMMEDIATELY" mandate
- Show complete React example with manual file creation
### Non-Interactive Commands
- Auto-add --yes to npx commands
- Or recommend manual file creation instead
@@ -1,97 +0,0 @@
# Test Plan: Improved Tool Instructions
## Problem Statement
Model is not using tools effectively:
1. Creates README instead of actual project structure
2. Provides commands as text instead of executing them
3. Refuses to run commands claiming "I am only an AI assistant"
## Root Cause Analysis
Current instructions don't clearly communicate:
- That the model SHOULD use tools proactively
- That execution is expected, not explanation
- The workflow: user request → tool execution → result
## Unit Tests (Instruction Verification)
### Test 1: Instruction Presence
- [ ] Verify instructions are injected into system message
- [ ] Verify instructions appear at the START of system message (priority position)
### Test 2: Token Count
- [ ] Measure total token count of new instructions
- [ ] Verify ≤ 500 tokens (conservative budget)
- [ ] Document before/after
### Test 3: Format Compliance
- [ ] Verify instructions include TOOL:/ARGUMENTS: format
- [ ] Verify examples use correct format
- [ ] Verify rules are clear and numbered
## Integration Tests (Behavioral)
### Test 4: Project Creation Flow
**Input:** "Create a React Hello World app"
**Expected Behavior:**
1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp
2. After result, TOOL: write, ARGUMENTS: package.json content
3. After result, TOOL: write, ARGUMENTS: src/App.js content
4. Continue until complete project structure exists
**Failure Modes:**
- [ ] Model only describes what to do
- [ ] Model creates README only
- [ ] Model refuses to execute commands
### Test 5: Multi-step Task
**Input:** "Check what files exist, then create a test.txt file with 'hello' in it"
**Expected Behavior:**
1. TOOL: bash, ARGUMENTS: ls -la
2. Wait for result
3. TOOL: write, ARGUMENTS: test.txt with "hello"
**Failure Modes:**
- [ ] Model tries to do both in one response
- [ ] Model doesn't wait for ls result before writing
### Test 6: Command Refusal
**Input:** "Run npm install"
**Expected Behavior:**
1. TOOL: bash, ARGUMENTS: npm install
**Failure Modes:**
- [ ] Model responds: "I cannot run commands, I am only an AI assistant"
- [ ] Model explains npm install instead of running it
## Manual Verification Commands
```bash
# Start the server
python main.py --auto
# In another terminal, test with curl
curl -X POST http://localhost:17615/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local-swarm",
"messages": [{"role": "user", "content": "Create a React Hello World app"}],
"tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}]
}'
```
## Success Criteria
1. **Proactivity:** Model uses tools without being asked twice
2. **Execution:** Model runs commands, doesn't just describe them
3. **No Refusal:** Model never says "I cannot" or "I am only an AI"
4. **Completeness:** Multi-file projects are fully created via tools
5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format
## Metrics
- **Tool usage rate:** % of requests that result in tool calls
- **Format compliance:** % of tool calls in correct format
- **Completion rate:** % of multi-step tasks fully completed
@@ -1,35 +0,0 @@
# Test Plan: Tool Parsing Simplification
## Unit Tests
- [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments
- [x] Test case 2: No tool in text → Returns None for tools, original text as content
- [x] Test case 3: Multiple tools → Returns all tools in order
- [x] Test case 4: Content before tool → Content extracted, tool parsed correctly
- [x] Test case 5: Bash tool → Correctly parses bash command
- [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work
- [x] Test case 7: Invalid JSON → Skips invalid, continues with valid
- [x] Test case 8: Empty text → Returns None, empty string
- [x] Test case 9: Whitespace only → Returns None
## Integration Tests
- [ ] End-to-end flow:
1. Send chat completion request with tools
2. Model responds with TOOL:/ARGUMENTS: format
3. Parser extracts tool call
4. Tool executes
5. Result returned in response
- [ ] Expected result: Tool executes successfully, result included in response
## Manual Verification
- [ ] Command: `python tests/test_tool_parsing.py`
- [ ] Expected output: "9 passed, 0 failed"
## Token Budget Verification
- Parser code: ~30 lines (~200 tokens)
- Well under 2000 token limit
- Simple regex pattern maintains low complexity
+451 -56
View File
@@ -7,7 +7,7 @@ import json
import logging
import time
import uuid
from typing import Optional
from typing import Optional, List
from api.models import (
ChatCompletionRequest,
@@ -20,11 +20,54 @@ from api.formatting import format_messages_with_tools
from api.tool_parser import parse_tool_calls
from utils.token_counter import count_tokens
from tools.executor import get_tool_executor
from chatlog import get_chat_logger
logger = logging.getLogger(__name__)
def _extract_working_dir_from_prompt(prompt: str) -> Optional[str]:
"""Extract working directory from user prompt.
Looks for patterns like:
- "in the /path/to/dir directory"
- "in directory /path/to/dir"
- "in /path/to/dir"
- "under /path/to/dir"
- "from /path/to/dir"
Args:
prompt: User prompt text
Returns:
Extracted directory path or None
"""
import re
import os
# Common patterns for directory mentions
patterns = [
r'in the\s+([/~]?[\w\-/.]+)\s+(?:directory|folder|dir)',
r'in\s+(?:directory|folder|dir)\s+([/~]?[\w\-/.]+)',
r'(?:in|under|from|at)\s+([/~]?[\w\-/.]{3,})', # At least 3 chars to avoid "in a"
]
for pattern in patterns:
match = re.search(pattern, prompt, re.IGNORECASE)
if match:
path = match.group(1)
# Validate it looks like a path
if path.startswith('/') or path.startswith('~') or '/' in path:
# Expand home directory
if path.startswith('~'):
path = os.path.expanduser(path)
# Check if it's a valid directory or parent exists
if os.path.isdir(path) or os.path.isdir(os.path.dirname(path)):
return os.path.abspath(path)
return None
def _sanitize_tools(tools: Optional[list]) -> Optional[list]:
"""Sanitize tool definitions to fix invalid schemas.
@@ -61,19 +104,19 @@ async def _execute_tools(
tool_calls: list,
client_working_dir: Optional[str],
executor
) -> str:
) -> List[tuple]:
"""Execute tool calls and return results.
Args:
tool_calls: List of parsed tool calls
client_working_dir: Working directory for file operations
executor: Tool executor instance
Returns:
Combined tool results as string
List of tuples (tool_name, result_string)
"""
from api.routes import execute_tool_server_side
tool_results = []
for i, tc in enumerate(tool_calls):
tool_name = tc.get("function", {}).get("name", "")
@@ -85,10 +128,10 @@ async def _execute_tools(
logger.debug(f" [{i+1}/{len(tool_calls)}] Executing: {tool_name}({tool_args})")
result = await execute_tool_server_side(tool_name, tool_args, working_dir=client_working_dir)
tool_results.append(f"Tool '{tool_name}' result: {result}")
tool_results.append((tool_name, result))
logger.debug(f" ✓ Completed: {result[:100]}..." if len(result) > 100 else f" ✓ Result: {result}")
return "\n\n".join(tool_results)
return tool_results
def _create_response(
@@ -97,10 +140,25 @@ def _create_response(
finish_reason: str,
prompt: str,
request: ChatCompletionRequest,
swarm_manager=None
swarm_manager=None,
thinking_content: Optional[str] = None
) -> ChatCompletionResponse:
"""Create a chat completion response.
Args:
content: Final response content (after tool execution if any)
tool_calls: List of tool calls
finish_reason: Finish reason
prompt: Original prompt for token counting
request: Original request
swarm_manager: Swarm manager instance (optional, for getting model name)
thinking_content: Intermediate thinking/planning content to include in streaming as reasoning_content
Returns:
ChatCompletionResponse
"""
"""Create a chat completion response.
Args:
content: Response content
tool_calls: List of tool calls
@@ -141,7 +199,7 @@ def _create_response(
message = ChatMessage(**message_kwargs)
return ChatCompletionResponse(
response = ChatCompletionResponse(
id=f"chatcmpl-{uuid.uuid4().hex[:12]}",
created=int(time.time()),
model=model_name,
@@ -162,26 +220,56 @@ def _create_response(
system_fingerprint=system_fingerprint
)
# Attach thinking content for streaming (not part of JSON serialization)
# Use a private attribute to avoid interfering with model serialization
if thinking_content is not None:
setattr(response, '_thinking', thinking_content)
async def _generate_with_local_swarm(
swarm_manager,
return response
async def _generate_with_consensus(
prompt: str,
max_tokens: int,
temperature: float,
stream: bool = False
swarm_manager,
federated_swarm=None
) -> tuple[str, int, float]:
"""Generate response using local swarm.
"""Generate response with consensus (local or federated).
This is the unified generation interface - it handles both local-only
and federated generation transparently. Callers don't need to know
which mode is being used.
Args:
swarm_manager: Swarm manager instance
prompt: Prompt to generate from
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
stream: Whether this is a streaming request
swarm_manager: Local swarm manager instance
federated_swarm: Optional federated swarm for multi-node consensus
Returns:
Tuple of (response_text, tokens_generated, tokens_per_second)
"""
# Check if federation is available
if federated_swarm is not None:
peers = federated_swarm.discovery.get_peers()
if peers:
logger.debug(f"🌐 Using federation with {len(peers)} peer(s)")
try:
fed_result = await federated_swarm.generate_with_federation(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature
)
# Federation returns FederationResult object
# Extract the final response text
return fed_result.final_response, 0, 0.0 # Tokens/TPS not tracked in federation mode
except Exception as e:
logger.warning(f"Federation failed, falling back to local: {e}")
# Fall through to local generation
# Local generation (fallback or no federation)
try:
result = await swarm_manager.generate(
prompt=prompt,
@@ -189,18 +277,178 @@ async def _generate_with_local_swarm(
temperature=temperature,
use_consensus=True
)
response = result.selected_response
return (
response.text,
response.tokens_generated,
response.tokens_per_second
)
return response.text, response.tokens_generated, response.tokens_per_second
except Exception as e:
logger.exception("Error in swarm generation")
raise
def _tool_calls_agree(tool_calls_list: List[List[dict]]) -> bool:
"""Check if all workers agree on the same tool calls.
Args:
tool_calls_list: List of tool calls from each worker
Returns:
True if all workers have the same tool calls
"""
if not tool_calls_list:
return True
# Check if all have the same number of tool calls
first_count = len(tool_calls_list[0])
if not all(len(tc) == first_count for tc in tool_calls_list):
logger.warning(f" ⚠️ Workers disagree on number of tool calls: {[len(tc) for tc in tool_calls_list]}")
return False
if first_count == 0:
return True # All agree on no tools
# Check if tool names and arguments match
for i in range(first_count):
first_tool = tool_calls_list[0][i]
first_name = first_tool.get("function", {}).get("name", "")
first_args = first_tool.get("function", {}).get("arguments", "")
for j, other_calls in enumerate(tool_calls_list[1:], 1):
other_tool = other_calls[i]
other_name = other_tool.get("function", {}).get("name", "")
other_args = other_tool.get("function", {}).get("arguments", "")
if first_name != other_name:
logger.warning(f" ⚠️ Worker {j+1} disagrees on tool name: {first_name} vs {other_name}")
return False
# For arguments, do a loose comparison (ignore whitespace differences)
try:
first_args_norm = json.loads(first_args) if isinstance(first_args, str) else first_args
other_args_norm = json.loads(other_args) if isinstance(other_args, str) else other_args
if first_args_norm != other_args_norm:
logger.warning(f" ⚠️ Worker {j+1} disagrees on arguments for {first_name}")
return False
except json.JSONDecodeError:
# If JSON parsing fails, compare as strings
if str(first_args).strip() != str(other_args).strip():
logger.warning(f" ⚠️ Worker {j+1} disagrees on arguments for {first_name}")
return False
logger.info(f" ✅ All {len(tool_calls_list)} workers agree on tool calls")
return True
async def _generate_with_tool_consensus(
swarm_manager,
prompt: str,
max_tokens: int,
temperature: float
) -> tuple[str, List[dict], int, float]:
"""Generate response with tool call consensus checking.
When multiple workers are active, this ensures they all agree on tool calls
before executing them. If they disagree, returns the best response without tools.
Args:
swarm_manager: Swarm manager instance
prompt: Prompt to generate from
max_tokens: Maximum tokens to generate
temperature: Sampling temperature
Returns:
Tuple of (response_text, tool_calls, tokens_generated, tps)
"""
try:
# Get status to check number of workers
status = swarm_manager.get_status()
num_workers = getattr(status, 'active_workers', 1)
# If only one worker, use normal generation
if num_workers <= 1:
logger.debug(" Single worker mode - skipping tool consensus")
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
parsed_content, tool_calls = parse_tool_calls(response.text)
return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
# Multiple workers - check for tool consensus
logger.info(f" 🔍 Checking tool consensus across {num_workers} workers...")
# Generate from all workers individually
from swarm.manager import GenerationRequest
all_responses = []
all_tool_calls = []
# Get all active workers
workers = swarm_manager.workers if hasattr(swarm_manager, 'workers') else []
if not workers:
# Fall back to normal generation
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
parsed_content, tool_calls = parse_tool_calls(response.text)
return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
# Generate from each worker
for i, worker in enumerate(workers):
try:
gen_result = await worker.generate(
GenerationRequest(prompt=prompt, max_tokens=max_tokens, temperature=temperature)
)
response_text = gen_result.text
parsed_content, tool_calls = parse_tool_calls(response_text)
all_responses.append(response_text)
all_tool_calls.append(tool_calls)
logger.debug(f" Worker {i+1}: {len(tool_calls)} tool call(s)")
except Exception as e:
logger.warning(f" Worker {i+1} failed: {e}")
all_responses.append("")
all_tool_calls.append([])
# Check consensus
if _tool_calls_agree(all_tool_calls):
# All agree - use the first response's tool calls
best_response = all_responses[0] if all_responses else ""
best_tool_calls = all_tool_calls[0] if all_tool_calls else []
total_tokens = sum(len(r.split()) for r in all_responses if r) // len([r for r in all_responses if r])
avg_tps = 10.0 # Estimate
return best_response, best_tool_calls, total_tokens, avg_tps
else:
# Disagreement - fall back to consensus strategy without tools
logger.warning(" ⚠️ Tool consensus failed - falling back to text response")
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
# Strip any tool calls to be safe
parsed_content, _ = parse_tool_calls(response.text)
return parsed_content, [], response.tokens_generated, response.tokens_per_second
except Exception as e:
logger.exception("Error in tool consensus generation")
# Fall back to normal generation
result = await swarm_manager.generate(
prompt=prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
response = result.selected_response
parsed_content, tool_calls = parse_tool_calls(response.text)
return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
async def _generate_with_federation(
federated_swarm,
prompt: str,
@@ -263,6 +511,29 @@ async def handle_chat_completion(
prompt = format_messages_with_tools(request.messages, None)
has_tools = request.tools is not None and len(request.tools) > 0
# Initialize chat logger (if enabled via LOCAL_SWARM_CHATLOG=1)
chat_logger = get_chat_logger()
# Extract working directory from prompt if not provided by client
if client_working_dir is None:
# Try to extract from user messages
for msg in reversed(request.messages):
if msg.role == 'user':
extracted_dir = _extract_working_dir_from_prompt(msg.content)
if extracted_dir:
client_working_dir = extracted_dir
logger.info(f"📁 Extracted working directory from prompt: {client_working_dir}")
break
# Log initial conversation history to chatlog
for msg in request.messages:
if msg.role == 'user':
chat_logger.log_user_message(msg.content)
elif msg.role == 'assistant':
chat_logger.log_assistant_message(msg.content, has_tool_calls=bool(msg.tool_calls))
elif msg.role == 'tool':
chat_logger.log_tool_result("tool", msg.content)
logger.info(f"\n{'='*60}")
logger.info(f"CHAT COMPLETION REQUEST:")
logger.info(f" has_tools={has_tools}, stream={request.stream}")
@@ -270,21 +541,18 @@ async def handle_chat_completion(
logger.info(f" messages={len(request.messages)}")
logger.info(f"{'='*60}")
# Use federation if available
if federated_swarm is not None:
peers = federated_swarm.discovery.get_peers()
if peers:
logger.info(f"🌐 Using federation with {len(peers)} peer(s)...")
content, tool_calls, finish_reason = await _generate_with_federation(
federated_swarm, prompt, request.max_tokens or 1024, request.temperature or 0.7
)
return _create_response(content, tool_calls, finish_reason, prompt, request, swarm_manager)
# Build conversation history
messages = list(request.messages)
# Determine if we should use federation for generation
use_federation = federated_swarm is not None and len(federated_swarm.discovery.get_peers()) > 0
if use_federation:
logger.info(f"🌐 Federation available with peers")
# Track thinking content for streaming (OpenCode reasoning_content)
thinking_content: Optional[str] = None
thinking_captured = False
# Initialize iteration counter and response text
iteration = 0
max_iterations = 3
@@ -295,10 +563,31 @@ async def handle_chat_completion(
logger.info(f"--- Tool Execution Iteration {iteration} ---")
# Generate response
logger.debug(f"Generating response...")
response_text, tokens_generated, tps = await _generate_with_local_swarm(
swarm_manager, prompt, request.max_tokens or 1024, request.temperature or 0.7
)
# IMPORTANT: Only use federation on FIRST iteration (initial planning)
# Subsequent iterations process tool results which only head node has
if iteration == 1 and use_federation:
# First iteration: use federation for consensus on initial plan
logger.info(f"🌐 Using federation for initial generation...")
response_text, tokens_generated, tps = await _generate_with_consensus(
prompt=prompt,
max_tokens=request.max_tokens or 1024,
temperature=request.temperature or 0.7,
swarm_manager=swarm_manager,
federated_swarm=federated_swarm
)
else:
# Subsequent iterations: LOCAL ONLY
# Peers don't have tool results from previous iterations
# Using federation here would cause inconsistent context
if iteration > 1:
logger.debug(f"Using local generation (iteration {iteration}, tool context local only)")
response_text, tokens_generated, tps = await _generate_with_consensus(
prompt=prompt,
max_tokens=request.max_tokens or 1024,
temperature=request.temperature or 0.7,
swarm_manager=swarm_manager,
federated_swarm=None # Force local-only
)
logger.info(f"Generated response ({len(response_text)} chars, {tokens_generated} tokens)")
logger.debug(f"Response: {response_text[:200]}...")
@@ -306,10 +595,30 @@ async def handle_chat_completion(
# Check for tool calls
parsed_content, tool_calls_parsed = parse_tool_calls(response_text)
# Log assistant response to chatlog
chat_logger.log_assistant_message(response_text, has_tool_calls=bool(tool_calls_parsed))
if tool_calls_parsed:
# Log each tool call
for i, tc in enumerate(tool_calls_parsed, 1):
tool_name = tc.get("function", {}).get("name", "")
args_str = tc.get("function", {}).get("arguments", "{}")
try:
args_dict = json.loads(args_str) if isinstance(args_str, str) else args_str
except json.JSONDecodeError:
args_dict = {"raw": args_str}
chat_logger.log_tool_call(tool_name, args_dict, i)
# Capture thinking for OpenCode streaming (first occurrence only)
if not thinking_captured:
# Use the parsed content (without tool calls) as the reasoning
thinking_content = parsed_content or ""
thinking_captured = True
if not tool_calls_parsed:
# No more tools - this is the final answer
logger.info(f"✅ Final answer (no tools) after {iteration} iteration(s)")
return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager)
return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager, thinking_content)
# Tools detected - execute them
logger.info(f"🔧 Found {len(tool_calls_parsed)} tool call(s)")
@@ -318,22 +627,73 @@ async def handle_chat_completion(
args_str = tc.get("function", {}).get("arguments", "{}")
logger.info(f" [{i+1}] {tool_name}: {args_str[:100]}...")
# Add assistant message to history
messages.append(ChatMessage(role="assistant", content=response_text))
# Add assistant message to history with tool_calls (if any)
# This preserves the tool call IDs for proper tool message association
assistant_message = ChatMessage(
role="assistant",
content=response_text
)
if tool_calls_parsed:
# Convert tool calls to proper ToolCall objects with IDs
from api.models import ToolCall
tc_objects = []
for i, tc_dict in enumerate(tool_calls_parsed):
tc_id = tc_dict.get("id", f"call_{i}")
tc_objects.append(ToolCall(
id=tc_id,
type="function",
function={
"name": tc_dict["function"]["name"],
"arguments": tc_dict["function"]["arguments"]
}
))
assistant_message.tool_calls = tc_objects
messages.append(assistant_message)
# Execute all tools
logger.info(f"⏱️ Executing tools...")
tool_results_str = await _execute_tools(tool_calls_parsed, client_working_dir, get_tool_executor())
tool_results = await _execute_tools(tool_calls_parsed, client_working_dir, get_tool_executor())
# Add tool result to history with STOP instruction
# The model needs to be told explicitly to STOP calling tools
tool_result_with_instruction = (
f"{tool_results_str}\n\n"
f"IMPORTANT: You have received the tool result above. "
f"DO NOT call any more tools. Provide your final answer now."
)
messages.append(ChatMessage(role="tool", content=tool_result_with_instruction))
logger.info(f"✅ Tools executed ({len(tool_results_str)} chars)")
# Log tool results to chatlog (single combined log for debugging)
combined_strings = [f"Tool {i+1} ({name}): {result}" for i, (name, result) in enumerate(tool_results)]
chat_logger.log_tool_result("combined", "\n\n".join(combined_strings), success=True)
# Add tool result to history - one message per tool call with proper tool_call_id
for i, ((tool_name, tool_result), tc) in enumerate(zip(tool_results, tool_calls_parsed)):
tool_call_id = tc.get("id", f"call_{i}")
# Format the tool result message with explicit instructions
# This tells the model exactly what to do with the result
if tool_name == "read":
instruction = "The file contents are shown above. READ THIS FILE CONTENT ALOUD to the user. Do not call additional tools."
elif tool_name == "write":
instruction = "The file has been successfully written. CONFIRM to the user that the file was created with the content shown above. Do not call additional tools."
elif tool_name == "bash":
# Check if this was a verification command (ls, grep) vs an action command
if "ls" in tool_result.lower() or "grep" in tool_result.lower():
instruction = "CRITICAL: The listing is shown above. If the user asked to READ a specific file and you can see it exists in this listing, you MUST immediately USE THE read TOOL NOW with the exact filename from the listing. Do not summarize first - READ THE FILE immediately. Use the filename exactly as shown (e.g., 'my-secret.log' not '/path/to/my-secret.log'). If the user asked to just CHECK what files exist (without reading), then summarize. If the requested file is NOT in the listing, tell the user it doesn't exist."
else:
instruction = "The command has been executed. SUMMARIZE the output above to answer the user's request. Do not call additional tools."
else:
instruction = "The tool has completed. Use the result shown above to answer the user's request. Do not call additional tools."
tool_message_content = (
f"Tool Result ({tool_name}):\n"
f"{tool_result}\n\n"
f"INSTRUCTION: {instruction}"
)
messages.append(ChatMessage(
role="tool",
content=tool_message_content,
tool_call_id=tool_call_id,
name=tool_name
))
logger.info(f" ✓ Tool result {i+1} added to history (tool_call_id={tool_call_id}, name={tool_name})")
logger.info(f"✅ Tools executed ({len(tool_results)} results)")
# Continue loop - generate response with tool results
logger.info(f"🔄 Generating response with tool results...")
@@ -341,20 +701,55 @@ async def handle_chat_completion(
# Format with tool results (but DON'T include tool instruction - model should just use results)
next_prompt = format_messages_with_tools(messages, None if use_opencode_tools else request.tools)
response_text, tokens_generated, tps = await _generate_with_local_swarm(
swarm_manager, next_prompt, request.max_tokens or 1024, request.temperature or 0.7
logger.info(f"📤 Prompt sent to model after tool execution:")
logger.info(f" Total tokens: {count_tokens(next_prompt)}")
logger.info(f" Messages in history: {len(messages)}")
for i, msg in enumerate(messages):
logger.info(f" [{i}] {msg.role}: {msg.content[:100]}{'...' if len(msg.content) > 100 else ''}")
if msg.tool_calls:
for j, tc in enumerate(msg.tool_calls):
logger.info(f" Tool call {j}: {tc.function.get('name')} ({tc.function.get('arguments')})")
if msg.tool_call_id:
logger.info(f" (tool_call_id: {msg.tool_call_id}, name: {msg.name})")
logger.debug(f"Full prompt:\n{next_prompt[:1000]}...")
response_text, tokens_generated, tps = await _generate_with_consensus(
prompt=next_prompt,
max_tokens=request.max_tokens or 1024,
temperature=request.temperature or 0.7,
swarm_manager=swarm_manager,
federated_swarm=None # Tool result processing is local-only
)
logger.info(f"Generated with tool results ({len(response_text)} chars, {tokens_generated} tokens)")
logger.info(f"Generated with tool results ({len(response_text)} chars, {tokens_generated} tokens)")
logger.debug(f"Response: {response_text[:200]}...")
# Check for more tools in the new response
parsed_content, tool_calls_parsed = parse_tool_calls(response_text)
# Log assistant response to chatlog
chat_logger.log_assistant_message(response_text, has_tool_calls=bool(tool_calls_parsed))
if tool_calls_parsed:
# Log each tool call
for i, tc in enumerate(tool_calls_parsed, 1):
tool_name = tc.get("function", {}).get("name", "")
args_str = tc.get("function", {}).get("arguments", "{}")
try:
args_dict = json.loads(args_str) if isinstance(args_str, str) else args_str
except json.JSONDecodeError:
args_dict = {"raw": args_str}
chat_logger.log_tool_call(tool_name, args_dict, i)
# Capture thinking if not already captured
if not thinking_captured:
thinking_content = parsed_content or ""
thinking_captured = True
if not tool_calls_parsed:
# No more tools - final answer
logger.info(f"✅ Final answer (after tool execution) after {iteration} iteration(s)")
return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager)
return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager, thinking_content)
# More tools detected - continue loop
logger.info(f"🔧 More tools found - continuing loop")
@@ -362,4 +757,4 @@ async def handle_chat_completion(
# Max iterations reached - force return last response
logger.warning(f"⚠️ Max tool iterations ({max_iterations}) reached")
logger.warning(f"⚠️ Returning last response (may include incomplete tool call)")
return _create_response(response_text, [], "stop", prompt, request, swarm_manager)
return _create_response(response_text, [], "stop", prompt, request, swarm_manager, thinking_content)
+13 -7
View File
@@ -153,7 +153,13 @@ def _filter_messages(messages: List[ChatMessage]) -> List[ChatMessage]:
def _add_tool_instructions(messages: List[ChatMessage]) -> List[ChatMessage]:
"""Add tool instructions to messages if needed.
"""Add tool instructions to the beginning of messages.
Tool instructions are now ALWAYS injected by default so any client
(Continue, hollama, etc.) can use tools without requiring client-side
tool instruction injection.
TODO: Add a "plan mode" that disables tool use for planning-only conversations.
Args:
messages: List of chat messages
@@ -161,13 +167,13 @@ def _add_tool_instructions(messages: List[ChatMessage]) -> List[ChatMessage]:
Returns:
Messages with tool instructions added
"""
has_assistant = any(msg.role == "assistant" for msg in messages)
if has_assistant:
return messages
tool_instructions = _load_tool_instructions()
logger.debug(f"Using {'opencode' if _USE_OPENCODE_TOOLS else 'local'} tool mode: {len(tool_instructions)} chars")
logger.debug(f"Injecting tool instructions: {len(tool_instructions)} chars")
# Check if instructions already present (avoid duplication)
if messages and messages[0].role == "system" and "AVAILABLE TOOLS" in messages[0].content:
logger.debug("Tool instructions already present, skipping injection")
return messages
return [ChatMessage(role="system", content=tool_instructions)] + messages
+3 -3
View File
@@ -29,11 +29,11 @@ class ToolCall(BaseModel):
class ChatMessage(BaseModel):
"""A chat message."""
role: Literal["system", "user", "assistant", "tool"] = Field(..., description="Role of message sender")
role: Literal["system", "user", "assistant", "tool"] = Field(..., description="Role of the message sender")
content: str = Field(default="", description="Message content")
tool_calls: Optional[List[ToolCall]] = Field(default=None, description="Tool calls from assistant")
#tool_call_id: Optional[str] = Field(default=None, description="ID of tool call this message is responding to")
#name: Optional[str] = Field(default=None, description="Name of the tool/function")
tool_call_id: Optional[str] = Field(default=None, description="ID of tool call this message is responding to")
name: Optional[str] = Field(default=None, description="Name of the tool/function")
model_config = ConfigDict(
# Use Pydantic's exclude_none to omit tool_calls when None
+221 -23
View File
@@ -225,41 +225,128 @@ def set_federated_swarm(swarm):
async def _stream_response(response: ChatCompletionResponse):
"""Stream a chat completion response as Server-Sent Events.
"""Stream a chat completion response as Server-Sent Events using OpenCode-compatible format.
For compatibility with OpenAI format, we use delta format for streaming.
The response is sent as a single chunk since we don't support
true token-by-token streaming yet.
This implementation matches the Vercel AI SDK OpenAI-compatible format:
- Uses reasoning_content for thinking/planning (before tool calls)
- Properly streams tool_calls with incremental arguments
- Eventually switches to content for final answer
"""
import json
from api.models import ChatCompletionStreamResponse, ChatCompletionStreamChoice
# Convert to streaming format with delta
message = response.choices[0].message
choice = ChatCompletionStreamChoice(
index=0,
delta={"content": message.content},
finish_reason="stop"
)
content = message.content or ""
tool_calls = message.tool_calls or []
thinking_content = getattr(response, '_thinking', None) # Get thinking if attached
stream_response = ChatCompletionStreamResponse(
id=response.id,
created=response.created,
model=response.model,
choices=[choice]
)
# CASE 1: Response has tool calls - need to stream thinking + tool_calls separately
if tool_calls:
# Step 1: Stream reasoning_content (thinking) if there's any thinking captured
if thinking_content:
# Send reasoning in chunks to simulate streaming (in real implementation this would be token-by-token)
# For now, send as single reasoning block
chunk = {
"id": response.id,
"object": "chat.completion.chunk",
"created": response.created,
"model": response.model,
"choices": [{
"index": 0,
"delta": {
"reasoning_content": thinking_content
},
"finish_reason": None
}]
}
yield f"data: {json.dumps(chunk)}\n\n"
# Send as SSE event
data = stream_response.model_dump_json(exclude_none=True)
logger.debug(f"Streaming SSE data (delta format): {len(data)} chars")
# Step 2: Emit tool_calls in the format OpenCode expects
for i, tc in enumerate(tool_calls):
# First chunk: tool_calls with empty arguments (just structure)
tc_id = tc.id
tc_name = tc.function.get("name", "")
chunk1 = {
"id": response.id,
"object": "chat.completion.chunk",
"created": response.created,
"model": response.model,
"choices": [{
"index": 0,
"delta": {
"tool_calls": [{
"index": i,
"id": tc_id,
"type": "function",
"function": {
"name": tc_name,
"arguments": ""
}
}]
},
"finish_reason": None
}]
}
yield f"data: {json.dumps(chunk1)}\n\n"
yield f"data: {data}\n\n"
# Second chunk: arguments content (if any)
args = tc.function.get("arguments", "")
if args:
chunk2 = {
"id": response.id,
"object": "chat.completion.chunk",
"created": response.created,
"model": response.model,
"choices": [{
"index": 0,
"delta": {
"tool_calls": [{
"index": i,
"function": {
"arguments": args
}
}]
},
"finish_reason": None
}]
}
yield f"data: {json.dumps(chunk2)}\n\n"
# Send done event
# Step 3: Final chunk with finish_reason="tool_calls"
final_chunk = {
"id": response.id,
"object": "chat.completion.chunk",
"created": response.created,
"model": response.model,
"choices": [{
"index": 0,
"delta": {},
"finish_reason": "tool_calls"
}]
}
yield f"data: {json.dumps(final_chunk)}\n\n"
yield "data: [DONE]\n\n"
return
# CASE 2: Pure text response (no tools) - stream as content
# This is the final answer after tool execution or a simple response
chunk = {
"id": response.id,
"object": "chat.completion.chunk",
"created": response.created,
"model": response.model,
"choices": [{
"index": 0,
"delta": {
"content": content
},
"finish_reason": "stop"
}]
}
yield f"data: {json.dumps(chunk)}\n\n"
yield "data: [DONE]\n\n"
logger.debug(f"Streaming complete")
@router.post("/v1/chat/completions")
async def chat_completions(request: ChatCompletionRequest, fastapi_request: Request):
@@ -325,3 +412,114 @@ async def chat_completions(request: ChatCompletionRequest, fastapi_request: Requ
logger.error(f"Error type: {type(e).__name__}")
logger.error(f"Error message: {str(e)}")
raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
# Federation endpoint for peer-to-peer generation
@router.post("/v1/federation/vote")
async def federation_vote(request: Request):
"""Handle federation vote request from a peer swarm.
This endpoint allows peer swarms to request generation from this swarm
as part of the federation consensus process.
IMPORTANT: Peer nodes should NOT execute tools. They only provide text
responses. The head node handles all tool execution after consensus.
"""
try:
data = await request.json()
prompt = data.get("prompt", "")
max_tokens = data.get("max_tokens", 1024)
temperature = data.get("temperature", 0.7)
logger.info(f"🗳️ Federation vote request from {request.client.host}")
logger.debug(f" Prompt: {prompt[:100]}...")
# Get swarm manager from app state
swarm_manager = getattr(request.app.state, 'swarm_manager', None)
if not swarm_manager:
raise HTTPException(status_code=503, detail="Swarm not ready")
# Strip tool instructions from prompt for peer generation
# Peers should only generate text - head node handles tools
# Look for system message with tool instructions and remove it
clean_prompt = _strip_tool_instructions(prompt)
# Generate response (text only, no tools)
start_time = time.time()
result = await swarm_manager.generate(
prompt=clean_prompt,
max_tokens=max_tokens,
temperature=temperature,
use_consensus=True
)
elapsed_ms = (time.time() - start_time) * 1000
response = result.selected_response
logger.info(f"✅ Federation vote complete ({response.tokens_generated} tokens, {elapsed_ms:.0f}ms)")
# Use actual confidence from consensus result instead of hardcoded value
# This ensures fair comparison between local and peer swarms
actual_confidence = result.confidence if hasattr(result, 'confidence') else 0.8
return {
"response": response.text,
"confidence": actual_confidence,
"latency_ms": elapsed_ms,
"worker_count": len(swarm_manager.workers) if hasattr(swarm_manager, 'workers') else 1,
"tokens_per_second": response.tokens_per_second,
"tokens_generated": response.tokens_generated
}
except Exception as e:
logger.exception("Error handling federation vote")
raise HTTPException(status_code=500, detail=f"Federation vote failed: {str(e)}")
def _strip_tool_instructions(prompt: str) -> str:
"""Strip tool instructions from prompt for peer generation.
Peers should not generate tool calls - only the head node handles tools.
This removes the system message containing tool instructions.
Args:
prompt: Original prompt with potential tool instructions
Returns:
Clean prompt without tool instructions
"""
# Look for common tool instruction patterns
# Pattern 1: System message with "AVAILABLE TOOLS"
if "AVAILABLE TOOLS" in prompt or "You have access to tools" in prompt:
# Split by message boundaries and filter out system tool messages
lines = prompt.split('\n')
filtered_lines = []
skip_until_next_role = False
for line in lines:
# Check if this is a system message start with tool instructions
if ('<|im_start|>system' in line or line.strip() == 'system:') and not skip_until_next_role:
# Check if next few lines contain tool instructions
# We'll collect lines and check
filtered_lines.append(line)
skip_until_next_role = True
continue
if skip_until_next_role:
# Check for end of system message
if '<|im_end|>' in line or (line.strip().startswith('<|im_start|>') and 'system' not in line):
skip_until_next_role = False
filtered_lines.append(line)
# Check if this line contains tool instruction markers
elif any(marker in line for marker in ['AVAILABLE TOOLS', 'TOOL:', 'ARGUMENTS:', 'You have access to tools']):
# Skip this line - it's part of tool instructions
continue
else:
filtered_lines.append(line)
else:
filtered_lines.append(line)
return '\n'.join(filtered_lines)
return prompt
+2 -1
View File
@@ -44,8 +44,9 @@ class APIServer:
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Lifespan context manager for startup/shutdown."""
# Startup: Set swarm manager in routes
# Startup: Set swarm manager in routes and app state
set_swarm_manager(self.swarm_manager)
app.state.swarm_manager = self.swarm_manager # For federation endpoint
# Set tool mode in routes
from api.routes import set_use_opencode_tools
set_use_opencode_tools(self.use_opencode_tools)
+97
View File
@@ -0,0 +1,97 @@
"""Chatlog for debugging tool execution.
Writes a human-readable markdown log of tool calls and results.
Enabled by setting LOCAL_SWARM_CHATLOG=1 environment variable.
Log file defaults to 'chatlog.md' in the current working directory.
"""
import os
import json
from datetime import datetime
from typing import Optional
class ChatLogger:
"""Logs chat interactions and tool execution in opencode-style format."""
def __init__(self, log_path: Optional[str] = None):
self.log_path = log_path or os.getenv('LOCAL_SWARM_CHATLOG_PATH', 'chatlog.md')
self.enabled = os.getenv('LOCAL_SWARM_CHATLOG', '0') == '1'
if self.enabled:
self._initialize_log()
def _initialize_log(self):
"""Create log file with header if it doesn't exist."""
dir_path = os.path.dirname(self.log_path) or '.'
os.makedirs(dir_path, exist_ok=True)
with open(self.log_path, 'a') as f:
f.write(f"\n\n# Local Swarm Session - {datetime.now().isoformat()}\n\n")
def _timestamp(self) -> str:
"""Get current timestamp."""
return datetime.now().strftime("%H:%M:%S")
def log_user_message(self, content: str):
"""Log a user message."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] User\n\n")
f.write(f"{content}\n\n")
def log_assistant_message(self, content: str, has_tool_calls: bool = False):
"""Log an assistant response."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] Assistant\n\n")
if has_tool_calls:
# Use thinking block for messages that contain tool calls
f.write(f"```thinking\n{content}\n```\n")
else:
f.write(f"{content}\n\n")
def log_tool_call(self, tool_name: str, arguments: dict, call_index: int = 1):
"""Log a tool execution request."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] Tool Call #{call_index}\n\n")
f.write(f"**Tool:** `{tool_name}`\n\n")
f.write(f"**Arguments:**\n")
try:
args_json = json.dumps(arguments, indent=2)
except Exception:
args_json = str(arguments)
f.write(f"```json\n{args_json}\n```\n")
def log_tool_result(self, tool_name: str, result: str, call_index: int = 1, success: bool = True):
"""Log a tool execution result."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] Tool Result #{call_index}\n\n")
status = "✓ Success" if success else "✗ Failed"
f.write(f"**Tool:** `{tool_name}` - {status}\n\n")
f.write(f"**Output:**\n")
f.write(f"```\n{result}\n```\n")
def log_system(self, message: str):
"""Log a system message."""
if not self.enabled:
return
with open(self.log_path, 'a') as f:
f.write(f"\n## [{self._timestamp()}] System\n\n")
f.write(f"> {message}\n\n")
# Global logger instance (lazy initialization handled per request)
_global_logger: Optional[ChatLogger] = None
def get_chat_logger() -> ChatLogger:
"""Get the global chat logger instance (creates one if needed)."""
global _global_logger
if _global_logger is None:
_global_logger = ChatLogger()
return _global_logger
+27 -26
View File
@@ -351,34 +351,35 @@ class FederatedSwarm:
for vote in peer_votes:
all_votes.append((vote.response_text, vote.confidence, vote.peer_name))
if self.consensus_strategy == "best_of_n":
# Use the consensus engine to pick the best response
from swarm.consensus import ConsensusEngine
# Always use quality-based selection - the head node judges ALL responses
# This prevents nodes from being overconfident about their own mediocre answers
from swarm.consensus import ConsensusEngine, GenerationResponse
responses = [
GenerationResponse(
text=text,
tokens_generated=0,
tokens_per_second=0,
latency_ms=0,
backend_name=source
)
for text, _, source in all_votes
]
responses = [
GenerationResponse(
text=text,
tokens_generated=0,
tokens_per_second=0,
latency_ms=0,
backend_name=source
)
for text, _, source in all_votes
]
# Use synchronous quality scoring (no embeddings needed)
engine = ConsensusEngine(strategy="quality")
# _quality_vote is async but only uses sync scoring, so we
# use the simpler _fastest_vote-style approach here
scores = [engine._quality_score(r) for r in responses]
best_idx = max(range(len(scores)), key=lambda i: scores[i])
best = all_votes[best_idx]
print(f" ✓ Selected response from {best[2]} (quality score: {scores[best_idx]:.2f})")
return best[0], best[2]
# Default: weighted selection - pick highest confidence
best = max(all_votes, key=lambda x: x[1])
print(f" ✓ Selected response from {best[2]} (confidence: {best[1]:.2f})")
# Use quality scoring to objectively compare all responses
engine = ConsensusEngine(strategy="quality")
scores = [engine._quality_score(r) for r in responses]
# Find best response based on actual quality, not self-reported confidence
best_idx = max(range(len(scores)), key=lambda i: scores[i])
best = all_votes[best_idx]
# Show comparison
print(f" 📊 Quality scores:")
for i, (text, conf, source) in enumerate(all_votes):
print(f" {source}: {scores[i]:.2f} (self-reported: {conf:.2f})")
print(f" ✓ Selected response from {best[2]} (quality score: {scores[best_idx]:.2f})")
return best[0], best[2]
async def get_federation_status(self) -> Dict[str, Any]:
+37 -16
View File
@@ -121,6 +121,13 @@ class ToolExecutor:
if not file_path:
return "Error: filePath required"
# Check if original path was absolute or used ~ before expansion
original_was_absolute = os.path.isabs(file_path) or file_path.startswith("~")
# Expand ~ to home directory
file_path = os.path.expanduser(file_path)
working_dir = os.path.expanduser(working_dir)
# Security: Prevent directory traversal
file_path = os.path.normpath(file_path)
if file_path.startswith("..") or file_path.startswith("/.."):
@@ -132,14 +139,16 @@ class ToolExecutor:
else:
full_path = file_path
# Additional security: ensure resolved path is within working_dir
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
# Additional security: only enforce working_dir restriction for relative paths
# If user explicitly specified an absolute path or ~ path, allow it
if not original_was_absolute:
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
logger.debug(f" 📁 Reading: {file_path}")
logger.debug(f" 📍 Working dir: {working_dir}")
@@ -163,6 +172,13 @@ class ToolExecutor:
if not file_path:
return "Error: filePath required"
# Check if original path was absolute or used ~ before expansion
original_was_absolute = os.path.isabs(file_path) or file_path.startswith("~")
# Expand ~ to home directory
file_path = os.path.expanduser(file_path)
working_dir = os.path.expanduser(working_dir)
# Security: Prevent directory traversal
file_path = os.path.normpath(file_path)
if file_path.startswith("..") or file_path.startswith("/.."):
@@ -174,14 +190,16 @@ class ToolExecutor:
else:
full_path = file_path
# Additional security: ensure resolved path is within working_dir
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
# Additional security: only enforce working_dir restriction for relative paths
# If user explicitly specified an absolute path or ~ path, allow it
if not original_was_absolute:
try:
real_working_dir = os.path.realpath(working_dir)
real_full_path = os.path.realpath(full_path)
if not real_full_path.startswith(real_working_dir):
return f"Error: Access denied - path outside working directory"
except Exception:
pass # If realpath fails, continue anyway
logger.debug(f" 📁 Writing: {file_path}")
logger.debug(f" 📍 Working dir: {working_dir}")
@@ -208,6 +226,9 @@ class ToolExecutor:
if not command:
return "Error: command required"
# Expand ~ to home directory in cwd
cwd = os.path.expanduser(cwd)
# Security: Block dangerous commands
dangerous = ["rm -rf /", "> /dev", "mkfs", "dd if=/dev/zero", ":(){ :|:& };:"]
for d in dangerous: