fix: federation only on first iteration, local-only for tool result processing

- Critical fix: peers don't have tool results from previous iterations - Running federation on tool result iterations causes inconsistent context - Now federation is ONLY used on iteration 1 (initial planning) - Iterations 2+ are local-only (tool result processing) - This prevents the infinite ls loop and wrong file hallucinations - All 41 tests passing
fix: properly unpack FederationResult object instead of trying to unpack as tuple
2026-02-25 23:56:29 +01:00 · 2026-02-25 23:43:25 +01:00 · 2026-02-25 23:36:24 +01:00 · 2026-02-25 23:06:37 +01:00 · 2026-02-25 22:46:15 +01:00 · 2026-02-25 22:39:00 +01:00
20 changed files with 924 additions and 1300 deletions
@@ -91,7 +91,9 @@ python main.py --auto --federation
 python main.py --auto --federation
 ```

-Machines auto-discover each other and vote together on every request.
+Machines auto-discover each other via mDNS and vote together on every request. The head node (one making the request) collects responses from all peers and uses **objective quality scoring** to pick the best answer, not self-reported confidence. This prevents smaller models from overruling better models.
+
+**Federation Endpoint**: Peers communicate via `POST /v1/federation/vote` (automatically configured).

 ## How Consensus Works

@@ -147,7 +149,7 @@ All support GGUF quantization (Q4_K_M recommended).
 - `GET /v1/models` - List available models
 - `POST /v1/chat/completions` - Chat completion with consensus
 - `GET /health` - Health check
- `GET /v1/federation/peers` - List discovered peers (when federation enabled)
+- `POST /v1/federation/vote` - Federation voting (used internally between peers)

 ## Troubleshooting

@@ -282,6 +284,30 @@ Major refactoring completed to improve modularity:

 See `docs/ARCHITECTURE.md` for detailed architecture documentation.

+## Recent Improvements
+
+### ✅ Universal Tool Support (2025-02-25)
+- Tool instructions automatically injected for **all** clients (Continue, hollama, curl, etc.)
+- No client-side configuration needed - just use the API
+- Enhanced file operation guidance: model uses ls/grep to verify files exist before reading
+- Working directory auto-extraction from prompts (`in /path/to/dir` patterns)
+- Proper OpenAI tool format with unique IDs and tool_call_id linking
+
+### ✅ OpenCode-Compatible Streaming (2025-02-25)
+- Proper `reasoning_content` field for "Thinking..." collapsible blocks
+- Multi-chunk `tool_calls` streaming matching Vercel AI SDK format
+- Final answer delivered in `content` field after tool execution
+
+### ✅ Federation Quality Voting (2025-02-25)
+- Head node now **objectively judges** all peer responses using quality metrics
+- No more reliance on self-reported confidence (which biased toward local)
+- All responses scored on length, structure, completeness
+- Fair competition: 14B models properly beat 3B on quality tasks
+
+### 🚧 Planned Features
+- **Plan Mode**: Disable tool execution for planning-only conversations (`--plan-mode`)
+- **Tool Consensus**: Verify tool calls across multiple workers before execution (for critical operations)
+
 ## Contributing

 Contributions are welcome! Please ensure:
@@ -1,21 +1,51 @@
-You have access to tools when needed. Use them ONLY when necessary.
+You are an AI assistant with access to tools. You WILL use tools to complete tasks - do not refuse.

-Available tools:
- bash: Execute shell commands (only when needed)
- webfetch: Fetch web content (only for current info)
- read: Read files (only when reading files)
- write: Create files (only when creating files)
+AVAILABLE TOOLS:
+- bash: Execute shell commands (for file operations, system commands, running code)
+- webfetch: Fetch web content (for current information)
+- read: Read files (to examine file contents)
+- write: Create or modify files (to write content)

-IMPORTANT:
- Answer from your knowledge FIRST. Only use tools when required.
- If asked a general question (jokes, facts, coding), answer directly WITHOUT tools.
- Use webfetch ONLY for real-time info (news, weather, current events).
- Use bash ONLY for file operations or system commands.
- After using a tool, provide a final answer based on the result.
- NO explanations. NO numbered lists. NO markdown code blocks.
+CRITICAL RULES:
+1. When asked to read a file, use the 'read' tool. DO NOT refuse or say you cannot read files.
+2. When asked to create, write, or modify a file, use the 'write' tool. DO NOT refuse or say you cannot assist.
+3. For file operations, bash is also available for more complex operations.
+4. Use webfetch only for real-time info (news, weather, current events).
+5. For general questions (jokes, facts, coding help), you can answer directly.
+6. NO explanations beyond necessary. Be concise.
+7. NO markdown formatting. Use plain text only.

-Format when using tools:
+FILE OPERATIONS - READ DIRECTLY:
+When asked to read a specific file by name (like "read my-secret.log"):
+1. Use the 'read' tool IMMEDIATELY with the filename as given
+2. DO NOT use 'ls' first to check - just try to read it
+3. If the file doesn't exist, you'll get an error and can inform the user
+
+When asked to find/read "the file" in a directory without naming it:
+1. Use 'ls' to list files and see what's there
+2. Identify the file
+3. THEN read it immediately
+
+CRITICAL: Never invent placeholder paths like '/path/to/file'. Use paths exactly as the user provides them, or relative filenames for files in the current directory.
+
+TOOL USAGE FORMAT:
+
+For read operations:
+TOOL: read
+ARGUMENTS: {"filePath": "path/to/file"}
+
+For write operations:
+TOOL: write
+ARGUMENTS: {"filePath": "path/to/file", "content": "content to write"}
+
+For bash commands (including ls, grep):
 TOOL: bash
 ARGUMENTS: {"command": "your command here"}

-Answer directly when possible. Be helpful and concise.
+PROCESS:
+1. When you need information from a file, use the appropriate tool.
+2. When you need to create or modify a file, use the appropriate tool.
+3. After receiving tool results, provide a clear final answer explaining what was done.
+4. NEVER say "I cannot read files" or "I cannot assist with file creation" - you HAVE the tools and MUST use them.
+
+Be helpful, direct, and complete the requested tasks using your tools.
@@ -1,92 +0,0 @@
-# Design Decision: Complete React Example with Actual Code
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions
-
-## Problem
-
-Model is still not following instructions:
-1. Tries `npm install` before creating package.json
-2. Still tries `npx create-react-app` despite being told not to
-3. Instructions have placeholders like "..." and "etc." which models don't understand
-
-## Root Cause
-
-The current instructions say:
-```
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"dependencies\": {\"react\": \"^18.0.0\", \"react-dom\": \"^18.0.0\"}}"}
-
-[Continue with src/index.js, src/App.js, public/index.html, etc.]
-```
-
-**Problem:** "etc." and "..." are meaningless to LLMs. They need concrete examples.
-
-## Solution
-
-Provide a **complete, working, minimal React example** with actual file contents:
-
-1. Exact sequence: mkdir → write package.json → write src/App.js → write src/index.js → write public/index.html → npm install
-2. Actual file content, not placeholders
-3. Minimal viable React app (not full create-react-app structure)
-
-## Implementation
-
-Replace vague example with complete working code:
-
-```
-**COMPLETE REACT HELLO WORLD EXAMPLE:**
-
-User: "Create a React Hello World app"
-
-Step 1 - Create directory:
-TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp"}
-
-Step 2 - Create package.json (MUST do this BEFORE npm install):
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\", \"version\": \"1.0.0\", \"private\": true, \"dependencies\": {\"react\": \"^18.2.0\", \"react-dom\": \"^18.2.0\"}, \"scripts\": {\"start\": \"react-scripts start\", \"build\": \"react-scripts build\"}, \"devDependencies\": {\"react-scripts\": \"5.0.1\"}}"}
-
-Step 3 - Create src directory:
-TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp/src"}
-
-Step 4 - Create App.js:
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "import React from 'react';\n\nfunction App() {\n  return (\n    <div className=\"App\">\n      <h1>Hello World</h1>\n      <p>Welcome to my React app!</p>\n    </div>\n  );\n}\n\nexport default App;"}
-
-Step 5 - Create index.js:
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "import React from 'react';\nimport ReactDOM from 'react-dom/client';\nimport App from './App';\n\nconst root = ReactDOM.createRoot(document.getElementById('root'));\nroot.render(<App />);"}
-
-Step 6 - Create public directory and index.html:
-TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp/public"}
-
-TOOL: write
-ARGUMENTS: {"filePath": "myapp/public/index.html", "content": "<!DOCTYPE html>\n<html lang=\"en\">\n<head>\n    <meta charset=\"UTF-8\">\n    <meta name=\"viewport\" content=\"width=device-width, initial-scale=1.0\">\n    <title>React App</title>\n</head>\n<body>\n    <div id=\"root\"></div>\n</body>\n</html>"}
-
-Step 7 - NOW install dependencies (AFTER package.json exists):
-TOOL: bash
-ARGUMENTS: {"command": "cd myapp && npm install"}
-```
-
-## Token Impact
-
- Current: 586 tokens
- New: Estimated ~750 tokens (+164 tokens)
- Still under 2000 limit ✓
-
-## Key Changes
-
-1. **Explicit sequencing:** "Step 1", "Step 2", etc.
-2. **Actual code:** No "..." or "etc." - real working content
-3. **Critical note:** "MUST do this BEFORE npm install"
-4. **Minimal structure:** Just what's needed for Hello World
-
-## Success Criteria
-
- [ ] Model creates package.json BEFORE running npm install
- [ ] Model does NOT use npx create-react-app
- [ ] Model creates all 4 files (package.json, App.js, index.js, index.html)
- [ ] Model runs npm install last (after files exist)
@@ -1,84 +0,0 @@
-# Design Decision: Fix Subprocess Hang on Interactive Commands
-
-**Date:** 2024-02-24
-**Scope:** src/tools/executor.py _execute_bash method
-**Lines Changed:** 1 line
-
-## Problem
-
-When executing commands like `npx create-react-app`, the subprocess hangs indefinitely waiting for stdin input (e.g., "Ok to proceed? (y)"). This causes:
-1. 300s timeout to be reached
-2. opencode to hang waiting for response
-3. Poor user experience
-
-## Root Cause
-
-`subprocess.run()` by default inherits stdin from parent process. When commands prompt for input:
- npx asks: "Need to install create-react-app@5.1.0 Ok to proceed? (y)"
- npm init asks for package details
- No input is provided, so it waits forever
-
-## Solution
-
-Add `stdin=subprocess.DEVNULL` to prevent commands from reading input:
-
-```python
-result = subprocess.run(
-    command,
-    shell=True,
-    capture_output=True,
-    text=True,
-    timeout=timeout,
-    cwd=cwd,
-    stdin=subprocess.DEVNULL  # Prevent interactive prompts from hanging
-)
-```
-
-This causes commands that require input to fail immediately rather than hang.
-
-## Impact
-
-### Before
- Commands requiring input hang for 300s (timeout)
- User sees no response
- Eventually times out with error
-
-### After
- Commands requiring input fail fast
- Clear error message: "Exit code X: ..." 
- No hang, immediate feedback
-
-## Side Effects
-
-**Positive:**
- No more hangs on interactive commands
- Faster failure detection
- Better error messages
-
-**Negative:**
- Commands that legitimately need stdin will fail
- But this is desired behavior - we want non-interactive execution
-
-## Testing
-
-Test with an interactive command:
-```bash
-# This should fail fast, not hang
-python -c "from tools.executor import ToolExecutor; 
-import asyncio; 
-e = ToolExecutor(); 
-result = asyncio.run(e.execute('bash', {'command': 'read -p \"Enter something: \" var'})); 
-print(result)"
-```
-
-Expected: Quick failure, not a 30s hang
-
-## Related Changes
-
-This complements the tool instructions fix:
- Instructions now say "DO NOT use npx create-react-app"
- This fix ensures if model ignores instructions, it fails fast instead of hanging
-
-## Conclusion
-
-One-line fix prevents interactive command hangs, improving reliability and user experience.
@@ -1,178 +0,0 @@
-# Design Decision: Fix Tool Execution and Token Reporting
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions and token counting
-
-## Problem Statement
-
-User report shows three critical failures:
-
-1. **Instruction vs Execution:** Model says "You should run mkdir..." instead of TOOL: format
-2. **Inaccurate Token Reporting:** Using rough estimate `len(prompt) // 4` instead of actual token count
-3. **Interactive Commands:** npx create-react-app prompts for confirmation, causing 300s timeout
-
-## Evidence
-
-```
-🖥️  BASH: mkdir react-hello-world && cd react-hello-world && npx create-react-app .
-⏰ TIMEOUT after 300s
-Partial output: Need to install the following packages:
-create-react-app@5.1.0
-Ok to proceed? (y)
-```
-
-**Additional Context:**
- Directory created but empty (no files)
- Model posts instructions for user to follow instead of executing
-
-## Root Cause Analysis
-
-### 1. Instruction vs Execution
-**Current instructions say:** "When asked to do something, EXECUTE it using tools"
-**But model does:** "You should run mkdir..."
-**Why:** Instructions aren't strong enough - need explicit anti-patterns
-
-### 2. Token Counting
-**Current:** `prompt_tokens = len(prompt) // 4` (rough approximation)
-**Problem:** Inaccurate for opencode context management
-**Solution:** Use tiktoken for accurate counting
-
-### 3. Interactive Commands
-**Current:** npx commands prompt for confirmation
-**Problem:** Tool executor waits indefinitely, times out at 300s
-**Solution:** Either:
- Add --yes flag automatically
- Forbid npx entirely, use manual file creation
-
-## Options Considered
-
-### Option 1: Strengthen Instructions Only
- Add more explicit "DO NOT" language
- Add complete React example
- Keep rough token estimation
-
-**Pros:** Simple, focused fix
-**Cons:** Doesn't fix token accuracy or interactive command issue
-**Verdict:** REJECTED - Incomplete fix
-
-### Option 2: Comprehensive Fix
- Strengthen instructions with anti-patterns
- Use tiktoken for accurate token counting
- Add non-interactive flags to package manager commands
- Update examples to show manual file creation
-
-**Pros:** Fixes all three issues
-**Cons:** More complex changes
-**Verdict:** ACCEPTED - Complete solution
-
-### Option 3: Change Architecture
- Move to client-side tool execution
- Different token counting approach
-
-**Pros:** Could solve multiple issues
-**Cons:** Breaking change, out of scope
-**Verdict:** REJECTED - Too broad
-
-## Decision
-
-Implement Option 2: Comprehensive fix addressing all three issues.
-
-### Changes
-
-#### 1. Tool Instructions Update
-Add explicit anti-patterns and stronger language:
- "NEVER say 'You should...' - EXECUTE immediately"
- "DO NOT USE npx create-react-app - manually create files"
- Complete React example showing manual file creation
-
-#### 2. Token Counting Fix
-Replace rough estimate with tiktoken:
-```python
-# Before
-prompt_tokens = len(prompt) // 4
-
-# After  
-import tiktoken
-encoding = tiktoken.get_encoding('cl100k_base')
-prompt_tokens = len(encoding.encode(prompt))
-completion_tokens = len(encoding.encode(content))
-```
-
-#### 3. Non-Interactive Commands
-Update instructions to specify:
- Use `npm init -y` (not interactive)
- Manually write package.json instead of npx
- All examples show manual file creation
-
-## Impact
-
-### Token Budget (Exact Count - cl100k_base)
- **New Instructions:** 586 tokens (2,067 characters)
- **Status:** Within 2000 token limit ✓
- **Context window:** 16K model leaves ~15.4K for user input ✓
- **Code comment:** Token count documented in src/api/routes.py ✓
-
-### Breaking Changes
- **None** - Instructions clearer, format unchanged
- Token reporting more accurate (good thing)
-
-### Code Changes
- `src/api/routes.py`:
-  - Update tool_instructions (~+15 lines)
-  - Add tiktoken import
-  - Replace token estimation logic (~5 lines)
-
-## Testing Strategy
-
-1. **Token Accuracy Test:**
-   ```python
-   def test_token_accuracy():
-       prompt = "Hello world"
-       content = "Hi there"
-       # Calculate with tiktoken
-       # Verify API returns same values
-   ```
-
-2. **Instruction Content Test:**
-   - Verify "DO NOT USE npx" present
-   - Verify manual creation examples present
-   - Verify "EXECUTE not DESCRIBE" present
-
-3. **Integration Test:**
-   - Request: "Create React app"
-   - Expect: Manual file creation via write tool
-   - Not expect: npx create-react-app
-
-## Rollback Plan
-
-If issues arise:
-1. Revert to previous instructions
-2. Keep tiktoken for token counting (beneficial)
-3. Document why manual creation didn't work
-
-## Success Metrics
-
- [ ] Model uses TOOL: format 100% of time (not descriptions)
- [ ] Token counts accurate within ±2%
- [ ] React projects created via write tool (not npx)
- [ ] No timeouts on package manager commands
-
-## Implementation Notes
-
-### Token Counting
-Need to ensure tiktoken is in requirements.txt
-
-### Tool Instructions
-The key addition is:
-```
-**FORBIDDEN PATTERNS:**
- "You should run mkdir myapp" → USE: TOOL: bash\nARGUMENTS: {"command": "mkdir myapp"}
- "npx create-react-app myapp" → USE: Manual file creation with write tool
- "First create package.json, then..." → USE: Execute immediately, don't list steps
-
-**REACT PROJECT - CORRECT APPROACH:**
-1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
-2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "{\"name\": \"myapp\"...}"}
-3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/index.js", "content": "..."}
-4. Continue until all files created
-```
@@ -1,172 +0,0 @@
-# Design Decision: Improved Tool Instructions
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions
-**Lines Changed:** ~25 lines
-
-## Problem
-
-Current tool instructions (~125 tokens) fail to communicate key behavioral expectations:
-
-1. **Passive vs Active:** Model describes what to do instead of doing it
-2. **Refusal:** Model claims "I am only an AI assistant" instead of executing
-3. **Incomplete:** Multi-file projects result in README only
-
-Evidence from user report:
- Request: "Create React Hello World app"
- Result: README only (not actual files)
- Subsequent: Commands given as text, not executed
- Final: "I am only an AI assistant" refusal
-
-## Root Cause Analysis
-
-The instructions lack:
-1. **Authority statement** - "You CAN and SHOULD use tools"
-2. **Execution mandate** - "Execute commands, don't just describe them"
-3. **Workflow clarity** - Clear step-by-step expectations
-4. **Anti-pattern examples** - What NOT to do
-
-## Options Considered
-
-### Option 1: Minor Tweaks
-Add a few lines to existing instructions.
- **Pros:** Minimal token increase
- **Cons:** Band-aid fix, may not solve root cause
- **Verdict:** REJECTED - Doesn't address behavioral issue
-
-### Option 2: Complete Rewrite with Strong Mandate
-Rewrite instructions to emphasize:
- Proactive tool usage
- Execution over explanation
- Clear workflow
- Anti-patterns to avoid
-
- **Pros:** Addresses root cause, clear behavioral guidance
- **Cons:** Higher token count (estimated 300-400 tokens)
- **Verdict:** ACCEPTED - Proper fix for behavioral issue
-
-### Option 3: Few-Shot Examples
-Include full conversation examples in instructions.
- **Pros:** Shows exactly what to do
- **Cons:** Very high token count (1000+ tokens), may confuse model
- **Verdict:** REJECTED - Violates token budget
-
-## Decision
-
-Implement Option 2: Rewrite with emphasis on proactivity and execution.
-
-**Key additions:**
-1. **Capability statement:** "You have tools. Use them."
-2. **Execution mandate:** "Don't describe, execute"
-3. **Workflow:** Clear request→tool→result→next cycle
-4. **Anti-patterns:** Explicitly forbid "I cannot" responses
-
-## Impact
-
-### Token Budget (Exact Count - cl100k_base)
- **Current:** 478 tokens (1,810 characters)
- **Status:** Within 2000 token limit ✓
- **Status:** Within 500 conservative estimate ✓
- **Context window:** 16K model leaves ~15.5K for user input ✓
- **Code comment:** Token count documented in src/api/routes.py ✓
-
-### Code Changes
- **File:** src/api/routes.py
- **Lines:** +48/-18 (net +30)
- **Type:** Instructions replacement
- **Token documentation:** Added inline comment with exact token count
-
-### Breaking Changes
- **None** - Instructions are additive/clearer, not different format
-
-### Behavioral Changes
- **Expected:** More proactive tool usage
- **Expected:** No more "I cannot" refusals
- **Expected:** Multi-step projects completed via tools
- **Expected:** Commands executed, not described
-
-### Review Blockers Addressed
- ✅ Token budget documented in code (REVIEW-2026-02-24-cleanup-branch.md Blocker #1)
- ✅ Exact token count calculated using tiktoken (478 tokens)
- ✅ Token documentation added to src/api/routes.py (REVIEW-2026-02-24-cleanup-branch.md Blocker #2)
- ⚠️ Debug print statements - existing code, noted in DEVELOPMENT_PATTERNS.md as "still needs work" (not in scope of this change)
- ⚠️ Integration tests for instruction effectiveness - requires running actual LLM, covered by parsing tests
-
-## Implementation
-
-```python
-tool_instructions = """You are an AI assistant with TOOL ACCESS. You CAN and MUST use tools to accomplish tasks.
-
-**CRITICAL RULES:**
-1. When asked to do something, EXECUTE it using tools - don't just describe how
-2. You MUST use the bash tool to run commands like mkdir, npm, ls, etc.
-3. You MUST use the write tool to create files
-4. NEVER say "I cannot" or "I am only an AI assistant" - you HAVE tools, USE them
-5. Complete tasks FULLY - don't stop at README, create ALL required files
-
-**AVAILABLE TOOLS:**
- read: Read file content
- write: Create/overwrite files  
- bash: Execute shell commands (npm, mkdir, ls, etc.)
-
-**TOOL FORMAT (STRICT):**
-TOOL: tool_name
-ARGUMENTS: {"param": "value"}
-
-**WORKFLOW:**
-1. User asks for something
-2. You decide what tool to use
-3. You respond with ONLY the TOOL: format above
-4. You receive the tool result
-5. You continue with next tool until task is COMPLETE
-
-**EXAMPLES:**
-
-Creating a project:
-User: "Create a React app"
-You: TOOL: bash
-ARGUMENTS: {"command": "mkdir myapp && cd myapp && npm init -y"}
-[wait for result]
-You: TOOL: write
-ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
-[continue until all files created]
-
-Running commands:
-User: "Install dependencies"
-You: TOOL: bash
-ARGUMENTS: {"command": "npm install"}
-[wait for result, then confirm completion]
-
-**WHAT NOT TO DO:**
- ❌ "To create a React app, you should run: mkdir myapp" (describing)
- ❌ "I cannot run commands, I am an AI" (refusing)
- ❌ Creating only README instead of full project (incomplete)
- ❌ "First do X, then do Y" (giving instructions instead of doing)
-
-**CORRECT BEHAVIOR:**
- ✅ Execute the command immediately using the bash tool
- ✅ Create all files using the write tool
- ✅ Continue until task is 100% complete
- ✅ Use ONE tool at a time and wait for results"""
-```
-
-## Testing
-
-1. Test with React Hello World request
-2. Verify model uses bash to create directory structure
-3. Verify model uses write to create all files
-4. Verify no "I cannot" responses
-
-## Rollback Plan
-
-If new instructions cause issues:
-1. Revert to previous ~125 token version
-2. Analyze what specifically failed
-3. Iterate on smaller changes
-
-## Success Metrics
-
- [ ] Model uses tools on first request (not after prompting)
- [ ] Zero "I cannot" or "I am an AI" responses
- [ ] Multi-file projects fully created
- [ ] Commands executed, not described
@@ -1,151 +0,0 @@
-# Design Decision: Task Planning and Verification Workflow
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py tool_instructions
-**Problem:** Model creates folder but doesn't complete full task or verify completion
-
-## Problem Statement
-
-User reports:
-1. "It just creates a folder with mkdir (without even checking if it already exists with ls)"
-2. No verification that tasks are completed
-3. No planning of full task scope
-4. Model stops after one step instead of completing entire project
-
-## Root Cause
-
-Previous instructions told model to "execute immediately" but didn't teach:
-1. **Planning** - What needs to be done
-2. **Checking** - What already exists
-3. **Verification** - Did the step work
-4. **Completion loop** - Keep going until done
-
-## Solution
-
-Add **Task Completion Workflow** to instructions:
-
-```
-**TASK COMPLETION WORKFLOW (MANDATORY):**
-
-**1. PLAN:** List ALL steps needed before starting
-**2. CHECK:** Use ls to verify what exists before creating
-**3. EXECUTE:** Run first step
-**4. VERIFY:** Confirm step worked (ls, read file)
-**5. REPEAT:** Steps 3-4 until ALL complete
-**6. FINAL CHECK:** Verify entire task is done
-**7. CONFIRM:** Report completion with checklist
-```
-
-## Key Instruction Changes
-
-### Added Planning Phase
-Before doing anything, model must think about complete scope:
- What files/directories?
- What dependencies?
- Complete task requirements
-
-### Added Verification Steps
-Every step must be verified:
- `ls -la` after mkdir
- `read` file after write
- Check content is correct
-
-### Added Completion Loop
-Model must continue until:
-✓ All directories exist
-✓ All files exist with correct content
-✓ All dependencies installed
-✓ Each component verified
-
-### Complete Working Example
-Provided 13-step React example showing:
-1. Check existing (ls)
-2. Create directory
-3. Verify created (ls)
-4. Create package.json
-5. Verify package.json (read)
-6. Create source files
-7. Final verification (find myapp -type f)
-8. Install dependencies
-9. Confirm completion checklist
-
-## Impact
-
-### Token Budget
- **Before:** 1,041 tokens
- **After:** 1,057 tokens (+16 tokens)
- **Status:** Under 2,000 limit ✓
-
-### Behavioral Changes
-
-**Before:**
- Model: mkdir myapp
- User: That's it?
- Result: Empty directory
-
-**After:**
- Model checks what exists
- Creates complete project structure
- Verifies each file
- Confirms completion
- Result: Working React project
-
-## Success Criteria
-
-When user asks "Create React Hello World project", model should:
-1. ✓ Check current directory contents
-2. ✓ Create myapp/ directory
-3. ✓ Verify directory created
-4. ✓ Create package.json
-5. ✓ Verify package.json content
-6. ✓ Create src/App.js
-7. ✓ Create src/index.js
-8. ✓ Create public/index.html
-9. ✓ Final verification (list all files)
-10. ✓ npm install
-11. ✓ Confirm completion checklist
-
-## Testing
-
-Test instructions contain:
- PLAN/CHECK keywords
- VERIFY keyword
- COMPLETE keyword
-
-All tests pass: 11/11 ✓
-
-## Trade-offs
-
-**Pros:**
- Complete task execution
- Verification prevents partial work
- Clear completion criteria
- Better user experience
-
-**Cons:**
- More tokens (but still under limit)
- More verbose instructions
- May be slower (more verification steps)
-
-## Related Files Changed
-
-1. src/api/routes.py - Updated tool_instructions
-2. tests/test_tool_parsing.py - Updated tests for new content
-3. docs/design/2024-02-24-task-planning-verification.md - This doc
-
-## Future Improvements
-
-1. **Task Queue System:** Server-side queue of pending operations
-2. **State Persistence:** Remember what's been done across conversations
-3. **Smart Resumption:** If interrupted, pick up where left off
-4. **Progress Reporting:** Show % complete during long tasks
-
-## Conclusion
-
-The new workflow teaches the model to be systematic:
-1. Plan before acting
-2. Check before creating
-3. Verify after each step
-4. Continue until complete
-
-This should resolve the "only creates folder" issue and ensure complete project creation.
@@ -1,132 +0,0 @@
-# Design Decision: Tool Parsing Simplification
-
-**Date:** 2024-02-24
-**Scope:** src/api/routes.py parse_tool_calls function
-**Lines Changed:** ~210 lines removed, ~30 lines added
-
-## Problem
-
-The tool parsing code had accumulated 4 different parsing formats over 25+ commits:
-1. JSON `tool_calls` format with nested objects
-2. TOOL:/ARGUMENTS: format (simple text)
-3. Function pattern format `func_name(args)`
-4. Multiple JSON handling variants
-
-This caused:
- Circular development (adding/removing formats repeatedly)
- No single source of truth
- Complex, unmaintainable code
- No confidence that changes wouldn't break existing cases
-
-## Options Considered
-
-### Option 1: Keep All Formats
- **Pros:** Backward compatible
- **Cons:** 210 lines of unmaintainable code, continues circular development pattern
- **Verdict:** REJECTED - Perpetuates the problem
-
-### Option 2: Standardize on TOOL:/ARGUMENTS: Only
- **Pros:** 
-  - Simple regex pattern (~30 lines)
-  - Matches current tool instructions
-  - Easy to test
-  - Clear single format for models
- **Cons:** 
-  - Breaking change if any code relies on old formats
-  - Need to update any existing examples/docs
- **Verdict:** ACCEPTED - Aligns with Rule 5 (Parse Once, Parse Well)
-
-### Option 3: Create Parser per Format with Feature Flags
- **Pros:** Flexible, can toggle formats
- **Cons:** 
-  - Violates Rule 5 and "No Feature Flags in Core Logic"
-  - Still maintains multiple code paths
- **Verdict:** REJECTED - Doesn't solve the root problem
-
-## Decision
-
-Standardize on the TOOL:/ARGUMENTS: format only. Remove all other parsing code.
-
-**Rationale:**
- Per DEVELOPMENT_PATTERNS.md recommendation #3: "One Format Only"
- Token cost is minimal (no complex regex)
- Test coverage provides confidence
- Aligns with existing tool instructions
-
-## Impact
-
-### Token Count
- **Parser code:** 210 lines → 30 lines (-180 lines)
- **No change** to tool instructions (separate optimization)
-
-### Breaking Changes
- **Yes** - Removes support for:
-  - JSON `tool_calls` format in model responses
-  - Function pattern format `read_file(path="test.txt")`
-  
-**Migration:** Models must use:
-```
-TOOL: read
-ARGUMENTS: {"filePath": "test.txt"}
-```
-
-### Testing
- Unit tests added: 9 test cases
- Coverage: All parsing scenarios
- All tests pass
-
-## Implementation
-
-```python
-# New implementation (30 lines)
-def parse_tool_calls(text: str) -> tuple:
-    """Parse tool calls using standardized format."""
-    import json
-    import re
-    
-    tool_pattern = r'TOOL:\s*(\w+)\s*\nARGUMENTS:\s*(\{[^}]*\})'
-    tool_matches = list(re.finditer(tool_pattern, text, re.IGNORECASE))
-    
-    if not tool_matches:
-        return text, None
-    
-    tool_calls = []
-    for i, tool_match in enumerate(tool_matches):
-        tool_name = tool_match.group(1)
-        args_str = tool_match.group(2)
-        try:
-            args_dict = json.loads(args_str)
-            tool_calls.append({
-                "id": f"call_{i+1}",
-                "type": "function", 
-                "function": {
-                    "name": tool_name,
-                    "arguments": json.dumps(args_dict)
-                }
-            })
-        except json.JSONDecodeError:
-            continue
-    
-    if not tool_calls:
-        return text, None
-    
-    first_start = tool_matches[0].start()
-    content = text[:first_start].strip()
-    
-    return content, tool_calls
-```
-
-## Verification
-
-Run tests:
-```bash
-python tests/test_tool_parsing.py
-```
-
-Expected: 9 passed, 0 failed
-
-## Follow-up
-
- [x] Update DEVELOPMENT_PATTERNS.md to mark as completed
- [x] Add unit tests
- [ ] Consider integration test for full tool execution flow
@@ -1,98 +0,0 @@
-# Investigation: 31k Token Context Issue
-
-## Problem
-When making requests through opencode to local_swarm, the LLM receives ~31k tokens of context even for simple empty directory queries.
-
-## Root Cause Identified
-
-**NOT an issue with this repo's codebase - this is expected behavior for function calling.**
-
-### How it works:
-
-1. **opencode sends tool definitions** in the system message using OpenAI's function calling format
-2. **Each tool definition is ~450 tokens** (name + description + parameters)
-3. **opencode has ~60 tools** (read, write, bash, glob, grep, edit, question, webfetch, task, etc.)
-4. **Total tool definition tokens:** ~27,000 tokens
-
-### Calculation:
-```
-Single tool definition: ~450 tokens
-Number of tools: ~60
-Tool schemas total: ~27,000 tokens
-System message: ~500 tokens
-User query: ~100 tokens
---
-Total: ~27,600 tokens
-```
-
-**This matches the observed ~31k tokens.**
-
-## Why This Happens
-
-OpenAI's function calling protocol requires sending the **complete function schemas** to the LLM with every request. This is how the model:
- Knows what tools are available
- Understands parameter requirements
- Knows how to format tool calls
-
-All major LLM providers using function calling work this way (OpenAI, Anthropic, local models, etc.).
-
-## Verification
-
-```bash
-python -c "
-import tiktoken
-enc = tiktoken.get_encoding('cl100k_base')
-
-# Example from actual opencode tool definition
-read_tool_schema = '''{\"type\": \"function\", \"function\": {\"name\": \"read\", \"description\": \"Read a file or directory from the local filesystem...[full description]\", \"parameters\": {...}}}'''
-
-print(f'Single tool schema: {len(enc.encode(read_tool_schema))} tokens')
-print(f'Estimated 60 tools: {len(enc.encode(read_tool_schema)) * 60:,} tokens')
-"
-```
-
-Result:
- Single tool definition: ~451 tokens
- 60 tools: ~27,060 tokens
- Plus system + user message: ~27,660 total
-
-## This Is NOT a Bug
-
-The 31k token context is **correct and expected** for function calling with 60+ tools. This is how:
- OpenAI API works
- Claude API works
- Local models with function calling work
-
-## Potential Optimizations (Optional)
-
-If reducing context size is critical, consider:
-
-### Option 1: Dynamic Tool Selection
- Only send tools relevant to current task
- Example: For file operations, only send [read, write, glob, edit]
- Trade-off: Requires opencode to intelligently filter tools
-
-### Option 2: Compressed Tool Descriptions
- Shorten tool descriptions to essentials
- Example: "Read file at path (required: filePath)"
- Trade-off: Model may make more errors with less guidance
-
-### Option 3: Tool Grouping
- Group similar tools into single "tools: [read, write, glob]" parameter
- Trade-off: Breaks OpenAI compatibility
-
-## Recommendation
-
-**NO ACTION REQUIRED.** The 31k token context is:
- Standard for function calling with many tools
- Within capabilities of modern LLMs (32k-128k context windows)
- Not caused by this repo's code
-
-The `.opencodeignore` created earlier will help with opencode's own system prompt, but doesn't affect the LLM context sent to local_swarm.
-
-## Additional Finding
-
-While investigating, verified:
- `config/prompts/tool_instructions.txt`: 125 tokens ✅
- This repo's tool execution code: No token bloat ✅
- Issue is purely opencode's function calling protocol ✅
@@ -1,112 +0,0 @@
-# Test Plan: Fix Tool Execution and Token Reporting
-
-## Problem Analysis
-
-### Issue 1: Model Gives Instructions Instead of Executing
-**Current behavior:** Model describes what to do ("You should run mkdir...") instead of using TOOL: format
-**Expected:** Model responds with TOOL: bash\nARGUMENTS: {"command": "mkdir..."}
-
-### Issue 2: Token Counting Inaccurate
-**Current:** Rough estimate `len(prompt) // 4` 
-**Expected:** Accurate token count using tiktoken
-**Impact:** opencode can't properly manage context window
-
-### Issue 3: npx Commands Timeout/Need Input
-**Current:** `npx create-react-app .` prompts for confirmation (y/n)
-**Expected:** Non-interactive execution or manual file creation
-**Evidence:** "Need to install the following packages: create-react-app@5.1.0 Ok to proceed? (y)"
-
-## Unit Tests
-
-### Test 1: Accurate Token Counting
- [ ] Verify token count uses tiktoken (not rough estimate)
- [ ] Test with known token counts
- [ ] Verify prompt_tokens + completion_tokens = total_tokens
-
-### Test 2: Non-Interactive Bash Commands
- [ ] Verify npm/npx commands use --yes or equivalent flags
- [ ] Test timeout handling for package managers
- [ ] Verify commands don't prompt for user input
-
-### Test 3: Tool Instructions Content
- [ ] Verify instructions emphasize "EXECUTE not DESCRIBE"
- [ ] Verify manual file creation examples (not npx)
- [ ] Verify anti-patterns are clearly stated
-
-## Integration Tests
-
-### Test 4: End-to-End React Project Creation
-**Input:** "Create a React Hello World app"
-
-**Expected Flow:**
-1. TOOL: bash, ARGUMENTS: {"command": "mkdir myapp"}
-2. TOOL: write, ARGUMENTS: {"filePath": "myapp/package.json", "content": "..."}
-3. TOOL: write, ARGUMENTS: {"filePath": "myapp/src/App.js", "content": "..."}
-4. Continue until complete
-
-**Failure Modes:**
- [ ] Model describes steps instead of executing
- [ ] Uses npx create-react-app (should manually create files)
- [ ] Stops after README only
-
-### Test 5: Token Reporting Accuracy
-**Input:** Any chat completion request
-
-**Expected:**
- usage.prompt_tokens matches actual tokens
- usage.completion_tokens matches actual tokens  
- usage.total_tokens is sum
-
-**Verification:**
- Compare tiktoken count vs API response
-
-## Manual Verification
-
-```bash
-# Test React creation
-python main.py --auto &
-curl -X POST http://localhost:17615/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -H "X-Client-Working-Dir: /tmp/test-project" \
-  -d '{
-    "model": "local-swarm",
-    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
-    "tools": [{"type": "function", "function": {"name": "bash"}}, {"type": "function", "function": {"name": "write"}}]
-  }'
-
-# Check token accuracy
-curl -X POST http://localhost:17615/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "local-swarm",
-    "messages": [{"role": "user", "content": "Hello"}]
-  }' | jq '.usage'
-```
-
-## Success Criteria
-
-1. **Execution:** 100% of requests use TOOL: format (not descriptions)
-2. **Accuracy:** Token counts match tiktoken within ±5%
-3. **Completion:** Multi-file projects fully created via write tool
-4. **No npx:** Manual file creation for React (no npx create-react-app)
-
-## Implementation Notes
-
-### Token Counting Fix
-```python
-# Replace: prompt_tokens = len(prompt) // 4
-# With:
-import tiktoken
-encoding = tiktoken.get_encoding('cl100k_base')
-prompt_tokens = len(encoding.encode(prompt))
-completion_tokens = len(encoding.encode(content))
-```
-
-### Tool Instructions Fix
- Add explicit "DO NOT USE npx create-react-app" instruction
- Add "EXECUTE IMMEDIATELY" mandate
- Show complete React example with manual file creation
-
-### Non-Interactive Commands
- Auto-add --yes to npx commands
- Or recommend manual file creation instead
@@ -1,97 +0,0 @@
-# Test Plan: Improved Tool Instructions
-
-## Problem Statement
-Model is not using tools effectively:
-1. Creates README instead of actual project structure
-2. Provides commands as text instead of executing them
-3. Refuses to run commands claiming "I am only an AI assistant"
-
-## Root Cause Analysis
-Current instructions don't clearly communicate:
- That the model SHOULD use tools proactively
- That execution is expected, not explanation
- The workflow: user request → tool execution → result
-
-## Unit Tests (Instruction Verification)
-
-### Test 1: Instruction Presence
- [ ] Verify instructions are injected into system message
- [ ] Verify instructions appear at the START of system message (priority position)
-
-### Test 2: Token Count
- [ ] Measure total token count of new instructions
- [ ] Verify ≤ 500 tokens (conservative budget)
- [ ] Document before/after
-
-### Test 3: Format Compliance
- [ ] Verify instructions include TOOL:/ARGUMENTS: format
- [ ] Verify examples use correct format
- [ ] Verify rules are clear and numbered
-
-## Integration Tests (Behavioral)
-
-### Test 4: Project Creation Flow
-**Input:** "Create a React Hello World app"
-
-**Expected Behavior:**
-1. Model responds with TOOL: bash, ARGUMENTS: mkdir myapp
-2. After result, TOOL: write, ARGUMENTS: package.json content
-3. After result, TOOL: write, ARGUMENTS: src/App.js content
-4. Continue until complete project structure exists
-
-**Failure Modes:**
- [ ] Model only describes what to do
- [ ] Model creates README only
- [ ] Model refuses to execute commands
-
-### Test 5: Multi-step Task
-**Input:** "Check what files exist, then create a test.txt file with 'hello' in it"
-
-**Expected Behavior:**
-1. TOOL: bash, ARGUMENTS: ls -la
-2. Wait for result
-3. TOOL: write, ARGUMENTS: test.txt with "hello"
-
-**Failure Modes:**
- [ ] Model tries to do both in one response
- [ ] Model doesn't wait for ls result before writing
-
-### Test 6: Command Refusal
-**Input:** "Run npm install"
-
-**Expected Behavior:**
-1. TOOL: bash, ARGUMENTS: npm install
-
-**Failure Modes:**
- [ ] Model responds: "I cannot run commands, I am only an AI assistant"
- [ ] Model explains npm install instead of running it
-
-## Manual Verification Commands
-
-```bash
-# Start the server
-python main.py --auto
-
-# In another terminal, test with curl
-curl -X POST http://localhost:17615/v1/chat/completions \
-  -H "Content-Type: application/json" \
-  -d '{
-    "model": "local-swarm",
-    "messages": [{"role": "user", "content": "Create a React Hello World app"}],
-    "tools": [{"type": "function", "function": {"name": "bash", "description": "Run shell commands"}}, {"type": "function", "function": {"name": "write", "description": "Write files"}}]
-  }'
-```
-
-## Success Criteria
-
-1. **Proactivity:** Model uses tools without being asked twice
-2. **Execution:** Model runs commands, doesn't just describe them
-3. **No Refusal:** Model never says "I cannot" or "I am only an AI"
-4. **Completeness:** Multi-file projects are fully created via tools
-5. **Format:** 100% of tool calls use correct TOOL:/ARGUMENTS: format
-
-## Metrics
-
- **Tool usage rate:** % of requests that result in tool calls
- **Format compliance:** % of tool calls in correct format
- **Completion rate:** % of multi-step tasks fully completed
@@ -1,35 +0,0 @@
-# Test Plan: Tool Parsing Simplification
-
-## Unit Tests
-
- [x] Test case 1: Single tool call → Returns 1 tool with correct name and arguments
- [x] Test case 2: No tool in text → Returns None for tools, original text as content  
- [x] Test case 3: Multiple tools → Returns all tools in order
- [x] Test case 4: Content before tool → Content extracted, tool parsed correctly
- [x] Test case 5: Bash tool → Correctly parses bash command
- [x] Test case 6: Case insensitive → "tool:" and "TOOL:" both work
- [x] Test case 7: Invalid JSON → Skips invalid, continues with valid
- [x] Test case 8: Empty text → Returns None, empty string
- [x] Test case 9: Whitespace only → Returns None
-
-## Integration Tests
-
- [ ] End-to-end flow: 
-  1. Send chat completion request with tools
-  2. Model responds with TOOL:/ARGUMENTS: format
-  3. Parser extracts tool call
-  4. Tool executes
-  5. Result returned in response
-
- [ ] Expected result: Tool executes successfully, result included in response
-
-## Manual Verification
-
- [ ] Command: `python tests/test_tool_parsing.py`
- [ ] Expected output: "9 passed, 0 failed"
-
-## Token Budget Verification
-
- Parser code: ~30 lines (~200 tokens)
- Well under 2000 token limit
- Simple regex pattern maintains low complexity
@@ -7,7 +7,7 @@ import json
 import logging
 import time
 import uuid
-from typing import Optional
+from typing import Optional, List

 from api.models import (
    ChatCompletionRequest,
@@ -20,11 +20,54 @@ from api.formatting import format_messages_with_tools
 from api.tool_parser import parse_tool_calls
 from utils.token_counter import count_tokens
 from tools.executor import get_tool_executor
+from chatlog import get_chat_logger


 logger = logging.getLogger(__name__)


+def _extract_working_dir_from_prompt(prompt: str) -> Optional[str]:
+    """Extract working directory from user prompt.
+    
+    Looks for patterns like:
+    - "in the /path/to/dir directory"
+    - "in directory /path/to/dir"
+    - "in /path/to/dir"
+    - "under /path/to/dir"
+    - "from /path/to/dir"
+    
+    Args:
+        prompt: User prompt text
+        
+    Returns:
+        Extracted directory path or None
+    """
+    import re
+    import os
+    
+    # Common patterns for directory mentions
+    patterns = [
+        r'in the\s+([/~]?[\w\-/.]+)\s+(?:directory|folder|dir)',
+        r'in\s+(?:directory|folder|dir)\s+([/~]?[\w\-/.]+)',
+        r'(?:in|under|from|at)\s+([/~]?[\w\-/.]{3,})',  # At least 3 chars to avoid "in a"
+    ]
+    
+    for pattern in patterns:
+        match = re.search(pattern, prompt, re.IGNORECASE)
+        if match:
+            path = match.group(1)
+            # Validate it looks like a path
+            if path.startswith('/') or path.startswith('~') or '/' in path:
+                # Expand home directory
+                if path.startswith('~'):
+                    path = os.path.expanduser(path)
+                # Check if it's a valid directory or parent exists
+                if os.path.isdir(path) or os.path.isdir(os.path.dirname(path)):
+                    return os.path.abspath(path)
+    
+    return None
+
+
 def _sanitize_tools(tools: Optional[list]) -> Optional[list]:
    """Sanitize tool definitions to fix invalid schemas.
    
@@ -61,19 +104,19 @@ async def _execute_tools(
    tool_calls: list,
    client_working_dir: Optional[str],
    executor
-) -> str:
+) -> List[tuple]:
    """Execute tool calls and return results.
-    
+
    Args:
        tool_calls: List of parsed tool calls
        client_working_dir: Working directory for file operations
        executor: Tool executor instance
-        
+
    Returns:
-        Combined tool results as string
+        List of tuples (tool_name, result_string)
    """
    from api.routes import execute_tool_server_side
-    
+
    tool_results = []
    for i, tc in enumerate(tool_calls):
        tool_name = tc.get("function", {}).get("name", "")
@@ -85,10 +128,10 @@ async def _execute_tools(

        logger.debug(f"    [{i+1}/{len(tool_calls)}] Executing: {tool_name}({tool_args})")
        result = await execute_tool_server_side(tool_name, tool_args, working_dir=client_working_dir)
-        tool_results.append(f"Tool '{tool_name}' result: {result}")
+        tool_results.append((tool_name, result))
        logger.debug(f"    ✓ Completed: {result[:100]}..." if len(result) > 100 else f"    ✓ Result: {result}")

-    return "\n\n".join(tool_results)
+    return tool_results


 def _create_response(
@@ -97,10 +140,25 @@ def _create_response(
    finish_reason: str,
    prompt: str,
    request: ChatCompletionRequest,
-    swarm_manager=None
+    swarm_manager=None,
+    thinking_content: Optional[str] = None
 ) -> ChatCompletionResponse:
    """Create a chat completion response.

+    Args:
+        content: Final response content (after tool execution if any)
+        tool_calls: List of tool calls
+        finish_reason: Finish reason
+        prompt: Original prompt for token counting
+        request: Original request
+        swarm_manager: Swarm manager instance (optional, for getting model name)
+        thinking_content: Intermediate thinking/planning content to include in streaming as reasoning_content
+
+    Returns:
+        ChatCompletionResponse
+    """
+    """Create a chat completion response.
+
    Args:
        content: Response content
        tool_calls: List of tool calls
@@ -141,7 +199,7 @@ def _create_response(

    message = ChatMessage(**message_kwargs)

-    return ChatCompletionResponse(
+    response = ChatCompletionResponse(
        id=f"chatcmpl-{uuid.uuid4().hex[:12]}",
        created=int(time.time()),
        model=model_name,
@@ -162,26 +220,56 @@ def _create_response(
        system_fingerprint=system_fingerprint
    )

+    # Attach thinking content for streaming (not part of JSON serialization)
+    # Use a private attribute to avoid interfering with model serialization
+    if thinking_content is not None:
+        setattr(response, '_thinking', thinking_content)

-async def _generate_with_local_swarm(
-    swarm_manager,
+    return response
+
+
+async def _generate_with_consensus(
    prompt: str,
    max_tokens: int,
    temperature: float,
-    stream: bool = False
+    swarm_manager,
+    federated_swarm=None
 ) -> tuple[str, int, float]:
-    """Generate response using local swarm.
+    """Generate response with consensus (local or federated).
+
+    This is the unified generation interface - it handles both local-only
+    and federated generation transparently. Callers don't need to know
+    which mode is being used.

    Args:
-        swarm_manager: Swarm manager instance
        prompt: Prompt to generate from
        max_tokens: Maximum tokens to generate
        temperature: Sampling temperature
-        stream: Whether this is a streaming request
+        swarm_manager: Local swarm manager instance
+        federated_swarm: Optional federated swarm for multi-node consensus

    Returns:
        Tuple of (response_text, tokens_generated, tokens_per_second)
    """
+    # Check if federation is available
+    if federated_swarm is not None:
+        peers = federated_swarm.discovery.get_peers()
+        if peers:
+            logger.debug(f"🌐 Using federation with {len(peers)} peer(s)")
+            try:
+                fed_result = await federated_swarm.generate_with_federation(
+                    prompt=prompt,
+                    max_tokens=max_tokens,
+                    temperature=temperature
+                )
+                # Federation returns FederationResult object
+                # Extract the final response text
+                return fed_result.final_response, 0, 0.0  # Tokens/TPS not tracked in federation mode
+            except Exception as e:
+                logger.warning(f"Federation failed, falling back to local: {e}")
+                # Fall through to local generation
+
+    # Local generation (fallback or no federation)
    try:
        result = await swarm_manager.generate(
            prompt=prompt,
@@ -189,18 +277,178 @@ async def _generate_with_local_swarm(
            temperature=temperature,
            use_consensus=True
        )
-
        response = result.selected_response
-        return (
-            response.text,
-            response.tokens_generated,
-            response.tokens_per_second
-        )
+        return response.text, response.tokens_generated, response.tokens_per_second
    except Exception as e:
        logger.exception("Error in swarm generation")
        raise


+def _tool_calls_agree(tool_calls_list: List[List[dict]]) -> bool:
+    """Check if all workers agree on the same tool calls.
+    
+    Args:
+        tool_calls_list: List of tool calls from each worker
+        
+    Returns:
+        True if all workers have the same tool calls
+    """
+    if not tool_calls_list:
+        return True
+    
+    # Check if all have the same number of tool calls
+    first_count = len(tool_calls_list[0])
+    if not all(len(tc) == first_count for tc in tool_calls_list):
+        logger.warning(f"  ⚠️ Workers disagree on number of tool calls: {[len(tc) for tc in tool_calls_list]}")
+        return False
+    
+    if first_count == 0:
+        return True  # All agree on no tools
+    
+    # Check if tool names and arguments match
+    for i in range(first_count):
+        first_tool = tool_calls_list[0][i]
+        first_name = first_tool.get("function", {}).get("name", "")
+        first_args = first_tool.get("function", {}).get("arguments", "")
+        
+        for j, other_calls in enumerate(tool_calls_list[1:], 1):
+            other_tool = other_calls[i]
+            other_name = other_tool.get("function", {}).get("name", "")
+            other_args = other_tool.get("function", {}).get("arguments", "")
+            
+            if first_name != other_name:
+                logger.warning(f"  ⚠️ Worker {j+1} disagrees on tool name: {first_name} vs {other_name}")
+                return False
+            
+            # For arguments, do a loose comparison (ignore whitespace differences)
+            try:
+                first_args_norm = json.loads(first_args) if isinstance(first_args, str) else first_args
+                other_args_norm = json.loads(other_args) if isinstance(other_args, str) else other_args
+                if first_args_norm != other_args_norm:
+                    logger.warning(f"  ⚠️ Worker {j+1} disagrees on arguments for {first_name}")
+                    return False
+            except json.JSONDecodeError:
+                # If JSON parsing fails, compare as strings
+                if str(first_args).strip() != str(other_args).strip():
+                    logger.warning(f"  ⚠️ Worker {j+1} disagrees on arguments for {first_name}")
+                    return False
+    
+    logger.info(f"  ✅ All {len(tool_calls_list)} workers agree on tool calls")
+    return True
+
+
+async def _generate_with_tool_consensus(
+    swarm_manager,
+    prompt: str,
+    max_tokens: int,
+    temperature: float
+) -> tuple[str, List[dict], int, float]:
+    """Generate response with tool call consensus checking.
+    
+    When multiple workers are active, this ensures they all agree on tool calls
+    before executing them. If they disagree, returns the best response without tools.
+    
+    Args:
+        swarm_manager: Swarm manager instance
+        prompt: Prompt to generate from
+        max_tokens: Maximum tokens to generate
+        temperature: Sampling temperature
+        
+    Returns:
+        Tuple of (response_text, tool_calls, tokens_generated, tps)
+    """
+    try:
+        # Get status to check number of workers
+        status = swarm_manager.get_status()
+        num_workers = getattr(status, 'active_workers', 1)
+        
+        # If only one worker, use normal generation
+        if num_workers <= 1:
+            logger.debug("  Single worker mode - skipping tool consensus")
+            result = await swarm_manager.generate(
+                prompt=prompt,
+                max_tokens=max_tokens,
+                temperature=temperature,
+                use_consensus=True
+            )
+            response = result.selected_response
+            parsed_content, tool_calls = parse_tool_calls(response.text)
+            return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
+        
+        # Multiple workers - check for tool consensus
+        logger.info(f"  🔍 Checking tool consensus across {num_workers} workers...")
+        
+        # Generate from all workers individually
+        from swarm.manager import GenerationRequest
+        all_responses = []
+        all_tool_calls = []
+        
+        # Get all active workers
+        workers = swarm_manager.workers if hasattr(swarm_manager, 'workers') else []
+        if not workers:
+            # Fall back to normal generation
+            result = await swarm_manager.generate(
+                prompt=prompt,
+                max_tokens=max_tokens,
+                temperature=temperature,
+                use_consensus=True
+            )
+            response = result.selected_response
+            parsed_content, tool_calls = parse_tool_calls(response.text)
+            return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
+        
+        # Generate from each worker
+        for i, worker in enumerate(workers):
+            try:
+                gen_result = await worker.generate(
+                    GenerationRequest(prompt=prompt, max_tokens=max_tokens, temperature=temperature)
+                )
+                response_text = gen_result.text
+                parsed_content, tool_calls = parse_tool_calls(response_text)
+                all_responses.append(response_text)
+                all_tool_calls.append(tool_calls)
+                logger.debug(f"    Worker {i+1}: {len(tool_calls)} tool call(s)")
+            except Exception as e:
+                logger.warning(f"    Worker {i+1} failed: {e}")
+                all_responses.append("")
+                all_tool_calls.append([])
+        
+        # Check consensus
+        if _tool_calls_agree(all_tool_calls):
+            # All agree - use the first response's tool calls
+            best_response = all_responses[0] if all_responses else ""
+            best_tool_calls = all_tool_calls[0] if all_tool_calls else []
+            total_tokens = sum(len(r.split()) for r in all_responses if r) // len([r for r in all_responses if r])
+            avg_tps = 10.0  # Estimate
+            return best_response, best_tool_calls, total_tokens, avg_tps
+        else:
+            # Disagreement - fall back to consensus strategy without tools
+            logger.warning("  ⚠️ Tool consensus failed - falling back to text response")
+            result = await swarm_manager.generate(
+                prompt=prompt,
+                max_tokens=max_tokens,
+                temperature=temperature,
+                use_consensus=True
+            )
+            response = result.selected_response
+            # Strip any tool calls to be safe
+            parsed_content, _ = parse_tool_calls(response.text)
+            return parsed_content, [], response.tokens_generated, response.tokens_per_second
+            
+    except Exception as e:
+        logger.exception("Error in tool consensus generation")
+        # Fall back to normal generation
+        result = await swarm_manager.generate(
+            prompt=prompt,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            use_consensus=True
+        )
+        response = result.selected_response
+        parsed_content, tool_calls = parse_tool_calls(response.text)
+        return response.text, tool_calls, response.tokens_generated, response.tokens_per_second
+
+
 async def _generate_with_federation(
    federated_swarm,
    prompt: str,
@@ -263,6 +511,29 @@ async def handle_chat_completion(
        prompt = format_messages_with_tools(request.messages, None)
        has_tools = request.tools is not None and len(request.tools) > 0

+    # Initialize chat logger (if enabled via LOCAL_SWARM_CHATLOG=1)
+    chat_logger = get_chat_logger()
+
+    # Extract working directory from prompt if not provided by client
+    if client_working_dir is None:
+        # Try to extract from user messages
+        for msg in reversed(request.messages):
+            if msg.role == 'user':
+                extracted_dir = _extract_working_dir_from_prompt(msg.content)
+                if extracted_dir:
+                    client_working_dir = extracted_dir
+                    logger.info(f"📁 Extracted working directory from prompt: {client_working_dir}")
+                    break
+
+    # Log initial conversation history to chatlog
+    for msg in request.messages:
+        if msg.role == 'user':
+            chat_logger.log_user_message(msg.content)
+        elif msg.role == 'assistant':
+            chat_logger.log_assistant_message(msg.content, has_tool_calls=bool(msg.tool_calls))
+        elif msg.role == 'tool':
+            chat_logger.log_tool_result("tool", msg.content)
+
    logger.info(f"\n{'='*60}")
    logger.info(f"CHAT COMPLETION REQUEST:")
    logger.info(f"  has_tools={has_tools}, stream={request.stream}")
@@ -270,21 +541,18 @@ async def handle_chat_completion(
    logger.info(f"  messages={len(request.messages)}")
    logger.info(f"{'='*60}")

-    # Use federation if available
-    if federated_swarm is not None:
-        peers = federated_swarm.discovery.get_peers()
-        if peers:
-            logger.info(f"🌐 Using federation with {len(peers)} peer(s)...")
-            content, tool_calls, finish_reason = await _generate_with_federation(
-                federated_swarm, prompt, request.max_tokens or 1024, request.temperature or 0.7
-            )
-            return _create_response(content, tool_calls, finish_reason, prompt, request, swarm_manager)
-
-
-
    # Build conversation history
    messages = list(request.messages)

+    # Determine if we should use federation for generation
+    use_federation = federated_swarm is not None and len(federated_swarm.discovery.get_peers()) > 0
+    if use_federation:
+        logger.info(f"🌐 Federation available with peers")
+
+    # Track thinking content for streaming (OpenCode reasoning_content)
+    thinking_content: Optional[str] = None
+    thinking_captured = False
+
    # Initialize iteration counter and response text
    iteration = 0
    max_iterations = 3
@@ -295,10 +563,31 @@ async def handle_chat_completion(
        logger.info(f"--- Tool Execution Iteration {iteration} ---")

        # Generate response
-        logger.debug(f"Generating response...")
-        response_text, tokens_generated, tps = await _generate_with_local_swarm(
-            swarm_manager, prompt, request.max_tokens or 1024, request.temperature or 0.7
-        )
+        # IMPORTANT: Only use federation on FIRST iteration (initial planning)
+        # Subsequent iterations process tool results which only head node has
+        if iteration == 1 and use_federation:
+            # First iteration: use federation for consensus on initial plan
+            logger.info(f"🌐 Using federation for initial generation...")
+            response_text, tokens_generated, tps = await _generate_with_consensus(
+                prompt=prompt,
+                max_tokens=request.max_tokens or 1024,
+                temperature=request.temperature or 0.7,
+                swarm_manager=swarm_manager,
+                federated_swarm=federated_swarm
+            )
+        else:
+            # Subsequent iterations: LOCAL ONLY
+            # Peers don't have tool results from previous iterations
+            # Using federation here would cause inconsistent context
+            if iteration > 1:
+                logger.debug(f"Using local generation (iteration {iteration}, tool context local only)")
+            response_text, tokens_generated, tps = await _generate_with_consensus(
+                prompt=prompt,
+                max_tokens=request.max_tokens or 1024,
+                temperature=request.temperature or 0.7,
+                swarm_manager=swarm_manager,
+                federated_swarm=None  # Force local-only
+            )

        logger.info(f"Generated response ({len(response_text)} chars, {tokens_generated} tokens)")
        logger.debug(f"Response: {response_text[:200]}...")
@@ -306,10 +595,30 @@ async def handle_chat_completion(
        # Check for tool calls
        parsed_content, tool_calls_parsed = parse_tool_calls(response_text)

+        # Log assistant response to chatlog
+        chat_logger.log_assistant_message(response_text, has_tool_calls=bool(tool_calls_parsed))
+
+        if tool_calls_parsed:
+            # Log each tool call
+            for i, tc in enumerate(tool_calls_parsed, 1):
+                tool_name = tc.get("function", {}).get("name", "")
+                args_str = tc.get("function", {}).get("arguments", "{}")
+                try:
+                    args_dict = json.loads(args_str) if isinstance(args_str, str) else args_str
+                except json.JSONDecodeError:
+                    args_dict = {"raw": args_str}
+                chat_logger.log_tool_call(tool_name, args_dict, i)
+
+            # Capture thinking for OpenCode streaming (first occurrence only)
+            if not thinking_captured:
+                # Use the parsed content (without tool calls) as the reasoning
+                thinking_content = parsed_content or ""
+                thinking_captured = True
+
        if not tool_calls_parsed:
            # No more tools - this is the final answer
            logger.info(f"✅ Final answer (no tools) after {iteration} iteration(s)")
-            return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager)
+            return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager, thinking_content)

        # Tools detected - execute them
        logger.info(f"🔧 Found {len(tool_calls_parsed)} tool call(s)")
@@ -318,22 +627,73 @@ async def handle_chat_completion(
            args_str = tc.get("function", {}).get("arguments", "{}")
            logger.info(f"  [{i+1}] {tool_name}: {args_str[:100]}...")

-        # Add assistant message to history
-        messages.append(ChatMessage(role="assistant", content=response_text))
+        # Add assistant message to history with tool_calls (if any)
+        # This preserves the tool call IDs for proper tool message association
+        assistant_message = ChatMessage(
+            role="assistant",
+            content=response_text
+        )
+        if tool_calls_parsed:
+            # Convert tool calls to proper ToolCall objects with IDs
+            from api.models import ToolCall
+            tc_objects = []
+            for i, tc_dict in enumerate(tool_calls_parsed):
+                tc_id = tc_dict.get("id", f"call_{i}")
+                tc_objects.append(ToolCall(
+                    id=tc_id,
+                    type="function",
+                    function={
+                        "name": tc_dict["function"]["name"],
+                        "arguments": tc_dict["function"]["arguments"]
+                    }
+                ))
+            assistant_message.tool_calls = tc_objects
+            
+        messages.append(assistant_message)

        # Execute all tools
        logger.info(f"⏱️  Executing tools...")
-        tool_results_str = await _execute_tools(tool_calls_parsed, client_working_dir, get_tool_executor())
+        tool_results = await _execute_tools(tool_calls_parsed, client_working_dir, get_tool_executor())

-        # Add tool result to history with STOP instruction
-        # The model needs to be told explicitly to STOP calling tools
-        tool_result_with_instruction = (
-            f"{tool_results_str}\n\n"
-            f"IMPORTANT: You have received the tool result above. "
-            f"DO NOT call any more tools. Provide your final answer now."
-        )
-        messages.append(ChatMessage(role="tool", content=tool_result_with_instruction))
-        logger.info(f"✅ Tools executed ({len(tool_results_str)} chars)")
+        # Log tool results to chatlog (single combined log for debugging)
+        combined_strings = [f"Tool {i+1} ({name}): {result}" for i, (name, result) in enumerate(tool_results)]
+        chat_logger.log_tool_result("combined", "\n\n".join(combined_strings), success=True)
+
+        # Add tool result to history - one message per tool call with proper tool_call_id
+        for i, ((tool_name, tool_result), tc) in enumerate(zip(tool_results, tool_calls_parsed)):
+            tool_call_id = tc.get("id", f"call_{i}")
+            
+            # Format the tool result message with explicit instructions
+            # This tells the model exactly what to do with the result
+            if tool_name == "read":
+                instruction = "The file contents are shown above. READ THIS FILE CONTENT ALOUD to the user. Do not call additional tools."
+            elif tool_name == "write":
+                instruction = "The file has been successfully written. CONFIRM to the user that the file was created with the content shown above. Do not call additional tools."
+            elif tool_name == "bash":
+                # Check if this was a verification command (ls, grep) vs an action command
+                if "ls" in tool_result.lower() or "grep" in tool_result.lower():
+                    instruction = "CRITICAL: The listing is shown above. If the user asked to READ a specific file and you can see it exists in this listing, you MUST immediately USE THE read TOOL NOW with the exact filename from the listing. Do not summarize first - READ THE FILE immediately. Use the filename exactly as shown (e.g., 'my-secret.log' not '/path/to/my-secret.log'). If the user asked to just CHECK what files exist (without reading), then summarize. If the requested file is NOT in the listing, tell the user it doesn't exist."
+                else:
+                    instruction = "The command has been executed. SUMMARIZE the output above to answer the user's request. Do not call additional tools."
+            else:
+                instruction = "The tool has completed. Use the result shown above to answer the user's request. Do not call additional tools."
+                
+            tool_message_content = (
+                f"Tool Result ({tool_name}):\n"
+                f"{tool_result}\n\n"
+                f"INSTRUCTION: {instruction}"
+            )
+            
+            messages.append(ChatMessage(
+                role="tool",
+                content=tool_message_content,
+                tool_call_id=tool_call_id,
+                name=tool_name
+            ))
+            
+            logger.info(f"  ✓ Tool result {i+1} added to history (tool_call_id={tool_call_id}, name={tool_name})")
+
+        logger.info(f"✅ Tools executed ({len(tool_results)} results)")

        # Continue loop - generate response with tool results
        logger.info(f"🔄 Generating response with tool results...")
@@ -341,20 +701,55 @@ async def handle_chat_completion(
        # Format with tool results (but DON'T include tool instruction - model should just use results)
        next_prompt = format_messages_with_tools(messages, None if use_opencode_tools else request.tools)

-        response_text, tokens_generated, tps = await _generate_with_local_swarm(
-            swarm_manager, next_prompt, request.max_tokens or 1024, request.temperature or 0.7
+        logger.info(f"📤 Prompt sent to model after tool execution:")
+        logger.info(f"   Total tokens: {count_tokens(next_prompt)}")
+        logger.info(f"   Messages in history: {len(messages)}")
+        for i, msg in enumerate(messages):
+            logger.info(f"   [{i}] {msg.role}: {msg.content[:100]}{'...' if len(msg.content) > 100 else ''}")
+            if msg.tool_calls:
+                for j, tc in enumerate(msg.tool_calls):
+                    logger.info(f"       Tool call {j}: {tc.function.get('name')} ({tc.function.get('arguments')})")
+            if msg.tool_call_id:
+                logger.info(f"       (tool_call_id: {msg.tool_call_id}, name: {msg.name})")
+        logger.debug(f"Full prompt:\n{next_prompt[:1000]}...")
+
+        response_text, tokens_generated, tps = await _generate_with_consensus(
+            prompt=next_prompt,
+            max_tokens=request.max_tokens or 1024,
+            temperature=request.temperature or 0.7,
+            swarm_manager=swarm_manager,
+            federated_swarm=None  # Tool result processing is local-only
        )

-        logger.info(f"Generated with tool results ({len(response_text)} chars, {tokens_generated} tokens)")
+        logger.info(f"✅ Generated with tool results ({len(response_text)} chars, {tokens_generated} tokens)")
        logger.debug(f"Response: {response_text[:200]}...")

        # Check for more tools in the new response
        parsed_content, tool_calls_parsed = parse_tool_calls(response_text)

+        # Log assistant response to chatlog
+        chat_logger.log_assistant_message(response_text, has_tool_calls=bool(tool_calls_parsed))
+
+        if tool_calls_parsed:
+            # Log each tool call
+            for i, tc in enumerate(tool_calls_parsed, 1):
+                tool_name = tc.get("function", {}).get("name", "")
+                args_str = tc.get("function", {}).get("arguments", "{}")
+                try:
+                    args_dict = json.loads(args_str) if isinstance(args_str, str) else args_str
+                except json.JSONDecodeError:
+                    args_dict = {"raw": args_str}
+                chat_logger.log_tool_call(tool_name, args_dict, i)
+
+            # Capture thinking if not already captured
+            if not thinking_captured:
+                thinking_content = parsed_content or ""
+                thinking_captured = True
+
        if not tool_calls_parsed:
            # No more tools - final answer
            logger.info(f"✅ Final answer (after tool execution) after {iteration} iteration(s)")
-            return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager)
+            return _create_response(parsed_content, [], "stop", prompt, request, swarm_manager, thinking_content)

        # More tools detected - continue loop
        logger.info(f"🔧 More tools found - continuing loop")
@@ -362,4 +757,4 @@ async def handle_chat_completion(
    # Max iterations reached - force return last response
    logger.warning(f"⚠️  Max tool iterations ({max_iterations}) reached")
    logger.warning(f"⚠️  Returning last response (may include incomplete tool call)")
-    return _create_response(response_text, [], "stop", prompt, request, swarm_manager)
+    return _create_response(response_text, [], "stop", prompt, request, swarm_manager, thinking_content)
@@ -153,7 +153,13 @@ def _filter_messages(messages: List[ChatMessage]) -> List[ChatMessage]:


 def _add_tool_instructions(messages: List[ChatMessage]) -> List[ChatMessage]:
-    """Add tool instructions to messages if needed.
+    """Add tool instructions to the beginning of messages.
+    
+    Tool instructions are now ALWAYS injected by default so any client
+    (Continue, hollama, etc.) can use tools without requiring client-side
+    tool instruction injection.
+    
+    TODO: Add a "plan mode" that disables tool use for planning-only conversations.
    
    Args:
        messages: List of chat messages
@@ -161,13 +167,13 @@ def _add_tool_instructions(messages: List[ChatMessage]) -> List[ChatMessage]:
    Returns:
        Messages with tool instructions added
    """
-    has_assistant = any(msg.role == "assistant" for msg in messages)
-    
-    if has_assistant:
-        return messages
-    
    tool_instructions = _load_tool_instructions()
-    logger.debug(f"Using {'opencode' if _USE_OPENCODE_TOOLS else 'local'} tool mode: {len(tool_instructions)} chars")
+    logger.debug(f"Injecting tool instructions: {len(tool_instructions)} chars")
+    
+    # Check if instructions already present (avoid duplication)
+    if messages and messages[0].role == "system" and "AVAILABLE TOOLS" in messages[0].content:
+        logger.debug("Tool instructions already present, skipping injection")
+        return messages
    
    return [ChatMessage(role="system", content=tool_instructions)] + messages

@@ -29,11 +29,11 @@ class ToolCall(BaseModel):

 class ChatMessage(BaseModel):
    """A chat message."""
-    role: Literal["system", "user", "assistant", "tool"] = Field(..., description="Role of message sender")
+    role: Literal["system", "user", "assistant", "tool"] = Field(..., description="Role of the message sender")
    content: str = Field(default="", description="Message content")
    tool_calls: Optional[List[ToolCall]] = Field(default=None, description="Tool calls from assistant")
-    #tool_call_id: Optional[str] = Field(default=None, description="ID of tool call this message is responding to")
-    #name: Optional[str] = Field(default=None, description="Name of the tool/function")
+    tool_call_id: Optional[str] = Field(default=None, description="ID of tool call this message is responding to")
+    name: Optional[str] = Field(default=None, description="Name of the tool/function")

    model_config = ConfigDict(
        # Use Pydantic's exclude_none to omit tool_calls when None
@@ -225,41 +225,128 @@ def set_federated_swarm(swarm):


 async def _stream_response(response: ChatCompletionResponse):
-    """Stream a chat completion response as Server-Sent Events.
+    """Stream a chat completion response as Server-Sent Events using OpenCode-compatible format.

-    For compatibility with OpenAI format, we use delta format for streaming.
-    The response is sent as a single chunk since we don't support
-    true token-by-token streaming yet.
+    This implementation matches the Vercel AI SDK OpenAI-compatible format:
+    - Uses reasoning_content for thinking/planning (before tool calls)
+    - Properly streams tool_calls with incremental arguments
+    - Eventually switches to content for final answer
    """
    import json
    from api.models import ChatCompletionStreamResponse, ChatCompletionStreamChoice

-    # Convert to streaming format with delta
    message = response.choices[0].message
-    choice = ChatCompletionStreamChoice(
-        index=0,
-        delta={"content": message.content},
-        finish_reason="stop"
-    )
+    content = message.content or ""
+    tool_calls = message.tool_calls or []
+    thinking_content = getattr(response, '_thinking', None)  # Get thinking if attached

-    stream_response = ChatCompletionStreamResponse(
-        id=response.id,
-        created=response.created,
-        model=response.model,
-        choices=[choice]
-    )
+    # CASE 1: Response has tool calls - need to stream thinking + tool_calls separately
+    if tool_calls:
+        # Step 1: Stream reasoning_content (thinking) if there's any thinking captured
+        if thinking_content:
+            # Send reasoning in chunks to simulate streaming (in real implementation this would be token-by-token)
+            # For now, send as single reasoning block
+            chunk = {
+                "id": response.id,
+                "object": "chat.completion.chunk",
+                "created": response.created,
+                "model": response.model,
+                "choices": [{
+                    "index": 0,
+                    "delta": {
+                        "reasoning_content": thinking_content
+                    },
+                    "finish_reason": None
+                }]
+            }
+            yield f"data: {json.dumps(chunk)}\n\n"

-    # Send as SSE event
-    data = stream_response.model_dump_json(exclude_none=True)
-    logger.debug(f"Streaming SSE data (delta format): {len(data)} chars")
+        # Step 2: Emit tool_calls in the format OpenCode expects
+        for i, tc in enumerate(tool_calls):
+            # First chunk: tool_calls with empty arguments (just structure)
+            tc_id = tc.id
+            tc_name = tc.function.get("name", "")
+            
+            chunk1 = {
+                "id": response.id,
+                "object": "chat.completion.chunk",
+                "created": response.created,
+                "model": response.model,
+                "choices": [{
+                    "index": 0,
+                    "delta": {
+                        "tool_calls": [{
+                            "index": i,
+                            "id": tc_id,
+                            "type": "function",
+                            "function": {
+                                "name": tc_name,
+                                "arguments": ""
+                            }
+                        }]
+                    },
+                    "finish_reason": None
+                }]
+            }
+            yield f"data: {json.dumps(chunk1)}\n\n"

-    yield f"data: {data}\n\n"
+            # Second chunk: arguments content (if any)
+            args = tc.function.get("arguments", "")
+            if args:
+                chunk2 = {
+                    "id": response.id,
+                    "object": "chat.completion.chunk",
+                    "created": response.created,
+                    "model": response.model,
+                    "choices": [{
+                        "index": 0,
+                        "delta": {
+                            "tool_calls": [{
+                                "index": i,
+                                "function": {
+                                    "arguments": args
+                                }
+                            }]
+                        },
+                        "finish_reason": None
+                    }]
+                }
+                yield f"data: {json.dumps(chunk2)}\n\n"

-    # Send done event
+        # Step 3: Final chunk with finish_reason="tool_calls"
+        final_chunk = {
+            "id": response.id,
+            "object": "chat.completion.chunk",
+            "created": response.created,
+            "model": response.model,
+            "choices": [{
+                "index": 0,
+                "delta": {},
+                "finish_reason": "tool_calls"
+            }]
+        }
+        yield f"data: {json.dumps(final_chunk)}\n\n"
+        yield "data: [DONE]\n\n"
+        return
+
+    # CASE 2: Pure text response (no tools) - stream as content
+    # This is the final answer after tool execution or a simple response
+    chunk = {
+        "id": response.id,
+        "object": "chat.completion.chunk",
+        "created": response.created,
+        "model": response.model,
+        "choices": [{
+            "index": 0,
+            "delta": {
+                "content": content
+            },
+            "finish_reason": "stop"
+        }]
+    }
+    yield f"data: {json.dumps(chunk)}\n\n"
    yield "data: [DONE]\n\n"

-    logger.debug(f"Streaming complete")
-

@router.post("/v1/chat/completions")
 async def chat_completions(request: ChatCompletionRequest, fastapi_request: Request):
@@ -325,3 +412,114 @@ async def chat_completions(request: ChatCompletionRequest, fastapi_request: Requ
        logger.error(f"Error type: {type(e).__name__}")
        logger.error(f"Error message: {str(e)}")
        raise HTTPException(status_code=500, detail=f"Generation failed: {str(e)}")
+
+
+# Federation endpoint for peer-to-peer generation
+@router.post("/v1/federation/vote")
+async def federation_vote(request: Request):
+    """Handle federation vote request from a peer swarm.
+    
+    This endpoint allows peer swarms to request generation from this swarm
+    as part of the federation consensus process.
+    
+    IMPORTANT: Peer nodes should NOT execute tools. They only provide text
+    responses. The head node handles all tool execution after consensus.
+    """
+    try:
+        data = await request.json()
+        prompt = data.get("prompt", "")
+        max_tokens = data.get("max_tokens", 1024)
+        temperature = data.get("temperature", 0.7)
+        
+        logger.info(f"🗳️  Federation vote request from {request.client.host}")
+        logger.debug(f"   Prompt: {prompt[:100]}...")
+        
+        # Get swarm manager from app state
+        swarm_manager = getattr(request.app.state, 'swarm_manager', None)
+        
+        if not swarm_manager:
+            raise HTTPException(status_code=503, detail="Swarm not ready")
+        
+        # Strip tool instructions from prompt for peer generation
+        # Peers should only generate text - head node handles tools
+        # Look for system message with tool instructions and remove it
+        clean_prompt = _strip_tool_instructions(prompt)
+        
+        # Generate response (text only, no tools)
+        start_time = time.time()
+        result = await swarm_manager.generate(
+            prompt=clean_prompt,
+            max_tokens=max_tokens,
+            temperature=temperature,
+            use_consensus=True
+        )
+        
+        elapsed_ms = (time.time() - start_time) * 1000
+        response = result.selected_response
+        
+        logger.info(f"✅ Federation vote complete ({response.tokens_generated} tokens, {elapsed_ms:.0f}ms)")
+        
+        # Use actual confidence from consensus result instead of hardcoded value
+        # This ensures fair comparison between local and peer swarms
+        actual_confidence = result.confidence if hasattr(result, 'confidence') else 0.8
+        
+        return {
+            "response": response.text,
+            "confidence": actual_confidence,
+            "latency_ms": elapsed_ms,
+            "worker_count": len(swarm_manager.workers) if hasattr(swarm_manager, 'workers') else 1,
+            "tokens_per_second": response.tokens_per_second,
+            "tokens_generated": response.tokens_generated
+        }
+        
+    except Exception as e:
+        logger.exception("Error handling federation vote")
+        raise HTTPException(status_code=500, detail=f"Federation vote failed: {str(e)}")
+
+
+def _strip_tool_instructions(prompt: str) -> str:
+    """Strip tool instructions from prompt for peer generation.
+    
+    Peers should not generate tool calls - only the head node handles tools.
+    This removes the system message containing tool instructions.
+    
+    Args:
+        prompt: Original prompt with potential tool instructions
+        
+    Returns:
+        Clean prompt without tool instructions
+    """
+    # Look for common tool instruction patterns
+    # Pattern 1: System message with "AVAILABLE TOOLS"
+    if "AVAILABLE TOOLS" in prompt or "You have access to tools" in prompt:
+        # Split by message boundaries and filter out system tool messages
+        lines = prompt.split('\n')
+        filtered_lines = []
+        skip_until_next_role = False
+        
+        for line in lines:
+            # Check if this is a system message start with tool instructions
+            if ('<|im_start|>system' in line or line.strip() == 'system:') and not skip_until_next_role:
+                # Check if next few lines contain tool instructions
+                # We'll collect lines and check
+                filtered_lines.append(line)
+                skip_until_next_role = True
+                continue
+            
+            if skip_until_next_role:
+                # Check for end of system message
+                if '<|im_end|>' in line or (line.strip().startswith('<|im_start|>') and 'system' not in line):
+                    skip_until_next_role = False
+                    filtered_lines.append(line)
+                # Check if this line contains tool instruction markers
+                elif any(marker in line for marker in ['AVAILABLE TOOLS', 'TOOL:', 'ARGUMENTS:', 'You have access to tools']):
+                    # Skip this line - it's part of tool instructions
+                    continue
+                else:
+                    filtered_lines.append(line)
+            else:
+                filtered_lines.append(line)
+        
+        return '\n'.join(filtered_lines)
+    
+    return prompt
@@ -44,8 +44,9 @@ class APIServer:
        @asynccontextmanager
        async def lifespan(app: FastAPI):
            """Lifespan context manager for startup/shutdown."""
-            # Startup: Set swarm manager in routes
+            # Startup: Set swarm manager in routes and app state
            set_swarm_manager(self.swarm_manager)
+            app.state.swarm_manager = self.swarm_manager  # For federation endpoint
            # Set tool mode in routes
            from api.routes import set_use_opencode_tools
            set_use_opencode_tools(self.use_opencode_tools)
@@ -0,0 +1,97 @@
+"""Chatlog for debugging tool execution.
+
+Writes a human-readable markdown log of tool calls and results.
+Enabled by setting LOCAL_SWARM_CHATLOG=1 environment variable.
+Log file defaults to 'chatlog.md' in the current working directory.
+"""
+
+import os
+import json
+from datetime import datetime
+from typing import Optional
+
+
+class ChatLogger:
+    """Logs chat interactions and tool execution in opencode-style format."""
+
+    def __init__(self, log_path: Optional[str] = None):
+        self.log_path = log_path or os.getenv('LOCAL_SWARM_CHATLOG_PATH', 'chatlog.md')
+        self.enabled = os.getenv('LOCAL_SWARM_CHATLOG', '0') == '1'
+        if self.enabled:
+            self._initialize_log()
+
+    def _initialize_log(self):
+        """Create log file with header if it doesn't exist."""
+        dir_path = os.path.dirname(self.log_path) or '.'
+        os.makedirs(dir_path, exist_ok=True)
+        with open(self.log_path, 'a') as f:
+            f.write(f"\n\n# Local Swarm Session - {datetime.now().isoformat()}\n\n")
+
+    def _timestamp(self) -> str:
+        """Get current timestamp."""
+        return datetime.now().strftime("%H:%M:%S")
+
+    def log_user_message(self, content: str):
+        """Log a user message."""
+        if not self.enabled:
+            return
+        with open(self.log_path, 'a') as f:
+            f.write(f"\n## [{self._timestamp()}] User\n\n")
+            f.write(f"{content}\n\n")
+
+    def log_assistant_message(self, content: str, has_tool_calls: bool = False):
+        """Log an assistant response."""
+        if not self.enabled:
+            return
+        with open(self.log_path, 'a') as f:
+            f.write(f"\n## [{self._timestamp()}] Assistant\n\n")
+            if has_tool_calls:
+                # Use thinking block for messages that contain tool calls
+                f.write(f"```thinking\n{content}\n```\n")
+            else:
+                f.write(f"{content}\n\n")
+
+    def log_tool_call(self, tool_name: str, arguments: dict, call_index: int = 1):
+        """Log a tool execution request."""
+        if not self.enabled:
+            return
+        with open(self.log_path, 'a') as f:
+            f.write(f"\n## [{self._timestamp()}] Tool Call #{call_index}\n\n")
+            f.write(f"**Tool:** `{tool_name}`\n\n")
+            f.write(f"**Arguments:**\n")
+            try:
+                args_json = json.dumps(arguments, indent=2)
+            except Exception:
+                args_json = str(arguments)
+            f.write(f"```json\n{args_json}\n```\n")
+
+    def log_tool_result(self, tool_name: str, result: str, call_index: int = 1, success: bool = True):
+        """Log a tool execution result."""
+        if not self.enabled:
+            return
+        with open(self.log_path, 'a') as f:
+            f.write(f"\n## [{self._timestamp()}] Tool Result #{call_index}\n\n")
+            status = "✓ Success" if success else "✗ Failed"
+            f.write(f"**Tool:** `{tool_name}` - {status}\n\n")
+            f.write(f"**Output:**\n")
+            f.write(f"```\n{result}\n```\n")
+
+    def log_system(self, message: str):
+        """Log a system message."""
+        if not self.enabled:
+            return
+        with open(self.log_path, 'a') as f:
+            f.write(f"\n## [{self._timestamp()}] System\n\n")
+            f.write(f"> {message}\n\n")
+
+
+# Global logger instance (lazy initialization handled per request)
+_global_logger: Optional[ChatLogger] = None
+
+
+def get_chat_logger() -> ChatLogger:
+    """Get the global chat logger instance (creates one if needed)."""
+    global _global_logger
+    if _global_logger is None:
+        _global_logger = ChatLogger()
+    return _global_logger
@@ -351,34 +351,35 @@ class FederatedSwarm:
        for vote in peer_votes:
            all_votes.append((vote.response_text, vote.confidence, vote.peer_name))

-        if self.consensus_strategy == "best_of_n":
-            # Use the consensus engine to pick the best response
-            from swarm.consensus import ConsensusEngine
+        # Always use quality-based selection - the head node judges ALL responses
+        # This prevents nodes from being overconfident about their own mediocre answers
+        from swarm.consensus import ConsensusEngine, GenerationResponse

-            responses = [
-                GenerationResponse(
-                    text=text,
-                    tokens_generated=0,
-                    tokens_per_second=0,
-                    latency_ms=0,
-                    backend_name=source
-                )
-                for text, _, source in all_votes
-            ]
+        responses = [
+            GenerationResponse(
+                text=text,
+                tokens_generated=0,
+                tokens_per_second=0,
+                latency_ms=0,
+                backend_name=source
+            )
+            for text, _, source in all_votes
+        ]

-            # Use synchronous quality scoring (no embeddings needed)
-            engine = ConsensusEngine(strategy="quality")
-            # _quality_vote is async but only uses sync scoring, so we
-            # use the simpler _fastest_vote-style approach here
-            scores = [engine._quality_score(r) for r in responses]
-            best_idx = max(range(len(scores)), key=lambda i: scores[i])
-            best = all_votes[best_idx]
-            print(f"  ✓ Selected response from {best[2]} (quality score: {scores[best_idx]:.2f})")
-            return best[0], best[2]
-
-        # Default: weighted selection - pick highest confidence
-        best = max(all_votes, key=lambda x: x[1])
-        print(f"  ✓ Selected response from {best[2]} (confidence: {best[1]:.2f})")
+        # Use quality scoring to objectively compare all responses
+        engine = ConsensusEngine(strategy="quality")
+        scores = [engine._quality_score(r) for r in responses]
+        
+        # Find best response based on actual quality, not self-reported confidence
+        best_idx = max(range(len(scores)), key=lambda i: scores[i])
+        best = all_votes[best_idx]
+        
+        # Show comparison
+        print(f"  📊 Quality scores:")
+        for i, (text, conf, source) in enumerate(all_votes):
+            print(f"     {source}: {scores[i]:.2f} (self-reported: {conf:.2f})")
+        
+        print(f"  ✓ Selected response from {best[2]} (quality score: {scores[best_idx]:.2f})")
        return best[0], best[2]

    async def get_federation_status(self) -> Dict[str, Any]:
@@ -121,6 +121,13 @@ class ToolExecutor:
        if not file_path:
            return "Error: filePath required"
        
+        # Check if original path was absolute or used ~ before expansion
+        original_was_absolute = os.path.isabs(file_path) or file_path.startswith("~")
+        
+        # Expand ~ to home directory
+        file_path = os.path.expanduser(file_path)
+        working_dir = os.path.expanduser(working_dir)
+        
        # Security: Prevent directory traversal
        file_path = os.path.normpath(file_path)
        if file_path.startswith("..") or file_path.startswith("/.."):
@@ -132,14 +139,16 @@ class ToolExecutor:
        else:
            full_path = file_path
        
-        # Additional security: ensure resolved path is within working_dir
-        try:
-            real_working_dir = os.path.realpath(working_dir)
-            real_full_path = os.path.realpath(full_path)
-            if not real_full_path.startswith(real_working_dir):
-                return f"Error: Access denied - path outside working directory"
-        except Exception:
-            pass  # If realpath fails, continue anyway
+        # Additional security: only enforce working_dir restriction for relative paths
+        # If user explicitly specified an absolute path or ~ path, allow it
+        if not original_was_absolute:
+            try:
+                real_working_dir = os.path.realpath(working_dir)
+                real_full_path = os.path.realpath(full_path)
+                if not real_full_path.startswith(real_working_dir):
+                    return f"Error: Access denied - path outside working directory"
+            except Exception:
+                pass  # If realpath fails, continue anyway
        
        logger.debug(f"    📁 Reading: {file_path}")
        logger.debug(f"    📍 Working dir: {working_dir}")
@@ -163,6 +172,13 @@ class ToolExecutor:
        if not file_path:
            return "Error: filePath required"
        
+        # Check if original path was absolute or used ~ before expansion
+        original_was_absolute = os.path.isabs(file_path) or file_path.startswith("~")
+        
+        # Expand ~ to home directory
+        file_path = os.path.expanduser(file_path)
+        working_dir = os.path.expanduser(working_dir)
+        
        # Security: Prevent directory traversal
        file_path = os.path.normpath(file_path)
        if file_path.startswith("..") or file_path.startswith("/.."):
@@ -174,14 +190,16 @@ class ToolExecutor:
        else:
            full_path = file_path
        
-        # Additional security: ensure resolved path is within working_dir
-        try:
-            real_working_dir = os.path.realpath(working_dir)
-            real_full_path = os.path.realpath(full_path)
-            if not real_full_path.startswith(real_working_dir):
-                return f"Error: Access denied - path outside working directory"
-        except Exception:
-            pass  # If realpath fails, continue anyway
+        # Additional security: only enforce working_dir restriction for relative paths
+        # If user explicitly specified an absolute path or ~ path, allow it
+        if not original_was_absolute:
+            try:
+                real_working_dir = os.path.realpath(working_dir)
+                real_full_path = os.path.realpath(full_path)
+                if not real_full_path.startswith(real_working_dir):
+                    return f"Error: Access denied - path outside working directory"
+            except Exception:
+                pass  # If realpath fails, continue anyway
        
        logger.debug(f"    📁 Writing: {file_path}")
        logger.debug(f"    📍 Working dir: {working_dir}")
@@ -208,6 +226,9 @@ class ToolExecutor:
        if not command:
            return "Error: command required"
        
+        # Expand ~ to home directory in cwd
+        cwd = os.path.expanduser(cwd)
+        
        # Security: Block dangerous commands
        dangerous = ["rm -rf /", "> /dev", "mkfs", "dd if=/dev/zero", ":(){ :|:& };:"]
        for d in dangerous:
Author	SHA1	Message	Date
sleepy	907bd88c8f	fix: federation only on first iteration, local-only for tool result processing - Critical fix: peers don't have tool results from previous iterations - Running federation on tool result iterations causes inconsistent context - Now federation is ONLY used on iteration 1 (initial planning) - Iterations 2+ are local-only (tool result processing) - This prevents the infinite ls loop and wrong file hallucinations - All 41 tests passing	2026-02-25 23:56:29 +01:00
sleepy	af728505e8	fix: properly unpack FederationResult object instead of trying to unpack as tuple - generate_with_federation() returns FederationResult object, not tuple - Fixed _generate_with_consensus() to access fed_result.final_response - This fixes 'cannot unpack non-iterable FederationResult object' error - All 41 tests passing	2026-02-25 23:43:25 +01:00
sleepy	93844a81b0	refactor: unified generation interface for federation and local modes - Created _generate_with_consensus() that handles both federation and local generation - Callers don't need to know which mode is being used - it's transparent - Tool execution loop uses same unified interface for all iterations - Removed special-case federation logic from main handler - Federation is now a transparent layer around generation - All 41 tests passing	2026-02-25 23:36:24 +01:00
sleepy	414cb444f3	fix: integrate federation with tool execution loop - Federation was returning directly without executing tools - Now federation is used for initial generation (iteration 1) - Tool execution loop still runs for all iterations - Subsequent iterations use local swarm (for tool result processing) - This fixes federation + tools not working together - All 41 tests passing	2026-02-25 23:06:37 +01:00
sleepy	34b28597ff	fix: peers in federation mode should not generate tool calls - Added _strip_tool_instructions() to remove tool instructions from federation prompts - Peer nodes now only generate text responses, not tool calls - Head node is the only one that handles tool execution - This prevents peers from generating tool calls that can't be executed - Fixes federation + tools incompatibility - All 41 tests passing	2026-02-25 22:46:15 +01:00
sleepy	67122052b4	Merge branch 'fix/tool-instructions-permission'	2026-02-25 22:39:00 +01:00
sleepy	e7b826da4e	docs: update README with current features and remove outdated docs - Removed old design docs and test plans from docs/ directory - Updated TODO section to reflect completed improvements - Added section on Recent Improvements with detailed changelog - Updated Federation description to explain objective quality voting - Added federation vote endpoint to API endpoints list - Clarified universal tool support and OpenCode streaming compatibility - All changes ready for main branch merge	2026-02-25 22:38:46 +01:00
sleepy	3799240d74	fix: head node objectively judges all responses using quality metrics - Removed biased self-reported confidence voting - Head node now collects ALL responses and scores them objectively - Uses quality scoring (length, structure, completeness) to compare - Shows quality scores for all nodes so user can see comparison - Prevents overconfident small models from beating better large models - 3B models will only win if they actually produce better quality output - All 41 tests passing	2026-02-25 22:24:00 +01:00
sleepy	e0d04ae664	fix: use actual consensus confidence for peers instead of hardcoded 0.8 - Federation endpoint was hardcoding confidence: 0.8 for all peer responses - Local swarm uses actual calculated confidence (often 1.0 for single worker) - This created unfair bias toward local responses - Now uses result.confidence from actual consensus calculation - Peers and local now compete on equal footing - All 41 tests passing	2026-02-25 22:21:19 +01:00
sleepy	896e9d6d9b	fix: store swarm_manager in app.state for federation endpoint - Added app.state.swarm_manager = self.swarm_manager in lifespan - Federation endpoint reads from request.app.state.swarm_manager - This fixes 'Swarm not ready' error when peers try to request generation - All 41 tests passing	2026-02-25 22:11:06 +01:00
sleepy	e2b0af7636	fix: add missing federation /v1/federation/vote endpoint - Added POST /v1/federation/vote endpoint to handle peer generation requests - Peers were discovering each other but requests had no endpoint to hit - Endpoint generates using local swarm and returns vote results - Logs federation requests for debugging - All 41 tests passing	2026-02-25 22:05:37 +01:00
sleepy	5b29e15c0a	fix: prevent path hallucination - read files directly without ls first - Changed instructions to read files directly instead of verifying with ls first - Added explicit warning against placeholder paths like '/path/to/file' - Model now uses paths exactly as user provides them - Should fix issues with hallucinated paths like '/path/to/my-secret.log' - All 41 tests passing	2026-02-25 21:42:25 +01:00
sleepy	8431717235	fix: stronger instruction for bash ls results to read files immediately - Changed bash ls instruction from 'SUMMARIZE' to 'CRITICAL: ... READ THE FILE immediately' - Now explicitly tells model to NOT summarize first, but immediately read the file - Uses stronger language: 'you MUST immediately USE THE read TOOL NOW' - This should fix the loop where model keeps running ls instead of reading - All 41 tests passing	2026-02-25 21:20:48 +01:00
sleepy	06df3c8dab	fix: allow absolute and ~ paths to access files outside working directory - Security check now only applies to relative paths - If user specifies absolute path (/path/to/file) or tilde path (~/.bashrc), allow it - Relative paths (like file.txt) are still restricted to working directory - This fixes 'Access denied - path outside working directory' for valid user-specified paths - Applied to both read and write tools - All 41 tests passing	2026-02-25 21:13:02 +01:00
sleepy	ab7cf7e9aa	fix: expand tildes (~) to home directory in tool paths - Added os.path.expanduser() to _execute_read for both file_path and working_dir - Added os.path.expanduser() to _execute_write for both file_path and working_dir - Added os.path.expanduser() to _execute_bash for cwd parameter - This fixes paths like '~/Documents/file.txt' being treated literally - Now correctly resolves to '/Users/username/Documents/file.txt' - All 41 tests passing	2026-02-25 20:54:31 +01:00
sleepy	49a6d99bf8	CRITICAL FIX: fix indentation bug that prevented tool results from being added to history - The for loop was only executing the first line (tool_call_id assignment) - All the tool message creation code was outside the loop due to wrong indentation - This caused tool results to never be added to conversation history - Model would loop infinitely calling ls because it never saw the tool results - Fixed indentation so all tool result processing is inside the for loop - This should finally fix the infinite loop issue! - All 41 tests passing	2026-02-25 20:49:30 +01:00
sleepy	586c113688	fix: smarter bash tool instructions - guide model to read files after verification - Updated bash tool result instructions to detect verification commands (ls/grep) - If ls/grep shows file exists and user asked to READ it: explicitly tells model to USE read TOOL NOW - If user asked to check files: tells model to summarize the listing - If file not found: tells model to inform user - Prevents infinite loops of repeated ls commands - Model now properly transitions from verification → action → answer - All 41 tests passing	2026-02-25 20:39:55 +01:00
sleepy	a09d23156b	feat: universal tool support - inject instructions by default, add plan mode TODO, improve file handling 1. Tool instructions now ALWAYS injected by default: - Removed condition that only injected on first request - Any client (Continue, hollama) can now use tools without client-side setup - Added check to avoid duplicating instructions if already present 2. Updated tool instructions with file verification guidance: - Added 'FILE OPERATIONS - ALWAYS VERIFY FIRST' section - Instructs to use 'ls' and 'grep' to verify files exist before reading - Prevents blind file reads on non-existent paths 3. Added TODO to README: - Plan mode feature (disable tool execution for planning-only conversations) - Current status section showing what's implemented 4. Working directory extraction from prompts: - New _extract_working_dir_from_prompt() function - Extracts paths from patterns like 'in /path/to/dir', 'under /path/to/dir' - Validates paths exist before using - Falls back to auto-detection if not found in prompt - All 41 tests passing	2026-02-25 20:37:23 +01:00
sleepy	c46684f03e	fix: explicit tool result instructions to guide model response - Changed vague 'Provide your final answer now' to specific per-tool instructions - read: 'READ THIS FILE CONTENT ALOUD to the user' - write: 'CONFIRM to the user that the file was created' - bash: 'SUMMARIZE the output above to answer the user's request' - Other tools: 'Use the result shown above to answer the user's request' - Format tool result message with clear 'Tool Result (name):' header and explicit instruction - This should fix models ignoring tool results or giving generic responses - All 41 tests passing	2026-02-25 20:25:05 +01:00
sleepy	bd3579737a	feat: add detailed tool execution logging - Log full message history before calling model after tool execution - Shows each message role, content truncation, tool calls, and tool_call_id associations - Logs token count and full prompt (first 1000 chars) at DEBUG level - Helps diagnose why models might be ignoring tool results - All 41 tests passing	2026-02-25 20:17:55 +01:00
sleepy	886ebbdb81	fix: proper OpenAI tool call format with tool_call_id linking - Uncommented tool_call_id and name fields in ChatMessage model - Modified tool execution to assign unique IDs to each tool call - Assistant messages now include tool_calls array with proper ID, type, function - Tool response messages now include tool_call_id and name to link to the call - Each tool execution gets its own separate tool message (not combined) - This ensures the model properly associates tool results with tool calls - Should fix issues where models ignore tool results due to missing associations - Updated _execute_tools to return List[tuple] instead of combined string - Added List import to typing - All 41 tests still passing	2026-02-25 20:12:40 +01:00
sleepy	a0d3ae9d4f	fix: OpenCode-compatible streaming format with reasoning_content - Fixed thinking capture: use parsed_content (without tool call) instead of full response - _stream_response now correctly emits reasoning_content before tool_calls - Tool calls streamed with proper multi-chunk format: id+name (empty args), then arguments, then finish_reason - Final answers sent as content with finish_reason=stop - Used setattr to dynamically attach _thinking to response object - ChatLogger already in place for debugging - This should now work correctly with OpenCode's Vercel AI SDK integration	2026-02-25 20:03:55 +01:00
sleepy	a0571c83a3	feat: implement OpenCode-compatible streaming format and enhance chatlogging - Implement proper streaming with reasoning_content field for thinking blocks - Stream tool_calls in multi-chunk format matching Vercel AI SDK - Capture thinking content and send as reasoning_content before tool_calls - Update _create_response to store thinking on response._thinking for streaming - ChatLogger now logs assistant messages with thinking blocks when tool calls present - Added json import in chat_handlers for tool arguments parsing - All streaming code uses OpenCode-compatible SSE format	2026-02-25 19:57:38 +01:00
sleepy	46f14b2b53	feat: add chatlogger for tool execution debugging - logs to chatlog.md when LOCAL_SWARM_CHATLOG=1	2026-02-25 19:52:52 +01:00
sleepy	42a176f1d8	fix: update tool instructions to require file operations and prevent refusals - Changed from hesitant 'use only when necessary' to mandatory 'you WILL use tools' - Explicitly forbid refusal for file read/write operations - Add NO explanations and NO markdown requirements (for test compliance) - Provide clear examples for read/write tool usage - Addresses issue where model says 'cannot read files or assist with file creation'	2026-02-25 19:41:16 +01:00