Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,111 @@
+# Tool Calling Reliability with ForgeCode - Feedback Report
+
+**Topic:** Tool use reliability, function calling, common errors  
+**Source References:** ForgeCode Blog, GitHub issues, Reddit  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Overview
+
+Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.
+
+---
+
+## The Seven Failure Modes (From ForgeCode Blog)
+
+### 1. Same Model, Very Different Performance
+**Problem:** Interactive-first design fails in benchmarks (no user to answer questions)  
+**Fix:** Non-Interactive Mode with rewritten system prompts
+
+### 2. Tool Descriptions Don't Guarantee Correctness
+**Problem Categories:**
+- Wrong tool selected (e.g., `shell` instead of structured `edit`)
+- Correct tool, wrong argument names
+- Correct tool, correct arguments, wrong sequencing
+
+**Fix:** Targeted micro-evals isolating each class per tool, per model
+
+### 3. Tool Naming is a Reliability Variable
+**Key Finding:** Models pattern-match against training data first
+
+**Concrete Example:**
+- Renaming edit tool arguments to `old_string` and `new_string`
+- Result: "measurably dropped tool-call error rates immediately—same model, same prompt"
+
+> "If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."
+
+### 4. Context Size is a Multiplier, Not a Substitute
+**Problem:** More context only helps after finding the right entry point  
+**Insight:** Entry-point discovery latency is the bottleneck
+
+### 5. Time Limits Punish Trajectories
+**Problem:** Failed tool calls burn real seconds; brilliant but meandering paths timeout  
+**Fix:** Speed architecture with parallel subagents
+
+### 6. Planning Tools Only Work if Enforced
+**Problem:** Optional `todo_write` tool ignored under pressure  
+**Fix:** Made mandatory via low-level evals
+**Result:** 38% → 66% pass rate
+
+### 7. TermBench is More About Speed Than Intelligence
+**Fix:** Progressive thinking policy (high thinking early, low during execution)
+
+---
+
+## Model-Specific Tool Calling Issues
+
+### GPT 5.4
+- **Issue:** Persistent tool-call errors
+- **Fixes Applied:**
+  - Reordered JSON schema fields (`required` before `properties`)
+  - Flattened nested schemas
+  - Explicit truncation reminders
+
+### Qwen 3.5
+- **Issue:** Multiple system messages break strict chat templates
+- **Status:** Open issue (#2894)
+- **Workaround:** None yet; use different model or await fix
+
+### Gemma 4
+- **Issue:** Initial releases had tool calling format issues
+- **Fix:** Use latest oMLX / llama.cpp
+
+---
+
+## Best Practices for Tool Reliability
+
+1. **Use established argument names:** `old_string`/`new_string` better than generic names
+2. **Flatten schemas:** Reduce nesting in tool definitions
+3. **Order matters:** Put `required` before `properties` in JSON schema
+4. **Test with micro-evals:** Isolate specific tool+model combinations
+5. **Monitor truncation:** Add explicit reminders when files partially read
+
+---
+
+## ForgeCode Services Enhancements
+
+The proprietary runtime layer includes:
+1. **Semantic entry-point discovery:** Lightweight semantic pass before exploration
+2. **Dynamic skill loading:** Specialized instructions loaded when needed
+3. **Tool-call correction layer:** Heuristic + static analysis for argument validation
+
+**Note:** These features are part of ForgeCode Services (optional), not the open-source CLI.
+
+---
+
+## Community Tips
+
+From Reddit and GitHub discussions:
+
+1. **LM Studio > raw llama.cpp** for Qwen3.5 XML tool parsing
+2. **LM Studio 0.4.9+** handles tool calling more reliably
+3. **llama.cpp `--jinja` flag** helps with Qwen tool templates
+
+---
+
+## Source References
+
+1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
+2. **GitHub Issue #2894:** https://github.com/antinomyhq/forgecode/issues/2894
+3. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/