Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
3.9 KiB
Tool Calling Reliability with ForgeCode - Feedback Report
Topic: Tool use reliability, function calling, common errors
Source References: ForgeCode Blog, GitHub issues, Reddit
Date Compiled: April 9, 2026
Overview
Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.
The Seven Failure Modes (From ForgeCode Blog)
1. Same Model, Very Different Performance
Problem: Interactive-first design fails in benchmarks (no user to answer questions)
Fix: Non-Interactive Mode with rewritten system prompts
2. Tool Descriptions Don't Guarantee Correctness
Problem Categories:
- Wrong tool selected (e.g.,
shellinstead of structurededit) - Correct tool, wrong argument names
- Correct tool, correct arguments, wrong sequencing
Fix: Targeted micro-evals isolating each class per tool, per model
3. Tool Naming is a Reliability Variable
Key Finding: Models pattern-match against training data first
Concrete Example:
- Renaming edit tool arguments to
old_stringandnew_string - Result: "measurably dropped tool-call error rates immediately—same model, same prompt"
"If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."
4. Context Size is a Multiplier, Not a Substitute
Problem: More context only helps after finding the right entry point
Insight: Entry-point discovery latency is the bottleneck
5. Time Limits Punish Trajectories
Problem: Failed tool calls burn real seconds; brilliant but meandering paths timeout
Fix: Speed architecture with parallel subagents
6. Planning Tools Only Work if Enforced
Problem: Optional todo_write tool ignored under pressure
Fix: Made mandatory via low-level evals
Result: 38% → 66% pass rate
7. TermBench is More About Speed Than Intelligence
Fix: Progressive thinking policy (high thinking early, low during execution)
Model-Specific Tool Calling Issues
GPT 5.4
- Issue: Persistent tool-call errors
- Fixes Applied:
- Reordered JSON schema fields (
requiredbeforeproperties) - Flattened nested schemas
- Explicit truncation reminders
- Reordered JSON schema fields (
Qwen 3.5
- Issue: Multiple system messages break strict chat templates
- Status: Open issue (#2894)
- Workaround: None yet; use different model or await fix
Gemma 4
- Issue: Initial releases had tool calling format issues
- Fix: Use latest oMLX / llama.cpp
Best Practices for Tool Reliability
- Use established argument names:
old_string/new_stringbetter than generic names - Flatten schemas: Reduce nesting in tool definitions
- Order matters: Put
requiredbeforepropertiesin JSON schema - Test with micro-evals: Isolate specific tool+model combinations
- Monitor truncation: Add explicit reminders when files partially read
ForgeCode Services Enhancements
The proprietary runtime layer includes:
- Semantic entry-point discovery: Lightweight semantic pass before exploration
- Dynamic skill loading: Specialized instructions loaded when needed
- Tool-call correction layer: Heuristic + static analysis for argument validation
Note: These features are part of ForgeCode Services (optional), not the open-source CLI.
Community Tips
From Reddit and GitHub discussions:
- LM Studio > raw llama.cpp for Qwen3.5 XML tool parsing
- LM Studio 0.4.9+ handles tool calling more reliably
- llama.cpp
--jinjaflag helps with Qwen tool templates
Source References
- ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
- GitHub Issue #2894: https://github.com/antinomyhq/forgecode/issues/2894
- Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/