# Tool Calling Reliability with ForgeCode - Feedback Report **Topic:** Tool use reliability, function calling, common errors **Source References:** ForgeCode Blog, GitHub issues, Reddit **Date Compiled:** April 9, 2026 --- ## Overview Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models. --- ## The Seven Failure Modes (From ForgeCode Blog) ### 1. Same Model, Very Different Performance **Problem:** Interactive-first design fails in benchmarks (no user to answer questions) **Fix:** Non-Interactive Mode with rewritten system prompts ### 2. Tool Descriptions Don't Guarantee Correctness **Problem Categories:** - Wrong tool selected (e.g., `shell` instead of structured `edit`) - Correct tool, wrong argument names - Correct tool, correct arguments, wrong sequencing **Fix:** Targeted micro-evals isolating each class per tool, per model ### 3. Tool Naming is a Reliability Variable **Key Finding:** Models pattern-match against training data first **Concrete Example:** - Renaming edit tool arguments to `old_string` and `new_string` - Result: "measurably dropped tool-call error rates immediately—same model, same prompt" > "If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first." ### 4. Context Size is a Multiplier, Not a Substitute **Problem:** More context only helps after finding the right entry point **Insight:** Entry-point discovery latency is the bottleneck ### 5. Time Limits Punish Trajectories **Problem:** Failed tool calls burn real seconds; brilliant but meandering paths timeout **Fix:** Speed architecture with parallel subagents ### 6. Planning Tools Only Work if Enforced **Problem:** Optional `todo_write` tool ignored under pressure **Fix:** Made mandatory via low-level evals **Result:** 38% → 66% pass rate ### 7. TermBench is More About Speed Than Intelligence **Fix:** Progressive thinking policy (high thinking early, low during execution) --- ## Model-Specific Tool Calling Issues ### GPT 5.4 - **Issue:** Persistent tool-call errors - **Fixes Applied:** - Reordered JSON schema fields (`required` before `properties`) - Flattened nested schemas - Explicit truncation reminders ### Qwen 3.5 - **Issue:** Multiple system messages break strict chat templates - **Status:** Open issue (#2894) - **Workaround:** None yet; use different model or await fix ### Gemma 4 - **Issue:** Initial releases had tool calling format issues - **Fix:** Use latest oMLX / llama.cpp --- ## Best Practices for Tool Reliability 1. **Use established argument names:** `old_string`/`new_string` better than generic names 2. **Flatten schemas:** Reduce nesting in tool definitions 3. **Order matters:** Put `required` before `properties` in JSON schema 4. **Test with micro-evals:** Isolate specific tool+model combinations 5. **Monitor truncation:** Add explicit reminders when files partially read --- ## ForgeCode Services Enhancements The proprietary runtime layer includes: 1. **Semantic entry-point discovery:** Lightweight semantic pass before exploration 2. **Dynamic skill loading:** Specialized instructions loaded when needed 3. **Tool-call correction layer:** Heuristic + static analysis for argument validation **Note:** These features are part of ForgeCode Services (optional), not the open-source CLI. --- ## Community Tips From Reddit and GitHub discussions: 1. **LM Studio > raw llama.cpp** for Qwen3.5 XML tool parsing 2. **LM Studio 0.4.9+** handles tool calling more reliably 3. **llama.cpp `--jinja` flag** helps with Qwen tool templates --- ## Source References 1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/ 2. **GitHub Issue #2894:** https://github.com/antinomyhq/forgecode/issues/2894 3. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/