Files
mid_model_research/forgecode/feedback/localllm/tool-calling-reliability.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

3.9 KiB

Tool Calling Reliability with ForgeCode - Feedback Report

Topic: Tool use reliability, function calling, common errors
Source References: ForgeCode Blog, GitHub issues, Reddit
Date Compiled: April 9, 2026


Overview

Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.


The Seven Failure Modes (From ForgeCode Blog)

1. Same Model, Very Different Performance

Problem: Interactive-first design fails in benchmarks (no user to answer questions)
Fix: Non-Interactive Mode with rewritten system prompts

2. Tool Descriptions Don't Guarantee Correctness

Problem Categories:

  • Wrong tool selected (e.g., shell instead of structured edit)
  • Correct tool, wrong argument names
  • Correct tool, correct arguments, wrong sequencing

Fix: Targeted micro-evals isolating each class per tool, per model

3. Tool Naming is a Reliability Variable

Key Finding: Models pattern-match against training data first

Concrete Example:

  • Renaming edit tool arguments to old_string and new_string
  • Result: "measurably dropped tool-call error rates immediately—same model, same prompt"

"If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."

4. Context Size is a Multiplier, Not a Substitute

Problem: More context only helps after finding the right entry point
Insight: Entry-point discovery latency is the bottleneck

5. Time Limits Punish Trajectories

Problem: Failed tool calls burn real seconds; brilliant but meandering paths timeout
Fix: Speed architecture with parallel subagents

6. Planning Tools Only Work if Enforced

Problem: Optional todo_write tool ignored under pressure
Fix: Made mandatory via low-level evals Result: 38% → 66% pass rate

7. TermBench is More About Speed Than Intelligence

Fix: Progressive thinking policy (high thinking early, low during execution)


Model-Specific Tool Calling Issues

GPT 5.4

  • Issue: Persistent tool-call errors
  • Fixes Applied:
    • Reordered JSON schema fields (required before properties)
    • Flattened nested schemas
    • Explicit truncation reminders

Qwen 3.5

  • Issue: Multiple system messages break strict chat templates
  • Status: Open issue (#2894)
  • Workaround: None yet; use different model or await fix

Gemma 4

  • Issue: Initial releases had tool calling format issues
  • Fix: Use latest oMLX / llama.cpp

Best Practices for Tool Reliability

  1. Use established argument names: old_string/new_string better than generic names
  2. Flatten schemas: Reduce nesting in tool definitions
  3. Order matters: Put required before properties in JSON schema
  4. Test with micro-evals: Isolate specific tool+model combinations
  5. Monitor truncation: Add explicit reminders when files partially read

ForgeCode Services Enhancements

The proprietary runtime layer includes:

  1. Semantic entry-point discovery: Lightweight semantic pass before exploration
  2. Dynamic skill loading: Specialized instructions loaded when needed
  3. Tool-call correction layer: Heuristic + static analysis for argument validation

Note: These features are part of ForgeCode Services (optional), not the open-source CLI.


Community Tips

From Reddit and GitHub discussions:

  1. LM Studio > raw llama.cpp for Qwen3.5 XML tool parsing
  2. LM Studio 0.4.9+ handles tool calling more reliably
  3. llama.cpp --jinja flag helps with Qwen tool templates

Source References

  1. ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
  2. GitHub Issue #2894: https://github.com/antinomyhq/forgecode/issues/2894
  3. Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/