mid_model_research/forgecode/feedback/localllm/tool-calling-reliability.md

# Tool Calling Reliability with ForgeCode - Feedback Report

**Topic:** Tool use reliability, function calling, common errors
**Source References:** ForgeCode Blog, GitHub issues, Reddit
**Date Compiled:** April 9, 2026

---

## Overview

Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.

---

## The Seven Failure Modes (From ForgeCode Blog)

### 1. Same Model, Very Different Performance
**Problem:** Interactive-first design fails in benchmarks (no user to answer questions)
**Fix:** Non-Interactive Mode with rewritten system prompts

### 2. Tool Descriptions Don't Guarantee Correctness
**Problem Categories:**
- Wrong tool selected (e.g., `shell` instead of structured `edit`)
- Correct tool, wrong argument names
- Correct tool, correct arguments, wrong sequencing

**Fix:** Targeted micro-evals isolating each class per tool, per model

### 3. Tool Naming is a Reliability Variable
**Key Finding:** Models pattern-match against training data first

**Concrete Example:**
- Renaming edit tool arguments to `old_string` and `new_string`
- Result: "measurably dropped tool-call error rates immediately—same model, same prompt"

> "If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."

### 4. Context Size is a Multiplier, Not a Substitute
**Problem:** More context only helps after finding the right entry point
**Insight:** Entry-point discovery latency is the bottleneck

### 5. Time Limits Punish Trajectories
**Problem:** Failed tool calls burn real seconds; brilliant but meandering paths timeout
**Fix:** Speed architecture with parallel subagents

### 6. Planning Tools Only Work if Enforced
**Problem:** Optional `todo_write` tool ignored under pressure
**Fix:** Made mandatory via low-level evals
**Result:** 38% → 66% pass rate

### 7. TermBench is More About Speed Than Intelligence
**Fix:** Progressive thinking policy (high thinking early, low during execution)

---

## Model-Specific Tool Calling Issues

### GPT 5.4
- **Issue:** Persistent tool-call errors
- **Fixes Applied:**
  - Reordered JSON schema fields (`required` before `properties`)
  - Flattened nested schemas
  - Explicit truncation reminders

### Qwen 3.5
- **Issue:** Multiple system messages break strict chat templates
- **Status:** Open issue (#2894)
- **Workaround:** None yet; use different model or await fix

### Gemma 4
- **Issue:** Initial releases had tool calling format issues
- **Fix:** Use latest oMLX / llama.cpp

---

## Best Practices for Tool Reliability

1. **Use established argument names:** `old_string`/`new_string` better than generic names
2. **Flatten schemas:** Reduce nesting in tool definitions
3. **Order matters:** Put `required` before `properties` in JSON schema
4. **Test with micro-evals:** Isolate specific tool+model combinations
5. **Monitor truncation:** Add explicit reminders when files partially read

---

## ForgeCode Services Enhancements

The proprietary runtime layer includes:
1. **Semantic entry-point discovery:** Lightweight semantic pass before exploration
2. **Dynamic skill loading:** Specialized instructions loaded when needed
3. **Tool-call correction layer:** Heuristic + static analysis for argument validation

**Note:** These features are part of ForgeCode Services (optional), not the open-source CLI.

---

## Community Tips

From Reddit and GitHub discussions:

1. **LM Studio > raw llama.cpp** for Qwen3.5 XML tool parsing
2. **LM Studio 0.4.9+** handles tool calling more reliably
3. **llama.cpp `--jinja` flag** helps with Qwen tool templates

---

## Source References

1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
2. **GitHub Issue #2894:** https://github.com/antinomyhq/forgecode/issues/2894
3. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/