Files

T

sleepy 51123212c4 Initial commit: coding harness feedback analysis

Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.

2026-04-09 15:13:45 +02:00

3.9 KiB

Raw Blame History

Tool Calling Reliability with ForgeCode - Feedback Report

Topic: Tool use reliability, function calling, common errors
Source References: ForgeCode Blog, GitHub issues, Reddit
Date Compiled: April 9, 2026

Overview

Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.

The Seven Failure Modes (From ForgeCode Blog)

1. Same Model, Very Different Performance

Problem: Interactive-first design fails in benchmarks (no user to answer questions)
Fix: Non-Interactive Mode with rewritten system prompts

2. Tool Descriptions Don't Guarantee Correctness

Problem Categories:

Wrong tool selected (e.g., shell instead of structured edit)
Correct tool, wrong argument names
Correct tool, correct arguments, wrong sequencing

Fix: Targeted micro-evals isolating each class per tool, per model

3. Tool Naming is a Reliability Variable

Key Finding: Models pattern-match against training data first

Concrete Example:

Renaming edit tool arguments to old_string and new_string
Result: "measurably dropped tool-call error rates immediately—same model, same prompt"

"If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."

4. Context Size is a Multiplier, Not a Substitute

Problem: More context only helps after finding the right entry point
Insight: Entry-point discovery latency is the bottleneck

5. Time Limits Punish Trajectories

Problem: Failed tool calls burn real seconds; brilliant but meandering paths timeout
Fix: Speed architecture with parallel subagents

6. Planning Tools Only Work if Enforced

Problem: Optional todo_write tool ignored under pressure
Fix: Made mandatory via low-level evals Result: 38% → 66% pass rate

7. TermBench is More About Speed Than Intelligence

Fix: Progressive thinking policy (high thinking early, low during execution)

Model-Specific Tool Calling Issues

GPT 5.4

Issue: Persistent tool-call errors
Fixes Applied:
- Reordered JSON schema fields (required before properties)
- Flattened nested schemas
- Explicit truncation reminders

Qwen 3.5

Issue: Multiple system messages break strict chat templates
Status: Open issue (#2894)
Workaround: None yet; use different model or await fix

Gemma 4

Issue: Initial releases had tool calling format issues
Fix: Use latest oMLX / llama.cpp

Best Practices for Tool Reliability

Use established argument names: old_string/new_string better than generic names
Flatten schemas: Reduce nesting in tool definitions
Order matters: Put required before properties in JSON schema
Test with micro-evals: Isolate specific tool+model combinations
Monitor truncation: Add explicit reminders when files partially read

ForgeCode Services Enhancements

The proprietary runtime layer includes:

Semantic entry-point discovery: Lightweight semantic pass before exploration
Dynamic skill loading: Specialized instructions loaded when needed
Tool-call correction layer: Heuristic + static analysis for argument validation

Note: These features are part of ForgeCode Services (optional), not the open-source CLI.

Community Tips

From Reddit and GitHub discussions:

LM Studio > raw llama.cpp for Qwen3.5 XML tool parsing
LM Studio 0.4.9+ handles tool calling more reliably
llama.cpp --jinja flag helps with Qwen tool templates

Source References

ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
GitHub Issue #2894: https://github.com/antinomyhq/forgecode/issues/2894
Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/

3.9 KiB Raw Blame History