Files
mid_model_research/forgecode/feedback/frontier/gpt-5.4.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

2.6 KiB

GPT 5.4 with ForgeCode - Feedback Report

Model: GPT 5.4
Provider: OpenAI
Harness: ForgeCode
Source References: DEV Community (Liran Baba), ForgeCode Blog
Date Compiled: April 9, 2026


Benchmark Performance

TermBench 2.0 (Self-Reported via ForgeCode)

  • Score: 81.8% (tied for #1 with Opus 4.6)
  • Note: Achieved through extensive harness optimizations, not raw model capability

Real-World Performance Feedback

Stability Issues

  • Assessment: "Borderline unusable" for some tasks
  • Specific Issue: 15-minute research task on small repo
    • Tool calls repeatedly failing
    • Agent stuck in retry loops
    • Required manual kill

"I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."

Tool Calling Reliability

  • Problem: Persistent tool-call errors with GPT 5.4
  • ForgeCode Fixes Applied:
    1. Reordered JSON schema fields (required before properties)
    2. Flattened nested schemas
    3. Added explicit truncation reminders for partial file reads
  • Result: These optimizations were benchmark-specific (described as "benchmaxxed")

Harness Optimizations for GPT 5.4

From ForgeCode's "Benchmarks Don't Matter" blog series:

  1. Non-Interactive Mode: System prompt rewritten to prohibit conversational branching
  2. Tool Naming: Renaming edit tool arguments to old_string and new_string (names appearing frequently in training data) measurably dropped tool-call error rates
  3. Progressive Thinking Policy:
    • Messages 1-10: Very high thinking (plan formation)
    • Messages 11+: Low thinking default (execution phase)
    • Verification skill calls: Switch back to high thinking

What Didn't Work Well

  1. Research tasks: Tool calling failures causing infinite loops
  2. Long-running tasks: 15+ minute tasks became unstable
  3. Consistency: Unpredictable failures requiring manual intervention

Comparison with Opus 4.6

Aspect GPT 5.4 Opus 4.6
TermBench 2.0 81.8% 81.8%
Real-world stability Poor Excellent
Tool calling reliability Problematic Reliable
Research tasks Unusable Good

Key Takeaway: Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.


Source References

  1. DEV Community: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
  2. ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/