mid_model_research/forgecode/feedback/frontier/gpt-5.4.md

# GPT 5.4 with ForgeCode - Feedback Report

**Model:** GPT 5.4
**Provider:** OpenAI
**Harness:** ForgeCode
**Source References:** DEV Community (Liran Baba), ForgeCode Blog
**Date Compiled:** April 9, 2026

---

## Benchmark Performance

### TermBench 2.0 (Self-Reported via ForgeCode)
- **Score:** 81.8% (tied for #1 with Opus 4.6)
- **Note:** Achieved through extensive harness optimizations, not raw model capability

---

## Real-World Performance Feedback

### Stability Issues
- **Assessment:** "Borderline unusable" for some tasks
- **Specific Issue:** 15-minute research task on small repo
  - Tool calls repeatedly failing
  - Agent stuck in retry loops
  - Required manual kill

> "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."

### Tool Calling Reliability
- **Problem:** Persistent tool-call errors with GPT 5.4
- **ForgeCode Fixes Applied:**
  1. Reordered JSON schema fields (`required` before `properties`)
  2. Flattened nested schemas
  3. Added explicit truncation reminders for partial file reads
- **Result:** These optimizations were benchmark-specific (described as "benchmaxxed")

---

## Harness Optimizations for GPT 5.4

From ForgeCode's "Benchmarks Don't Matter" blog series:

1. **Non-Interactive Mode:** System prompt rewritten to prohibit conversational branching
2. **Tool Naming:** Renaming edit tool arguments to `old_string` and `new_string` (names appearing frequently in training data) measurably dropped tool-call error rates
3. **Progressive Thinking Policy:**
   - Messages 1-10: Very high thinking (plan formation)
   - Messages 11+: Low thinking default (execution phase)
   - Verification skill calls: Switch back to high thinking

---

## What Didn't Work Well

1. **Research tasks:** Tool calling failures causing infinite loops
2. **Long-running tasks:** 15+ minute tasks became unstable
3. **Consistency:** Unpredictable failures requiring manual intervention

---

## Comparison with Opus 4.6

| Aspect | GPT 5.4 | Opus 4.6 |
|--------|---------|----------|
| TermBench 2.0 | 81.8% | 81.8% |
| Real-world stability | Poor | Excellent |
| Tool calling reliability | Problematic | Reliable |
| Research tasks | Unusable | Good |

**Key Takeaway:** Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.

---

## Source References

1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/