harness-feedback/forgecode/feedback/frontier/gpt-5.4.md

# GPT 5.4 with ForgeCode - Feedback Report

**Model:** GPT 5.4
**Size:** [Not specified]
**Provider:** OpenAI
**Harness:** ForgeCode
**Date Compiled:** April 9, 2026
**Source References:** DEV Community (Liran Baba), ForgeCode Blog

---

## Quick Reference

| Attribute | Value |
|-----------|-------|
| Model | GPT 5.4 |
| Provider | OpenAI |
| Context Window | 1M tokens |
| Best For | Terminal execution, speed |
| Cost | ~$10/M input, ~$30/M output |

---

## Benchmark Results

### Terminal-Bench 2.0 (Harness-Specific)
- **Score:** 81.8% (tied for #1)
- **Harness:** ForgeCode
- **Date:** March 2026
- **Note:** Self-reported by ForgeCode; score reflects harness+model combination, not raw model capability

### SWE-Bench Pro
- **Score:** 57.7% (Rank #3 overall)
- **Behind:** Claude Mythos Preview (77.8%), GLM-5.1 (58.4%)
- **Source:** llm-stats.com

---

## What Worked Well

1. **Terminal Execution Speed**
   - Fastest terminal execution among frontier models
   - 47% token reduction with tool search
   - Best price/performance ratio for terminal tasks

2. **Benchmark Performance**
   - High scores on Terminal-Bench with ForgeCode harness optimizations
   - Strong reasoning capabilities on AIME 2025, HMMT, GPQA-Diamond

---

## Issues Encountered

1. **Stability Problems** (Critical)
   - **Description:** "Borderline unusable" for research tasks
   - **Manifestation:** 15-minute research task on small repo failed repeatedly
   - **Symptoms:** Tool calls failing, agent stuck in retry loops, required manual kill
   - **Quote:** "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."

2. **Tool Calling Reliability** (Major)
   - **Description:** Persistent tool-call errors with GPT 5.4
   - **ForgeCode Fixes Applied:**
     - Reordered JSON schema fields (`required` before `properties`)
     - Flattened nested schemas
     - Added explicit truncation reminders for partial file reads
   - **Note:** These optimizations were benchmark-specific ("benchmaxxed")

3. **Long-Running Task Instability** (Major)
   - **Description:** 15+ minute tasks became unstable
   - **Impact:** Unpredictable failures requiring manual intervention

---

## Harness Optimizations

ForgeCode applied specific optimizations for GPT 5.4:

1. **Non-Interactive Mode:** System prompt rewritten to prohibit conversational branching
2. **Tool Naming:** Renaming edit tool arguments to `old_string` and `new_string` (names appearing frequently in training data) measurably dropped tool-call error rates
3. **Progressive Thinking Policy:**
   - Messages 1-10: Very high thinking (plan formation)
   - Messages 11+: Low thinking default (execution phase)
   - Verification skill calls: Switch back to high thinking

---

## Comparison with Claude Opus 4.6

| Aspect | GPT 5.4 | Opus 4.6 |
|--------|---------|----------|
| Terminal-Bench 2.0 (ForgeCode) | 81.8% | 81.8% |
| Real-world stability | Poor | Excellent |
| Tool calling reliability | Problematic | Reliable |
| Research tasks | Unusable | Good |

**Key Takeaway:** Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.

---

## Source References

1. **DEV Community - ForgeCode vs Claude Code**: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
   - Real-world performance comparison by Liran Baba

2. **ForgeCode Blog - Benchmarks Don't Matter**: https://forgecode.dev/blog/benchmarks-dont-matter/
   - Documentation of harness optimizations