Files
harness-feedback/forgecode/feedback/frontier/gpt-5.4.md
T
sleepy b012a406c7 Unify feedback file structure across harness folders
Applied unified structure template to key feedback files:

Structure now includes:
1. Standard header (Model/Size/Provider/Harness/Date)
2. Quick Reference table
3. Benchmark Results (with harness+model note)
4. What Worked Well
5. Issues Encountered (with severity levels)
6. Configuration (if applicable)
7. Source References (with descriptions)

Files restructured:
- forgecode/feedback/frontier/gpt-5.4.md
- forgecode/feedback/frontier/claude-opus-4.6.md
- hermes/feedback/frontier/claude-sonnet-feedback.md

Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
2026-04-09 16:12:52 +02:00

3.5 KiB

GPT 5.4 with ForgeCode - Feedback Report

Model: GPT 5.4
Size: [Not specified]
Provider: OpenAI
Harness: ForgeCode
Date Compiled: April 9, 2026
Source References: DEV Community (Liran Baba), ForgeCode Blog


Quick Reference

Attribute Value
Model GPT 5.4
Provider OpenAI
Context Window 1M tokens
Best For Terminal execution, speed
Cost ~$10/M input, ~$30/M output

Benchmark Results

Terminal-Bench 2.0 (Harness-Specific)

  • Score: 81.8% (tied for #1)
  • Harness: ForgeCode
  • Date: March 2026
  • Note: Self-reported by ForgeCode; score reflects harness+model combination, not raw model capability

SWE-Bench Pro

  • Score: 57.7% (Rank #3 overall)
  • Behind: Claude Mythos Preview (77.8%), GLM-5.1 (58.4%)
  • Source: llm-stats.com

What Worked Well

  1. Terminal Execution Speed

    • Fastest terminal execution among frontier models
    • 47% token reduction with tool search
    • Best price/performance ratio for terminal tasks
  2. Benchmark Performance

    • High scores on Terminal-Bench with ForgeCode harness optimizations
    • Strong reasoning capabilities on AIME 2025, HMMT, GPQA-Diamond

Issues Encountered

  1. Stability Problems (Critical)

    • Description: "Borderline unusable" for research tasks
    • Manifestation: 15-minute research task on small repo failed repeatedly
    • Symptoms: Tool calls failing, agent stuck in retry loops, required manual kill
    • Quote: "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
  2. Tool Calling Reliability (Major)

    • Description: Persistent tool-call errors with GPT 5.4
    • ForgeCode Fixes Applied:
      • Reordered JSON schema fields (required before properties)
      • Flattened nested schemas
      • Added explicit truncation reminders for partial file reads
    • Note: These optimizations were benchmark-specific ("benchmaxxed")
  3. Long-Running Task Instability (Major)

    • Description: 15+ minute tasks became unstable
    • Impact: Unpredictable failures requiring manual intervention

Harness Optimizations

ForgeCode applied specific optimizations for GPT 5.4:

  1. Non-Interactive Mode: System prompt rewritten to prohibit conversational branching
  2. Tool Naming: Renaming edit tool arguments to old_string and new_string (names appearing frequently in training data) measurably dropped tool-call error rates
  3. Progressive Thinking Policy:
    • Messages 1-10: Very high thinking (plan formation)
    • Messages 11+: Low thinking default (execution phase)
    • Verification skill calls: Switch back to high thinking

Comparison with Claude Opus 4.6

Aspect GPT 5.4 Opus 4.6
Terminal-Bench 2.0 (ForgeCode) 81.8% 81.8%
Real-world stability Poor Excellent
Tool calling reliability Problematic Reliable
Research tasks Unusable Good

Key Takeaway: Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.


Source References

  1. DEV Community - ForgeCode vs Claude Code: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c

    • Real-world performance comparison by Liran Baba
  2. ForgeCode Blog - Benchmarks Don't Matter: https://forgecode.dev/blog/benchmarks-dont-matter/

    • Documentation of harness optimizations