b012a406c7
Applied unified structure template to key feedback files: Structure now includes: 1. Standard header (Model/Size/Provider/Harness/Date) 2. Quick Reference table 3. Benchmark Results (with harness+model note) 4. What Worked Well 5. Issues Encountered (with severity levels) 6. Configuration (if applicable) 7. Source References (with descriptions) Files restructured: - forgecode/feedback/frontier/gpt-5.4.md - forgecode/feedback/frontier/claude-opus-4.6.md - hermes/feedback/frontier/claude-sonnet-feedback.md Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
3.5 KiB
3.5 KiB
GPT 5.4 with ForgeCode - Feedback Report
Model: GPT 5.4
Size: [Not specified]
Provider: OpenAI
Harness: ForgeCode
Date Compiled: April 9, 2026
Source References: DEV Community (Liran Baba), ForgeCode Blog
Quick Reference
| Attribute | Value |
|---|---|
| Model | GPT 5.4 |
| Provider | OpenAI |
| Context Window | 1M tokens |
| Best For | Terminal execution, speed |
| Cost | ~$10/M input, ~$30/M output |
Benchmark Results
Terminal-Bench 2.0 (Harness-Specific)
- Score: 81.8% (tied for #1)
- Harness: ForgeCode
- Date: March 2026
- Note: Self-reported by ForgeCode; score reflects harness+model combination, not raw model capability
SWE-Bench Pro
- Score: 57.7% (Rank #3 overall)
- Behind: Claude Mythos Preview (77.8%), GLM-5.1 (58.4%)
- Source: llm-stats.com
What Worked Well
-
Terminal Execution Speed
- Fastest terminal execution among frontier models
- 47% token reduction with tool search
- Best price/performance ratio for terminal tasks
-
Benchmark Performance
- High scores on Terminal-Bench with ForgeCode harness optimizations
- Strong reasoning capabilities on AIME 2025, HMMT, GPQA-Diamond
Issues Encountered
-
Stability Problems (Critical)
- Description: "Borderline unusable" for research tasks
- Manifestation: 15-minute research task on small repo failed repeatedly
- Symptoms: Tool calls failing, agent stuck in retry loops, required manual kill
- Quote: "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
-
Tool Calling Reliability (Major)
- Description: Persistent tool-call errors with GPT 5.4
- ForgeCode Fixes Applied:
- Reordered JSON schema fields (
requiredbeforeproperties) - Flattened nested schemas
- Added explicit truncation reminders for partial file reads
- Reordered JSON schema fields (
- Note: These optimizations were benchmark-specific ("benchmaxxed")
-
Long-Running Task Instability (Major)
- Description: 15+ minute tasks became unstable
- Impact: Unpredictable failures requiring manual intervention
Harness Optimizations
ForgeCode applied specific optimizations for GPT 5.4:
- Non-Interactive Mode: System prompt rewritten to prohibit conversational branching
- Tool Naming: Renaming edit tool arguments to
old_stringandnew_string(names appearing frequently in training data) measurably dropped tool-call error rates - Progressive Thinking Policy:
- Messages 1-10: Very high thinking (plan formation)
- Messages 11+: Low thinking default (execution phase)
- Verification skill calls: Switch back to high thinking
Comparison with Claude Opus 4.6
| Aspect | GPT 5.4 | Opus 4.6 |
|---|---|---|
| Terminal-Bench 2.0 (ForgeCode) | 81.8% | 81.8% |
| Real-world stability | Poor | Excellent |
| Tool calling reliability | Problematic | Reliable |
| Research tasks | Unusable | Good |
Key Takeaway: Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.
Source References
-
DEV Community - ForgeCode vs Claude Code: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
- Real-world performance comparison by Liran Baba
-
ForgeCode Blog - Benchmarks Don't Matter: https://forgecode.dev/blog/benchmarks-dont-matter/
- Documentation of harness optimizations