b012a406c7
Applied unified structure template to key feedback files: Structure now includes: 1. Standard header (Model/Size/Provider/Harness/Date) 2. Quick Reference table 3. Benchmark Results (with harness+model note) 4. What Worked Well 5. Issues Encountered (with severity levels) 6. Configuration (if applicable) 7. Source References (with descriptions) Files restructured: - forgecode/feedback/frontier/gpt-5.4.md - forgecode/feedback/frontier/claude-opus-4.6.md - hermes/feedback/frontier/claude-sonnet-feedback.md Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
107 lines
3.5 KiB
Markdown
107 lines
3.5 KiB
Markdown
# GPT 5.4 with ForgeCode - Feedback Report
|
|
|
|
**Model:** GPT 5.4
|
|
**Size:** [Not specified]
|
|
**Provider:** OpenAI
|
|
**Harness:** ForgeCode
|
|
**Date Compiled:** April 9, 2026
|
|
**Source References:** DEV Community (Liran Baba), ForgeCode Blog
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
| Attribute | Value |
|
|
|-----------|-------|
|
|
| Model | GPT 5.4 |
|
|
| Provider | OpenAI |
|
|
| Context Window | 1M tokens |
|
|
| Best For | Terminal execution, speed |
|
|
| Cost | ~$10/M input, ~$30/M output |
|
|
|
|
---
|
|
|
|
## Benchmark Results
|
|
|
|
### Terminal-Bench 2.0 (Harness-Specific)
|
|
- **Score:** 81.8% (tied for #1)
|
|
- **Harness:** ForgeCode
|
|
- **Date:** March 2026
|
|
- **Note:** Self-reported by ForgeCode; score reflects harness+model combination, not raw model capability
|
|
|
|
### SWE-Bench Pro
|
|
- **Score:** 57.7% (Rank #3 overall)
|
|
- **Behind:** Claude Mythos Preview (77.8%), GLM-5.1 (58.4%)
|
|
- **Source:** llm-stats.com
|
|
|
|
---
|
|
|
|
## What Worked Well
|
|
|
|
1. **Terminal Execution Speed**
|
|
- Fastest terminal execution among frontier models
|
|
- 47% token reduction with tool search
|
|
- Best price/performance ratio for terminal tasks
|
|
|
|
2. **Benchmark Performance**
|
|
- High scores on Terminal-Bench with ForgeCode harness optimizations
|
|
- Strong reasoning capabilities on AIME 2025, HMMT, GPQA-Diamond
|
|
|
|
---
|
|
|
|
## Issues Encountered
|
|
|
|
1. **Stability Problems** (Critical)
|
|
- **Description:** "Borderline unusable" for research tasks
|
|
- **Manifestation:** 15-minute research task on small repo failed repeatedly
|
|
- **Symptoms:** Tool calls failing, agent stuck in retry loops, required manual kill
|
|
- **Quote:** "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
|
|
|
|
2. **Tool Calling Reliability** (Major)
|
|
- **Description:** Persistent tool-call errors with GPT 5.4
|
|
- **ForgeCode Fixes Applied:**
|
|
- Reordered JSON schema fields (`required` before `properties`)
|
|
- Flattened nested schemas
|
|
- Added explicit truncation reminders for partial file reads
|
|
- **Note:** These optimizations were benchmark-specific ("benchmaxxed")
|
|
|
|
3. **Long-Running Task Instability** (Major)
|
|
- **Description:** 15+ minute tasks became unstable
|
|
- **Impact:** Unpredictable failures requiring manual intervention
|
|
|
|
---
|
|
|
|
## Harness Optimizations
|
|
|
|
ForgeCode applied specific optimizations for GPT 5.4:
|
|
|
|
1. **Non-Interactive Mode:** System prompt rewritten to prohibit conversational branching
|
|
2. **Tool Naming:** Renaming edit tool arguments to `old_string` and `new_string` (names appearing frequently in training data) measurably dropped tool-call error rates
|
|
3. **Progressive Thinking Policy:**
|
|
- Messages 1-10: Very high thinking (plan formation)
|
|
- Messages 11+: Low thinking default (execution phase)
|
|
- Verification skill calls: Switch back to high thinking
|
|
|
|
---
|
|
|
|
## Comparison with Claude Opus 4.6
|
|
|
|
| Aspect | GPT 5.4 | Opus 4.6 |
|
|
|--------|---------|----------|
|
|
| Terminal-Bench 2.0 (ForgeCode) | 81.8% | 81.8% |
|
|
| Real-world stability | Poor | Excellent |
|
|
| Tool calling reliability | Problematic | Reliable |
|
|
| Research tasks | Unusable | Good |
|
|
|
|
**Key Takeaway:** Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.
|
|
|
|
---
|
|
|
|
## Source References
|
|
|
|
1. **DEV Community - ForgeCode vs Claude Code**: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
|
|
- Real-world performance comparison by Liran Baba
|
|
|
|
2. **ForgeCode Blog - Benchmarks Don't Matter**: https://forgecode.dev/blog/benchmarks-dont-matter/
|
|
- Documentation of harness optimizations
|