51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2.6 KiB
2.6 KiB
GPT 5.4 with ForgeCode - Feedback Report
Model: GPT 5.4
Provider: OpenAI
Harness: ForgeCode
Source References: DEV Community (Liran Baba), ForgeCode Blog
Date Compiled: April 9, 2026
Benchmark Performance
TermBench 2.0 (Self-Reported via ForgeCode)
- Score: 81.8% (tied for #1 with Opus 4.6)
- Note: Achieved through extensive harness optimizations, not raw model capability
Real-World Performance Feedback
Stability Issues
- Assessment: "Borderline unusable" for some tasks
- Specific Issue: 15-minute research task on small repo
- Tool calls repeatedly failing
- Agent stuck in retry loops
- Required manual kill
"I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
Tool Calling Reliability
- Problem: Persistent tool-call errors with GPT 5.4
- ForgeCode Fixes Applied:
- Reordered JSON schema fields (
requiredbeforeproperties) - Flattened nested schemas
- Added explicit truncation reminders for partial file reads
- Reordered JSON schema fields (
- Result: These optimizations were benchmark-specific (described as "benchmaxxed")
Harness Optimizations for GPT 5.4
From ForgeCode's "Benchmarks Don't Matter" blog series:
- Non-Interactive Mode: System prompt rewritten to prohibit conversational branching
- Tool Naming: Renaming edit tool arguments to
old_stringandnew_string(names appearing frequently in training data) measurably dropped tool-call error rates - Progressive Thinking Policy:
- Messages 1-10: Very high thinking (plan formation)
- Messages 11+: Low thinking default (execution phase)
- Verification skill calls: Switch back to high thinking
What Didn't Work Well
- Research tasks: Tool calling failures causing infinite loops
- Long-running tasks: 15+ minute tasks became unstable
- Consistency: Unpredictable failures requiring manual intervention
Comparison with Opus 4.6
| Aspect | GPT 5.4 | Opus 4.6 |
|---|---|---|
| TermBench 2.0 | 81.8% | 81.8% |
| Real-world stability | Poor | Excellent |
| Tool calling reliability | Problematic | Reliable |
| Research tasks | Unusable | Good |
Key Takeaway: Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.