Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,77 @@
+# GPT 5.4 with ForgeCode - Feedback Report
+
+**Model:** GPT 5.4  
+**Provider:** OpenAI  
+**Harness:** ForgeCode  
+**Source References:** DEV Community (Liran Baba), ForgeCode Blog  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Benchmark Performance
+
+### TermBench 2.0 (Self-Reported via ForgeCode)
+- **Score:** 81.8% (tied for #1 with Opus 4.6)
+- **Note:** Achieved through extensive harness optimizations, not raw model capability
+
+---
+
+## Real-World Performance Feedback
+
+### Stability Issues
+- **Assessment:** "Borderline unusable" for some tasks
+- **Specific Issue:** 15-minute research task on small repo
+  - Tool calls repeatedly failing
+  - Agent stuck in retry loops
+  - Required manual kill
+
+> "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
+
+### Tool Calling Reliability
+- **Problem:** Persistent tool-call errors with GPT 5.4
+- **ForgeCode Fixes Applied:**
+  1. Reordered JSON schema fields (`required` before `properties`)
+  2. Flattened nested schemas
+  3. Added explicit truncation reminders for partial file reads
+- **Result:** These optimizations were benchmark-specific (described as "benchmaxxed")
+
+---
+
+## Harness Optimizations for GPT 5.4
+
+From ForgeCode's "Benchmarks Don't Matter" blog series:
+
+1. **Non-Interactive Mode:** System prompt rewritten to prohibit conversational branching
+2. **Tool Naming:** Renaming edit tool arguments to `old_string` and `new_string` (names appearing frequently in training data) measurably dropped tool-call error rates
+3. **Progressive Thinking Policy:**
+   - Messages 1-10: Very high thinking (plan formation)
+   - Messages 11+: Low thinking default (execution phase)
+   - Verification skill calls: Switch back to high thinking
+
+---
+
+## What Didn't Work Well
+
+1. **Research tasks:** Tool calling failures causing infinite loops
+2. **Long-running tasks:** 15+ minute tasks became unstable
+3. **Consistency:** Unpredictable failures requiring manual intervention
+
+---
+
+## Comparison with Opus 4.6
+
+| Aspect | GPT 5.4 | Opus 4.6 |
+|--------|---------|----------|
+| TermBench 2.0 | 81.8% | 81.8% |
+| Real-world stability | Poor | Excellent |
+| Tool calling reliability | Problematic | Reliable |
+| Research tasks | Unusable | Good |
+
+**Key Takeaway:** Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.
+
+---
+
+## Source References
+
+1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/