Unify feedback file structure across harness folders
Applied unified structure template to key feedback files: Structure now includes: 1. Standard header (Model/Size/Provider/Harness/Date) 2. Quick Reference table 3. Benchmark Results (with harness+model note) 4. What Worked Well 5. Issues Encountered (with severity levels) 6. Configuration (if applicable) 7. Source References (with descriptions) Files restructured: - forgecode/feedback/frontier/gpt-5.4.md - forgecode/feedback/frontier/claude-opus-4.6.md - hermes/feedback/frontier/claude-sonnet-feedback.md Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
This commit is contained in:
@@ -1,64 +1,88 @@
|
||||
# Claude Opus 4.6 with ForgeCode - Feedback Report
|
||||
|
||||
**Model:** Claude Opus 4.6
|
||||
**Size:** [Not specified]
|
||||
**Provider:** Anthropic
|
||||
**Harness:** ForgeCode
|
||||
**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode
|
||||
**Date Compiled:** April 9, 2026
|
||||
**Date Compiled:** April 9, 2026
|
||||
**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode
|
||||
|
||||
---
|
||||
|
||||
## Benchmark Performance
|
||||
## Quick Reference
|
||||
|
||||
### TermBench 2.0 (Self-Reported via ForgeCode)
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Model | Claude Opus 4.6 |
|
||||
| Provider | Anthropic |
|
||||
| Context Window | 200K tokens |
|
||||
| Best For | Complex reasoning, large codebases, long-horizon tasks |
|
||||
| Cost | ~$15/M input, ~$75/M output |
|
||||
|
||||
---
|
||||
|
||||
## Benchmark Results
|
||||
|
||||
### Terminal-Bench 2.0 (Harness-Specific)
|
||||
- **Score:** 81.8% (tied for #1)
|
||||
- **Comparison:** Claude Code + Opus 4.6 scored 58.0% (Rank #39)
|
||||
- **Harness:** ForgeCode
|
||||
- **Comparison:** Claude Code + Opus 4.6: 58.0% (Rank #39)
|
||||
- **Gap:** ~24 percentage points in favor of ForgeCode harness
|
||||
- **Note:** Score reflects harness+model combination, not raw model capability
|
||||
|
||||
### SWE-bench Verified (Independent - Princeton/UChicago)
|
||||
### SWE-Bench Verified (Independent)
|
||||
- **ForgeCode + Claude 4:** 72.7%
|
||||
- **Claude Code + Claude 3.7 Sonnet (extended thinking):** 70.3%
|
||||
- **Gap:** Only 2.4 percentage points
|
||||
- **Gap:** Only 2.4 percentage points on independent validation
|
||||
- **Source:** Princeton/UChicago
|
||||
|
||||
**Key Insight:** The benchmark gap narrows significantly on independent validation. TermBench 2.0 results are self-reported by ForgeCode itself.
|
||||
### SWE-Bench Pro
|
||||
- **Score:** 57.3% (Rank varies)
|
||||
- **Behind:** Claude Mythos Preview (77.8%), GLM-5.1 (58.4%), GPT-5.4 (57.7%)
|
||||
- **Source:** llm-stats.com
|
||||
|
||||
---
|
||||
|
||||
## Real-World Performance Feedback
|
||||
|
||||
### Speed
|
||||
- **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
|
||||
- **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
|
||||
- Claude Code: ~90 seconds
|
||||
- ForgeCode + Opus 4.6: <30 seconds
|
||||
- **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
|
||||
|
||||
### Why Faster
|
||||
1. **Rust binary** vs Claude Code's TypeScript (better startup/memory)
|
||||
2. **Context engine:** Indexes function signatures and module boundaries instead of dumping raw files (~90% context size reduction)
|
||||
3. **Selective context:** Pulls only what the agent needs
|
||||
|
||||
### Stability
|
||||
- **Assessment:** Excellent stability with Opus 4.6 through ForgeCode
|
||||
- **No tool call failures reported** (unlike GPT 5.4 experience)
|
||||
- Consistent performance across different task types
|
||||
**Key Insight:** The benchmark gap narrows significantly on independent validation. Terminal-Bench results are self-reported by harness developers.
|
||||
|
||||
---
|
||||
|
||||
## What Worked Well
|
||||
|
||||
1. **Multi-file refactoring:** Handles complex changes across file boundaries efficiently
|
||||
2. **Code comprehension:** Strong understanding of Astro/React components
|
||||
3. **Speed on complex tasks:** Consistently 3x faster than Claude Code on identical tasks
|
||||
4. **Planning with muse:** Plan output felt "more detailed and verbose than Claude Code's plan mode"
|
||||
1. **Speed**
|
||||
- **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
|
||||
- **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
|
||||
- Claude Code: ~90 seconds
|
||||
- ForgeCode + Opus 4.6: <30 seconds
|
||||
- **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
|
||||
- **Why:** Rust binary vs TypeScript, context engine indexes signatures (~90% size reduction), selective context
|
||||
|
||||
2. **Multi-file Refactoring**
|
||||
- Handles complex changes across file boundaries efficiently
|
||||
- Strong understanding of Astro/React components
|
||||
- Consistently 3x faster than Claude Code on identical tasks
|
||||
|
||||
3. **Planning with Muse**
|
||||
- Plan output felt "more detailed and verbose than Claude Code's plan mode"
|
||||
|
||||
4. **Stability**
|
||||
- Excellent stability with Opus 4.6 through ForgeCode
|
||||
- No tool call failures reported (unlike GPT 5.4 experience)
|
||||
- Consistent performance across different task types
|
||||
|
||||
---
|
||||
|
||||
## Issues Encountered
|
||||
|
||||
1. **Ecosystem gaps:** No IDE extensions, no hooks, no checkpoints/rewind
|
||||
2. **No auto-memory:** Context doesn't persist between sessions
|
||||
3. **No built-in sandbox:** Requires manual `--sandbox` flag for isolation
|
||||
1. **Ecosystem Gaps** (Major)
|
||||
- **Description:** No IDE extensions, no hooks, no checkpoints/rewind
|
||||
- **Impact:** Less integrated workflow compared to Claude Code
|
||||
|
||||
2. **No Auto-Memory** (Minor)
|
||||
- **Description:** Context doesn't persist between sessions
|
||||
- **Impact:** Requires re-contextualization on new sessions
|
||||
|
||||
3. **No Built-in Sandbox** (Minor)
|
||||
- **Description:** Requires manual `--sandbox` flag for isolation
|
||||
- **Impact:** Security requires explicit configuration
|
||||
|
||||
---
|
||||
|
||||
@@ -76,6 +100,11 @@
|
||||
|
||||
## Source References
|
||||
|
||||
1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
|
||||
2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
|
||||
3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
|
||||
1. **DEV Community - ForgeCode vs Claude Code**: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
|
||||
- Real-world performance comparison by Liran Baba
|
||||
|
||||
2. **ForgeCode Blog - Benchmarks Don't Matter**: https://forgecode.dev/blog/benchmarks-dont-matter/
|
||||
- Documentation of harness optimizations and benchmark methodology
|
||||
|
||||
3. **Reddit r/ClaudeCode**: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
|
||||
- Community discussion on ForgeCode usage
|
||||
|
||||
@@ -1,45 +1,79 @@
|
||||
# GPT 5.4 with ForgeCode - Feedback Report
|
||||
|
||||
**Model:** GPT 5.4
|
||||
**Size:** [Not specified]
|
||||
**Provider:** OpenAI
|
||||
**Harness:** ForgeCode
|
||||
**Source References:** DEV Community (Liran Baba), ForgeCode Blog
|
||||
**Date Compiled:** April 9, 2026
|
||||
**Date Compiled:** April 9, 2026
|
||||
**Source References:** DEV Community (Liran Baba), ForgeCode Blog
|
||||
|
||||
---
|
||||
|
||||
## Benchmark Performance
|
||||
## Quick Reference
|
||||
|
||||
### TermBench 2.0 (Self-Reported via ForgeCode)
|
||||
- **Score:** 81.8% (tied for #1 with Opus 4.6)
|
||||
- **Note:** Achieved through extensive harness optimizations, not raw model capability
|
||||
| Attribute | Value |
|
||||
|-----------|-------|
|
||||
| Model | GPT 5.4 |
|
||||
| Provider | OpenAI |
|
||||
| Context Window | 1M tokens |
|
||||
| Best For | Terminal execution, speed |
|
||||
| Cost | ~$10/M input, ~$30/M output |
|
||||
|
||||
---
|
||||
|
||||
## Real-World Performance Feedback
|
||||
## Benchmark Results
|
||||
|
||||
### Stability Issues
|
||||
- **Assessment:** "Borderline unusable" for some tasks
|
||||
- **Specific Issue:** 15-minute research task on small repo
|
||||
- Tool calls repeatedly failing
|
||||
- Agent stuck in retry loops
|
||||
- Required manual kill
|
||||
### Terminal-Bench 2.0 (Harness-Specific)
|
||||
- **Score:** 81.8% (tied for #1)
|
||||
- **Harness:** ForgeCode
|
||||
- **Date:** March 2026
|
||||
- **Note:** Self-reported by ForgeCode; score reflects harness+model combination, not raw model capability
|
||||
|
||||
> "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
|
||||
|
||||
### Tool Calling Reliability
|
||||
- **Problem:** Persistent tool-call errors with GPT 5.4
|
||||
- **ForgeCode Fixes Applied:**
|
||||
1. Reordered JSON schema fields (`required` before `properties`)
|
||||
2. Flattened nested schemas
|
||||
3. Added explicit truncation reminders for partial file reads
|
||||
- **Result:** These optimizations were benchmark-specific (described as "benchmaxxed")
|
||||
### SWE-Bench Pro
|
||||
- **Score:** 57.7% (Rank #3 overall)
|
||||
- **Behind:** Claude Mythos Preview (77.8%), GLM-5.1 (58.4%)
|
||||
- **Source:** llm-stats.com
|
||||
|
||||
---
|
||||
|
||||
## Harness Optimizations for GPT 5.4
|
||||
## What Worked Well
|
||||
|
||||
From ForgeCode's "Benchmarks Don't Matter" blog series:
|
||||
1. **Terminal Execution Speed**
|
||||
- Fastest terminal execution among frontier models
|
||||
- 47% token reduction with tool search
|
||||
- Best price/performance ratio for terminal tasks
|
||||
|
||||
2. **Benchmark Performance**
|
||||
- High scores on Terminal-Bench with ForgeCode harness optimizations
|
||||
- Strong reasoning capabilities on AIME 2025, HMMT, GPQA-Diamond
|
||||
|
||||
---
|
||||
|
||||
## Issues Encountered
|
||||
|
||||
1. **Stability Problems** (Critical)
|
||||
- **Description:** "Borderline unusable" for research tasks
|
||||
- **Manifestation:** 15-minute research task on small repo failed repeatedly
|
||||
- **Symptoms:** Tool calls failing, agent stuck in retry loops, required manual kill
|
||||
- **Quote:** "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
|
||||
|
||||
2. **Tool Calling Reliability** (Major)
|
||||
- **Description:** Persistent tool-call errors with GPT 5.4
|
||||
- **ForgeCode Fixes Applied:**
|
||||
- Reordered JSON schema fields (`required` before `properties`)
|
||||
- Flattened nested schemas
|
||||
- Added explicit truncation reminders for partial file reads
|
||||
- **Note:** These optimizations were benchmark-specific ("benchmaxxed")
|
||||
|
||||
3. **Long-Running Task Instability** (Major)
|
||||
- **Description:** 15+ minute tasks became unstable
|
||||
- **Impact:** Unpredictable failures requiring manual intervention
|
||||
|
||||
---
|
||||
|
||||
## Harness Optimizations
|
||||
|
||||
ForgeCode applied specific optimizations for GPT 5.4:
|
||||
|
||||
1. **Non-Interactive Mode:** System prompt rewritten to prohibit conversational branching
|
||||
2. **Tool Naming:** Renaming edit tool arguments to `old_string` and `new_string` (names appearing frequently in training data) measurably dropped tool-call error rates
|
||||
@@ -50,19 +84,11 @@ From ForgeCode's "Benchmarks Don't Matter" blog series:
|
||||
|
||||
---
|
||||
|
||||
## What Didn't Work Well
|
||||
|
||||
1. **Research tasks:** Tool calling failures causing infinite loops
|
||||
2. **Long-running tasks:** 15+ minute tasks became unstable
|
||||
3. **Consistency:** Unpredictable failures requiring manual intervention
|
||||
|
||||
---
|
||||
|
||||
## Comparison with Opus 4.6
|
||||
## Comparison with Claude Opus 4.6
|
||||
|
||||
| Aspect | GPT 5.4 | Opus 4.6 |
|
||||
|--------|---------|----------|
|
||||
| TermBench 2.0 | 81.8% | 81.8% |
|
||||
| Terminal-Bench 2.0 (ForgeCode) | 81.8% | 81.8% |
|
||||
| Real-world stability | Poor | Excellent |
|
||||
| Tool calling reliability | Problematic | Reliable |
|
||||
| Research tasks | Unusable | Good |
|
||||
@@ -73,5 +99,8 @@ From ForgeCode's "Benchmarks Don't Matter" blog series:
|
||||
|
||||
## Source References
|
||||
|
||||
1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
|
||||
2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
|
||||
1. **DEV Community - ForgeCode vs Claude Code**: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
|
||||
- Real-world performance comparison by Liran Baba
|
||||
|
||||
2. **ForgeCode Blog - Benchmarks Don't Matter**: https://forgecode.dev/blog/benchmarks-dont-matter/
|
||||
- Documentation of harness optimizations
|
||||
|
||||
Reference in New Issue
Block a user