Unify feedback file structure across harness folders

Applied unified structure template to key feedback files: Structure now includes: 1. Standard header (Model/Size/Provider/Harness/Date) 2. Quick Reference table 3. Benchmark Results (with harness+model note) 4. What Worked Well 5. Issues Encountered (with severity levels) 6. Configuration (if applicable) 7. Source References (with descriptions) Files restructured: - forgecode/feedback/frontier/gpt-5.4.md - forgecode/feedback/frontier/claude-opus-4.6.md - hermes/feedback/frontier/claude-sonnet-feedback.md Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
2026-04-09 16:12:52 +02:00
parent f561bed731
commit b012a406c7
4 changed files with 375 additions and 126 deletions
@@ -1,64 +1,88 @@
 # Claude Opus 4.6 with ForgeCode - Feedback Report

 **Model:** Claude Opus 4.6  
+**Size:** [Not specified]  
 **Provider:** Anthropic  
 **Harness:** ForgeCode  
-**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode  
-**Date Compiled:** April 9, 2026
+**Date Compiled:** April 9, 2026  
+**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode

 ---

-## Benchmark Performance
+## Quick Reference

-### TermBench 2.0 (Self-Reported via ForgeCode)
+| Attribute | Value |
+|-----------|-------|
+| Model | Claude Opus 4.6 |
+| Provider | Anthropic |
+| Context Window | 200K tokens |
+| Best For | Complex reasoning, large codebases, long-horizon tasks |
+| Cost | ~$15/M input, ~$75/M output |
+
+---
+
+## Benchmark Results
+
+### Terminal-Bench 2.0 (Harness-Specific)
 - **Score:** 81.8% (tied for #1)
- **Comparison:** Claude Code + Opus 4.6 scored 58.0% (Rank #39)
+- **Harness:** ForgeCode
+- **Comparison:** Claude Code + Opus 4.6: 58.0% (Rank #39)
 - **Gap:** ~24 percentage points in favor of ForgeCode harness
+- **Note:** Score reflects harness+model combination, not raw model capability

-### SWE-bench Verified (Independent - Princeton/UChicago)
+### SWE-Bench Verified (Independent)
 - **ForgeCode + Claude 4:** 72.7%
 - **Claude Code + Claude 3.7 Sonnet (extended thinking):** 70.3%
- **Gap:** Only 2.4 percentage points
+- **Gap:** Only 2.4 percentage points on independent validation
+- **Source:** Princeton/UChicago

-**Key Insight:** The benchmark gap narrows significantly on independent validation. TermBench 2.0 results are self-reported by ForgeCode itself.
+### SWE-Bench Pro
+- **Score:** 57.3% (Rank varies)
+- **Behind:** Claude Mythos Preview (77.8%), GLM-5.1 (58.4%), GPT-5.4 (57.7%)
+- **Source:** llm-stats.com

---
-
-## Real-World Performance Feedback
-
-### Speed
- **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
- **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
-  - Claude Code: ~90 seconds
-  - ForgeCode + Opus 4.6: <30 seconds
- **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
-
-### Why Faster
-1. **Rust binary** vs Claude Code's TypeScript (better startup/memory)
-2. **Context engine:** Indexes function signatures and module boundaries instead of dumping raw files (~90% context size reduction)
-3. **Selective context:** Pulls only what the agent needs
-
-### Stability
- **Assessment:** Excellent stability with Opus 4.6 through ForgeCode
- **No tool call failures reported** (unlike GPT 5.4 experience)
- Consistent performance across different task types
+**Key Insight:** The benchmark gap narrows significantly on independent validation. Terminal-Bench results are self-reported by harness developers.

 ---

 ## What Worked Well

-1. **Multi-file refactoring:** Handles complex changes across file boundaries efficiently
-2. **Code comprehension:** Strong understanding of Astro/React components
-3. **Speed on complex tasks:** Consistently 3x faster than Claude Code on identical tasks
-4. **Planning with muse:** Plan output felt "more detailed and verbose than Claude Code's plan mode"
+1. **Speed**
+   - **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
+   - **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
+     - Claude Code: ~90 seconds
+     - ForgeCode + Opus 4.6: <30 seconds
+   - **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
+   - **Why:** Rust binary vs TypeScript, context engine indexes signatures (~90% size reduction), selective context
+
+2. **Multi-file Refactoring**
+   - Handles complex changes across file boundaries efficiently
+   - Strong understanding of Astro/React components
+   - Consistently 3x faster than Claude Code on identical tasks
+
+3. **Planning with Muse**
+   - Plan output felt "more detailed and verbose than Claude Code's plan mode"
+
+4. **Stability**
+   - Excellent stability with Opus 4.6 through ForgeCode
+   - No tool call failures reported (unlike GPT 5.4 experience)
+   - Consistent performance across different task types

 ---

 ## Issues Encountered

-1. **Ecosystem gaps:** No IDE extensions, no hooks, no checkpoints/rewind
-2. **No auto-memory:** Context doesn't persist between sessions
-3. **No built-in sandbox:** Requires manual `--sandbox` flag for isolation
+1. **Ecosystem Gaps** (Major)
+   - **Description:** No IDE extensions, no hooks, no checkpoints/rewind
+   - **Impact:** Less integrated workflow compared to Claude Code
+
+2. **No Auto-Memory** (Minor)
+   - **Description:** Context doesn't persist between sessions
+   - **Impact:** Requires re-contextualization on new sessions
+
+3. **No Built-in Sandbox** (Minor)
+   - **Description:** Requires manual `--sandbox` flag for isolation
+   - **Impact:** Security requires explicit configuration

 ---

@@ -76,6 +100,11 @@

 ## Source References

-1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
-2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
-3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
+1. **DEV Community - ForgeCode vs Claude Code**: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+   - Real-world performance comparison by Liran Baba
+
+2. **ForgeCode Blog - Benchmarks Don't Matter**: https://forgecode.dev/blog/benchmarks-dont-matter/
+   - Documentation of harness optimizations and benchmark methodology
+
+3. **Reddit r/ClaudeCode**: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
+   - Community discussion on ForgeCode usage