Applied unified structure template to key feedback files: Structure now includes: 1. Standard header (Model/Size/Provider/Harness/Date) 2. Quick Reference table 3. Benchmark Results (with harness+model note) 4. What Worked Well 5. Issues Encountered (with severity levels) 6. Configuration (if applicable) 7. Source References (with descriptions) Files restructured: - forgecode/feedback/frontier/gpt-5.4.md - forgecode/feedback/frontier/claude-opus-4.6.md - hermes/feedback/frontier/claude-sonnet-feedback.md Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
3.8 KiB
Claude Opus 4.6 with ForgeCode - Feedback Report
Model: Claude Opus 4.6
Size: [Not specified]
Provider: Anthropic
Harness: ForgeCode
Date Compiled: April 9, 2026
Source References: DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode
Quick Reference
| Attribute | Value |
|---|---|
| Model | Claude Opus 4.6 |
| Provider | Anthropic |
| Context Window | 200K tokens |
| Best For | Complex reasoning, large codebases, long-horizon tasks |
| Cost | ~$15/M input, ~$75/M output |
Benchmark Results
Terminal-Bench 2.0 (Harness-Specific)
- Score: 81.8% (tied for #1)
- Harness: ForgeCode
- Comparison: Claude Code + Opus 4.6: 58.0% (Rank #39)
- Gap: ~24 percentage points in favor of ForgeCode harness
- Note: Score reflects harness+model combination, not raw model capability
SWE-Bench Verified (Independent)
- ForgeCode + Claude 4: 72.7%
- Claude Code + Claude 3.7 Sonnet (extended thinking): 70.3%
- Gap: Only 2.4 percentage points on independent validation
- Source: Princeton/UChicago
SWE-Bench Pro
- Score: 57.3% (Rank varies)
- Behind: Claude Mythos Preview (77.8%), GLM-5.1 (58.4%), GPT-5.4 (57.7%)
- Source: llm-stats.com
Key Insight: The benchmark gap narrows significantly on independent validation. Terminal-Bench results are self-reported by harness developers.
What Worked Well
-
Speed
- Observation: "Noticeably faster than Claude Code. Not marginal, real."
- Test Case: Adding post counter to blog index (Astro 6, ~30 files)
- Claude Code: ~90 seconds
- ForgeCode + Opus 4.6: <30 seconds
- Consistency: Multi-file renames, component additions, layout restructuring all showed faster performance
- Why: Rust binary vs TypeScript, context engine indexes signatures (~90% size reduction), selective context
-
Multi-file Refactoring
- Handles complex changes across file boundaries efficiently
- Strong understanding of Astro/React components
- Consistently 3x faster than Claude Code on identical tasks
-
Planning with Muse
- Plan output felt "more detailed and verbose than Claude Code's plan mode"
-
Stability
- Excellent stability with Opus 4.6 through ForgeCode
- No tool call failures reported (unlike GPT 5.4 experience)
- Consistent performance across different task types
Issues Encountered
-
Ecosystem Gaps (Major)
- Description: No IDE extensions, no hooks, no checkpoints/rewind
- Impact: Less integrated workflow compared to Claude Code
-
No Auto-Memory (Minor)
- Description: Context doesn't persist between sessions
- Impact: Requires re-contextualization on new sessions
-
No Built-in Sandbox (Minor)
- Description: Requires manual
--sandboxflag for isolation - Impact: Security requires explicit configuration
- Description: Requires manual
User Workflow Integration
Current User Pattern (Liran Baba):
"I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
Use Cases:
- Speed-critical tasks: ForgeCode + Opus 4.6
- Complex refactoring: ForgeCode for faster iteration
- Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)
Source References
-
DEV Community - ForgeCode vs Claude Code: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
- Real-world performance comparison by Liran Baba
-
ForgeCode Blog - Benchmarks Don't Matter: https://forgecode.dev/blog/benchmarks-dont-matter/
- Documentation of harness optimizations and benchmark methodology
-
Reddit r/ClaudeCode: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
- Community discussion on ForgeCode usage