b012a406c7
Applied unified structure template to key feedback files: Structure now includes: 1. Standard header (Model/Size/Provider/Harness/Date) 2. Quick Reference table 3. Benchmark Results (with harness+model note) 4. What Worked Well 5. Issues Encountered (with severity levels) 6. Configuration (if applicable) 7. Source References (with descriptions) Files restructured: - forgecode/feedback/frontier/gpt-5.4.md - forgecode/feedback/frontier/claude-opus-4.6.md - hermes/feedback/frontier/claude-sonnet-feedback.md Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
111 lines
3.8 KiB
Markdown
111 lines
3.8 KiB
Markdown
# Claude Opus 4.6 with ForgeCode - Feedback Report
|
|
|
|
**Model:** Claude Opus 4.6
|
|
**Size:** [Not specified]
|
|
**Provider:** Anthropic
|
|
**Harness:** ForgeCode
|
|
**Date Compiled:** April 9, 2026
|
|
**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode
|
|
|
|
---
|
|
|
|
## Quick Reference
|
|
|
|
| Attribute | Value |
|
|
|-----------|-------|
|
|
| Model | Claude Opus 4.6 |
|
|
| Provider | Anthropic |
|
|
| Context Window | 200K tokens |
|
|
| Best For | Complex reasoning, large codebases, long-horizon tasks |
|
|
| Cost | ~$15/M input, ~$75/M output |
|
|
|
|
---
|
|
|
|
## Benchmark Results
|
|
|
|
### Terminal-Bench 2.0 (Harness-Specific)
|
|
- **Score:** 81.8% (tied for #1)
|
|
- **Harness:** ForgeCode
|
|
- **Comparison:** Claude Code + Opus 4.6: 58.0% (Rank #39)
|
|
- **Gap:** ~24 percentage points in favor of ForgeCode harness
|
|
- **Note:** Score reflects harness+model combination, not raw model capability
|
|
|
|
### SWE-Bench Verified (Independent)
|
|
- **ForgeCode + Claude 4:** 72.7%
|
|
- **Claude Code + Claude 3.7 Sonnet (extended thinking):** 70.3%
|
|
- **Gap:** Only 2.4 percentage points on independent validation
|
|
- **Source:** Princeton/UChicago
|
|
|
|
### SWE-Bench Pro
|
|
- **Score:** 57.3% (Rank varies)
|
|
- **Behind:** Claude Mythos Preview (77.8%), GLM-5.1 (58.4%), GPT-5.4 (57.7%)
|
|
- **Source:** llm-stats.com
|
|
|
|
**Key Insight:** The benchmark gap narrows significantly on independent validation. Terminal-Bench results are self-reported by harness developers.
|
|
|
|
---
|
|
|
|
## What Worked Well
|
|
|
|
1. **Speed**
|
|
- **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
|
|
- **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
|
|
- Claude Code: ~90 seconds
|
|
- ForgeCode + Opus 4.6: <30 seconds
|
|
- **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
|
|
- **Why:** Rust binary vs TypeScript, context engine indexes signatures (~90% size reduction), selective context
|
|
|
|
2. **Multi-file Refactoring**
|
|
- Handles complex changes across file boundaries efficiently
|
|
- Strong understanding of Astro/React components
|
|
- Consistently 3x faster than Claude Code on identical tasks
|
|
|
|
3. **Planning with Muse**
|
|
- Plan output felt "more detailed and verbose than Claude Code's plan mode"
|
|
|
|
4. **Stability**
|
|
- Excellent stability with Opus 4.6 through ForgeCode
|
|
- No tool call failures reported (unlike GPT 5.4 experience)
|
|
- Consistent performance across different task types
|
|
|
|
---
|
|
|
|
## Issues Encountered
|
|
|
|
1. **Ecosystem Gaps** (Major)
|
|
- **Description:** No IDE extensions, no hooks, no checkpoints/rewind
|
|
- **Impact:** Less integrated workflow compared to Claude Code
|
|
|
|
2. **No Auto-Memory** (Minor)
|
|
- **Description:** Context doesn't persist between sessions
|
|
- **Impact:** Requires re-contextualization on new sessions
|
|
|
|
3. **No Built-in Sandbox** (Minor)
|
|
- **Description:** Requires manual `--sandbox` flag for isolation
|
|
- **Impact:** Security requires explicit configuration
|
|
|
|
---
|
|
|
|
## User Workflow Integration
|
|
|
|
**Current User Pattern (Liran Baba):**
|
|
> "I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
|
|
|
|
**Use Cases:**
|
|
- Speed-critical tasks: ForgeCode + Opus 4.6
|
|
- Complex refactoring: ForgeCode for faster iteration
|
|
- Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)
|
|
|
|
---
|
|
|
|
## Source References
|
|
|
|
1. **DEV Community - ForgeCode vs Claude Code**: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
|
|
- Real-world performance comparison by Liran Baba
|
|
|
|
2. **ForgeCode Blog - Benchmarks Don't Matter**: https://forgecode.dev/blog/benchmarks-dont-matter/
|
|
- Documentation of harness optimizations and benchmark methodology
|
|
|
|
3. **Reddit r/ClaudeCode**: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
|
|
- Community discussion on ForgeCode usage
|