Files
mid_model_research/forgecode/feedback/frontier/claude-opus-4.6.md
T
sleepy b012a406c7 Unify feedback file structure across harness folders
Applied unified structure template to key feedback files:

Structure now includes:
1. Standard header (Model/Size/Provider/Harness/Date)
2. Quick Reference table
3. Benchmark Results (with harness+model note)
4. What Worked Well
5. Issues Encountered (with severity levels)
6. Configuration (if applicable)
7. Source References (with descriptions)

Files restructured:
- forgecode/feedback/frontier/gpt-5.4.md
- forgecode/feedback/frontier/claude-opus-4.6.md
- hermes/feedback/frontier/claude-sonnet-feedback.md

Also created FEEDBACK_TEMPLATE.md as a style guide for all future feedback files.
2026-04-09 16:12:52 +02:00

3.8 KiB

Claude Opus 4.6 with ForgeCode - Feedback Report

Model: Claude Opus 4.6
Size: [Not specified]
Provider: Anthropic
Harness: ForgeCode
Date Compiled: April 9, 2026
Source References: DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode


Quick Reference

Attribute Value
Model Claude Opus 4.6
Provider Anthropic
Context Window 200K tokens
Best For Complex reasoning, large codebases, long-horizon tasks
Cost ~$15/M input, ~$75/M output

Benchmark Results

Terminal-Bench 2.0 (Harness-Specific)

  • Score: 81.8% (tied for #1)
  • Harness: ForgeCode
  • Comparison: Claude Code + Opus 4.6: 58.0% (Rank #39)
  • Gap: ~24 percentage points in favor of ForgeCode harness
  • Note: Score reflects harness+model combination, not raw model capability

SWE-Bench Verified (Independent)

  • ForgeCode + Claude 4: 72.7%
  • Claude Code + Claude 3.7 Sonnet (extended thinking): 70.3%
  • Gap: Only 2.4 percentage points on independent validation
  • Source: Princeton/UChicago

SWE-Bench Pro

  • Score: 57.3% (Rank varies)
  • Behind: Claude Mythos Preview (77.8%), GLM-5.1 (58.4%), GPT-5.4 (57.7%)
  • Source: llm-stats.com

Key Insight: The benchmark gap narrows significantly on independent validation. Terminal-Bench results are self-reported by harness developers.


What Worked Well

  1. Speed

    • Observation: "Noticeably faster than Claude Code. Not marginal, real."
    • Test Case: Adding post counter to blog index (Astro 6, ~30 files)
      • Claude Code: ~90 seconds
      • ForgeCode + Opus 4.6: <30 seconds
    • Consistency: Multi-file renames, component additions, layout restructuring all showed faster performance
    • Why: Rust binary vs TypeScript, context engine indexes signatures (~90% size reduction), selective context
  2. Multi-file Refactoring

    • Handles complex changes across file boundaries efficiently
    • Strong understanding of Astro/React components
    • Consistently 3x faster than Claude Code on identical tasks
  3. Planning with Muse

    • Plan output felt "more detailed and verbose than Claude Code's plan mode"
  4. Stability

    • Excellent stability with Opus 4.6 through ForgeCode
    • No tool call failures reported (unlike GPT 5.4 experience)
    • Consistent performance across different task types

Issues Encountered

  1. Ecosystem Gaps (Major)

    • Description: No IDE extensions, no hooks, no checkpoints/rewind
    • Impact: Less integrated workflow compared to Claude Code
  2. No Auto-Memory (Minor)

    • Description: Context doesn't persist between sessions
    • Impact: Requires re-contextualization on new sessions
  3. No Built-in Sandbox (Minor)

    • Description: Requires manual --sandbox flag for isolation
    • Impact: Security requires explicit configuration

User Workflow Integration

Current User Pattern (Liran Baba):

"I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."

Use Cases:

  • Speed-critical tasks: ForgeCode + Opus 4.6
  • Complex refactoring: ForgeCode for faster iteration
  • Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)

Source References

  1. DEV Community - ForgeCode vs Claude Code: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c

    • Real-world performance comparison by Liran Baba
  2. ForgeCode Blog - Benchmarks Don't Matter: https://forgecode.dev/blog/benchmarks-dont-matter/

    • Documentation of harness optimizations and benchmark methodology
  3. Reddit r/ClaudeCode: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/

    • Community discussion on ForgeCode usage