Files
mid_model_research/forgecode/feedback/frontier/claude-opus-4.6.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

2.9 KiB

Claude Opus 4.6 with ForgeCode - Feedback Report

Model: Claude Opus 4.6
Provider: Anthropic
Harness: ForgeCode
Source References: DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode
Date Compiled: April 9, 2026


Benchmark Performance

TermBench 2.0 (Self-Reported via ForgeCode)

  • Score: 81.8% (tied for #1)
  • Comparison: Claude Code + Opus 4.6 scored 58.0% (Rank #39)
  • Gap: ~24 percentage points in favor of ForgeCode harness

SWE-bench Verified (Independent - Princeton/UChicago)

  • ForgeCode + Claude 4: 72.7%
  • Claude Code + Claude 3.7 Sonnet (extended thinking): 70.3%
  • Gap: Only 2.4 percentage points

Key Insight: The benchmark gap narrows significantly on independent validation. TermBench 2.0 results are self-reported by ForgeCode itself.


Real-World Performance Feedback

Speed

  • Observation: "Noticeably faster than Claude Code. Not marginal, real."
  • Test Case: Adding post counter to blog index (Astro 6, ~30 files)
    • Claude Code: ~90 seconds
    • ForgeCode + Opus 4.6: <30 seconds
  • Consistency: Multi-file renames, component additions, layout restructuring all showed faster performance

Why Faster

  1. Rust binary vs Claude Code's TypeScript (better startup/memory)
  2. Context engine: Indexes function signatures and module boundaries instead of dumping raw files (~90% context size reduction)
  3. Selective context: Pulls only what the agent needs

Stability

  • Assessment: Excellent stability with Opus 4.6 through ForgeCode
  • No tool call failures reported (unlike GPT 5.4 experience)
  • Consistent performance across different task types

What Worked Well

  1. Multi-file refactoring: Handles complex changes across file boundaries efficiently
  2. Code comprehension: Strong understanding of Astro/React components
  3. Speed on complex tasks: Consistently 3x faster than Claude Code on identical tasks
  4. Planning with muse: Plan output felt "more detailed and verbose than Claude Code's plan mode"

Issues Encountered

  1. Ecosystem gaps: No IDE extensions, no hooks, no checkpoints/rewind
  2. No auto-memory: Context doesn't persist between sessions
  3. No built-in sandbox: Requires manual --sandbox flag for isolation

User Workflow Integration

Current User Pattern (Liran Baba):

"I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."

Use Cases:

  • Speed-critical tasks: ForgeCode + Opus 4.6
  • Complex refactoring: ForgeCode for faster iteration
  • Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)

Source References

  1. DEV Community: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
  2. ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
  3. Reddit r/ClaudeCode: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/