51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2.9 KiB
2.9 KiB
Claude Opus 4.6 with ForgeCode - Feedback Report
Model: Claude Opus 4.6
Provider: Anthropic
Harness: ForgeCode
Source References: DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode
Date Compiled: April 9, 2026
Benchmark Performance
TermBench 2.0 (Self-Reported via ForgeCode)
- Score: 81.8% (tied for #1)
- Comparison: Claude Code + Opus 4.6 scored 58.0% (Rank #39)
- Gap: ~24 percentage points in favor of ForgeCode harness
SWE-bench Verified (Independent - Princeton/UChicago)
- ForgeCode + Claude 4: 72.7%
- Claude Code + Claude 3.7 Sonnet (extended thinking): 70.3%
- Gap: Only 2.4 percentage points
Key Insight: The benchmark gap narrows significantly on independent validation. TermBench 2.0 results are self-reported by ForgeCode itself.
Real-World Performance Feedback
Speed
- Observation: "Noticeably faster than Claude Code. Not marginal, real."
- Test Case: Adding post counter to blog index (Astro 6, ~30 files)
- Claude Code: ~90 seconds
- ForgeCode + Opus 4.6: <30 seconds
- Consistency: Multi-file renames, component additions, layout restructuring all showed faster performance
Why Faster
- Rust binary vs Claude Code's TypeScript (better startup/memory)
- Context engine: Indexes function signatures and module boundaries instead of dumping raw files (~90% context size reduction)
- Selective context: Pulls only what the agent needs
Stability
- Assessment: Excellent stability with Opus 4.6 through ForgeCode
- No tool call failures reported (unlike GPT 5.4 experience)
- Consistent performance across different task types
What Worked Well
- Multi-file refactoring: Handles complex changes across file boundaries efficiently
- Code comprehension: Strong understanding of Astro/React components
- Speed on complex tasks: Consistently 3x faster than Claude Code on identical tasks
- Planning with muse: Plan output felt "more detailed and verbose than Claude Code's plan mode"
Issues Encountered
- Ecosystem gaps: No IDE extensions, no hooks, no checkpoints/rewind
- No auto-memory: Context doesn't persist between sessions
- No built-in sandbox: Requires manual
--sandboxflag for isolation
User Workflow Integration
Current User Pattern (Liran Baba):
"I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
Use Cases:
- Speed-critical tasks: ForgeCode + Opus 4.6
- Complex refactoring: ForgeCode for faster iteration
- Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)