Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,81 @@
+# Claude Opus 4.6 with ForgeCode - Feedback Report
+
+**Model:** Claude Opus 4.6  
+**Provider:** Anthropic  
+**Harness:** ForgeCode  
+**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Benchmark Performance
+
+### TermBench 2.0 (Self-Reported via ForgeCode)
+- **Score:** 81.8% (tied for #1)
+- **Comparison:** Claude Code + Opus 4.6 scored 58.0% (Rank #39)
+- **Gap:** ~24 percentage points in favor of ForgeCode harness
+
+### SWE-bench Verified (Independent - Princeton/UChicago)
+- **ForgeCode + Claude 4:** 72.7%
+- **Claude Code + Claude 3.7 Sonnet (extended thinking):** 70.3%
+- **Gap:** Only 2.4 percentage points
+
+**Key Insight:** The benchmark gap narrows significantly on independent validation. TermBench 2.0 results are self-reported by ForgeCode itself.
+
+---
+
+## Real-World Performance Feedback
+
+### Speed
+- **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
+- **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
+  - Claude Code: ~90 seconds
+  - ForgeCode + Opus 4.6: <30 seconds
+- **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
+
+### Why Faster
+1. **Rust binary** vs Claude Code's TypeScript (better startup/memory)
+2. **Context engine:** Indexes function signatures and module boundaries instead of dumping raw files (~90% context size reduction)
+3. **Selective context:** Pulls only what the agent needs
+
+### Stability
+- **Assessment:** Excellent stability with Opus 4.6 through ForgeCode
+- **No tool call failures reported** (unlike GPT 5.4 experience)
+- Consistent performance across different task types
+
+---
+
+## What Worked Well
+
+1. **Multi-file refactoring:** Handles complex changes across file boundaries efficiently
+2. **Code comprehension:** Strong understanding of Astro/React components
+3. **Speed on complex tasks:** Consistently 3x faster than Claude Code on identical tasks
+4. **Planning with muse:** Plan output felt "more detailed and verbose than Claude Code's plan mode"
+
+---
+
+## Issues Encountered
+
+1. **Ecosystem gaps:** No IDE extensions, no hooks, no checkpoints/rewind
+2. **No auto-memory:** Context doesn't persist between sessions
+3. **No built-in sandbox:** Requires manual `--sandbox` flag for isolation
+
+---
+
+## User Workflow Integration
+
+**Current User Pattern (Liran Baba):**
+> "I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
+
+**Use Cases:**
+- Speed-critical tasks: ForgeCode + Opus 4.6
+- Complex refactoring: ForgeCode for faster iteration
+- Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)
+
+---
+
+## Source References
+
+1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
+3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/