Initial commit: coding harness feedback analysis
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
This commit is contained in:
@@ -0,0 +1,140 @@
|
||||
# ForgeCode Benchmark Controversy - Feedback Report
|
||||
|
||||
**Topic:** TermBench 2.0 results, self-reported vs independent validation, "benchmaxxing"
|
||||
**Source References:** Reddit r/ClaudeCode, DEV Community, llm-stats.com, arXiv
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## The Controversy
|
||||
|
||||
ForgeCode achieved **81.8% on TermBench 2.0** (tied with GPT 5.4 and Opus 4.6), significantly outperforming Claude Code's 58.0% on the same Opus 4.6 model. This raised questions about:
|
||||
|
||||
1. Self-reported vs. independent validation
|
||||
2. Benchmark-specific optimizations ("benchmaxxing")
|
||||
3. Proprietary layer involvement
|
||||
|
||||
---
|
||||
|
||||
## TermBench 2.0 Results
|
||||
|
||||
### Self-Reported (via ForgeCode at tbench.ai)
|
||||
| Configuration | Score | Rank |
|
||||
|--------------|-------|------|
|
||||
| ForgeCode + GPT 5.4 | 81.8% | #1 |
|
||||
| ForgeCode + Opus 4.6 | 81.8% | #1 |
|
||||
| Claude Code + Opus 4.6 | 58.0% | #39 |
|
||||
|
||||
### Independent SWE-bench (Princeton/UChicago)
|
||||
| Configuration | Score |
|
||||
|--------------|-------|
|
||||
| ForgeCode + Claude 4 | 72.7% |
|
||||
| Claude 3.7 Sonnet (extended thinking) | 70.3% |
|
||||
| Claude 4.5 Opus | 76.8% |
|
||||
|
||||
**Gap narrows from 24 points to 2.4 points on independent benchmark.**
|
||||
|
||||
---
|
||||
|
||||
## Community Skepticism
|
||||
|
||||
### Reddit r/ClaudeCode
|
||||
> "Looks this agent forgecode.dev ranks better than anyone in terminalbench but anyone is talking about it. Is it fake or what is wrong with these 'artifacts' that promise save time and tokens"
|
||||
|
||||
> "At this point, terminalbench has received quite some attention and most benchmarks are not validated."
|
||||
|
||||
> "I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%."
|
||||
|
||||
### The "Benchmaxxing" Term
|
||||
Community coined "benchmaxxed" to describe ForgeCode's approach:
|
||||
- Real engineering improvements
|
||||
- Also benchmark-specific optimizations
|
||||
- Not necessarily representative of real-world performance
|
||||
|
||||
---
|
||||
|
||||
## ForgeCode's Defense
|
||||
|
||||
### Blog Series: "Benchmarks Don't Matter — Until They Do"
|
||||
ForgeCode transparently documented their journey:
|
||||
- **Baseline:** ~25% (interactive-first runtime)
|
||||
- **Stabilization:** ~38% (non-interactive mode + tool naming fixes)
|
||||
- **Planning control:** 66% (mandatory todo_write enforcement)
|
||||
- **Speed architecture:** 78.4% (subagent parallelization + progressive thinking)
|
||||
- **Final:** 81.8% (additional optimizations)
|
||||
|
||||
### Documented Optimizations
|
||||
1. **JSON schema reordering:** `required` before `properties` for GPT 5.4
|
||||
2. **Schema flattening:** Reduced nesting
|
||||
3. **Truncation reminders:** Explicit notes when files partially read
|
||||
4. **Mandatory verification:** Reviewer skill checks completion
|
||||
|
||||
---
|
||||
|
||||
## The Proprietary Layer Question
|
||||
|
||||
**ForgeCode Services** (optional, free during evaluation) includes:
|
||||
1. Semantic entry-point discovery
|
||||
2. Dynamic skill loading
|
||||
3. Tool-call correction layer
|
||||
|
||||
**Concern:** These services were used for benchmark evaluations but differ from open-source CLI mode.
|
||||
|
||||
**Clarification from Discussion #2545:**
|
||||
> "Regarding the benchmarks: the setup we used for those evaluations includes ForgeCode Services, which provides multiple improvements over the existing CLI harness."
|
||||
|
||||
---
|
||||
|
||||
## Independent Terminal-Bench Data
|
||||
|
||||
From llm-stats.com (April 9, 2026):
|
||||
- **23 models evaluated**
|
||||
- **Average score:** 0.345 (34.5%)
|
||||
- **Best score:** 0.500 (50.0%) - Claude Sonnet 4.5
|
||||
- **All results self-reported** (0 verified)
|
||||
|
||||
**Top 3:**
|
||||
1. Claude Sonnet 4.5: 50.0%
|
||||
2. MiniMax M2.1: 47.9%
|
||||
3. Kimi K2-Thinking: 47.1%
|
||||
|
||||
**Note:** ForgeCode's 81.8% is not on this independent leaderboard; it was self-reported on tbench.ai.
|
||||
|
||||
---
|
||||
|
||||
## Academic Validation
|
||||
|
||||
### Terminal-Bench Paper (ICLR 2026)
|
||||
From arXiv:2601.11868:
|
||||
> "Nonetheless, the benchmark developers have invested considerable effort in mitigating reproducibility issues through containerization, repeated runs, and reporting of confidence intervals. We found no major issues. Independent inspection of the released tasks confirmed that they are well-specified and largely free of ambiguity or underspecification."
|
||||
|
||||
**Key Point:** The benchmark itself is well-constructed; the question is about harness-specific optimizations.
|
||||
|
||||
---
|
||||
|
||||
## Key Takeaways
|
||||
|
||||
1. **Benchmarks can be gamed:** Documented optimizations show how harness engineering affects scores
|
||||
2. **Independent validation matters:** 24-point gap shrinks to 2.4 on independent tests
|
||||
3. **Proprietary layers complicate comparisons:** Services used for benchmarks differ from open-source code
|
||||
4. **Real-world != benchmark:** GPT 5.4 scored 81.8% but was "borderline unusable" in practice
|
||||
|
||||
---
|
||||
|
||||
## Recommendations for Benchmark Consumers
|
||||
|
||||
1. **Look for independent validation** (SWE-bench > self-reported TermBench)
|
||||
2. **Test on your own tasks** - benchmarks don't capture all failure modes
|
||||
3. **Consider harness transparency** - open-source vs proprietary optimizations
|
||||
4. **Beware benchmaxxing** - optimizations may not generalize
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
|
||||
2. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
|
||||
3. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
|
||||
4. **arXiv Paper:** https://arxiv.org/html/2601.11868v1
|
||||
5. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
|
||||
6. **GitHub Discussion:** https://github.com/antinomyhq/forgecode/discussions/2545
|
||||
Reference in New Issue
Block a user