Qwen Model Corrections: - Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families - Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B) - Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B) - Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected) - Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B Terminal-Bench 2.0 Fixes: - Clarified that Terminal-Bench measures HARNESS+MODEL combinations - Updated rankings with current leaderboard data (April 2026): - #1: Pilot + Claude Opus 4.6: 82.9% - #2: ForgeCode + GPT-5.4: 81.8% - #3: ForgeCode + Claude Opus 4.6: 81.8% - Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness) - Added harness attribution to all Terminal-Bench references SWE-Bench Pro Updates (Verified): - #1: Claude Mythos Preview: 77.8% - #2: GLM-5.1: 58.4% (top open-source) - #3: GPT-5.4: 57.7% - Added source references to llm-stats.com Files Modified: - forgecode/feedback/localllm/qwen-3.5.md - forgecode/feedback/frontier/benchmark-controversy.md - hermes/feedback/localllm/qwen-models-feedback.md - opencode/opencode/feedback/SUMMARY.md - opencode/opencode/feedback/frontier/frontier-model-feedback.md - opencode/opencode/feedback/localllm/local-llm-feedback.md - pi/feedback/frontier/frontier-model-feedback.md
6.0 KiB
ForgeCode Benchmark Controversy - Feedback Report
Topic: TermBench 2.0 results, self-reported vs independent validation, "benchmaxxing"
Source References: Reddit r/ClaudeCode, DEV Community, llm-stats.com, arXiv
Date Compiled: April 9, 2026
The Controversy
ForgeCode achieved 81.8% on TermBench 2.0 (tied with GPT 5.4 and Opus 4.6), significantly outperforming Claude Code's 58.0% on the same Opus 4.6 model. This raised questions about:
- Self-reported vs. independent validation
- Benchmark-specific optimizations ("benchmaxxing")
- Proprietary layer involvement
TermBench 2.0 Results
Current Leaderboard (Harness + Model Combinations)
Important: Terminal-Bench measures agent harness + model combinations, not raw model capability.
| Rank | Harness | Model | Score | Date |
|---|---|---|---|---|
| 1 | Pilot | Claude Opus 4.6 | 82.9% | 2026-04-01 |
| 2 | ForgeCode | GPT 5.4 | 81.8% | 2026-03-12 |
| 3 | ForgeCode | Claude Opus 4.6 | 81.8% | 2026-03-12 |
| 4 | TongAgents | Gemini 3.1 Pro | 80.2% | 2026-03-13 |
| 5 | SageAgent | GPT-5.3-Codex | 78.4% | 2026-03-13 |
Source: https://www.tbench.ai/leaderboard/terminal-bench/2.0
Self-Reported (via ForgeCode at tbench.ai)
| Configuration | Score |
|---|---|
| ForgeCode + GPT 5.4 | 81.8% |
| ForgeCode + Claude Opus 4.6 | 81.8% |
| Claude Code + Claude Opus 4.6 | 58.0% |
Independent SWE-bench (Princeton/UChicago)
| Configuration | Score |
|---|---|
| ForgeCode + Claude 4 | 72.7% |
| Claude 3.7 Sonnet (extended thinking) | 70.3% |
| Claude 4.5 Opus | 76.8% |
Gap narrows from 24 points to 2.4 points on independent benchmark.
Community Skepticism
Reddit r/ClaudeCode
"Looks this agent forgecode.dev ranks better than anyone in terminalbench but anyone is talking about it. Is it fake or what is wrong with these 'artifacts' that promise save time and tokens"
"At this point, terminalbench has received quite some attention and most benchmarks are not validated."
"I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%."
The "Benchmaxxing" Term
Community coined "benchmaxxed" to describe ForgeCode's approach:
- Real engineering improvements
- Also benchmark-specific optimizations
- Not necessarily representative of real-world performance
ForgeCode's Defense
Blog Series: "Benchmarks Don't Matter — Until They Do"
ForgeCode transparently documented their journey:
- Baseline: ~25% (interactive-first runtime)
- Stabilization: ~38% (non-interactive mode + tool naming fixes)
- Planning control: 66% (mandatory todo_write enforcement)
- Speed architecture: 78.4% (subagent parallelization + progressive thinking)
- Final: 81.8% (additional optimizations)
Documented Optimizations
- JSON schema reordering:
requiredbeforepropertiesfor GPT 5.4 - Schema flattening: Reduced nesting
- Truncation reminders: Explicit notes when files partially read
- Mandatory verification: Reviewer skill checks completion
The Proprietary Layer Question
ForgeCode Services (optional, free during evaluation) includes:
- Semantic entry-point discovery
- Dynamic skill loading
- Tool-call correction layer
Concern: These services were used for benchmark evaluations but differ from open-source CLI mode.
Clarification from Discussion #2545:
"Regarding the benchmarks: the setup we used for those evaluations includes ForgeCode Services, which provides multiple improvements over the existing CLI harness."
Independent Terminal-Bench Data
From llm-stats.com (April 9, 2026):
- 28+ models evaluated
- Average score: Varies significantly by harness
- All results self-reported (0 verified on independent platforms)
Key Point: Terminal-Bench scores are inherently harness-specific. The same model (e.g., Claude Opus 4.6) achieves different scores with different harnesses:
- Pilot + Claude Opus 4.6: 82.9%
- ForgeCode + Claude Opus 4.6: 81.8%
- Claude Code + Claude Opus 4.6: 58.0%
Note: The 24-point gap between ForgeCode and Claude Code on the same model illustrates how harness engineering significantly impacts benchmark scores.
Academic Validation
Terminal-Bench Paper (ICLR 2026)
From arXiv:2601.11868:
"Nonetheless, the benchmark developers have invested considerable effort in mitigating reproducibility issues through containerization, repeated runs, and reporting of confidence intervals. We found no major issues. Independent inspection of the released tasks confirmed that they are well-specified and largely free of ambiguity or underspecification."
Key Point: The benchmark itself is well-constructed; the question is about harness-specific optimizations.
Key Takeaways
- Benchmarks can be gamed: Documented optimizations show how harness engineering affects scores
- Independent validation matters: 24-point gap shrinks to 2.4 on independent tests
- Proprietary layers complicate comparisons: Services used for benchmarks differ from open-source code
- Real-world != benchmark: GPT 5.4 scored 81.8% but was "borderline unusable" in practice
Recommendations for Benchmark Consumers
- Look for independent validation (SWE-bench > self-reported TermBench)
- Test on your own tasks - benchmarks don't capture all failure modes
- Consider harness transparency - open-source vs proprietary optimizations
- Beware benchmaxxing - optimizations may not generalize
Source References
- Reddit r/ClaudeCode: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
- DEV Community: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
- llm-stats.com: https://llm-stats.com/benchmarks/terminal-bench
- arXiv Paper: https://arxiv.org/html/2601.11868v1
- ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
- GitHub Discussion: https://github.com/antinomyhq/forgecode/discussions/2545