Files
mid_model_research/forgecode/feedback/frontier/benchmark-controversy.md
T
sleepy f561bed731 Fix model references and benchmark data across all feedback files
Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B

Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
  - #1: Pilot + Claude Opus 4.6: 82.9%
  - #2: ForgeCode + GPT-5.4: 81.8%
  - #3: ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references

SWE-Bench Pro Updates (Verified):
- #1: Claude Mythos Preview: 77.8%
- #2: GLM-5.1: 58.4% (top open-source)
- #3: GPT-5.4: 57.7%
- Added source references to llm-stats.com

Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md
2026-04-09 16:05:14 +02:00

6.0 KiB

ForgeCode Benchmark Controversy - Feedback Report

Topic: TermBench 2.0 results, self-reported vs independent validation, "benchmaxxing"
Source References: Reddit r/ClaudeCode, DEV Community, llm-stats.com, arXiv
Date Compiled: April 9, 2026


The Controversy

ForgeCode achieved 81.8% on TermBench 2.0 (tied with GPT 5.4 and Opus 4.6), significantly outperforming Claude Code's 58.0% on the same Opus 4.6 model. This raised questions about:

  1. Self-reported vs. independent validation
  2. Benchmark-specific optimizations ("benchmaxxing")
  3. Proprietary layer involvement

TermBench 2.0 Results

Current Leaderboard (Harness + Model Combinations)

Important: Terminal-Bench measures agent harness + model combinations, not raw model capability.

Rank Harness Model Score Date
1 Pilot Claude Opus 4.6 82.9% 2026-04-01
2 ForgeCode GPT 5.4 81.8% 2026-03-12
3 ForgeCode Claude Opus 4.6 81.8% 2026-03-12
4 TongAgents Gemini 3.1 Pro 80.2% 2026-03-13
5 SageAgent GPT-5.3-Codex 78.4% 2026-03-13

Source: https://www.tbench.ai/leaderboard/terminal-bench/2.0

Self-Reported (via ForgeCode at tbench.ai)

Configuration Score
ForgeCode + GPT 5.4 81.8%
ForgeCode + Claude Opus 4.6 81.8%
Claude Code + Claude Opus 4.6 58.0%

Independent SWE-bench (Princeton/UChicago)

Configuration Score
ForgeCode + Claude 4 72.7%
Claude 3.7 Sonnet (extended thinking) 70.3%
Claude 4.5 Opus 76.8%

Gap narrows from 24 points to 2.4 points on independent benchmark.


Community Skepticism

Reddit r/ClaudeCode

"Looks this agent forgecode.dev ranks better than anyone in terminalbench but anyone is talking about it. Is it fake or what is wrong with these 'artifacts' that promise save time and tokens"

"At this point, terminalbench has received quite some attention and most benchmarks are not validated."

"I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%."

The "Benchmaxxing" Term

Community coined "benchmaxxed" to describe ForgeCode's approach:

  • Real engineering improvements
  • Also benchmark-specific optimizations
  • Not necessarily representative of real-world performance

ForgeCode's Defense

Blog Series: "Benchmarks Don't Matter — Until They Do"

ForgeCode transparently documented their journey:

  • Baseline: ~25% (interactive-first runtime)
  • Stabilization: ~38% (non-interactive mode + tool naming fixes)
  • Planning control: 66% (mandatory todo_write enforcement)
  • Speed architecture: 78.4% (subagent parallelization + progressive thinking)
  • Final: 81.8% (additional optimizations)

Documented Optimizations

  1. JSON schema reordering: required before properties for GPT 5.4
  2. Schema flattening: Reduced nesting
  3. Truncation reminders: Explicit notes when files partially read
  4. Mandatory verification: Reviewer skill checks completion

The Proprietary Layer Question

ForgeCode Services (optional, free during evaluation) includes:

  1. Semantic entry-point discovery
  2. Dynamic skill loading
  3. Tool-call correction layer

Concern: These services were used for benchmark evaluations but differ from open-source CLI mode.

Clarification from Discussion #2545:

"Regarding the benchmarks: the setup we used for those evaluations includes ForgeCode Services, which provides multiple improvements over the existing CLI harness."


Independent Terminal-Bench Data

From llm-stats.com (April 9, 2026):

  • 28+ models evaluated
  • Average score: Varies significantly by harness
  • All results self-reported (0 verified on independent platforms)

Key Point: Terminal-Bench scores are inherently harness-specific. The same model (e.g., Claude Opus 4.6) achieves different scores with different harnesses:

  • Pilot + Claude Opus 4.6: 82.9%
  • ForgeCode + Claude Opus 4.6: 81.8%
  • Claude Code + Claude Opus 4.6: 58.0%

Note: The 24-point gap between ForgeCode and Claude Code on the same model illustrates how harness engineering significantly impacts benchmark scores.


Academic Validation

Terminal-Bench Paper (ICLR 2026)

From arXiv:2601.11868:

"Nonetheless, the benchmark developers have invested considerable effort in mitigating reproducibility issues through containerization, repeated runs, and reporting of confidence intervals. We found no major issues. Independent inspection of the released tasks confirmed that they are well-specified and largely free of ambiguity or underspecification."

Key Point: The benchmark itself is well-constructed; the question is about harness-specific optimizations.


Key Takeaways

  1. Benchmarks can be gamed: Documented optimizations show how harness engineering affects scores
  2. Independent validation matters: 24-point gap shrinks to 2.4 on independent tests
  3. Proprietary layers complicate comparisons: Services used for benchmarks differ from open-source code
  4. Real-world != benchmark: GPT 5.4 scored 81.8% but was "borderline unusable" in practice

Recommendations for Benchmark Consumers

  1. Look for independent validation (SWE-bench > self-reported TermBench)
  2. Test on your own tasks - benchmarks don't capture all failure modes
  3. Consider harness transparency - open-source vs proprietary optimizations
  4. Beware benchmaxxing - optimizations may not generalize

Source References

  1. Reddit r/ClaudeCode: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
  2. DEV Community: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
  3. llm-stats.com: https://llm-stats.com/benchmarks/terminal-bench
  4. arXiv Paper: https://arxiv.org/html/2601.11868v1
  5. ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
  6. GitHub Discussion: https://github.com/antinomyhq/forgecode/discussions/2545