Files

T

sleepy f561bed731 Fix model references and benchmark data across all feedback files

Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B

Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
  - #1: Pilot + Claude Opus 4.6: 82.9%
  - #2: ForgeCode + GPT-5.4: 81.8%
  - #3: ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references

SWE-Bench Pro Updates (Verified):
- #1: Claude Mythos Preview: 77.8%
- #2: GLM-5.1: 58.4% (top open-source)
- #3: GPT-5.4: 57.7%
- Added source references to llm-stats.com

Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md

2026-04-09 16:05:14 +02:00

6.0 KiB

Raw Blame History

ForgeCode Benchmark Controversy - Feedback Report

Topic: TermBench 2.0 results, self-reported vs independent validation, "benchmaxxing"
Source References: Reddit r/ClaudeCode, DEV Community, llm-stats.com, arXiv
Date Compiled: April 9, 2026

The Controversy

ForgeCode achieved 81.8% on TermBench 2.0 (tied with GPT 5.4 and Opus 4.6), significantly outperforming Claude Code's 58.0% on the same Opus 4.6 model. This raised questions about:

Self-reported vs. independent validation
Benchmark-specific optimizations ("benchmaxxing")
Proprietary layer involvement

TermBench 2.0 Results

Current Leaderboard (Harness + Model Combinations)

Important: Terminal-Bench measures agent harness + model combinations, not raw model capability.

Rank	Harness	Model	Score	Date
1	Pilot	Claude Opus 4.6	82.9%	2026-04-01
2	ForgeCode	GPT 5.4	81.8%	2026-03-12
3	ForgeCode	Claude Opus 4.6	81.8%	2026-03-12
4	TongAgents	Gemini 3.1 Pro	80.2%	2026-03-13
5	SageAgent	GPT-5.3-Codex	78.4%	2026-03-13

Source: https://www.tbench.ai/leaderboard/terminal-bench/2.0

Self-Reported (via ForgeCode at tbench.ai)

Configuration	Score
ForgeCode + GPT 5.4	81.8%
ForgeCode + Claude Opus 4.6	81.8%
Claude Code + Claude Opus 4.6	58.0%

Independent SWE-bench (Princeton/UChicago)

Configuration	Score
ForgeCode + Claude 4	72.7%
Claude 3.7 Sonnet (extended thinking)	70.3%
Claude 4.5 Opus	76.8%

Gap narrows from 24 points to 2.4 points on independent benchmark.

Community Skepticism

Reddit r/ClaudeCode

"Looks this agent forgecode.dev ranks better than anyone in terminalbench but anyone is talking about it. Is it fake or what is wrong with these 'artifacts' that promise save time and tokens"

"At this point, terminalbench has received quite some attention and most benchmarks are not validated."

"I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%."

The "Benchmaxxing" Term

Community coined "benchmaxxed" to describe ForgeCode's approach:

Real engineering improvements
Also benchmark-specific optimizations
Not necessarily representative of real-world performance

ForgeCode's Defense

Blog Series: "Benchmarks Don't Matter — Until They Do"

ForgeCode transparently documented their journey:

Baseline: ~25% (interactive-first runtime)
Stabilization: ~38% (non-interactive mode + tool naming fixes)
Planning control: 66% (mandatory todo_write enforcement)
Speed architecture: 78.4% (subagent parallelization + progressive thinking)
Final: 81.8% (additional optimizations)

Documented Optimizations

JSON schema reordering: required before properties for GPT 5.4
Schema flattening: Reduced nesting
Truncation reminders: Explicit notes when files partially read
Mandatory verification: Reviewer skill checks completion

The Proprietary Layer Question

ForgeCode Services (optional, free during evaluation) includes:

Semantic entry-point discovery
Dynamic skill loading
Tool-call correction layer

Concern: These services were used for benchmark evaluations but differ from open-source CLI mode.

Clarification from Discussion #2545:

"Regarding the benchmarks: the setup we used for those evaluations includes ForgeCode Services, which provides multiple improvements over the existing CLI harness."

Independent Terminal-Bench Data

From llm-stats.com (April 9, 2026):

28+ models evaluated
Average score: Varies significantly by harness
All results self-reported (0 verified on independent platforms)

Key Point: Terminal-Bench scores are inherently harness-specific. The same model (e.g., Claude Opus 4.6) achieves different scores with different harnesses:

Pilot + Claude Opus 4.6: 82.9%
ForgeCode + Claude Opus 4.6: 81.8%
Claude Code + Claude Opus 4.6: 58.0%

Note: The 24-point gap between ForgeCode and Claude Code on the same model illustrates how harness engineering significantly impacts benchmark scores.

Academic Validation

Terminal-Bench Paper (ICLR 2026)

From arXiv:2601.11868:

"Nonetheless, the benchmark developers have invested considerable effort in mitigating reproducibility issues through containerization, repeated runs, and reporting of confidence intervals. We found no major issues. Independent inspection of the released tasks confirmed that they are well-specified and largely free of ambiguity or underspecification."

Key Point: The benchmark itself is well-constructed; the question is about harness-specific optimizations.

Key Takeaways

Benchmarks can be gamed: Documented optimizations show how harness engineering affects scores
Independent validation matters: 24-point gap shrinks to 2.4 on independent tests
Proprietary layers complicate comparisons: Services used for benchmarks differ from open-source code
Real-world != benchmark: GPT 5.4 scored 81.8% but was "borderline unusable" in practice

Recommendations for Benchmark Consumers

Look for independent validation (SWE-bench > self-reported TermBench)
Test on your own tasks - benchmarks don't capture all failure modes
Consider harness transparency - open-source vs proprietary optimizations
Beware benchmaxxing - optimizations may not generalize

Source References

Reddit r/ClaudeCode: https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
DEV Community: https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
llm-stats.com: https://llm-stats.com/benchmarks/terminal-bench
arXiv Paper: https://arxiv.org/html/2601.11868v1
ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
GitHub Discussion: https://github.com/antinomyhq/forgecode/discussions/2545

6.0 KiB Raw Blame History