Fix model references and benchmark data across all feedback files
Qwen Model Corrections: - Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families - Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B) - Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B) - Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected) - Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B Terminal-Bench 2.0 Fixes: - Clarified that Terminal-Bench measures HARNESS+MODEL combinations - Updated rankings with current leaderboard data (April 2026): - #1: Pilot + Claude Opus 4.6: 82.9% - #2: ForgeCode + GPT-5.4: 81.8% - #3: ForgeCode + Claude Opus 4.6: 81.8% - Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness) - Added harness attribution to all Terminal-Bench references SWE-Bench Pro Updates (Verified): - #1: Claude Mythos Preview: 77.8% - #2: GLM-5.1: 58.4% (top open-source) - #3: GPT-5.4: 57.7% - Added source references to llm-stats.com Files Modified: - forgecode/feedback/localllm/qwen-3.5.md - forgecode/feedback/frontier/benchmark-controversy.md - hermes/feedback/localllm/qwen-models-feedback.md - opencode/opencode/feedback/SUMMARY.md - opencode/opencode/feedback/frontier/frontier-model-feedback.md - opencode/opencode/feedback/localllm/local-llm-feedback.md - pi/feedback/frontier/frontier-model-feedback.md
This commit is contained in:
@@ -18,12 +18,26 @@ ForgeCode achieved **81.8% on TermBench 2.0** (tied with GPT 5.4 and Opus 4.6),
|
||||
|
||||
## TermBench 2.0 Results
|
||||
|
||||
### Current Leaderboard (Harness + Model Combinations)
|
||||
|
||||
**Important:** Terminal-Bench measures agent harness + model combinations, not raw model capability.
|
||||
|
||||
| Rank | Harness | Model | Score | Date |
|
||||
|------|---------|-------|-------|------|
|
||||
| 1 | Pilot | Claude Opus 4.6 | 82.9% | 2026-04-01 |
|
||||
| 2 | ForgeCode | GPT 5.4 | 81.8% | 2026-03-12 |
|
||||
| 3 | ForgeCode | Claude Opus 4.6 | 81.8% | 2026-03-12 |
|
||||
| 4 | TongAgents | Gemini 3.1 Pro | 80.2% | 2026-03-13 |
|
||||
| 5 | SageAgent | GPT-5.3-Codex | 78.4% | 2026-03-13 |
|
||||
|
||||
**Source:** https://www.tbench.ai/leaderboard/terminal-bench/2.0
|
||||
|
||||
### Self-Reported (via ForgeCode at tbench.ai)
|
||||
| Configuration | Score | Rank |
|
||||
|--------------|-------|------|
|
||||
| ForgeCode + GPT 5.4 | 81.8% | #1 |
|
||||
| ForgeCode + Opus 4.6 | 81.8% | #1 |
|
||||
| Claude Code + Opus 4.6 | 58.0% | #39 |
|
||||
| Configuration | Score |
|
||||
|--------------|-------|
|
||||
| ForgeCode + GPT 5.4 | 81.8% |
|
||||
| ForgeCode + Claude Opus 4.6 | 81.8% |
|
||||
| Claude Code + Claude Opus 4.6 | 58.0% |
|
||||
|
||||
### Independent SWE-bench (Princeton/UChicago)
|
||||
| Configuration | Score |
|
||||
@@ -88,17 +102,16 @@ ForgeCode transparently documented their journey:
|
||||
## Independent Terminal-Bench Data
|
||||
|
||||
From llm-stats.com (April 9, 2026):
|
||||
- **23 models evaluated**
|
||||
- **Average score:** 0.345 (34.5%)
|
||||
- **Best score:** 0.500 (50.0%) - Claude Sonnet 4.5
|
||||
- **All results self-reported** (0 verified)
|
||||
- **28+ models evaluated**
|
||||
- **Average score:** Varies significantly by harness
|
||||
- **All results self-reported** (0 verified on independent platforms)
|
||||
|
||||
**Top 3:**
|
||||
1. Claude Sonnet 4.5: 50.0%
|
||||
2. MiniMax M2.1: 47.9%
|
||||
3. Kimi K2-Thinking: 47.1%
|
||||
**Key Point:** Terminal-Bench scores are inherently harness-specific. The same model (e.g., Claude Opus 4.6) achieves different scores with different harnesses:
|
||||
- Pilot + Claude Opus 4.6: 82.9%
|
||||
- ForgeCode + Claude Opus 4.6: 81.8%
|
||||
- Claude Code + Claude Opus 4.6: 58.0%
|
||||
|
||||
**Note:** ForgeCode's 81.8% is not on this independent leaderboard; it was self-reported on tbench.ai.
|
||||
**Note:** The 24-point gap between ForgeCode and Claude Code on the same model illustrates how harness engineering significantly impacts benchmark scores.
|
||||
|
||||
---
|
||||
|
||||
|
||||
Reference in New Issue
Block a user