Fix model references and benchmark data across all feedback files

Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B

Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
  - #1: Pilot + Claude Opus 4.6: 82.9%
  - #2: ForgeCode + GPT-5.4: 81.8%
  - #3: ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references

SWE-Bench Pro Updates (Verified):
- #1: Claude Mythos Preview: 77.8%
- #2: GLM-5.1: 58.4% (top open-source)
- #3: GPT-5.4: 57.7%
- Added source references to llm-stats.com

Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md
This commit is contained in:
2026-04-09 16:05:14 +02:00
parent 2623737ad2
commit f561bed731
7 changed files with 141 additions and 67 deletions
@@ -11,8 +11,8 @@
### Benchmark Results
- **SWE-bench Verified:** 80.0%
- **SWE-bench Pro:** 57.7% (Rank 1)
- **Terminal-Bench 2.0:** 75.1% (Rank 1)
- **SWE-bench Pro:** 57.7% (Rank 3, behind Claude Mythos Preview 77.8% and GLM-5.1 58.4%)
- **Terminal-Bench 2.0:** Score varies by harness (ForgeCode+GPT-5.4: 81.8%, other harnesses vary)
- **LiveCodeBench:** Not specified
- **MRCR v2 (1M context):** Not specified
@@ -23,10 +23,10 @@
### What Worked Well
1. **Speed:** Fastest terminal execution among frontier models
2. **Terminal Execution:** 9.7pt advantage over Claude Opus on Terminal-Bench
2. **Terminal Execution:** Strong performance on terminal tasks
3. **Tool Search:** 47% token reduction with tool search
4. **Physics Simulation:** Near perfect emulation in creative coding tasks
5. **Cost Efficiency:** Best price/performance ratio for terminal tasks
5. **Cost Efficiency:** Good price/performance ratio for terminal tasks
### Source References
- [MorphLLM - Best AI for Coding 2026](https://www.morphllm.com/best-ai-model-for-coding)
@@ -191,11 +191,13 @@
- **Long Context:** Claude Opus 4.6 best at lossless summarization under compression
### Performance on Benchmarks
- **Terminal-Bench:** GPT-5.4 leads with 75.1%
- **SWE-bench:** Claude Opus 4.6 leads with 80.8%
- **SWE-bench:** Claude Opus 4.6 leads with 80.8% (Verified)
- **SWE-bench Pro:** Claude Mythos Preview leads with 77.8%
- **LiveCodeBench:** Gemini 3.1 Pro leads with 2887 Elo
- **Retry Rate:** 1.0-1.5 retries per prompt typical for frontier models
**Note:** Terminal-Bench scores vary significantly by harness. See harness-specific feedback folders for Terminal-Bench results.
### Best Practices
1. Use GPT-5.4 for terminal execution and speed
2. Use Claude Opus 4.6 for complex reasoning and large codebases