Fix model references and benchmark data across all feedback files

Qwen Model Corrections: - Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families - Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B) - Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B) - Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected) - Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B Terminal-Bench 2.0 Fixes: - Clarified that Terminal-Bench measures HARNESS+MODEL combinations - Updated rankings with current leaderboard data (April 2026): - #1: Pilot + Claude Opus 4.6: 82.9% - #2: ForgeCode + GPT-5.4: 81.8% - #3: ForgeCode + Claude Opus 4.6: 81.8% - Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness) - Added harness attribution to all Terminal-Bench references SWE-Bench Pro Updates (Verified): - #1: Claude Mythos Preview: 77.8% - #2: GLM-5.1: 58.4% (top open-source) - #3: GPT-5.4: 57.7% - Added source references to llm-stats.com Files Modified: - forgecode/feedback/localllm/qwen-3.5.md - forgecode/feedback/frontier/benchmark-controversy.md - hermes/feedback/localllm/qwen-models-feedback.md - opencode/opencode/feedback/SUMMARY.md - opencode/opencode/feedback/frontier/frontier-model-feedback.md - opencode/opencode/feedback/localllm/local-llm-feedback.md - pi/feedback/frontier/frontier-model-feedback.md
2026-04-09 16:05:14 +02:00
parent 2623737ad2
commit f561bed731
7 changed files with 141 additions and 67 deletions
@@ -18,12 +18,26 @@ ForgeCode achieved **81.8% on TermBench 2.0** (tied with GPT 5.4 and Opus 4.6),

 ## TermBench 2.0 Results

+### Current Leaderboard (Harness + Model Combinations)
+
+**Important:** Terminal-Bench measures agent harness + model combinations, not raw model capability.
+
+| Rank | Harness | Model | Score | Date |
+|------|---------|-------|-------|------|
+| 1 | Pilot | Claude Opus 4.6 | 82.9% | 2026-04-01 |
+| 2 | ForgeCode | GPT 5.4 | 81.8% | 2026-03-12 |
+| 3 | ForgeCode | Claude Opus 4.6 | 81.8% | 2026-03-12 |
+| 4 | TongAgents | Gemini 3.1 Pro | 80.2% | 2026-03-13 |
+| 5 | SageAgent | GPT-5.3-Codex | 78.4% | 2026-03-13 |
+
+**Source:** https://www.tbench.ai/leaderboard/terminal-bench/2.0
+
 ### Self-Reported (via ForgeCode at tbench.ai)
-| Configuration | Score | Rank |
-|--------------|-------|------|
-| ForgeCode + GPT 5.4 | 81.8% | #1 |
-| ForgeCode + Opus 4.6 | 81.8% | #1 |
-| Claude Code + Opus 4.6 | 58.0% | #39 |
+| Configuration | Score |
+|--------------|-------|
+| ForgeCode + GPT 5.4 | 81.8% |
+| ForgeCode + Claude Opus 4.6 | 81.8% |
+| Claude Code + Claude Opus 4.6 | 58.0% |

 ### Independent SWE-bench (Princeton/UChicago)
 | Configuration | Score |
@@ -88,17 +102,16 @@ ForgeCode transparently documented their journey:
 ## Independent Terminal-Bench Data

 From llm-stats.com (April 9, 2026):
- **23 models evaluated**
- **Average score:** 0.345 (34.5%)
- **Best score:** 0.500 (50.0%) - Claude Sonnet 4.5
- **All results self-reported** (0 verified)
+- **28+ models evaluated**
+- **Average score:** Varies significantly by harness
+- **All results self-reported** (0 verified on independent platforms)

-**Top 3:**
-1. Claude Sonnet 4.5: 50.0%
-2. MiniMax M2.1: 47.9%
-3. Kimi K2-Thinking: 47.1%
+**Key Point:** Terminal-Bench scores are inherently harness-specific. The same model (e.g., Claude Opus 4.6) achieves different scores with different harnesses:
+- Pilot + Claude Opus 4.6: 82.9%
+- ForgeCode + Claude Opus 4.6: 81.8%
+- Claude Code + Claude Opus 4.6: 58.0%

-**Note:** ForgeCode's 81.8% is not on this independent leaderboard; it was self-reported on tbench.ai.
+**Note:** The 24-point gap between ForgeCode and Claude Code on the same model illustrates how harness engineering significantly impacts benchmark scores.

 ---

@@ -1,6 +1,6 @@
-# Qwen 3.5 with ForgeCode - Feedback Report
+# Qwen Models with ForgeCode - Feedback Report

-**Model:** Qwen 3.5  
+**Models Covered:** Qwen 3.5, Qwen3  
 **Provider:** Alibaba Cloud (via local inference)  
 **Harness:** ForgeCode  
 **Source References:** GitHub Issue #2894, Reddit r/LocalLLaMA  
@@ -8,12 +8,24 @@

 ---

+## Model Reference Guide
+
+| Model Family | Available Sizes | Notes |
+|--------------|-----------------|-------|
+| **Qwen 3.5** | 0.8B, 2B, 4B, 9B (dense); 27B, 122B-A10B, 397B-A17B (MoE) | Released Feb 2026 |
+| **Qwen3** | 0.6B, 1.7B, 4B, 8B, 14B, 32B (dense); 30B-A3B, 235B-A22B (MoE) | Released April 2025 |
+| **Qwen2.5** | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B + Coder variants | Earlier generation |
+
+> **Note:** References to "Qwen 3.5 14B" in community discussions likely mean Qwen3-14B or Qwen2.5-14B.
+
+---
+
 ## Known Issues

 ### Multiple System Messages Bug
 **GitHub Issue:** #2894 (Open as of April 8, 2026)

-**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3.5)
+**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3, Qwen 3.5)

 **Error Manifestation:**
 - Models with strict chat templates fail to parse message structure correctly
@@ -22,7 +34,7 @@

 **Impact:**
 - Affects local inference with llama.cpp, Ollama, and similar servers
- Qwen3.5 specifically mentioned as affected
+- Qwen3 and Qwen 3.5 specifically mentioned as affected

 **Workaround Status:** No official fix yet; issue under investigation

@@ -4,12 +4,23 @@

 ---

+## Model Reference Guide
+
+| Model Family | Available Sizes | Notes |
+|--------------|-----------------|-------|
+| **Qwen 3.5** | 0.8B, 2B, 4B, 9B (dense); 27B, 122B-A10B, 397B-A17B (MoE) | Released Feb 2026 |
+| **Qwen3** | 0.6B, 1.7B, 4B, 8B, 14B, 32B (dense); 30B-A3B, 235B-A22B (MoE) | Released April 2025 |
+| **Qwen2.5** | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B + Coder variants | Earlier generation |
+
+---
+
 ## Model: Qwen 3.5 (Various Sizes)

 ### Qwen 3.5 27B - Highly Recommended

 **Hardware:** Dual 3090s with UD_5XL quant from Unsloth  
 **Performance:** ~25 t/s at 32k context  
+**Note:** Qwen 3.5 27B is an MoE (Mixture-of-Experts) model  
 **Source:** https://www.reddit.com/r/LocalLLaMA/comments/1ro9lph/anybody_who_tried_hermesagent/

 > "The go to model for intelligence on decent hardware is qwen 3.5 27B, if you have two 3090s, use the UD_5XL quant from unsloth - its amazing. You will get about 25 t/s with this one, at a contex size of 32k, which is perfect."
@@ -16,18 +16,20 @@ This document provides a comprehensive summary of community feedback, benchmark

 | Rank | Model | Strengths | Best For |
 |------|-------|-----------|----------|
-| 1 | **Qwen3.5-35B-A3B** | Best balance of speed, accuracy, context (262k) | General coding, long-context tasks |
+| 1 | **Qwen3-30B-A3B** | Best balance of speed, accuracy, context (128k) | General coding, long-context tasks |
 | 2 | **Gemma 4 26B-A4B** | Excellent on M-series Mac, 8W power usage | Laptop development, M5 MacBook |
 | 3 | **GLM-5.1** | SWE-Bench Pro #1 (58.4), 8-hour autonomy | Long-horizon tasks, enterprise |
 | 4 | **Nemotron 3 Super** | PinchBench 85.6%, 1M context | Agentic reasoning, GPU clusters |
 | 5 | **Gemma 4 8B** | Runs on 16GB RAM, fast | Quick tasks, modest hardware |

+**Note:** "Qwen3.5-35B-A3B" community references likely mean **Qwen3-30B-A3B**. Qwen 3.5 MoE sizes: 27B, 122B-A10B, 397B-A17B.
+
 ### 2. Best Frontier Models for OpenCode

 | Rank | Model | Strengths | Best For |
 |------|-------|-----------|----------|
 | 1 | **GLM-5.1** | SWE-Bench Pro #1, MIT license, cheap API | Best overall value |
-| 2 | **GPT-5.4** | Terminal-Bench 2.0 #1 (75.1), strong reasoning | Complex tasks |
+| 2 | **GPT-5.4** | Strong reasoning, 1M context | Complex tasks |
 | 3 | **Claude Opus 4.6** | Long-horizon optimization, code quality | Deep refactoring |
 | 4 | **Gemini 3.0 Pro** | 1M+ context, fast prompt processing | Long documents |
 | 5 | **GPT-5.2** | Recommended default, reliable | General use |
@@ -51,13 +53,18 @@ This document provides a comprehensive summary of community feedback, benchmark

 ### 4. Performance Benchmarks

-#### Terminal-Bench 2.0
-| Model | Score | Rank |
-|-------|-------|------|
-| GPT-5.4 | 75.1 | #1 |
-| GLM-5.1 | 69.0 | #2 |
-| Gemini 3.1 Pro | 68.5 | #3 |
-| Claude Opus 4.6 | 65.4 | #4 |
+#### Terminal-Bench 2.0 (Harness + Model)
+
+**Current Leaderboard (April 2026):**
+| Rank | Harness | Model | Score |
+|------|---------|-------|-------|
+| 1 | Pilot | Claude Opus 4.6 | 82.9% |
+| 2 | ForgeCode | GPT-5.4 | 81.8% |
+| 3 | ForgeCode | Claude Opus 4.6 | 81.8% |
+| 4 | TongAgents | Gemini 3.1 Pro | 80.2% |
+| 5 | SageAgent | GPT-5.3-Codex | 78.4% |
+
+**Note:** Terminal-Bench measures harness+model combinations, not raw model capability. Scores vary significantly by agent framework.

 #### SWE-Bench Pro
 | Model | Score | Rank |
@@ -89,7 +96,7 @@ This document provides a comprehensive summary of community feedback, benchmark
 **File:** `opencode/feedback/localllm/local-llm-feedback.md`

 **Contents:**
- Qwen3.5-35B-A3B (MoE) - Detailed performance data
+- Qwen3-30B-A3B (MoE) - Detailed performance data (Note: community "Qwen3.5-35B-A3B" references)
 - Gemma 4 26B-A4B - M-series Mac optimization
 - GLM-4.7 Flash - API performance
 - GLM-5.1 - 8-hour autonomous capability
@@ -230,7 +237,7 @@ This document provides a comprehensive summary of community feedback, benchmark
 ## Recommendations

 ### For Local Development
-1. **Qwen3.5-35B-A3B** - Best overall local model
+1. **Qwen3-30B-A3B** - Best overall local model (Note: community references to "Qwen3.5-35B-A3B")
 2. **Gemma 4 26B-A4B** - Best for M-series Mac
 3. **Increase context to 32K+**
 4. **Use corrected chat templates**
@@ -238,7 +245,7 @@ This document provides a comprehensive summary of community feedback, benchmark

 ### For Cloud/Remote
 1. **GLM-5.1** - Best value, SWE-Bench Pro #1
-2. **GPT-5.4** - Best Terminal-Bench performance
+2. **GPT-5.4** - Strong reasoning, 1M context
 3. **Claude Opus 4.6** - Best for long-horizon tasks
 4. **Hybrid setup** - Local for quick tasks, cloud for complex

@@ -273,7 +280,7 @@ This document provides a comprehensive summary of community feedback, benchmark
 The OpenCode ecosystem has matured significantly with strong support for both local and frontier models. Key findings:

 1. **Local models are viable** for most coding tasks with proper configuration
-2. **Qwen3.5-35B-A3B** is the best local model overall
+2. **Qwen3-30B-A3B** (often referenced as "Qwen3.5-35B-A3B") is the best local model overall
 3. **GLM-5.1** is the best frontier model (SWE-Bench Pro #1)
 4. **Context management** is critical for long-running sessions
 5. **Hybrid setups** offer the best of both worlds
@@ -13,13 +13,12 @@ This document compiles community feedback, benchmark results, and performance ob
 **Context:** 1M tokens  

 **Benchmark Results:**
- **Terminal-Bench 2.0:** 75.1 (Highest among tested models)
- **SWE-Bench Pro:** 57.7 (Rank #2 overall)
+- **SWE-Bench Pro:** 57.7 (Rank #3 overall, behind Claude Mythos Preview 77.8% and GLM-5.1 58.4%)
 - **Reasoning:** Strong on AIME 2025, HMMT, GPQA-Diamond
 - **Context:** Compaction triggers at 272k (sometimes earlier), never reaches full 1M
+- **Note:** Terminal-Bench scores are harness-specific (see Terminal-Bench 2.0 section below)

 **What Worked Well:**
- Best Terminal-Bench 2.0 performance
 - Strong reasoning capabilities
 - Excellent tool calling
 - Good for complex multi-step tasks
@@ -278,22 +277,37 @@ OpenCode Zen is a curated list of models tested and verified by the OpenCode tea

 ## Benchmark Comparisons

-### SWE-Bench Pro Rankings
-| Model | Score | Rank |
-|-------|-------|------|
-| GLM-5.1 | 58.4 | #1 (Open) |
-| GPT-5.4 | 57.7 | #2 |
-| Claude Opus 4.6 | 57.3 | #3 |
-| GLM-5 | 55.1 | #4 |
-| Gemini 3.1 Pro | 54.2 | #5 |
+### SWE-Bench Pro Rankings (Verified)

-### Terminal-Bench 2.0 Rankings
-| Model | Score |
-|-------|-------|
-| GPT-5.4 | 75.1 |
-| GLM-5.1 | 69.0 |
-| Gemini 3.1 Pro | 68.5 |
-| Claude Opus 4.6 | 65.4 |
+**Note:** Claude Mythos Preview (77.8%) leads overall; GLM-5.1 leads among open-source models.
+
+| Rank | Model | Score | License |
+|------|-------|-------|---------|
+| 1 | Claude Mythos Preview | 77.8% | Proprietary |
+| 2 | GLM-5.1 | 58.4% | Open (MIT) |
+| 3 | GPT-5.4 | 57.7% | Proprietary |
+| 4 | GPT-5.3 Codex | 56.8% | Proprietary |
+| 5 | Qwen3.6 Plus | 56.6% | Proprietary |
+| 6 | Claude Opus 4.6 | 57.3%* | Proprietary |
+| 7 | Gemini 3.1 Pro | 54.2% | Proprietary |
+
+*Note: Rankings may shift as new evaluations are submitted.
+
+**Source:** https://llm-stats.com/benchmarks/swe-bench-pro
+
+### Terminal-Bench 2.0 Rankings (Harness + Model)
+
+**Important:** Terminal-Bench measures agent harness + model combinations, not raw model performance.
+
+| Rank | Harness | Model | Score | Date |
+|------|---------|-------|-------|------|
+| 1 | Pilot | Claude Opus 4.6 | 82.9% | 2026-04-01 |
+| 2 | ForgeCode | GPT-5.4 | 81.8% | 2026-03-12 |
+| 3 | ForgeCode | Claude Opus 4.6 | 81.8% | 2026-03-12 |
+| 4 | TongAgents | Gemini 3.1 Pro | 80.2% | 2026-03-13 |
+| 5 | SageAgent | GPT-5.3-Codex | 78.4% | 2026-03-13 |
+
+**Source:** https://www.tbench.ai/leaderboard/terminal-bench/2.0

 ### CyberGym Rankings (1,507 real tasks)
 | Model | Score |
@@ -7,20 +7,33 @@ This document compiles community feedback, benchmark results, and performance ob

 ## Qwen Models

-### Qwen3.5-35B-A3B (MoE)
-**Model:** Qwen3.5-35B-A3B  
-**Size:** 35B total / 3B active parameters  
+### Model Reference Guide
+
+| Model Family | Available Sizes | Type | Notes |
+|--------------|-----------------|------|-------|
+| **Qwen 3.5** | 0.8B, 2B, 4B, 9B | Dense | Released Feb 2026 |
+| **Qwen 3.5** | 27B, 122B-A10B, 397B-A17B | MoE | Released Feb 2026 |
+| **Qwen3** | 0.6B, 1.7B, 4B, 8B, 14B, 32B | Dense | Released April 2025 |
+| **Qwen3** | 30B-A3B, 235B-A22B | MoE | Released April 2025 |
+| **Qwen2.5** | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B | Dense | + Coder variants |
+
+> **Note:** "Qwen3.5-35B-A3B" references in community posts likely mean **Qwen3-30B-A3B** (from the Qwen3 MoE family) or are speculative. Qwen 3.5 MoE sizes are 27B, 122B-A10B, and 397B-A17B.
+
+---
+
+### Qwen3-30B-A3B (MoE) [Most likely model referenced]
+**Model:** Qwen3-30B-A3B (not Qwen 3.5)  
+**Size:** 30B total / 3B active parameters  
 **Quantization:** Q4_K_M, Q8_0, UD-Q4_K_XL  
 **Provider:** llama.cpp / Ollama / HuggingFace  

 **Benchmark Results:**
- **Terminal-Bench:** Most accurate & fast among local models
- **Performance:** 3-5x faster than dense 27B variants (~60-100 tok/s)
- **Context:** Supports up to 262k context with `--n-cpu-moe 10` (24GB VRAM)
+- **Performance:** 3-5x faster than dense variants (~60-100 tok/s)
+- **Context:** Supports up to 128k context
 - **Accuracy:** Excellent on coding tasks, comparable to cloud models

 **What Worked Well:**
- Long context handling (262k tested)
+- Long context handling (128k tested)
 - Fast inference due to MoE architecture
 - Good tool calling with corrected chat templates
 - Works well with OpenCode's skill system
@@ -39,7 +52,7 @@ This document compiles community feedback, benchmark results, and performance ob
 --batch-size 2048
 --ubatch-size 512
 --jinja
--chat-template-file qwen35-chat-template-corrected.jinja
+--chat-template-file qwen3-chat-template-corrected.jinja
 --context-shift
 ```

@@ -188,12 +201,12 @@ ollama run gemma4:e4b
 **License:** MIT (Open Weights)  

 **Benchmark Results:**
- **SWE-Bench Pro:** 58.4 (Rank #1 open source)
- **Terminal-Bench 2.0:** 69.0
+- **SWE-Bench Pro:** 58.4 (Rank #1 open source, verified)
 - **CyberGym:** 68.7 (1,507 real tasks)
 - **MCP-Atlas:** 71.8
 - **Autonomous Duration:** 8 hours continuous execution
 - **Steps:** Up to 1,700 autonomous steps
+- **Note:** Terminal-Bench scores are harness-specific and not reported for GLM-5.1

 **What Worked Well:**
 - Best open-source model on SWE-Bench Pro
@@ -311,12 +324,14 @@ docker model configure --context-size=100000 gpt-oss:20B-UD-Q8_K_XL

 ### Best Local Models for OpenCode (Ranked)

-1. **Qwen3.5-35B-A3B** - Best overall balance of speed, accuracy, context
+1. **Qwen3-30B-A3B** (or Qwen 3.5 27B-A3B if available) - Best balance of speed, accuracy, context
 2. **Gemma 4 26B-A4B** - Best for M-series Mac, very efficient
-3. **GLM-5.1** - Best for long-horizon tasks (if hardware allows)
+3. **GLM-5.1** - Best for long-horizon tasks (requires enterprise hardware)
 4. **Nemotron 3 Super** - Best for agentic reasoning (enterprise hardware)
 5. **Gemma 4 8B** - Best for quick tasks on modest hardware

+**Note:** Community references to "Qwen3.5-35B-A3B" likely mean **Qwen3-30B-A3B** from the Qwen3 family (not Qwen 3.5). Qwen 3.5 MoE models come in 27B, 122B-A10B, and 397B-A17B sizes.
+
 ### Hybrid Setup Strategy
 - **Local models:** Lightweight tasks, repetitive work, privacy-sensitive
 - **Cloud models:** Complex reasoning, multi-file refactors, deep analysis
@@ -11,8 +11,8 @@

 ### Benchmark Results
 - **SWE-bench Verified:** 80.0%
- **SWE-bench Pro:** 57.7% (Rank 1)
- **Terminal-Bench 2.0:** 75.1% (Rank 1)
+- **SWE-bench Pro:** 57.7% (Rank 3, behind Claude Mythos Preview 77.8% and GLM-5.1 58.4%)
+- **Terminal-Bench 2.0:** Score varies by harness (ForgeCode+GPT-5.4: 81.8%, other harnesses vary)
 - **LiveCodeBench:** Not specified
 - **MRCR v2 (1M context):** Not specified

@@ -23,10 +23,10 @@

 ### What Worked Well
 1. **Speed:** Fastest terminal execution among frontier models
-2. **Terminal Execution:** 9.7pt advantage over Claude Opus on Terminal-Bench
+2. **Terminal Execution:** Strong performance on terminal tasks
 3. **Tool Search:** 47% token reduction with tool search
 4. **Physics Simulation:** Near perfect emulation in creative coding tasks
-5. **Cost Efficiency:** Best price/performance ratio for terminal tasks
+5. **Cost Efficiency:** Good price/performance ratio for terminal tasks

 ### Source References
 - [MorphLLM - Best AI for Coding 2026](https://www.morphllm.com/best-ai-model-for-coding)
@@ -191,11 +191,13 @@
 - **Long Context:** Claude Opus 4.6 best at lossless summarization under compression

 ### Performance on Benchmarks
- **Terminal-Bench:** GPT-5.4 leads with 75.1%
- **SWE-bench:** Claude Opus 4.6 leads with 80.8%
+- **SWE-bench:** Claude Opus 4.6 leads with 80.8% (Verified)
+- **SWE-bench Pro:** Claude Mythos Preview leads with 77.8%
 - **LiveCodeBench:** Gemini 3.1 Pro leads with 2887 Elo
 - **Retry Rate:** 1.0-1.5 retries per prompt typical for frontier models

+**Note:** Terminal-Bench scores vary significantly by harness. See harness-specific feedback folders for Terminal-Bench results.
+
 ### Best Practices
 1. Use GPT-5.4 for terminal execution and speed
 2. Use Claude Opus 4.6 for complex reasoning and large codebases