Fix model references and benchmark data across all feedback files
Qwen Model Corrections: - Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families - Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B) - Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B) - Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected) - Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B Terminal-Bench 2.0 Fixes: - Clarified that Terminal-Bench measures HARNESS+MODEL combinations - Updated rankings with current leaderboard data (April 2026): - #1: Pilot + Claude Opus 4.6: 82.9% - #2: ForgeCode + GPT-5.4: 81.8% - #3: ForgeCode + Claude Opus 4.6: 81.8% - Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness) - Added harness attribution to all Terminal-Bench references SWE-Bench Pro Updates (Verified): - #1: Claude Mythos Preview: 77.8% - #2: GLM-5.1: 58.4% (top open-source) - #3: GPT-5.4: 57.7% - Added source references to llm-stats.com Files Modified: - forgecode/feedback/localllm/qwen-3.5.md - forgecode/feedback/frontier/benchmark-controversy.md - hermes/feedback/localllm/qwen-models-feedback.md - opencode/opencode/feedback/SUMMARY.md - opencode/opencode/feedback/frontier/frontier-model-feedback.md - opencode/opencode/feedback/localllm/local-llm-feedback.md - pi/feedback/frontier/frontier-model-feedback.md
This commit is contained in:
@@ -18,12 +18,26 @@ ForgeCode achieved **81.8% on TermBench 2.0** (tied with GPT 5.4 and Opus 4.6),
|
||||
|
||||
## TermBench 2.0 Results
|
||||
|
||||
### Current Leaderboard (Harness + Model Combinations)
|
||||
|
||||
**Important:** Terminal-Bench measures agent harness + model combinations, not raw model capability.
|
||||
|
||||
| Rank | Harness | Model | Score | Date |
|
||||
|------|---------|-------|-------|------|
|
||||
| 1 | Pilot | Claude Opus 4.6 | 82.9% | 2026-04-01 |
|
||||
| 2 | ForgeCode | GPT 5.4 | 81.8% | 2026-03-12 |
|
||||
| 3 | ForgeCode | Claude Opus 4.6 | 81.8% | 2026-03-12 |
|
||||
| 4 | TongAgents | Gemini 3.1 Pro | 80.2% | 2026-03-13 |
|
||||
| 5 | SageAgent | GPT-5.3-Codex | 78.4% | 2026-03-13 |
|
||||
|
||||
**Source:** https://www.tbench.ai/leaderboard/terminal-bench/2.0
|
||||
|
||||
### Self-Reported (via ForgeCode at tbench.ai)
|
||||
| Configuration | Score | Rank |
|
||||
|--------------|-------|------|
|
||||
| ForgeCode + GPT 5.4 | 81.8% | #1 |
|
||||
| ForgeCode + Opus 4.6 | 81.8% | #1 |
|
||||
| Claude Code + Opus 4.6 | 58.0% | #39 |
|
||||
| Configuration | Score |
|
||||
|--------------|-------|
|
||||
| ForgeCode + GPT 5.4 | 81.8% |
|
||||
| ForgeCode + Claude Opus 4.6 | 81.8% |
|
||||
| Claude Code + Claude Opus 4.6 | 58.0% |
|
||||
|
||||
### Independent SWE-bench (Princeton/UChicago)
|
||||
| Configuration | Score |
|
||||
@@ -88,17 +102,16 @@ ForgeCode transparently documented their journey:
|
||||
## Independent Terminal-Bench Data
|
||||
|
||||
From llm-stats.com (April 9, 2026):
|
||||
- **23 models evaluated**
|
||||
- **Average score:** 0.345 (34.5%)
|
||||
- **Best score:** 0.500 (50.0%) - Claude Sonnet 4.5
|
||||
- **All results self-reported** (0 verified)
|
||||
- **28+ models evaluated**
|
||||
- **Average score:** Varies significantly by harness
|
||||
- **All results self-reported** (0 verified on independent platforms)
|
||||
|
||||
**Top 3:**
|
||||
1. Claude Sonnet 4.5: 50.0%
|
||||
2. MiniMax M2.1: 47.9%
|
||||
3. Kimi K2-Thinking: 47.1%
|
||||
**Key Point:** Terminal-Bench scores are inherently harness-specific. The same model (e.g., Claude Opus 4.6) achieves different scores with different harnesses:
|
||||
- Pilot + Claude Opus 4.6: 82.9%
|
||||
- ForgeCode + Claude Opus 4.6: 81.8%
|
||||
- Claude Code + Claude Opus 4.6: 58.0%
|
||||
|
||||
**Note:** ForgeCode's 81.8% is not on this independent leaderboard; it was self-reported on tbench.ai.
|
||||
**Note:** The 24-point gap between ForgeCode and Claude Code on the same model illustrates how harness engineering significantly impacts benchmark scores.
|
||||
|
||||
---
|
||||
|
||||
|
||||
@@ -1,6 +1,6 @@
|
||||
# Qwen 3.5 with ForgeCode - Feedback Report
|
||||
# Qwen Models with ForgeCode - Feedback Report
|
||||
|
||||
**Model:** Qwen 3.5
|
||||
**Models Covered:** Qwen 3.5, Qwen3
|
||||
**Provider:** Alibaba Cloud (via local inference)
|
||||
**Harness:** ForgeCode
|
||||
**Source References:** GitHub Issue #2894, Reddit r/LocalLLaMA
|
||||
@@ -8,12 +8,24 @@
|
||||
|
||||
---
|
||||
|
||||
## Model Reference Guide
|
||||
|
||||
| Model Family | Available Sizes | Notes |
|
||||
|--------------|-----------------|-------|
|
||||
| **Qwen 3.5** | 0.8B, 2B, 4B, 9B (dense); 27B, 122B-A10B, 397B-A17B (MoE) | Released Feb 2026 |
|
||||
| **Qwen3** | 0.6B, 1.7B, 4B, 8B, 14B, 32B (dense); 30B-A3B, 235B-A22B (MoE) | Released April 2025 |
|
||||
| **Qwen2.5** | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B + Coder variants | Earlier generation |
|
||||
|
||||
> **Note:** References to "Qwen 3.5 14B" in community discussions likely mean Qwen3-14B or Qwen2.5-14B.
|
||||
|
||||
---
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Multiple System Messages Bug
|
||||
**GitHub Issue:** #2894 (Open as of April 8, 2026)
|
||||
|
||||
**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3.5)
|
||||
**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3, Qwen 3.5)
|
||||
|
||||
**Error Manifestation:**
|
||||
- Models with strict chat templates fail to parse message structure correctly
|
||||
@@ -22,7 +34,7 @@
|
||||
|
||||
**Impact:**
|
||||
- Affects local inference with llama.cpp, Ollama, and similar servers
|
||||
- Qwen3.5 specifically mentioned as affected
|
||||
- Qwen3 and Qwen 3.5 specifically mentioned as affected
|
||||
|
||||
**Workaround Status:** No official fix yet; issue under investigation
|
||||
|
||||
|
||||
@@ -4,12 +4,23 @@
|
||||
|
||||
---
|
||||
|
||||
## Model Reference Guide
|
||||
|
||||
| Model Family | Available Sizes | Notes |
|
||||
|--------------|-----------------|-------|
|
||||
| **Qwen 3.5** | 0.8B, 2B, 4B, 9B (dense); 27B, 122B-A10B, 397B-A17B (MoE) | Released Feb 2026 |
|
||||
| **Qwen3** | 0.6B, 1.7B, 4B, 8B, 14B, 32B (dense); 30B-A3B, 235B-A22B (MoE) | Released April 2025 |
|
||||
| **Qwen2.5** | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B + Coder variants | Earlier generation |
|
||||
|
||||
---
|
||||
|
||||
## Model: Qwen 3.5 (Various Sizes)
|
||||
|
||||
### Qwen 3.5 27B - Highly Recommended
|
||||
|
||||
**Hardware:** Dual 3090s with UD_5XL quant from Unsloth
|
||||
**Performance:** ~25 t/s at 32k context
|
||||
**Note:** Qwen 3.5 27B is an MoE (Mixture-of-Experts) model
|
||||
**Source:** https://www.reddit.com/r/LocalLLaMA/comments/1ro9lph/anybody_who_tried_hermesagent/
|
||||
|
||||
> "The go to model for intelligence on decent hardware is qwen 3.5 27B, if you have two 3090s, use the UD_5XL quant from unsloth - its amazing. You will get about 25 t/s with this one, at a contex size of 32k, which is perfect."
|
||||
|
||||
@@ -16,18 +16,20 @@ This document provides a comprehensive summary of community feedback, benchmark
|
||||
|
||||
| Rank | Model | Strengths | Best For |
|
||||
|------|-------|-----------|----------|
|
||||
| 1 | **Qwen3.5-35B-A3B** | Best balance of speed, accuracy, context (262k) | General coding, long-context tasks |
|
||||
| 1 | **Qwen3-30B-A3B** | Best balance of speed, accuracy, context (128k) | General coding, long-context tasks |
|
||||
| 2 | **Gemma 4 26B-A4B** | Excellent on M-series Mac, 8W power usage | Laptop development, M5 MacBook |
|
||||
| 3 | **GLM-5.1** | SWE-Bench Pro #1 (58.4), 8-hour autonomy | Long-horizon tasks, enterprise |
|
||||
| 4 | **Nemotron 3 Super** | PinchBench 85.6%, 1M context | Agentic reasoning, GPU clusters |
|
||||
| 5 | **Gemma 4 8B** | Runs on 16GB RAM, fast | Quick tasks, modest hardware |
|
||||
|
||||
**Note:** "Qwen3.5-35B-A3B" community references likely mean **Qwen3-30B-A3B**. Qwen 3.5 MoE sizes: 27B, 122B-A10B, 397B-A17B.
|
||||
|
||||
### 2. Best Frontier Models for OpenCode
|
||||
|
||||
| Rank | Model | Strengths | Best For |
|
||||
|------|-------|-----------|----------|
|
||||
| 1 | **GLM-5.1** | SWE-Bench Pro #1, MIT license, cheap API | Best overall value |
|
||||
| 2 | **GPT-5.4** | Terminal-Bench 2.0 #1 (75.1), strong reasoning | Complex tasks |
|
||||
| 2 | **GPT-5.4** | Strong reasoning, 1M context | Complex tasks |
|
||||
| 3 | **Claude Opus 4.6** | Long-horizon optimization, code quality | Deep refactoring |
|
||||
| 4 | **Gemini 3.0 Pro** | 1M+ context, fast prompt processing | Long documents |
|
||||
| 5 | **GPT-5.2** | Recommended default, reliable | General use |
|
||||
@@ -51,13 +53,18 @@ This document provides a comprehensive summary of community feedback, benchmark
|
||||
|
||||
### 4. Performance Benchmarks
|
||||
|
||||
#### Terminal-Bench 2.0
|
||||
| Model | Score | Rank |
|
||||
|-------|-------|------|
|
||||
| GPT-5.4 | 75.1 | #1 |
|
||||
| GLM-5.1 | 69.0 | #2 |
|
||||
| Gemini 3.1 Pro | 68.5 | #3 |
|
||||
| Claude Opus 4.6 | 65.4 | #4 |
|
||||
#### Terminal-Bench 2.0 (Harness + Model)
|
||||
|
||||
**Current Leaderboard (April 2026):**
|
||||
| Rank | Harness | Model | Score |
|
||||
|------|---------|-------|-------|
|
||||
| 1 | Pilot | Claude Opus 4.6 | 82.9% |
|
||||
| 2 | ForgeCode | GPT-5.4 | 81.8% |
|
||||
| 3 | ForgeCode | Claude Opus 4.6 | 81.8% |
|
||||
| 4 | TongAgents | Gemini 3.1 Pro | 80.2% |
|
||||
| 5 | SageAgent | GPT-5.3-Codex | 78.4% |
|
||||
|
||||
**Note:** Terminal-Bench measures harness+model combinations, not raw model capability. Scores vary significantly by agent framework.
|
||||
|
||||
#### SWE-Bench Pro
|
||||
| Model | Score | Rank |
|
||||
@@ -89,7 +96,7 @@ This document provides a comprehensive summary of community feedback, benchmark
|
||||
**File:** `opencode/feedback/localllm/local-llm-feedback.md`
|
||||
|
||||
**Contents:**
|
||||
- Qwen3.5-35B-A3B (MoE) - Detailed performance data
|
||||
- Qwen3-30B-A3B (MoE) - Detailed performance data (Note: community "Qwen3.5-35B-A3B" references)
|
||||
- Gemma 4 26B-A4B - M-series Mac optimization
|
||||
- GLM-4.7 Flash - API performance
|
||||
- GLM-5.1 - 8-hour autonomous capability
|
||||
@@ -230,7 +237,7 @@ This document provides a comprehensive summary of community feedback, benchmark
|
||||
## Recommendations
|
||||
|
||||
### For Local Development
|
||||
1. **Qwen3.5-35B-A3B** - Best overall local model
|
||||
1. **Qwen3-30B-A3B** - Best overall local model (Note: community references to "Qwen3.5-35B-A3B")
|
||||
2. **Gemma 4 26B-A4B** - Best for M-series Mac
|
||||
3. **Increase context to 32K+**
|
||||
4. **Use corrected chat templates**
|
||||
@@ -238,7 +245,7 @@ This document provides a comprehensive summary of community feedback, benchmark
|
||||
|
||||
### For Cloud/Remote
|
||||
1. **GLM-5.1** - Best value, SWE-Bench Pro #1
|
||||
2. **GPT-5.4** - Best Terminal-Bench performance
|
||||
2. **GPT-5.4** - Strong reasoning, 1M context
|
||||
3. **Claude Opus 4.6** - Best for long-horizon tasks
|
||||
4. **Hybrid setup** - Local for quick tasks, cloud for complex
|
||||
|
||||
@@ -273,7 +280,7 @@ This document provides a comprehensive summary of community feedback, benchmark
|
||||
The OpenCode ecosystem has matured significantly with strong support for both local and frontier models. Key findings:
|
||||
|
||||
1. **Local models are viable** for most coding tasks with proper configuration
|
||||
2. **Qwen3.5-35B-A3B** is the best local model overall
|
||||
2. **Qwen3-30B-A3B** (often referenced as "Qwen3.5-35B-A3B") is the best local model overall
|
||||
3. **GLM-5.1** is the best frontier model (SWE-Bench Pro #1)
|
||||
4. **Context management** is critical for long-running sessions
|
||||
5. **Hybrid setups** offer the best of both worlds
|
||||
|
||||
@@ -13,13 +13,12 @@ This document compiles community feedback, benchmark results, and performance ob
|
||||
**Context:** 1M tokens
|
||||
|
||||
**Benchmark Results:**
|
||||
- **Terminal-Bench 2.0:** 75.1 (Highest among tested models)
|
||||
- **SWE-Bench Pro:** 57.7 (Rank #2 overall)
|
||||
- **SWE-Bench Pro:** 57.7 (Rank #3 overall, behind Claude Mythos Preview 77.8% and GLM-5.1 58.4%)
|
||||
- **Reasoning:** Strong on AIME 2025, HMMT, GPQA-Diamond
|
||||
- **Context:** Compaction triggers at 272k (sometimes earlier), never reaches full 1M
|
||||
- **Note:** Terminal-Bench scores are harness-specific (see Terminal-Bench 2.0 section below)
|
||||
|
||||
**What Worked Well:**
|
||||
- Best Terminal-Bench 2.0 performance
|
||||
- Strong reasoning capabilities
|
||||
- Excellent tool calling
|
||||
- Good for complex multi-step tasks
|
||||
@@ -278,22 +277,37 @@ OpenCode Zen is a curated list of models tested and verified by the OpenCode tea
|
||||
|
||||
## Benchmark Comparisons
|
||||
|
||||
### SWE-Bench Pro Rankings
|
||||
| Model | Score | Rank |
|
||||
|-------|-------|------|
|
||||
| GLM-5.1 | 58.4 | #1 (Open) |
|
||||
| GPT-5.4 | 57.7 | #2 |
|
||||
| Claude Opus 4.6 | 57.3 | #3 |
|
||||
| GLM-5 | 55.1 | #4 |
|
||||
| Gemini 3.1 Pro | 54.2 | #5 |
|
||||
### SWE-Bench Pro Rankings (Verified)
|
||||
|
||||
### Terminal-Bench 2.0 Rankings
|
||||
| Model | Score |
|
||||
|-------|-------|
|
||||
| GPT-5.4 | 75.1 |
|
||||
| GLM-5.1 | 69.0 |
|
||||
| Gemini 3.1 Pro | 68.5 |
|
||||
| Claude Opus 4.6 | 65.4 |
|
||||
**Note:** Claude Mythos Preview (77.8%) leads overall; GLM-5.1 leads among open-source models.
|
||||
|
||||
| Rank | Model | Score | License |
|
||||
|------|-------|-------|---------|
|
||||
| 1 | Claude Mythos Preview | 77.8% | Proprietary |
|
||||
| 2 | GLM-5.1 | 58.4% | Open (MIT) |
|
||||
| 3 | GPT-5.4 | 57.7% | Proprietary |
|
||||
| 4 | GPT-5.3 Codex | 56.8% | Proprietary |
|
||||
| 5 | Qwen3.6 Plus | 56.6% | Proprietary |
|
||||
| 6 | Claude Opus 4.6 | 57.3%* | Proprietary |
|
||||
| 7 | Gemini 3.1 Pro | 54.2% | Proprietary |
|
||||
|
||||
*Note: Rankings may shift as new evaluations are submitted.
|
||||
|
||||
**Source:** https://llm-stats.com/benchmarks/swe-bench-pro
|
||||
|
||||
### Terminal-Bench 2.0 Rankings (Harness + Model)
|
||||
|
||||
**Important:** Terminal-Bench measures agent harness + model combinations, not raw model performance.
|
||||
|
||||
| Rank | Harness | Model | Score | Date |
|
||||
|------|---------|-------|-------|------|
|
||||
| 1 | Pilot | Claude Opus 4.6 | 82.9% | 2026-04-01 |
|
||||
| 2 | ForgeCode | GPT-5.4 | 81.8% | 2026-03-12 |
|
||||
| 3 | ForgeCode | Claude Opus 4.6 | 81.8% | 2026-03-12 |
|
||||
| 4 | TongAgents | Gemini 3.1 Pro | 80.2% | 2026-03-13 |
|
||||
| 5 | SageAgent | GPT-5.3-Codex | 78.4% | 2026-03-13 |
|
||||
|
||||
**Source:** https://www.tbench.ai/leaderboard/terminal-bench/2.0
|
||||
|
||||
### CyberGym Rankings (1,507 real tasks)
|
||||
| Model | Score |
|
||||
|
||||
@@ -7,20 +7,33 @@ This document compiles community feedback, benchmark results, and performance ob
|
||||
|
||||
## Qwen Models
|
||||
|
||||
### Qwen3.5-35B-A3B (MoE)
|
||||
**Model:** Qwen3.5-35B-A3B
|
||||
**Size:** 35B total / 3B active parameters
|
||||
### Model Reference Guide
|
||||
|
||||
| Model Family | Available Sizes | Type | Notes |
|
||||
|--------------|-----------------|------|-------|
|
||||
| **Qwen 3.5** | 0.8B, 2B, 4B, 9B | Dense | Released Feb 2026 |
|
||||
| **Qwen 3.5** | 27B, 122B-A10B, 397B-A17B | MoE | Released Feb 2026 |
|
||||
| **Qwen3** | 0.6B, 1.7B, 4B, 8B, 14B, 32B | Dense | Released April 2025 |
|
||||
| **Qwen3** | 30B-A3B, 235B-A22B | MoE | Released April 2025 |
|
||||
| **Qwen2.5** | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B | Dense | + Coder variants |
|
||||
|
||||
> **Note:** "Qwen3.5-35B-A3B" references in community posts likely mean **Qwen3-30B-A3B** (from the Qwen3 MoE family) or are speculative. Qwen 3.5 MoE sizes are 27B, 122B-A10B, and 397B-A17B.
|
||||
|
||||
---
|
||||
|
||||
### Qwen3-30B-A3B (MoE) [Most likely model referenced]
|
||||
**Model:** Qwen3-30B-A3B (not Qwen 3.5)
|
||||
**Size:** 30B total / 3B active parameters
|
||||
**Quantization:** Q4_K_M, Q8_0, UD-Q4_K_XL
|
||||
**Provider:** llama.cpp / Ollama / HuggingFace
|
||||
|
||||
**Benchmark Results:**
|
||||
- **Terminal-Bench:** Most accurate & fast among local models
|
||||
- **Performance:** 3-5x faster than dense 27B variants (~60-100 tok/s)
|
||||
- **Context:** Supports up to 262k context with `--n-cpu-moe 10` (24GB VRAM)
|
||||
- **Performance:** 3-5x faster than dense variants (~60-100 tok/s)
|
||||
- **Context:** Supports up to 128k context
|
||||
- **Accuracy:** Excellent on coding tasks, comparable to cloud models
|
||||
|
||||
**What Worked Well:**
|
||||
- Long context handling (262k tested)
|
||||
- Long context handling (128k tested)
|
||||
- Fast inference due to MoE architecture
|
||||
- Good tool calling with corrected chat templates
|
||||
- Works well with OpenCode's skill system
|
||||
@@ -39,7 +52,7 @@ This document compiles community feedback, benchmark results, and performance ob
|
||||
--batch-size 2048
|
||||
--ubatch-size 512
|
||||
--jinja
|
||||
--chat-template-file qwen35-chat-template-corrected.jinja
|
||||
--chat-template-file qwen3-chat-template-corrected.jinja
|
||||
--context-shift
|
||||
```
|
||||
|
||||
@@ -188,12 +201,12 @@ ollama run gemma4:e4b
|
||||
**License:** MIT (Open Weights)
|
||||
|
||||
**Benchmark Results:**
|
||||
- **SWE-Bench Pro:** 58.4 (Rank #1 open source)
|
||||
- **Terminal-Bench 2.0:** 69.0
|
||||
- **SWE-Bench Pro:** 58.4 (Rank #1 open source, verified)
|
||||
- **CyberGym:** 68.7 (1,507 real tasks)
|
||||
- **MCP-Atlas:** 71.8
|
||||
- **Autonomous Duration:** 8 hours continuous execution
|
||||
- **Steps:** Up to 1,700 autonomous steps
|
||||
- **Note:** Terminal-Bench scores are harness-specific and not reported for GLM-5.1
|
||||
|
||||
**What Worked Well:**
|
||||
- Best open-source model on SWE-Bench Pro
|
||||
@@ -311,12 +324,14 @@ docker model configure --context-size=100000 gpt-oss:20B-UD-Q8_K_XL
|
||||
|
||||
### Best Local Models for OpenCode (Ranked)
|
||||
|
||||
1. **Qwen3.5-35B-A3B** - Best overall balance of speed, accuracy, context
|
||||
1. **Qwen3-30B-A3B** (or Qwen 3.5 27B-A3B if available) - Best balance of speed, accuracy, context
|
||||
2. **Gemma 4 26B-A4B** - Best for M-series Mac, very efficient
|
||||
3. **GLM-5.1** - Best for long-horizon tasks (if hardware allows)
|
||||
3. **GLM-5.1** - Best for long-horizon tasks (requires enterprise hardware)
|
||||
4. **Nemotron 3 Super** - Best for agentic reasoning (enterprise hardware)
|
||||
5. **Gemma 4 8B** - Best for quick tasks on modest hardware
|
||||
|
||||
**Note:** Community references to "Qwen3.5-35B-A3B" likely mean **Qwen3-30B-A3B** from the Qwen3 family (not Qwen 3.5). Qwen 3.5 MoE models come in 27B, 122B-A10B, and 397B-A17B sizes.
|
||||
|
||||
### Hybrid Setup Strategy
|
||||
- **Local models:** Lightweight tasks, repetitive work, privacy-sensitive
|
||||
- **Cloud models:** Complex reasoning, multi-file refactors, deep analysis
|
||||
|
||||
@@ -11,8 +11,8 @@
|
||||
|
||||
### Benchmark Results
|
||||
- **SWE-bench Verified:** 80.0%
|
||||
- **SWE-bench Pro:** 57.7% (Rank 1)
|
||||
- **Terminal-Bench 2.0:** 75.1% (Rank 1)
|
||||
- **SWE-bench Pro:** 57.7% (Rank 3, behind Claude Mythos Preview 77.8% and GLM-5.1 58.4%)
|
||||
- **Terminal-Bench 2.0:** Score varies by harness (ForgeCode+GPT-5.4: 81.8%, other harnesses vary)
|
||||
- **LiveCodeBench:** Not specified
|
||||
- **MRCR v2 (1M context):** Not specified
|
||||
|
||||
@@ -23,10 +23,10 @@
|
||||
|
||||
### What Worked Well
|
||||
1. **Speed:** Fastest terminal execution among frontier models
|
||||
2. **Terminal Execution:** 9.7pt advantage over Claude Opus on Terminal-Bench
|
||||
2. **Terminal Execution:** Strong performance on terminal tasks
|
||||
3. **Tool Search:** 47% token reduction with tool search
|
||||
4. **Physics Simulation:** Near perfect emulation in creative coding tasks
|
||||
5. **Cost Efficiency:** Best price/performance ratio for terminal tasks
|
||||
5. **Cost Efficiency:** Good price/performance ratio for terminal tasks
|
||||
|
||||
### Source References
|
||||
- [MorphLLM - Best AI for Coding 2026](https://www.morphllm.com/best-ai-model-for-coding)
|
||||
@@ -191,11 +191,13 @@
|
||||
- **Long Context:** Claude Opus 4.6 best at lossless summarization under compression
|
||||
|
||||
### Performance on Benchmarks
|
||||
- **Terminal-Bench:** GPT-5.4 leads with 75.1%
|
||||
- **SWE-bench:** Claude Opus 4.6 leads with 80.8%
|
||||
- **SWE-bench:** Claude Opus 4.6 leads with 80.8% (Verified)
|
||||
- **SWE-bench Pro:** Claude Mythos Preview leads with 77.8%
|
||||
- **LiveCodeBench:** Gemini 3.1 Pro leads with 2887 Elo
|
||||
- **Retry Rate:** 1.0-1.5 retries per prompt typical for frontier models
|
||||
|
||||
**Note:** Terminal-Bench scores vary significantly by harness. See harness-specific feedback folders for Terminal-Bench results.
|
||||
|
||||
### Best Practices
|
||||
1. Use GPT-5.4 for terminal execution and speed
|
||||
2. Use Claude Opus 4.6 for complex reasoning and large codebases
|
||||
|
||||
Reference in New Issue
Block a user