Initial commit: coding harness feedback analysis
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
This commit is contained in:
@@ -0,0 +1,390 @@
|
||||
# Frontier Model Feedback for OpenCode
|
||||
|
||||
## Overview
|
||||
This document compiles community feedback, benchmark results, and performance observations for **frontier (cloud) models** used with OpenCode. Data sourced from Reddit, GitHub issues, benchmark dashboards, and community blogs.
|
||||
|
||||
---
|
||||
|
||||
## GPT Models
|
||||
|
||||
### GPT-5.4
|
||||
**Model:** GPT-5.4
|
||||
**Provider:** OpenAI
|
||||
**Context:** 1M tokens
|
||||
|
||||
**Benchmark Results:**
|
||||
- **Terminal-Bench 2.0:** 75.1 (Highest among tested models)
|
||||
- **SWE-Bench Pro:** 57.7 (Rank #2 overall)
|
||||
- **Reasoning:** Strong on AIME 2025, HMMT, GPQA-Diamond
|
||||
- **Context:** Compaction triggers at 272k (sometimes earlier), never reaches full 1M
|
||||
|
||||
**What Worked Well:**
|
||||
- Best Terminal-Bench 2.0 performance
|
||||
- Strong reasoning capabilities
|
||||
- Excellent tool calling
|
||||
- Good for complex multi-step tasks
|
||||
|
||||
**Issues Encountered:**
|
||||
- Compaction triggers too early (272k vs advertised 1M)
|
||||
- Context never approaches full 1M tokens
|
||||
- Expensive for long-running sessions
|
||||
- Some users report quality degradation before compaction
|
||||
|
||||
**Source References:**
|
||||
- [GitHub Issue #16308: 1M context compaction issue](https://github.com/anomalyco/opencode/issues/16308)
|
||||
- [Terminal-Bench 2.0 Leaderboard](https://www.vals.ai/benchmarks/terminal-bench-2)
|
||||
|
||||
---
|
||||
|
||||
### GPT-5.2
|
||||
**Model:** GPT-5.2
|
||||
**Provider:** OpenAI
|
||||
**Status:** Recommended by OpenCode
|
||||
|
||||
**What Worked Well:**
|
||||
- Listed as recommended model in OpenCode docs
|
||||
- Good balance of speed and accuracy
|
||||
- Reliable tool calling
|
||||
|
||||
**Source References:**
|
||||
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
|
||||
|
||||
---
|
||||
|
||||
### GPT OSS 20B
|
||||
**Model:** GPT OSS 20B
|
||||
**Provider:** Docker Model Runner (local), OpenRouter (cloud)
|
||||
|
||||
**Benchmark Results:**
|
||||
- **Accuracy:** Very accurate on coding tasks
|
||||
- **Speed:** Acceptable for local deployment
|
||||
- **Context:** Requires manual increase from 4K default
|
||||
|
||||
**What Worked Well:**
|
||||
- Good local alternative to cloud models
|
||||
- Works with Docker Model Runner
|
||||
- Acceptable performance for development tasks
|
||||
|
||||
**Source References:**
|
||||
- [grigio.org Benchmark Dashboard](https://grigio.org/opencode-benchmark-dashboard-find-the-best-local-llm-for-your-computer/)
|
||||
- [The AIOps: Local Models Setup](https://theaiops.substack.com/p/setting-up-opencode-with-local-models)
|
||||
|
||||
---
|
||||
|
||||
## Claude Models
|
||||
|
||||
### Claude Opus 4.6
|
||||
**Model:** Claude Opus 4.6
|
||||
**Provider:** Anthropic
|
||||
**Context:** 200K tokens
|
||||
|
||||
**Benchmark Results:**
|
||||
- **SWE-Bench Pro:** 57.3 (Rank #3 overall)
|
||||
- **CyberGym:** 66.6
|
||||
- **NL2Repo:** 49.8 (Higher than GLM-5.1)
|
||||
- **GPU Kernel Optimization:** 4.2x speedup (led GLM-5.1)
|
||||
- **BrowseComp:** 84.0
|
||||
|
||||
**What Worked Well:**
|
||||
- Strong on long-horizon optimization
|
||||
- Excellent code quality
|
||||
- Good for complex refactoring
|
||||
- Reliable tool calling
|
||||
|
||||
**Issues Encountered:**
|
||||
- Expensive for extended sessions
|
||||
- Context degradation at ~50% of window
|
||||
- Slower than some alternatives
|
||||
- Higher cost per token
|
||||
|
||||
**Source References:**
|
||||
- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
|
||||
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
|
||||
|
||||
---
|
||||
|
||||
### Claude Sonnet 4.5
|
||||
**Model:** Claude Sonnet 4.5
|
||||
**Provider:** Anthropic
|
||||
**Status:** Recommended by OpenCode
|
||||
|
||||
**What Worked Well:**
|
||||
- Listed as recommended model
|
||||
- Good balance of speed and quality
|
||||
- Reliable for most coding tasks
|
||||
- Lower cost than Opus
|
||||
|
||||
**Source References:**
|
||||
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
|
||||
|
||||
---
|
||||
|
||||
## Gemini Models
|
||||
|
||||
### Gemini 3.0 Pro
|
||||
**Model:** Gemini 3.0 Pro
|
||||
**Provider:** Google
|
||||
**Context:** 1M+ tokens
|
||||
|
||||
**Benchmark Results:**
|
||||
- **SWE-Bench Pro:** 54.2
|
||||
- **Terminal-Bench 2.0:** 68.5
|
||||
- **BrowseComp:** 85.9 (High)
|
||||
- **MCP-Atlas:** 69.2
|
||||
|
||||
**What Worked Well:**
|
||||
- Excellent context handling
|
||||
- Strong on BrowseComp tasks
|
||||
- Good for long document analysis
|
||||
- Fast prompt processing
|
||||
|
||||
**Issues Encountered:**
|
||||
- Context degradation starts at ~30% (300k tokens)
|
||||
- 2-3x slower responses near compaction point
|
||||
- Hallucinations before compaction triggers
|
||||
- Quality drops significantly before 75% threshold
|
||||
|
||||
**Source References:**
|
||||
- [GitHub Issue #11314: Context Compaction](https://github.com/anomalyco/opencode/issues/11314)
|
||||
- [OpenCode Zen](https://opencode.ai/docs/zen/)
|
||||
|
||||
---
|
||||
|
||||
### Gemini 3 Pro
|
||||
**Model:** Gemini 3 Pro
|
||||
**Provider:** Google
|
||||
**Status:** Recommended by OpenCode
|
||||
|
||||
**What Worked Well:**
|
||||
- Listed as recommended model
|
||||
- Good general-purpose performance
|
||||
- Reliable tool calling
|
||||
|
||||
**Source References:**
|
||||
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
|
||||
|
||||
---
|
||||
|
||||
## Minimax Models
|
||||
|
||||
### Minimax M2.1
|
||||
**Model:** Minimax M2.1
|
||||
**Provider:** Minimax
|
||||
**Status:** Recommended by OpenCode
|
||||
|
||||
**What Worked Well:**
|
||||
- Listed as recommended model
|
||||
- Good for coding tasks
|
||||
- Competitive with other frontier models
|
||||
|
||||
**Source References:**
|
||||
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
|
||||
|
||||
---
|
||||
|
||||
## GLM Models (Frontier)
|
||||
|
||||
### GLM-5.1
|
||||
**Model:** GLM-5.1
|
||||
**Size:** 754B total / 40B active (MoE)
|
||||
**Provider:** Z.AI API, BigModel, OpenRouter
|
||||
**License:** MIT (Open Weights)
|
||||
|
||||
**Benchmark Results:**
|
||||
- **SWE-Bench Pro:** 58.4 (Rank #1 open source, #3 overall)
|
||||
- **Terminal-Bench 2.0:** 69.0
|
||||
- **CyberGym:** 68.7 (1,507 real tasks)
|
||||
- **MCP-Atlas:** 71.8 (Rank #1)
|
||||
- **Autonomous Duration:** 8 hours continuous
|
||||
- **Steps:** Up to 1,700 autonomous steps
|
||||
|
||||
**What Worked Well:**
|
||||
- #1 on SWE-Bench Pro among open models
|
||||
- 8-hour autonomous coding capability
|
||||
- MIT license (commercial use allowed)
|
||||
- Works with OpenCode, Claude Code, Kilo Code, Roo Code
|
||||
- Trained on Huawei Ascend 910B (no Nvidia dependency)
|
||||
- 3-7 points better than GLM-5 on benchmarks
|
||||
|
||||
**Pricing:**
|
||||
- **API:** $1.40/M input, $4.40/M output
|
||||
- **Peak Hours:** 3x rate (14:00-18:00 Beijing)
|
||||
- **Off-Peak:** 2x rate (1x through April 2026 promo)
|
||||
- **GLM Coding Plan:** $10/month subscription
|
||||
|
||||
**Source References:**
|
||||
- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
|
||||
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
|
||||
- [Z.AI Developer Docs](https://docs.z.ai/guides/llm/glm-5)
|
||||
|
||||
---
|
||||
|
||||
## OpenRouter Models
|
||||
|
||||
### Grok Fast
|
||||
**Model:** Grok Fast
|
||||
**Provider:** OpenRouter
|
||||
**Status:** Free model
|
||||
|
||||
**What Worked Well:**
|
||||
- Fast code generation
|
||||
- Great for large refactoring
|
||||
- Good with test coverage
|
||||
- Free tier available
|
||||
|
||||
**Limitations:**
|
||||
- Not the smartest model
|
||||
- Best for simple tasks
|
||||
- Requires good test coverage
|
||||
|
||||
**Source References:**
|
||||
- [Reddit: Opencode benchmarks](https://www.reddit.com/r/opencodeCLI/comments/1perov3/opencode_benchmarks_which_agentic_llm_models_work/)
|
||||
|
||||
---
|
||||
|
||||
### Step 3.5 Flash
|
||||
**Model:** Step 3.5 Flash
|
||||
**Provider:** OpenRouter
|
||||
**Status:** Top performer
|
||||
|
||||
**What Worked Well:**
|
||||
- Top performer in accuracy and speed
|
||||
- Good balance of cost and quality
|
||||
- Reliable for most tasks
|
||||
|
||||
**Source References:**
|
||||
- [grigio.org Benchmark Dashboard](https://grigio.org/opencode-benchmark-benchmark-dashboard-find-the-best-local-llm-for-your-computer/)
|
||||
|
||||
---
|
||||
|
||||
## OpenCode Zen Models
|
||||
|
||||
OpenCode Zen is a curated list of models tested and verified by the OpenCode team.
|
||||
|
||||
**Zen Models Include:**
|
||||
- GLM-4.6 (works great with dedicated API)
|
||||
- DeepSeek 3.2 (works great with dedicated API)
|
||||
- Various free and paid options
|
||||
|
||||
**What Worked Well:**
|
||||
- Curated selection of reliable models
|
||||
- Dedicated APIs perform better than OpenRouter
|
||||
- Good for users who want pre-verified options
|
||||
|
||||
**Source References:**
|
||||
- [Reddit: Opencode benchmarks](https://www.reddit.com/r/opencodeCLI/comments/1perov3/opencode_benchmarks_which_agentic_llm_models_work/)
|
||||
|
||||
---
|
||||
|
||||
## Benchmark Comparisons
|
||||
|
||||
### SWE-Bench Pro Rankings
|
||||
| Model | Score | Rank |
|
||||
|-------|-------|------|
|
||||
| GLM-5.1 | 58.4 | #1 (Open) |
|
||||
| GPT-5.4 | 57.7 | #2 |
|
||||
| Claude Opus 4.6 | 57.3 | #3 |
|
||||
| GLM-5 | 55.1 | #4 |
|
||||
| Gemini 3.1 Pro | 54.2 | #5 |
|
||||
|
||||
### Terminal-Bench 2.0 Rankings
|
||||
| Model | Score |
|
||||
|-------|-------|
|
||||
| GPT-5.4 | 75.1 |
|
||||
| GLM-5.1 | 69.0 |
|
||||
| Gemini 3.1 Pro | 68.5 |
|
||||
| Claude Opus 4.6 | 65.4 |
|
||||
|
||||
### CyberGym Rankings (1,507 real tasks)
|
||||
| Model | Score |
|
||||
|-------|-------|
|
||||
| GLM-5.1 | 68.7 |
|
||||
| Claude Opus 4.6 | 66.6 |
|
||||
| GLM-5 | ~49 |
|
||||
|
||||
### MCP-Atlas Rankings
|
||||
| Model | Score |
|
||||
|-------|-------|
|
||||
| GLM-5.1 | 71.8 |
|
||||
| Claude Opus 4.6 | 73.8 |
|
||||
| GPT-5.4 | 67.2 |
|
||||
|
||||
**Source References:**
|
||||
- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
|
||||
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
|
||||
|
||||
---
|
||||
|
||||
## Long-Horizon Optimization
|
||||
|
||||
### GLM-5.1 8-Hour Autonomous Test
|
||||
**Task:** Build full Linux desktop environment from scratch
|
||||
|
||||
**Results:**
|
||||
- **Iterations:** 655 autonomous iterations
|
||||
- **Optimization:** 6.9x throughput increase
|
||||
- **Duration:** 8 hours continuous execution
|
||||
- **Steps:** 1,700 autonomous steps
|
||||
|
||||
**Comparison:**
|
||||
- **GLM-5:** Plateaued at 8,000-10,000 QPS
|
||||
- **GLM-5.1:** Reached 21,500 QPS (6,000+ tool calls)
|
||||
- **Claude Opus 4.6:** 3,547 QPS (single session)
|
||||
|
||||
**Source References:**
|
||||
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
|
||||
|
||||
---
|
||||
|
||||
## Context Management
|
||||
|
||||
### Compaction Threshold Issues
|
||||
**Problem:** Hardcoded 75% threshold causes quality degradation
|
||||
|
||||
**Model-Specific Degradation:**
|
||||
| Model | Degradation Start | Compaction Trigger | Result |
|
||||
|-------|------------------|-------------------|--------|
|
||||
| Gemini | ~30% (300k) | 75% (786k) | 2-3x slower, hallucinations |
|
||||
| Claude | ~50% | 75% | Significant quality drops |
|
||||
|
||||
**Source References:**
|
||||
- [GitHub Issue #11314: Configurable Context Compaction](https://github.com/anomalyco/opencode/issues/11314)
|
||||
|
||||
---
|
||||
|
||||
## General Recommendations
|
||||
|
||||
### Best Frontier Models for OpenCode (Ranked)
|
||||
|
||||
1. **GLM-5.1** - Best overall (SWE-Bench Pro #1, MIT license)
|
||||
2. **GPT-5.4** - Best Terminal-Bench performance
|
||||
3. **Claude Opus 4.6** - Best for long-horizon tasks
|
||||
4. **Gemini 3.0 Pro** - Best context handling
|
||||
5. **GPT-5.2** - Best recommended default
|
||||
|
||||
### Hybrid Setup Strategy
|
||||
- **Frontier models:** Complex reasoning, multi-file refactors, deep analysis
|
||||
- **Local models:** Quick tasks, repetitive work, privacy-sensitive
|
||||
- Switch between models using `/models` command
|
||||
|
||||
### Cost Considerations
|
||||
- **GLM-5.1:** $1.40/M input, $4.40/M output (cheapest frontier)
|
||||
- **GPT-5.4:** ~$10/M input, ~$30/M output (expensive)
|
||||
- **Claude Opus 4.6:** ~$15/M input, ~$75/M output (most expensive)
|
||||
- **OpenRouter:** Aggregates multiple providers, often cheaper
|
||||
|
||||
---
|
||||
|
||||
## Data Sources Summary
|
||||
|
||||
| Source Type | Count | Topics Covered |
|
||||
|-------------|-------|----------------|
|
||||
| Reddit Threads | 3 | Model comparisons, user experiences |
|
||||
| GitHub Issues | 2 | Configuration problems, bugs |
|
||||
| Benchmark Dashboards | 2 | Performance metrics, comparisons |
|
||||
| Blog Posts | 4 | Setup guides, optimization tips |
|
||||
| Technical Blogs | 3 | Architecture, benchmark analysis |
|
||||
| Documentation | 2 | Official docs, configuration |
|
||||
|
||||
**Total Sources:** 14 unique sources
|
||||
**Date Range:** April 2025 - April 2026
|
||||
Reference in New Issue
Block a user