Initial commit: coding harness feedback analysis

Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
This commit is contained in:
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,390 @@
# Frontier Model Feedback for OpenCode
## Overview
This document compiles community feedback, benchmark results, and performance observations for **frontier (cloud) models** used with OpenCode. Data sourced from Reddit, GitHub issues, benchmark dashboards, and community blogs.
---
## GPT Models
### GPT-5.4
**Model:** GPT-5.4
**Provider:** OpenAI
**Context:** 1M tokens
**Benchmark Results:**
- **Terminal-Bench 2.0:** 75.1 (Highest among tested models)
- **SWE-Bench Pro:** 57.7 (Rank #2 overall)
- **Reasoning:** Strong on AIME 2025, HMMT, GPQA-Diamond
- **Context:** Compaction triggers at 272k (sometimes earlier), never reaches full 1M
**What Worked Well:**
- Best Terminal-Bench 2.0 performance
- Strong reasoning capabilities
- Excellent tool calling
- Good for complex multi-step tasks
**Issues Encountered:**
- Compaction triggers too early (272k vs advertised 1M)
- Context never approaches full 1M tokens
- Expensive for long-running sessions
- Some users report quality degradation before compaction
**Source References:**
- [GitHub Issue #16308: 1M context compaction issue](https://github.com/anomalyco/opencode/issues/16308)
- [Terminal-Bench 2.0 Leaderboard](https://www.vals.ai/benchmarks/terminal-bench-2)
---
### GPT-5.2
**Model:** GPT-5.2
**Provider:** OpenAI
**Status:** Recommended by OpenCode
**What Worked Well:**
- Listed as recommended model in OpenCode docs
- Good balance of speed and accuracy
- Reliable tool calling
**Source References:**
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
---
### GPT OSS 20B
**Model:** GPT OSS 20B
**Provider:** Docker Model Runner (local), OpenRouter (cloud)
**Benchmark Results:**
- **Accuracy:** Very accurate on coding tasks
- **Speed:** Acceptable for local deployment
- **Context:** Requires manual increase from 4K default
**What Worked Well:**
- Good local alternative to cloud models
- Works with Docker Model Runner
- Acceptable performance for development tasks
**Source References:**
- [grigio.org Benchmark Dashboard](https://grigio.org/opencode-benchmark-dashboard-find-the-best-local-llm-for-your-computer/)
- [The AIOps: Local Models Setup](https://theaiops.substack.com/p/setting-up-opencode-with-local-models)
---
## Claude Models
### Claude Opus 4.6
**Model:** Claude Opus 4.6
**Provider:** Anthropic
**Context:** 200K tokens
**Benchmark Results:**
- **SWE-Bench Pro:** 57.3 (Rank #3 overall)
- **CyberGym:** 66.6
- **NL2Repo:** 49.8 (Higher than GLM-5.1)
- **GPU Kernel Optimization:** 4.2x speedup (led GLM-5.1)
- **BrowseComp:** 84.0
**What Worked Well:**
- Strong on long-horizon optimization
- Excellent code quality
- Good for complex refactoring
- Reliable tool calling
**Issues Encountered:**
- Expensive for extended sessions
- Context degradation at ~50% of window
- Slower than some alternatives
- Higher cost per token
**Source References:**
- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
---
### Claude Sonnet 4.5
**Model:** Claude Sonnet 4.5
**Provider:** Anthropic
**Status:** Recommended by OpenCode
**What Worked Well:**
- Listed as recommended model
- Good balance of speed and quality
- Reliable for most coding tasks
- Lower cost than Opus
**Source References:**
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
---
## Gemini Models
### Gemini 3.0 Pro
**Model:** Gemini 3.0 Pro
**Provider:** Google
**Context:** 1M+ tokens
**Benchmark Results:**
- **SWE-Bench Pro:** 54.2
- **Terminal-Bench 2.0:** 68.5
- **BrowseComp:** 85.9 (High)
- **MCP-Atlas:** 69.2
**What Worked Well:**
- Excellent context handling
- Strong on BrowseComp tasks
- Good for long document analysis
- Fast prompt processing
**Issues Encountered:**
- Context degradation starts at ~30% (300k tokens)
- 2-3x slower responses near compaction point
- Hallucinations before compaction triggers
- Quality drops significantly before 75% threshold
**Source References:**
- [GitHub Issue #11314: Context Compaction](https://github.com/anomalyco/opencode/issues/11314)
- [OpenCode Zen](https://opencode.ai/docs/zen/)
---
### Gemini 3 Pro
**Model:** Gemini 3 Pro
**Provider:** Google
**Status:** Recommended by OpenCode
**What Worked Well:**
- Listed as recommended model
- Good general-purpose performance
- Reliable tool calling
**Source References:**
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
---
## Minimax Models
### Minimax M2.1
**Model:** Minimax M2.1
**Provider:** Minimax
**Status:** Recommended by OpenCode
**What Worked Well:**
- Listed as recommended model
- Good for coding tasks
- Competitive with other frontier models
**Source References:**
- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
---
## GLM Models (Frontier)
### GLM-5.1
**Model:** GLM-5.1
**Size:** 754B total / 40B active (MoE)
**Provider:** Z.AI API, BigModel, OpenRouter
**License:** MIT (Open Weights)
**Benchmark Results:**
- **SWE-Bench Pro:** 58.4 (Rank #1 open source, #3 overall)
- **Terminal-Bench 2.0:** 69.0
- **CyberGym:** 68.7 (1,507 real tasks)
- **MCP-Atlas:** 71.8 (Rank #1)
- **Autonomous Duration:** 8 hours continuous
- **Steps:** Up to 1,700 autonomous steps
**What Worked Well:**
- #1 on SWE-Bench Pro among open models
- 8-hour autonomous coding capability
- MIT license (commercial use allowed)
- Works with OpenCode, Claude Code, Kilo Code, Roo Code
- Trained on Huawei Ascend 910B (no Nvidia dependency)
- 3-7 points better than GLM-5 on benchmarks
**Pricing:**
- **API:** $1.40/M input, $4.40/M output
- **Peak Hours:** 3x rate (14:00-18:00 Beijing)
- **Off-Peak:** 2x rate (1x through April 2026 promo)
- **GLM Coding Plan:** $10/month subscription
**Source References:**
- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
- [Z.AI Developer Docs](https://docs.z.ai/guides/llm/glm-5)
---
## OpenRouter Models
### Grok Fast
**Model:** Grok Fast
**Provider:** OpenRouter
**Status:** Free model
**What Worked Well:**
- Fast code generation
- Great for large refactoring
- Good with test coverage
- Free tier available
**Limitations:**
- Not the smartest model
- Best for simple tasks
- Requires good test coverage
**Source References:**
- [Reddit: Opencode benchmarks](https://www.reddit.com/r/opencodeCLI/comments/1perov3/opencode_benchmarks_which_agentic_llm_models_work/)
---
### Step 3.5 Flash
**Model:** Step 3.5 Flash
**Provider:** OpenRouter
**Status:** Top performer
**What Worked Well:**
- Top performer in accuracy and speed
- Good balance of cost and quality
- Reliable for most tasks
**Source References:**
- [grigio.org Benchmark Dashboard](https://grigio.org/opencode-benchmark-benchmark-dashboard-find-the-best-local-llm-for-your-computer/)
---
## OpenCode Zen Models
OpenCode Zen is a curated list of models tested and verified by the OpenCode team.
**Zen Models Include:**
- GLM-4.6 (works great with dedicated API)
- DeepSeek 3.2 (works great with dedicated API)
- Various free and paid options
**What Worked Well:**
- Curated selection of reliable models
- Dedicated APIs perform better than OpenRouter
- Good for users who want pre-verified options
**Source References:**
- [Reddit: Opencode benchmarks](https://www.reddit.com/r/opencodeCLI/comments/1perov3/opencode_benchmarks_which_agentic_llm_models_work/)
---
## Benchmark Comparisons
### SWE-Bench Pro Rankings
| Model | Score | Rank |
|-------|-------|------|
| GLM-5.1 | 58.4 | #1 (Open) |
| GPT-5.4 | 57.7 | #2 |
| Claude Opus 4.6 | 57.3 | #3 |
| GLM-5 | 55.1 | #4 |
| Gemini 3.1 Pro | 54.2 | #5 |
### Terminal-Bench 2.0 Rankings
| Model | Score |
|-------|-------|
| GPT-5.4 | 75.1 |
| GLM-5.1 | 69.0 |
| Gemini 3.1 Pro | 68.5 |
| Claude Opus 4.6 | 65.4 |
### CyberGym Rankings (1,507 real tasks)
| Model | Score |
|-------|-------|
| GLM-5.1 | 68.7 |
| Claude Opus 4.6 | 66.6 |
| GLM-5 | ~49 |
### MCP-Atlas Rankings
| Model | Score |
|-------|-------|
| GLM-5.1 | 71.8 |
| Claude Opus 4.6 | 73.8 |
| GPT-5.4 | 67.2 |
**Source References:**
- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
---
## Long-Horizon Optimization
### GLM-5.1 8-Hour Autonomous Test
**Task:** Build full Linux desktop environment from scratch
**Results:**
- **Iterations:** 655 autonomous iterations
- **Optimization:** 6.9x throughput increase
- **Duration:** 8 hours continuous execution
- **Steps:** 1,700 autonomous steps
**Comparison:**
- **GLM-5:** Plateaued at 8,000-10,000 QPS
- **GLM-5.1:** Reached 21,500 QPS (6,000+ tool calls)
- **Claude Opus 4.6:** 3,547 QPS (single session)
**Source References:**
- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
---
## Context Management
### Compaction Threshold Issues
**Problem:** Hardcoded 75% threshold causes quality degradation
**Model-Specific Degradation:**
| Model | Degradation Start | Compaction Trigger | Result |
|-------|------------------|-------------------|--------|
| Gemini | ~30% (300k) | 75% (786k) | 2-3x slower, hallucinations |
| Claude | ~50% | 75% | Significant quality drops |
**Source References:**
- [GitHub Issue #11314: Configurable Context Compaction](https://github.com/anomalyco/opencode/issues/11314)
---
## General Recommendations
### Best Frontier Models for OpenCode (Ranked)
1. **GLM-5.1** - Best overall (SWE-Bench Pro #1, MIT license)
2. **GPT-5.4** - Best Terminal-Bench performance
3. **Claude Opus 4.6** - Best for long-horizon tasks
4. **Gemini 3.0 Pro** - Best context handling
5. **GPT-5.2** - Best recommended default
### Hybrid Setup Strategy
- **Frontier models:** Complex reasoning, multi-file refactors, deep analysis
- **Local models:** Quick tasks, repetitive work, privacy-sensitive
- Switch between models using `/models` command
### Cost Considerations
- **GLM-5.1:** $1.40/M input, $4.40/M output (cheapest frontier)
- **GPT-5.4:** ~$10/M input, ~$30/M output (expensive)
- **Claude Opus 4.6:** ~$15/M input, ~$75/M output (most expensive)
- **OpenRouter:** Aggregates multiple providers, often cheaper
---
## Data Sources Summary
| Source Type | Count | Topics Covered |
|-------------|-------|----------------|
| Reddit Threads | 3 | Model comparisons, user experiences |
| GitHub Issues | 2 | Configuration problems, bugs |
| Benchmark Dashboards | 2 | Performance metrics, comparisons |
| Blog Posts | 4 | Setup guides, optimization tips |
| Technical Blogs | 3 | Architecture, benchmark analysis |
| Documentation | 2 | Official docs, configuration |
**Total Sources:** 14 unique sources
**Date Range:** April 2025 - April 2026