Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,390 @@
+# Frontier Model Feedback for OpenCode
+
+## Overview
+This document compiles community feedback, benchmark results, and performance observations for **frontier (cloud) models** used with OpenCode. Data sourced from Reddit, GitHub issues, benchmark dashboards, and community blogs.
+
+---
+
+## GPT Models
+
+### GPT-5.4
+**Model:** GPT-5.4  
+**Provider:** OpenAI  
+**Context:** 1M tokens  
+
+**Benchmark Results:**
+- **Terminal-Bench 2.0:** 75.1 (Highest among tested models)
+- **SWE-Bench Pro:** 57.7 (Rank #2 overall)
+- **Reasoning:** Strong on AIME 2025, HMMT, GPQA-Diamond
+- **Context:** Compaction triggers at 272k (sometimes earlier), never reaches full 1M
+
+**What Worked Well:**
+- Best Terminal-Bench 2.0 performance
+- Strong reasoning capabilities
+- Excellent tool calling
+- Good for complex multi-step tasks
+
+**Issues Encountered:**
+- Compaction triggers too early (272k vs advertised 1M)
+- Context never approaches full 1M tokens
+- Expensive for long-running sessions
+- Some users report quality degradation before compaction
+
+**Source References:**
+- [GitHub Issue #16308: 1M context compaction issue](https://github.com/anomalyco/opencode/issues/16308)
+- [Terminal-Bench 2.0 Leaderboard](https://www.vals.ai/benchmarks/terminal-bench-2)
+
+---
+
+### GPT-5.2
+**Model:** GPT-5.2  
+**Provider:** OpenAI  
+**Status:** Recommended by OpenCode  
+
+**What Worked Well:**
+- Listed as recommended model in OpenCode docs
+- Good balance of speed and accuracy
+- Reliable tool calling
+
+**Source References:**
+- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
+
+---
+
+### GPT OSS 20B
+**Model:** GPT OSS 20B  
+**Provider:** Docker Model Runner (local), OpenRouter (cloud)  
+
+**Benchmark Results:**
+- **Accuracy:** Very accurate on coding tasks
+- **Speed:** Acceptable for local deployment
+- **Context:** Requires manual increase from 4K default
+
+**What Worked Well:**
+- Good local alternative to cloud models
+- Works with Docker Model Runner
+- Acceptable performance for development tasks
+
+**Source References:**
+- [grigio.org Benchmark Dashboard](https://grigio.org/opencode-benchmark-dashboard-find-the-best-local-llm-for-your-computer/)
+- [The AIOps: Local Models Setup](https://theaiops.substack.com/p/setting-up-opencode-with-local-models)
+
+---
+
+## Claude Models
+
+### Claude Opus 4.6
+**Model:** Claude Opus 4.6  
+**Provider:** Anthropic  
+**Context:** 200K tokens  
+
+**Benchmark Results:**
+- **SWE-Bench Pro:** 57.3 (Rank #3 overall)
+- **CyberGym:** 66.6
+- **NL2Repo:** 49.8 (Higher than GLM-5.1)
+- **GPU Kernel Optimization:** 4.2x speedup (led GLM-5.1)
+- **BrowseComp:** 84.0
+
+**What Worked Well:**
+- Strong on long-horizon optimization
+- Excellent code quality
+- Good for complex refactoring
+- Reliable tool calling
+
+**Issues Encountered:**
+- Expensive for extended sessions
+- Context degradation at ~50% of window
+- Slower than some alternatives
+- Higher cost per token
+
+**Source References:**
+- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
+- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
+
+---
+
+### Claude Sonnet 4.5
+**Model:** Claude Sonnet 4.5  
+**Provider:** Anthropic  
+**Status:** Recommended by OpenCode  
+
+**What Worked Well:**
+- Listed as recommended model
+- Good balance of speed and quality
+- Reliable for most coding tasks
+- Lower cost than Opus
+
+**Source References:**
+- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
+
+---
+
+## Gemini Models
+
+### Gemini 3.0 Pro
+**Model:** Gemini 3.0 Pro  
+**Provider:** Google  
+**Context:** 1M+ tokens  
+
+**Benchmark Results:**
+- **SWE-Bench Pro:** 54.2
+- **Terminal-Bench 2.0:** 68.5
+- **BrowseComp:** 85.9 (High)
+- **MCP-Atlas:** 69.2
+
+**What Worked Well:**
+- Excellent context handling
+- Strong on BrowseComp tasks
+- Good for long document analysis
+- Fast prompt processing
+
+**Issues Encountered:**
+- Context degradation starts at ~30% (300k tokens)
+- 2-3x slower responses near compaction point
+- Hallucinations before compaction triggers
+- Quality drops significantly before 75% threshold
+
+**Source References:**
+- [GitHub Issue #11314: Context Compaction](https://github.com/anomalyco/opencode/issues/11314)
+- [OpenCode Zen](https://opencode.ai/docs/zen/)
+
+---
+
+### Gemini 3 Pro
+**Model:** Gemini 3 Pro  
+**Provider:** Google  
+**Status:** Recommended by OpenCode  
+
+**What Worked Well:**
+- Listed as recommended model
+- Good general-purpose performance
+- Reliable tool calling
+
+**Source References:**
+- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
+
+---
+
+## Minimax Models
+
+### Minimax M2.1
+**Model:** Minimax M2.1  
+**Provider:** Minimax  
+**Status:** Recommended by OpenCode  
+
+**What Worked Well:**
+- Listed as recommended model
+- Good for coding tasks
+- Competitive with other frontier models
+
+**Source References:**
+- [OpenCode Docs: Models](https://opencode.ai/docs/models/)
+
+---
+
+## GLM Models (Frontier)
+
+### GLM-5.1
+**Model:** GLM-5.1  
+**Size:** 754B total / 40B active (MoE)  
+**Provider:** Z.AI API, BigModel, OpenRouter  
+**License:** MIT (Open Weights)  
+
+**Benchmark Results:**
+- **SWE-Bench Pro:** 58.4 (Rank #1 open source, #3 overall)
+- **Terminal-Bench 2.0:** 69.0
+- **CyberGym:** 68.7 (1,507 real tasks)
+- **MCP-Atlas:** 71.8 (Rank #1)
+- **Autonomous Duration:** 8 hours continuous
+- **Steps:** Up to 1,700 autonomous steps
+
+**What Worked Well:**
+- #1 on SWE-Bench Pro among open models
+- 8-hour autonomous coding capability
+- MIT license (commercial use allowed)
+- Works with OpenCode, Claude Code, Kilo Code, Roo Code
+- Trained on Huawei Ascend 910B (no Nvidia dependency)
+- 3-7 points better than GLM-5 on benchmarks
+
+**Pricing:**
+- **API:** $1.40/M input, $4.40/M output
+- **Peak Hours:** 3x rate (14:00-18:00 Beijing)
+- **Off-Peak:** 2x rate (1x through April 2026 promo)
+- **GLM Coding Plan:** $10/month subscription
+
+**Source References:**
+- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
+- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
+- [Z.AI Developer Docs](https://docs.z.ai/guides/llm/glm-5)
+
+---
+
+## OpenRouter Models
+
+### Grok Fast
+**Model:** Grok Fast  
+**Provider:** OpenRouter  
+**Status:** Free model  
+
+**What Worked Well:**
+- Fast code generation
+- Great for large refactoring
+- Good with test coverage
+- Free tier available
+
+**Limitations:**
+- Not the smartest model
+- Best for simple tasks
+- Requires good test coverage
+
+**Source References:**
+- [Reddit: Opencode benchmarks](https://www.reddit.com/r/opencodeCLI/comments/1perov3/opencode_benchmarks_which_agentic_llm_models_work/)
+
+---
+
+### Step 3.5 Flash
+**Model:** Step 3.5 Flash  
+**Provider:** OpenRouter  
+**Status:** Top performer  
+
+**What Worked Well:**
+- Top performer in accuracy and speed
+- Good balance of cost and quality
+- Reliable for most tasks
+
+**Source References:**
+- [grigio.org Benchmark Dashboard](https://grigio.org/opencode-benchmark-benchmark-dashboard-find-the-best-local-llm-for-your-computer/)
+
+---
+
+## OpenCode Zen Models
+
+OpenCode Zen is a curated list of models tested and verified by the OpenCode team.
+
+**Zen Models Include:**
+- GLM-4.6 (works great with dedicated API)
+- DeepSeek 3.2 (works great with dedicated API)
+- Various free and paid options
+
+**What Worked Well:**
+- Curated selection of reliable models
+- Dedicated APIs perform better than OpenRouter
+- Good for users who want pre-verified options
+
+**Source References:**
+- [Reddit: Opencode benchmarks](https://www.reddit.com/r/opencodeCLI/comments/1perov3/opencode_benchmarks_which_agentic_llm_models_work/)
+
+---
+
+## Benchmark Comparisons
+
+### SWE-Bench Pro Rankings
+| Model | Score | Rank |
+|-------|-------|------|
+| GLM-5.1 | 58.4 | #1 (Open) |
+| GPT-5.4 | 57.7 | #2 |
+| Claude Opus 4.6 | 57.3 | #3 |
+| GLM-5 | 55.1 | #4 |
+| Gemini 3.1 Pro | 54.2 | #5 |
+
+### Terminal-Bench 2.0 Rankings
+| Model | Score |
+|-------|-------|
+| GPT-5.4 | 75.1 |
+| GLM-5.1 | 69.0 |
+| Gemini 3.1 Pro | 68.5 |
+| Claude Opus 4.6 | 65.4 |
+
+### CyberGym Rankings (1,507 real tasks)
+| Model | Score |
+|-------|-------|
+| GLM-5.1 | 68.7 |
+| Claude Opus 4.6 | 66.6 |
+| GLM-5 | ~49 |
+
+### MCP-Atlas Rankings
+| Model | Score |
+|-------|-------|
+| GLM-5.1 | 71.8 |
+| Claude Opus 4.6 | 73.8 |
+| GPT-5.4 | 67.2 |
+
+**Source References:**
+- [Apidog Blog: GLM-5.1 Review](https://apidog.com/blog/glm-5-1/)
+- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
+
+---
+
+## Long-Horizon Optimization
+
+### GLM-5.1 8-Hour Autonomous Test
+**Task:** Build full Linux desktop environment from scratch
+
+**Results:**
+- **Iterations:** 655 autonomous iterations
+- **Optimization:** 6.9x throughput increase
+- **Duration:** 8 hours continuous execution
+- **Steps:** 1,700 autonomous steps
+
+**Comparison:**
+- **GLM-5:** Plateaued at 8,000-10,000 QPS
+- **GLM-5.1:** Reached 21,500 QPS (6,000+ tool calls)
+- **Claude Opus 4.6:** 3,547 QPS (single session)
+
+**Source References:**
+- [Build Fast with AI: GLM-5.1 Review](https://www.buildfastwithai.com/blogs/glm-5-1-open-source-review-2026)
+
+---
+
+## Context Management
+
+### Compaction Threshold Issues
+**Problem:** Hardcoded 75% threshold causes quality degradation
+
+**Model-Specific Degradation:**
+| Model | Degradation Start | Compaction Trigger | Result |
+|-------|------------------|-------------------|--------|
+| Gemini | ~30% (300k) | 75% (786k) | 2-3x slower, hallucinations |
+| Claude | ~50% | 75% | Significant quality drops |
+
+**Source References:**
+- [GitHub Issue #11314: Configurable Context Compaction](https://github.com/anomalyco/opencode/issues/11314)
+
+---
+
+## General Recommendations
+
+### Best Frontier Models for OpenCode (Ranked)
+
+1. **GLM-5.1** - Best overall (SWE-Bench Pro #1, MIT license)
+2. **GPT-5.4** - Best Terminal-Bench performance
+3. **Claude Opus 4.6** - Best for long-horizon tasks
+4. **Gemini 3.0 Pro** - Best context handling
+5. **GPT-5.2** - Best recommended default
+
+### Hybrid Setup Strategy
+- **Frontier models:** Complex reasoning, multi-file refactors, deep analysis
+- **Local models:** Quick tasks, repetitive work, privacy-sensitive
+- Switch between models using `/models` command
+
+### Cost Considerations
+- **GLM-5.1:** $1.40/M input, $4.40/M output (cheapest frontier)
+- **GPT-5.4:** ~$10/M input, ~$30/M output (expensive)
+- **Claude Opus 4.6:** ~$15/M input, ~$75/M output (most expensive)
+- **OpenRouter:** Aggregates multiple providers, often cheaper
+
+---
+
+## Data Sources Summary
+
+| Source Type | Count | Topics Covered |
+|-------------|-------|----------------|
+| Reddit Threads | 3 | Model comparisons, user experiences |
+| GitHub Issues | 2 | Configuration problems, bugs |
+| Benchmark Dashboards | 2 | Performance metrics, comparisons |
+| Blog Posts | 4 | Setup guides, optimization tips |
+| Technical Blogs | 3 | Architecture, benchmark analysis |
+| Documentation | 2 | Official docs, configuration |
+
+**Total Sources:** 14 unique sources  
+**Date Range:** April 2025 - April 2026