f561bed7317d64064b49eaba66d52011c85c7bee
Qwen Model Corrections: - Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families - Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B) - Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B) - Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected) - Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B Terminal-Bench 2.0 Fixes: - Clarified that Terminal-Bench measures HARNESS+MODEL combinations - Updated rankings with current leaderboard data (April 2026): - #1: Pilot + Claude Opus 4.6: 82.9% - #2: ForgeCode + GPT-5.4: 81.8% - #3: ForgeCode + Claude Opus 4.6: 81.8% - Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness) - Added harness attribution to all Terminal-Bench references SWE-Bench Pro Updates (Verified): - #1: Claude Mythos Preview: 77.8% - #2: GLM-5.1: 58.4% (top open-source) - #3: GPT-5.4: 57.7% - Added source references to llm-stats.com Files Modified: - forgecode/feedback/localllm/qwen-3.5.md - forgecode/feedback/frontier/benchmark-controversy.md - hermes/feedback/localllm/qwen-models-feedback.md - opencode/opencode/feedback/SUMMARY.md - opencode/opencode/feedback/frontier/frontier-model-feedback.md - opencode/opencode/feedback/localllm/local-llm-feedback.md - pi/feedback/frontier/frontier-model-feedback.md
Coding Harness Feedback Analysis
Research on four coding agent harnesses to understand what works best for different model sizes, particularly smaller/local models.
Folder Structure
├── AGENTS.md # Project overview and data collection strategy
├── Research*.md # Prompt research and orchestration strategies
│
├── opencode/ # Go-based coding agent
│ ├── feedback/
│ │ ├── frontier/ # GPT-5.4, Claude Opus, Gemini feedback
│ │ └── localllm/ # Local model feedback (prompting, tool handling)
│ └── repo/ # Source code (submodule)
│
├── pi/ # Minimal terminal coding harness by Mario Zechner
│ ├── feedback/
│ │ ├── frontier/ # (empty - in progress)
│ │ └── localllm/ # (empty - in progress)
│ └── repo/ # Source code (submodule)
│
├── hermes/ # Nous Research's agent
│ ├── feedback/
│ │ ├── frontier/ # Claude, GPT, budget provider feedback
│ │ ├── localllm/ # Qwen, Gemma, local model feedback
│ │ └── general/ # Bug reports, benchmarks, features
│ └── repo/ # Source code (submodule)
│
└── forgecode/ # AI pair programmer with sub-agents
├── feedback/
│ ├── frontier/ # GPT-5.4, Claude, Gemini, pricing, security
│ └── localllm/ # Qwen, MiniMax, GLM, DeepSeek feedback
└── repo/ # Source code (submodule)
Quick Navigation
| Harness | Feedback Location | Key Topics |
|---|---|---|
| opencode | opencode/feedback/ |
Tool calling, local model prompting |
| pi | pi/feedback/ |
(Being researched) |
| hermes | hermes/feedback/ |
Terminal-bench results, local setup |
| forgecode | forgecode/feedback/ |
Pricing, benchmarks, security |
Feedback Format
Each feedback file includes:
- Model name/size/provider
- Task performance or benchmark results
- Issues encountered
- What worked well
- Source reference (URL, Discord, GitHub issues)
Research Focus
- Tool handling and capabilities
- Skills system effectiveness
- Prompt engineering strategies
- Context management
- Error recovery
Description
Languages
Markdown
100%