T

sleepy f561bed731 Fix model references and benchmark data across all feedback files

Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B

Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
  - #1: Pilot + Claude Opus 4.6: 82.9%
  - #2: ForgeCode + GPT-5.4: 81.8%
  - #3: ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references

SWE-Bench Pro Updates (Verified):
- #1: Claude Mythos Preview: 77.8%
- #2: GLM-5.1: 58.4% (top open-source)
- #3: GPT-5.4: 57.7%
- Added source references to llm-stats.com

Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md

2026-04-09 16:05:14 +02:00

forgecode

Fix model references and benchmark data across all feedback files

2026-04-09 16:05:14 +02:00

hermes

Fix model references and benchmark data across all feedback files

2026-04-09 16:05:14 +02:00

opencode

Fix model references and benchmark data across all feedback files

2026-04-09 16:05:14 +02:00

Fix model references and benchmark data across all feedback files

2026-04-09 16:05:14 +02:00

AGENTS.md

Initial commit: coding harness feedback analysis

2026-04-09 15:13:45 +02:00

prompt.md

Initial commit: coding harness feedback analysis

2026-04-09 15:13:45 +02:00

README.md

Add README with folder navigation

2026-04-09 15:15:28 +02:00

Research-orchestration.md

Initial commit: coding harness feedback analysis

2026-04-09 15:13:45 +02:00

Research-prompt.md

Initial commit: coding harness feedback analysis

2026-04-09 15:13:45 +02:00

Research.md

Initial commit: coding harness feedback analysis

2026-04-09 15:13:45 +02:00

README.md

Coding Harness Feedback Analysis

Research on four coding agent harnesses to understand what works best for different model sizes, particularly smaller/local models.

Folder Structure

├── AGENTS.md              # Project overview and data collection strategy
├── Research*.md           # Prompt research and orchestration strategies
│
├── opencode/              # Go-based coding agent
│   ├── feedback/
│   │   ├── frontier/      # GPT-5.4, Claude Opus, Gemini feedback
│   │   └── localllm/      # Local model feedback (prompting, tool handling)
│   └── repo/              # Source code (submodule)
│
├── pi/                    # Minimal terminal coding harness by Mario Zechner
│   ├── feedback/
│   │   ├── frontier/      # (empty - in progress)
│   │   └── localllm/      # (empty - in progress)
│   └── repo/              # Source code (submodule)
│
├── hermes/                # Nous Research's agent
│   ├── feedback/
│   │   ├── frontier/      # Claude, GPT, budget provider feedback
│   │   ├── localllm/      # Qwen, Gemma, local model feedback
│   │   └── general/       # Bug reports, benchmarks, features
│   └── repo/              # Source code (submodule)
│
└── forgecode/             # AI pair programmer with sub-agents
    ├── feedback/
    │   ├── frontier/      # GPT-5.4, Claude, Gemini, pricing, security
    │   └── localllm/      # Qwen, MiniMax, GLM, DeepSeek feedback
    └── repo/              # Source code (submodule)

Harness	Feedback Location	Key Topics
opencode	`opencode/feedback/`	Tool calling, local model prompting
pi	`pi/feedback/`	(Being researched)
hermes	`hermes/feedback/`	Terminal-bench results, local setup
forgecode	`forgecode/feedback/`	Pricing, benchmarks, security

Feedback Format

Each feedback file includes:

Model name/size/provider
Task performance or benchmark results
Issues encountered
What worked well
Source reference (URL, Discord, GitHub issues)

Research Focus

Tool handling and capabilities
Skills system effectiveness
Prompt engineering strategies
Context management
Error recovery

README.md

Coding Harness Feedback Analysis

Folder Structure

Quick Navigation

Feedback Format

Research Focus