Files

T

sleepy 51123212c4 Initial commit: coding harness feedback analysis

Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.

2026-04-09 15:13:45 +02:00

1.5 KiB

Raw Blame History

Gemini 3.1 Pro with ForgeCode - Feedback Report

Model: Gemini 3.1 Pro Preview
Provider: Google
Harness: ForgeCode
Source References: ForgeCode Blog
Date Compiled: April 9, 2026

Benchmark Performance

TermBench 2.0

ForgeCode Score: 78.4% (SOTA at time of testing)
Google's Reported Score: 68.5% on same model
Gap: ~10 percentage points advantage to ForgeCode harness

"The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."

Key Technical Insights

What Made the Difference

ForgeCode's blog describes seven failure modes and fixes that enabled this performance:

Non-Interactive Mode: Required for benchmark success (no user to answer clarifying questions)
Tool Description Optimization: Micro-evals isolating tool misuse categories
Tool Naming: Using old_string/new_string argument names from training data
Entry-Point Discovery: Lightweight semantic pass before exploration
Time Limit Management: Subagent parallelization + progressive thinking policy
Planning Enforcement: Mandatory todo_write tool usage (38% → 66% pass rate)
Speed Architecture: Low-complexity work delegated to subagents with minimal thinking budget

Progressive Thinking Policy

Messages 1-10: Very high thinking (plan formation)
Messages 11+: Low thinking default (execution)
Verification calls: Switch back to high thinking

Source References

ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/

1.5 KiB Raw Blame History