51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
1.5 KiB
1.5 KiB
Gemini 3.1 Pro with ForgeCode - Feedback Report
Model: Gemini 3.1 Pro Preview
Provider: Google
Harness: ForgeCode
Source References: ForgeCode Blog
Date Compiled: April 9, 2026
Benchmark Performance
TermBench 2.0
- ForgeCode Score: 78.4% (SOTA at time of testing)
- Google's Reported Score: 68.5% on same model
- Gap: ~10 percentage points advantage to ForgeCode harness
"The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."
Key Technical Insights
What Made the Difference
ForgeCode's blog describes seven failure modes and fixes that enabled this performance:
- Non-Interactive Mode: Required for benchmark success (no user to answer clarifying questions)
- Tool Description Optimization: Micro-evals isolating tool misuse categories
- Tool Naming: Using
old_string/new_stringargument names from training data - Entry-Point Discovery: Lightweight semantic pass before exploration
- Time Limit Management: Subagent parallelization + progressive thinking policy
- Planning Enforcement: Mandatory
todo_writetool usage (38% → 66% pass rate) - Speed Architecture: Low-complexity work delegated to subagents with minimal thinking budget
Progressive Thinking Policy
- Messages 1-10: Very high thinking (plan formation)
- Messages 11+: Low thinking default (execution)
- Verification calls: Switch back to high thinking
Source References
- ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/