51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
46 lines
1.5 KiB
Markdown
46 lines
1.5 KiB
Markdown
# Gemini 3.1 Pro with ForgeCode - Feedback Report
|
|
|
|
**Model:** Gemini 3.1 Pro Preview
|
|
**Provider:** Google
|
|
**Harness:** ForgeCode
|
|
**Source References:** ForgeCode Blog
|
|
**Date Compiled:** April 9, 2026
|
|
|
|
---
|
|
|
|
## Benchmark Performance
|
|
|
|
### TermBench 2.0
|
|
- **ForgeCode Score:** 78.4% (SOTA at time of testing)
|
|
- **Google's Reported Score:** 68.5% on same model
|
|
- **Gap:** ~10 percentage points advantage to ForgeCode harness
|
|
|
|
> "The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."
|
|
|
|
---
|
|
|
|
## Key Technical Insights
|
|
|
|
### What Made the Difference
|
|
|
|
ForgeCode's blog describes seven failure modes and fixes that enabled this performance:
|
|
|
|
1. **Non-Interactive Mode:** Required for benchmark success (no user to answer clarifying questions)
|
|
2. **Tool Description Optimization:** Micro-evals isolating tool misuse categories
|
|
3. **Tool Naming:** Using `old_string`/`new_string` argument names from training data
|
|
4. **Entry-Point Discovery:** Lightweight semantic pass before exploration
|
|
5. **Time Limit Management:** Subagent parallelization + progressive thinking policy
|
|
6. **Planning Enforcement:** Mandatory `todo_write` tool usage (38% → 66% pass rate)
|
|
7. **Speed Architecture:** Low-complexity work delegated to subagents with minimal thinking budget
|
|
|
|
### Progressive Thinking Policy
|
|
- Messages 1-10: Very high thinking (plan formation)
|
|
- Messages 11+: Low thinking default (execution)
|
|
- Verification calls: Switch back to high thinking
|
|
|
|
---
|
|
|
|
## Source References
|
|
|
|
1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
|