Files
mid_model_research/forgecode/feedback/frontier/gemini-3.1-pro.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

46 lines
1.5 KiB
Markdown

# Gemini 3.1 Pro with ForgeCode - Feedback Report
**Model:** Gemini 3.1 Pro Preview
**Provider:** Google
**Harness:** ForgeCode
**Source References:** ForgeCode Blog
**Date Compiled:** April 9, 2026
---
## Benchmark Performance
### TermBench 2.0
- **ForgeCode Score:** 78.4% (SOTA at time of testing)
- **Google's Reported Score:** 68.5% on same model
- **Gap:** ~10 percentage points advantage to ForgeCode harness
> "The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."
---
## Key Technical Insights
### What Made the Difference
ForgeCode's blog describes seven failure modes and fixes that enabled this performance:
1. **Non-Interactive Mode:** Required for benchmark success (no user to answer clarifying questions)
2. **Tool Description Optimization:** Micro-evals isolating tool misuse categories
3. **Tool Naming:** Using `old_string`/`new_string` argument names from training data
4. **Entry-Point Discovery:** Lightweight semantic pass before exploration
5. **Time Limit Management:** Subagent parallelization + progressive thinking policy
6. **Planning Enforcement:** Mandatory `todo_write` tool usage (38% → 66% pass rate)
7. **Speed Architecture:** Low-complexity work delegated to subagents with minimal thinking budget
### Progressive Thinking Policy
- Messages 1-10: Very high thinking (plan formation)
- Messages 11+: Low thinking default (execution)
- Verification calls: Switch back to high thinking
---
## Source References
1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/