mid_model_research/forgecode/feedback/frontier/gemini-3.1-pro.md

# Gemini 3.1 Pro with ForgeCode - Feedback Report

**Model:** Gemini 3.1 Pro Preview
**Provider:** Google
**Harness:** ForgeCode
**Source References:** ForgeCode Blog
**Date Compiled:** April 9, 2026

---

## Benchmark Performance

### TermBench 2.0
- **ForgeCode Score:** 78.4% (SOTA at time of testing)
- **Google's Reported Score:** 68.5% on same model
- **Gap:** ~10 percentage points advantage to ForgeCode harness

> "The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."

---

## Key Technical Insights

### What Made the Difference

ForgeCode's blog describes seven failure modes and fixes that enabled this performance:

1. **Non-Interactive Mode:** Required for benchmark success (no user to answer clarifying questions)
2. **Tool Description Optimization:** Micro-evals isolating tool misuse categories
3. **Tool Naming:** Using `old_string`/`new_string` argument names from training data
4. **Entry-Point Discovery:** Lightweight semantic pass before exploration
5. **Time Limit Management:** Subagent parallelization + progressive thinking policy
6. **Planning Enforcement:** Mandatory `todo_write` tool usage (38% → 66% pass rate)
7. **Speed Architecture:** Low-complexity work delegated to subagents with minimal thinking budget

### Progressive Thinking Policy
- Messages 1-10: Very high thinking (plan formation)
- Messages 11+: Low thinking default (execution)
- Verification calls: Switch back to high thinking

---

## Source References

1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/