Files
mid_model_research/forgecode/feedback/frontier/gemini-3.1-pro.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

1.5 KiB

Gemini 3.1 Pro with ForgeCode - Feedback Report

Model: Gemini 3.1 Pro Preview
Provider: Google
Harness: ForgeCode
Source References: ForgeCode Blog
Date Compiled: April 9, 2026


Benchmark Performance

TermBench 2.0

  • ForgeCode Score: 78.4% (SOTA at time of testing)
  • Google's Reported Score: 68.5% on same model
  • Gap: ~10 percentage points advantage to ForgeCode harness

"The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."


Key Technical Insights

What Made the Difference

ForgeCode's blog describes seven failure modes and fixes that enabled this performance:

  1. Non-Interactive Mode: Required for benchmark success (no user to answer clarifying questions)
  2. Tool Description Optimization: Micro-evals isolating tool misuse categories
  3. Tool Naming: Using old_string/new_string argument names from training data
  4. Entry-Point Discovery: Lightweight semantic pass before exploration
  5. Time Limit Management: Subagent parallelization + progressive thinking policy
  6. Planning Enforcement: Mandatory todo_write tool usage (38% → 66% pass rate)
  7. Speed Architecture: Low-complexity work delegated to subagents with minimal thinking budget

Progressive Thinking Policy

  • Messages 1-10: Very high thinking (plan formation)
  • Messages 11+: Low thinking default (execution)
  • Verification calls: Switch back to high thinking

Source References

  1. ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/