Initial commit: coding harness feedback analysis
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
This commit is contained in:
@@ -0,0 +1,45 @@
|
||||
# Gemini 3.1 Pro with ForgeCode - Feedback Report
|
||||
|
||||
**Model:** Gemini 3.1 Pro Preview
|
||||
**Provider:** Google
|
||||
**Harness:** ForgeCode
|
||||
**Source References:** ForgeCode Blog
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## Benchmark Performance
|
||||
|
||||
### TermBench 2.0
|
||||
- **ForgeCode Score:** 78.4% (SOTA at time of testing)
|
||||
- **Google's Reported Score:** 68.5% on same model
|
||||
- **Gap:** ~10 percentage points advantage to ForgeCode harness
|
||||
|
||||
> "The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."
|
||||
|
||||
---
|
||||
|
||||
## Key Technical Insights
|
||||
|
||||
### What Made the Difference
|
||||
|
||||
ForgeCode's blog describes seven failure modes and fixes that enabled this performance:
|
||||
|
||||
1. **Non-Interactive Mode:** Required for benchmark success (no user to answer clarifying questions)
|
||||
2. **Tool Description Optimization:** Micro-evals isolating tool misuse categories
|
||||
3. **Tool Naming:** Using `old_string`/`new_string` argument names from training data
|
||||
4. **Entry-Point Discovery:** Lightweight semantic pass before exploration
|
||||
5. **Time Limit Management:** Subagent parallelization + progressive thinking policy
|
||||
6. **Planning Enforcement:** Mandatory `todo_write` tool usage (38% → 66% pass rate)
|
||||
7. **Speed Architecture:** Low-complexity work delegated to subagents with minimal thinking budget
|
||||
|
||||
### Progressive Thinking Policy
|
||||
- Messages 1-10: Very high thinking (plan formation)
|
||||
- Messages 11+: Low thinking default (execution)
|
||||
- Verification calls: Switch back to high thinking
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
|
||||
Reference in New Issue
Block a user