Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,45 @@
+# Gemini 3.1 Pro with ForgeCode - Feedback Report
+
+**Model:** Gemini 3.1 Pro Preview  
+**Provider:** Google  
+**Harness:** ForgeCode  
+**Source References:** ForgeCode Blog  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Benchmark Performance
+
+### TermBench 2.0
+- **ForgeCode Score:** 78.4% (SOTA at time of testing)
+- **Google's Reported Score:** 68.5% on same model
+- **Gap:** ~10 percentage points advantage to ForgeCode harness
+
+> "The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."
+
+---
+
+## Key Technical Insights
+
+### What Made the Difference
+
+ForgeCode's blog describes seven failure modes and fixes that enabled this performance:
+
+1. **Non-Interactive Mode:** Required for benchmark success (no user to answer clarifying questions)
+2. **Tool Description Optimization:** Micro-evals isolating tool misuse categories
+3. **Tool Naming:** Using `old_string`/`new_string` argument names from training data
+4. **Entry-Point Discovery:** Lightweight semantic pass before exploration
+5. **Time Limit Management:** Subagent parallelization + progressive thinking policy
+6. **Planning Enforcement:** Mandatory `todo_write` tool usage (38% → 66% pass rate)
+7. **Speed Architecture:** Low-complexity work delegated to subagents with minimal thinking budget
+
+### Progressive Thinking Policy
+- Messages 1-10: Very high thinking (plan formation)
+- Messages 11+: Low thinking default (execution)
+- Verification calls: Switch back to high thinking
+
+---
+
+## Source References
+
+1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/