51123212c4
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
99 lines
2.7 KiB
Markdown
99 lines
2.7 KiB
Markdown
# ForgeCode Research & Analysis Folder
|
|
|
|
This folder contains comprehensive research and analysis of the **ForgeCode** coding harness from antinomyhq.
|
|
|
|
---
|
|
|
|
## Folder Structure
|
|
|
|
```
|
|
forgecode/
|
|
├── feedback/
|
|
│ ├── frontier/ # Frontier/closed-weight model feedback
|
|
│ │ ├── claude-opus-4.6.md
|
|
│ │ ├── gpt-5.4.md
|
|
│ │ ├── gemini-3.1-pro.md
|
|
│ │ ├── privacy-security-concerns.md
|
|
│ │ ├── pricing-model.md
|
|
│ │ ├── feature-comparison-ecosystem.md
|
|
│ │ ├── benchmark-controversy.md
|
|
│ │ └── summary-best-practices.md
|
|
│ └── localllm/ # Local/open-weight model feedback
|
|
│ ├── qwen-3.5.md
|
|
│ ├── general-local-models.md
|
|
│ ├── tool-calling-reliability.md
|
|
│ ├── github-issues-summary.md
|
|
│ ├── minimax-glm-deepseek.md
|
|
│ └── installation-platform-issues.md
|
|
└── README.md # This file
|
|
```
|
|
|
|
---
|
|
|
|
## Key Findings Summary
|
|
|
|
### Strengths
|
|
- **Speed:** 3x faster than Claude Code on identical tasks (Opus 4.6)
|
|
- **Multi-model:** 300+ models via OpenRouter
|
|
- **Open source:** Apache 2.0, auditable
|
|
- **Context efficiency:** ~90% reduction vs full-file inclusion
|
|
|
|
### Weaknesses
|
|
- **Privacy concerns:** Telemetry collects SSH/git data by default
|
|
- **Feature gaps:** No checkpoints, auto-memory, or IDE extensions
|
|
- **Benchmark questions:** Self-reported scores differ from independent validation
|
|
- **GPT 5.4 stability:** "Borderline unusable" despite 81.8% benchmark score
|
|
|
|
### Critical Issues
|
|
1. **#2894:** Multiple system messages break Qwen 3.5 and similar models
|
|
2. **#1318:** Telemetry collection concerns
|
|
3. **#2893:** Ghostty terminal resize bug
|
|
|
|
---
|
|
|
|
## Model Recommendations
|
|
|
|
### Best Overall Experience
|
|
- **Claude Opus 4.6** - Fast, stable, reliable
|
|
|
|
### Best Value
|
|
- **MiniMax M2.1** - 47.9% score at $0.30/$1.20 per million tokens
|
|
|
|
### Avoid
|
|
- **GPT 5.4** through ForgeCode - Tool calling failures
|
|
- **Qwen 3.5** - Broken by #2894 until fixed
|
|
|
|
---
|
|
|
|
## Quick Links
|
|
|
|
- **Repository:** https://github.com/antinomyhq/forgecode
|
|
- **Documentation:** https://forgecode.dev/docs/
|
|
- **Discord:** https://discord.gg/kRZBPpkgwq
|
|
- **TermBench Leaderboard:** https://tbench.ai/leaderboard/terminal-bench/2.0
|
|
|
|
---
|
|
|
|
## Feedback Format
|
|
|
|
Each feedback file includes:
|
|
- Model used (name, size, provider)
|
|
- Benchmark results or task performance
|
|
- Issues encountered
|
|
- What worked well
|
|
- Source reference (URL or site)
|
|
|
|
---
|
|
|
|
## Last Updated
|
|
|
|
April 9, 2026
|
|
|
|
Compiled from:
|
|
- GitHub issues (48 open, 433 closed)
|
|
- Reddit discussions (r/ClaudeCode, r/cursor, r/LocalLLaMA)
|
|
- DEV Community articles
|
|
- ForgeCode blog posts
|
|
- Independent benchmark sites (llm-stats.com)
|
|
- Academic papers (arXiv)
|