Initial commit: coding harness feedback analysis

Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
This commit is contained in:
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
+145
View File
@@ -0,0 +1,145 @@
# Community Sources & Ongoing Monitoring
**Last Updated:** April 9, 2026
---
## Official Channels
### Discord
- **URL:** https://discord.gg/kRZBPpkgwq
- **Purpose:** Community support, feature announcements, feedback
- **Activity:** Active (referenced in docs and GitHub)
### GitHub
- **Issues:** https://github.com/antinomyhq/forgecode/issues (48 open, 433 closed)
- **Discussions:** https://github.com/antinomyhq/forgecode/discussions
- **Releases:** https://github.com/antinomyhq/forgecode/releases
### Reddit
- **r/forgecode:** https://www.reddit.com/r/forgecode/ (official subreddit)
- **r/ClaudeCode:** Frequently discusses ForgeCode comparisons
- **r/cursor:** Pricing and feature comparisons
- **r/LocalLLaMA:** Local model usage with ForgeCode
---
## Key External References
### Benchmarks
- **TermBench 2.0:** https://tbench.ai/leaderboard/terminal-bench/2.0
- **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
- **SWE-bench:** https://www.swebench.com/ (independent validation)
### Documentation
- **Official Docs:** https://forgecode.dev/docs/
- **Installation:** https://forgecode.dev/docs/installation/
- **ZSH Support:** https://forgecode.dev/docs/zsh-support/
- **Blog:** https://forgecode.dev/blog/
### Articles & Reviews
- **DEV Community:** Multiple comparison articles
- **TechGig:** Feature overview (August 2025)
- **Artificial Analysis:** Independent benchmark tracking
---
## Notable GitHub Issues to Watch
### Critical (Open)
| Issue | Description | Status |
|-------|-------------|--------|
| #2904 | Use models.dev as registry | Open |
| #2894 | Qwen 3.5 system messages bug | Open |
| #2893 | Ghostty resize bug | Open, PR linked |
| #2888 | API key helpers | Open |
| #2884 | Muse mode blocked | Open |
### Historical (Closed but relevant)
| Issue | Description |
|-------|-------------|
| #2813 | Fixed in response to Reddit feedback |
| #2485 | Mac installation issues |
| #1296 | Daily FORGE limit stops tasks |
| #1318 | Telemetry concerns |
---
## Research Papers
### Terminal-Bench
- **arXiv:** https://arxiv.org/html/2601.11868v1
- **OpenReview:** https://openreview.net/forum?id=a7Qa4CcHak
- **Published:** ICLR 2026
---
## Monitoring Recommendations
### Weekly Checks
1. GitHub issues for new bugs affecting model compatibility
2. Discord announcements for feature updates
3. Reddit for user experience reports
### Monthly Reviews
1. Benchmark leaderboard updates (llm-stats.com)
2. New model support announcements
3. Pricing changes
### Quarterly Analysis
1. Comparative reviews (DEV Community, blogs)
2. Feature gap analysis vs competitors
3. Local model compatibility updates
---
## Data Collection Notes
### Exhaustive Search Performed
- Web search across multiple query angles
- GitHub issue extraction
- Documentation review
- Blog post analysis
- Community forum monitoring
### Sources Checked
- GitHub (antinomyhq/forgecode)
- Reddit (r/forgecode, r/ClaudeCode, r/cursor, r/LocalLLaMA)
- DEV Community
- ForgeCode official blog
- Independent benchmark sites
- Academic papers
### Limitations
- Reddit verification challenges prevented some thread extraction
- Discord content not directly accessible (requires login)
- Some GitHub issues require authentication for full details
---
## Contribution Guidelines
When adding new feedback:
1. **Follow the format:**
- Model/Topic header
- Source references with URLs
- What worked / What didn't
- Specific issues encountered
2. **Include dates:** When was the feedback collected?
3. **Categorize correctly:**
- `frontier/` for closed-weight models (GPT, Claude, Gemini, etc.)
- `localllm/` for open-weight models (Qwen, Llama, Mistral, etc.)
4. **Update README.md:** If adding major new categories
---
## Contact
For questions about this research:
- Check the GitHub repository for updates
- Join the ForgeCode Discord
- File issues against this research folder
@@ -0,0 +1,140 @@
# ForgeCode Benchmark Controversy - Feedback Report
**Topic:** TermBench 2.0 results, self-reported vs independent validation, "benchmaxxing"
**Source References:** Reddit r/ClaudeCode, DEV Community, llm-stats.com, arXiv
**Date Compiled:** April 9, 2026
---
## The Controversy
ForgeCode achieved **81.8% on TermBench 2.0** (tied with GPT 5.4 and Opus 4.6), significantly outperforming Claude Code's 58.0% on the same Opus 4.6 model. This raised questions about:
1. Self-reported vs. independent validation
2. Benchmark-specific optimizations ("benchmaxxing")
3. Proprietary layer involvement
---
## TermBench 2.0 Results
### Self-Reported (via ForgeCode at tbench.ai)
| Configuration | Score | Rank |
|--------------|-------|------|
| ForgeCode + GPT 5.4 | 81.8% | #1 |
| ForgeCode + Opus 4.6 | 81.8% | #1 |
| Claude Code + Opus 4.6 | 58.0% | #39 |
### Independent SWE-bench (Princeton/UChicago)
| Configuration | Score |
|--------------|-------|
| ForgeCode + Claude 4 | 72.7% |
| Claude 3.7 Sonnet (extended thinking) | 70.3% |
| Claude 4.5 Opus | 76.8% |
**Gap narrows from 24 points to 2.4 points on independent benchmark.**
---
## Community Skepticism
### Reddit r/ClaudeCode
> "Looks this agent forgecode.dev ranks better than anyone in terminalbench but anyone is talking about it. Is it fake or what is wrong with these 'artifacts' that promise save time and tokens"
> "At this point, terminalbench has received quite some attention and most benchmarks are not validated."
> "I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%."
### The "Benchmaxxing" Term
Community coined "benchmaxxed" to describe ForgeCode's approach:
- Real engineering improvements
- Also benchmark-specific optimizations
- Not necessarily representative of real-world performance
---
## ForgeCode's Defense
### Blog Series: "Benchmarks Don't Matter — Until They Do"
ForgeCode transparently documented their journey:
- **Baseline:** ~25% (interactive-first runtime)
- **Stabilization:** ~38% (non-interactive mode + tool naming fixes)
- **Planning control:** 66% (mandatory todo_write enforcement)
- **Speed architecture:** 78.4% (subagent parallelization + progressive thinking)
- **Final:** 81.8% (additional optimizations)
### Documented Optimizations
1. **JSON schema reordering:** `required` before `properties` for GPT 5.4
2. **Schema flattening:** Reduced nesting
3. **Truncation reminders:** Explicit notes when files partially read
4. **Mandatory verification:** Reviewer skill checks completion
---
## The Proprietary Layer Question
**ForgeCode Services** (optional, free during evaluation) includes:
1. Semantic entry-point discovery
2. Dynamic skill loading
3. Tool-call correction layer
**Concern:** These services were used for benchmark evaluations but differ from open-source CLI mode.
**Clarification from Discussion #2545:**
> "Regarding the benchmarks: the setup we used for those evaluations includes ForgeCode Services, which provides multiple improvements over the existing CLI harness."
---
## Independent Terminal-Bench Data
From llm-stats.com (April 9, 2026):
- **23 models evaluated**
- **Average score:** 0.345 (34.5%)
- **Best score:** 0.500 (50.0%) - Claude Sonnet 4.5
- **All results self-reported** (0 verified)
**Top 3:**
1. Claude Sonnet 4.5: 50.0%
2. MiniMax M2.1: 47.9%
3. Kimi K2-Thinking: 47.1%
**Note:** ForgeCode's 81.8% is not on this independent leaderboard; it was self-reported on tbench.ai.
---
## Academic Validation
### Terminal-Bench Paper (ICLR 2026)
From arXiv:2601.11868:
> "Nonetheless, the benchmark developers have invested considerable effort in mitigating reproducibility issues through containerization, repeated runs, and reporting of confidence intervals. We found no major issues. Independent inspection of the released tasks confirmed that they are well-specified and largely free of ambiguity or underspecification."
**Key Point:** The benchmark itself is well-constructed; the question is about harness-specific optimizations.
---
## Key Takeaways
1. **Benchmarks can be gamed:** Documented optimizations show how harness engineering affects scores
2. **Independent validation matters:** 24-point gap shrinks to 2.4 on independent tests
3. **Proprietary layers complicate comparisons:** Services used for benchmarks differ from open-source code
4. **Real-world != benchmark:** GPT 5.4 scored 81.8% but was "borderline unusable" in practice
---
## Recommendations for Benchmark Consumers
1. **Look for independent validation** (SWE-bench > self-reported TermBench)
2. **Test on your own tasks** - benchmarks don't capture all failure modes
3. **Consider harness transparency** - open-source vs proprietary optimizations
4. **Beware benchmaxxing** - optimizations may not generalize
---
## Source References
1. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
2. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
3. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
4. **arXiv Paper:** https://arxiv.org/html/2601.11868v1
5. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
6. **GitHub Discussion:** https://github.com/antinomyhq/forgecode/discussions/2545
@@ -0,0 +1,81 @@
# Claude Opus 4.6 with ForgeCode - Feedback Report
**Model:** Claude Opus 4.6
**Provider:** Anthropic
**Harness:** ForgeCode
**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode
**Date Compiled:** April 9, 2026
---
## Benchmark Performance
### TermBench 2.0 (Self-Reported via ForgeCode)
- **Score:** 81.8% (tied for #1)
- **Comparison:** Claude Code + Opus 4.6 scored 58.0% (Rank #39)
- **Gap:** ~24 percentage points in favor of ForgeCode harness
### SWE-bench Verified (Independent - Princeton/UChicago)
- **ForgeCode + Claude 4:** 72.7%
- **Claude Code + Claude 3.7 Sonnet (extended thinking):** 70.3%
- **Gap:** Only 2.4 percentage points
**Key Insight:** The benchmark gap narrows significantly on independent validation. TermBench 2.0 results are self-reported by ForgeCode itself.
---
## Real-World Performance Feedback
### Speed
- **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
- **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
- Claude Code: ~90 seconds
- ForgeCode + Opus 4.6: <30 seconds
- **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
### Why Faster
1. **Rust binary** vs Claude Code's TypeScript (better startup/memory)
2. **Context engine:** Indexes function signatures and module boundaries instead of dumping raw files (~90% context size reduction)
3. **Selective context:** Pulls only what the agent needs
### Stability
- **Assessment:** Excellent stability with Opus 4.6 through ForgeCode
- **No tool call failures reported** (unlike GPT 5.4 experience)
- Consistent performance across different task types
---
## What Worked Well
1. **Multi-file refactoring:** Handles complex changes across file boundaries efficiently
2. **Code comprehension:** Strong understanding of Astro/React components
3. **Speed on complex tasks:** Consistently 3x faster than Claude Code on identical tasks
4. **Planning with muse:** Plan output felt "more detailed and verbose than Claude Code's plan mode"
---
## Issues Encountered
1. **Ecosystem gaps:** No IDE extensions, no hooks, no checkpoints/rewind
2. **No auto-memory:** Context doesn't persist between sessions
3. **No built-in sandbox:** Requires manual `--sandbox` flag for isolation
---
## User Workflow Integration
**Current User Pattern (Liran Baba):**
> "I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
**Use Cases:**
- Speed-critical tasks: ForgeCode + Opus 4.6
- Complex refactoring: ForgeCode for faster iteration
- Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)
---
## Source References
1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,146 @@
# ForgeCode vs Competitors - Feature & Ecosystem Comparison
**Topic:** Feature gaps, ecosystem comparison, workflow integration
**Source References:** DEV Community, ForgeCode docs, Reddit
**Date Compiled:** April 9, 2026
---
## Feature Matrix
| Feature | ForgeCode | Claude Code | Cursor |
|---------|-----------|-------------|--------|
| **Model Choice** | Any (300+) | Claude only | Multiple |
| **License** | Open source (Apache 2.0) | Proprietary | Proprietary |
| **Language** | Rust | TypeScript | TypeScript |
| **Project Config** | `AGENTS.md` | `CLAUDE.md` (hierarchical) | `.cursorrules` |
| **MCP Support** | Yes | Yes (extensive) | Yes |
| **Hooks** | **No** | Yes (6 types) | Limited |
| **Scheduled Tasks** | **No** | Yes (cloud + local) | No |
| **Sub-agents** | Yes (forge/sage/muse) | Yes (parallel) | Limited |
| **IDE Extensions** | **None** | VS Code, JetBrains | VS Code only |
| **Auto-Memory** | **No** | Yes | Yes |
| **Checkpoints/Rewind** | **No** | Yes | Yes |
| **Sandbox Mode** | `--sandbox` flag | Built-in | Built-in |
| **Plan Mode** | Yes (muse writes to `plans/`) | Yes (Shift+Tab) | Composer |
| **Pricing** | $0-$100/mo + API | $20/mo subscription | $20/mo subscription |
---
## The "Lambo with No Cup Holder" Problem
> "ForgeCode is a Lambo with no cup holder. Fast as hell, but you're holding your coffee between your knees."
**Meaning:** Extremely fast but missing quality-of-life features.
---
## Major Feature Gaps
### 1. No IDE Extensions
**Impact:** Must use terminal exclusively; no GUI integration
**Workaround:** Use alongside IDE manually
### 2. No Auto-Memory
**Impact:** Context doesn't persist between sessions
**Claude Code Comparison:** Remembers project context across sessions
### 3. No Checkpoints/Rewind
**Impact:** Cannot rollback changes without git
**Claude Code Comparison:** Every edit snapshotted; `/rewind` available
### 4. No Hooks
**Impact:** Cannot trigger scripts on file changes
**Claude Code Comparison:** 6 hook types (pre-command, post-command, etc.)
### 5. No Scheduled Tasks
**Impact:** Cannot schedule recurring agent runs
**Claude Code Comparison:** Both cloud and local scheduled tasks
---
## ForgeCode Strengths
### 1. Speed
- Rust binary vs TypeScript runtime
- Context indexing reduces token usage ~90%
- Real-world: 3x faster on identical tasks
### 2. Multi-Model Support
- 300+ models via OpenRouter
- Not locked to single provider
- Can optimize cost/performance per task
### 3. Multi-Agent Architecture
- `forge`: Implementation
- `sage`: Read-only research
- `muse`: Planning (writes to `plans/`)
- More detailed plan output than competitors
### 4. Open Source
- Apache 2.0 license
- Auditable code
- Community contributions
### 5. Terminal-Native
- Zsh plugin integration
- `:` sentinel for quick access
- No context switching
---
## Workflow Integration Patterns
### Pattern 1: ForgeCode for Speed
**Use Case:** Latency-sensitive tasks, quick fixes
**Workflow:** Use ForgeCode for implementation, IDE for review
### Pattern 2: Double-Dipping
**User Quote:** "I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
### Pattern 3: Team Configuration
**Challenge:** No shared project instructions (CLAUDE.md is Claude-specific)
**Partial Solution:** AGENTS.md for ForgeCode, but not widely adopted
---
## AGENTS.md vs CLAUDE.md
### AGENTS.md (ForgeCode)
- Project-specific instructions
- Less widely documented
- Single file (no hierarchy)
### CLAUDE.md (Claude Code)
- Hierarchical (project → parent dirs → home)
- More mature documentation
- Shared across team if committed
---
## Recommendations by Use Case
### Solo Developer, Speed Priority
**Choice:** ForgeCode + Opus 4.6
**Reason:** Fastest iteration, cost-effective with careful model selection
### Team Environment
**Choice:** Claude Code
**Reason:** Shared CLAUDE.md, checkpoints, auto-memory for team continuity
### IDE-First Developer
**Choice:** Cursor
**Reason:** Native IDE integration, GUI features
### Terminal-First, Privacy-Focused
**Choice:** ForgeCode (with FORGE_TRACKER=false)
**Reason:** Local execution, open source, no IDE lock-in
---
## Source References
1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
2. **ForgeCode Docs:** https://forgecode.dev/docs/operating-agents/
3. **ForgeCode ZSH Docs:** https://forgecode.dev/docs/zsh-support/
4. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,45 @@
# Gemini 3.1 Pro with ForgeCode - Feedback Report
**Model:** Gemini 3.1 Pro Preview
**Provider:** Google
**Harness:** ForgeCode
**Source References:** ForgeCode Blog
**Date Compiled:** April 9, 2026
---
## Benchmark Performance
### TermBench 2.0
- **ForgeCode Score:** 78.4% (SOTA at time of testing)
- **Google's Reported Score:** 68.5% on same model
- **Gap:** ~10 percentage points advantage to ForgeCode harness
> "The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."
---
## Key Technical Insights
### What Made the Difference
ForgeCode's blog describes seven failure modes and fixes that enabled this performance:
1. **Non-Interactive Mode:** Required for benchmark success (no user to answer clarifying questions)
2. **Tool Description Optimization:** Micro-evals isolating tool misuse categories
3. **Tool Naming:** Using `old_string`/`new_string` argument names from training data
4. **Entry-Point Discovery:** Lightweight semantic pass before exploration
5. **Time Limit Management:** Subagent parallelization + progressive thinking policy
6. **Planning Enforcement:** Mandatory `todo_write` tool usage (38% → 66% pass rate)
7. **Speed Architecture:** Low-complexity work delegated to subagents with minimal thinking budget
### Progressive Thinking Policy
- Messages 1-10: Very high thinking (plan formation)
- Messages 11+: Low thinking default (execution)
- Verification calls: Switch back to high thinking
---
## Source References
1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
+77
View File
@@ -0,0 +1,77 @@
# GPT 5.4 with ForgeCode - Feedback Report
**Model:** GPT 5.4
**Provider:** OpenAI
**Harness:** ForgeCode
**Source References:** DEV Community (Liran Baba), ForgeCode Blog
**Date Compiled:** April 9, 2026
---
## Benchmark Performance
### TermBench 2.0 (Self-Reported via ForgeCode)
- **Score:** 81.8% (tied for #1 with Opus 4.6)
- **Note:** Achieved through extensive harness optimizations, not raw model capability
---
## Real-World Performance Feedback
### Stability Issues
- **Assessment:** "Borderline unusable" for some tasks
- **Specific Issue:** 15-minute research task on small repo
- Tool calls repeatedly failing
- Agent stuck in retry loops
- Required manual kill
> "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
### Tool Calling Reliability
- **Problem:** Persistent tool-call errors with GPT 5.4
- **ForgeCode Fixes Applied:**
1. Reordered JSON schema fields (`required` before `properties`)
2. Flattened nested schemas
3. Added explicit truncation reminders for partial file reads
- **Result:** These optimizations were benchmark-specific (described as "benchmaxxed")
---
## Harness Optimizations for GPT 5.4
From ForgeCode's "Benchmarks Don't Matter" blog series:
1. **Non-Interactive Mode:** System prompt rewritten to prohibit conversational branching
2. **Tool Naming:** Renaming edit tool arguments to `old_string` and `new_string` (names appearing frequently in training data) measurably dropped tool-call error rates
3. **Progressive Thinking Policy:**
- Messages 1-10: Very high thinking (plan formation)
- Messages 11+: Low thinking default (execution phase)
- Verification skill calls: Switch back to high thinking
---
## What Didn't Work Well
1. **Research tasks:** Tool calling failures causing infinite loops
2. **Long-running tasks:** 15+ minute tasks became unstable
3. **Consistency:** Unpredictable failures requiring manual intervention
---
## Comparison with Opus 4.6
| Aspect | GPT 5.4 | Opus 4.6 |
|--------|---------|----------|
| TermBench 2.0 | 81.8% | 81.8% |
| Real-world stability | Poor | Excellent |
| Tool calling reliability | Problematic | Reliable |
| Research tasks | Unusable | Good |
**Key Takeaway:** Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.
---
## Source References
1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
@@ -0,0 +1,114 @@
# ForgeCode Pricing & Cost Feedback Report
**Topic:** Pricing tiers, cost concerns, value proposition
**Source References:** ForgeCode Blog, Reddit r/cursor, GitHub issues
**Date Compiled:** April 9, 2026
---
## Pricing Structure (As of July 27, 2025)
### Free Tier
- **Cost:** $0 (permanent, not a trial)
- **Limit:** Dynamic request limit (adjusts based on server load)
- **Typical Range:** 10-50 requests/day
- **Purpose:** Full feature exploration without time pressure
### Pro Plan
- **Cost:** $20/month
- **Limit:** Up to 1,000 AI requests/day
- **Target:** Regular users scaling up from free tier
### Max Plan
- **Cost:** $100/month
- **Limit:** Up to 5,000 AI requests/day
- **Target:** Power users
---
## Early Access Insights
### Usage Patterns Discovered
- **Top 1% of users:** Made thousands of AI requests daily
- **Power user costs:** Over $500/day in AI inference costs during heavy usage
- **Growth:** 17x surge in signups, 10x spike in usage during early access
---
## Community Feedback
### Reddit r/cursor
**Mixed reactions to pricing:**
> "ForgeCode is VERY good. I tested it by resolving failed CI tests using Python and Go code, and it proved efficient and persistent."
> "What a sad news but it really good to solve my real problem with around 10 requests. If it refresh 1000 token daily I think it still OK unless u are building a quantum codebase"
**Comparison to alternatives:**
> "Cursor has models to efficiently index your codebase, while forgecode doesn't, so consider it to be worse than both. However, this looks like a good deal to me bc of the pricing."
---
## Cost Considerations
### Token Usage Concerns
From DEV Community analysis:
> "Nobody's published hard numbers. ForgeCode's multi-agent setup (forge/sage/muse spawning sub-agents) almost certainly burns more tokens per session. I noticed it anecdotally but didn't measure."
### API Key Requirements
- ForgeCode requires **own API keys** (not included in subscription)
- Separate billing from Claude Pro/ChatGPT Plus
- Can become expensive with heavy usage of premium models (Opus 4.6: $15/$75 per million tokens)
### Daily Limit Issues
**GitHub Issue #1296:**
- Problem: Reaching daily FORGE limit stops task mid-execution
- Context built up is lost or must wait for reset
- User requested ability to switch providers when limit reached
---
## Value Proposition Analysis
### For Light Users (Free Tier)
- **Pros:** 10-50 requests may be sufficient for small projects
- **Cons:** Dynamic limits unpredictable; may hit cap during intensive sessions
### For Regular Users (Pro - $20/month)
- **Pros:** 1,000 requests/day is generous for most workflows
- **Cons:** Must also pay for API usage separately
### For Power Users (Max - $100/month)
- **Pros:** 5,000 requests/day accommodates heavy usage
- **Cons:** Expensive when combined with API costs; $100 + $500/day inference = $15,100/month potential
---
## Cost Optimization Tips
1. **Use context efficiently:** ForgeCode's context indexing reduces token usage ~90%
2. **Choose models carefully:** Opus 4.6 is expensive ($15/$75); consider Sonnet for routine tasks
3. **Monitor sub-agent spawning:** Multi-agent workflows consume more tokens
4. **Set FORGE_TRACKER=false:** Reduces overhead (minor but measurable)
---
## Comparison with Alternatives
| Tool | Pricing Model | Notes |
|------|---------------|-------|
| **ForgeCode** | $0-$100/month + API costs | Pay for harness + pay for inference |
| **Claude Code** | $20/month subscription | Includes model access |
| **Cursor** | $20/month subscription | Includes model access |
| **Aider** | Free (open source) | Bring your own API keys |
**Key Difference:** ForgeCode is the only one with dual payment (harness subscription + API costs).
---
## Source References
1. **ForgeCode Blog:** https://forgecode.dev/blog/graduating-from-early-access-new-pricing-tiers-available/
2. **Reddit r/cursor:** https://www.reddit.com/r/cursor/comments/1maq1ex/forgecode_is_no_longer_free_and_unlimited_but/
3. **GitHub Issue #1296:** https://github.com/antinomyhq/forgecode/issues/1296
4. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
@@ -0,0 +1,97 @@
# ForgeCode Privacy & Security Concerns - Feedback Report
**Topic:** Data collection, telemetry, privacy
**Source References:** GitHub Issue #1318, Discussion #2545, DEV Community, Reddit
**Date Compiled:** April 9, 2026
---
## Overview
Despite ForgeCode's claim that "Your code never leaves your computer," there are significant community concerns about telemetry and data collection practices.
---
## Documented Privacy Issues
### GitHub Issue #1318
**Status:** Referenced as "red flag" by community members
**Reported Concerns:**
- Default telemetry collects:
- Git user emails
- SSH directory scans
- Conversation data sent externally
### GitHub Discussion #2545
**Title:** "Clarity about data collected that involves code"
**Key Points:**
- Privacy policy mentions collecting commands
- Data can be stored and transferred in many ways
- ForgeCode Services (optional) may process data differently than local CLI mode
**Distinction:**
- **Local CLI mode:** Claims to run entirely on local machine
- **ForgeCode Services:** Optional features that provide additional capabilities, may process data externally
---
## Mitigation
### Disable Tracking
```bash
FORGE_TRACKER=false # Disables all tracking
```
### ForgeCode Services Clarification
From Discussion #2545:
> "ForgeCode Services are optional features that provide additional capabilities beyond the purely local CLI experience. If a user chooses to enable those services, some data relevant to those features may be processed by the service."
---
## Community Sentiment
### Reddit r/ClaudeCode
> "Specifically for Forgecodedev, I haven't used it yet since they are not transparent about user data which is a red flag to me."
### DEV Community (Liran Baba)
- Mentions telemetry concerns in comparison article
- Notes the FORGE_TRACKER=false mitigation
---
## Benchmark Controversy Connection
Some users connect privacy concerns to benchmark results:
> "I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%. Currently it is free to use but may change in the future."
**Note:** ForgeCode Services (proprietary layer) was used for benchmark evaluations, which differs from purely local CLI mode.
---
## Transparency Issues
1. **Telemetry defaults:** Enabled by default, must explicitly disable
2. **Data scope:** SSH directory scanning not clearly documented upfront
3. **ForgeCode Services:** Connection between services and benchmark results not immediately obvious
4. **Proprietary layer:** Some components not open source
---
## Recommendations for Privacy-Conscious Users
1. **Set FORGE_TRACKER=false** before using
2. **Avoid ForgeCode Services** if local-only operation is required
3. **Audit code:** Harness is open source (Apache 2.0), can be inspected
4. **Use own API keys:** Don't rely on any bundled/free tier that might require data sharing
---
## Source References
1. **GitHub Discussion:** https://github.com/antinomyhq/forgecode/discussions/2545
2. **GitHub Issue #1318:** Referenced in multiple community discussions
3. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
4. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,204 @@
# ForgeCode Best Practices - Summary
**Compiled from:** Community feedback, GitHub issues, blog posts, documentation
**Date Compiled:** April 9, 2026
---
## Quick Start Best Practices
### 1. Disable Telemetry
```bash
export FORGE_TRACKER=false
```
Add to `~/.zshrc` for persistence.
### 2. Configure API Keys Properly
```bash
forge provider login # Set up providers
```
Consider API key helpers (requested in #2888) for security.
### 3. Verify ZSH Integration
```bash
forge zsh doctor # Check for issues
forge zsh setup # Re-run if needed
```
---
## Model Selection Best Practices
### For Speed
- **Opus 4.6** through ForgeCode: Fastest real-world performance
- **Avoid GPT 5.4** through ForgeCode: Unstable tool calling
### For Cost
- **MiniMax M2.1:** Near-SOTA performance at $0.30/$1.20 per million tokens
- **LongCat-Flash-Lite:** Budget option at $0.10/$0.40
### For Reliability
- **Claude Sonnet 4.5:** Best independent benchmark scores
- **Avoid:** Models with known tool calling issues (Qwen 3.5 with current bug)
---
## Agent Usage Best Practices
### Workflow Pattern
1. **Start with `muse`** for planning complex changes
2. **Switch to `forge`** for implementation
3. **Use `sage`** (automatically) for research
### Command Reference
```bash
:muse # Planning mode
:forge # Implementation mode
:agent # View all agents
:new # Fresh conversation
:compact # Free up token budget
```
---
## Context Management
### Strengths
- **~90% context reduction** vs full-file inclusion
- Function signature indexing
- Selective context pulling
### Limitations
- **No auto-compaction** (unlike Claude Code)
- **No checkpoints/rewind**
- Manual `:compact` required when context full
### Tips
- Use `@filename` for file tagging
- Run `:compact` before long tasks
- Start with `:new` for unrelated tasks
---
## Tool Calling Best Practices
### For Harness Developers
1. Use `old_string`/`new_string` argument names
2. Put `required` before `properties` in JSON schema
3. Flatten nested schemas
4. Add explicit truncation reminders
### For Users
1. **Verify tool calls** - don't blindly accept
2. **Check file paths** - AI can hallucinate paths
3. **Review diffs** - especially for large changes
---
## Pricing Optimization
### Cost Control
1. **Use Sonnet** for routine tasks (cheaper than Opus)
2. **Limit sub-agent spawning** - burns tokens
3. **Use context efficiently** - ForgeCode's indexing helps
4. **Monitor daily limits** - Free tier is 10-50 requests
### Plan Selection
- **Free:** Testing, small projects
- **Pro ($20):** Regular use (<1,000 requests/day)
- **Max ($100):** Power users (1,000-5,000 requests/day)
---
## Project Configuration
### AGENTS.md
Create at project root or `~/forge/AGENTS.md`:
```markdown
# Development Guidelines
## Runtime
- NEVER restart the dev server (runs on port 3000)
- Use npm exclusively (not yarn/pnpm)
## Code Style
- TypeScript strict mode
- Functional programming preferred
```
### Tips
- Be specific and actionable
- Include negative constraints ("NEVER...")
- Reference existing code patterns
---
## Common Pitfalls
### 1. Expecting Claude Code Features
- **Missing:** Checkpoints, auto-memory, IDE extensions
- **Workaround:** Use git commits frequently
### 2. Ignoring Daily Limits
- **Problem:** Task stops mid-execution when limit reached
- **Solution:** Monitor usage, upgrade plan, or switch providers
### 3. Using GPT 5.4 for Research
- **Problem:** Tool calling failures, infinite loops
- **Solution:** Use Opus 4.6 or Sonnet instead
### 4. Privacy Concerns
- **Problem:** Telemetry collects SSH/git data by default
- **Solution:** Set FORGE_TRACKER=false
---
## When to Use ForgeCode vs Alternatives
### Use ForgeCode When:
- Terminal-first workflow
- Speed is priority
- Multi-model flexibility needed
- Open source/auditable code required
- Privacy control essential (with telemetry disabled)
### Use Claude Code When:
- Team collaboration (shared CLAUDE.md)
- Need checkpoints/rewind
- Want auto-memory across sessions
- IDE extensions needed
- Prefer subscription pricing (no separate API costs)
### Use Cursor When:
- IDE-native experience preferred
- GUI features important
- Team using VS Code exclusively
---
## Debugging Tips
### Tool Call Failures
1. Check model compatibility (avoid Qwen 3.5 currently)
2. Verify JSON schema format
3. Try `:retry` to resend
### Performance Issues
1. Use `:compact` to free context
2. Switch to faster model (Sonnet vs Opus)
3. Close unnecessary files with `@[filename]`
### Integration Issues
1. Run `forge zsh doctor`
2. Verify Nerd Font installed
3. Check terminal compatibility (Ghostty has resize bug)
---
## Source References
1. **ForgeCode Docs:** https://forgecode.dev/docs/
2. **ZSH Support:** https://forgecode.dev/docs/zsh-support/
3. **Operating Agents:** https://forgecode.dev/docs/operating-agents/
4. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
5. **GitHub Issues:** https://github.com/antinomyhq/forgecode/issues
@@ -0,0 +1,101 @@
# Local/Small Models with ForgeCode - General Feedback
**Scope:** Local LLMs via Ollama, llama.cpp, LM Studio, etc.
**Harness:** ForgeCode
**Source References:** Reddit r/LocalLLaMA, r/LocalLLM, GitHub issues
**Date Compiled:** April 9, 2026
---
## Key Challenges for Local Models
### 1. Tool Calling Format Issues
**Problem:** Many local models struggle with tool calling formats
**Evidence:**
- Gemma 4 initial releases had tool calling format issues with harnesses
- Qwen3.5 has issues with multiple system messages
- Various models require specific inference backends for reliable tool use
**Recommendation:** Use latest versions of inference backends:
- oMLX / llama.cpp (latest) for Gemma 4
- LM Studio 0.4.9+ for Qwen3.5
- Unsloth fixes for Qwen3-Coder tool calling
### 2. Context Window Configuration
**Default Issues:**
- Ollama/Qwen3 runs with 4K context window by default (too small)
- Need explicit configuration to increase context
**Fix:**
```bash
# Increase context window in settings
# For Ollama: modify Modelfile
# For llama.cpp: use -c flag
```
### 3. Quantization Quality
**Observation:** Default quantization often insufficient for tool use
**Fix:**
- Try higher-quality quantization (e.g., `:q8_0` for 8-bit instead of default Q4_K_M)
- Trade-off: More RAM usage but better output quality
### 4. Model Size Recommendations
From community feedback:
- **< 7B models:** Generally insufficient for reliable agentic tool use
- **7B-14B:** Minimum viable for simple tasks
- **30B+:** Recommended for serious coding work
- **MoE models (Qwen3-Coder 480B-A35B):** Good performance but requires significant RAM
---
## Specific Model Notes
### Qwen3-Coder Next
- **Status:** "First usable coding model < 60GB" according to user reports
- **Workflow tip:** Compress context after each bug fix/feature, then reload
- **Important:** Limit context size in settings.json to prevent overflow
### Gemma 4
- **Requirement:** Latest oMLX / llama.cpp for tool calling
- **Recommendation:** 26B MoE good for limited RAM setups
### Mistral 7B
- **Alternative:** Consider when Qwen 2.5 14B uses too much RAM
- **Trade-off:** Smaller but potentially less capable
---
## Platform-Specific Notes
### Apple Silicon (M-series)
- **Observation:** "Silent, very power efficient, good speeds"
- **Limitation:** Prompt processing slower than NVIDIA GPUs
- **Alternative:** LM Studio with MLX backend currently preferred over Ollama for some users
### Linux
- Best support and performance for local inference
- htop recommended for monitoring RAM usage
---
## General Best Practices
1. **Close other applications** to free RAM before running local models
2. **Monitor context usage** - can exceed 100% in some UIs while still appearing to work
3. **Update regularly** - inference backends fix tool calling issues frequently
4. **Test thoroughly** - local model behavior varies significantly by quant and backend
---
## Source References
1. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/
2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
3. **Reddit r/LocalLLM:** https://www.reddit.com/r/LocalLLM/comments/1sf5aqy/how_are_people_using_local_llms_for_coding/
4. **llama.cpp Discussion:** https://github.com/ggml-org/llama.cpp/discussions/4167
@@ -0,0 +1,133 @@
# GitHub Issues Summary for ForgeCode
**Scope:** Open and recently closed issues affecting model performance
**Repository:** antinomyhq/forgecode
**Stats:** 48 open, 433 closed (as of April 9, 2026)
**Date Compiled:** April 9, 2026
---
## Critical Open Issues
### #2904: Use models.dev as LLM model registry source
- **Status:** Open (April 9, 2026)
- **Type:** Enhancement
- **Impact:** Would improve model discovery and configuration
### #2894: Multiple system messages break models with strict chat templates (e.g. Qwen3.5)
- **Status:** Open (April 8, 2026)
- **Type:** Bug
- **Impact:** BREAKS local models with strict templates
- **Affected Models:** Qwen3.5, potentially others
- **Workaround:** None yet
### #2893: Terminal output disappears on window resize in Ghostty
- **Status:** Open (April 8, 2026)
- **Type:** Bug
- **Impact:** UI/usability issue
- **Linked PR:** 1 linked PR
### #2888: Add support for API key helpers
- **Status:** Open (April 8, 2026)
- **Type:** Feature
- **Impact:** Would improve security (helper scripts for API keys)
### #2884: Muse mode shell blocked
- **Status:** Open (April 7, 2026)
- **Type:** Bug
- **Impact:** Blocks usage of muse agent for planning
---
## Historical Issues (Now Fixed)
### #2813: (Fixed)
- Fixed issue referenced in Reddit response from maintainer
- **Source:** Reddit r/ClaudeCode
### #2485: Installation issues on Mac
- **Symptoms:** Oh My Zsh not found, terminal configuration issues
- **Resolution:** Install Oh My Zsh separately
### #1296: Daily FORGE limit stops tasks mid-execution
- **Problem:** Cannot switch providers when daily limit reached
- **Impact:** Context built up is lost
- **Status:** Open (feature request)
---
## Model-Specific Issues
### GPT 5.4
- **Tool calling reliability:** Improved via schema reordering
- **Status:** Workarounds implemented
### Qwen 3.5
- **Multiple system messages:** Open issue #2894
- **Tool calling format:** Use LM Studio 0.4.9+ for better compatibility
### Gemma 4
- **Tool calling:** Requires latest llama.cpp/oMLX
- **Status:** Resolved with backend updates
---
## Privacy/Security Issues
### #1318: Telemetry concerns
- **Collection:** Git emails, SSH directory scans, conversation data
- **Mitigation:** `FORGE_TRACKER=false`
- **Status:** Documented mitigation available
### #1317: Related privacy concerns
- **Linked to:** Discussion #2545
---
## ZSH/Terminal Issues
### Shell Integration
- **Issue:** ZSH aliases don't work in interactive mode (by design)
- **Solution:** Use `:` sentinel from native ZSH session
### Oh My Zsh
- **Requirement:** Not strictly required but recommended
- **Error:** Install script warns if not present
### Ghostty Terminal
- **Issue:** #2893 - Output disappears on resize
- **Status:** Under investigation
---
## Installation Issues
### macOS
- **Common:** iTerm + Oh My Zsh configuration issues
- **Fix:** Run `forge zsh doctor` and `forge zsh setup`
### Windows
- **Support:** Via WSL or Git Bash only
- **Native:** Not officially supported
### Linux
- **Best supported platform**
- **Android:** Also supported
---
## Issue Resolution Tips
From documentation:
```bash
forge zsh doctor # Check environment
forge zsh setup # Re-run ZSH integration
```
---
## Source References
1. **GitHub Issues:** https://github.com/antinomyhq/forgecode/issues
2. **GitHub Discussions:** https://github.com/antinomyhq/forgecode/discussions
3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,203 @@
# Installation & Platform Issues - Feedback Report
**Topic:** Setup problems, platform compatibility, requirements
**Source References:** GitHub issues, ForgeCode docs, Reddit
**Date Compiled:** April 9, 2026
---
## Supported Platforms
### Officially Supported
- **macOS:** Full support
- **Linux:** Best support
- **Android:** Supported
- **Windows:** Via WSL or Git Bash only
### Not Supported
- **Native Windows:** Not officially supported
---
## Installation Methods
### Method 1: YOLO Install (Recommended)
```bash
curl -fsSL https://forgecode.dev/cli | sh
```
### Method 2: Nix
```bash
nix run github:antinomyhq/forge
```
### Method 3: NPM
```bash
npx forgecode@latest
```
---
## Common Installation Issues
### Issue #2485: Mac Installation Problems
**Symptoms:**
- Oh My Zsh not found
- Terminal configuration issues
- Shell environment problems
**Environment Reported:**
- Shell: zsh 5.9
- Terminal: iTerm.app 3.6.8
- Oh My Zsh: Not installed
**Solution:**
```bash
# Install Oh My Zsh first
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
# Then re-run forge setup
forge zsh setup
```
### Terminal Requirements
#### Required: Nerd Font
- **Purpose:** Icon display
- **Recommended:** FiraCode Nerd Font
- **Verification:** Icons should display without overlap during setup
#### Recommended Terminals
- iTerm2 (macOS)
- Ghostty (macOS) - NOTE: Has resize bug (#2893)
- Any modern Linux terminal
---
## ZSH Integration Issues
### Interactive Mode Isolation
**Design:** ForgeCode's interactive mode runs in isolated environment
**Impact:**
- ZSH aliases don't work inside interactive mode
- Custom functions unavailable
- Shell tooling not accessible
**Solution:** Use `:` sentinel from native ZSH session instead
### Tab Completion
**Requirements:**
- `fd` (file finder)
- `fzf` (fuzzy finder)
**Usage:**
```bash
:<TAB> # Open command list
@file<TAB> # Fuzzy file picker
```
**Fallback:** Use full path with brackets: `@[src/components/Header.tsx]`
---
## Platform-Specific Notes
### macOS
**Best Practices:**
- Use iTerm2 or Ghostty
- Install Oh My Zsh for best experience
- Enable Nerd Font in terminal preferences
**Troubleshooting:**
```bash
forge zsh doctor # Check setup
forge zsh setup # Reconfigure
```
### Linux
**Advantages:**
- Best performance for local models
- Native ZSH support
- Package manager availability
**Tips:**
- Use system package manager when available
- Check `htop` for resource monitoring
### Windows
**Limitations:**
- No native support
- Must use WSL or Git Bash
**WSL Recommendation:**
- Ubuntu 22.04+ recommended
- Install ZSH within WSL
- Windows Terminal for best experience
### Android
**Status:** Supported but limited documentation
**Use Case:** Primarily for remote development scenarios
---
## Verification Steps
### Post-Installation Checklist
1. **Run doctor:**
```bash
forge zsh doctor
```
2. **Verify icons:**
- Should display without overlap
- Check during interactive setup
3. **Test basic commands:**
```bash
: hi
:new
:agent
```
4. **Configure provider:**
```bash
forge provider login
```
---
## Open Issues
### #2893: Ghostty Terminal Resize Bug
- **Problem:** Terminal output disappears on window resize
- **Status:** Open, 1 linked PR
- **Workaround:** Avoid resizing or use different terminal
### #2884: Muse Mode Shell Blocked
- **Problem:** Cannot use muse agent
- **Status:** Open
- **Impact:** Planning workflow blocked
---
## Resource Requirements
### Minimum
- **RAM:** 4GB (for cloud models)
- **Disk:** 500MB
- **Shell:** ZSH 5.0+
### For Local Models
- **RAM:** 16GB+ recommended
- **GPU:** Optional but recommended for larger models
- **Storage:** 10GB+ for model downloads
---
## Source References
1. **GitHub Issue #2485:** https://github.com/antinomyhq/forgecode/issues/2485
2. **GitHub Issue #2893:** https://github.com/antinomyhq/forgecode/issues/2893
3. **ForgeCode Docs:** https://forgecode.dev/docs/installation/
4. **ZSH Support:** https://forgecode.dev/docs/zsh-support/
@@ -0,0 +1,99 @@
# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
**Source References:** llm-stats.com, ForgeCode Blog
**Date Compiled:** April 9, 2026
---
## MiniMax M2.1
### Performance
- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
- **Parameters:** 230B
- **Context:** 1.0M tokens
- **Cost:** $0.30 / $1.20 per million tokens
### Value Proposition
- **Best cost-performance ratio** among top performers
- Near-SOTA performance at entry-level pricing
- Massive 1.0M context window
### ForgeCode Usage
- Well-supported via OpenRouter
- Good tool calling reliability
- Recommended for budget-conscious users
---
## GLM-4.6 (Zhipu AI)
### Performance
- **Terminal-Bench Score:** 40.5% (Rank #7)
- **Parameters:** 357B
- **Context:** 131K tokens
- **Cost:** $0.55 / $2.19 per million tokens
### Characteristics
- Open weights
- Competitive with proprietary models at similar price point
- Good context length (131K)
---
## DeepSeek Models
### DeepSeek-V3.2-Exp
- **Terminal-Bench Score:** 37.7% (Rank #10)
- **Status:** Experimental
- **Note:** Results from llm-stats.com
### DeepSeek-V3.1
- **Terminal-Bench Score:** 31.3% (Rank #16)
- **Parameters:** 671B
- **Observation:** Large parameter count doesn't translate to top-tier performance
### DeepSeek-R1-0528
- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
- **Parameters:** 671B
- **Note:** Reasoning model may not be optimized for terminal tasks
---
## Key Insights
### Scale ≠ Performance
- Kimi K2 (1.0T parameters) underperforms smaller models
- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
- **Quality of architecture > raw parameter count**
### Cost-Performance Leaders
1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
### Context Window Comparison
| Model | Context | Rank |
|-------|---------|------|
| MiniMax M2.1 | 1.0M | #2 |
| Claude Opus 4.1 | 200K | #5 |
| GLM-4.6 | 131K | #7 |
---
## Recommendations
### For Budget + Performance
**MiniMax M2.1** - Best value proposition
### For Open Weights
**GLM-4.6** or **MiniMax M2** - Both open, strong performance
### For Research
Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
---
## Source References
1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
2. **ForgeCode Blog:** Model comparison series
+55
View File
@@ -0,0 +1,55 @@
# Qwen 3.5 with ForgeCode - Feedback Report
**Model:** Qwen 3.5
**Provider:** Alibaba Cloud (via local inference)
**Harness:** ForgeCode
**Source References:** GitHub Issue #2894, Reddit r/LocalLLaMA
**Date Compiled:** April 9, 2026
---
## Known Issues
### Multiple System Messages Bug
**GitHub Issue:** #2894 (Open as of April 8, 2026)
**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3.5)
**Error Manifestation:**
- Models with strict chat templates fail to parse message structure correctly
- Tool calling may fail or produce incorrect results
- Agent behavior becomes unpredictable
**Impact:**
- Affects local inference with llama.cpp, Ollama, and similar servers
- Qwen3.5 specifically mentioned as affected
**Workaround Status:** No official fix yet; issue under investigation
---
## Tool Calling with Qwen Models
### General Observations from Community
1. **Qwen3-Coder Next** shows promise as "first usable coding model < 60GB"
2. **Tool calling reliability varies** by inference backend:
- LM Studio 0.4.9 reportedly handles Qwen3.5 XML tool parsing more reliably than raw llama.cpp
- llama.cpp with `--jinja` flag helps with tool calling
3. **finish_reason issue** is annoying to debug according to community reports
---
## Recommendations for Local Use
1. **Use LM Studio** for more reliable tool parsing vs raw llama.cpp
2. **Monitor system message count** - known issue with ForgeCode's multi-message approach
3. **Test thoroughly** before relying on Qwen 3.5 for production tasks via ForgeCode
---
## Source References
1. **GitHub Issue:** https://github.com/antinomyhq/forgecode/issues/2894
2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
@@ -0,0 +1,111 @@
# Tool Calling Reliability with ForgeCode - Feedback Report
**Topic:** Tool use reliability, function calling, common errors
**Source References:** ForgeCode Blog, GitHub issues, Reddit
**Date Compiled:** April 9, 2026
---
## Overview
Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.
---
## The Seven Failure Modes (From ForgeCode Blog)
### 1. Same Model, Very Different Performance
**Problem:** Interactive-first design fails in benchmarks (no user to answer questions)
**Fix:** Non-Interactive Mode with rewritten system prompts
### 2. Tool Descriptions Don't Guarantee Correctness
**Problem Categories:**
- Wrong tool selected (e.g., `shell` instead of structured `edit`)
- Correct tool, wrong argument names
- Correct tool, correct arguments, wrong sequencing
**Fix:** Targeted micro-evals isolating each class per tool, per model
### 3. Tool Naming is a Reliability Variable
**Key Finding:** Models pattern-match against training data first
**Concrete Example:**
- Renaming edit tool arguments to `old_string` and `new_string`
- Result: "measurably dropped tool-call error rates immediately—same model, same prompt"
> "If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."
### 4. Context Size is a Multiplier, Not a Substitute
**Problem:** More context only helps after finding the right entry point
**Insight:** Entry-point discovery latency is the bottleneck
### 5. Time Limits Punish Trajectories
**Problem:** Failed tool calls burn real seconds; brilliant but meandering paths timeout
**Fix:** Speed architecture with parallel subagents
### 6. Planning Tools Only Work if Enforced
**Problem:** Optional `todo_write` tool ignored under pressure
**Fix:** Made mandatory via low-level evals
**Result:** 38% → 66% pass rate
### 7. TermBench is More About Speed Than Intelligence
**Fix:** Progressive thinking policy (high thinking early, low during execution)
---
## Model-Specific Tool Calling Issues
### GPT 5.4
- **Issue:** Persistent tool-call errors
- **Fixes Applied:**
- Reordered JSON schema fields (`required` before `properties`)
- Flattened nested schemas
- Explicit truncation reminders
### Qwen 3.5
- **Issue:** Multiple system messages break strict chat templates
- **Status:** Open issue (#2894)
- **Workaround:** None yet; use different model or await fix
### Gemma 4
- **Issue:** Initial releases had tool calling format issues
- **Fix:** Use latest oMLX / llama.cpp
---
## Best Practices for Tool Reliability
1. **Use established argument names:** `old_string`/`new_string` better than generic names
2. **Flatten schemas:** Reduce nesting in tool definitions
3. **Order matters:** Put `required` before `properties` in JSON schema
4. **Test with micro-evals:** Isolate specific tool+model combinations
5. **Monitor truncation:** Add explicit reminders when files partially read
---
## ForgeCode Services Enhancements
The proprietary runtime layer includes:
1. **Semantic entry-point discovery:** Lightweight semantic pass before exploration
2. **Dynamic skill loading:** Specialized instructions loaded when needed
3. **Tool-call correction layer:** Heuristic + static analysis for argument validation
**Note:** These features are part of ForgeCode Services (optional), not the open-source CLI.
---
## Community Tips
From Reddit and GitHub discussions:
1. **LM Studio > raw llama.cpp** for Qwen3.5 XML tool parsing
2. **LM Studio 0.4.9+** handles tool calling more reliably
3. **llama.cpp `--jinja` flag** helps with Qwen tool templates
---
## Source References
1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
2. **GitHub Issue #2894:** https://github.com/antinomyhq/forgecode/issues/2894
3. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/