Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,140 @@
+# ForgeCode Benchmark Controversy - Feedback Report
+
+**Topic:** TermBench 2.0 results, self-reported vs independent validation, "benchmaxxing"  
+**Source References:** Reddit r/ClaudeCode, DEV Community, llm-stats.com, arXiv  
+**Date Compiled:** April 9, 2026
+
+---
+
+## The Controversy
+
+ForgeCode achieved **81.8% on TermBench 2.0** (tied with GPT 5.4 and Opus 4.6), significantly outperforming Claude Code's 58.0% on the same Opus 4.6 model. This raised questions about:
+
+1. Self-reported vs. independent validation
+2. Benchmark-specific optimizations ("benchmaxxing")
+3. Proprietary layer involvement
+
+---
+
+## TermBench 2.0 Results
+
+### Self-Reported (via ForgeCode at tbench.ai)
+| Configuration | Score | Rank |
+|--------------|-------|------|
+| ForgeCode + GPT 5.4 | 81.8% | #1 |
+| ForgeCode + Opus 4.6 | 81.8% | #1 |
+| Claude Code + Opus 4.6 | 58.0% | #39 |
+
+### Independent SWE-bench (Princeton/UChicago)
+| Configuration | Score |
+|--------------|-------|
+| ForgeCode + Claude 4 | 72.7% |
+| Claude 3.7 Sonnet (extended thinking) | 70.3% |
+| Claude 4.5 Opus | 76.8% |
+
+**Gap narrows from 24 points to 2.4 points on independent benchmark.**
+
+---
+
+## Community Skepticism
+
+### Reddit r/ClaudeCode
+> "Looks this agent forgecode.dev ranks better than anyone in terminalbench but anyone is talking about it. Is it fake or what is wrong with these 'artifacts' that promise save time and tokens"
+
+> "At this point, terminalbench has received quite some attention and most benchmarks are not validated."
+
+> "I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%."
+
+### The "Benchmaxxing" Term
+Community coined "benchmaxxed" to describe ForgeCode's approach:
+- Real engineering improvements
+- Also benchmark-specific optimizations
+- Not necessarily representative of real-world performance
+
+---
+
+## ForgeCode's Defense
+
+### Blog Series: "Benchmarks Don't Matter — Until They Do"
+ForgeCode transparently documented their journey:
+- **Baseline:** ~25% (interactive-first runtime)
+- **Stabilization:** ~38% (non-interactive mode + tool naming fixes)
+- **Planning control:** 66% (mandatory todo_write enforcement)
+- **Speed architecture:** 78.4% (subagent parallelization + progressive thinking)
+- **Final:** 81.8% (additional optimizations)
+
+### Documented Optimizations
+1. **JSON schema reordering:** `required` before `properties` for GPT 5.4
+2. **Schema flattening:** Reduced nesting
+3. **Truncation reminders:** Explicit notes when files partially read
+4. **Mandatory verification:** Reviewer skill checks completion
+
+---
+
+## The Proprietary Layer Question
+
+**ForgeCode Services** (optional, free during evaluation) includes:
+1. Semantic entry-point discovery
+2. Dynamic skill loading
+3. Tool-call correction layer
+
+**Concern:** These services were used for benchmark evaluations but differ from open-source CLI mode.
+
+**Clarification from Discussion #2545:**
+> "Regarding the benchmarks: the setup we used for those evaluations includes ForgeCode Services, which provides multiple improvements over the existing CLI harness."
+
+---
+
+## Independent Terminal-Bench Data
+
+From llm-stats.com (April 9, 2026):
+- **23 models evaluated**
+- **Average score:** 0.345 (34.5%)
+- **Best score:** 0.500 (50.0%) - Claude Sonnet 4.5
+- **All results self-reported** (0 verified)
+
+**Top 3:**
+1. Claude Sonnet 4.5: 50.0%
+2. MiniMax M2.1: 47.9%
+3. Kimi K2-Thinking: 47.1%
+
+**Note:** ForgeCode's 81.8% is not on this independent leaderboard; it was self-reported on tbench.ai.
+
+---
+
+## Academic Validation
+
+### Terminal-Bench Paper (ICLR 2026)
+From arXiv:2601.11868:
+> "Nonetheless, the benchmark developers have invested considerable effort in mitigating reproducibility issues through containerization, repeated runs, and reporting of confidence intervals. We found no major issues. Independent inspection of the released tasks confirmed that they are well-specified and largely free of ambiguity or underspecification."
+
+**Key Point:** The benchmark itself is well-constructed; the question is about harness-specific optimizations.
+
+---
+
+## Key Takeaways
+
+1. **Benchmarks can be gamed:** Documented optimizations show how harness engineering affects scores
+2. **Independent validation matters:** 24-point gap shrinks to 2.4 on independent tests
+3. **Proprietary layers complicate comparisons:** Services used for benchmarks differ from open-source code
+4. **Real-world != benchmark:** GPT 5.4 scored 81.8% but was "borderline unusable" in practice
+
+---
+
+## Recommendations for Benchmark Consumers
+
+1. **Look for independent validation** (SWE-bench > self-reported TermBench)
+2. **Test on your own tasks** - benchmarks don't capture all failure modes
+3. **Consider harness transparency** - open-source vs proprietary optimizations
+4. **Beware benchmaxxing** - optimizations may not generalize
+
+---
+
+## Source References
+
+1. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
+2. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+3. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
+4. **arXiv Paper:** https://arxiv.org/html/2601.11868v1
+5. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
+6. **GitHub Discussion:** https://github.com/antinomyhq/forgecode/discussions/2545
@@ -0,0 +1,81 @@
+# Claude Opus 4.6 with ForgeCode - Feedback Report
+
+**Model:** Claude Opus 4.6  
+**Provider:** Anthropic  
+**Harness:** ForgeCode  
+**Source References:** DEV Community (Liran Baba), ForgeCode Blog, Reddit r/ClaudeCode  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Benchmark Performance
+
+### TermBench 2.0 (Self-Reported via ForgeCode)
+- **Score:** 81.8% (tied for #1)
+- **Comparison:** Claude Code + Opus 4.6 scored 58.0% (Rank #39)
+- **Gap:** ~24 percentage points in favor of ForgeCode harness
+
+### SWE-bench Verified (Independent - Princeton/UChicago)
+- **ForgeCode + Claude 4:** 72.7%
+- **Claude Code + Claude 3.7 Sonnet (extended thinking):** 70.3%
+- **Gap:** Only 2.4 percentage points
+
+**Key Insight:** The benchmark gap narrows significantly on independent validation. TermBench 2.0 results are self-reported by ForgeCode itself.
+
+---
+
+## Real-World Performance Feedback
+
+### Speed
+- **Observation:** "Noticeably faster than Claude Code. Not marginal, real."
+- **Test Case:** Adding post counter to blog index (Astro 6, ~30 files)
+  - Claude Code: ~90 seconds
+  - ForgeCode + Opus 4.6: <30 seconds
+- **Consistency:** Multi-file renames, component additions, layout restructuring all showed faster performance
+
+### Why Faster
+1. **Rust binary** vs Claude Code's TypeScript (better startup/memory)
+2. **Context engine:** Indexes function signatures and module boundaries instead of dumping raw files (~90% context size reduction)
+3. **Selective context:** Pulls only what the agent needs
+
+### Stability
+- **Assessment:** Excellent stability with Opus 4.6 through ForgeCode
+- **No tool call failures reported** (unlike GPT 5.4 experience)
+- Consistent performance across different task types
+
+---
+
+## What Worked Well
+
+1. **Multi-file refactoring:** Handles complex changes across file boundaries efficiently
+2. **Code comprehension:** Strong understanding of Astro/React components
+3. **Speed on complex tasks:** Consistently 3x faster than Claude Code on identical tasks
+4. **Planning with muse:** Plan output felt "more detailed and verbose than Claude Code's plan mode"
+
+---
+
+## Issues Encountered
+
+1. **Ecosystem gaps:** No IDE extensions, no hooks, no checkpoints/rewind
+2. **No auto-memory:** Context doesn't persist between sessions
+3. **No built-in sandbox:** Requires manual `--sandbox` flag for isolation
+
+---
+
+## User Workflow Integration
+
+**Current User Pattern (Liran Baba):**
+> "I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
+
+**Use Cases:**
+- Speed-critical tasks: ForgeCode + Opus 4.6
+- Complex refactoring: ForgeCode for faster iteration
+- Team collaboration: Claude Code (shared CLAUDE.md, checkpoints)
+
+---
+
+## Source References
+
+1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
+3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,146 @@
+# ForgeCode vs Competitors - Feature & Ecosystem Comparison
+
+**Topic:** Feature gaps, ecosystem comparison, workflow integration  
+**Source References:** DEV Community, ForgeCode docs, Reddit  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Feature Matrix
+
+| Feature | ForgeCode | Claude Code | Cursor |
+|---------|-----------|-------------|--------|
+| **Model Choice** | Any (300+) | Claude only | Multiple |
+| **License** | Open source (Apache 2.0) | Proprietary | Proprietary |
+| **Language** | Rust | TypeScript | TypeScript |
+| **Project Config** | `AGENTS.md` | `CLAUDE.md` (hierarchical) | `.cursorrules` |
+| **MCP Support** | Yes | Yes (extensive) | Yes |
+| **Hooks** | **No** | Yes (6 types) | Limited |
+| **Scheduled Tasks** | **No** | Yes (cloud + local) | No |
+| **Sub-agents** | Yes (forge/sage/muse) | Yes (parallel) | Limited |
+| **IDE Extensions** | **None** | VS Code, JetBrains | VS Code only |
+| **Auto-Memory** | **No** | Yes | Yes |
+| **Checkpoints/Rewind** | **No** | Yes | Yes |
+| **Sandbox Mode** | `--sandbox` flag | Built-in | Built-in |
+| **Plan Mode** | Yes (muse writes to `plans/`) | Yes (Shift+Tab) | Composer |
+| **Pricing** | $0-$100/mo + API | $20/mo subscription | $20/mo subscription |
+
+---
+
+## The "Lambo with No Cup Holder" Problem
+
+> "ForgeCode is a Lambo with no cup holder. Fast as hell, but you're holding your coffee between your knees."
+
+**Meaning:** Extremely fast but missing quality-of-life features.
+
+---
+
+## Major Feature Gaps
+
+### 1. No IDE Extensions
+**Impact:** Must use terminal exclusively; no GUI integration
+**Workaround:** Use alongside IDE manually
+
+### 2. No Auto-Memory
+**Impact:** Context doesn't persist between sessions
+**Claude Code Comparison:** Remembers project context across sessions
+
+### 3. No Checkpoints/Rewind
+**Impact:** Cannot rollback changes without git
+**Claude Code Comparison:** Every edit snapshotted; `/rewind` available
+
+### 4. No Hooks
+**Impact:** Cannot trigger scripts on file changes
+**Claude Code Comparison:** 6 hook types (pre-command, post-command, etc.)
+
+### 5. No Scheduled Tasks
+**Impact:** Cannot schedule recurring agent runs
+**Claude Code Comparison:** Both cloud and local scheduled tasks
+
+---
+
+## ForgeCode Strengths
+
+### 1. Speed
+- Rust binary vs TypeScript runtime
+- Context indexing reduces token usage ~90%
+- Real-world: 3x faster on identical tasks
+
+### 2. Multi-Model Support
+- 300+ models via OpenRouter
+- Not locked to single provider
+- Can optimize cost/performance per task
+
+### 3. Multi-Agent Architecture
+- `forge`: Implementation
+- `sage`: Read-only research
+- `muse`: Planning (writes to `plans/`)
+- More detailed plan output than competitors
+
+### 4. Open Source
+- Apache 2.0 license
+- Auditable code
+- Community contributions
+
+### 5. Terminal-Native
+- Zsh plugin integration
+- `:` sentinel for quick access
+- No context switching
+
+---
+
+## Workflow Integration Patterns
+
+### Pattern 1: ForgeCode for Speed
+**Use Case:** Latency-sensitive tasks, quick fixes
+**Workflow:** Use ForgeCode for implementation, IDE for review
+
+### Pattern 2: Double-Dipping
+**User Quote:** "I double-dip. Claude Code for my primary workflow (ecosystem, features), ForgeCode when I care about latency."
+
+### Pattern 3: Team Configuration
+**Challenge:** No shared project instructions (CLAUDE.md is Claude-specific)
+**Partial Solution:** AGENTS.md for ForgeCode, but not widely adopted
+
+---
+
+## AGENTS.md vs CLAUDE.md
+
+### AGENTS.md (ForgeCode)
+- Project-specific instructions
+- Less widely documented
+- Single file (no hierarchy)
+
+### CLAUDE.md (Claude Code)
+- Hierarchical (project → parent dirs → home)
+- More mature documentation
+- Shared across team if committed
+
+---
+
+## Recommendations by Use Case
+
+### Solo Developer, Speed Priority
+**Choice:** ForgeCode + Opus 4.6
+**Reason:** Fastest iteration, cost-effective with careful model selection
+
+### Team Environment
+**Choice:** Claude Code
+**Reason:** Shared CLAUDE.md, checkpoints, auto-memory for team continuity
+
+### IDE-First Developer
+**Choice:** Cursor
+**Reason:** Native IDE integration, GUI features
+
+### Terminal-First, Privacy-Focused
+**Choice:** ForgeCode (with FORGE_TRACKER=false)
+**Reason:** Local execution, open source, no IDE lock-in
+
+---
+
+## Source References
+
+1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+2. **ForgeCode Docs:** https://forgecode.dev/docs/operating-agents/
+3. **ForgeCode ZSH Docs:** https://forgecode.dev/docs/zsh-support/
+4. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,45 @@
+# Gemini 3.1 Pro with ForgeCode - Feedback Report
+
+**Model:** Gemini 3.1 Pro Preview  
+**Provider:** Google  
+**Harness:** ForgeCode  
+**Source References:** ForgeCode Blog  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Benchmark Performance
+
+### TermBench 2.0
+- **ForgeCode Score:** 78.4% (SOTA at time of testing)
+- **Google's Reported Score:** 68.5% on same model
+- **Gap:** ~10 percentage points advantage to ForgeCode harness
+
+> "The delta is not a better model. It is better harness. Same weights, 10 percentage points higher."
+
+---
+
+## Key Technical Insights
+
+### What Made the Difference
+
+ForgeCode's blog describes seven failure modes and fixes that enabled this performance:
+
+1. **Non-Interactive Mode:** Required for benchmark success (no user to answer clarifying questions)
+2. **Tool Description Optimization:** Micro-evals isolating tool misuse categories
+3. **Tool Naming:** Using `old_string`/`new_string` argument names from training data
+4. **Entry-Point Discovery:** Lightweight semantic pass before exploration
+5. **Time Limit Management:** Subagent parallelization + progressive thinking policy
+6. **Planning Enforcement:** Mandatory `todo_write` tool usage (38% → 66% pass rate)
+7. **Speed Architecture:** Low-complexity work delegated to subagents with minimal thinking budget
+
+### Progressive Thinking Policy
+- Messages 1-10: Very high thinking (plan formation)
+- Messages 11+: Low thinking default (execution)
+- Verification calls: Switch back to high thinking
+
+---
+
+## Source References
+
+1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
@@ -0,0 +1,77 @@
+# GPT 5.4 with ForgeCode - Feedback Report
+
+**Model:** GPT 5.4  
+**Provider:** OpenAI  
+**Harness:** ForgeCode  
+**Source References:** DEV Community (Liran Baba), ForgeCode Blog  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Benchmark Performance
+
+### TermBench 2.0 (Self-Reported via ForgeCode)
+- **Score:** 81.8% (tied for #1 with Opus 4.6)
+- **Note:** Achieved through extensive harness optimizations, not raw model capability
+
+---
+
+## Real-World Performance Feedback
+
+### Stability Issues
+- **Assessment:** "Borderline unusable" for some tasks
+- **Specific Issue:** 15-minute research task on small repo
+  - Tool calls repeatedly failing
+  - Agent stuck in retry loops
+  - Required manual kill
+
+> "I asked it to research the architecture of a small repo. Fifteen minutes. Kept going unstable, tool calls failing, the agent retrying and spinning. I killed it."
+
+### Tool Calling Reliability
+- **Problem:** Persistent tool-call errors with GPT 5.4
+- **ForgeCode Fixes Applied:**
+  1. Reordered JSON schema fields (`required` before `properties`)
+  2. Flattened nested schemas
+  3. Added explicit truncation reminders for partial file reads
+- **Result:** These optimizations were benchmark-specific (described as "benchmaxxed")
+
+---
+
+## Harness Optimizations for GPT 5.4
+
+From ForgeCode's "Benchmarks Don't Matter" blog series:
+
+1. **Non-Interactive Mode:** System prompt rewritten to prohibit conversational branching
+2. **Tool Naming:** Renaming edit tool arguments to `old_string` and `new_string` (names appearing frequently in training data) measurably dropped tool-call error rates
+3. **Progressive Thinking Policy:**
+   - Messages 1-10: Very high thinking (plan formation)
+   - Messages 11+: Low thinking default (execution phase)
+   - Verification skill calls: Switch back to high thinking
+
+---
+
+## What Didn't Work Well
+
+1. **Research tasks:** Tool calling failures causing infinite loops
+2. **Long-running tasks:** 15+ minute tasks became unstable
+3. **Consistency:** Unpredictable failures requiring manual intervention
+
+---
+
+## Comparison with Opus 4.6
+
+| Aspect | GPT 5.4 | Opus 4.6 |
+|--------|---------|----------|
+| TermBench 2.0 | 81.8% | 81.8% |
+| Real-world stability | Poor | Excellent |
+| Tool calling reliability | Problematic | Reliable |
+| Research tasks | Unusable | Good |
+
+**Key Takeaway:** Benchmark scores don't reflect real-world usability. Same harness, dramatically different experiences.
+
+---
+
+## Source References
+
+1. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+2. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
@@ -0,0 +1,114 @@
+# ForgeCode Pricing & Cost Feedback Report
+
+**Topic:** Pricing tiers, cost concerns, value proposition  
+**Source References:** ForgeCode Blog, Reddit r/cursor, GitHub issues  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Pricing Structure (As of July 27, 2025)
+
+### Free Tier
+- **Cost:** $0 (permanent, not a trial)
+- **Limit:** Dynamic request limit (adjusts based on server load)
+- **Typical Range:** 10-50 requests/day
+- **Purpose:** Full feature exploration without time pressure
+
+### Pro Plan
+- **Cost:** $20/month
+- **Limit:** Up to 1,000 AI requests/day
+- **Target:** Regular users scaling up from free tier
+
+### Max Plan
+- **Cost:** $100/month
+- **Limit:** Up to 5,000 AI requests/day
+- **Target:** Power users
+
+---
+
+## Early Access Insights
+
+### Usage Patterns Discovered
+- **Top 1% of users:** Made thousands of AI requests daily
+- **Power user costs:** Over $500/day in AI inference costs during heavy usage
+- **Growth:** 17x surge in signups, 10x spike in usage during early access
+
+---
+
+## Community Feedback
+
+### Reddit r/cursor
+**Mixed reactions to pricing:**
+
+> "ForgeCode is VERY good. I tested it by resolving failed CI tests using Python and Go code, and it proved efficient and persistent."
+
+> "What a sad news but it really good to solve my real problem with around 10 requests. If it refresh 1000 token daily I think it still OK unless u are building a quantum codebase"
+
+**Comparison to alternatives:**
+> "Cursor has models to efficiently index your codebase, while forgecode doesn't, so consider it to be worse than both. However, this looks like a good deal to me bc of the pricing."
+
+---
+
+## Cost Considerations
+
+### Token Usage Concerns
+From DEV Community analysis:
+> "Nobody's published hard numbers. ForgeCode's multi-agent setup (forge/sage/muse spawning sub-agents) almost certainly burns more tokens per session. I noticed it anecdotally but didn't measure."
+
+### API Key Requirements
+- ForgeCode requires **own API keys** (not included in subscription)
+- Separate billing from Claude Pro/ChatGPT Plus
+- Can become expensive with heavy usage of premium models (Opus 4.6: $15/$75 per million tokens)
+
+### Daily Limit Issues
+**GitHub Issue #1296:**
+- Problem: Reaching daily FORGE limit stops task mid-execution
+- Context built up is lost or must wait for reset
+- User requested ability to switch providers when limit reached
+
+---
+
+## Value Proposition Analysis
+
+### For Light Users (Free Tier)
+- **Pros:** 10-50 requests may be sufficient for small projects
+- **Cons:** Dynamic limits unpredictable; may hit cap during intensive sessions
+
+### For Regular Users (Pro - $20/month)
+- **Pros:** 1,000 requests/day is generous for most workflows
+- **Cons:** Must also pay for API usage separately
+
+### For Power Users (Max - $100/month)
+- **Pros:** 5,000 requests/day accommodates heavy usage
+- **Cons:** Expensive when combined with API costs; $100 + $500/day inference = $15,100/month potential
+
+---
+
+## Cost Optimization Tips
+
+1. **Use context efficiently:** ForgeCode's context indexing reduces token usage ~90%
+2. **Choose models carefully:** Opus 4.6 is expensive ($15/$75); consider Sonnet for routine tasks
+3. **Monitor sub-agent spawning:** Multi-agent workflows consume more tokens
+4. **Set FORGE_TRACKER=false:** Reduces overhead (minor but measurable)
+
+---
+
+## Comparison with Alternatives
+
+| Tool | Pricing Model | Notes |
+|------|---------------|-------|
+| **ForgeCode** | $0-$100/month + API costs | Pay for harness + pay for inference |
+| **Claude Code** | $20/month subscription | Includes model access |
+| **Cursor** | $20/month subscription | Includes model access |
+| **Aider** | Free (open source) | Bring your own API keys |
+
+**Key Difference:** ForgeCode is the only one with dual payment (harness subscription + API costs).
+
+---
+
+## Source References
+
+1. **ForgeCode Blog:** https://forgecode.dev/blog/graduating-from-early-access-new-pricing-tiers-available/
+2. **Reddit r/cursor:** https://www.reddit.com/r/cursor/comments/1maq1ex/forgecode_is_no_longer_free_and_unlimited_but/
+3. **GitHub Issue #1296:** https://github.com/antinomyhq/forgecode/issues/1296
+4. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
@@ -0,0 +1,97 @@
+# ForgeCode Privacy & Security Concerns - Feedback Report
+
+**Topic:** Data collection, telemetry, privacy  
+**Source References:** GitHub Issue #1318, Discussion #2545, DEV Community, Reddit  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Overview
+
+Despite ForgeCode's claim that "Your code never leaves your computer," there are significant community concerns about telemetry and data collection practices.
+
+---
+
+## Documented Privacy Issues
+
+### GitHub Issue #1318
+**Status:** Referenced as "red flag" by community members
+
+**Reported Concerns:**
+- Default telemetry collects:
+  - Git user emails
+  - SSH directory scans
+  - Conversation data sent externally
+
+### GitHub Discussion #2545
+**Title:** "Clarity about data collected that involves code"
+
+**Key Points:**
+- Privacy policy mentions collecting commands
+- Data can be stored and transferred in many ways
+- ForgeCode Services (optional) may process data differently than local CLI mode
+
+**Distinction:**
+- **Local CLI mode:** Claims to run entirely on local machine
+- **ForgeCode Services:** Optional features that provide additional capabilities, may process data externally
+
+---
+
+## Mitigation
+
+### Disable Tracking
+```bash
+FORGE_TRACKER=false  # Disables all tracking
+```
+
+### ForgeCode Services Clarification
+From Discussion #2545:
+> "ForgeCode Services are optional features that provide additional capabilities beyond the purely local CLI experience. If a user chooses to enable those services, some data relevant to those features may be processed by the service."
+
+---
+
+## Community Sentiment
+
+### Reddit r/ClaudeCode
+> "Specifically for Forgecodedev, I haven't used it yet since they are not transparent about user data which is a red flag to me."
+
+### DEV Community (Liran Baba)
+- Mentions telemetry concerns in comparison article
+- Notes the FORGE_TRACKER=false mitigation
+
+---
+
+## Benchmark Controversy Connection
+
+Some users connect privacy concerns to benchmark results:
+
+> "I am concerned about their proprietary layer, which I believe is a big part of what moved their bench scores from ~25% to ~81%. Currently it is free to use but may change in the future."
+
+**Note:** ForgeCode Services (proprietary layer) was used for benchmark evaluations, which differs from purely local CLI mode.
+
+---
+
+## Transparency Issues
+
+1. **Telemetry defaults:** Enabled by default, must explicitly disable
+2. **Data scope:** SSH directory scanning not clearly documented upfront
+3. **ForgeCode Services:** Connection between services and benchmark results not immediately obvious
+4. **Proprietary layer:** Some components not open source
+
+---
+
+## Recommendations for Privacy-Conscious Users
+
+1. **Set FORGE_TRACKER=false** before using
+2. **Avoid ForgeCode Services** if local-only operation is required
+3. **Audit code:** Harness is open source (Apache 2.0), can be inspected
+4. **Use own API keys:** Don't rely on any bundled/free tier that might require data sharing
+
+---
+
+## Source References
+
+1. **GitHub Discussion:** https://github.com/antinomyhq/forgecode/discussions/2545
+2. **GitHub Issue #1318:** Referenced in multiple community discussions
+3. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+4. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,204 @@
+# ForgeCode Best Practices - Summary
+
+**Compiled from:** Community feedback, GitHub issues, blog posts, documentation  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Quick Start Best Practices
+
+### 1. Disable Telemetry
+```bash
+export FORGE_TRACKER=false
+```
+Add to `~/.zshrc` for persistence.
+
+### 2. Configure API Keys Properly
+```bash
+forge provider login  # Set up providers
+```
+Consider API key helpers (requested in #2888) for security.
+
+### 3. Verify ZSH Integration
+```bash
+forge zsh doctor   # Check for issues
+forge zsh setup    # Re-run if needed
+```
+
+---
+
+## Model Selection Best Practices
+
+### For Speed
+- **Opus 4.6** through ForgeCode: Fastest real-world performance
+- **Avoid GPT 5.4** through ForgeCode: Unstable tool calling
+
+### For Cost
+- **MiniMax M2.1:** Near-SOTA performance at $0.30/$1.20 per million tokens
+- **LongCat-Flash-Lite:** Budget option at $0.10/$0.40
+
+### For Reliability
+- **Claude Sonnet 4.5:** Best independent benchmark scores
+- **Avoid:** Models with known tool calling issues (Qwen 3.5 with current bug)
+
+---
+
+## Agent Usage Best Practices
+
+### Workflow Pattern
+1. **Start with `muse`** for planning complex changes
+2. **Switch to `forge`** for implementation
+3. **Use `sage`** (automatically) for research
+
+### Command Reference
+```bash
+:muse    # Planning mode
+:forge   # Implementation mode
+:agent   # View all agents
+:new     # Fresh conversation
+:compact # Free up token budget
+```
+
+---
+
+## Context Management
+
+### Strengths
+- **~90% context reduction** vs full-file inclusion
+- Function signature indexing
+- Selective context pulling
+
+### Limitations
+- **No auto-compaction** (unlike Claude Code)
+- **No checkpoints/rewind**
+- Manual `:compact` required when context full
+
+### Tips
+- Use `@filename` for file tagging
+- Run `:compact` before long tasks
+- Start with `:new` for unrelated tasks
+
+---
+
+## Tool Calling Best Practices
+
+### For Harness Developers
+1. Use `old_string`/`new_string` argument names
+2. Put `required` before `properties` in JSON schema
+3. Flatten nested schemas
+4. Add explicit truncation reminders
+
+### For Users
+1. **Verify tool calls** - don't blindly accept
+2. **Check file paths** - AI can hallucinate paths
+3. **Review diffs** - especially for large changes
+
+---
+
+## Pricing Optimization
+
+### Cost Control
+1. **Use Sonnet** for routine tasks (cheaper than Opus)
+2. **Limit sub-agent spawning** - burns tokens
+3. **Use context efficiently** - ForgeCode's indexing helps
+4. **Monitor daily limits** - Free tier is 10-50 requests
+
+### Plan Selection
+- **Free:** Testing, small projects
+- **Pro ($20):** Regular use (<1,000 requests/day)
+- **Max ($100):** Power users (1,000-5,000 requests/day)
+
+---
+
+## Project Configuration
+
+### AGENTS.md
+Create at project root or `~/forge/AGENTS.md`:
+```markdown
+# Development Guidelines
+
+## Runtime
+- NEVER restart the dev server (runs on port 3000)
+- Use npm exclusively (not yarn/pnpm)
+
+## Code Style
+- TypeScript strict mode
+- Functional programming preferred
+```
+
+### Tips
+- Be specific and actionable
+- Include negative constraints ("NEVER...")
+- Reference existing code patterns
+
+---
+
+## Common Pitfalls
+
+### 1. Expecting Claude Code Features
+- **Missing:** Checkpoints, auto-memory, IDE extensions
+- **Workaround:** Use git commits frequently
+
+### 2. Ignoring Daily Limits
+- **Problem:** Task stops mid-execution when limit reached
+- **Solution:** Monitor usage, upgrade plan, or switch providers
+
+### 3. Using GPT 5.4 for Research
+- **Problem:** Tool calling failures, infinite loops
+- **Solution:** Use Opus 4.6 or Sonnet instead
+
+### 4. Privacy Concerns
+- **Problem:** Telemetry collects SSH/git data by default
+- **Solution:** Set FORGE_TRACKER=false
+
+---
+
+## When to Use ForgeCode vs Alternatives
+
+### Use ForgeCode When:
+- Terminal-first workflow
+- Speed is priority
+- Multi-model flexibility needed
+- Open source/auditable code required
+- Privacy control essential (with telemetry disabled)
+
+### Use Claude Code When:
+- Team collaboration (shared CLAUDE.md)
+- Need checkpoints/rewind
+- Want auto-memory across sessions
+- IDE extensions needed
+- Prefer subscription pricing (no separate API costs)
+
+### Use Cursor When:
+- IDE-native experience preferred
+- GUI features important
+- Team using VS Code exclusively
+
+---
+
+## Debugging Tips
+
+### Tool Call Failures
+1. Check model compatibility (avoid Qwen 3.5 currently)
+2. Verify JSON schema format
+3. Try `:retry` to resend
+
+### Performance Issues
+1. Use `:compact` to free context
+2. Switch to faster model (Sonnet vs Opus)
+3. Close unnecessary files with `@[filename]`
+
+### Integration Issues
+1. Run `forge zsh doctor`
+2. Verify Nerd Font installed
+3. Check terminal compatibility (Ghostty has resize bug)
+
+---
+
+## Source References
+
+1. **ForgeCode Docs:** https://forgecode.dev/docs/
+2. **ZSH Support:** https://forgecode.dev/docs/zsh-support/
+3. **Operating Agents:** https://forgecode.dev/docs/operating-agents/
+4. **DEV Community:** https://dev.to/liran_baba/forgecode-vs-claude-code-which-ai-coding-agent-actually-wins-36c
+5. **GitHub Issues:** https://github.com/antinomyhq/forgecode/issues