Initial commit: coding harness feedback analysis

Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
This commit is contained in:
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,101 @@
# Local/Small Models with ForgeCode - General Feedback
**Scope:** Local LLMs via Ollama, llama.cpp, LM Studio, etc.
**Harness:** ForgeCode
**Source References:** Reddit r/LocalLLaMA, r/LocalLLM, GitHub issues
**Date Compiled:** April 9, 2026
---
## Key Challenges for Local Models
### 1. Tool Calling Format Issues
**Problem:** Many local models struggle with tool calling formats
**Evidence:**
- Gemma 4 initial releases had tool calling format issues with harnesses
- Qwen3.5 has issues with multiple system messages
- Various models require specific inference backends for reliable tool use
**Recommendation:** Use latest versions of inference backends:
- oMLX / llama.cpp (latest) for Gemma 4
- LM Studio 0.4.9+ for Qwen3.5
- Unsloth fixes for Qwen3-Coder tool calling
### 2. Context Window Configuration
**Default Issues:**
- Ollama/Qwen3 runs with 4K context window by default (too small)
- Need explicit configuration to increase context
**Fix:**
```bash
# Increase context window in settings
# For Ollama: modify Modelfile
# For llama.cpp: use -c flag
```
### 3. Quantization Quality
**Observation:** Default quantization often insufficient for tool use
**Fix:**
- Try higher-quality quantization (e.g., `:q8_0` for 8-bit instead of default Q4_K_M)
- Trade-off: More RAM usage but better output quality
### 4. Model Size Recommendations
From community feedback:
- **< 7B models:** Generally insufficient for reliable agentic tool use
- **7B-14B:** Minimum viable for simple tasks
- **30B+:** Recommended for serious coding work
- **MoE models (Qwen3-Coder 480B-A35B):** Good performance but requires significant RAM
---
## Specific Model Notes
### Qwen3-Coder Next
- **Status:** "First usable coding model < 60GB" according to user reports
- **Workflow tip:** Compress context after each bug fix/feature, then reload
- **Important:** Limit context size in settings.json to prevent overflow
### Gemma 4
- **Requirement:** Latest oMLX / llama.cpp for tool calling
- **Recommendation:** 26B MoE good for limited RAM setups
### Mistral 7B
- **Alternative:** Consider when Qwen 2.5 14B uses too much RAM
- **Trade-off:** Smaller but potentially less capable
---
## Platform-Specific Notes
### Apple Silicon (M-series)
- **Observation:** "Silent, very power efficient, good speeds"
- **Limitation:** Prompt processing slower than NVIDIA GPUs
- **Alternative:** LM Studio with MLX backend currently preferred over Ollama for some users
### Linux
- Best support and performance for local inference
- htop recommended for monitoring RAM usage
---
## General Best Practices
1. **Close other applications** to free RAM before running local models
2. **Monitor context usage** - can exceed 100% in some UIs while still appearing to work
3. **Update regularly** - inference backends fix tool calling issues frequently
4. **Test thoroughly** - local model behavior varies significantly by quant and backend
---
## Source References
1. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/
2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
3. **Reddit r/LocalLLM:** https://www.reddit.com/r/LocalLLM/comments/1sf5aqy/how_are_people_using_local_llms_for_coding/
4. **llama.cpp Discussion:** https://github.com/ggml-org/llama.cpp/discussions/4167
@@ -0,0 +1,133 @@
# GitHub Issues Summary for ForgeCode
**Scope:** Open and recently closed issues affecting model performance
**Repository:** antinomyhq/forgecode
**Stats:** 48 open, 433 closed (as of April 9, 2026)
**Date Compiled:** April 9, 2026
---
## Critical Open Issues
### #2904: Use models.dev as LLM model registry source
- **Status:** Open (April 9, 2026)
- **Type:** Enhancement
- **Impact:** Would improve model discovery and configuration
### #2894: Multiple system messages break models with strict chat templates (e.g. Qwen3.5)
- **Status:** Open (April 8, 2026)
- **Type:** Bug
- **Impact:** BREAKS local models with strict templates
- **Affected Models:** Qwen3.5, potentially others
- **Workaround:** None yet
### #2893: Terminal output disappears on window resize in Ghostty
- **Status:** Open (April 8, 2026)
- **Type:** Bug
- **Impact:** UI/usability issue
- **Linked PR:** 1 linked PR
### #2888: Add support for API key helpers
- **Status:** Open (April 8, 2026)
- **Type:** Feature
- **Impact:** Would improve security (helper scripts for API keys)
### #2884: Muse mode shell blocked
- **Status:** Open (April 7, 2026)
- **Type:** Bug
- **Impact:** Blocks usage of muse agent for planning
---
## Historical Issues (Now Fixed)
### #2813: (Fixed)
- Fixed issue referenced in Reddit response from maintainer
- **Source:** Reddit r/ClaudeCode
### #2485: Installation issues on Mac
- **Symptoms:** Oh My Zsh not found, terminal configuration issues
- **Resolution:** Install Oh My Zsh separately
### #1296: Daily FORGE limit stops tasks mid-execution
- **Problem:** Cannot switch providers when daily limit reached
- **Impact:** Context built up is lost
- **Status:** Open (feature request)
---
## Model-Specific Issues
### GPT 5.4
- **Tool calling reliability:** Improved via schema reordering
- **Status:** Workarounds implemented
### Qwen 3.5
- **Multiple system messages:** Open issue #2894
- **Tool calling format:** Use LM Studio 0.4.9+ for better compatibility
### Gemma 4
- **Tool calling:** Requires latest llama.cpp/oMLX
- **Status:** Resolved with backend updates
---
## Privacy/Security Issues
### #1318: Telemetry concerns
- **Collection:** Git emails, SSH directory scans, conversation data
- **Mitigation:** `FORGE_TRACKER=false`
- **Status:** Documented mitigation available
### #1317: Related privacy concerns
- **Linked to:** Discussion #2545
---
## ZSH/Terminal Issues
### Shell Integration
- **Issue:** ZSH aliases don't work in interactive mode (by design)
- **Solution:** Use `:` sentinel from native ZSH session
### Oh My Zsh
- **Requirement:** Not strictly required but recommended
- **Error:** Install script warns if not present
### Ghostty Terminal
- **Issue:** #2893 - Output disappears on resize
- **Status:** Under investigation
---
## Installation Issues
### macOS
- **Common:** iTerm + Oh My Zsh configuration issues
- **Fix:** Run `forge zsh doctor` and `forge zsh setup`
### Windows
- **Support:** Via WSL or Git Bash only
- **Native:** Not officially supported
### Linux
- **Best supported platform**
- **Android:** Also supported
---
## Issue Resolution Tips
From documentation:
```bash
forge zsh doctor # Check environment
forge zsh setup # Re-run ZSH integration
```
---
## Source References
1. **GitHub Issues:** https://github.com/antinomyhq/forgecode/issues
2. **GitHub Discussions:** https://github.com/antinomyhq/forgecode/discussions
3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,203 @@
# Installation & Platform Issues - Feedback Report
**Topic:** Setup problems, platform compatibility, requirements
**Source References:** GitHub issues, ForgeCode docs, Reddit
**Date Compiled:** April 9, 2026
---
## Supported Platforms
### Officially Supported
- **macOS:** Full support
- **Linux:** Best support
- **Android:** Supported
- **Windows:** Via WSL or Git Bash only
### Not Supported
- **Native Windows:** Not officially supported
---
## Installation Methods
### Method 1: YOLO Install (Recommended)
```bash
curl -fsSL https://forgecode.dev/cli | sh
```
### Method 2: Nix
```bash
nix run github:antinomyhq/forge
```
### Method 3: NPM
```bash
npx forgecode@latest
```
---
## Common Installation Issues
### Issue #2485: Mac Installation Problems
**Symptoms:**
- Oh My Zsh not found
- Terminal configuration issues
- Shell environment problems
**Environment Reported:**
- Shell: zsh 5.9
- Terminal: iTerm.app 3.6.8
- Oh My Zsh: Not installed
**Solution:**
```bash
# Install Oh My Zsh first
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
# Then re-run forge setup
forge zsh setup
```
### Terminal Requirements
#### Required: Nerd Font
- **Purpose:** Icon display
- **Recommended:** FiraCode Nerd Font
- **Verification:** Icons should display without overlap during setup
#### Recommended Terminals
- iTerm2 (macOS)
- Ghostty (macOS) - NOTE: Has resize bug (#2893)
- Any modern Linux terminal
---
## ZSH Integration Issues
### Interactive Mode Isolation
**Design:** ForgeCode's interactive mode runs in isolated environment
**Impact:**
- ZSH aliases don't work inside interactive mode
- Custom functions unavailable
- Shell tooling not accessible
**Solution:** Use `:` sentinel from native ZSH session instead
### Tab Completion
**Requirements:**
- `fd` (file finder)
- `fzf` (fuzzy finder)
**Usage:**
```bash
:<TAB> # Open command list
@file<TAB> # Fuzzy file picker
```
**Fallback:** Use full path with brackets: `@[src/components/Header.tsx]`
---
## Platform-Specific Notes
### macOS
**Best Practices:**
- Use iTerm2 or Ghostty
- Install Oh My Zsh for best experience
- Enable Nerd Font in terminal preferences
**Troubleshooting:**
```bash
forge zsh doctor # Check setup
forge zsh setup # Reconfigure
```
### Linux
**Advantages:**
- Best performance for local models
- Native ZSH support
- Package manager availability
**Tips:**
- Use system package manager when available
- Check `htop` for resource monitoring
### Windows
**Limitations:**
- No native support
- Must use WSL or Git Bash
**WSL Recommendation:**
- Ubuntu 22.04+ recommended
- Install ZSH within WSL
- Windows Terminal for best experience
### Android
**Status:** Supported but limited documentation
**Use Case:** Primarily for remote development scenarios
---
## Verification Steps
### Post-Installation Checklist
1. **Run doctor:**
```bash
forge zsh doctor
```
2. **Verify icons:**
- Should display without overlap
- Check during interactive setup
3. **Test basic commands:**
```bash
: hi
:new
:agent
```
4. **Configure provider:**
```bash
forge provider login
```
---
## Open Issues
### #2893: Ghostty Terminal Resize Bug
- **Problem:** Terminal output disappears on window resize
- **Status:** Open, 1 linked PR
- **Workaround:** Avoid resizing or use different terminal
### #2884: Muse Mode Shell Blocked
- **Problem:** Cannot use muse agent
- **Status:** Open
- **Impact:** Planning workflow blocked
---
## Resource Requirements
### Minimum
- **RAM:** 4GB (for cloud models)
- **Disk:** 500MB
- **Shell:** ZSH 5.0+
### For Local Models
- **RAM:** 16GB+ recommended
- **GPU:** Optional but recommended for larger models
- **Storage:** 10GB+ for model downloads
---
## Source References
1. **GitHub Issue #2485:** https://github.com/antinomyhq/forgecode/issues/2485
2. **GitHub Issue #2893:** https://github.com/antinomyhq/forgecode/issues/2893
3. **ForgeCode Docs:** https://forgecode.dev/docs/installation/
4. **ZSH Support:** https://forgecode.dev/docs/zsh-support/
@@ -0,0 +1,99 @@
# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
**Source References:** llm-stats.com, ForgeCode Blog
**Date Compiled:** April 9, 2026
---
## MiniMax M2.1
### Performance
- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
- **Parameters:** 230B
- **Context:** 1.0M tokens
- **Cost:** $0.30 / $1.20 per million tokens
### Value Proposition
- **Best cost-performance ratio** among top performers
- Near-SOTA performance at entry-level pricing
- Massive 1.0M context window
### ForgeCode Usage
- Well-supported via OpenRouter
- Good tool calling reliability
- Recommended for budget-conscious users
---
## GLM-4.6 (Zhipu AI)
### Performance
- **Terminal-Bench Score:** 40.5% (Rank #7)
- **Parameters:** 357B
- **Context:** 131K tokens
- **Cost:** $0.55 / $2.19 per million tokens
### Characteristics
- Open weights
- Competitive with proprietary models at similar price point
- Good context length (131K)
---
## DeepSeek Models
### DeepSeek-V3.2-Exp
- **Terminal-Bench Score:** 37.7% (Rank #10)
- **Status:** Experimental
- **Note:** Results from llm-stats.com
### DeepSeek-V3.1
- **Terminal-Bench Score:** 31.3% (Rank #16)
- **Parameters:** 671B
- **Observation:** Large parameter count doesn't translate to top-tier performance
### DeepSeek-R1-0528
- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
- **Parameters:** 671B
- **Note:** Reasoning model may not be optimized for terminal tasks
---
## Key Insights
### Scale ≠ Performance
- Kimi K2 (1.0T parameters) underperforms smaller models
- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
- **Quality of architecture > raw parameter count**
### Cost-Performance Leaders
1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
### Context Window Comparison
| Model | Context | Rank |
|-------|---------|------|
| MiniMax M2.1 | 1.0M | #2 |
| Claude Opus 4.1 | 200K | #5 |
| GLM-4.6 | 131K | #7 |
---
## Recommendations
### For Budget + Performance
**MiniMax M2.1** - Best value proposition
### For Open Weights
**GLM-4.6** or **MiniMax M2** - Both open, strong performance
### For Research
Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
---
## Source References
1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
2. **ForgeCode Blog:** Model comparison series
+55
View File
@@ -0,0 +1,55 @@
# Qwen 3.5 with ForgeCode - Feedback Report
**Model:** Qwen 3.5
**Provider:** Alibaba Cloud (via local inference)
**Harness:** ForgeCode
**Source References:** GitHub Issue #2894, Reddit r/LocalLLaMA
**Date Compiled:** April 9, 2026
---
## Known Issues
### Multiple System Messages Bug
**GitHub Issue:** #2894 (Open as of April 8, 2026)
**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3.5)
**Error Manifestation:**
- Models with strict chat templates fail to parse message structure correctly
- Tool calling may fail or produce incorrect results
- Agent behavior becomes unpredictable
**Impact:**
- Affects local inference with llama.cpp, Ollama, and similar servers
- Qwen3.5 specifically mentioned as affected
**Workaround Status:** No official fix yet; issue under investigation
---
## Tool Calling with Qwen Models
### General Observations from Community
1. **Qwen3-Coder Next** shows promise as "first usable coding model < 60GB"
2. **Tool calling reliability varies** by inference backend:
- LM Studio 0.4.9 reportedly handles Qwen3.5 XML tool parsing more reliably than raw llama.cpp
- llama.cpp with `--jinja` flag helps with tool calling
3. **finish_reason issue** is annoying to debug according to community reports
---
## Recommendations for Local Use
1. **Use LM Studio** for more reliable tool parsing vs raw llama.cpp
2. **Monitor system message count** - known issue with ForgeCode's multi-message approach
3. **Test thoroughly** before relying on Qwen 3.5 for production tasks via ForgeCode
---
## Source References
1. **GitHub Issue:** https://github.com/antinomyhq/forgecode/issues/2894
2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
@@ -0,0 +1,111 @@
# Tool Calling Reliability with ForgeCode - Feedback Report
**Topic:** Tool use reliability, function calling, common errors
**Source References:** ForgeCode Blog, GitHub issues, Reddit
**Date Compiled:** April 9, 2026
---
## Overview
Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.
---
## The Seven Failure Modes (From ForgeCode Blog)
### 1. Same Model, Very Different Performance
**Problem:** Interactive-first design fails in benchmarks (no user to answer questions)
**Fix:** Non-Interactive Mode with rewritten system prompts
### 2. Tool Descriptions Don't Guarantee Correctness
**Problem Categories:**
- Wrong tool selected (e.g., `shell` instead of structured `edit`)
- Correct tool, wrong argument names
- Correct tool, correct arguments, wrong sequencing
**Fix:** Targeted micro-evals isolating each class per tool, per model
### 3. Tool Naming is a Reliability Variable
**Key Finding:** Models pattern-match against training data first
**Concrete Example:**
- Renaming edit tool arguments to `old_string` and `new_string`
- Result: "measurably dropped tool-call error rates immediately—same model, same prompt"
> "If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."
### 4. Context Size is a Multiplier, Not a Substitute
**Problem:** More context only helps after finding the right entry point
**Insight:** Entry-point discovery latency is the bottleneck
### 5. Time Limits Punish Trajectories
**Problem:** Failed tool calls burn real seconds; brilliant but meandering paths timeout
**Fix:** Speed architecture with parallel subagents
### 6. Planning Tools Only Work if Enforced
**Problem:** Optional `todo_write` tool ignored under pressure
**Fix:** Made mandatory via low-level evals
**Result:** 38% → 66% pass rate
### 7. TermBench is More About Speed Than Intelligence
**Fix:** Progressive thinking policy (high thinking early, low during execution)
---
## Model-Specific Tool Calling Issues
### GPT 5.4
- **Issue:** Persistent tool-call errors
- **Fixes Applied:**
- Reordered JSON schema fields (`required` before `properties`)
- Flattened nested schemas
- Explicit truncation reminders
### Qwen 3.5
- **Issue:** Multiple system messages break strict chat templates
- **Status:** Open issue (#2894)
- **Workaround:** None yet; use different model or await fix
### Gemma 4
- **Issue:** Initial releases had tool calling format issues
- **Fix:** Use latest oMLX / llama.cpp
---
## Best Practices for Tool Reliability
1. **Use established argument names:** `old_string`/`new_string` better than generic names
2. **Flatten schemas:** Reduce nesting in tool definitions
3. **Order matters:** Put `required` before `properties` in JSON schema
4. **Test with micro-evals:** Isolate specific tool+model combinations
5. **Monitor truncation:** Add explicit reminders when files partially read
---
## ForgeCode Services Enhancements
The proprietary runtime layer includes:
1. **Semantic entry-point discovery:** Lightweight semantic pass before exploration
2. **Dynamic skill loading:** Specialized instructions loaded when needed
3. **Tool-call correction layer:** Heuristic + static analysis for argument validation
**Note:** These features are part of ForgeCode Services (optional), not the open-source CLI.
---
## Community Tips
From Reddit and GitHub discussions:
1. **LM Studio > raw llama.cpp** for Qwen3.5 XML tool parsing
2. **LM Studio 0.4.9+** handles tool calling more reliably
3. **llama.cpp `--jinja` flag** helps with Qwen tool templates
---
## Source References
1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
2. **GitHub Issue #2894:** https://github.com/antinomyhq/forgecode/issues/2894
3. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/