Initial commit: coding harness feedback analysis

Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00
commit 51123212c4
46 changed files with 7213 additions and 0 deletions
@@ -0,0 +1,101 @@
+# Local/Small Models with ForgeCode - General Feedback
+
+**Scope:** Local LLMs via Ollama, llama.cpp, LM Studio, etc.  
+**Harness:** ForgeCode  
+**Source References:** Reddit r/LocalLLaMA, r/LocalLLM, GitHub issues  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Key Challenges for Local Models
+
+### 1. Tool Calling Format Issues
+
+**Problem:** Many local models struggle with tool calling formats
+
+**Evidence:**
+- Gemma 4 initial releases had tool calling format issues with harnesses
+- Qwen3.5 has issues with multiple system messages
+- Various models require specific inference backends for reliable tool use
+
+**Recommendation:** Use latest versions of inference backends:
+- oMLX / llama.cpp (latest) for Gemma 4
+- LM Studio 0.4.9+ for Qwen3.5
+- Unsloth fixes for Qwen3-Coder tool calling
+
+### 2. Context Window Configuration
+
+**Default Issues:**
+- Ollama/Qwen3 runs with 4K context window by default (too small)
+- Need explicit configuration to increase context
+
+**Fix:**
+```bash
+# Increase context window in settings
+# For Ollama: modify Modelfile
+# For llama.cpp: use -c flag
+```
+
+### 3. Quantization Quality
+
+**Observation:** Default quantization often insufficient for tool use
+
+**Fix:**
+- Try higher-quality quantization (e.g., `:q8_0` for 8-bit instead of default Q4_K_M)
+- Trade-off: More RAM usage but better output quality
+
+### 4. Model Size Recommendations
+
+From community feedback:
+- **< 7B models:** Generally insufficient for reliable agentic tool use
+- **7B-14B:** Minimum viable for simple tasks
+- **30B+:** Recommended for serious coding work
+- **MoE models (Qwen3-Coder 480B-A35B):** Good performance but requires significant RAM
+
+---
+
+## Specific Model Notes
+
+### Qwen3-Coder Next
+- **Status:** "First usable coding model < 60GB" according to user reports
+- **Workflow tip:** Compress context after each bug fix/feature, then reload
+- **Important:** Limit context size in settings.json to prevent overflow
+
+### Gemma 4
+- **Requirement:** Latest oMLX / llama.cpp for tool calling
+- **Recommendation:** 26B MoE good for limited RAM setups
+
+### Mistral 7B
+- **Alternative:** Consider when Qwen 2.5 14B uses too much RAM
+- **Trade-off:** Smaller but potentially less capable
+
+---
+
+## Platform-Specific Notes
+
+### Apple Silicon (M-series)
+- **Observation:** "Silent, very power efficient, good speeds"
+- **Limitation:** Prompt processing slower than NVIDIA GPUs
+- **Alternative:** LM Studio with MLX backend currently preferred over Ollama for some users
+
+### Linux
+- Best support and performance for local inference
+- htop recommended for monitoring RAM usage
+
+---
+
+## General Best Practices
+
+1. **Close other applications** to free RAM before running local models
+2. **Monitor context usage** - can exceed 100% in some UIs while still appearing to work
+3. **Update regularly** - inference backends fix tool calling issues frequently
+4. **Test thoroughly** - local model behavior varies significantly by quant and backend
+
+---
+
+## Source References
+
+1. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/
+2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
+3. **Reddit r/LocalLLM:** https://www.reddit.com/r/LocalLLM/comments/1sf5aqy/how_are_people_using_local_llms_for_coding/
+4. **llama.cpp Discussion:** https://github.com/ggml-org/llama.cpp/discussions/4167
@@ -0,0 +1,133 @@
+# GitHub Issues Summary for ForgeCode
+
+**Scope:** Open and recently closed issues affecting model performance  
+**Repository:** antinomyhq/forgecode  
+**Stats:** 48 open, 433 closed (as of April 9, 2026)  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Critical Open Issues
+
+### #2904: Use models.dev as LLM model registry source
+- **Status:** Open (April 9, 2026)
+- **Type:** Enhancement
+- **Impact:** Would improve model discovery and configuration
+
+### #2894: Multiple system messages break models with strict chat templates (e.g. Qwen3.5)
+- **Status:** Open (April 8, 2026)
+- **Type:** Bug
+- **Impact:** BREAKS local models with strict templates
+- **Affected Models:** Qwen3.5, potentially others
+- **Workaround:** None yet
+
+### #2893: Terminal output disappears on window resize in Ghostty
+- **Status:** Open (April 8, 2026)
+- **Type:** Bug
+- **Impact:** UI/usability issue
+- **Linked PR:** 1 linked PR
+
+### #2888: Add support for API key helpers
+- **Status:** Open (April 8, 2026)
+- **Type:** Feature
+- **Impact:** Would improve security (helper scripts for API keys)
+
+### #2884: Muse mode shell blocked
+- **Status:** Open (April 7, 2026)
+- **Type:** Bug
+- **Impact:** Blocks usage of muse agent for planning
+
+---
+
+## Historical Issues (Now Fixed)
+
+### #2813: (Fixed)
+- Fixed issue referenced in Reddit response from maintainer
+- **Source:** Reddit r/ClaudeCode
+
+### #2485: Installation issues on Mac
+- **Symptoms:** Oh My Zsh not found, terminal configuration issues
+- **Resolution:** Install Oh My Zsh separately
+
+### #1296: Daily FORGE limit stops tasks mid-execution
+- **Problem:** Cannot switch providers when daily limit reached
+- **Impact:** Context built up is lost
+- **Status:** Open (feature request)
+
+---
+
+## Model-Specific Issues
+
+### GPT 5.4
+- **Tool calling reliability:** Improved via schema reordering
+- **Status:** Workarounds implemented
+
+### Qwen 3.5
+- **Multiple system messages:** Open issue #2894
+- **Tool calling format:** Use LM Studio 0.4.9+ for better compatibility
+
+### Gemma 4
+- **Tool calling:** Requires latest llama.cpp/oMLX
+- **Status:** Resolved with backend updates
+
+---
+
+## Privacy/Security Issues
+
+### #1318: Telemetry concerns
+- **Collection:** Git emails, SSH directory scans, conversation data
+- **Mitigation:** `FORGE_TRACKER=false`
+- **Status:** Documented mitigation available
+
+### #1317: Related privacy concerns
+- **Linked to:** Discussion #2545
+
+---
+
+## ZSH/Terminal Issues
+
+### Shell Integration
+- **Issue:** ZSH aliases don't work in interactive mode (by design)
+- **Solution:** Use `:` sentinel from native ZSH session
+
+### Oh My Zsh
+- **Requirement:** Not strictly required but recommended
+- **Error:** Install script warns if not present
+
+### Ghostty Terminal
+- **Issue:** #2893 - Output disappears on resize
+- **Status:** Under investigation
+
+---
+
+## Installation Issues
+
+### macOS
+- **Common:** iTerm + Oh My Zsh configuration issues
+- **Fix:** Run `forge zsh doctor` and `forge zsh setup`
+
+### Windows
+- **Support:** Via WSL or Git Bash only
+- **Native:** Not officially supported
+
+### Linux
+- **Best supported platform**
+- **Android:** Also supported
+
+---
+
+## Issue Resolution Tips
+
+From documentation:
+```bash
+forge zsh doctor  # Check environment
+forge zsh setup   # Re-run ZSH integration
+```
+
+---
+
+## Source References
+
+1. **GitHub Issues:** https://github.com/antinomyhq/forgecode/issues
+2. **GitHub Discussions:** https://github.com/antinomyhq/forgecode/discussions
+3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
@@ -0,0 +1,203 @@
+# Installation & Platform Issues - Feedback Report
+
+**Topic:** Setup problems, platform compatibility, requirements  
+**Source References:** GitHub issues, ForgeCode docs, Reddit  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Supported Platforms
+
+### Officially Supported
+- **macOS:** Full support
+- **Linux:** Best support
+- **Android:** Supported
+- **Windows:** Via WSL or Git Bash only
+
+### Not Supported
+- **Native Windows:** Not officially supported
+
+---
+
+## Installation Methods
+
+### Method 1: YOLO Install (Recommended)
+```bash
+curl -fsSL https://forgecode.dev/cli | sh
+```
+
+### Method 2: Nix
+```bash
+nix run github:antinomyhq/forge
+```
+
+### Method 3: NPM
+```bash
+npx forgecode@latest
+```
+
+---
+
+## Common Installation Issues
+
+### Issue #2485: Mac Installation Problems
+**Symptoms:**
+- Oh My Zsh not found
+- Terminal configuration issues
+- Shell environment problems
+
+**Environment Reported:**
+- Shell: zsh 5.9
+- Terminal: iTerm.app 3.6.8
+- Oh My Zsh: Not installed
+
+**Solution:**
+```bash
+# Install Oh My Zsh first
+sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
+
+# Then re-run forge setup
+forge zsh setup
+```
+
+### Terminal Requirements
+
+#### Required: Nerd Font
+- **Purpose:** Icon display
+- **Recommended:** FiraCode Nerd Font
+- **Verification:** Icons should display without overlap during setup
+
+#### Recommended Terminals
+- iTerm2 (macOS)
+- Ghostty (macOS) - NOTE: Has resize bug (#2893)
+- Any modern Linux terminal
+
+---
+
+## ZSH Integration Issues
+
+### Interactive Mode Isolation
+**Design:** ForgeCode's interactive mode runs in isolated environment
+
+**Impact:**
+- ZSH aliases don't work inside interactive mode
+- Custom functions unavailable
+- Shell tooling not accessible
+
+**Solution:** Use `:` sentinel from native ZSH session instead
+
+### Tab Completion
+**Requirements:**
+- `fd` (file finder)
+- `fzf` (fuzzy finder)
+
+**Usage:**
+```bash
+:<TAB>        # Open command list
+@file<TAB>    # Fuzzy file picker
+```
+
+**Fallback:** Use full path with brackets: `@[src/components/Header.tsx]`
+
+---
+
+## Platform-Specific Notes
+
+### macOS
+**Best Practices:**
+- Use iTerm2 or Ghostty
+- Install Oh My Zsh for best experience
+- Enable Nerd Font in terminal preferences
+
+**Troubleshooting:**
+```bash
+forge zsh doctor   # Check setup
+forge zsh setup    # Reconfigure
+```
+
+### Linux
+**Advantages:**
+- Best performance for local models
+- Native ZSH support
+- Package manager availability
+
+**Tips:**
+- Use system package manager when available
+- Check `htop` for resource monitoring
+
+### Windows
+**Limitations:**
+- No native support
+- Must use WSL or Git Bash
+
+**WSL Recommendation:**
+- Ubuntu 22.04+ recommended
+- Install ZSH within WSL
+- Windows Terminal for best experience
+
+### Android
+**Status:** Supported but limited documentation
+**Use Case:** Primarily for remote development scenarios
+
+---
+
+## Verification Steps
+
+### Post-Installation Checklist
+1. **Run doctor:**
+   ```bash
+   forge zsh doctor
+   ```
+
+2. **Verify icons:**
+   - Should display without overlap
+   - Check during interactive setup
+
+3. **Test basic commands:**
+   ```bash
+   : hi
+   :new
+   :agent
+   ```
+
+4. **Configure provider:**
+   ```bash
+   forge provider login
+   ```
+
+---
+
+## Open Issues
+
+### #2893: Ghostty Terminal Resize Bug
+- **Problem:** Terminal output disappears on window resize
+- **Status:** Open, 1 linked PR
+- **Workaround:** Avoid resizing or use different terminal
+
+### #2884: Muse Mode Shell Blocked
+- **Problem:** Cannot use muse agent
+- **Status:** Open
+- **Impact:** Planning workflow blocked
+
+---
+
+## Resource Requirements
+
+### Minimum
+- **RAM:** 4GB (for cloud models)
+- **Disk:** 500MB
+- **Shell:** ZSH 5.0+
+
+### For Local Models
+- **RAM:** 16GB+ recommended
+- **GPU:** Optional but recommended for larger models
+- **Storage:** 10GB+ for model downloads
+
+---
+
+## Source References
+
+1. **GitHub Issue #2485:** https://github.com/antinomyhq/forgecode/issues/2485
+2. **GitHub Issue #2893:** https://github.com/antinomyhq/forgecode/issues/2893
+3. **ForgeCode Docs:** https://forgecode.dev/docs/installation/
+4. **ZSH Support:** https://forgecode.dev/docs/zsh-support/
@@ -0,0 +1,99 @@
+# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
+
+**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3  
+**Source References:** llm-stats.com, ForgeCode Blog  
+**Date Compiled:** April 9, 2026
+
+---
+
+## MiniMax M2.1
+
+### Performance
+- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
+- **Parameters:** 230B
+- **Context:** 1.0M tokens
+- **Cost:** $0.30 / $1.20 per million tokens
+
+### Value Proposition
+- **Best cost-performance ratio** among top performers
+- Near-SOTA performance at entry-level pricing
+- Massive 1.0M context window
+
+### ForgeCode Usage
+- Well-supported via OpenRouter
+- Good tool calling reliability
+- Recommended for budget-conscious users
+
+---
+
+## GLM-4.6 (Zhipu AI)
+
+### Performance
+- **Terminal-Bench Score:** 40.5% (Rank #7)
+- **Parameters:** 357B
+- **Context:** 131K tokens
+- **Cost:** $0.55 / $2.19 per million tokens
+
+### Characteristics
+- Open weights
+- Competitive with proprietary models at similar price point
+- Good context length (131K)
+
+---
+
+## DeepSeek Models
+
+### DeepSeek-V3.2-Exp
+- **Terminal-Bench Score:** 37.7% (Rank #10)
+- **Status:** Experimental
+- **Note:** Results from llm-stats.com
+
+### DeepSeek-V3.1
+- **Terminal-Bench Score:** 31.3% (Rank #16)
+- **Parameters:** 671B
+- **Observation:** Large parameter count doesn't translate to top-tier performance
+
+### DeepSeek-R1-0528
+- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
+- **Parameters:** 671B
+- **Note:** Reasoning model may not be optimized for terminal tasks
+
+---
+
+## Key Insights
+
+### Scale ≠ Performance
+- Kimi K2 (1.0T parameters) underperforms smaller models
+- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
+- **Quality of architecture > raw parameter count**
+
+### Cost-Performance Leaders
+1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
+2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
+
+### Context Window Comparison
+| Model | Context | Rank |
+|-------|---------|------|
+| MiniMax M2.1 | 1.0M | #2 |
+| Claude Opus 4.1 | 200K | #5 |
+| GLM-4.6 | 131K | #7 |
+
+---
+
+## Recommendations
+
+### For Budget + Performance
+**MiniMax M2.1** - Best value proposition
+
+### For Open Weights
+**GLM-4.6** or **MiniMax M2** - Both open, strong performance
+
+### For Research
+Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
+
+---
+
+## Source References
+
+1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
+2. **ForgeCode Blog:** Model comparison series
@@ -0,0 +1,55 @@
+# Qwen 3.5 with ForgeCode - Feedback Report
+
+**Model:** Qwen 3.5  
+**Provider:** Alibaba Cloud (via local inference)  
+**Harness:** ForgeCode  
+**Source References:** GitHub Issue #2894, Reddit r/LocalLLaMA  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Known Issues
+
+### Multiple System Messages Bug
+**GitHub Issue:** #2894 (Open as of April 8, 2026)
+
+**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3.5)
+
+**Error Manifestation:**
+- Models with strict chat templates fail to parse message structure correctly
+- Tool calling may fail or produce incorrect results
+- Agent behavior becomes unpredictable
+
+**Impact:**
+- Affects local inference with llama.cpp, Ollama, and similar servers
+- Qwen3.5 specifically mentioned as affected
+
+**Workaround Status:** No official fix yet; issue under investigation
+
+---
+
+## Tool Calling with Qwen Models
+
+### General Observations from Community
+
+1. **Qwen3-Coder Next** shows promise as "first usable coding model < 60GB"
+2. **Tool calling reliability varies** by inference backend:
+   - LM Studio 0.4.9 reportedly handles Qwen3.5 XML tool parsing more reliably than raw llama.cpp
+   - llama.cpp with `--jinja` flag helps with tool calling
+
+3. **finish_reason issue** is annoying to debug according to community reports
+
+---
+
+## Recommendations for Local Use
+
+1. **Use LM Studio** for more reliable tool parsing vs raw llama.cpp
+2. **Monitor system message count** - known issue with ForgeCode's multi-message approach
+3. **Test thoroughly** before relying on Qwen 3.5 for production tasks via ForgeCode
+
+---
+
+## Source References
+
+1. **GitHub Issue:** https://github.com/antinomyhq/forgecode/issues/2894
+2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
@@ -0,0 +1,111 @@
+# Tool Calling Reliability with ForgeCode - Feedback Report
+
+**Topic:** Tool use reliability, function calling, common errors  
+**Source References:** ForgeCode Blog, GitHub issues, Reddit  
+**Date Compiled:** April 9, 2026
+
+---
+
+## Overview
+
+Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.
+
+---
+
+## The Seven Failure Modes (From ForgeCode Blog)
+
+### 1. Same Model, Very Different Performance
+**Problem:** Interactive-first design fails in benchmarks (no user to answer questions)  
+**Fix:** Non-Interactive Mode with rewritten system prompts
+
+### 2. Tool Descriptions Don't Guarantee Correctness
+**Problem Categories:**
+- Wrong tool selected (e.g., `shell` instead of structured `edit`)
+- Correct tool, wrong argument names
+- Correct tool, correct arguments, wrong sequencing
+
+**Fix:** Targeted micro-evals isolating each class per tool, per model
+
+### 3. Tool Naming is a Reliability Variable
+**Key Finding:** Models pattern-match against training data first
+
+**Concrete Example:**
+- Renaming edit tool arguments to `old_string` and `new_string`
+- Result: "measurably dropped tool-call error rates immediately—same model, same prompt"
+
+> "If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."
+
+### 4. Context Size is a Multiplier, Not a Substitute
+**Problem:** More context only helps after finding the right entry point  
+**Insight:** Entry-point discovery latency is the bottleneck
+
+### 5. Time Limits Punish Trajectories
+**Problem:** Failed tool calls burn real seconds; brilliant but meandering paths timeout  
+**Fix:** Speed architecture with parallel subagents
+
+### 6. Planning Tools Only Work if Enforced
+**Problem:** Optional `todo_write` tool ignored under pressure  
+**Fix:** Made mandatory via low-level evals
+**Result:** 38% → 66% pass rate
+
+### 7. TermBench is More About Speed Than Intelligence
+**Fix:** Progressive thinking policy (high thinking early, low during execution)
+
+---
+
+## Model-Specific Tool Calling Issues
+
+### GPT 5.4
+- **Issue:** Persistent tool-call errors
+- **Fixes Applied:**
+  - Reordered JSON schema fields (`required` before `properties`)
+  - Flattened nested schemas
+  - Explicit truncation reminders
+
+### Qwen 3.5
+- **Issue:** Multiple system messages break strict chat templates
+- **Status:** Open issue (#2894)
+- **Workaround:** None yet; use different model or await fix
+
+### Gemma 4
+- **Issue:** Initial releases had tool calling format issues
+- **Fix:** Use latest oMLX / llama.cpp
+
+---
+
+## Best Practices for Tool Reliability
+
+1. **Use established argument names:** `old_string`/`new_string` better than generic names
+2. **Flatten schemas:** Reduce nesting in tool definitions
+3. **Order matters:** Put `required` before `properties` in JSON schema
+4. **Test with micro-evals:** Isolate specific tool+model combinations
+5. **Monitor truncation:** Add explicit reminders when files partially read
+
+---
+
+## ForgeCode Services Enhancements
+
+The proprietary runtime layer includes:
+1. **Semantic entry-point discovery:** Lightweight semantic pass before exploration
+2. **Dynamic skill loading:** Specialized instructions loaded when needed
+3. **Tool-call correction layer:** Heuristic + static analysis for argument validation
+
+**Note:** These features are part of ForgeCode Services (optional), not the open-source CLI.
+
+---
+
+## Community Tips
+
+From Reddit and GitHub discussions:
+
+1. **LM Studio > raw llama.cpp** for Qwen3.5 XML tool parsing
+2. **LM Studio 0.4.9+** handles tool calling more reliably
+3. **llama.cpp `--jinja` flag** helps with Qwen tool templates
+
+---
+
+## Source References
+
+1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
+2. **GitHub Issue #2894:** https://github.com/antinomyhq/forgecode/issues/2894
+3. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/