Initial commit: coding harness feedback analysis
Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
This commit is contained in:
@@ -0,0 +1,101 @@
|
||||
# Local/Small Models with ForgeCode - General Feedback
|
||||
|
||||
**Scope:** Local LLMs via Ollama, llama.cpp, LM Studio, etc.
|
||||
**Harness:** ForgeCode
|
||||
**Source References:** Reddit r/LocalLLaMA, r/LocalLLM, GitHub issues
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## Key Challenges for Local Models
|
||||
|
||||
### 1. Tool Calling Format Issues
|
||||
|
||||
**Problem:** Many local models struggle with tool calling formats
|
||||
|
||||
**Evidence:**
|
||||
- Gemma 4 initial releases had tool calling format issues with harnesses
|
||||
- Qwen3.5 has issues with multiple system messages
|
||||
- Various models require specific inference backends for reliable tool use
|
||||
|
||||
**Recommendation:** Use latest versions of inference backends:
|
||||
- oMLX / llama.cpp (latest) for Gemma 4
|
||||
- LM Studio 0.4.9+ for Qwen3.5
|
||||
- Unsloth fixes for Qwen3-Coder tool calling
|
||||
|
||||
### 2. Context Window Configuration
|
||||
|
||||
**Default Issues:**
|
||||
- Ollama/Qwen3 runs with 4K context window by default (too small)
|
||||
- Need explicit configuration to increase context
|
||||
|
||||
**Fix:**
|
||||
```bash
|
||||
# Increase context window in settings
|
||||
# For Ollama: modify Modelfile
|
||||
# For llama.cpp: use -c flag
|
||||
```
|
||||
|
||||
### 3. Quantization Quality
|
||||
|
||||
**Observation:** Default quantization often insufficient for tool use
|
||||
|
||||
**Fix:**
|
||||
- Try higher-quality quantization (e.g., `:q8_0` for 8-bit instead of default Q4_K_M)
|
||||
- Trade-off: More RAM usage but better output quality
|
||||
|
||||
### 4. Model Size Recommendations
|
||||
|
||||
From community feedback:
|
||||
- **< 7B models:** Generally insufficient for reliable agentic tool use
|
||||
- **7B-14B:** Minimum viable for simple tasks
|
||||
- **30B+:** Recommended for serious coding work
|
||||
- **MoE models (Qwen3-Coder 480B-A35B):** Good performance but requires significant RAM
|
||||
|
||||
---
|
||||
|
||||
## Specific Model Notes
|
||||
|
||||
### Qwen3-Coder Next
|
||||
- **Status:** "First usable coding model < 60GB" according to user reports
|
||||
- **Workflow tip:** Compress context after each bug fix/feature, then reload
|
||||
- **Important:** Limit context size in settings.json to prevent overflow
|
||||
|
||||
### Gemma 4
|
||||
- **Requirement:** Latest oMLX / llama.cpp for tool calling
|
||||
- **Recommendation:** 26B MoE good for limited RAM setups
|
||||
|
||||
### Mistral 7B
|
||||
- **Alternative:** Consider when Qwen 2.5 14B uses too much RAM
|
||||
- **Trade-off:** Smaller but potentially less capable
|
||||
|
||||
---
|
||||
|
||||
## Platform-Specific Notes
|
||||
|
||||
### Apple Silicon (M-series)
|
||||
- **Observation:** "Silent, very power efficient, good speeds"
|
||||
- **Limitation:** Prompt processing slower than NVIDIA GPUs
|
||||
- **Alternative:** LM Studio with MLX backend currently preferred over Ollama for some users
|
||||
|
||||
### Linux
|
||||
- Best support and performance for local inference
|
||||
- htop recommended for monitoring RAM usage
|
||||
|
||||
---
|
||||
|
||||
## General Best Practices
|
||||
|
||||
1. **Close other applications** to free RAM before running local models
|
||||
2. **Monitor context usage** - can exceed 100% in some UIs while still appearing to work
|
||||
3. **Update regularly** - inference backends fix tool calling issues frequently
|
||||
4. **Test thoroughly** - local model behavior varies significantly by quant and backend
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/
|
||||
2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
|
||||
3. **Reddit r/LocalLLM:** https://www.reddit.com/r/LocalLLM/comments/1sf5aqy/how_are_people_using_local_llms_for_coding/
|
||||
4. **llama.cpp Discussion:** https://github.com/ggml-org/llama.cpp/discussions/4167
|
||||
@@ -0,0 +1,133 @@
|
||||
# GitHub Issues Summary for ForgeCode
|
||||
|
||||
**Scope:** Open and recently closed issues affecting model performance
|
||||
**Repository:** antinomyhq/forgecode
|
||||
**Stats:** 48 open, 433 closed (as of April 9, 2026)
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## Critical Open Issues
|
||||
|
||||
### #2904: Use models.dev as LLM model registry source
|
||||
- **Status:** Open (April 9, 2026)
|
||||
- **Type:** Enhancement
|
||||
- **Impact:** Would improve model discovery and configuration
|
||||
|
||||
### #2894: Multiple system messages break models with strict chat templates (e.g. Qwen3.5)
|
||||
- **Status:** Open (April 8, 2026)
|
||||
- **Type:** Bug
|
||||
- **Impact:** BREAKS local models with strict templates
|
||||
- **Affected Models:** Qwen3.5, potentially others
|
||||
- **Workaround:** None yet
|
||||
|
||||
### #2893: Terminal output disappears on window resize in Ghostty
|
||||
- **Status:** Open (April 8, 2026)
|
||||
- **Type:** Bug
|
||||
- **Impact:** UI/usability issue
|
||||
- **Linked PR:** 1 linked PR
|
||||
|
||||
### #2888: Add support for API key helpers
|
||||
- **Status:** Open (April 8, 2026)
|
||||
- **Type:** Feature
|
||||
- **Impact:** Would improve security (helper scripts for API keys)
|
||||
|
||||
### #2884: Muse mode shell blocked
|
||||
- **Status:** Open (April 7, 2026)
|
||||
- **Type:** Bug
|
||||
- **Impact:** Blocks usage of muse agent for planning
|
||||
|
||||
---
|
||||
|
||||
## Historical Issues (Now Fixed)
|
||||
|
||||
### #2813: (Fixed)
|
||||
- Fixed issue referenced in Reddit response from maintainer
|
||||
- **Source:** Reddit r/ClaudeCode
|
||||
|
||||
### #2485: Installation issues on Mac
|
||||
- **Symptoms:** Oh My Zsh not found, terminal configuration issues
|
||||
- **Resolution:** Install Oh My Zsh separately
|
||||
|
||||
### #1296: Daily FORGE limit stops tasks mid-execution
|
||||
- **Problem:** Cannot switch providers when daily limit reached
|
||||
- **Impact:** Context built up is lost
|
||||
- **Status:** Open (feature request)
|
||||
|
||||
---
|
||||
|
||||
## Model-Specific Issues
|
||||
|
||||
### GPT 5.4
|
||||
- **Tool calling reliability:** Improved via schema reordering
|
||||
- **Status:** Workarounds implemented
|
||||
|
||||
### Qwen 3.5
|
||||
- **Multiple system messages:** Open issue #2894
|
||||
- **Tool calling format:** Use LM Studio 0.4.9+ for better compatibility
|
||||
|
||||
### Gemma 4
|
||||
- **Tool calling:** Requires latest llama.cpp/oMLX
|
||||
- **Status:** Resolved with backend updates
|
||||
|
||||
---
|
||||
|
||||
## Privacy/Security Issues
|
||||
|
||||
### #1318: Telemetry concerns
|
||||
- **Collection:** Git emails, SSH directory scans, conversation data
|
||||
- **Mitigation:** `FORGE_TRACKER=false`
|
||||
- **Status:** Documented mitigation available
|
||||
|
||||
### #1317: Related privacy concerns
|
||||
- **Linked to:** Discussion #2545
|
||||
|
||||
---
|
||||
|
||||
## ZSH/Terminal Issues
|
||||
|
||||
### Shell Integration
|
||||
- **Issue:** ZSH aliases don't work in interactive mode (by design)
|
||||
- **Solution:** Use `:` sentinel from native ZSH session
|
||||
|
||||
### Oh My Zsh
|
||||
- **Requirement:** Not strictly required but recommended
|
||||
- **Error:** Install script warns if not present
|
||||
|
||||
### Ghostty Terminal
|
||||
- **Issue:** #2893 - Output disappears on resize
|
||||
- **Status:** Under investigation
|
||||
|
||||
---
|
||||
|
||||
## Installation Issues
|
||||
|
||||
### macOS
|
||||
- **Common:** iTerm + Oh My Zsh configuration issues
|
||||
- **Fix:** Run `forge zsh doctor` and `forge zsh setup`
|
||||
|
||||
### Windows
|
||||
- **Support:** Via WSL or Git Bash only
|
||||
- **Native:** Not officially supported
|
||||
|
||||
### Linux
|
||||
- **Best supported platform**
|
||||
- **Android:** Also supported
|
||||
|
||||
---
|
||||
|
||||
## Issue Resolution Tips
|
||||
|
||||
From documentation:
|
||||
```bash
|
||||
forge zsh doctor # Check environment
|
||||
forge zsh setup # Re-run ZSH integration
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **GitHub Issues:** https://github.com/antinomyhq/forgecode/issues
|
||||
2. **GitHub Discussions:** https://github.com/antinomyhq/forgecode/discussions
|
||||
3. **Reddit r/ClaudeCode:** https://www.reddit.com/r/ClaudeCode/comments/1royhni/someone_is_using_forgecodedev/
|
||||
@@ -0,0 +1,203 @@
|
||||
# Installation & Platform Issues - Feedback Report
|
||||
|
||||
**Topic:** Setup problems, platform compatibility, requirements
|
||||
**Source References:** GitHub issues, ForgeCode docs, Reddit
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## Supported Platforms
|
||||
|
||||
### Officially Supported
|
||||
- **macOS:** Full support
|
||||
- **Linux:** Best support
|
||||
- **Android:** Supported
|
||||
- **Windows:** Via WSL or Git Bash only
|
||||
|
||||
### Not Supported
|
||||
- **Native Windows:** Not officially supported
|
||||
|
||||
---
|
||||
|
||||
## Installation Methods
|
||||
|
||||
### Method 1: YOLO Install (Recommended)
|
||||
```bash
|
||||
curl -fsSL https://forgecode.dev/cli | sh
|
||||
```
|
||||
|
||||
### Method 2: Nix
|
||||
```bash
|
||||
nix run github:antinomyhq/forge
|
||||
```
|
||||
|
||||
### Method 3: NPM
|
||||
```bash
|
||||
npx forgecode@latest
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Common Installation Issues
|
||||
|
||||
### Issue #2485: Mac Installation Problems
|
||||
**Symptoms:**
|
||||
- Oh My Zsh not found
|
||||
- Terminal configuration issues
|
||||
- Shell environment problems
|
||||
|
||||
**Environment Reported:**
|
||||
- Shell: zsh 5.9
|
||||
- Terminal: iTerm.app 3.6.8
|
||||
- Oh My Zsh: Not installed
|
||||
|
||||
**Solution:**
|
||||
```bash
|
||||
# Install Oh My Zsh first
|
||||
sh -c "$(curl -fsSL https://raw.githubusercontent.com/ohmyzsh/ohmyzsh/master/tools/install.sh)"
|
||||
|
||||
# Then re-run forge setup
|
||||
forge zsh setup
|
||||
```
|
||||
|
||||
### Terminal Requirements
|
||||
|
||||
#### Required: Nerd Font
|
||||
- **Purpose:** Icon display
|
||||
- **Recommended:** FiraCode Nerd Font
|
||||
- **Verification:** Icons should display without overlap during setup
|
||||
|
||||
#### Recommended Terminals
|
||||
- iTerm2 (macOS)
|
||||
- Ghostty (macOS) - NOTE: Has resize bug (#2893)
|
||||
- Any modern Linux terminal
|
||||
|
||||
---
|
||||
|
||||
## ZSH Integration Issues
|
||||
|
||||
### Interactive Mode Isolation
|
||||
**Design:** ForgeCode's interactive mode runs in isolated environment
|
||||
|
||||
**Impact:**
|
||||
- ZSH aliases don't work inside interactive mode
|
||||
- Custom functions unavailable
|
||||
- Shell tooling not accessible
|
||||
|
||||
**Solution:** Use `:` sentinel from native ZSH session instead
|
||||
|
||||
### Tab Completion
|
||||
**Requirements:**
|
||||
- `fd` (file finder)
|
||||
- `fzf` (fuzzy finder)
|
||||
|
||||
**Usage:**
|
||||
```bash
|
||||
:<TAB> # Open command list
|
||||
@file<TAB> # Fuzzy file picker
|
||||
```
|
||||
|
||||
**Fallback:** Use full path with brackets: `@[src/components/Header.tsx]`
|
||||
|
||||
---
|
||||
|
||||
## Platform-Specific Notes
|
||||
|
||||
### macOS
|
||||
**Best Practices:**
|
||||
- Use iTerm2 or Ghostty
|
||||
- Install Oh My Zsh for best experience
|
||||
- Enable Nerd Font in terminal preferences
|
||||
|
||||
**Troubleshooting:**
|
||||
```bash
|
||||
forge zsh doctor # Check setup
|
||||
forge zsh setup # Reconfigure
|
||||
```
|
||||
|
||||
### Linux
|
||||
**Advantages:**
|
||||
- Best performance for local models
|
||||
- Native ZSH support
|
||||
- Package manager availability
|
||||
|
||||
**Tips:**
|
||||
- Use system package manager when available
|
||||
- Check `htop` for resource monitoring
|
||||
|
||||
### Windows
|
||||
**Limitations:**
|
||||
- No native support
|
||||
- Must use WSL or Git Bash
|
||||
|
||||
**WSL Recommendation:**
|
||||
- Ubuntu 22.04+ recommended
|
||||
- Install ZSH within WSL
|
||||
- Windows Terminal for best experience
|
||||
|
||||
### Android
|
||||
**Status:** Supported but limited documentation
|
||||
**Use Case:** Primarily for remote development scenarios
|
||||
|
||||
---
|
||||
|
||||
## Verification Steps
|
||||
|
||||
### Post-Installation Checklist
|
||||
1. **Run doctor:**
|
||||
```bash
|
||||
forge zsh doctor
|
||||
```
|
||||
|
||||
2. **Verify icons:**
|
||||
- Should display without overlap
|
||||
- Check during interactive setup
|
||||
|
||||
3. **Test basic commands:**
|
||||
```bash
|
||||
: hi
|
||||
:new
|
||||
:agent
|
||||
```
|
||||
|
||||
4. **Configure provider:**
|
||||
```bash
|
||||
forge provider login
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## Open Issues
|
||||
|
||||
### #2893: Ghostty Terminal Resize Bug
|
||||
- **Problem:** Terminal output disappears on window resize
|
||||
- **Status:** Open, 1 linked PR
|
||||
- **Workaround:** Avoid resizing or use different terminal
|
||||
|
||||
### #2884: Muse Mode Shell Blocked
|
||||
- **Problem:** Cannot use muse agent
|
||||
- **Status:** Open
|
||||
- **Impact:** Planning workflow blocked
|
||||
|
||||
---
|
||||
|
||||
## Resource Requirements
|
||||
|
||||
### Minimum
|
||||
- **RAM:** 4GB (for cloud models)
|
||||
- **Disk:** 500MB
|
||||
- **Shell:** ZSH 5.0+
|
||||
|
||||
### For Local Models
|
||||
- **RAM:** 16GB+ recommended
|
||||
- **GPU:** Optional but recommended for larger models
|
||||
- **Storage:** 10GB+ for model downloads
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **GitHub Issue #2485:** https://github.com/antinomyhq/forgecode/issues/2485
|
||||
2. **GitHub Issue #2893:** https://github.com/antinomyhq/forgecode/issues/2893
|
||||
3. **ForgeCode Docs:** https://forgecode.dev/docs/installation/
|
||||
4. **ZSH Support:** https://forgecode.dev/docs/zsh-support/
|
||||
@@ -0,0 +1,99 @@
|
||||
# MiniMax, GLM, DeepSeek with ForgeCode - Feedback Report
|
||||
|
||||
**Models:** MiniMax M2/M2.1, GLM-4.6, DeepSeek-V3
|
||||
**Source References:** llm-stats.com, ForgeCode Blog
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## MiniMax M2.1
|
||||
|
||||
### Performance
|
||||
- **Terminal-Bench Score:** 47.9% (Rank #2 on independent leaderboard)
|
||||
- **Parameters:** 230B
|
||||
- **Context:** 1.0M tokens
|
||||
- **Cost:** $0.30 / $1.20 per million tokens
|
||||
|
||||
### Value Proposition
|
||||
- **Best cost-performance ratio** among top performers
|
||||
- Near-SOTA performance at entry-level pricing
|
||||
- Massive 1.0M context window
|
||||
|
||||
### ForgeCode Usage
|
||||
- Well-supported via OpenRouter
|
||||
- Good tool calling reliability
|
||||
- Recommended for budget-conscious users
|
||||
|
||||
---
|
||||
|
||||
## GLM-4.6 (Zhipu AI)
|
||||
|
||||
### Performance
|
||||
- **Terminal-Bench Score:** 40.5% (Rank #7)
|
||||
- **Parameters:** 357B
|
||||
- **Context:** 131K tokens
|
||||
- **Cost:** $0.55 / $2.19 per million tokens
|
||||
|
||||
### Characteristics
|
||||
- Open weights
|
||||
- Competitive with proprietary models at similar price point
|
||||
- Good context length (131K)
|
||||
|
||||
---
|
||||
|
||||
## DeepSeek Models
|
||||
|
||||
### DeepSeek-V3.2-Exp
|
||||
- **Terminal-Bench Score:** 37.7% (Rank #10)
|
||||
- **Status:** Experimental
|
||||
- **Note:** Results from llm-stats.com
|
||||
|
||||
### DeepSeek-V3.1
|
||||
- **Terminal-Bench Score:** 31.3% (Rank #16)
|
||||
- **Parameters:** 671B
|
||||
- **Observation:** Large parameter count doesn't translate to top-tier performance
|
||||
|
||||
### DeepSeek-R1-0528
|
||||
- **Terminal-Bench Score:** 5.7% (Rank #23 - lowest)
|
||||
- **Parameters:** 671B
|
||||
- **Note:** Reasoning model may not be optimized for terminal tasks
|
||||
|
||||
---
|
||||
|
||||
## Key Insights
|
||||
|
||||
### Scale ≠ Performance
|
||||
- Kimi K2 (1.0T parameters) underperforms smaller models
|
||||
- DeepSeek-V3.1 (671B) scores lower than MiniMax M2.1 (230B)
|
||||
- **Quality of architecture > raw parameter count**
|
||||
|
||||
### Cost-Performance Leaders
|
||||
1. **MiniMax M2.1:** 47.9% at $0.30/$1.20
|
||||
2. **LongCat-Flash-Lite:** Budget option at $0.10/$0.40 (score not in top 10)
|
||||
|
||||
### Context Window Comparison
|
||||
| Model | Context | Rank |
|
||||
|-------|---------|------|
|
||||
| MiniMax M2.1 | 1.0M | #2 |
|
||||
| Claude Opus 4.1 | 200K | #5 |
|
||||
| GLM-4.6 | 131K | #7 |
|
||||
|
||||
---
|
||||
|
||||
## Recommendations
|
||||
|
||||
### For Budget + Performance
|
||||
**MiniMax M2.1** - Best value proposition
|
||||
|
||||
### For Open Weights
|
||||
**GLM-4.6** or **MiniMax M2** - Both open, strong performance
|
||||
|
||||
### For Research
|
||||
Avoid DeepSeek-R1 for terminal tasks; use V3 variants instead
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **llm-stats.com:** https://llm-stats.com/benchmarks/terminal-bench
|
||||
2. **ForgeCode Blog:** Model comparison series
|
||||
@@ -0,0 +1,55 @@
|
||||
# Qwen 3.5 with ForgeCode - Feedback Report
|
||||
|
||||
**Model:** Qwen 3.5
|
||||
**Provider:** Alibaba Cloud (via local inference)
|
||||
**Harness:** ForgeCode
|
||||
**Source References:** GitHub Issue #2894, Reddit r/LocalLLaMA
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## Known Issues
|
||||
|
||||
### Multiple System Messages Bug
|
||||
**GitHub Issue:** #2894 (Open as of April 8, 2026)
|
||||
|
||||
**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3.5)
|
||||
|
||||
**Error Manifestation:**
|
||||
- Models with strict chat templates fail to parse message structure correctly
|
||||
- Tool calling may fail or produce incorrect results
|
||||
- Agent behavior becomes unpredictable
|
||||
|
||||
**Impact:**
|
||||
- Affects local inference with llama.cpp, Ollama, and similar servers
|
||||
- Qwen3.5 specifically mentioned as affected
|
||||
|
||||
**Workaround Status:** No official fix yet; issue under investigation
|
||||
|
||||
---
|
||||
|
||||
## Tool Calling with Qwen Models
|
||||
|
||||
### General Observations from Community
|
||||
|
||||
1. **Qwen3-Coder Next** shows promise as "first usable coding model < 60GB"
|
||||
2. **Tool calling reliability varies** by inference backend:
|
||||
- LM Studio 0.4.9 reportedly handles Qwen3.5 XML tool parsing more reliably than raw llama.cpp
|
||||
- llama.cpp with `--jinja` flag helps with tool calling
|
||||
|
||||
3. **finish_reason issue** is annoying to debug according to community reports
|
||||
|
||||
---
|
||||
|
||||
## Recommendations for Local Use
|
||||
|
||||
1. **Use LM Studio** for more reliable tool parsing vs raw llama.cpp
|
||||
2. **Monitor system message count** - known issue with ForgeCode's multi-message approach
|
||||
3. **Test thoroughly** before relying on Qwen 3.5 for production tasks via ForgeCode
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **GitHub Issue:** https://github.com/antinomyhq/forgecode/issues/2894
|
||||
2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
|
||||
@@ -0,0 +1,111 @@
|
||||
# Tool Calling Reliability with ForgeCode - Feedback Report
|
||||
|
||||
**Topic:** Tool use reliability, function calling, common errors
|
||||
**Source References:** ForgeCode Blog, GitHub issues, Reddit
|
||||
**Date Compiled:** April 9, 2026
|
||||
|
||||
---
|
||||
|
||||
## Overview
|
||||
|
||||
Tool calling reliability is a major differentiator for ForgeCode. The harness has implemented numerous optimizations to improve tool use across models.
|
||||
|
||||
---
|
||||
|
||||
## The Seven Failure Modes (From ForgeCode Blog)
|
||||
|
||||
### 1. Same Model, Very Different Performance
|
||||
**Problem:** Interactive-first design fails in benchmarks (no user to answer questions)
|
||||
**Fix:** Non-Interactive Mode with rewritten system prompts
|
||||
|
||||
### 2. Tool Descriptions Don't Guarantee Correctness
|
||||
**Problem Categories:**
|
||||
- Wrong tool selected (e.g., `shell` instead of structured `edit`)
|
||||
- Correct tool, wrong argument names
|
||||
- Correct tool, correct arguments, wrong sequencing
|
||||
|
||||
**Fix:** Targeted micro-evals isolating each class per tool, per model
|
||||
|
||||
### 3. Tool Naming is a Reliability Variable
|
||||
**Key Finding:** Models pattern-match against training data first
|
||||
|
||||
**Concrete Example:**
|
||||
- Renaming edit tool arguments to `old_string` and `new_string`
|
||||
- Result: "measurably dropped tool-call error rates immediately—same model, same prompt"
|
||||
|
||||
> "If you are seeing persistent tool-call errors and attribute them entirely to model capability, check your naming first."
|
||||
|
||||
### 4. Context Size is a Multiplier, Not a Substitute
|
||||
**Problem:** More context only helps after finding the right entry point
|
||||
**Insight:** Entry-point discovery latency is the bottleneck
|
||||
|
||||
### 5. Time Limits Punish Trajectories
|
||||
**Problem:** Failed tool calls burn real seconds; brilliant but meandering paths timeout
|
||||
**Fix:** Speed architecture with parallel subagents
|
||||
|
||||
### 6. Planning Tools Only Work if Enforced
|
||||
**Problem:** Optional `todo_write` tool ignored under pressure
|
||||
**Fix:** Made mandatory via low-level evals
|
||||
**Result:** 38% → 66% pass rate
|
||||
|
||||
### 7. TermBench is More About Speed Than Intelligence
|
||||
**Fix:** Progressive thinking policy (high thinking early, low during execution)
|
||||
|
||||
---
|
||||
|
||||
## Model-Specific Tool Calling Issues
|
||||
|
||||
### GPT 5.4
|
||||
- **Issue:** Persistent tool-call errors
|
||||
- **Fixes Applied:**
|
||||
- Reordered JSON schema fields (`required` before `properties`)
|
||||
- Flattened nested schemas
|
||||
- Explicit truncation reminders
|
||||
|
||||
### Qwen 3.5
|
||||
- **Issue:** Multiple system messages break strict chat templates
|
||||
- **Status:** Open issue (#2894)
|
||||
- **Workaround:** None yet; use different model or await fix
|
||||
|
||||
### Gemma 4
|
||||
- **Issue:** Initial releases had tool calling format issues
|
||||
- **Fix:** Use latest oMLX / llama.cpp
|
||||
|
||||
---
|
||||
|
||||
## Best Practices for Tool Reliability
|
||||
|
||||
1. **Use established argument names:** `old_string`/`new_string` better than generic names
|
||||
2. **Flatten schemas:** Reduce nesting in tool definitions
|
||||
3. **Order matters:** Put `required` before `properties` in JSON schema
|
||||
4. **Test with micro-evals:** Isolate specific tool+model combinations
|
||||
5. **Monitor truncation:** Add explicit reminders when files partially read
|
||||
|
||||
---
|
||||
|
||||
## ForgeCode Services Enhancements
|
||||
|
||||
The proprietary runtime layer includes:
|
||||
1. **Semantic entry-point discovery:** Lightweight semantic pass before exploration
|
||||
2. **Dynamic skill loading:** Specialized instructions loaded when needed
|
||||
3. **Tool-call correction layer:** Heuristic + static analysis for argument validation
|
||||
|
||||
**Note:** These features are part of ForgeCode Services (optional), not the open-source CLI.
|
||||
|
||||
---
|
||||
|
||||
## Community Tips
|
||||
|
||||
From Reddit and GitHub discussions:
|
||||
|
||||
1. **LM Studio > raw llama.cpp** for Qwen3.5 XML tool parsing
|
||||
2. **LM Studio 0.4.9+** handles tool calling more reliably
|
||||
3. **llama.cpp `--jinja` flag** helps with Qwen tool templates
|
||||
|
||||
---
|
||||
|
||||
## Source References
|
||||
|
||||
1. **ForgeCode Blog:** https://forgecode.dev/blog/benchmarks-dont-matter/
|
||||
2. **GitHub Issue #2894:** https://github.com/antinomyhq/forgecode/issues/2894
|
||||
3. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
|
||||
Reference in New Issue
Block a user