Files

T

sleepy a794d9bddf Add REPO_FEEDBACK.md files for opencode, hermes, forgecode, and pi-mono harnesses

2026-04-09 17:14:27 +02:00

11 KiB

Raw Blame History

ForgeCode Repository Feedback Analysis

Date: April 9, 2026
Scope: Analysis of forgecode codebase for local model compatibility
Focus Areas: Prompts, tools, parsing, skills
Model Focus: Local models (Qwen 3.5, Gemma 4, MiniMax, GLM, DeepSeek)

Executive Summary

ForgeCode has a sophisticated but complex architecture that presents both opportunities and challenges for local models. The harness implements numerous optimizations for tool calling reliability, but many of these rely on infrastructure that may not be available or performant with smaller models.

Key Finding: The harness's tool calling layer is the primary concern for local models, followed by prompt complexity and context management. The skills system is well-designed but adds overhead.

What Works Well for Local Models

1. Modular Prompt Architecture ✅

Evidence:

Templates are modular and composable (forge-custom-agent-template.md, forge-partial-*.md)
System context is re-rendered on each turn (plan: 2025-04-02-system-context-rendering-v2.md)
Variables can be passed to prompts

Why This Helps Local Models:

Smaller prompts = less context pressure
Re-rendering allows dynamic updates (time, environment)
Variables enable customization without full prompt rewrites

Strength: Strong - This is well-documented and implemented in the codebase.

2. Tool Schema Normalization ✅

Evidence:

normalize_tool_schema.rs removes duplicate description and title from parameters
enforce_strict_schema.rs adds additionalProperties: false for stricter JSON schema compliance
enforce_strict_tool_schema.rs converts nullable enums to OpenAI-compatible format

Why This Helps Local Models:

Simplified schemas reduce parsing errors
Strict schemas are more predictable for smaller models
Nullable enum handling prevents schema validation failures

Strength: Strong - Multiple transformers ensure schemas are optimized before reaching the model.

3. Parallel Tool Calls ✅

Evidence:

supports_parallel_tool_calls flag in system_prompt.rs
Instructions in forge-custom-agent-template.md: "invoke all relevant tools simultaneously"

Why This Helps Local Models:

Reduces total turns needed for multi-step tasks
Faster task completion = less context accumulation
Parallelism reduces timeout risk

Strength: Moderate - Depends on model support; local models may not reliably support parallel calls.

4. Skills System ✅

Evidence:

forge-partial-skill-instructions.md provides clear invocation pattern
Skills are loaded dynamically via tool call
Skills provide domain-specific workflows

Why This Helps Local Models:

Specialized skills reduce cognitive load on main prompt
Reusable workflows = less prompt engineering overhead
Clear invocation pattern (skill tool with name only)

Strength: Strong - Well-designed and documented. Skills can be invoked with minimal context.

Problematic Areas for Local Models

1. Multiple System Messages ❌

Evidence:

GitHub Issue #2894: "Multiple system messages break models with strict chat templates (e.g. Qwen3.5)"
system_prompt.rs line 128: context.set_system_messages(vec![static_block, non_static_block])
Two system messages are set: static_block and non_static_block

Impact:

BREAKS Qwen3.5 and Qwen3 models
Models with strict chat templates fail to parse message structure
Tool calling becomes unpredictable

Root Cause: The harness generates two separate system messages:

static_block - from system_prompt.template
non_static_block - from forge-custom-agent-template.md

These are concatenated into two separate system messages, which breaks models that expect a single system message.

Strength: Strong - This is a confirmed bug with an open GitHub issue.

Workaround: None yet; use different model or await fix.

2. Tool Calling Format Complexity ⚠️

Evidence:

forge-partial-tool-use-example.md shows <forge_tool_call> XML wrapper
Tool calls must be in JSON format inside XML tags
Example: <forge_tool_call>{"name": "read", "arguments": {...}}</forge_tool_call>

Why This Is Problematic:

Local models trained on varied data may not recognize custom XML wrapper
Qwen3.5 specifically struggles with XML tool parsing (community feedback)
LM Studio 0.4.9+ reportedly handles this better than raw llama.cpp

Strength: Moderate - This is a known issue with community workarounds (LM Studio > raw llama.cpp).

3. Context Window Pressure ⚠️

Evidence:

system_prompt.rs includes:
- Full tool definitions (tool_information)
- File list (files)
- Extension statistics (extensions)
- Custom rules (custom_rules)
- Skills list (skills)
- README content (not shown but referenced)

Impact:

Local models often have smaller context windows (4K-32K)
Default Ollama context is 4K (too small)
Context can exceed 100% while still appearing to work

Strength: Strong - Well-documented in general-local-models.md:

"Ollama/Qwen3 runs with 4K context window by default (too small)" "Need explicit configuration to increase context"

4. Prompt Complexity ⚠️

Evidence:

forge-custom-agent-template.md is 58 lines with complex rules
non_negotiable_rules section has 12+ rules with examples
forge-command-generator-prompt.md is 113 lines with 6+ edge case categories

Why This Is Problematic:

Smaller models (<14B) struggle with long, complex prompts
Qwen3.5 requires higher-quality quantization for reliable parsing
Context pressure increases with prompt length

Strength: Moderate - Community feedback suggests:

"30B+ recommended for serious coding work" "<7B models: Generally insufficient for reliable agentic tool use"

5. Tool Naming Conventions ⚠️

Evidence:

tool-calling-reliability.md: "Models pattern-match against training data first"
Renaming edit tool to old_string/new_string "measurably dropped tool-call error rates"

Why This Is Problematic:

ForgeCode's tool names may not match training data patterns
Local models rely more on pattern matching than frontier models
Custom tool names increase error rate

Strength: Moderate - This is a known issue with a known fix (use established names).

Codebase Quality Assessment

Good: Architecture & Design

Transformer Pipeline (crates/forge_app/src/dto/)
- Multiple transformers for different providers (Anthropic, OpenAI, Google)
- Each transformer is focused and testable
- Example: enforce_schema.rs, normalize_tool_schema.rs
Tool Registry (tool_registry.rs)
- Clear separation of concerns
- Timeout handling built-in
- Permission checking before execution
Template Engine (system_prompt.rs)
- Handlebars-style templating
- Variables passed to templates
- Re-rendering on each turn

Concerning: Complexity

Multiple Layers of Abstraction
- ToolRegistry → ToolExecutor → ToolCatalog
- SystemPrompt → TemplateEngine → Template
- Each layer adds overhead and potential failure points
Generic Type Parameters
- ToolRegistry<S> where S: Services + EnvironmentInfra
- Complex trait bounds make debugging harder
- Local models may struggle with the resulting prompts
Async Complexity
- Heavy use of async/await and tokio
- join_all for parallel tool calls
- Timeout handling adds latency

Recommendations for Local Models

Immediate Fixes (High Priority)

Fix Multiple System Messages (#2894)
- Combine static_block and non_static_block into single message
- Or make second message optional via config
Add Context Window Config
- Allow users to specify context window size
- Default to 32K for local models (not 4K)
Simplify Tool Call Format
- Add option for pure JSON (no XML wrapper)
- Let users choose based on model compatibility

Medium Priority

Tool Name Optimization
- Use established names (old_string/new_string)
- Document tool naming conventions for users
Context Compaction
- Implement automatic context compression
- Add warning when context exceeds 80%
Quantization Guidance
- Document recommended quantizations per model
- Q8_0 for tool calling, Q4_K_M for basic tasks

Lower Priority

Skills System Optimization
- Lazy-load skills (only when needed)
- Cache skill content to reduce prompt size
Parallel Tool Call Fallback
- Detect model support for parallel calls
- Fall back to sequential if not supported

Conclusions

Strong Conclusions (Based on Direct Evidence)

Multiple system messages break Qwen3.5 - Confirmed via GitHub issue #2894
4K default context is insufficient - Documented in general-local-models.md
Tool schema normalization helps - Multiple transformers ensure strict compliance
30B+ recommended for serious work - Community consensus from Reddit r/LocalLLaMA

Moderate Conclusions (Based on Code Analysis + Community Feedback)

XML tool wrapper may confuse local models - Qwen3.5 struggles with XML parsing
Prompt complexity exceeds local model capacity - 58+ line prompts with 12+ rules
Pattern matching on tool names matters - Renaming improves reliability
Parallel calls reduce context pressure - But may not be supported by all models

Weaker Conclusions (Speculative)

Generic type parameters add overhead - Plausible but not directly measured
Async complexity affects local models - Indirect impact via prompt size
Skills system adds latency - Not measured, but plausible

Source References

GitHub Issue #2894: https://github.com/antinomyhq/forgecode/issues/2894
Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/
ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
Tool Calling Reliability: forgecode/feedback/localllm/tool-calling-reliability.md
General Local Models: forgecode/feedback/localllm/general-local-models.md

Appendix: Key Code Locations

Component	File Path	Local Model Impact
Multiple System Messages	`crates/forge_app/src/system_prompt.rs:128`	HIGH - Breaks Qwen3.5
Tool Schema Normalization	`crates/forge_app/src/dto/openai/transformers/normalize_tool_schema.rs`	POSITIVE - Helps all models
Parallel Tool Calls	`crates/forge_app/src/system_prompt.rs:114`	MODERATE - Depends on model
Skills System	`crates/forge_app/src/system_prompt.rs:95`	POSITIVE - Well-designed
Context Rendering	`plans/2025-04-02-system-context-rendering-v2.md`	POSITIVE - Dynamic updates

Author's Note: This analysis combines direct code inspection with community feedback. Strong conclusions are backed by both code and external sources. Weaker conclusions are based on code patterns and reasonable inference. Always verify with your specific model/backend combination.

11 KiB Raw Blame History

ForgeCode Repository Feedback Analysis

Executive Summary

What Works Well for Local Models

1. Modular Prompt Architecture ✅

2. Tool Schema Normalization ✅

3. Parallel Tool Calls ✅

4. Skills System ✅

Problematic Areas for Local Models

1. Multiple System Messages ❌

2. Tool Calling Format Complexity ⚠️

3. Context Window Pressure ⚠️

4. Prompt Complexity ⚠️

5. Tool Naming Conventions ⚠️

Codebase Quality Assessment

Good: Architecture & Design

Concerning: Complexity

Recommendations for Local Models

Immediate Fixes (High Priority)

Medium Priority

Lower Priority

Conclusions

Strong Conclusions (Based on Direct Evidence)

Moderate Conclusions (Based on Code Analysis + Community Feedback)

Weaker Conclusions (Speculative)

Source References

Appendix: Key Code Locations

11 KiB

Raw Blame History