Files

11 KiB

ForgeCode Repository Feedback Analysis

Date: April 9, 2026
Scope: Analysis of forgecode codebase for local model compatibility
Focus Areas: Prompts, tools, parsing, skills
Model Focus: Local models (Qwen 3.5, Gemma 4, MiniMax, GLM, DeepSeek)


Executive Summary

ForgeCode has a sophisticated but complex architecture that presents both opportunities and challenges for local models. The harness implements numerous optimizations for tool calling reliability, but many of these rely on infrastructure that may not be available or performant with smaller models.

Key Finding: The harness's tool calling layer is the primary concern for local models, followed by prompt complexity and context management. The skills system is well-designed but adds overhead.


What Works Well for Local Models

1. Modular Prompt Architecture

Evidence:

  • Templates are modular and composable (forge-custom-agent-template.md, forge-partial-*.md)
  • System context is re-rendered on each turn (plan: 2025-04-02-system-context-rendering-v2.md)
  • Variables can be passed to prompts

Why This Helps Local Models:

  • Smaller prompts = less context pressure
  • Re-rendering allows dynamic updates (time, environment)
  • Variables enable customization without full prompt rewrites

Strength: Strong - This is well-documented and implemented in the codebase.


2. Tool Schema Normalization

Evidence:

  • normalize_tool_schema.rs removes duplicate description and title from parameters
  • enforce_strict_schema.rs adds additionalProperties: false for stricter JSON schema compliance
  • enforce_strict_tool_schema.rs converts nullable enums to OpenAI-compatible format

Why This Helps Local Models:

  • Simplified schemas reduce parsing errors
  • Strict schemas are more predictable for smaller models
  • Nullable enum handling prevents schema validation failures

Strength: Strong - Multiple transformers ensure schemas are optimized before reaching the model.


3. Parallel Tool Calls

Evidence:

  • supports_parallel_tool_calls flag in system_prompt.rs
  • Instructions in forge-custom-agent-template.md: "invoke all relevant tools simultaneously"

Why This Helps Local Models:

  • Reduces total turns needed for multi-step tasks
  • Faster task completion = less context accumulation
  • Parallelism reduces timeout risk

Strength: Moderate - Depends on model support; local models may not reliably support parallel calls.


4. Skills System

Evidence:

  • forge-partial-skill-instructions.md provides clear invocation pattern
  • Skills are loaded dynamically via tool call
  • Skills provide domain-specific workflows

Why This Helps Local Models:

  • Specialized skills reduce cognitive load on main prompt
  • Reusable workflows = less prompt engineering overhead
  • Clear invocation pattern (skill tool with name only)

Strength: Strong - Well-designed and documented. Skills can be invoked with minimal context.


Problematic Areas for Local Models

1. Multiple System Messages

Evidence:

  • GitHub Issue #2894: "Multiple system messages break models with strict chat templates (e.g. Qwen3.5)"
  • system_prompt.rs line 128: context.set_system_messages(vec![static_block, non_static_block])
  • Two system messages are set: static_block and non_static_block

Impact:

  • BREAKS Qwen3.5 and Qwen3 models
  • Models with strict chat templates fail to parse message structure
  • Tool calling becomes unpredictable

Root Cause: The harness generates two separate system messages:

  1. static_block - from system_prompt.template
  2. non_static_block - from forge-custom-agent-template.md

These are concatenated into two separate system messages, which breaks models that expect a single system message.

Strength: Strong - This is a confirmed bug with an open GitHub issue.

Workaround: None yet; use different model or await fix.


2. Tool Calling Format Complexity ⚠️

Evidence:

  • forge-partial-tool-use-example.md shows <forge_tool_call> XML wrapper
  • Tool calls must be in JSON format inside XML tags
  • Example: <forge_tool_call>{"name": "read", "arguments": {...}}</forge_tool_call>

Why This Is Problematic:

  • Local models trained on varied data may not recognize custom XML wrapper
  • Qwen3.5 specifically struggles with XML tool parsing (community feedback)
  • LM Studio 0.4.9+ reportedly handles this better than raw llama.cpp

Strength: Moderate - This is a known issue with community workarounds (LM Studio > raw llama.cpp).


3. Context Window Pressure ⚠️

Evidence:

  • system_prompt.rs includes:
    • Full tool definitions (tool_information)
    • File list (files)
    • Extension statistics (extensions)
    • Custom rules (custom_rules)
    • Skills list (skills)
    • README content (not shown but referenced)

Impact:

  • Local models often have smaller context windows (4K-32K)
  • Default Ollama context is 4K (too small)
  • Context can exceed 100% while still appearing to work

Strength: Strong - Well-documented in general-local-models.md:

"Ollama/Qwen3 runs with 4K context window by default (too small)" "Need explicit configuration to increase context"


4. Prompt Complexity ⚠️

Evidence:

  • forge-custom-agent-template.md is 58 lines with complex rules
  • non_negotiable_rules section has 12+ rules with examples
  • forge-command-generator-prompt.md is 113 lines with 6+ edge case categories

Why This Is Problematic:

  • Smaller models (<14B) struggle with long, complex prompts
  • Qwen3.5 requires higher-quality quantization for reliable parsing
  • Context pressure increases with prompt length

Strength: Moderate - Community feedback suggests:

"30B+ recommended for serious coding work" "<7B models: Generally insufficient for reliable agentic tool use"


5. Tool Naming Conventions ⚠️

Evidence:

  • tool-calling-reliability.md: "Models pattern-match against training data first"
  • Renaming edit tool to old_string/new_string "measurably dropped tool-call error rates"

Why This Is Problematic:

  • ForgeCode's tool names may not match training data patterns
  • Local models rely more on pattern matching than frontier models
  • Custom tool names increase error rate

Strength: Moderate - This is a known issue with a known fix (use established names).


Codebase Quality Assessment

Good: Architecture & Design

  1. Transformer Pipeline (crates/forge_app/src/dto/)

    • Multiple transformers for different providers (Anthropic, OpenAI, Google)
    • Each transformer is focused and testable
    • Example: enforce_schema.rs, normalize_tool_schema.rs
  2. Tool Registry (tool_registry.rs)

    • Clear separation of concerns
    • Timeout handling built-in
    • Permission checking before execution
  3. Template Engine (system_prompt.rs)

    • Handlebars-style templating
    • Variables passed to templates
    • Re-rendering on each turn

Concerning: Complexity

  1. Multiple Layers of Abstraction

    • ToolRegistryToolExecutorToolCatalog
    • SystemPromptTemplateEngineTemplate
    • Each layer adds overhead and potential failure points
  2. Generic Type Parameters

    • ToolRegistry<S> where S: Services + EnvironmentInfra
    • Complex trait bounds make debugging harder
    • Local models may struggle with the resulting prompts
  3. Async Complexity

    • Heavy use of async/await and tokio
    • join_all for parallel tool calls
    • Timeout handling adds latency

Recommendations for Local Models

Immediate Fixes (High Priority)

  1. Fix Multiple System Messages (#2894)

    • Combine static_block and non_static_block into single message
    • Or make second message optional via config
  2. Add Context Window Config

    • Allow users to specify context window size
    • Default to 32K for local models (not 4K)
  3. Simplify Tool Call Format

    • Add option for pure JSON (no XML wrapper)
    • Let users choose based on model compatibility

Medium Priority

  1. Tool Name Optimization

    • Use established names (old_string/new_string)
    • Document tool naming conventions for users
  2. Context Compaction

    • Implement automatic context compression
    • Add warning when context exceeds 80%
  3. Quantization Guidance

    • Document recommended quantizations per model
    • Q8_0 for tool calling, Q4_K_M for basic tasks

Lower Priority

  1. Skills System Optimization

    • Lazy-load skills (only when needed)
    • Cache skill content to reduce prompt size
  2. Parallel Tool Call Fallback

    • Detect model support for parallel calls
    • Fall back to sequential if not supported

Conclusions

Strong Conclusions (Based on Direct Evidence)

  1. Multiple system messages break Qwen3.5 - Confirmed via GitHub issue #2894
  2. 4K default context is insufficient - Documented in general-local-models.md
  3. Tool schema normalization helps - Multiple transformers ensure strict compliance
  4. 30B+ recommended for serious work - Community consensus from Reddit r/LocalLLaMA

Moderate Conclusions (Based on Code Analysis + Community Feedback)

  1. XML tool wrapper may confuse local models - Qwen3.5 struggles with XML parsing
  2. Prompt complexity exceeds local model capacity - 58+ line prompts with 12+ rules
  3. Pattern matching on tool names matters - Renaming improves reliability
  4. Parallel calls reduce context pressure - But may not be supported by all models

Weaker Conclusions (Speculative)

  1. Generic type parameters add overhead - Plausible but not directly measured
  2. Async complexity affects local models - Indirect impact via prompt size
  3. Skills system adds latency - Not measured, but plausible

Source References

  1. GitHub Issue #2894: https://github.com/antinomyhq/forgecode/issues/2894
  2. Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/
  3. ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
  4. Tool Calling Reliability: forgecode/feedback/localllm/tool-calling-reliability.md
  5. General Local Models: forgecode/feedback/localllm/general-local-models.md

Appendix: Key Code Locations

Component File Path Local Model Impact
Multiple System Messages crates/forge_app/src/system_prompt.rs:128 HIGH - Breaks Qwen3.5
Tool Schema Normalization crates/forge_app/src/dto/openai/transformers/normalize_tool_schema.rs POSITIVE - Helps all models
Parallel Tool Calls crates/forge_app/src/system_prompt.rs:114 MODERATE - Depends on model
Skills System crates/forge_app/src/system_prompt.rs:95 POSITIVE - Well-designed
Context Rendering plans/2025-04-02-system-context-rendering-v2.md POSITIVE - Dynamic updates

Author's Note: This analysis combines direct code inspection with community feedback. Strong conclusions are backed by both code and external sources. Weaker conclusions are based on code patterns and reasonable inference. Always verify with your specific model/backend combination.