11 KiB
ForgeCode Repository Feedback Analysis
Date: April 9, 2026
Scope: Analysis of forgecode codebase for local model compatibility
Focus Areas: Prompts, tools, parsing, skills
Model Focus: Local models (Qwen 3.5, Gemma 4, MiniMax, GLM, DeepSeek)
Executive Summary
ForgeCode has a sophisticated but complex architecture that presents both opportunities and challenges for local models. The harness implements numerous optimizations for tool calling reliability, but many of these rely on infrastructure that may not be available or performant with smaller models.
Key Finding: The harness's tool calling layer is the primary concern for local models, followed by prompt complexity and context management. The skills system is well-designed but adds overhead.
What Works Well for Local Models
1. Modular Prompt Architecture ✅
Evidence:
- Templates are modular and composable (
forge-custom-agent-template.md,forge-partial-*.md) - System context is re-rendered on each turn (plan:
2025-04-02-system-context-rendering-v2.md) - Variables can be passed to prompts
Why This Helps Local Models:
- Smaller prompts = less context pressure
- Re-rendering allows dynamic updates (time, environment)
- Variables enable customization without full prompt rewrites
Strength: Strong - This is well-documented and implemented in the codebase.
2. Tool Schema Normalization ✅
Evidence:
normalize_tool_schema.rsremoves duplicatedescriptionandtitlefrom parametersenforce_strict_schema.rsaddsadditionalProperties: falsefor stricter JSON schema complianceenforce_strict_tool_schema.rsconverts nullable enums to OpenAI-compatible format
Why This Helps Local Models:
- Simplified schemas reduce parsing errors
- Strict schemas are more predictable for smaller models
- Nullable enum handling prevents schema validation failures
Strength: Strong - Multiple transformers ensure schemas are optimized before reaching the model.
3. Parallel Tool Calls ✅
Evidence:
supports_parallel_tool_callsflag insystem_prompt.rs- Instructions in
forge-custom-agent-template.md: "invoke all relevant tools simultaneously"
Why This Helps Local Models:
- Reduces total turns needed for multi-step tasks
- Faster task completion = less context accumulation
- Parallelism reduces timeout risk
Strength: Moderate - Depends on model support; local models may not reliably support parallel calls.
4. Skills System ✅
Evidence:
forge-partial-skill-instructions.mdprovides clear invocation pattern- Skills are loaded dynamically via tool call
- Skills provide domain-specific workflows
Why This Helps Local Models:
- Specialized skills reduce cognitive load on main prompt
- Reusable workflows = less prompt engineering overhead
- Clear invocation pattern (
skilltool with name only)
Strength: Strong - Well-designed and documented. Skills can be invoked with minimal context.
Problematic Areas for Local Models
1. Multiple System Messages ❌
Evidence:
- GitHub Issue #2894: "Multiple system messages break models with strict chat templates (e.g. Qwen3.5)"
system_prompt.rsline 128:context.set_system_messages(vec![static_block, non_static_block])- Two system messages are set:
static_blockandnon_static_block
Impact:
- BREAKS Qwen3.5 and Qwen3 models
- Models with strict chat templates fail to parse message structure
- Tool calling becomes unpredictable
Root Cause: The harness generates two separate system messages:
static_block- fromsystem_prompt.templatenon_static_block- fromforge-custom-agent-template.md
These are concatenated into two separate system messages, which breaks models that expect a single system message.
Strength: Strong - This is a confirmed bug with an open GitHub issue.
Workaround: None yet; use different model or await fix.
2. Tool Calling Format Complexity ⚠️
Evidence:
forge-partial-tool-use-example.mdshows<forge_tool_call>XML wrapper- Tool calls must be in JSON format inside XML tags
- Example:
<forge_tool_call>{"name": "read", "arguments": {...}}</forge_tool_call>
Why This Is Problematic:
- Local models trained on varied data may not recognize custom XML wrapper
- Qwen3.5 specifically struggles with XML tool parsing (community feedback)
- LM Studio 0.4.9+ reportedly handles this better than raw llama.cpp
Strength: Moderate - This is a known issue with community workarounds (LM Studio > raw llama.cpp).
3. Context Window Pressure ⚠️
Evidence:
system_prompt.rsincludes:- Full tool definitions (
tool_information) - File list (
files) - Extension statistics (
extensions) - Custom rules (
custom_rules) - Skills list (
skills) - README content (not shown but referenced)
- Full tool definitions (
Impact:
- Local models often have smaller context windows (4K-32K)
- Default Ollama context is 4K (too small)
- Context can exceed 100% while still appearing to work
Strength: Strong - Well-documented in general-local-models.md:
"Ollama/Qwen3 runs with 4K context window by default (too small)" "Need explicit configuration to increase context"
4. Prompt Complexity ⚠️
Evidence:
forge-custom-agent-template.mdis 58 lines with complex rulesnon_negotiable_rulessection has 12+ rules with examplesforge-command-generator-prompt.mdis 113 lines with 6+ edge case categories
Why This Is Problematic:
- Smaller models (<14B) struggle with long, complex prompts
- Qwen3.5 requires higher-quality quantization for reliable parsing
- Context pressure increases with prompt length
Strength: Moderate - Community feedback suggests:
"30B+ recommended for serious coding work" "<7B models: Generally insufficient for reliable agentic tool use"
5. Tool Naming Conventions ⚠️
Evidence:
tool-calling-reliability.md: "Models pattern-match against training data first"- Renaming edit tool to
old_string/new_string"measurably dropped tool-call error rates"
Why This Is Problematic:
- ForgeCode's tool names may not match training data patterns
- Local models rely more on pattern matching than frontier models
- Custom tool names increase error rate
Strength: Moderate - This is a known issue with a known fix (use established names).
Codebase Quality Assessment
Good: Architecture & Design
-
Transformer Pipeline (
crates/forge_app/src/dto/)- Multiple transformers for different providers (Anthropic, OpenAI, Google)
- Each transformer is focused and testable
- Example:
enforce_schema.rs,normalize_tool_schema.rs
-
Tool Registry (
tool_registry.rs)- Clear separation of concerns
- Timeout handling built-in
- Permission checking before execution
-
Template Engine (
system_prompt.rs)- Handlebars-style templating
- Variables passed to templates
- Re-rendering on each turn
Concerning: Complexity
-
Multiple Layers of Abstraction
ToolRegistry→ToolExecutor→ToolCatalogSystemPrompt→TemplateEngine→Template- Each layer adds overhead and potential failure points
-
Generic Type Parameters
ToolRegistry<S>whereS: Services + EnvironmentInfra- Complex trait bounds make debugging harder
- Local models may struggle with the resulting prompts
-
Async Complexity
- Heavy use of
async/awaitandtokio join_allfor parallel tool calls- Timeout handling adds latency
- Heavy use of
Recommendations for Local Models
Immediate Fixes (High Priority)
-
Fix Multiple System Messages (#2894)
- Combine
static_blockandnon_static_blockinto single message - Or make second message optional via config
- Combine
-
Add Context Window Config
- Allow users to specify context window size
- Default to 32K for local models (not 4K)
-
Simplify Tool Call Format
- Add option for pure JSON (no XML wrapper)
- Let users choose based on model compatibility
Medium Priority
-
Tool Name Optimization
- Use established names (
old_string/new_string) - Document tool naming conventions for users
- Use established names (
-
Context Compaction
- Implement automatic context compression
- Add warning when context exceeds 80%
-
Quantization Guidance
- Document recommended quantizations per model
- Q8_0 for tool calling, Q4_K_M for basic tasks
Lower Priority
-
Skills System Optimization
- Lazy-load skills (only when needed)
- Cache skill content to reduce prompt size
-
Parallel Tool Call Fallback
- Detect model support for parallel calls
- Fall back to sequential if not supported
Conclusions
Strong Conclusions (Based on Direct Evidence)
- Multiple system messages break Qwen3.5 - Confirmed via GitHub issue #2894
- 4K default context is insufficient - Documented in
general-local-models.md - Tool schema normalization helps - Multiple transformers ensure strict compliance
- 30B+ recommended for serious work - Community consensus from Reddit r/LocalLLaMA
Moderate Conclusions (Based on Code Analysis + Community Feedback)
- XML tool wrapper may confuse local models - Qwen3.5 struggles with XML parsing
- Prompt complexity exceeds local model capacity - 58+ line prompts with 12+ rules
- Pattern matching on tool names matters - Renaming improves reliability
- Parallel calls reduce context pressure - But may not be supported by all models
Weaker Conclusions (Speculative)
- Generic type parameters add overhead - Plausible but not directly measured
- Async complexity affects local models - Indirect impact via prompt size
- Skills system adds latency - Not measured, but plausible
Source References
- GitHub Issue #2894: https://github.com/antinomyhq/forgecode/issues/2894
- Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1qz5uww/qwen3_coder_next_as_first_usable_coding_model_60/
- ForgeCode Blog: https://forgecode.dev/blog/benchmarks-dont-matter/
- Tool Calling Reliability:
forgecode/feedback/localllm/tool-calling-reliability.md - General Local Models:
forgecode/feedback/localllm/general-local-models.md
Appendix: Key Code Locations
| Component | File Path | Local Model Impact |
|---|---|---|
| Multiple System Messages | crates/forge_app/src/system_prompt.rs:128 |
HIGH - Breaks Qwen3.5 |
| Tool Schema Normalization | crates/forge_app/src/dto/openai/transformers/normalize_tool_schema.rs |
POSITIVE - Helps all models |
| Parallel Tool Calls | crates/forge_app/src/system_prompt.rs:114 |
MODERATE - Depends on model |
| Skills System | crates/forge_app/src/system_prompt.rs:95 |
POSITIVE - Well-designed |
| Context Rendering | plans/2025-04-02-system-context-rendering-v2.md |
POSITIVE - Dynamic updates |
Author's Note: This analysis combines direct code inspection with community feedback. Strong conclusions are backed by both code and external sources. Weaker conclusions are based on code patterns and reasonable inference. Always verify with your specific model/backend combination.