Files
mid_model_research/forgecode/feedback/localllm/qwen-3.5.md
T
sleepy f561bed731 Fix model references and benchmark data across all feedback files
Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B

Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
  - #1: Pilot + Claude Opus 4.6: 82.9%
  - #2: ForgeCode + GPT-5.4: 81.8%
  - #3: ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references

SWE-Bench Pro Updates (Verified):
- #1: Claude Mythos Preview: 77.8%
- #2: GLM-5.1: 58.4% (top open-source)
- #3: GPT-5.4: 57.7%
- Added source references to llm-stats.com

Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md
2026-04-09 16:05:14 +02:00

2.3 KiB

Qwen Models with ForgeCode - Feedback Report

Models Covered: Qwen 3.5, Qwen3
Provider: Alibaba Cloud (via local inference)
Harness: ForgeCode
Source References: GitHub Issue #2894, Reddit r/LocalLLaMA
Date Compiled: April 9, 2026


Model Reference Guide

Model Family Available Sizes Notes
Qwen 3.5 0.8B, 2B, 4B, 9B (dense); 27B, 122B-A10B, 397B-A17B (MoE) Released Feb 2026
Qwen3 0.6B, 1.7B, 4B, 8B, 14B, 32B (dense); 30B-A3B, 235B-A22B (MoE) Released April 2025
Qwen2.5 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B + Coder variants Earlier generation

Note: References to "Qwen 3.5 14B" in community discussions likely mean Qwen3-14B or Qwen2.5-14B.


Known Issues

Multiple System Messages Bug

GitHub Issue: #2894 (Open as of April 8, 2026)

Problem: Multiple system messages break models with strict chat templates (e.g., Qwen3, Qwen 3.5)

Error Manifestation:

  • Models with strict chat templates fail to parse message structure correctly
  • Tool calling may fail or produce incorrect results
  • Agent behavior becomes unpredictable

Impact:

  • Affects local inference with llama.cpp, Ollama, and similar servers
  • Qwen3 and Qwen 3.5 specifically mentioned as affected

Workaround Status: No official fix yet; issue under investigation


Tool Calling with Qwen Models

General Observations from Community

  1. Qwen3-Coder Next shows promise as "first usable coding model < 60GB"

  2. Tool calling reliability varies by inference backend:

    • LM Studio 0.4.9 reportedly handles Qwen3.5 XML tool parsing more reliably than raw llama.cpp
    • llama.cpp with --jinja flag helps with tool calling
  3. finish_reason issue is annoying to debug according to community reports


Recommendations for Local Use

  1. Use LM Studio for more reliable tool parsing vs raw llama.cpp
  2. Monitor system message count - known issue with ForgeCode's multi-message approach
  3. Test thoroughly before relying on Qwen 3.5 for production tasks via ForgeCode

Source References

  1. GitHub Issue: https://github.com/antinomyhq/forgecode/issues/2894
  2. Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/