Files

T

sleepy f561bed731 Fix model references and benchmark data across all feedback files

Qwen Model Corrections:
- Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families
- Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B)
- Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B)
- Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected)
- Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B

Terminal-Bench 2.0 Fixes:
- Clarified that Terminal-Bench measures HARNESS+MODEL combinations
- Updated rankings with current leaderboard data (April 2026):
  - #1: Pilot + Claude Opus 4.6: 82.9%
  - #2: ForgeCode + GPT-5.4: 81.8%
  - #3: ForgeCode + Claude Opus 4.6: 81.8%
- Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness)
- Added harness attribution to all Terminal-Bench references

SWE-Bench Pro Updates (Verified):
- #1: Claude Mythos Preview: 77.8%
- #2: GLM-5.1: 58.4% (top open-source)
- #3: GPT-5.4: 57.7%
- Added source references to llm-stats.com

Files Modified:
- forgecode/feedback/localllm/qwen-3.5.md
- forgecode/feedback/frontier/benchmark-controversy.md
- hermes/feedback/localllm/qwen-models-feedback.md
- opencode/opencode/feedback/SUMMARY.md
- opencode/opencode/feedback/frontier/frontier-model-feedback.md
- opencode/opencode/feedback/localllm/local-llm-feedback.md
- pi/feedback/frontier/frontier-model-feedback.md

2026-04-09 16:05:14 +02:00

2.3 KiB

Raw Blame History

Qwen Models with ForgeCode - Feedback Report

Models Covered: Qwen 3.5, Qwen3
Provider: Alibaba Cloud (via local inference)
Harness: ForgeCode
Source References: GitHub Issue #2894, Reddit r/LocalLLaMA
Date Compiled: April 9, 2026

Model Reference Guide

Model Family	Available Sizes	Notes
Qwen 3.5	0.8B, 2B, 4B, 9B (dense); 27B, 122B-A10B, 397B-A17B (MoE)	Released Feb 2026
Qwen3	0.6B, 1.7B, 4B, 8B, 14B, 32B (dense); 30B-A3B, 235B-A22B (MoE)	Released April 2025
Qwen2.5	0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B + Coder variants	Earlier generation

Note: References to "Qwen 3.5 14B" in community discussions likely mean Qwen3-14B or Qwen2.5-14B.

Known Issues

Multiple System Messages Bug

GitHub Issue: #2894 (Open as of April 8, 2026)

Problem: Multiple system messages break models with strict chat templates (e.g., Qwen3, Qwen 3.5)

Error Manifestation:

Models with strict chat templates fail to parse message structure correctly
Tool calling may fail or produce incorrect results
Agent behavior becomes unpredictable

Impact:

Affects local inference with llama.cpp, Ollama, and similar servers
Qwen3 and Qwen 3.5 specifically mentioned as affected

Workaround Status: No official fix yet; issue under investigation

Tool Calling with Qwen Models

General Observations from Community

Qwen3-Coder Next shows promise as "first usable coding model < 60GB"
Tool calling reliability varies by inference backend:
- LM Studio 0.4.9 reportedly handles Qwen3.5 XML tool parsing more reliably than raw llama.cpp
- llama.cpp with --jinja flag helps with tool calling
finish_reason issue is annoying to debug according to community reports

Recommendations for Local Use

Use LM Studio for more reliable tool parsing vs raw llama.cpp
Monitor system message count - known issue with ForgeCode's multi-message approach
Test thoroughly before relying on Qwen 3.5 for production tasks via ForgeCode

Source References

GitHub Issue: https://github.com/antinomyhq/forgecode/issues/2894
Reddit r/LocalLLaMA: https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/

2.3 KiB Raw Blame History