f561bed731
Qwen Model Corrections: - Added Model Reference Guide to clarify Qwen3 vs Qwen 3.5 families - Qwen3: 0.6B, 1.7B, 4B, 8B, 14B, 32B + MoE (30B-A3B, 235B-A22B) - Qwen 3.5: 0.8B, 2B, 4B, 9B + MoE (27B, 122B-A10B, 397B-A17B) - Fixed 'Qwen3.5-35B-A3B' -> 'Qwen3-30B-A3B' (non-existent model corrected) - Note: Qwen 3.5 14B does NOT exist; references likely mean Qwen3-14B Terminal-Bench 2.0 Fixes: - Clarified that Terminal-Bench measures HARNESS+MODEL combinations - Updated rankings with current leaderboard data (April 2026): - #1: Pilot + Claude Opus 4.6: 82.9% - #2: ForgeCode + GPT-5.4: 81.8% - #3: ForgeCode + Claude Opus 4.6: 81.8% - Removed incorrect 'GPT-5.4 Rank #1' claims (scores vary by harness) - Added harness attribution to all Terminal-Bench references SWE-Bench Pro Updates (Verified): - #1: Claude Mythos Preview: 77.8% - #2: GLM-5.1: 58.4% (top open-source) - #3: GPT-5.4: 57.7% - Added source references to llm-stats.com Files Modified: - forgecode/feedback/localllm/qwen-3.5.md - forgecode/feedback/frontier/benchmark-controversy.md - hermes/feedback/localllm/qwen-models-feedback.md - opencode/opencode/feedback/SUMMARY.md - opencode/opencode/feedback/frontier/frontier-model-feedback.md - opencode/opencode/feedback/localllm/local-llm-feedback.md - pi/feedback/frontier/frontier-model-feedback.md
68 lines
2.3 KiB
Markdown
68 lines
2.3 KiB
Markdown
# Qwen Models with ForgeCode - Feedback Report
|
|
|
|
**Models Covered:** Qwen 3.5, Qwen3
|
|
**Provider:** Alibaba Cloud (via local inference)
|
|
**Harness:** ForgeCode
|
|
**Source References:** GitHub Issue #2894, Reddit r/LocalLLaMA
|
|
**Date Compiled:** April 9, 2026
|
|
|
|
---
|
|
|
|
## Model Reference Guide
|
|
|
|
| Model Family | Available Sizes | Notes |
|
|
|--------------|-----------------|-------|
|
|
| **Qwen 3.5** | 0.8B, 2B, 4B, 9B (dense); 27B, 122B-A10B, 397B-A17B (MoE) | Released Feb 2026 |
|
|
| **Qwen3** | 0.6B, 1.7B, 4B, 8B, 14B, 32B (dense); 30B-A3B, 235B-A22B (MoE) | Released April 2025 |
|
|
| **Qwen2.5** | 0.5B, 1.5B, 3B, 7B, 14B, 32B, 72B + Coder variants | Earlier generation |
|
|
|
|
> **Note:** References to "Qwen 3.5 14B" in community discussions likely mean Qwen3-14B or Qwen2.5-14B.
|
|
|
|
---
|
|
|
|
## Known Issues
|
|
|
|
### Multiple System Messages Bug
|
|
**GitHub Issue:** #2894 (Open as of April 8, 2026)
|
|
|
|
**Problem:** Multiple system messages break models with strict chat templates (e.g., Qwen3, Qwen 3.5)
|
|
|
|
**Error Manifestation:**
|
|
- Models with strict chat templates fail to parse message structure correctly
|
|
- Tool calling may fail or produce incorrect results
|
|
- Agent behavior becomes unpredictable
|
|
|
|
**Impact:**
|
|
- Affects local inference with llama.cpp, Ollama, and similar servers
|
|
- Qwen3 and Qwen 3.5 specifically mentioned as affected
|
|
|
|
**Workaround Status:** No official fix yet; issue under investigation
|
|
|
|
---
|
|
|
|
## Tool Calling with Qwen Models
|
|
|
|
### General Observations from Community
|
|
|
|
1. **Qwen3-Coder Next** shows promise as "first usable coding model < 60GB"
|
|
2. **Tool calling reliability varies** by inference backend:
|
|
- LM Studio 0.4.9 reportedly handles Qwen3.5 XML tool parsing more reliably than raw llama.cpp
|
|
- llama.cpp with `--jinja` flag helps with tool calling
|
|
|
|
3. **finish_reason issue** is annoying to debug according to community reports
|
|
|
|
---
|
|
|
|
## Recommendations for Local Use
|
|
|
|
1. **Use LM Studio** for more reliable tool parsing vs raw llama.cpp
|
|
2. **Monitor system message count** - known issue with ForgeCode's multi-message approach
|
|
3. **Test thoroughly** before relying on Qwen 3.5 for production tasks via ForgeCode
|
|
|
|
---
|
|
|
|
## Source References
|
|
|
|
1. **GitHub Issue:** https://github.com/antinomyhq/forgecode/issues/2894
|
|
2. **Reddit r/LocalLLaMA:** https://www.reddit.com/r/LocalLLaMA/comments/1sdhvc5/qwen_35_tool_calling_fixes_for_agentic_use_whats/
|