Files
mid_model_research/hermes/feedback/localllm/qwen-models-feedback.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

3.4 KiB

Qwen Models Feedback for Hermes Agent

Source reference: Multiple Reddit r/LocalLLaMA posts, GitHub issues, community discussions


Model: Qwen 3.5 (Various Sizes)

Hardware: Dual 3090s with UD_5XL quant from Unsloth
Performance: ~25 t/s at 32k context
Source: https://www.reddit.com/r/LocalLLaMA/comments/1ro9lph/anybody_who_tried_hermesagent/

"The go to model for intelligence on decent hardware is qwen 3.5 27B, if you have two 3090s, use the UD_5XL quant from unsloth - its amazing. You will get about 25 t/s with this one, at a contex size of 32k, which is perfect."

Tool Calling Performance

Issue: Tool calls work once then model forgets which tool to use
Models affected: Qwen 3.5 4B, Qwen 2.5 7B
Source: https://www.reddit.com/r/LocalLLaMA/comments/1s4yy6o/best_model_for_hermesagent/

"I use ollama and qwen3.5:4b qwen2.5:7b and they all tool call once than they forget which one to use any recomendations for other models?"

User hardware: 8GB VRAM

Qwen vs Gemma 4 Comparison

Source: https://www.reddit.com/r/LocalLLaMA/comments/1scbpmo/so_qwen35_or_gemma_4/

"For me Qwen is working significantly better for tool use with novel tools (things unlike what you'd expect in OpenCode or Claude Code). Gemma keeps duplicating tool calls for some reason."

"Gemma is failing to complete the complex challenge which qwen can succeed at (24gb VRAM) it's just giving up and claiming it's succeeded when it hasn't."


llama-server (llama.cpp) Compatibility Issue

Issue #1071: Critical bug with llama-server/Ollama backend

Error: 'dict' object has no attribute 'strip' during tool call argument validation

Environment:

  • OS: Windows 11 (llama-server) + Ubuntu/WSL2 (hermes-agent)
  • Python: 3.11.15
  • Hermes: v0.2.0
  • Backend: llama-server with Qwen3.5-27B-Q4_K_M.gguf

Root Cause: Hermes assumes tc.function.arguments is always a string, but llama-server sometimes returns it as a parsed dict. This is a known llama-server/Ollama behavior divergence from OpenAI spec.

Fix:

if isinstance(args, (dict, list)):
    tc.function.arguments = json.dumps(args)

Status: User-submitted fix confirmed working


Best Practices for Local Models

Context Length Configuration

Critical: Match Ollama's num_ctx with Hermes config

"Ollama users: If you set custom num_ctx (e.g., ollama run --num_ctx 16384), ensure matching context length in Hermes — Ollama's /api/show reports the model's maximum context, not the effective num_ctx configured."

Source: https://hermes-agent.nousresearch.com/docs/reference/faq

Model Recommendations by VRAM

VRAM Recommended Model Notes
8GB Qwen 3.5 4B Tool calling may be inconsistent
24GB Qwen 3.5 27B (Q4_K_M) Excellent tool use, 25 t/s
48GB+ Qwen 3.5 27B UD_5XL Best quality, ~25 t/s at 32k ctx

General Local Model Feedback

Positive:

  • "Hermes agent already works way way better than Open Claw and it actually works pretty well locally"
  • "I have to be super careful about exposing this to the outside world because the model is not smart enough, probably, to catch sophisticated..."

Challenges:

  • Context exceeded errors common with default settings
  • Need to manually configure context length to match model capabilities
  • Tool calling reliability varies significantly by model size