Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
3.4 KiB
Qwen Models Feedback for Hermes Agent
Source reference: Multiple Reddit r/LocalLLaMA posts, GitHub issues, community discussions
Model: Qwen 3.5 (Various Sizes)
Qwen 3.5 27B - Highly Recommended
Hardware: Dual 3090s with UD_5XL quant from Unsloth
Performance: ~25 t/s at 32k context
Source: https://www.reddit.com/r/LocalLLaMA/comments/1ro9lph/anybody_who_tried_hermesagent/
"The go to model for intelligence on decent hardware is qwen 3.5 27B, if you have two 3090s, use the UD_5XL quant from unsloth - its amazing. You will get about 25 t/s with this one, at a contex size of 32k, which is perfect."
Tool Calling Performance
Issue: Tool calls work once then model forgets which tool to use
Models affected: Qwen 3.5 4B, Qwen 2.5 7B
Source: https://www.reddit.com/r/LocalLLaMA/comments/1s4yy6o/best_model_for_hermesagent/
"I use ollama and qwen3.5:4b qwen2.5:7b and they all tool call once than they forget which one to use any recomendations for other models?"
User hardware: 8GB VRAM
Qwen vs Gemma 4 Comparison
Source: https://www.reddit.com/r/LocalLLaMA/comments/1scbpmo/so_qwen35_or_gemma_4/
"For me Qwen is working significantly better for tool use with novel tools (things unlike what you'd expect in OpenCode or Claude Code). Gemma keeps duplicating tool calls for some reason."
"Gemma is failing to complete the complex challenge which qwen can succeed at (24gb VRAM) it's just giving up and claiming it's succeeded when it hasn't."
llama-server (llama.cpp) Compatibility Issue
Issue #1071: Critical bug with llama-server/Ollama backend
Error: 'dict' object has no attribute 'strip' during tool call argument validation
Environment:
- OS: Windows 11 (llama-server) + Ubuntu/WSL2 (hermes-agent)
- Python: 3.11.15
- Hermes: v0.2.0
- Backend: llama-server with Qwen3.5-27B-Q4_K_M.gguf
Root Cause:
Hermes assumes tc.function.arguments is always a string, but llama-server sometimes returns it as a parsed dict. This is a known llama-server/Ollama behavior divergence from OpenAI spec.
Fix:
if isinstance(args, (dict, list)):
tc.function.arguments = json.dumps(args)
Status: User-submitted fix confirmed working
Best Practices for Local Models
Context Length Configuration
Critical: Match Ollama's num_ctx with Hermes config
"Ollama users: If you set custom
num_ctx(e.g.,ollama run --num_ctx 16384), ensure matching context length in Hermes — Ollama's/api/showreports the model's maximum context, not the effectivenum_ctxconfigured."
Source: https://hermes-agent.nousresearch.com/docs/reference/faq
Model Recommendations by VRAM
| VRAM | Recommended Model | Notes |
|---|---|---|
| 8GB | Qwen 3.5 4B | Tool calling may be inconsistent |
| 24GB | Qwen 3.5 27B (Q4_K_M) | Excellent tool use, 25 t/s |
| 48GB+ | Qwen 3.5 27B UD_5XL | Best quality, ~25 t/s at 32k ctx |
General Local Model Feedback
Positive:
- "Hermes agent already works way way better than Open Claw and it actually works pretty well locally"
- "I have to be super careful about exposing this to the outside world because the model is not smart enough, probably, to catch sophisticated..."
Challenges:
- Context exceeded errors common with default settings
- Need to manually configure context length to match model capabilities
- Tool calling reliability varies significantly by model size