Files
mid_model_research/hermes/feedback/frontier/general-frontier-feedback.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

4.2 KiB

General Frontier Model Feedback

Collection Date: 2026-04-09
Sources: GitHub issues, blog posts, community discussions, official documentation


Provider Support Matrix

Provider Status Special Features
OpenAI Full Codex OAuth support
Anthropic Full Claude Code credential store
OpenRouter Full 200+ models, flexible
Nous Portal Full OAuth, subscription
Kimi/Moonshot Full 75% cache discount
DeepSeek Full 90% cache discount
MiniMax Full Token plan support
z.ai/GLM Full China/global endpoints
Gemini Full Via OpenRouter or direct

Key Feedback Themes

1. Token Overhead is the Hidden Cost

Critical Issue: Every API call includes ~13.9K tokens of fixed overhead

Source: GitHub Issue #4379

"The dollar cost depends on the model/provider, but the token overhead is constant — these 13.9K tokens are sent on every call whether using Sonnet, Haiku, Llama, or any other model via OpenRouter."

Breakdown:

  • Tool definitions: 8,759 tokens (46%)
  • System prompt: 5,176 tokens (27%)
  • Actual messages: ~5,000 tokens (27%)

Impact on Costs:

  • A "simple weather query" can cost 21,000 tokens when the agent spawns a terminal
  • One user reported: "4 million tokens in 2 hours of light usage"

2. CLI vs Gateway Token Disparity (Fixed in v0.6.0)

Bug (pre-v0.6.0): Telegram used 2-3x more tokens than CLI

Access Method Tokens/Request
CLI 6,000-8,000
Telegram (old) 15,000-20,000

Root Cause: Gateway started in repo directory instead of home directory

Fix: Update to v0.6.0+ and restart gateway

3. Tool Reliability by Provider

Most Reliable:

  1. Claude Sonnet (excellent tool calling)
  2. GPT-4 class models (very reliable)
  3. Kimi K2.5 (good for the price)

Acceptable:

  • MiniMax
  • DeepSeek
  • Gemini

Variable:

  • Depends on specific task complexity
  • Budget models may struggle with novel tools

Cost Management Strategies

Strategy 1: Tiered Model Usage

Complex reasoning → Claude Sonnet / GPT-4
Routine tasks → Kimi K2.5 / MiniMax
Vision tasks → Gemini Flash / GPT-4o
Maximum savings → DeepSeek with cache

Strategy 2: Session Management

  • Use hermes --fresh for unrelated tasks
  • Run token-intensive work in CLI vs gateway
  • Monitor with /usage command

Strategy 3: Toolset Optimization

  • Disable unused skill categories (~2,200 tokens saved)
  • Use platform-specific toolsets (~1,300 tokens saved)
  • Keep MEMORY.md lean

Provider-Specific Notes

OpenRouter

  • Best for: Flexibility, trying different models
  • Pros: 200+ models, single API key
  • Cons: Cache support depends on upstream

Anthropic/Claude

  • Best for: Complex reasoning, reliability
  • Pros: Excellent tool calling, context understanding
  • Cons: Higher cost, no special cache discounts

Nous Portal

  • Best for: Supporting the project, native integration
  • Pros: OAuth, built-in support
  • Cons: Subscription model

Budget Providers (Kimi, DeepSeek, MiniMax)

  • Best for: High volume, routine tasks
  • Pros: 50-90% cost savings, fast
  • Cons: May struggle with complex tasks

Community Quotes

On Cost:

"Choosing a cache-friendly provider is the single biggest lever for reducing costs."

On Performance:

"One developer reported that a task taking OpenClaw 50+ tool calls and steps took Hermes 5 correct tool calls and finished 2.5 minutes faster."

On Model Selection:

"For most users: Kimi K2.5 from Moonshot or MiniMax as a daily driver — both are fast, capable, and inexpensive."


Summary Table

Provider Cost Reliability Speed Cache
Claude Sonnet Partial
GPT-4 Partial
Kimi K2.5 $$ 75%
DeepSeek $ 90%
MiniMax $$ No
OpenRouter Varies Varies Varies Varies

Legend:

  • $ = Budget friendly
  • = Performance rating
  • Cache % = Discount on cached tokens