Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
4.2 KiB
General Frontier Model Feedback
Collection Date: 2026-04-09
Sources: GitHub issues, blog posts, community discussions, official documentation
Provider Support Matrix
| Provider | Status | Special Features |
|---|---|---|
| OpenAI | ✅ Full | Codex OAuth support |
| Anthropic | ✅ Full | Claude Code credential store |
| OpenRouter | ✅ Full | 200+ models, flexible |
| Nous Portal | ✅ Full | OAuth, subscription |
| Kimi/Moonshot | ✅ Full | 75% cache discount |
| DeepSeek | ✅ Full | 90% cache discount |
| MiniMax | ✅ Full | Token plan support |
| z.ai/GLM | ✅ Full | China/global endpoints |
| Gemini | ✅ Full | Via OpenRouter or direct |
Key Feedback Themes
1. Token Overhead is the Hidden Cost
Critical Issue: Every API call includes ~13.9K tokens of fixed overhead
Source: GitHub Issue #4379
"The dollar cost depends on the model/provider, but the token overhead is constant — these 13.9K tokens are sent on every call whether using Sonnet, Haiku, Llama, or any other model via OpenRouter."
Breakdown:
- Tool definitions: 8,759 tokens (46%)
- System prompt: 5,176 tokens (27%)
- Actual messages: ~5,000 tokens (27%)
Impact on Costs:
- A "simple weather query" can cost 21,000 tokens when the agent spawns a terminal
- One user reported: "4 million tokens in 2 hours of light usage"
2. CLI vs Gateway Token Disparity (Fixed in v0.6.0)
Bug (pre-v0.6.0): Telegram used 2-3x more tokens than CLI
| Access Method | Tokens/Request |
|---|---|
| CLI | 6,000-8,000 |
| Telegram (old) | 15,000-20,000 |
Root Cause: Gateway started in repo directory instead of home directory
Fix: Update to v0.6.0+ and restart gateway
3. Tool Reliability by Provider
Most Reliable:
- Claude Sonnet (excellent tool calling)
- GPT-4 class models (very reliable)
- Kimi K2.5 (good for the price)
Acceptable:
- MiniMax
- DeepSeek
- Gemini
Variable:
- Depends on specific task complexity
- Budget models may struggle with novel tools
Cost Management Strategies
Strategy 1: Tiered Model Usage
Complex reasoning → Claude Sonnet / GPT-4
Routine tasks → Kimi K2.5 / MiniMax
Vision tasks → Gemini Flash / GPT-4o
Maximum savings → DeepSeek with cache
Strategy 2: Session Management
- Use
hermes --freshfor unrelated tasks - Run token-intensive work in CLI vs gateway
- Monitor with
/usagecommand
Strategy 3: Toolset Optimization
- Disable unused skill categories (~2,200 tokens saved)
- Use platform-specific toolsets (~1,300 tokens saved)
- Keep MEMORY.md lean
Provider-Specific Notes
OpenRouter
- Best for: Flexibility, trying different models
- Pros: 200+ models, single API key
- Cons: Cache support depends on upstream
Anthropic/Claude
- Best for: Complex reasoning, reliability
- Pros: Excellent tool calling, context understanding
- Cons: Higher cost, no special cache discounts
Nous Portal
- Best for: Supporting the project, native integration
- Pros: OAuth, built-in support
- Cons: Subscription model
Budget Providers (Kimi, DeepSeek, MiniMax)
- Best for: High volume, routine tasks
- Pros: 50-90% cost savings, fast
- Cons: May struggle with complex tasks
Community Quotes
On Cost:
"Choosing a cache-friendly provider is the single biggest lever for reducing costs."
On Performance:
"One developer reported that a task taking OpenClaw 50+ tool calls and steps took Hermes 5 correct tool calls and finished 2.5 minutes faster."
On Model Selection:
"For most users: Kimi K2.5 from Moonshot or MiniMax as a daily driver — both are fast, capable, and inexpensive."
Summary Table
| Provider | Cost | Reliability | Speed | Cache |
|---|---|---|---|---|
| Claude Sonnet | |
⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Partial |
| GPT-4 | |
⭐⭐⭐⭐⭐ | ⭐⭐⭐⭐ | Partial |
| Kimi K2.5 | $$ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | 75% |
| DeepSeek | $ | ⭐⭐⭐ | ⭐⭐⭐⭐ | 90% |
| MiniMax | $$ | ⭐⭐⭐⭐ | ⭐⭐⭐⭐⭐ | No |
| OpenRouter | Varies | Varies | Varies | Varies |
Legend:
- $ = Budget friendly
- ⭐ = Performance rating
- Cache % = Discount on cached tokens