Files

T

sleepy 51123212c4 Initial commit: coding harness feedback analysis

Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.

2026-04-09 15:13:45 +02:00

4.2 KiB

Raw Blame History

General Frontier Model Feedback

Collection Date: 2026-04-09
Sources: GitHub issues, blog posts, community discussions, official documentation

Provider Support Matrix

Provider	Status	Special Features
OpenAI	✅ Full	Codex OAuth support
Anthropic	✅ Full	Claude Code credential store
OpenRouter	✅ Full	200+ models, flexible
Nous Portal	✅ Full	OAuth, subscription
Kimi/Moonshot	✅ Full	75% cache discount
DeepSeek	✅ Full	90% cache discount
MiniMax	✅ Full	Token plan support
z.ai/GLM	✅ Full	China/global endpoints
Gemini	✅ Full	Via OpenRouter or direct

Key Feedback Themes

1. Token Overhead is the Hidden Cost

Critical Issue: Every API call includes ~13.9K tokens of fixed overhead

Source: GitHub Issue #4379

"The dollar cost depends on the model/provider, but the token overhead is constant — these 13.9K tokens are sent on every call whether using Sonnet, Haiku, Llama, or any other model via OpenRouter."

Breakdown:

Tool definitions: 8,759 tokens (46%)
System prompt: 5,176 tokens (27%)
Actual messages: ~5,000 tokens (27%)

Impact on Costs:

A "simple weather query" can cost 21,000 tokens when the agent spawns a terminal
One user reported: "4 million tokens in 2 hours of light usage"

2. CLI vs Gateway Token Disparity (Fixed in v0.6.0)

Bug (pre-v0.6.0): Telegram used 2-3x more tokens than CLI

Access Method	Tokens/Request
CLI	6,000-8,000
Telegram (old)	15,000-20,000

Root Cause: Gateway started in repo directory instead of home directory

Fix: Update to v0.6.0+ and restart gateway

3. Tool Reliability by Provider

Most Reliable:

Claude Sonnet (excellent tool calling)
GPT-4 class models (very reliable)
Kimi K2.5 (good for the price)

Acceptable:

MiniMax
DeepSeek
Gemini

Variable:

Depends on specific task complexity
Budget models may struggle with novel tools

Cost Management Strategies

Strategy 1: Tiered Model Usage

Complex reasoning → Claude Sonnet / GPT-4
Routine tasks → Kimi K2.5 / MiniMax
Vision tasks → Gemini Flash / GPT-4o
Maximum savings → DeepSeek with cache

Strategy 2: Session Management

Use hermes --fresh for unrelated tasks
Run token-intensive work in CLI vs gateway
Monitor with /usage command

Strategy 3: Toolset Optimization

Disable unused skill categories (~2,200 tokens saved)
Use platform-specific toolsets (~1,300 tokens saved)
Keep MEMORY.md lean

Provider-Specific Notes

OpenRouter

Best for: Flexibility, trying different models
Pros: 200+ models, single API key
Cons: Cache support depends on upstream

Anthropic/Claude

Best for: Complex reasoning, reliability
Pros: Excellent tool calling, context understanding
Cons: Higher cost, no special cache discounts

Nous Portal

Best for: Supporting the project, native integration
Pros: OAuth, built-in support
Cons: Subscription model

Budget Providers (Kimi, DeepSeek, MiniMax)

Best for: High volume, routine tasks
Pros: 50-90% cost savings, fast
Cons: May struggle with complex tasks

Community Quotes

On Cost:

"Choosing a cache-friendly provider is the single biggest lever for reducing costs."

On Performance:

"One developer reported that a task taking OpenClaw 50+ tool calls and steps took Hermes 5 correct tool calls and finished 2.5 minutes faster."

On Model Selection:

"For most users: Kimi K2.5 from Moonshot or MiniMax as a daily driver — both are fast, capable, and inexpensive."

Summary Table

Provider	Cost	Reliability	Speed	Cache
Claude Sonnet		⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Partial
GPT-4		⭐⭐⭐⭐⭐	⭐⭐⭐⭐	Partial
Kimi K2.5	$$	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	75%
DeepSeek	$	⭐⭐⭐	⭐⭐⭐⭐	90%
MiniMax	$$	⭐⭐⭐⭐	⭐⭐⭐⭐⭐	No
OpenRouter	Varies	Varies	Varies	Varies

Legend:

$ = Budget friendly
⭐ = Performance rating
Cache % = Discount on cached tokens

4.2 KiB Raw Blame History