Harnesses under analysis: - opencode (Go-based coding agent) - pi (minimal terminal coding harness by Mario Zechner) - hermes (Nous Research agent) - forgecode (AI pair programmer with sub-agents) Each harness folder contains: - repo/: Source code from respective repositories - feedback/localllm/: Community feedback for local/smaller models - feedback/frontier/: Community feedback for frontier models Research focus: Tool handling, skills systems, prompt engineering, context management, and best practices for smaller/local models.
3.2 KiB
Terminal-Bench Benchmark Results
Collection Date: 2026-04-09
Sources: arXiv papers, official docs, community discussions
About Terminal-Bench
Paper: [2601.11868] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces
Key Finding:
"We show that frontier models and agents score less than 65% on the benchmark"
Dataset: NousResearch/terminal-bench-2
Legacy: terminal-bench-core v0.1.1
Hermes Agent Benchmark Support
Configuration
env:
enabled_toolsets: ["terminal", "file"]
max_agent_turns: 60
max_token_length: 32000
agent_temperature: 0.8
terminal_backend: "modal"
terminal_timeout: 300
dataset_name: "NousResearch/terminal-bench-2"
tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
Running Evaluations
# Install
tb run \
--agent terminus \
--model anthropic/claude-3-7-latest \
--dataset-name terminal-bench-core \
--dataset-version 0.1.1 \
--n-concurrent 8
YC-Bench (Strategic Benchmark)
Description: Long-horizon strategic benchmark — the agent plays CEO of an AI startup
Setup:
pip install "hermes-agent[yc-bench]"
bash environments/benchmarks/yc_bench/run_eval.sh
Community Benchmark Results
WebArena Performance
Source: https://www.reddit.com/r/LocalLLM/comments/1scglgq/i_looked_into_hermes_agent_architecture_to_dig/
"It identified 11 websites from pure text and hit 60% testing WebArena tasks without tuning"
Multi-Agent Performance
Source: https://ghost.codersera.com/blog/hermes-agent-guide-to-multi-agent-ai-setup/
"In that study, a multi-agent Hermes setup reached 75–85% success on complex network design tasks, above chain-of-thought baselines."
Benchmark Architecture
Environments Available
- Terminal-Bench - Command line tasks
- YC-Bench - Strategic business simulation
- TBLite - Thin subclass of TerminalBench2 (OpenThoughts Agent team)
Key Capabilities Tested
- Tool use accuracy
- Multi-step reasoning
- Context management
- Error recovery
- Long-horizon planning
Benchmarking Best Practices
Using Harbor Framework
Leaderboard: https://www.tbench.ai/leaderboard
Versions:
- Terminal-Bench 2.0 (latest) - via Harbor
- Terminal-Bench-Core v0.1.1 (legacy)
Configuration Tips
- Match context length to model capabilities
- Set appropriate timeouts (300s for complex tasks)
- Use Modal backend for isolation
- Enable concurrent runs for faster evaluation
Research Focus
The hermes-agent project uses these benchmarks to track:
- Tool handling effectiveness
- Skills system impact on performance
- Prompt engineering strategies
- Context management efficiency
- Performance on smaller/local models
Summary
| Benchmark | Hermes Result | Notes |
|---|---|---|
| WebArena | 60% | Without tuning |
| Multi-agent network design | 75-85% | Above CoT baseline |
| Terminal-Bench | N/A | Framework supported |
| YC-Bench | N/A | Strategic CEO simulation |
Key Takeaway: Hermes Agent demonstrates strong performance on agentic benchmarks, particularly in multi-agent configurations and real-world task completion.