Files
mid_model_research/hermes/feedback/general/terminal-bench-benchmarks.md
T
sleepy 51123212c4 Initial commit: coding harness feedback analysis
Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.
2026-04-09 15:13:45 +02:00

3.2 KiB
Raw Blame History

Terminal-Bench Benchmark Results

Collection Date: 2026-04-09
Sources: arXiv papers, official docs, community discussions


About Terminal-Bench

Paper: [2601.11868] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Key Finding:

"We show that frontier models and agents score less than 65% on the benchmark"

Dataset: NousResearch/terminal-bench-2
Legacy: terminal-bench-core v0.1.1


Hermes Agent Benchmark Support

Configuration

env:
  enabled_toolsets: ["terminal", "file"]
  max_agent_turns: 60
  max_token_length: 32000
  agent_temperature: 0.8
  terminal_backend: "modal"
  terminal_timeout: 300
  dataset_name: "NousResearch/terminal-bench-2"
  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"

Running Evaluations

# Install
tb run \
  --agent terminus \
  --model anthropic/claude-3-7-latest \
  --dataset-name terminal-bench-core \
  --dataset-version 0.1.1 \
  --n-concurrent 8

YC-Bench (Strategic Benchmark)

Description: Long-horizon strategic benchmark — the agent plays CEO of an AI startup

Setup:

pip install "hermes-agent[yc-bench]"
bash environments/benchmarks/yc_bench/run_eval.sh

Community Benchmark Results

WebArena Performance

Source: https://www.reddit.com/r/LocalLLM/comments/1scglgq/i_looked_into_hermes_agent_architecture_to_dig/

"It identified 11 websites from pure text and hit 60% testing WebArena tasks without tuning"

Multi-Agent Performance

Source: https://ghost.codersera.com/blog/hermes-agent-guide-to-multi-agent-ai-setup/

"In that study, a multi-agent Hermes setup reached 7585% success on complex network design tasks, above chain-of-thought baselines."


Benchmark Architecture

Environments Available

  1. Terminal-Bench - Command line tasks
  2. YC-Bench - Strategic business simulation
  3. TBLite - Thin subclass of TerminalBench2 (OpenThoughts Agent team)

Key Capabilities Tested

  • Tool use accuracy
  • Multi-step reasoning
  • Context management
  • Error recovery
  • Long-horizon planning

Benchmarking Best Practices

Using Harbor Framework

Leaderboard: https://www.tbench.ai/leaderboard

Versions:

  • Terminal-Bench 2.0 (latest) - via Harbor
  • Terminal-Bench-Core v0.1.1 (legacy)

Configuration Tips

  1. Match context length to model capabilities
  2. Set appropriate timeouts (300s for complex tasks)
  3. Use Modal backend for isolation
  4. Enable concurrent runs for faster evaluation

Research Focus

The hermes-agent project uses these benchmarks to track:

  1. Tool handling effectiveness
  2. Skills system impact on performance
  3. Prompt engineering strategies
  4. Context management efficiency
  5. Performance on smaller/local models

Summary

Benchmark Hermes Result Notes
WebArena 60% Without tuning
Multi-agent network design 75-85% Above CoT baseline
Terminal-Bench N/A Framework supported
YC-Bench N/A Strategic CEO simulation

Key Takeaway: Hermes Agent demonstrates strong performance on agentic benchmarks, particularly in multi-agent configurations and real-world task completion.