# Terminal-Bench Benchmark Results **Collection Date:** 2026-04-09 **Sources:** arXiv papers, official docs, community discussions --- ## About Terminal-Bench **Paper:** [2601.11868] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces **Key Finding:** > "We show that frontier models and agents score less than 65% on the benchmark" **Dataset:** NousResearch/terminal-bench-2 **Legacy:** terminal-bench-core v0.1.1 --- ## Hermes Agent Benchmark Support ### Configuration ```yaml env: enabled_toolsets: ["terminal", "file"] max_agent_turns: 60 max_token_length: 32000 agent_temperature: 0.8 terminal_backend: "modal" terminal_timeout: 300 dataset_name: "NousResearch/terminal-bench-2" tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B" ``` ### Running Evaluations ```bash # Install tb run \ --agent terminus \ --model anthropic/claude-3-7-latest \ --dataset-name terminal-bench-core \ --dataset-version 0.1.1 \ --n-concurrent 8 ``` --- ## YC-Bench (Strategic Benchmark) **Description:** Long-horizon strategic benchmark — the agent plays CEO of an AI startup **Setup:** ```bash pip install "hermes-agent[yc-bench]" bash environments/benchmarks/yc_bench/run_eval.sh ``` --- ## Community Benchmark Results ### WebArena Performance **Source:** https://www.reddit.com/r/LocalLLM/comments/1scglgq/i_looked_into_hermes_agent_architecture_to_dig/ > "It identified 11 websites from pure text and hit 60% testing WebArena tasks without tuning" ### Multi-Agent Performance **Source:** https://ghost.codersera.com/blog/hermes-agent-guide-to-multi-agent-ai-setup/ > "In that study, a multi-agent Hermes setup reached 75–85% success on complex network design tasks, above chain-of-thought baselines." --- ## Benchmark Architecture ### Environments Available 1. **Terminal-Bench** - Command line tasks 2. **YC-Bench** - Strategic business simulation 3. **TBLite** - Thin subclass of TerminalBench2 (OpenThoughts Agent team) ### Key Capabilities Tested - Tool use accuracy - Multi-step reasoning - Context management - Error recovery - Long-horizon planning --- ## Benchmarking Best Practices ### Using Harbor Framework **Leaderboard:** https://www.tbench.ai/leaderboard **Versions:** - Terminal-Bench 2.0 (latest) - via Harbor - Terminal-Bench-Core v0.1.1 (legacy) ### Configuration Tips 1. **Match context length** to model capabilities 2. **Set appropriate timeouts** (300s for complex tasks) 3. **Use Modal backend** for isolation 4. **Enable concurrent runs** for faster evaluation --- ## Research Focus The hermes-agent project uses these benchmarks to track: 1. **Tool handling effectiveness** 2. **Skills system impact** on performance 3. **Prompt engineering strategies** 4. **Context management efficiency** 5. **Performance on smaller/local models** --- ## Summary | Benchmark | Hermes Result | Notes | |-----------|---------------|-------| | WebArena | 60% | Without tuning | | Multi-agent network design | 75-85% | Above CoT baseline | | Terminal-Bench | N/A | Framework supported | | YC-Bench | N/A | Strategic CEO simulation | **Key Takeaway:** Hermes Agent demonstrates strong performance on agentic benchmarks, particularly in multi-agent configurations and real-world task completion.