# Terminal-Bench Benchmark Results

**Collection Date:** 2026-04-09  
**Sources:** arXiv papers, official docs, community discussions

---

## About Terminal-Bench

**Paper:** [2601.11868] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

**Key Finding:**
> "We show that frontier models and agents score less than 65% on the benchmark"

**Dataset:** NousResearch/terminal-bench-2  
**Legacy:** terminal-bench-core v0.1.1

---

## Hermes Agent Benchmark Support

### Configuration

```yaml
env:
  enabled_toolsets: ["terminal", "file"]
  max_agent_turns: 60
  max_token_length: 32000
  agent_temperature: 0.8
  terminal_backend: "modal"
  terminal_timeout: 300
  dataset_name: "NousResearch/terminal-bench-2"
  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"
```

### Running Evaluations

```bash
# Install
tb run \
  --agent terminus \
  --model anthropic/claude-3-7-latest \
  --dataset-name terminal-bench-core \
  --dataset-version 0.1.1 \
  --n-concurrent 8
```

---

## YC-Bench (Strategic Benchmark)

**Description:** Long-horizon strategic benchmark — the agent plays CEO of an AI startup

**Setup:**
```bash
pip install "hermes-agent[yc-bench]"
bash environments/benchmarks/yc_bench/run_eval.sh
```

---

## Community Benchmark Results

### WebArena Performance

**Source:** https://www.reddit.com/r/LocalLLM/comments/1scglgq/i_looked_into_hermes_agent_architecture_to_dig/

> "It identified 11 websites from pure text and hit 60% testing WebArena tasks without tuning"

### Multi-Agent Performance

**Source:** https://ghost.codersera.com/blog/hermes-agent-guide-to-multi-agent-ai-setup/

> "In that study, a multi-agent Hermes setup reached 75–85% success on complex network design tasks, above chain-of-thought baselines."

---

## Benchmark Architecture

### Environments Available

1. **Terminal-Bench** - Command line tasks
2. **YC-Bench** - Strategic business simulation
3. **TBLite** - Thin subclass of TerminalBench2 (OpenThoughts Agent team)

### Key Capabilities Tested

- Tool use accuracy
- Multi-step reasoning
- Context management
- Error recovery
- Long-horizon planning

---

## Benchmarking Best Practices

### Using Harbor Framework

**Leaderboard:** https://www.tbench.ai/leaderboard

**Versions:**
- Terminal-Bench 2.0 (latest) - via Harbor
- Terminal-Bench-Core v0.1.1 (legacy)

### Configuration Tips

1. **Match context length** to model capabilities
2. **Set appropriate timeouts** (300s for complex tasks)
3. **Use Modal backend** for isolation
4. **Enable concurrent runs** for faster evaluation

---

## Research Focus

The hermes-agent project uses these benchmarks to track:

1. **Tool handling effectiveness**
2. **Skills system impact** on performance
3. **Prompt engineering strategies**
4. **Context management efficiency**
5. **Performance on smaller/local models**

---

## Summary

| Benchmark | Hermes Result | Notes |
|-----------|---------------|-------|
| WebArena | 60% | Without tuning |
| Multi-agent network design | 75-85% | Above CoT baseline |
| Terminal-Bench | N/A | Framework supported |
| YC-Bench | N/A | Strategic CEO simulation |

**Key Takeaway:** Hermes Agent demonstrates strong performance on agentic benchmarks, particularly in multi-agent configurations and real-world task completion.