Files

T

sleepy 51123212c4 Initial commit: coding harness feedback analysis

Harnesses under analysis:
- opencode (Go-based coding agent)
- pi (minimal terminal coding harness by Mario Zechner)
- hermes (Nous Research agent)
- forgecode (AI pair programmer with sub-agents)

Each harness folder contains:
- repo/: Source code from respective repositories
- feedback/localllm/: Community feedback for local/smaller models
- feedback/frontier/: Community feedback for frontier models

Research focus: Tool handling, skills systems, prompt engineering,
context management, and best practices for smaller/local models.

2026-04-09 15:13:45 +02:00

3.2 KiB

Raw Blame History

Terminal-Bench Benchmark Results

Collection Date: 2026-04-09
Sources: arXiv papers, official docs, community discussions

About Terminal-Bench

Paper: [2601.11868] Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Key Finding:

"We show that frontier models and agents score less than 65% on the benchmark"

Dataset: NousResearch/terminal-bench-2
Legacy: terminal-bench-core v0.1.1

Hermes Agent Benchmark Support

Configuration

env:
  enabled_toolsets: ["terminal", "file"]
  max_agent_turns: 60
  max_token_length: 32000
  agent_temperature: 0.8
  terminal_backend: "modal"
  terminal_timeout: 300
  dataset_name: "NousResearch/terminal-bench-2"
  tokenizer_name: "NousResearch/Hermes-3-Llama-3.1-8B"

Running Evaluations

# Install
tb run \
  --agent terminus \
  --model anthropic/claude-3-7-latest \
  --dataset-name terminal-bench-core \
  --dataset-version 0.1.1 \
  --n-concurrent 8

YC-Bench (Strategic Benchmark)

Description: Long-horizon strategic benchmark — the agent plays CEO of an AI startup

Setup:

pip install "hermes-agent[yc-bench]"
bash environments/benchmarks/yc_bench/run_eval.sh

Community Benchmark Results

WebArena Performance

Source: https://www.reddit.com/r/LocalLLM/comments/1scglgq/i_looked_into_hermes_agent_architecture_to_dig/

"It identified 11 websites from pure text and hit 60% testing WebArena tasks without tuning"

Multi-Agent Performance

Source: https://ghost.codersera.com/blog/hermes-agent-guide-to-multi-agent-ai-setup/

"In that study, a multi-agent Hermes setup reached 75–85% success on complex network design tasks, above chain-of-thought baselines."

Benchmark Architecture

Environments Available

Terminal-Bench - Command line tasks
YC-Bench - Strategic business simulation
TBLite - Thin subclass of TerminalBench2 (OpenThoughts Agent team)

Key Capabilities Tested

Tool use accuracy
Multi-step reasoning
Context management
Error recovery
Long-horizon planning

Benchmarking Best Practices

Using Harbor Framework

Leaderboard: https://www.tbench.ai/leaderboard

Versions:

Terminal-Bench 2.0 (latest) - via Harbor
Terminal-Bench-Core v0.1.1 (legacy)

Configuration Tips

Match context length to model capabilities
Set appropriate timeouts (300s for complex tasks)
Use Modal backend for isolation
Enable concurrent runs for faster evaluation

Research Focus

The hermes-agent project uses these benchmarks to track:

Tool handling effectiveness
Skills system impact on performance
Prompt engineering strategies
Context management efficiency
Performance on smaller/local models

Summary

Benchmark	Hermes Result	Notes
WebArena	60%	Without tuning
Multi-agent network design	75-85%	Above CoT baseline
Terminal-Bench	N/A	Framework supported
YC-Bench	N/A	Strategic CEO simulation

Key Takeaway: Hermes Agent demonstrates strong performance on agentic benchmarks, particularly in multi-agent configurations and real-world task completion.

3.2 KiB Raw Blame History Unescape Escape