Terminal-X

Terminal-X is a suite of three terminal-based datasets for evaluating LLM coding agents: depth, iteration, and evolution. Every task is a self-contained problem inside a Docker image with execution-verified rewards, built on the Terminal-Bench 2.0 task format and the Harbor evaluation framework.

Dataset	Capability	Tasks	Format
DeepTerminalBench	Depth	50	Single-shot deep engineering
EvoCode-Bench	Iteration	26	Multi-turn (5-15 rounds)
RoadmapBench	Evolution	115	Version-upgrade across phases

All three datasets are included in this repo.

Project Structure

Terminal-X/
├── data/                            # Datasets
│   ├── DeepTerminalBench/           # 50 single-shot tasks (Harbor-compatible)
│   │   └── <task_name>/
│   │       ├── instruction.md       # task statement
│   │       ├── environment/         # Dockerfile + initial repo state
│   │       ├── tests/               # hidden test suite (test.sh + test_*.py)
│   │       ├── solution/            # reference solution (solve.sh + artifacts)
│   │       └── task.toml            # difficulty, timeouts, resource limits
│   │
│   ├── EvoCodeBench/                # 26 EvoCode-Bench multi-turn tasks (5-15 rounds each)
│   │   └── <theme_name>/
│   │       ├── instruction.md       # initial task statement
│   │       ├── environment/         # Dockerfile + starter code
│   │       ├── task.toml            # difficulty, round metadata, resource limits
│   │       └── round_N/             # per-round additions
│   │           ├── instruction.md   # incremental requirement for this round
│   │           ├── tests/           # hidden test suite for this round
│   │           └── solution/        # reference solution for this round
│   │
│   └── RoadmapBench/                # 115 version-upgrade tasks (Harbor-compatible)
│       └── <task_name>/
│           ├── instruction.md       # multi-phase upgrade roadmap
│           ├── environment/         # Dockerfile + repo at starting version
│           ├── tests/               # phase-aware test suite
│           ├── solution/            # reference upgrade (solve.sh + patch)
│           └── task.toml            # difficulty, timeouts, resource limits
│
├── eval/                            # Evaluation scripts
│   ├── run_pass1.sh                 # parallel multi-model Pass@1 runner (DeepTerminalBench / RoadmapBench)
│   ├── run_multiround_single.sh     # single-task multi-round runner (EvoCode-Bench)
│   ├── build_leaderboard.py         # aggregate results into TSV + Markdown
│   ├── rerun_failed.py              # retry only failed tasks for one model
│   └── happypass.sh                # 1-task happy-pass endpoint check
│
├── assets/                          # Logo, screenshots
├── run.sh                           # one-command launcher
├── requirements.txt
└── README.md

Quick Start

1. Prerequisites

# Python deps (light — only used by leaderboard / rerun helpers)
pip install -r requirements.txt

# uv — needed by all eval scripts to invoke Harbor
# See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh

# Harbor evaluation framework (single-turn — for DeepTerminalBench / RoadmapBench)
git clone https://github.com/laude-institute/harbor.git /path/to/harbor
export HARBOR_REPO=/path/to/harbor

# Harbor multi-turn fork (for EvoCode-Bench)
git clone https://github.com/UniPat-AI/harbor_multiturn.git /path/to/harbor_multiturn
export HARBOR_MULTITURN_REPO=/path/to/harbor_multiturn

You also need Docker running (Harbor builds a fresh container per task).

2. Configure API keys

The default MODEL_REGISTRY in eval/run_pass1.sh evaluates 9 frontier models routed through MindraCode (Anthropic / OpenAI / Google) and Qiniu (other providers). This registry is used by DeepTerminalBench and RoadmapBench (single-shot). Set whichever credentials you need:

export MINDRACODE_API_KEY="..."
export MINDRACODE_API_BASE="https://api.mindracode.com"
export QINIU_API_KEY="..."
export QINIU_API_BASE="https://api.qnaigc.com"

To evaluate a different model (or your own endpoint), edit the MODEL_REGISTRY array in eval/run_pass1.sh. Each entry is:

"nickname | API_PROVIDER | AGENT_MODEL | HARBOR_AGENT_KWARGS"

API_PROVIDER must be one supported by Harbor (qiniu | openrouter | mindra | volcengine | …). AGENT_MODEL follows LiteLLM conventions (e.g. openai/gpt-5.5).

For EvoCode-Bench (multi-round) — set OPENAI_API_KEY / OPENAI_API_BASE and AGENT_MODEL as environment variables directly. See Step 7 for details.

3. Run a happy-pass check (recommended)

Before launching a full sweep, verify every model in your registry is reachable on a single task:

./eval/happypass.sh data/DeepTerminalBench

You'll get a status table with rc / dur / reward per model. Re-runs only take 5–20 minutes total at PARALLEL=10.

4. Run the full Pass@1 evaluation

# All models in MODEL_REGISTRY, all 50 DeepTerminalBench tasks
./run.sh data/DeepTerminalBench

# A specific subset
./run.sh data/DeepTerminalBench claude-opus-4-7 gpt-5.5-high

Each task runs once (Pass@1) inside its Docker container with a 5400-second timeout. Default parallelism is 20 tasks at a time per model. Outputs:

data/DeepTerminalBench/.runs/<timestamp>/
├── <model_nickname>/
│   ├── run.log              # raw Harbor batch log
│   └── summary.json         # per-task pass/fail + telemetry
└── leaderboard.tsv          # cross-model comparison

5. Generate the leaderboard

After a run completes, produce a clean leaderboard with per-task average turn count and output token usage:

python eval/build_leaderboard.py data/DeepTerminalBench

Output (TSV + Markdown) is written under data/DeepTerminalBench/.runs/leaderboard_<utc-timestamp>/.

6. Re-run failed tasks (single model)

python eval/rerun_failed.py data/DeepTerminalBench claude-opus-4-7

This finds every task whose latest trial scored below 0.5 and re-runs only those, useful for high-variance sampling diagnostics.

7. Run EvoCode-Bench (multi-round)

EvoCode-Bench tasks use a different runner (eval/run_multiround_single.sh) backed by the harbor_multiturn fork. It reads OPENAI_API_KEY / OPENAI_API_BASE directly (not the provider-based MODEL_REGISTRY):

# Point to any OpenAI-compatible endpoint that serves your model
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.your-provider.com/v1"

There are two evaluation modes:

Mode A — Sequential multi-round with selection (default)

Runs every round 1 → N in sequence. At each round the agent makes AGENT_ATTEMPTS parallel attempts; the best --continue-successes-per-round successes carry forward to the next round. This is the standard evaluation protocol.

# 4 attempts per round, keep 1 success, run all rounds
AGENT_MODEL=openai/claude-sonnet-4-6 AGENT_ATTEMPTS=4 \
  ./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation agent

Sweep all 26 tasks:

for theme in data/EvoCodeBench/theme_*; do
    AGENT_MODEL=openai/claude-sonnet-4-6 AGENT_ATTEMPTS=4 \
      ./eval/run_multiround_single.sh "$theme" agent
done

Mode B — Single-round fast-forward

Rounds 1..(N-1) are fast-forwarded by applying the oracle (reference) solution, then the agent runs only round N. Useful for debugging a specific round or running per-round ablations.

# Fast-forward rounds 1-4, run only round 5
AGENT_MODEL=openai/claude-sonnet-4-6 \
  ./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
    agent --start-round 5 --max-round 5

You can also fast-forward to round M and let the agent run from M to N:

# Fast-forward rounds 1-2, agent runs rounds 3-7
AGENT_MODEL=openai/claude-sonnet-4-6 \
  ./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
    agent --start-round 3 --max-round 7

See ./eval/run_multiround_single.sh --help for all options (resume from a previous trial, custom kwargs, etc.).

8. Run RoadmapBench (version-upgrade)

RoadmapBench tasks share the same single-shot Harbor structure as DeepTerminalBench, so the same eval/run_pass1.sh launcher applies — just point it at the RoadmapBench directory:

# All models in MODEL_REGISTRY, all 115 RoadmapBench tasks
./run.sh data/RoadmapBench

# A specific subset
./run.sh data/RoadmapBench claude-opus-4-7 gpt-5.5-high

RoadmapBench tasks have longer agent timeouts (~7200s) and longer build timeouts baked into each task.toml, because version upgrades typically require building, running and validating multi-phase changes against a real repository. We recommend reducing parallelism for this dataset:

MAX_JOBS=8 ./run.sh data/RoadmapBench

Happy-pass check, leaderboard generation and failed-task re-run all work the same way:

./eval/happypass.sh         data/RoadmapBench
python eval/build_leaderboard.py data/RoadmapBench
python eval/rerun_failed.py  data/RoadmapBench claude-opus-4-7

Scoring

A task passes if Harbor's hidden tests/test.sh returns reward = 1.0 — typically meaning every test assertion in tests/test_*.py passed. There is no partial credit at the dataset level. Pass@1 is the binary average across all tasks for a single trial.

We additionally surface per-task agent telemetry from Harbor's result.json:

Field	Source	Meaning
`n_episodes`	`agent_result.metadata.n_episodes`	Number of LLM-tool-call rounds the agent used
`n_output_tokens`	`agent_result.n_output_tokens`	Total generated tokens for the trial

These are reported as Avg Turns and Output Tok. (K) in the leaderboard.

Datasets at a Glance

All three datasets share the Terminal-Bench 2.0 task spec and Harbor's container-based execution model. They differ in task structure: single-shot tasks, multi-round tasks, and version-upgrade tasks.

Dataset	Tasks	Capability axis	Curation	Per-task agent budget
DeepTerminalBench	50	Depth (single-shot)	Claude-Opus-4.6 Pass@4 calibrated; preserves headroom under Pass@1	up to 90 min
EvoCode-Bench	26	Iteration (multi-turn)	5-15 evolving rounds per task; cumulative state across rounds	up to 30 min/round
RoadmapBench	115	Evolution (version-upgrade)	Real open-source repos with multi-phase upgrade roadmaps	up to 120 min

Languages covered are primarily Python, with a smaller share of Bash, C, and other languages depending on the source repo. Median test functions per task is roughly 30, and median environment files per task is about 10.

For the full breakdown — curated leaderboards, per-task pass/fail maps, failure-mode analysis across 9 frontier models — see the blog post: https://unipat.ai/blog/TerminalX.

Citation

If you use Terminal-X in your research, please cite:

@misc{terminalx2026,
  title  = {Terminal-X: Evaluating Coding Agents Across Depth, Iteration, and Evolution in Terminal Environments},
  author = {UniPat AI Coding Team},
  year   = {2026},
  url    = {https://unipat.ai/blog/TerminalX}
}

License

MIT

Contact

Questions, issues, or new model evaluations? Open an issue on GitHub or email contact@unipat.ai.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Terminal-X

Project Structure

Quick Start

1. Prerequisites

2. Configure API keys

3. Run a happy-pass check (recommended)

4. Run the full Pass@1 evaluation

5. Generate the leaderboard

6. Re-run failed tasks (single model)

7. Run EvoCode-Bench (multi-round)

Mode A — Sequential multi-round with selection (default)

Mode B — Single-round fast-forward

8. Run RoadmapBench (version-upgrade)

Scoring

Datasets at a Glance

Citation

License

Contact

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
assets		assets
data		data
eval		eval
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
requirements.txt		requirements.txt
run.sh		run.sh

Folders and files

Latest commit

History

Repository files navigation

Terminal-X

Project Structure

Quick Start

1. Prerequisites

2. Configure API keys

3. Run a happy-pass check (recommended)

4. Run the full Pass@1 evaluation

5. Generate the leaderboard

6. Re-run failed tasks (single model)

7. Run EvoCode-Bench (multi-round)

Mode A — Sequential multi-round with selection (default)

Mode B — Single-round fast-forward

8. Run RoadmapBench (version-upgrade)

Scoring

Datasets at a Glance

Citation

License

Contact

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages