Terminal-X is a suite of three terminal-based datasets for evaluating LLM coding agents: depth, iteration, and evolution. Every task is a self-contained problem inside a Docker image with execution-verified rewards, built on the Terminal-Bench 2.0 task format and the Harbor evaluation framework.
| Dataset | Capability | Tasks | Format |
|---|---|---|---|
| DeepTerminalBench | Depth | 50 | Single-shot deep engineering |
| EvoCode-Bench | Iteration | 26 | Multi-turn (5-15 rounds) |
| RoadmapBench | Evolution | 115 | Version-upgrade across phases |
All three datasets are included in this repo.
Terminal-X/
├── data/ # Datasets
│ ├── DeepTerminalBench/ # 50 single-shot tasks (Harbor-compatible)
│ │ └── <task_name>/
│ │ ├── instruction.md # task statement
│ │ ├── environment/ # Dockerfile + initial repo state
│ │ ├── tests/ # hidden test suite (test.sh + test_*.py)
│ │ ├── solution/ # reference solution (solve.sh + artifacts)
│ │ └── task.toml # difficulty, timeouts, resource limits
│ │
│ ├── EvoCodeBench/ # 26 EvoCode-Bench multi-turn tasks (5-15 rounds each)
│ │ └── <theme_name>/
│ │ ├── instruction.md # initial task statement
│ │ ├── environment/ # Dockerfile + starter code
│ │ ├── task.toml # difficulty, round metadata, resource limits
│ │ └── round_N/ # per-round additions
│ │ ├── instruction.md # incremental requirement for this round
│ │ ├── tests/ # hidden test suite for this round
│ │ └── solution/ # reference solution for this round
│ │
│ └── RoadmapBench/ # 115 version-upgrade tasks (Harbor-compatible)
│ └── <task_name>/
│ ├── instruction.md # multi-phase upgrade roadmap
│ ├── environment/ # Dockerfile + repo at starting version
│ ├── tests/ # phase-aware test suite
│ ├── solution/ # reference upgrade (solve.sh + patch)
│ └── task.toml # difficulty, timeouts, resource limits
│
├── eval/ # Evaluation scripts
│ ├── run_pass1.sh # parallel multi-model Pass@1 runner (DeepTerminalBench / RoadmapBench)
│ ├── run_multiround_single.sh # single-task multi-round runner (EvoCode-Bench)
│ ├── build_leaderboard.py # aggregate results into TSV + Markdown
│ ├── rerun_failed.py # retry only failed tasks for one model
│ └── happypass.sh # 1-task happy-pass endpoint check
│
├── assets/ # Logo, screenshots
├── run.sh # one-command launcher
├── requirements.txt
└── README.md
# Python deps (light — only used by leaderboard / rerun helpers)
pip install -r requirements.txt
# uv — needed by all eval scripts to invoke Harbor
# See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh
# Harbor evaluation framework (single-turn — for DeepTerminalBench / RoadmapBench)
git clone https://github.com/laude-institute/harbor.git /path/to/harbor
export HARBOR_REPO=/path/to/harbor
# Harbor multi-turn fork (for EvoCode-Bench)
git clone https://github.com/UniPat-AI/harbor_multiturn.git /path/to/harbor_multiturn
export HARBOR_MULTITURN_REPO=/path/to/harbor_multiturnYou also need Docker running (Harbor builds a fresh container per task).
The default MODEL_REGISTRY in eval/run_pass1.sh evaluates 9 frontier models routed through MindraCode (Anthropic / OpenAI / Google) and Qiniu (other providers). This registry is used by DeepTerminalBench and RoadmapBench (single-shot). Set whichever credentials you need:
export MINDRACODE_API_KEY="..."
export MINDRACODE_API_BASE="https://api.mindracode.com"
export QINIU_API_KEY="..."
export QINIU_API_BASE="https://api.qnaigc.com"To evaluate a different model (or your own endpoint), edit the MODEL_REGISTRY array in eval/run_pass1.sh. Each entry is:
"nickname | API_PROVIDER | AGENT_MODEL | HARBOR_AGENT_KWARGS"
API_PROVIDER must be one supported by Harbor (qiniu | openrouter | mindra | volcengine | …). AGENT_MODEL follows LiteLLM conventions (e.g. openai/gpt-5.5).
For EvoCode-Bench (multi-round) — set OPENAI_API_KEY / OPENAI_API_BASE and AGENT_MODEL as environment variables directly. See Step 7 for details.
Before launching a full sweep, verify every model in your registry is reachable on a single task:
./eval/happypass.sh data/DeepTerminalBenchYou'll get a status table with rc / dur / reward per model. Re-runs only take 5–20 minutes total at PARALLEL=10.
# All models in MODEL_REGISTRY, all 50 DeepTerminalBench tasks
./run.sh data/DeepTerminalBench
# A specific subset
./run.sh data/DeepTerminalBench claude-opus-4-7 gpt-5.5-highEach task runs once (Pass@1) inside its Docker container with a 5400-second timeout. Default parallelism is 20 tasks at a time per model. Outputs:
data/DeepTerminalBench/.runs/<timestamp>/
├── <model_nickname>/
│ ├── run.log # raw Harbor batch log
│ └── summary.json # per-task pass/fail + telemetry
└── leaderboard.tsv # cross-model comparison
After a run completes, produce a clean leaderboard with per-task average turn count and output token usage:
python eval/build_leaderboard.py data/DeepTerminalBenchOutput (TSV + Markdown) is written under data/DeepTerminalBench/.runs/leaderboard_<utc-timestamp>/.
python eval/rerun_failed.py data/DeepTerminalBench claude-opus-4-7This finds every task whose latest trial scored below 0.5 and re-runs only those, useful for high-variance sampling diagnostics.
EvoCode-Bench tasks use a different runner (eval/run_multiround_single.sh) backed by the harbor_multiturn fork. It reads OPENAI_API_KEY / OPENAI_API_BASE directly (not the provider-based MODEL_REGISTRY):
# Point to any OpenAI-compatible endpoint that serves your model
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.your-provider.com/v1"There are two evaluation modes:
Runs every round 1 → N in sequence. At each round the agent makes AGENT_ATTEMPTS parallel attempts; the best --continue-successes-per-round successes carry forward to the next round. This is the standard evaluation protocol.
# 4 attempts per round, keep 1 success, run all rounds
AGENT_MODEL=openai/claude-sonnet-4-6 AGENT_ATTEMPTS=4 \
./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation agentSweep all 26 tasks:
for theme in data/EvoCodeBench/theme_*; do
AGENT_MODEL=openai/claude-sonnet-4-6 AGENT_ATTEMPTS=4 \
./eval/run_multiround_single.sh "$theme" agent
doneRounds 1..(N-1) are fast-forwarded by applying the oracle (reference) solution, then the agent runs only round N. Useful for debugging a specific round or running per-round ablations.
# Fast-forward rounds 1-4, run only round 5
AGENT_MODEL=openai/claude-sonnet-4-6 \
./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
agent --start-round 5 --max-round 5You can also fast-forward to round M and let the agent run from M to N:
# Fast-forward rounds 1-2, agent runs rounds 3-7
AGENT_MODEL=openai/claude-sonnet-4-6 \
./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
agent --start-round 3 --max-round 7See ./eval/run_multiround_single.sh --help for all options (resume from a previous trial, custom kwargs, etc.).
RoadmapBench tasks share the same single-shot Harbor structure as DeepTerminalBench, so the same eval/run_pass1.sh launcher applies — just point it at the RoadmapBench directory:
# All models in MODEL_REGISTRY, all 115 RoadmapBench tasks
./run.sh data/RoadmapBench
# A specific subset
./run.sh data/RoadmapBench claude-opus-4-7 gpt-5.5-highRoadmapBench tasks have longer agent timeouts (~7200s) and longer build timeouts baked into each task.toml, because version upgrades typically require building, running and validating multi-phase changes against a real repository. We recommend reducing parallelism for this dataset:
MAX_JOBS=8 ./run.sh data/RoadmapBenchHappy-pass check, leaderboard generation and failed-task re-run all work the same way:
./eval/happypass.sh data/RoadmapBench
python eval/build_leaderboard.py data/RoadmapBench
python eval/rerun_failed.py data/RoadmapBench claude-opus-4-7A task passes if Harbor's hidden tests/test.sh returns reward = 1.0 — typically meaning every test assertion in tests/test_*.py passed. There is no partial credit at the dataset level. Pass@1 is the binary average across all tasks for a single trial.
We additionally surface per-task agent telemetry from Harbor's result.json:
| Field | Source | Meaning |
|---|---|---|
n_episodes |
agent_result.metadata.n_episodes |
Number of LLM-tool-call rounds the agent used |
n_output_tokens |
agent_result.n_output_tokens |
Total generated tokens for the trial |
These are reported as Avg Turns and Output Tok. (K) in the leaderboard.
All three datasets share the Terminal-Bench 2.0 task spec and Harbor's container-based execution model. They differ in task structure: single-shot tasks, multi-round tasks, and version-upgrade tasks.
| Dataset | Tasks | Capability axis | Curation | Per-task agent budget |
|---|---|---|---|---|
| DeepTerminalBench | 50 | Depth (single-shot) | Claude-Opus-4.6 Pass@4 calibrated; preserves headroom under Pass@1 | up to 90 min |
| EvoCode-Bench | 26 | Iteration (multi-turn) | 5-15 evolving rounds per task; cumulative state across rounds | up to 30 min/round |
| RoadmapBench | 115 | Evolution (version-upgrade) | Real open-source repos with multi-phase upgrade roadmaps | up to 120 min |
Languages covered are primarily Python, with a smaller share of Bash, C, and other languages depending on the source repo. Median test functions per task is roughly 30, and median environment files per task is about 10.
For the full breakdown — curated leaderboards, per-task pass/fail maps, failure-mode analysis across 9 frontier models — see the blog post: https://unipat.ai/blog/TerminalX.
If you use Terminal-X in your research, please cite:
@misc{terminalx2026,
title = {Terminal-X: Evaluating Coding Agents Across Depth, Iteration, and Evolution in Terminal Environments},
author = {UniPat AI Coding Team},
year = {2026},
url = {https://unipat.ai/blog/TerminalX}
}MIT
Questions, issues, or new model evaluations? Open an issue on GitHub or email contact@unipat.ai.