Skip to content

UniPat-AI/Terminal-X

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Terminal-X

GITHUB Blog

Terminal-X is a suite of three terminal-based datasets for evaluating LLM coding agents: depth, iteration, and evolution. Every task is a self-contained problem inside a Docker image with execution-verified rewards, built on the Terminal-Bench 2.0 task format and the Harbor evaluation framework.

Dataset Capability Tasks Format
DeepTerminalBench Depth 50 Single-shot deep engineering
EvoCode-Bench Iteration 26 Multi-turn (5-15 rounds)
RoadmapBench Evolution 115 Version-upgrade across phases

All three datasets are included in this repo.

Project Structure

Terminal-X/
├── data/                            # Datasets
│   ├── DeepTerminalBench/           # 50 single-shot tasks (Harbor-compatible)
│   │   └── <task_name>/
│   │       ├── instruction.md       # task statement
│   │       ├── environment/         # Dockerfile + initial repo state
│   │       ├── tests/               # hidden test suite (test.sh + test_*.py)
│   │       ├── solution/            # reference solution (solve.sh + artifacts)
│   │       └── task.toml            # difficulty, timeouts, resource limits
│   │
│   ├── EvoCodeBench/                # 26 EvoCode-Bench multi-turn tasks (5-15 rounds each)
│   │   └── <theme_name>/
│   │       ├── instruction.md       # initial task statement
│   │       ├── environment/         # Dockerfile + starter code
│   │       ├── task.toml            # difficulty, round metadata, resource limits
│   │       └── round_N/             # per-round additions
│   │           ├── instruction.md   # incremental requirement for this round
│   │           ├── tests/           # hidden test suite for this round
│   │           └── solution/        # reference solution for this round
│   │
│   └── RoadmapBench/                # 115 version-upgrade tasks (Harbor-compatible)
│       └── <task_name>/
│           ├── instruction.md       # multi-phase upgrade roadmap
│           ├── environment/         # Dockerfile + repo at starting version
│           ├── tests/               # phase-aware test suite
│           ├── solution/            # reference upgrade (solve.sh + patch)
│           └── task.toml            # difficulty, timeouts, resource limits
│
├── eval/                            # Evaluation scripts
│   ├── run_pass1.sh                 # parallel multi-model Pass@1 runner (DeepTerminalBench / RoadmapBench)
│   ├── run_multiround_single.sh     # single-task multi-round runner (EvoCode-Bench)
│   ├── build_leaderboard.py         # aggregate results into TSV + Markdown
│   ├── rerun_failed.py              # retry only failed tasks for one model
│   └── happypass.sh                # 1-task happy-pass endpoint check
│
├── assets/                          # Logo, screenshots
├── run.sh                           # one-command launcher
├── requirements.txt
└── README.md

Quick Start

1. Prerequisites

# Python deps (light — only used by leaderboard / rerun helpers)
pip install -r requirements.txt

# uv — needed by all eval scripts to invoke Harbor
# See https://docs.astral.sh/uv/getting-started/installation/
curl -LsSf https://astral.sh/uv/install.sh | sh

# Harbor evaluation framework (single-turn — for DeepTerminalBench / RoadmapBench)
git clone https://github.com/laude-institute/harbor.git /path/to/harbor
export HARBOR_REPO=/path/to/harbor

# Harbor multi-turn fork (for EvoCode-Bench)
git clone https://github.com/UniPat-AI/harbor_multiturn.git /path/to/harbor_multiturn
export HARBOR_MULTITURN_REPO=/path/to/harbor_multiturn

You also need Docker running (Harbor builds a fresh container per task).

2. Configure API keys

The default MODEL_REGISTRY in eval/run_pass1.sh evaluates 9 frontier models routed through MindraCode (Anthropic / OpenAI / Google) and Qiniu (other providers). This registry is used by DeepTerminalBench and RoadmapBench (single-shot). Set whichever credentials you need:

export MINDRACODE_API_KEY="..."
export MINDRACODE_API_BASE="https://api.mindracode.com"
export QINIU_API_KEY="..."
export QINIU_API_BASE="https://api.qnaigc.com"

To evaluate a different model (or your own endpoint), edit the MODEL_REGISTRY array in eval/run_pass1.sh. Each entry is:

"nickname | API_PROVIDER | AGENT_MODEL | HARBOR_AGENT_KWARGS"

API_PROVIDER must be one supported by Harbor (qiniu | openrouter | mindra | volcengine | …). AGENT_MODEL follows LiteLLM conventions (e.g. openai/gpt-5.5).

For EvoCode-Bench (multi-round) — set OPENAI_API_KEY / OPENAI_API_BASE and AGENT_MODEL as environment variables directly. See Step 7 for details.

3. Run a happy-pass check (recommended)

Before launching a full sweep, verify every model in your registry is reachable on a single task:

./eval/happypass.sh data/DeepTerminalBench

You'll get a status table with rc / dur / reward per model. Re-runs only take 5–20 minutes total at PARALLEL=10.

4. Run the full Pass@1 evaluation

# All models in MODEL_REGISTRY, all 50 DeepTerminalBench tasks
./run.sh data/DeepTerminalBench

# A specific subset
./run.sh data/DeepTerminalBench claude-opus-4-7 gpt-5.5-high

Each task runs once (Pass@1) inside its Docker container with a 5400-second timeout. Default parallelism is 20 tasks at a time per model. Outputs:

data/DeepTerminalBench/.runs/<timestamp>/
├── <model_nickname>/
│   ├── run.log              # raw Harbor batch log
│   └── summary.json         # per-task pass/fail + telemetry
└── leaderboard.tsv          # cross-model comparison

5. Generate the leaderboard

After a run completes, produce a clean leaderboard with per-task average turn count and output token usage:

python eval/build_leaderboard.py data/DeepTerminalBench

Output (TSV + Markdown) is written under data/DeepTerminalBench/.runs/leaderboard_<utc-timestamp>/.

6. Re-run failed tasks (single model)

python eval/rerun_failed.py data/DeepTerminalBench claude-opus-4-7

This finds every task whose latest trial scored below 0.5 and re-runs only those, useful for high-variance sampling diagnostics.

7. Run EvoCode-Bench (multi-round)

EvoCode-Bench tasks use a different runner (eval/run_multiround_single.sh) backed by the harbor_multiturn fork. It reads OPENAI_API_KEY / OPENAI_API_BASE directly (not the provider-based MODEL_REGISTRY):

# Point to any OpenAI-compatible endpoint that serves your model
export OPENAI_API_KEY="sk-..."
export OPENAI_API_BASE="https://api.your-provider.com/v1"

There are two evaluation modes:

Mode A — Sequential multi-round with selection (default)

Runs every round 1 → N in sequence. At each round the agent makes AGENT_ATTEMPTS parallel attempts; the best --continue-successes-per-round successes carry forward to the next round. This is the standard evaluation protocol.

# 4 attempts per round, keep 1 success, run all rounds
AGENT_MODEL=openai/claude-sonnet-4-6 AGENT_ATTEMPTS=4 \
  ./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation agent

Sweep all 26 tasks:

for theme in data/EvoCodeBench/theme_*; do
    AGENT_MODEL=openai/claude-sonnet-4-6 AGENT_ATTEMPTS=4 \
      ./eval/run_multiround_single.sh "$theme" agent
done

Mode B — Single-round fast-forward

Rounds 1..(N-1) are fast-forwarded by applying the oracle (reference) solution, then the agent runs only round N. Useful for debugging a specific round or running per-round ablations.

# Fast-forward rounds 1-4, run only round 5
AGENT_MODEL=openai/claude-sonnet-4-6 \
  ./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
    agent --start-round 5 --max-round 5

You can also fast-forward to round M and let the agent run from M to N:

# Fast-forward rounds 1-2, agent runs rounds 3-7
AGENT_MODEL=openai/claude-sonnet-4-6 \
  ./eval/run_multiround_single.sh data/EvoCodeBench/theme_d1_w1_code_build_greenfield_implementation \
    agent --start-round 3 --max-round 7

See ./eval/run_multiround_single.sh --help for all options (resume from a previous trial, custom kwargs, etc.).

8. Run RoadmapBench (version-upgrade)

RoadmapBench tasks share the same single-shot Harbor structure as DeepTerminalBench, so the same eval/run_pass1.sh launcher applies — just point it at the RoadmapBench directory:

# All models in MODEL_REGISTRY, all 115 RoadmapBench tasks
./run.sh data/RoadmapBench

# A specific subset
./run.sh data/RoadmapBench claude-opus-4-7 gpt-5.5-high

RoadmapBench tasks have longer agent timeouts (~7200s) and longer build timeouts baked into each task.toml, because version upgrades typically require building, running and validating multi-phase changes against a real repository. We recommend reducing parallelism for this dataset:

MAX_JOBS=8 ./run.sh data/RoadmapBench

Happy-pass check, leaderboard generation and failed-task re-run all work the same way:

./eval/happypass.sh         data/RoadmapBench
python eval/build_leaderboard.py data/RoadmapBench
python eval/rerun_failed.py  data/RoadmapBench claude-opus-4-7

Scoring

A task passes if Harbor's hidden tests/test.sh returns reward = 1.0 — typically meaning every test assertion in tests/test_*.py passed. There is no partial credit at the dataset level. Pass@1 is the binary average across all tasks for a single trial.

We additionally surface per-task agent telemetry from Harbor's result.json:

Field Source Meaning
n_episodes agent_result.metadata.n_episodes Number of LLM-tool-call rounds the agent used
n_output_tokens agent_result.n_output_tokens Total generated tokens for the trial

These are reported as Avg Turns and Output Tok. (K) in the leaderboard.

Datasets at a Glance

All three datasets share the Terminal-Bench 2.0 task spec and Harbor's container-based execution model. They differ in task structure: single-shot tasks, multi-round tasks, and version-upgrade tasks.

Dataset Tasks Capability axis Curation Per-task agent budget
DeepTerminalBench 50 Depth (single-shot) Claude-Opus-4.6 Pass@4 calibrated; preserves headroom under Pass@1 up to 90 min
EvoCode-Bench 26 Iteration (multi-turn) 5-15 evolving rounds per task; cumulative state across rounds up to 30 min/round
RoadmapBench 115 Evolution (version-upgrade) Real open-source repos with multi-phase upgrade roadmaps up to 120 min

Languages covered are primarily Python, with a smaller share of Bash, C, and other languages depending on the source repo. Median test functions per task is roughly 30, and median environment files per task is about 10.

For the full breakdown — curated leaderboards, per-task pass/fail maps, failure-mode analysis across 9 frontier models — see the blog post: https://unipat.ai/blog/TerminalX.

Citation

If you use Terminal-X in your research, please cite:

@misc{terminalx2026,
  title  = {Terminal-X: Evaluating Coding Agents Across Depth, Iteration, and Evolution in Terminal Environments},
  author = {UniPat AI Coding Team},
  year   = {2026},
  url    = {https://unipat.ai/blog/TerminalX}
}

License

MIT

Contact

Questions, issues, or new model evaluations? Open an issue on GitHub or email contact@unipat.ai.

About

Official repository of blog: Terminal-X: Evaluate Coding Agents across Depth, Iteration, and Evolution in Terminal Environments

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors