You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Integrate ae_agent into ArtEval benchmark with evaluation flow and smoke test
- Add ae_agent under benchmarks/arteval_bench/src/agents/ae_agent (main, run_eval, runner, utils, runner.sh, install.sh)
- Wire benchmark main.py and run_eval_in_env.py for ae_agent: host path runs agent then evaluator, parses score; Docker path uses same flow
- Add src/utils.py re-export for get_task when running from benchmark root
- SDK utils: do not overwrite existing env vars when loading env.toml (preserve API key)
- Add minimal smoke test: ae_agent_smoke artifact, ae_agent_smoke_test.jsonl (host + docker), run_ae_agent_smoke_test.sh
- Remove interactive_runner.py (interactive handled in runner)
- Use English throughout (docs, comments); ruff-compliant; single _make_eval_result for result shape
Co-authored-by: Cursor <cursoragent@cursor.com>
- Test the agent under `src/agents/ae_agent`: **host** and **docker** modes, and the **evaluation script** flow (evaluator runs after the agent and parses score).
6
+
- Task is minimal (create `success.txt` with content `1` in the artifact root); finishes in a few minutes and avoids long runs with full arteval_tasks.
7
+
8
+
## Files
9
+
10
+
-**ae_agent_smoke/**: Minimal artifact
11
+
-`README.md`: Task description (create success.txt with content 1)
12
+
-`_agent_eval/check.py`: Evaluator; outputs `1` if success.txt exists and contains `1`, else `0`
13
+
-**ae_agent_smoke_test.jsonl**: Two lines
14
+
- First line: `run_on_host: true`, run ae_agent + evaluator on host
15
+
- Second line: `run_on_host: false`, run ae_agent + evaluator in Docker
16
+
17
+
## How to run
18
+
19
+
From the **benchmarks/arteval_bench** directory:
20
+
21
+
```bash
22
+
# Set ANTHROPIC_API_KEY or ANTHROPIC_FOUNDRY_API_KEY first
-**Host task**: Runs the agent on the host, then runs `python3 _agent_eval/check.py` on the host to get the score.
31
+
-**Docker task**: Runs the agent in the container, then runs the evaluator in the container to get the score; the container is kept running by default for debugging.
32
+
33
+
Results are under the `-o` directory: `result.jsonl` (one JSON object per line with `score`, `status`, `test_method`, etc.) and `avg_score.json`.
34
+
35
+
## Interactive mode
36
+
37
+
The benchmark’s `src/main.py` does not read an `interactive` field from the JSONL, so the command above only covers **non-interactive** runs. To test interactive mode:
38
+
39
+
- Use ae_agent’s main entry with `--interactive`, and set `"env": "local"` or `"run_on_host": true` / `"env": "docker"` in the JSONL for the task, for example:
This agent is the **ae-agent**logic integrated as a sub-agent of the system-intelligence-benchmark ArtEval benchmark. It uses the Claude Agent SDK to run artifact evaluation tasks inside the benchmark container. Code is synced from the standalone [ae-agent](https://github.com/Couen/ae-agent) repo.
3
+
This agent is the **ae_agent**for the system-intelligence-benchmark ArtEval benchmark, with the same logic as the standalone [ae-agent](https://github.com/Couen/ae-agent) repo. It runs inside the benchmark container using the Claude Agent SDK to execute artifact evaluation tasks.
4
4
5
5
## Files
6
6
7
-
-**install.sh**: Installs `claude-agent-sdk==0.1.24` and configures `~/.claude/settings.json` (48h Bash timeout).
8
-
-**runner.sh**: Entry point invoked as `runner.sh <model> <task_or_path>`. Forwards to `runner.py`. Uses `/agent/current_task.txt` when the benchmark passes task via file.
9
-
-**runner.py**: Runs the task with Claude Agent SDK; supports rate-limit retry (429), message_formatter; second argument can be task text or path to file.
10
-
-**run_eval.py**: Orchestration for one task: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
11
-
-**main.py**: CLI entry for batch runs from JSONL; supports both host and Docker per task (see “Run on host (local)” below).
12
-
-**utils.py**: `DEFAULT_TIMEOUT_MS`, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
-**install.sh**: Installs `claude-agent-sdk` inside the container for use by runner.py.
8
+
-**runner.sh**: Entry script; invoked as `runner.sh <model> <task_or_path>`. Uses `/agent/current_task.txt` when the benchmark passes the task via file.
9
+
-**runner.py**: Runs the task with Claude Agent SDK; supports 429 rate-limit retry; second argument can be task text or path to a task file. Artifact path in container is `/repo`.
10
+
-**run_eval.py**: Single-task orchestration: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
11
+
-**main.py**: CLI entry for batch runs from JSONL; supports host or Docker per task.
12
+
-**utils.py**: Timeout, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
14
13
-**__init__.py**: Package marker.
15
14
16
15
## Usage from the benchmark
@@ -21,44 +20,43 @@ From the benchmark root (`benchmarks/arteval_bench/`):
21
20
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_run
22
21
```
23
22
24
-
Or use the helper script from `data/benchmark/`:
25
-
26
-
```bash
27
-
./data/benchmark/run_ae_agent.sh [model_name]
28
-
```
23
+
You can also use `-a ae-agent`; it is equivalent to `ae_agent`.
29
24
30
25
The benchmark will:
31
26
32
-
1. Upload the agent to `/agent` in the container.
33
-
2. For ae_agent: upload task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting with large tasks).
34
-
3. Use long-running and live-log behavior (48h timeout, live log streaming, `_agent_eval` removal before run and re-upload before evaluation, container kept for debugging).
35
-
4. Pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY` when set.
27
+
1. Upload this agent to `/agent` in the container.
28
+
2. For ae_agent: write the task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting issues with large tasks).
29
+
3. Use long-running and live-log behavior (48h timeout, streamed logs, remove `_agent_eval` before run and re-upload before evaluation, container kept for debugging).
30
+
4.**Evaluation script flow** (same as claude_sdk): after the agent finishes, run the JSONL `evaluator` (test_method), e.g. `cd /repo && python _agent_eval/main.py`, parse output for `score` and write to result.
31
+
5. If set, pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY`.
32
+
33
+
**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package’s `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
36
34
37
35
## Dependencies
38
36
39
-
- Python 3 with `claude-agent-sdk`(installed by `install.sh`).
40
-
-Optional: `message_formatter` for prettier output (if present in the environment).
37
+
- Python 3; `claude-agent-sdk`is installed in the container via `install.sh`.
38
+
-When running in Docker via the benchmark’s `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory’s `main.py` for Docker mode standalone, you also need `swe-rex`.
41
39
42
-
## Run on host (local)
40
+
## Running on host (local)
43
41
44
-
You can run tasks **on the host machine**(no Docker) from this directory:
42
+
You can run tasks on the **host** from this directory (without the benchmark’s Docker flow):
45
43
46
-
1.**Single-task / batch via main.py**
47
-
Use a JSONL input where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host. Other lines without that run in Docker (if swerex is available).
44
+
1.**Single or batch via main.py**
45
+
Use a JSONL where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host; others run in Docker (requires swerex).
48
46
49
47
```bash
50
48
cd benchmarks/arteval_bench/src/agents/ae_agent
51
-
python main.py -i /path/to/tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/host_run
-`ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY` set
52
+
2.**Host mode requirements**
53
+
-Set `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY`
56
54
- Docker installed and running (for prereq check; agent runs on host)
57
55
-`pip install claude-agent-sdk`
58
56
59
57
3.**Docker mode from this directory**
60
-
If JSONL has `"env": "docker"` (or no `run_on_host`), `main.py`will run that task in Docker via `run_eval.py` (requires `swe-rex` / `swerex`).
58
+
If the JSONL has `"env": "docker"` (or `run_on_host` is not set), `main.py`runs that task in Docker via `run_eval.py` (requires `swe-rex`/`swerex`).
61
59
62
-
## Relation to standalone ae-agent repo
60
+
## Relation to the standalone ae-agent repo
63
61
64
-
The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark’s `run_eval_in_env.py`) and the **host/local** mode via `main.py` and `run_eval.py`.
62
+
The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark’s `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
0 commit comments