Skip to content

Commit 30223e5

Browse files
bastoicacursoragent
andcommitted
Integrate ae_agent into ArtEval benchmark with evaluation flow and smoke test
- Add ae_agent under benchmarks/arteval_bench/src/agents/ae_agent (main, run_eval, runner, utils, runner.sh, install.sh) - Wire benchmark main.py and run_eval_in_env.py for ae_agent: host path runs agent then evaluator, parses score; Docker path uses same flow - Add src/utils.py re-export for get_task when running from benchmark root - SDK utils: do not overwrite existing env vars when loading env.toml (preserve API key) - Add minimal smoke test: ae_agent_smoke artifact, ae_agent_smoke_test.jsonl (host + docker), run_ae_agent_smoke_test.sh - Remove interactive_runner.py (interactive handled in runner) - Use English throughout (docs, comments); ruff-compliant; single _make_eval_result for result shape Co-authored-by: Cursor <cursoragent@cursor.com>
1 parent d4e4949 commit 30223e5

18 files changed

Lines changed: 1922 additions & 1126 deletions

File tree

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
# AE Agent Smoke Test Artifact
2+
3+
Minimal task for quick testing of ae_agent (host/docker + evaluation). Should complete in under a minute.
4+
5+
## Task
6+
7+
1. In this directory (the artifact root), create a file named **success.txt**.
8+
2. The file must contain exactly the single character **1** (no newline required).
9+
3. No other steps are required.
10+
11+
Example (bash): `echo -n 1 > success.txt`
12+
13+
After you finish, the benchmark will run an evaluation script that checks for this file and outputs a score (1 if correct, 0 otherwise).
Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,44 @@
1+
# AE Agent smoke test
2+
3+
## Purpose
4+
5+
- Test the agent under `src/agents/ae_agent`: **host** and **docker** modes, and the **evaluation script** flow (evaluator runs after the agent and parses score).
6+
- Task is minimal (create `success.txt` with content `1` in the artifact root); finishes in a few minutes and avoids long runs with full arteval_tasks.
7+
8+
## Files
9+
10+
- **ae_agent_smoke/**: Minimal artifact
11+
- `README.md`: Task description (create success.txt with content 1)
12+
- `_agent_eval/check.py`: Evaluator; outputs `1` if success.txt exists and contains `1`, else `0`
13+
- **ae_agent_smoke_test.jsonl**: Two lines
14+
- First line: `run_on_host: true`, run ae_agent + evaluator on host
15+
- Second line: `run_on_host: false`, run ae_agent + evaluator in Docker
16+
17+
## How to run
18+
19+
From the **benchmarks/arteval_bench** directory:
20+
21+
```bash
22+
# Set ANTHROPIC_API_KEY or ANTHROPIC_FOUNDRY_API_KEY first
23+
python src/main.py \
24+
-i ./data/benchmark/ae_agent_smoke_test.jsonl \
25+
-a ae_agent \
26+
-m claude-sonnet-4-5-20250929 \
27+
-o ./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)
28+
```
29+
30+
- **Host task**: Runs the agent on the host, then runs `python3 _agent_eval/check.py` on the host to get the score.
31+
- **Docker task**: Runs the agent in the container, then runs the evaluator in the container to get the score; the container is kept running by default for debugging.
32+
33+
Results are under the `-o` directory: `result.jsonl` (one JSON object per line with `score`, `status`, `test_method`, etc.) and `avg_score.json`.
34+
35+
## Interactive mode
36+
37+
The benchmark’s `src/main.py` does not read an `interactive` field from the JSONL, so the command above only covers **non-interactive** runs. To test interactive mode:
38+
39+
- Use ae_agent’s main entry with `--interactive`, and set `"env": "local"` or `"run_on_host": true` / `"env": "docker"` in the JSONL for the task, for example:
40+
```bash
41+
cd src/agents/ae_agent
42+
python -m ae_agent.main --interactive -i ../../../data/benchmark/ae_agent_smoke_test.jsonl -o ../../../outputs/ae_agent_smoke_int
43+
```
44+
- In interactive mode, after the first task completes you can keep typing instructions; type `quit` or `exit` to end.
Lines changed: 22 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,22 @@
1+
#!/usr/bin/env python3
2+
"""Minimal evaluator for ae_agent_smoke: output 1 if success.txt exists and contains '1', else 0.
3+
4+
Output must be a single digit on a line (or last line) for benchmark score parsing.
5+
"""
6+
import os
7+
import sys
8+
9+
def main():
10+
root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
11+
path = os.path.join(root, "success.txt")
12+
if os.path.isfile(path):
13+
with open(path, "r") as f:
14+
content = f.read().strip()
15+
if content == "1":
16+
print(1)
17+
sys.exit(0)
18+
print(0)
19+
sys.exit(0)
20+
21+
if __name__ == "__main__":
22+
main()
Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
{"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": true}
2+
{"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "docker_env": "bastoica/ae-agent-ubuntu24.04:latest", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": false}
Lines changed: 21 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,21 @@
1+
#!/bin/bash
2+
# Run ae_agent smoke test under arteval_bench (host + docker, with evaluation).
3+
# Usage: ./run_ae_agent_smoke_test.sh [model_name]
4+
# Default model: claude-sonnet-4-5-20250929
5+
6+
set -e
7+
BENCH_ROOT="$(cd "$(dirname "$0")" && pwd)"
8+
cd "$BENCH_ROOT"
9+
MODEL="${1:-claude-sonnet-4-5-20250929}"
10+
OUT_DIR="./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)"
11+
echo "==> AE Agent smoke test (host + docker + evaluation)"
12+
echo " Model: $MODEL"
13+
echo " Output: $OUT_DIR"
14+
echo ""
15+
python src/main.py \
16+
-i ./data/benchmark/ae_agent_smoke_test.jsonl \
17+
-a ae_agent \
18+
-m "$MODEL" \
19+
-o "$OUT_DIR"
20+
echo ""
21+
echo "==> Done. Results: $OUT_DIR/result.jsonl and $OUT_DIR/avg_score.json"
Lines changed: 27 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -1,16 +1,15 @@
11
# AE Agent (ArtEval sub-agent)
22

3-
This agent is the **ae-agent** logic integrated as a sub-agent of the system-intelligence-benchmark ArtEval benchmark. It uses the Claude Agent SDK to run artifact evaluation tasks inside the benchmark container. Code is synced from the standalone [ae-agent](https://github.com/Couen/ae-agent) repo.
3+
This agent is the **ae_agent** for the system-intelligence-benchmark ArtEval benchmark, with the same logic as the standalone [ae-agent](https://github.com/Couen/ae-agent) repo. It runs inside the benchmark container using the Claude Agent SDK to execute artifact evaluation tasks.
44

55
## Files
66

7-
- **install.sh**: Installs `claude-agent-sdk==0.1.24` and configures `~/.claude/settings.json` (48h Bash timeout).
8-
- **runner.sh**: Entry point invoked as `runner.sh <model> <task_or_path>`. Forwards to `runner.py`. Uses `/agent/current_task.txt` when the benchmark passes task via file.
9-
- **runner.py**: Runs the task with Claude Agent SDK; supports rate-limit retry (429), message_formatter; second argument can be task text or path to file.
10-
- **run_eval.py**: Orchestration for one task: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
11-
- **main.py**: CLI entry for batch runs from JSONL; supports both host and Docker per task (see “Run on host (local)” below).
12-
- **utils.py**: `DEFAULT_TIMEOUT_MS`, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
13-
- **interactive_runner.py**: Interactive multi-turn session inside container (e.g. `docker exec -it <cid> python3 /agent/interactive_runner.py <model>`).
7+
- **install.sh**: Installs `claude-agent-sdk` inside the container for use by runner.py.
8+
- **runner.sh**: Entry script; invoked as `runner.sh <model> <task_or_path>`. Uses `/agent/current_task.txt` when the benchmark passes the task via file.
9+
- **runner.py**: Runs the task with Claude Agent SDK; supports 429 rate-limit retry; second argument can be task text or path to a task file. Artifact path in container is `/repo`.
10+
- **run_eval.py**: Single-task orchestration: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
11+
- **main.py**: CLI entry for batch runs from JSONL; supports host or Docker per task.
12+
- **utils.py**: Timeout, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
1413
- **__init__.py**: Package marker.
1514

1615
## Usage from the benchmark
@@ -21,44 +20,43 @@ From the benchmark root (`benchmarks/arteval_bench/`):
2120
python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_run
2221
```
2322

24-
Or use the helper script from `data/benchmark/`:
25-
26-
```bash
27-
./data/benchmark/run_ae_agent.sh [model_name]
28-
```
23+
You can also use `-a ae-agent`; it is equivalent to `ae_agent`.
2924

3025
The benchmark will:
3126

32-
1. Upload the agent to `/agent` in the container.
33-
2. For ae_agent: upload task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting with large tasks).
34-
3. Use long-running and live-log behavior (48h timeout, live log streaming, `_agent_eval` removal before run and re-upload before evaluation, container kept for debugging).
35-
4. Pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY` when set.
27+
1. Upload this agent to `/agent` in the container.
28+
2. For ae_agent: write the task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting issues with large tasks).
29+
3. Use long-running and live-log behavior (48h timeout, streamed logs, remove `_agent_eval` before run and re-upload before evaluation, container kept for debugging).
30+
4. **Evaluation script flow** (same as claude_sdk): after the agent finishes, run the JSONL `evaluator` (test_method), e.g. `cd /repo && python _agent_eval/main.py`, parse output for `score` and write to result.
31+
5. If set, pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY`.
32+
33+
**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package’s `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
3634

3735
## Dependencies
3836

39-
- Python 3 with `claude-agent-sdk` (installed by `install.sh`).
40-
- Optional: `message_formatter` for prettier output (if present in the environment).
37+
- Python 3; `claude-agent-sdk` is installed in the container via `install.sh`.
38+
- When running in Docker via the benchmark’s `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory’s `main.py` for Docker mode standalone, you also need `swe-rex`.
4139

42-
## Run on host (local)
40+
## Running on host (local)
4341

44-
You can run tasks **on the host machine** (no Docker) from this directory:
42+
You can run tasks on the **host** from this directory (without the benchmark’s Docker flow):
4543

46-
1. **Single-task / batch via main.py**
47-
Use a JSONL input where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host. Other lines without that run in Docker (if swerex is available).
44+
1. **Single or batch via main.py**
45+
Use a JSONL where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host; others run in Docker (requires swerex).
4846

4947
```bash
5048
cd benchmarks/arteval_bench/src/agents/ae_agent
51-
python main.py -i /path/to/tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/host_run
49+
python -m ae_agent.main -i /path/to/tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/host_run
5250
```
5351

54-
2. **Requirements for host mode**
55-
- `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY` set
52+
2. **Host mode requirements**
53+
- Set `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY`
5654
- Docker installed and running (for prereq check; agent runs on host)
5755
- `pip install claude-agent-sdk`
5856

5957
3. **Docker mode from this directory**
60-
If JSONL has `"env": "docker"` (or no `run_on_host`), `main.py` will run that task in Docker via `run_eval.py` (requires `swe-rex` / `swerex`).
58+
If the JSONL has `"env": "docker"` (or `run_on_host` is not set), `main.py` runs that task in Docker via `run_eval.py` (requires `swe-rex`/`swerex`).
6159

62-
## Relation to standalone ae-agent repo
60+
## Relation to the standalone ae-agent repo
6361

64-
The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark’s `run_eval_in_env.py`) and the **host/local** mode via `main.py` and `run_eval.py`.
62+
The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark’s `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
Lines changed: 21 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,23 @@
1-
"""AE Agent for ArtEvalBench - Claude Agent SDK runner for artifact evaluation tasks.
1+
"""AE Agent - A tool for running Claude Agent SDK on artifact evaluation tasks.
22
3-
Contract: artifact at /repo, this agent at /agent; task passed as CLI arg or path to file (/agent/current_task.txt).
3+
Output files (under save_path):
4+
- ae_report_<artifact_id>.md: Per-artifact report with status and agent summary
5+
- ae_log_<artifact_id>.log: Per-artifact execution log
6+
- result.jsonl: Per-task results (one JSON per line)
7+
- summary.json: Overall statistics
48
"""
9+
10+
from .main import cli_main, main
11+
from .run_eval import run_agent_then_eval, run_eval
12+
from .runner import build_system_prompt, run_agent
13+
from .utils import parse_eval_score
14+
15+
__all__ = [
16+
'build_system_prompt',
17+
'cli_main',
18+
'main',
19+
'parse_eval_score',
20+
'run_agent',
21+
'run_agent_then_eval',
22+
'run_eval',
23+
]
Lines changed: 3 additions & 13 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
#!/bin/bash
2-
# Setup AE Agent environment inside benchmark container.
3-
# Ensures claude-agent-sdk is available so runner.py can run.
2+
# Setup agent running environment inside Docker container.
3+
# Ensures claude-agent-sdk is available so runner.py can import claude_agent_sdk.
44
set -e
55
if ! python3 -c "import claude_agent_sdk" 2>/dev/null; then
66
echo "Installing claude-agent-sdk..."
@@ -9,14 +9,4 @@ if ! python3 -c "import claude_agent_sdk" 2>/dev/null; then
99
echo "WARNING: claude_agent_sdk still not importable; runner may fail."
1010
fi
1111
fi
12-
# 48h Bash timeout for long-running artifact tasks
13-
mkdir -p ~/.claude
14-
cat > ~/.claude/settings.json << 'EOF'
15-
{
16-
"env": {
17-
"BASH_MAX_TIMEOUT_MS": "172800000",
18-
"BASH_DEFAULT_TIMEOUT_MS": "172800000"
19-
}
20-
}
21-
EOF
22-
echo "AE Agent environment ready (~/.claude/settings.json configured)."
12+
echo "Agent environment ready."

benchmarks/arteval_bench/src/agents/ae_agent/interactive_runner.py

Lines changed: 0 additions & 105 deletions
This file was deleted.

0 commit comments

Comments
 (0)