sys-intelligence
diff --git a/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README.md‎
Lines changed: 13 additions & 0 deletions b/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README_SMOKE_TEST.md‎
Lines changed: 44 additions & 0 deletions b/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/README_SMOKE_TEST.md‎
Lines changed: 44 additions & 0 deletions
diff --git a/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/_agent_eval/check.py‎
Lines changed: 22 additions & 0 deletions b/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke/_agent_eval/check.py‎
Lines changed: 22 additions & 0 deletions
diff --git a/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke_test.jsonl‎
Lines changed: 2 additions & 0 deletions b/‎benchmarks/arteval_bench/data/benchmark/ae_agent_smoke_test.jsonl‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎benchmarks/arteval_bench/run_ae_agent_smoke_test.sh‎
Lines changed: 21 additions & 0 deletions b/‎benchmarks/arteval_bench/run_ae_agent_smoke_test.sh‎
Lines changed: 21 additions & 0 deletions
diff --git a/‎benchmarks/arteval_bench/src/agents/ae_agent/README.md‎
Lines changed: 27 additions & 29 deletions b/‎benchmarks/arteval_bench/src/agents/ae_agent/README.md‎
Lines changed: 27 additions & 29 deletions
diff --git a/‎benchmarks/arteval_bench/src/agents/ae_agent/__init__.py‎
Lines changed: 21 additions & 2 deletions b/‎benchmarks/arteval_bench/src/agents/ae_agent/__init__.py‎
Lines changed: 21 additions & 2 deletions
diff --git a/‎benchmarks/arteval_bench/src/agents/ae_agent/install.sh‎
Lines changed: 3 additions & 13 deletions b/‎benchmarks/arteval_bench/src/agents/ae_agent/install.sh‎
Lines changed: 3 additions & 13 deletions
diff --git a/‎benchmarks/arteval_bench/src/agents/ae_agent/interactive_runner.py‎
Lines changed: 0 additions & 105 deletions b/‎benchmarks/arteval_bench/src/agents/ae_agent/interactive_runner.py‎
Lines changed: 0 additions & 105 deletions
@@ -0,0 +1,13 @@
+# AE Agent Smoke Test Artifact
+
+Minimal task for quick testing of ae_agent (host/docker + evaluation). Should complete in under a minute.
+
+## Task
+
+1. In this directory (the artifact root), create a file named **success.txt**.
+2. The file must contain exactly the single character **1** (no newline required).
+3. No other steps are required.
+
+Example (bash): `echo -n 1 > success.txt`
+
+After you finish, the benchmark will run an evaluation script that checks for this file and outputs a score (1 if correct, 0 otherwise).
@@ -0,0 +1,44 @@
+# AE Agent smoke test
+
+## Purpose
+
+- Test the agent under `src/agents/ae_agent`: **host** and **docker** modes, and the **evaluation script** flow (evaluator runs after the agent and parses score).
+- Task is minimal (create `success.txt` with content `1` in the artifact root); finishes in a few minutes and avoids long runs with full arteval_tasks.
+
+## Files
+
+- **ae_agent_smoke/**: Minimal artifact
+  - `README.md`: Task description (create success.txt with content 1)
+  - `_agent_eval/check.py`: Evaluator; outputs `1` if success.txt exists and contains `1`, else `0`
+- **ae_agent_smoke_test.jsonl**: Two lines
+  - First line: `run_on_host: true`, run ae_agent + evaluator on host
+  - Second line: `run_on_host: false`, run ae_agent + evaluator in Docker
+
+## How to run
+
+From the **benchmarks/arteval_bench** directory:
+
+```bash
+# Set ANTHROPIC_API_KEY or ANTHROPIC_FOUNDRY_API_KEY first
+python src/main.py \
+  -i ./data/benchmark/ae_agent_smoke_test.jsonl \
+  -a ae_agent \
+  -m claude-sonnet-4-5-20250929 \
+  -o ./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)
+```
+
+- **Host task**: Runs the agent on the host, then runs `python3 _agent_eval/check.py` on the host to get the score.
+- **Docker task**: Runs the agent in the container, then runs the evaluator in the container to get the score; the container is kept running by default for debugging.
+
+Results are under the `-o` directory: `result.jsonl` (one JSON object per line with `score`, `status`, `test_method`, etc.) and `avg_score.json`.
+
+## Interactive mode
+
+The benchmark’s `src/main.py` does not read an `interactive` field from the JSONL, so the command above only covers **non-interactive** runs. To test interactive mode:
+
+- Use ae_agent’s main entry with `--interactive`, and set `"env": "local"` or `"run_on_host": true` / `"env": "docker"` in the JSONL for the task, for example:
+  ```bash
+  cd src/agents/ae_agent
+  python -m ae_agent.main --interactive -i ../../../data/benchmark/ae_agent_smoke_test.jsonl -o ../../../outputs/ae_agent_smoke_int
+  ```
+- In interactive mode, after the first task completes you can keep typing instructions; type `quit` or `exit` to end.
@@ -0,0 +1,22 @@
+#!/usr/bin/env python3
+"""Minimal evaluator for ae_agent_smoke: output 1 if success.txt exists and contains '1', else 0.
+
+Output must be a single digit on a line (or last line) for benchmark score parsing.
+"""
+import os
+import sys
+
+def main():
+    root = os.path.dirname(os.path.dirname(os.path.abspath(__file__)))
+    path = os.path.join(root, "success.txt")
+    if os.path.isfile(path):
+        with open(path, "r") as f:
+            content = f.read().strip()
+        if content == "1":
+            print(1)
+            sys.exit(0)
+    print(0)
+    sys.exit(0)
+
+if __name__ == "__main__":
+    main()
@@ -0,0 +1,2 @@
+{"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": true}
+{"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "docker_env": "bastoica/ae-agent-ubuntu24.04:latest", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": false}
@@ -0,0 +1,21 @@
+#!/bin/bash
+# Run ae_agent smoke test under arteval_bench (host + docker, with evaluation).
+# Usage: ./run_ae_agent_smoke_test.sh [model_name]
+# Default model: claude-sonnet-4-5-20250929
+
+set -e
+BENCH_ROOT="$(cd "$(dirname "$0")" && pwd)"
+cd "$BENCH_ROOT"
+MODEL="${1:-claude-sonnet-4-5-20250929}"
+OUT_DIR="./outputs/ae_agent_smoke_$(date +%Y%m%d_%H%M%S)"
+echo "==> AE Agent smoke test (host + docker + evaluation)"
+echo "    Model: $MODEL"
+echo "    Output: $OUT_DIR"
+echo ""
+python src/main.py \
+  -i ./data/benchmark/ae_agent_smoke_test.jsonl \
+  -a ae_agent \
+  -m "$MODEL" \
+  -o "$OUT_DIR"
+echo ""
+echo "==> Done. Results: $OUT_DIR/result.jsonl and $OUT_DIR/avg_score.json"
@@ -1,16 +1,15 @@
 # AE Agent (ArtEval sub-agent)
 
-This agent is the **ae-agent** logic integrated as a sub-agent of the system-intelligence-benchmark ArtEval benchmark. It uses the Claude Agent SDK to run artifact evaluation tasks inside the benchmark container. Code is synced from the standalone [ae-agent](https://github.com/Couen/ae-agent) repo.
+This agent is the **ae_agent** for the system-intelligence-benchmark ArtEval benchmark, with the same logic as the standalone [ae-agent](https://github.com/Couen/ae-agent) repo. It runs inside the benchmark container using the Claude Agent SDK to execute artifact evaluation tasks.
 
 ## Files
 
-- **install.sh**: Installs `claude-agent-sdk==0.1.24` and configures `~/.claude/settings.json` (48h Bash timeout).
-- **runner.sh**: Entry point invoked as `runner.sh <model> <task_or_path>`. Forwards to `runner.py`. Uses `/agent/current_task.txt` when the benchmark passes task via file.
-- **runner.py**: Runs the task with Claude Agent SDK; supports rate-limit retry (429), message_formatter; second argument can be task text or path to file.
-- **run_eval.py**: Orchestration for one task: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
-- **main.py**: CLI entry for batch runs from JSONL; supports both host and Docker per task (see “Run on host (local)” below).
-- **utils.py**: `DEFAULT_TIMEOUT_MS`, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
-- **interactive_runner.py**: Interactive multi-turn session inside container (e.g. `docker exec -it <cid> python3 /agent/interactive_runner.py <model>`).
+- **install.sh**: Installs `claude-agent-sdk` inside the container for use by runner.py.
+- **runner.sh**: Entry script; invoked as `runner.sh <model> <task_or_path>`. Uses `/agent/current_task.txt` when the benchmark passes the task via file.
+- **runner.py**: Runs the task with Claude Agent SDK; supports 429 rate-limit retry; second argument can be task text or path to a task file. Artifact path in container is `/repo`.
+- **run_eval.py**: Single-task orchestration: `env='local'` runs on host, otherwise runs in Docker (requires swerex/swe-rex).
+- **main.py**: CLI entry for batch runs from JSONL; supports host or Docker per task.
+- **utils.py**: Timeout, task/path helpers, Tee, reports, summary (used by runner, main, run_eval).
 - **__init__.py**: Package marker.
 
 ## Usage from the benchmark
@@ -21,44 +20,43 @@ From the benchmark root (`benchmarks/arteval_bench/`):
 python src/main.py -i ./data/benchmark/arteval_tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/ae_agent_run
 ```
 
-Or use the helper script from `data/benchmark/`:
-
-```bash
-./data/benchmark/run_ae_agent.sh [model_name]
-```
+You can also use `-a ae-agent`; it is equivalent to `ae_agent`.
 
 The benchmark will:
 
-1. Upload the agent to `/agent` in the container.
-2. For ae_agent: upload task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting with large tasks).
-3. Use long-running and live-log behavior (48h timeout, live log streaming, `_agent_eval` removal before run and re-upload before evaluation, container kept for debugging).
-4. Pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY` when set.
+1. Upload this agent to `/agent` in the container.
+2. For ae_agent: write the task to `/agent/current_task.txt`, then run `runner.sh "$model" /agent/current_task.txt` (avoids shell quoting issues with large tasks).
+3. Use long-running and live-log behavior (48h timeout, streamed logs, remove `_agent_eval` before run and re-upload before evaluation, container kept for debugging).
+4. **Evaluation script flow** (same as claude_sdk): after the agent finishes, run the JSONL `evaluator` (test_method), e.g. `cd /repo && python _agent_eval/main.py`, parse output for `score` and write to result.
+5. If set, pass through `ANTHROPIC_API_KEY`, `ANTHROPIC_FOUNDRY_API_KEY`, `ANTHROPIC_FOUNDRY_BASE_URL`, `CLAUDE_CODE_USE_FOUNDRY`.
+
+**Evaluation flow on host**: When `run_on_host=True` and the agent is ae_agent, `run_eval_in_env.run_eval_on_host` calls this package’s `run_agent_then_eval()`: run the agent first, then run `test_method` on the host (e.g. `cd project_path && python _agent_eval/main.py`), parse score with `utils.parse_eval_score()`, and return a result with the same shape as the Docker path (`score`, `test_method`, `status`).
 
 ## Dependencies
 
-- Python 3 with `claude-agent-sdk` (installed by `install.sh`).
-- Optional: `message_formatter` for prettier output (if present in the environment).
+- Python 3; `claude-agent-sdk` is installed in the container via `install.sh`.
+- When running in Docker via the benchmark’s `run_eval_in_env.py`, install `swerex` on the host (the benchmark includes it). When using this directory’s `main.py` for Docker mode standalone, you also need `swe-rex`.
 
-## Run on host (local)
+## Running on host (local)
 
-You can run tasks **on the host machine** (no Docker) from this directory:
+You can run tasks on the **host** from this directory (without the benchmark’s Docker flow):
 
-1. **Single-task / batch via main.py**  
-   Use a JSONL input where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host. Other lines without that run in Docker (if swerex is available).
+1. **Single or batch via main.py**  
+   Use a JSONL where each line can set `"env": "local"` or `"run_on_host": true` to run that task on the host; others run in Docker (requires swerex).
 
    ```bash
    cd benchmarks/arteval_bench/src/agents/ae_agent
-   python main.py -i /path/to/tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/host_run
+   python -m ae_agent.main -i /path/to/tasks.jsonl -a ae_agent -m claude-sonnet-4-5-20250929 -o ./outputs/host_run
    ```
 
-2. **Requirements for host mode**  
-   - `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY` set  
+2. **Host mode requirements**  
+   - Set `ANTHROPIC_API_KEY` or `ANTHROPIC_FOUNDRY_API_KEY`  
    - Docker installed and running (for prereq check; agent runs on host)  
    - `pip install claude-agent-sdk`
 
 3. **Docker mode from this directory**  
-   If JSONL has `"env": "docker"` (or no `run_on_host`), `main.py` will run that task in Docker via `run_eval.py` (requires `swe-rex` / `swerex`).
+   If the JSONL has `"env": "docker"` (or `run_on_host` is not set), `main.py` runs that task in Docker via `run_eval.py` (requires `swe-rex`/`swerex`).
 
-## Relation to standalone ae-agent repo
+## Relation to the standalone ae-agent repo
 
-The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark’s `run_eval_in_env.py`) and the **host/local** mode via `main.py` and `run_eval.py`.
+The standalone ae-agent repo provides the same host/Docker CLI. This sub-agent includes both the **in-container** runner (used by the benchmark’s `run_eval_in_env.py`) and **host/local** mode via `main.py` and `run_eval.py`.
@@ -1,4 +1,23 @@
-"""AE Agent for ArtEvalBench - Claude Agent SDK runner for artifact evaluation tasks.
+"""AE Agent - A tool for running Claude Agent SDK on artifact evaluation tasks.
 
-Contract: artifact at /repo, this agent at /agent; task passed as CLI arg or path to file (/agent/current_task.txt).
+Output files (under save_path):
+- ae_report_<artifact_id>.md: Per-artifact report with status and agent summary
+- ae_log_<artifact_id>.log: Per-artifact execution log
+- result.jsonl: Per-task results (one JSON per line)
+- summary.json: Overall statistics
 """
+
+from .main import cli_main, main
+from .run_eval import run_agent_then_eval, run_eval
+from .runner import build_system_prompt, run_agent
+from .utils import parse_eval_score
+
+__all__ = [
+    'build_system_prompt',
+    'cli_main',
+    'main',
+    'parse_eval_score',
+    'run_agent',
+    'run_agent_then_eval',
+    'run_eval',
+]
@@ -1,6 +1,6 @@
 #!/bin/bash
-# Setup AE Agent environment inside benchmark container.
-# Ensures claude-agent-sdk is available so runner.py can run.
+# Setup agent running environment inside Docker container.
+# Ensures claude-agent-sdk is available so runner.py can import claude_agent_sdk.
 set -e
 if ! python3 -c "import claude_agent_sdk" 2>/dev/null; then
   echo "Installing claude-agent-sdk..."
@@ -9,14 +9,4 @@ if ! python3 -c "import claude_agent_sdk" 2>/dev/null; then
     echo "WARNING: claude_agent_sdk still not importable; runner may fail."
   fi
 fi
-# 48h Bash timeout for long-running artifact tasks
-mkdir -p ~/.claude
-cat > ~/.claude/settings.json << 'EOF'
-{
-  "env": {
-    "BASH_MAX_TIMEOUT_MS": "172800000",
-    "BASH_DEFAULT_TIMEOUT_MS": "172800000"
-  }
-}
-EOF
-echo "AE Agent environment ready (~/.claude/settings.json configured)."
+echo "Agent environment ready."
Original file line number	Diff line number	Diff line change
`@@ -0,0 +1,2 @@`
	`1`	`+{"artifact_id": "ae_agent_smoke_host", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": true}`
	`2`	`+{"artifact_id": "ae_agent_smoke_docker", "artifact_dir": "ae_agent_smoke", "artifact_readme": "ae_agent_smoke/README.md", "docker_env": "bastoica/ae-agent-ubuntu24.04:latest", "evaluator": "python3 _agent_eval/check.py", "expected_score": 1, "run_on_host": false}`