A benchmark for evaluating LLM agents on real, self-hosted SaaS applications. Each task asks the agent to drive a browser through a multi-step business workflow (project management, accounting, HR, document authoring, etc.); a per-task verifier inspects the running application's state to score the result.
The bench currently ships 106 task instances across 6 domains (split into a text-only uni-m track and a multimodal multi-m track) and 23 self-hosted SaaS apps:
| Track | Domain | Tasks | Representative apps |
|---|---|---|---|
| uni-m | Business | 15 | Twenty, Bigcapital, HRMS, Pretix |
| uni-m | Healthcare | 16 | OpenEMR, OnlyOffice, OpnForm |
| uni-m | Software | 31 | Baserow, OpenProject, code-server, Metabase |
| uni-m | Teamwork | 12 | OnlyOffice, Mattermost, RoundcubeMail, ownCloud |
| multi-m | Agriculture | 12 | Grocy, farmOS, Recipya, e-label |
| multi-m | Media | 20 | SiYuan, Watcharr, BookLore, PhotoPrism, MediaCMS |
Multi-m tasks consume image / audio / PDF inputs from
tasks/multi-m/inputs/; verifiers locate them via paths relative to
verify.py. Use scripts/fetch_multimodal_assets.sh to check that the
expected input files are present before running the multi-m suite.
The reference agent is built on browser-use
and talks to any OpenAI-compatible chat-completions endpoint. You can swap
in your own agent — only the verify.py contract is load-bearing.
saas_bench/ Eval harness (Python package)
run.py CLI entry: orchestrates concurrent task execution
agent.py browser-use reference agent
slot.py Per-slot Docker container manager
loader.py Task discovery + prompt builder
verify_runner.py Runs verify.py and parses results
apps.yaml App registry (ports, start commands, health probes)
docker/ Compose templates + image archives (download separately)
tasks/
uni-m/ Text-only tasks (Software, Business, Healthcare, Teamwork)
multi-m/ Multimodal tasks (Agriculture, Media) + inputs/ assets
scripts/ run.sh, stop_all.sh, load_images.sh, fetch_multimodal_assets.sh
docs/ Verify protocol and task format specifications
- Linux host (tested on Ubuntu 22.04 / Alibaba Cloud Linux)
- Docker 24+ with the
composeplugin - Python ≥ 3.10
- ~100 GB free disk for the SaaS app images
- Outbound network access (for first-time pull of compose-stack auxiliary
images and for
pip install scipy numpyinside the code-server container)
# 1. Clone and install the Python package
git clone <this-repo>.git SaaS-Bench
cd SaaS-Bench
pip install -e .
playwright install chromium
pip install socksio
# 2. Download SaaS-Bench docker images from Huggingface (see docker/README.md
# for the URL) and place the .tar files under docker/images/, then:
bash scripts/load_images.sh
# 3. Configure your LLM endpoint
cp .env.example .env
$EDITOR .env # set LLM_API_KEY, LLM_BASE_URL, LLM_MODELWe recommend running the evaluation on a machine with more than 500GB of RAM to support parallel SaaS environment deployment and long-horizon agent execution.
Run all tasks with 4 concurrent workers:
bash scripts/run.shUseful flags:
bash scripts/run.sh --workers 8 # bump concurrency
bash scripts/run.sh --tasks-dir tasks/uni-m/Business # one domain
bash scripts/run.sh --task-ids business_023 software_004 # cherry-pick
bash scripts/run.sh --max-steps 200 # tighter step budget
bash scripts/run.sh --result-dir results/run_2026_05_05 # custom output dir
bash scripts/run.sh --no-isolation # reuse already-running containers
bash scripts/run.sh --log results/run.log # also tee to a filePer-worker the harness:
- Picks a slot id and computes app ports
30000 + slot_id*20 + app_index. - Starts the docker containers / compose stacks for that task's
sites. - Launches a headless Chrome and a fresh browser-use Agent.
- Saves the agent trajectory to
<result_dir>/<task_id>.json. - Runs
verify.pyand saves the score to<result_dir>/<task_id>_verify.json. - Tears down the containers and tmp dirs.
Aggregated stats land in <result_dir>/summary.json. Errors are appended
to <result_dir>/errors.log without aborting the run.
When in doubt, you can purge stale containers from a previous (crashed) run:
bash scripts/stop_all.shThe harness invokes a single async function run_task(task, model_name, prompt, result_dir, max_steps, slot_id, todo_md) -> dict. Implement that
in a module of your choice and have saas_bench.run import it instead of
saas_bench.agent. The contract is intentionally tiny: return a dict with
status (completed / error), agent_output (string), and
trajectory (list of step dicts). The verifier runs against the live
docker state; what the agent does to get there is up to you.
See docs/task_format.md and docs/verify_protocol.md.
@misc{saasbench2026,
title = {SaaS-Bench: Can Computer-Use Agents Leverage Real-World SaaS to Solve Professional Workflows?},
author = {UniPat AI},
year = {2026}
}