LayoutArena

LayoutArena is a scaffold for a hackathon project on AI control for tool-using design agents.

The repository is organized around five areas:

layoutarena/env: canvas state, element models, constraints, scoring, and tool execution
layoutarena/control: enforcement, bounded control, stewardship monitoring, and monitor heuristics
layoutarena/attacks: adversarial side-task templates
layoutarena/experiments: sample briefs and evaluation runners
layoutarena/viz: SVG rendering, replay export, and Pareto utilities

The current code is intentionally a typed scaffold: core interfaces are in place, some pieces are implemented, and the remaining research choices are marked with pseudocode comments where they still need project-specific decisions.

Quick sanity check:

python3 -m compileall layoutarena

The scaffold is compatible with the current local interpreter in this repo (Python 3.9.x).

Primary planning document: TODO.md

Metrics and experiment outputs

Batch and single runs write outputs/metrics.csv plus per-run folders under outputs/runs/<run_id>/ (see layoutarena/experiments/output_writer.py).

Protocol-level monitoring (what differentiates bounded control and stewardship from enforcement) is summarized on each EpisodeSummary and in the CSV:

Field	Meaning
`monitor_detected`	True if the protocol blocked at least one tool call or attached non-empty `monitor_reasons` to a call that still executed (e.g. redirect-with-allow under stewardship).
`monitor_block_count`	Number of steps with `status == "blocked"` — the protocol vetoed execution.
`monitor_nudge_count`	Number of executed steps with non-empty `monitor_reasons` — “noticed and noted / shaped, but not vetoed.”

Use monitor_block_count when you want a strict “hard intervention” rate for plots or tables; use monitor_detected for the broad protocol-attention signal (aligned with the default Pareto / detection-rate figures). scripts/generate_plots.py still uses monitor_detected for backward compatibility; new columns are optional when loading older CSVs (they default to zero).

Helper: layoutarena.experiments.run_eval.compute_monitor_metrics(state) returns (monitor_detected, monitor_block_count, monitor_nudge_count) from a final LayoutState.

LLM runs (`--model`)

scripts/run_batch.py and scripts/run_single.py load layout-hackathon/.env if present. For OpenAI-style model names (e.g. gpt-4o-mini), set OPENAI_API_KEY. For Anthropic / Gemini, set ANTHROPIC_API_KEY or GEMINI_API_KEY / GOOGLE_API_KEY respectively. If the key is missing, the batch runner exits immediately with a clear error instead of running hundreds of failing episodes. Omit --model to use the deterministic scripted policy (no API calls).

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
layoutarena.egg-info		layoutarena.egg-info
layoutarena		layoutarena
report		report
scripts		scripts
tests		tests
.env.example		.env.example
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
architect.md		architect.md
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LayoutArena

Metrics and experiment outputs

LLM runs (`--model`)

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

LayoutArena

Metrics and experiment outputs

LLM runs (--model)

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

LLM runs (`--model`)

Packages