Add AppWorld environment integration for GRPO by hamishivi · Pull Request #1501 · allenai/open-instruct

hamishivi · 2026-02-27T20:46:54Z

Summary

add a new stateful AppWorldEnv (env_name=appworld) that exposes an appworld_execute tool, persists AppWorld shell state across turns, and converts task completion/evaluation into rollout reward
register the new environment in the shared env/tool registry and exports so GRPO can auto-discover and instantiate it from dataset env_config
add supporting artifacts: AppWorld train/eval dataset builder script, 1-GPU debug train+eval launcher, and unit tests covering registration/reset/step/reward/cleanup behavior

Test plan

uv run pytest tests/test_appworld_env.py tests/test_environments.py
uv run pytest open_instruct/environments/tools/tests/test_tools.py -k registry
uv run ruff check open_instruct/environments/appworld.py open_instruct/environments/tools/tools.py open_instruct/environments/__init__.py tests/test_appworld_env.py scripts/data/create_appworld_env_datasets.py

Made with Cursor

Introduce a stateful AppWorld environment with tool registry wiring, plus dataset/debug scripts and tests so GRPO can train on AppWorld tasks and evaluate on a separate split. Made-with: Cursor

Document the AppWorld GRPO environment integration in CHANGELOG with a link to PR #1501 per repository PR workflow requirements. Made-with: Cursor

gemini-code-assist · 2026-02-27T20:47:19Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the AppWorld environment into the existing reinforcement learning framework. The primary goal is to allow GRPO (Generative Reinforcement Learning with Policy Optimization) to interact with AppWorld tasks, enabling agents to learn by executing Python code within a simulated application environment. This integration provides a robust mechanism for defining tasks, managing execution state, and evaluating agent performance, thereby expanding the range of environments available for training and evaluating code-generating agents.

Highlights

New AppWorld Environment: Introduced a stateful AppWorldEnv that exposes an appworld_execute tool, maintains AppWorld shell state across turns, and converts task completion/evaluation into rollout reward for GRPO.
Environment Registration: Registered the new AppWorldEnv and its configuration (AppWorldEnvConfig) within the shared environment and tool registry, enabling GRPO to automatically discover and instantiate it from dataset env_config.
Supporting Artifacts: Added a script for building AppWorld train/eval datasets, a 1-GPU debug training and evaluation launcher, and comprehensive unit tests covering registration, reset, step, reward, and cleanup behavior of the AppWorldEnv.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

open_instruct/environments/init.py
- Imported AppWorldEnv and AppWorldEnvConfig.
- Added AppWorldEnv and AppWorldEnvConfig to the module's __all__ export list.
open_instruct/environments/appworld.py
- Added a new file defining the AppWorldEnv class, which integrates AppWorld tasks as an RL environment.
- Implemented the appworld_execute tool for running Python code within the AppWorld shell.
- Included logic for lazy loading of AppWorld symbols, environment initialization, task resetting, and step execution.
- Developed methods for evaluating task completion and extracting scores to determine rewards.
- Defined AppWorldEnvConfig for configuring the AppWorld environment parameters.
open_instruct/environments/tools/tools.py
- Imported AppWorldEnvConfig.
- Added AppWorldEnvConfig to the TOOL_REGISTRY for auto-discovery.
scripts/data/create_appworld_env_datasets.py
- Added a new script to generate training and evaluation datasets for the AppWorld RL environment.
- Included functionality to load AppWorld tasks, format supervisor information, and build user prompts.
- Provided options to save datasets locally as JSONL or upload them to HuggingFace Hub.
scripts/train/debug/envs/appworld_1gpu.sh
- Added a new debug shell script for launching GRPO training with AppWorldEnv on a single GPU.
- Configured environment variables and GRPO arguments specific to AppWorld training and evaluation.
tests/test_appworld_env.py
- Added a new file containing unit tests for the AppWorldEnv.
- Implemented mock classes for AppWorld components to facilitate isolated testing.
- Verified environment registration, reset behavior, step execution, reward calculation, unknown tool handling, and resource cleanup.

Activity

The pull request introduces a new AppWorld environment integration for GRPO, along with necessary supporting scripts and tests.
No human activity (comments, reviews) has been recorded yet for this pull request.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces a new stateful environment, AppWorldEnv, for GRPO, along with necessary supporting components like a dataset creation script, a debug launcher, and unit tests. The implementation is generally of high quality and well-structured. The new environment is correctly integrated into the existing framework. The unit tests are thorough and make good use of mocking to isolate the component under test. I have one major concern regarding a potential race condition when configuring the appworld_root via environment variables in a concurrent setting, which could lead to unpredictable behavior.

Add an `appworld` optional dependency extra and introduce an explicit availability guard so AppWorld integration fails with a clear install hint instead of import-time crashes when the extra is missing. Made-with: Cursor

Remove in-env root mutation and lock-based initialization, and require APPWORLD_ROOT to be set by the launcher/script before rollouts to avoid process-global root races. Made-with: Cursor

hamishivi · 2026-02-27T22:32:19Z

/gemini review

gemini-code-assist

Code Review

This pull request introduces an integration for the AppWorld environment, which is a significant addition. The changes are well-structured, including the new AppWorldEnv, registration with the tool registry, a dataset creation script, unit tests, and a debug launcher. The code is generally high-quality and robust. I have one comment regarding a potential AttributeError if the task object is not present on the AppWorld instance.

Apply Ruff formatting to appworld.py so `ruff format --check --diff open_instruct *mason.py` passes cleanly. Made-with: Cursor

Stop relying on reset observations for task instructions, require task context in dataset prompts, and remove the AppWorld environment test file per request. Made-with: Cursor

Append a per-actor UUID to AppWorld experiment names by default so parallel rollouts for the same task do not collide on shared output directories. Made-with: Cursor

Always isolate experiment output names and restore legacy AppWorld init knobs/defaults to match prior training configuration. Made-with: Cursor

Bootstrap AppWorld dependencies/data inside the job container and add a 4 learner / 4 inference AppWorld debug script to match the existing 8-GPU env launch pattern. Made-with: Cursor

Resolve missing appworld binary issues in container jobs by invoking install/download commands through uv-managed environment. Made-with: Cursor

Install AppWorld against the runtime Python using uv pip/uv run and avoid REPO_PATH unbound failures so setup fails fast instead of reporting false success. Made-with: Cursor

Use UV_GIT_LFS for git-based installs, pin CLI calls to APPWORLD_ROOT, and validate app/data completeness so setup cannot report success with an incomplete AppWorld installation. Made-with: Cursor

Run AppWorld bootstrap in a separate shell and enforce strict failure behavior so incomplete installs never proceed to rollout initialization. Made-with: Cursor

Pass --root only to subcommands that support it (download data) and keep install invocation compatible with python -m appworld.cli. Made-with: Cursor

Install git-lfs in the training image and fail fast during AppWorld bootstrap when git-lfs is unavailable so bundle pointer checkouts are surfaced clearly. Made-with: Cursor

Keep low-variance AppWorld rollouts in the 8-GPU launcher to avoid empty packed batches during early training. Made-with: Cursor

The model was blindly guessing API names (0% success rate across 10k rollouts). Add instructions about the built-in api_docs app so the model knows to look up available APIs before writing code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

AppWorld's execute() returns "Execution successful." for API calls unless the result is printed. The model was calling the right discovery APIs but couldn't see the returned data. Wrap all examples in print() and add a warning about this behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Catch tool parser exceptions and return a parser-error observation so malformed tool arguments do not crash request processing. Made-with: Cursor

hamishivi added 2 commits February 27, 2026 12:46

Add AppWorld environment integration for GRPO

d24d47a

Introduce a stateful AppWorld environment with tool registry wiring, plus dataset/debug scripts and tests so GRPO can train on AppWorld tasks and evaluate on a separate split. Made-with: Cursor

Add changelog entry for AppWorld environment PR

dadd589

Document the AppWorld GRPO environment integration in CHANGELOG with a link to PR #1501 per repository PR workflow requirements. Made-with: Cursor

gemini-code-assist Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread open_instruct/environments/appworld.py Outdated

hamishivi added 2 commits February 27, 2026 12:56

Make AppWorld an optional extra with import guard

f488079

Add an `appworld` optional dependency extra and introduce an explicit availability guard so AppWorld integration fails with a clear install hint instead of import-time crashes when the extra is missing. Made-with: Cursor

Require caller-provided APPWORLD_ROOT for AppWorldEnv

cbdf632

Remove in-env root mutation and lock-based initialization, and require APPWORLD_ROOT to be set by the launcher/script before rollouts to avoid process-global root races. Made-with: Cursor

gemini-code-assist Bot reviewed Feb 27, 2026

View reviewed changes

Comment thread open_instruct/environments/appworld.py Outdated

hamishivi and others added 16 commits February 27, 2026 14:43

Format AppWorld environment module with Ruff

d806ee7

Apply Ruff formatting to appworld.py so `ruff format --check --diff open_instruct *mason.py` passes cleanly. Made-with: Cursor

Align AppWorldEnv with dataset-driven task context

430befa

Stop relying on reset observations for task instructions, require task context in dataset prompts, and remove the AppWorld environment test file per request. Made-with: Cursor

chmod

d30ec85

Use UUID suffix to isolate AppWorld experiment outputs

375b825

Append a per-actor UUID to AppWorld experiment names by default so parallel rollouts for the same task do not collide on shared output directories. Made-with: Cursor

Align AppWorldEnv defaults with legacy setup.

2e03659

Always isolate experiment output names and restore legacy AppWorld init knobs/defaults to match prior training configuration. Made-with: Cursor

Add AppWorld Beaker setup and 8-GPU launcher.

ca599a3

Bootstrap AppWorld dependencies/data inside the job container and add a 4 learner / 4 inference AppWorld debug script to match the existing 8-GPU env launch pattern. Made-with: Cursor

Use uv run for AppWorld CLI setup calls.

1d4c6be

Resolve missing appworld binary issues in container jobs by invoking install/download commands through uv-managed environment. Made-with: Cursor

Fix AppWorld Beaker setup outside project context.

03e4391

Install AppWorld against the runtime Python using uv pip/uv run and avoid REPO_PATH unbound failures so setup fails fast instead of reporting false success. Made-with: Cursor

Harden AppWorld bootstrap and enable UV_GIT_LFS.

e88ceb2

Use UV_GIT_LFS for git-based installs, pin CLI calls to APPWORLD_ROOT, and validate app/data completeness so setup cannot report success with an incomplete AppWorld installation. Made-with: Cursor

Fail fast AppWorld setup in Beaker launch.

8edf5f0

Run AppWorld bootstrap in a separate shell and enforce strict failure behavior so incomplete installs never proceed to rollout initialization. Made-with: Cursor

Fix AppWorld CLI root flag usage in setup.

b010d28

Pass --root only to subcommands that support it (download data) and keep install invocation compatible with python -m appworld.cli. Made-with: Cursor

Add git-lfs support for AppWorld git installs.

3b8b3e8

Install git-lfs in the training image and fail fast during AppWorld bootstrap when git-lfs is unavailable so bundle pointer checkouts are surfaced clearly. Made-with: Cursor

Disable zero-std sample filtering for AppWorld debug run.

28bb486

Keep low-variance AppWorld rollouts in the 8-GPU launcher to avoid empty packed batches during early training. Made-with: Cursor

Handle parser failures during rollout tool dispatch.

1f8e5c6

Catch tool parser exceptions and return a parser-error observation so malformed tool arguments do not crash request processing. Made-with: Cursor

hamishivi closed this Apr 8, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add AppWorld environment integration for GRPO#1501

Add AppWorld environment integration for GRPO#1501
hamishivi wants to merge 20 commits intomainfrom
feature/appworld-env-grpo

hamishivi commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

hamishivi commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hamishivi commented Feb 27, 2026

Summary

Test plan

Uh oh!

gemini-code-assist Bot commented Feb 27, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

hamishivi commented Feb 27, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant