Skip to content

Add AppWorld environment integration for GRPO#1501

Closed
hamishivi wants to merge 20 commits intomainfrom
feature/appworld-env-grpo
Closed

Add AppWorld environment integration for GRPO#1501
hamishivi wants to merge 20 commits intomainfrom
feature/appworld-env-grpo

Conversation

@hamishivi
Copy link
Copy Markdown
Collaborator

Summary

  • add a new stateful AppWorldEnv (env_name=appworld) that exposes an appworld_execute tool, persists AppWorld shell state across turns, and converts task completion/evaluation into rollout reward
  • register the new environment in the shared env/tool registry and exports so GRPO can auto-discover and instantiate it from dataset env_config
  • add supporting artifacts: AppWorld train/eval dataset builder script, 1-GPU debug train+eval launcher, and unit tests covering registration/reset/step/reward/cleanup behavior

Test plan

  • uv run pytest tests/test_appworld_env.py tests/test_environments.py
  • uv run pytest open_instruct/environments/tools/tests/test_tools.py -k registry
  • uv run ruff check open_instruct/environments/appworld.py open_instruct/environments/tools/tools.py open_instruct/environments/__init__.py tests/test_appworld_env.py scripts/data/create_appworld_env_datasets.py

Made with Cursor

Introduce a stateful AppWorld environment with tool registry wiring, plus dataset/debug scripts and tests so GRPO can train on AppWorld tasks and evaluate on a separate split.

Made-with: Cursor
Document the AppWorld GRPO environment integration in CHANGELOG with a link to PR #1501 per repository PR workflow requirements.

Made-with: Cursor
@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request integrates the AppWorld environment into the existing reinforcement learning framework. The primary goal is to allow GRPO (Generative Reinforcement Learning with Policy Optimization) to interact with AppWorld tasks, enabling agents to learn by executing Python code within a simulated application environment. This integration provides a robust mechanism for defining tasks, managing execution state, and evaluating agent performance, thereby expanding the range of environments available for training and evaluating code-generating agents.

Highlights

  • New AppWorld Environment: Introduced a stateful AppWorldEnv that exposes an appworld_execute tool, maintains AppWorld shell state across turns, and converts task completion/evaluation into rollout reward for GRPO.
  • Environment Registration: Registered the new AppWorldEnv and its configuration (AppWorldEnvConfig) within the shared environment and tool registry, enabling GRPO to automatically discover and instantiate it from dataset env_config.
  • Supporting Artifacts: Added a script for building AppWorld train/eval datasets, a 1-GPU debug training and evaluation launcher, and comprehensive unit tests covering registration, reset, step, reward, and cleanup behavior of the AppWorldEnv.

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • open_instruct/environments/init.py
    • Imported AppWorldEnv and AppWorldEnvConfig.
    • Added AppWorldEnv and AppWorldEnvConfig to the module's __all__ export list.
  • open_instruct/environments/appworld.py
    • Added a new file defining the AppWorldEnv class, which integrates AppWorld tasks as an RL environment.
    • Implemented the appworld_execute tool for running Python code within the AppWorld shell.
    • Included logic for lazy loading of AppWorld symbols, environment initialization, task resetting, and step execution.
    • Developed methods for evaluating task completion and extracting scores to determine rewards.
    • Defined AppWorldEnvConfig for configuring the AppWorld environment parameters.
  • open_instruct/environments/tools/tools.py
    • Imported AppWorldEnvConfig.
    • Added AppWorldEnvConfig to the TOOL_REGISTRY for auto-discovery.
  • scripts/data/create_appworld_env_datasets.py
    • Added a new script to generate training and evaluation datasets for the AppWorld RL environment.
    • Included functionality to load AppWorld tasks, format supervisor information, and build user prompts.
    • Provided options to save datasets locally as JSONL or upload them to HuggingFace Hub.
  • scripts/train/debug/envs/appworld_1gpu.sh
    • Added a new debug shell script for launching GRPO training with AppWorldEnv on a single GPU.
    • Configured environment variables and GRPO arguments specific to AppWorld training and evaluation.
  • tests/test_appworld_env.py
    • Added a new file containing unit tests for the AppWorldEnv.
    • Implemented mock classes for AppWorld components to facilitate isolated testing.
    • Verified environment registration, reset behavior, step execution, reward calculation, unknown tool handling, and resource cleanup.
Activity
  • The pull request introduces a new AppWorld environment integration for GRPO, along with necessary supporting scripts and tests.
  • No human activity (comments, reviews) has been recorded yet for this pull request.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a new stateful environment, AppWorldEnv, for GRPO, along with necessary supporting components like a dataset creation script, a debug launcher, and unit tests. The implementation is generally of high quality and well-structured. The new environment is correctly integrated into the existing framework. The unit tests are thorough and make good use of mocking to isolate the component under test. I have one major concern regarding a potential race condition when configuring the appworld_root via environment variables in a concurrent setting, which could lead to unpredictable behavior.

Comment thread open_instruct/environments/appworld.py Outdated
Add an `appworld` optional dependency extra and introduce an explicit availability guard so AppWorld integration fails with a clear install hint instead of import-time crashes when the extra is missing.

Made-with: Cursor
Remove in-env root mutation and lock-based initialization, and require APPWORLD_ROOT to be set by the launcher/script before rollouts to avoid process-global root races.

Made-with: Cursor
@hamishivi
Copy link
Copy Markdown
Collaborator Author

/gemini review

Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces an integration for the AppWorld environment, which is a significant addition. The changes are well-structured, including the new AppWorldEnv, registration with the tool registry, a dataset creation script, unit tests, and a debug launcher. The code is generally high-quality and robust. I have one comment regarding a potential AttributeError if the task object is not present on the AppWorld instance.

Comment thread open_instruct/environments/appworld.py Outdated
hamishivi and others added 16 commits February 27, 2026 14:43
Apply Ruff formatting to appworld.py so `ruff format --check --diff open_instruct *mason.py` passes cleanly.

Made-with: Cursor
Stop relying on reset observations for task instructions, require task context in dataset prompts, and remove the AppWorld environment test file per request.

Made-with: Cursor
Append a per-actor UUID to AppWorld experiment names by default so parallel rollouts for the same task do not collide on shared output directories.

Made-with: Cursor
Always isolate experiment output names and restore legacy AppWorld init knobs/defaults to match prior training configuration.

Made-with: Cursor
Bootstrap AppWorld dependencies/data inside the job container and add a 4 learner / 4 inference AppWorld debug script to match the existing 8-GPU env launch pattern.

Made-with: Cursor
Resolve missing appworld binary issues in container jobs by invoking install/download commands through uv-managed environment.

Made-with: Cursor
Install AppWorld against the runtime Python using uv pip/uv run and avoid REPO_PATH unbound failures so setup fails fast instead of reporting false success.

Made-with: Cursor
Use UV_GIT_LFS for git-based installs, pin CLI calls to APPWORLD_ROOT, and validate app/data completeness so setup cannot report success with an incomplete AppWorld installation.

Made-with: Cursor
Run AppWorld bootstrap in a separate shell and enforce strict failure behavior so incomplete installs never proceed to rollout initialization.

Made-with: Cursor
Pass --root only to subcommands that support it (download data) and keep install invocation compatible with python -m appworld.cli.

Made-with: Cursor
Install git-lfs in the training image and fail fast during AppWorld bootstrap when git-lfs is unavailable so bundle pointer checkouts are surfaced clearly.

Made-with: Cursor
Keep low-variance AppWorld rollouts in the 8-GPU launcher to avoid empty packed batches during early training.

Made-with: Cursor
The model was blindly guessing API names (0% success rate across 10k
rollouts). Add instructions about the built-in api_docs app so the
model knows to look up available APIs before writing code.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AppWorld's execute() returns "Execution successful." for API calls
unless the result is printed. The model was calling the right
discovery APIs but couldn't see the returned data. Wrap all examples
in print() and add a warning about this behavior.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Catch tool parser exceptions and return a parser-error observation so malformed tool arguments do not crash request processing.

Made-with: Cursor
@hamishivi hamishivi closed this Apr 8, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant