Add AppWorld environment integration for GRPO#1501
Conversation
Introduce a stateful AppWorld environment with tool registry wiring, plus dataset/debug scripts and tests so GRPO can train on AppWorld tasks and evaluate on a separate split. Made-with: Cursor
Document the AppWorld GRPO environment integration in CHANGELOG with a link to PR #1501 per repository PR workflow requirements. Made-with: Cursor
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request integrates the AppWorld environment into the existing reinforcement learning framework. The primary goal is to allow GRPO (Generative Reinforcement Learning with Policy Optimization) to interact with AppWorld tasks, enabling agents to learn by executing Python code within a simulated application environment. This integration provides a robust mechanism for defining tasks, managing execution state, and evaluating agent performance, thereby expanding the range of environments available for training and evaluating code-generating agents. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a new stateful environment, AppWorldEnv, for GRPO, along with necessary supporting components like a dataset creation script, a debug launcher, and unit tests. The implementation is generally of high quality and well-structured. The new environment is correctly integrated into the existing framework. The unit tests are thorough and make good use of mocking to isolate the component under test. I have one major concern regarding a potential race condition when configuring the appworld_root via environment variables in a concurrent setting, which could lead to unpredictable behavior.
Add an `appworld` optional dependency extra and introduce an explicit availability guard so AppWorld integration fails with a clear install hint instead of import-time crashes when the extra is missing. Made-with: Cursor
Remove in-env root mutation and lock-based initialization, and require APPWORLD_ROOT to be set by the launcher/script before rollouts to avoid process-global root races. Made-with: Cursor
|
/gemini review |
There was a problem hiding this comment.
Code Review
This pull request introduces an integration for the AppWorld environment, which is a significant addition. The changes are well-structured, including the new AppWorldEnv, registration with the tool registry, a dataset creation script, unit tests, and a debug launcher. The code is generally high-quality and robust. I have one comment regarding a potential AttributeError if the task object is not present on the AppWorld instance.
Apply Ruff formatting to appworld.py so `ruff format --check --diff open_instruct *mason.py` passes cleanly. Made-with: Cursor
Stop relying on reset observations for task instructions, require task context in dataset prompts, and remove the AppWorld environment test file per request. Made-with: Cursor
Append a per-actor UUID to AppWorld experiment names by default so parallel rollouts for the same task do not collide on shared output directories. Made-with: Cursor
Always isolate experiment output names and restore legacy AppWorld init knobs/defaults to match prior training configuration. Made-with: Cursor
Bootstrap AppWorld dependencies/data inside the job container and add a 4 learner / 4 inference AppWorld debug script to match the existing 8-GPU env launch pattern. Made-with: Cursor
Resolve missing appworld binary issues in container jobs by invoking install/download commands through uv-managed environment. Made-with: Cursor
Install AppWorld against the runtime Python using uv pip/uv run and avoid REPO_PATH unbound failures so setup fails fast instead of reporting false success. Made-with: Cursor
Use UV_GIT_LFS for git-based installs, pin CLI calls to APPWORLD_ROOT, and validate app/data completeness so setup cannot report success with an incomplete AppWorld installation. Made-with: Cursor
Run AppWorld bootstrap in a separate shell and enforce strict failure behavior so incomplete installs never proceed to rollout initialization. Made-with: Cursor
Pass --root only to subcommands that support it (download data) and keep install invocation compatible with python -m appworld.cli. Made-with: Cursor
Install git-lfs in the training image and fail fast during AppWorld bootstrap when git-lfs is unavailable so bundle pointer checkouts are surfaced clearly. Made-with: Cursor
Keep low-variance AppWorld rollouts in the 8-GPU launcher to avoid empty packed batches during early training. Made-with: Cursor
The model was blindly guessing API names (0% success rate across 10k rollouts). Add instructions about the built-in api_docs app so the model knows to look up available APIs before writing code. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
AppWorld's execute() returns "Execution successful." for API calls unless the result is printed. The model was calling the right discovery APIs but couldn't see the returned data. Wrap all examples in print() and add a warning about this behavior. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Catch tool parser exceptions and return a parser-error observation so malformed tool arguments do not crash request processing. Made-with: Cursor
Summary
AppWorldEnv(env_name=appworld) that exposes anappworld_executetool, persists AppWorld shell state across turns, and converts task completion/evaluation into rollout rewardenv_configTest plan
uv run pytest tests/test_appworld_env.py tests/test_environments.pyuv run pytest open_instruct/environments/tools/tests/test_tools.py -k registryuv run ruff check open_instruct/environments/appworld.py open_instruct/environments/tools/tools.py open_instruct/environments/__init__.py tests/test_appworld_env.py scripts/data/create_appworld_env_datasets.pyMade with Cursor