WIP: Add PPO with value model support to grpo_fast.py by hamishivi · Pull Request #1462 · allenai/open-instruct

hamishivi · 2026-02-04T20:11:54Z

Summary

This PR adds the ability to do PPO-style training with a learned value model for advantage estimation (GAE) instead of group-normalized rewards (GRPO style).

Key features:

New config options: use_value_model, value_loss_coef, vf_clip_range, gamma, gae_lambda, separate_value_model
Supports two modes:
- Shared backbone (default): Uses the policy model's backbone with a separate value head - more memory efficient
- Separate value model: Loads a completely separate model for value estimation - more flexible but doubles memory
Computes GAE advantages using value model predictions instead of group-normalized rewards
Adds value loss (MSE with optional clipping) to the training objective
Full checkpointing support for value model/head and optimizer state

Usage

--use_value_model True \
--value_loss_coef 0.5 \
--vf_clip_range 0.2 \
--gamma 1.0 \
--gae_lambda 0.95

Status

Basic implementation
Testing
Documentation

Test plan

Run with --use_value_model True and verify training works
Test checkpoint save/load with value model
Compare training curves with GRPO baseline

Made with Cursor

gemini-code-assist · 2026-02-04T20:12:11Z

Summary of Changes

Hello @hamishivi, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reinforcement learning capabilities by integrating a PPO-style training approach that leverages a learned value model for more sophisticated advantage estimation. This allows for greater flexibility in model architecture and training dynamics, moving beyond the traditional GRPO method. Additionally, a new evaluation utility has been introduced to streamline the assessment of model performance on diverse tasks.

Highlights

PPO with Value Model Integration: Introduced the capability for PPO-style training using a learned value model for Generalized Advantage Estimation (GAE), offering an alternative to the existing GRPO-style group-normalized rewards.
Flexible Value Model Architectures: Implemented support for two value model configurations: a shared backbone with a separate value head (memory efficient) and a completely separate value model (more flexible), configurable via separate_value_model.
New Configuration Options: Added several new configuration parameters to ExperimentConfig including use_value_model, value_loss_coef, vf_clip_range, gamma, gae_lambda, and separate_value_model to control the PPO training process.
Data Loader Enhancements: Modified the data loader to include raw rewards and 'done' flags in the collated batch data, which are essential for GAE computation in the value model.
Value Model Checkpointing: Enabled full checkpointing support for the value model or value head, along with its optimizer state, ensuring training progress can be saved and resumed.
New Evaluation Script: Added a new utility script (scripts/eval_results.py) for evaluating inference results across various benchmarks like AIME (math verification), SimpleQA (LLM-as-judge), and GPQA (letter comparison).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog

open_instruct/data_loader.py
- Added use_value_model boolean field to StreamingDataLoaderConfig to control reward passing.
- Modified prepare_collated_data_for_workers to accept an include_rewards parameter and conditionally slice and collate rewards and dones.
- Updated _data_preparation_loop to pack raw rewards into packed_sequences when use_value_model is enabled, placing rewards at end-of-sequence positions.
open_instruct/data_types.py
- Added optional rewards and dones fields (list of torch.Tensor) to the CollatedBatchData class to support value model training.
open_instruct/grpo_fast.py
- Imported AutoModelForSequenceClassification for handling separate value models.
- Implemented logic within the load method to load value model/head checkpoints if use_value_model is true and a checkpoint exists.
- Added _init_value_model method to initialize the value model, supporting both separate models and shared backbones with a dedicated value head.
- Introduced forward_value method to compute value estimates from the policy or value model.
- Modified _compute_loss_metrics to track and log average value loss and value function clip fraction.
- Updated the step method to compute value estimates and GAE advantages using the value model predictions.
- Adjusted policy loss calculation to use GAE advantages when use_value_model is active.
- Implemented value loss computation with optional clipping (PPO-style) and integrated it into the overall training objective.
- Added separate backward pass and optimizer step for the value head when using a shared backbone, as it's not managed by DeepSpeed.
- Extended save_checkpoint_state to save the value model/head and its optimizer state.
open_instruct/grpo_utils.py
- Added new PPO-related configuration options to ExperimentConfig: use_value_model, value_model_name_or_path, value_loss_coef, vf_clip_range, gamma, gae_lambda, and separate_value_model.
- Added validation checks for the new PPO configuration parameters to ensure valid ranges for gamma, gae_lambda, value_loss_coef, and vf_clip_range.
scripts/eval_results.py
- Added a new script to evaluate inference results from tool-use models.
- Includes functions to load JSONL files, extract boxed answers (for AIME), and extract letter answers (for GPQA).
- Provides evaluation logic for AIME (using math-verify with majority voting), SimpleQA (using LLM-as-judge with OpenAI API), and GPQA (simple letter comparison).

Activity

The pull request is currently a Work In Progress (WIP).
The basic implementation of PPO with value model support has been completed.
Testing of the new features is pending.
Documentation for the new PPO functionality is yet to be added.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

This adds the ability to do PPO-style training with a learned value model for advantage estimation (GAE) instead of group-normalized rewards (GRPO). Key changes: - Add PPO configuration options (use_value_model, value_loss_coef, vf_clip_range, gamma, gae_lambda) - Add value model initialization (supports separate model or shared backbone with value head) - Compute GAE advantages using value model predictions - Add value loss to training objective with optional value function clipping - Support checkpointing for value model/head and optimizer state - Pass raw rewards through data loader when using value model Usage: --use_value_model True --value_loss_coef 0.5 --vf_clip_range 0.2 Co-authored-by: Cursor <cursoragent@cursor.com>

gemini-code-assist

Code Review

This pull request introduces support for PPO-style training with a learned value model, which is a significant enhancement. The implementation includes options for both a shared backbone and a separate value model, along with GAE computation and checkpointing. The changes are extensive and touch data loading, data types, and the core training loop. My review has identified a critical issue where the separate value model is not being trained, which needs to be addressed. I've also pointed out a few medium-severity issues related to DeepSpeed initialization, a potential bug in the value loss calculation, and a code smell in attention mask handling. Overall, the direction is great, but these key issues should be resolved before merging.

gemini-code-assist · 2026-02-04T20:15:45Z

+                        if not self.args.separate_value_model:
+                            # Value head is separate, need manual backward
+                            value_loss.backward()
+                        # For separate_value_model, the value model isn't trained in this loop


The current implementation does not train the separate value model (when separate_value_model=True). The value loss is calculated, but no backward pass or optimizer step is performed for the value model, as indicated by the comment on line 1069. This defeats the purpose of a learned value function in PPO. The value model should have its own optimizer, and its weights should be updated based on the value loss.

gemini-code-assist · 2026-02-04T20:15:45Z

+        if args.use_value_model:
+            self._init_value_model(args, model_config, ds_config if args.load_ref_policy else None)


The _init_value_model function is called with a ds_config that depends on args.load_ref_policy. If load_ref_policy is false, ds_config will be None, and the value model (if separate) will not be wrapped by DeepSpeed. This is inconsistent with the policy model and can lead to issues in a distributed environment. The value model should likely be initialized with the same DeepSpeed configuration as the main policy model.

gemini-code-assist · 2026-02-04T20:15:45Z

+                # Use the separate value model
+                output = self.value_model(
+                    input_ids=input_ids,
+                    attention_mask=attention_mask.clamp(0, 1),


The attention_mask is being clamped with .clamp(0, 1) here and on line 552. This suggests that the attention mask may contain values other than 0 and 1, which is unusual. While this clamp might prevent errors, it's better to investigate the root cause of why the attention mask has unexpected values and fix it at the source. This could be indicative of a bug in how attention masks are created or processed earlier in the pipeline.

gemini-code-assist · 2026-02-04T20:15:45Z

+                            loss_stats_B["value_loss"][i] = masked_mean(value_loss_BT[:, 1:], response_mask_BT)
+                            loss_stats_B["vf_clipfrac"][i] = masked_mean(vf_clipfrac_BT[:, 1:], response_mask_BT)


The value loss (value_loss_BT) and value clipping fraction (vf_clipfrac_BT) are sliced with [:, 1:] before being passed to masked_mean (here and on lines 1060-1061). This effectively ignores the value loss for the first token of each sequence in the batch. Unlike policy log-probabilities which are naturally shifted, value estimates and returns are typically computed for every token in the response. Dropping the first token's value loss seems incorrect and may have been copied from the policy loss calculation by mistake. Please verify if this is the intended behavior.

- Add _gather_for_gae() method to gather values, rewards, dones, and response masks across SP ranks before GAE computation - Add _extract_sp_chunk() method to extract this rank's chunk of advantages and returns after GAE computation - Modify GAE computation in step() to use gather/extract when SP enabled - Update forward_value() docstring to document SP handling GAE requires temporal differences (value[t+1] - value[t]) across the entire sequence. With sequence parallelism, each rank only has a chunk of the sequence, so we gather full sequences, compute GAE, then split the results back to each rank. Co-authored-by: Cursor <cursoragent@cursor.com>

Two test scripts for the PPO value model feature: - ppo_value_model_8gpu.sh: With sequence_parallel_size=2 to test SP+value model - ppo_value_model_8gpu_no_sp.sh: Without SP for simpler testing Both use Qwen2.5-1.5B on GSM8K with verifiable rewards. Co-authored-by: Cursor <cursoragent@cursor.com>