Skip to content

WIP: Add PPO with value model support to grpo_fast.py#1462

Draft
hamishivi wants to merge 153 commits intomainfrom
hamish/ppo-value-model
Draft

WIP: Add PPO with value model support to grpo_fast.py#1462
hamishivi wants to merge 153 commits intomainfrom
hamish/ppo-value-model

Conversation

@hamishivi
Copy link
Copy Markdown
Collaborator

Summary

This PR adds the ability to do PPO-style training with a learned value model for advantage estimation (GAE) instead of group-normalized rewards (GRPO style).

Key features:

  • New config options: use_value_model, value_loss_coef, vf_clip_range, gamma, gae_lambda, separate_value_model
  • Supports two modes:
    • Shared backbone (default): Uses the policy model's backbone with a separate value head - more memory efficient
    • Separate value model: Loads a completely separate model for value estimation - more flexible but doubles memory
  • Computes GAE advantages using value model predictions instead of group-normalized rewards
  • Adds value loss (MSE with optional clipping) to the training objective
  • Full checkpointing support for value model/head and optimizer state

Usage

--use_value_model True \
--value_loss_coef 0.5 \
--vf_clip_range 0.2 \
--gamma 1.0 \
--gae_lambda 0.95

Status

  • Basic implementation
  • Testing
  • Documentation

Test plan

  • Run with --use_value_model True and verify training works
  • Test checkpoint save/load with value model
  • Compare training curves with GRPO baseline

Made with Cursor

@gemini-code-assist
Copy link
Copy Markdown
Contributor

Summary of Changes

Hello @hamishivi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly enhances the reinforcement learning capabilities by integrating a PPO-style training approach that leverages a learned value model for more sophisticated advantage estimation. This allows for greater flexibility in model architecture and training dynamics, moving beyond the traditional GRPO method. Additionally, a new evaluation utility has been introduced to streamline the assessment of model performance on diverse tasks.

Highlights

  • PPO with Value Model Integration: Introduced the capability for PPO-style training using a learned value model for Generalized Advantage Estimation (GAE), offering an alternative to the existing GRPO-style group-normalized rewards.
  • Flexible Value Model Architectures: Implemented support for two value model configurations: a shared backbone with a separate value head (memory efficient) and a completely separate value model (more flexible), configurable via separate_value_model.
  • New Configuration Options: Added several new configuration parameters to ExperimentConfig including use_value_model, value_loss_coef, vf_clip_range, gamma, gae_lambda, and separate_value_model to control the PPO training process.
  • Data Loader Enhancements: Modified the data loader to include raw rewards and 'done' flags in the collated batch data, which are essential for GAE computation in the value model.
  • Value Model Checkpointing: Enabled full checkpointing support for the value model or value head, along with its optimizer state, ensuring training progress can be saved and resumed.
  • New Evaluation Script: Added a new utility script (scripts/eval_results.py) for evaluating inference results across various benchmarks like AIME (math verification), SimpleQA (LLM-as-judge), and GPQA (letter comparison).

🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console.

Changelog
  • open_instruct/data_loader.py
    • Added use_value_model boolean field to StreamingDataLoaderConfig to control reward passing.
    • Modified prepare_collated_data_for_workers to accept an include_rewards parameter and conditionally slice and collate rewards and dones.
    • Updated _data_preparation_loop to pack raw rewards into packed_sequences when use_value_model is enabled, placing rewards at end-of-sequence positions.
  • open_instruct/data_types.py
    • Added optional rewards and dones fields (list of torch.Tensor) to the CollatedBatchData class to support value model training.
  • open_instruct/grpo_fast.py
    • Imported AutoModelForSequenceClassification for handling separate value models.
    • Implemented logic within the load method to load value model/head checkpoints if use_value_model is true and a checkpoint exists.
    • Added _init_value_model method to initialize the value model, supporting both separate models and shared backbones with a dedicated value head.
    • Introduced forward_value method to compute value estimates from the policy or value model.
    • Modified _compute_loss_metrics to track and log average value loss and value function clip fraction.
    • Updated the step method to compute value estimates and GAE advantages using the value model predictions.
    • Adjusted policy loss calculation to use GAE advantages when use_value_model is active.
    • Implemented value loss computation with optional clipping (PPO-style) and integrated it into the overall training objective.
    • Added separate backward pass and optimizer step for the value head when using a shared backbone, as it's not managed by DeepSpeed.
    • Extended save_checkpoint_state to save the value model/head and its optimizer state.
  • open_instruct/grpo_utils.py
    • Added new PPO-related configuration options to ExperimentConfig: use_value_model, value_model_name_or_path, value_loss_coef, vf_clip_range, gamma, gae_lambda, and separate_value_model.
    • Added validation checks for the new PPO configuration parameters to ensure valid ranges for gamma, gae_lambda, value_loss_coef, and vf_clip_range.
  • scripts/eval_results.py
    • Added a new script to evaluate inference results from tool-use models.
    • Includes functions to load JSONL files, extract boxed answers (for AIME), and extract letter answers (for GPQA).
    • Provides evaluation logic for AIME (using math-verify with majority voting), SimpleQA (using LLM-as-judge with OpenAI API), and GPQA (simple letter comparison).
Activity
  • The pull request is currently a Work In Progress (WIP).
  • The basic implementation of PPO with value model support has been completed.
  • Testing of the new features is pending.
  • Documentation for the new PPO functionality is yet to be added.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

This adds the ability to do PPO-style training with a learned value model
for advantage estimation (GAE) instead of group-normalized rewards (GRPO).

Key changes:
- Add PPO configuration options (use_value_model, value_loss_coef, vf_clip_range, gamma, gae_lambda)
- Add value model initialization (supports separate model or shared backbone with value head)
- Compute GAE advantages using value model predictions
- Add value loss to training objective with optional value function clipping
- Support checkpointing for value model/head and optimizer state
- Pass raw rewards through data loader when using value model

Usage: --use_value_model True --value_loss_coef 0.5 --vf_clip_range 0.2
Co-authored-by: Cursor <cursoragent@cursor.com>
Copy link
Copy Markdown
Contributor

@gemini-code-assist gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces support for PPO-style training with a learned value model, which is a significant enhancement. The implementation includes options for both a shared backbone and a separate value model, along with GAE computation and checkpointing. The changes are extensive and touch data loading, data types, and the core training loop. My review has identified a critical issue where the separate value model is not being trained, which needs to be addressed. I've also pointed out a few medium-severity issues related to DeepSpeed initialization, a potential bug in the value loss calculation, and a code smell in attention mask handling. Overall, the direction is great, but these key issues should be resolved before merging.

Comment thread open_instruct/grpo_fast.py Outdated
Comment on lines +1066 to +1069
if not self.args.separate_value_model:
# Value head is separate, need manual backward
value_loss.backward()
# For separate_value_model, the value model isn't trained in this loop
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

critical

The current implementation does not train the separate value model (when separate_value_model=True). The value loss is calculated, but no backward pass or optimizer step is performed for the value model, as indicated by the comment on line 1069. This defeats the purpose of a learned value function in PPO. The value model should have its own optimizer, and its weights should be updated based on the value loss.

Comment thread open_instruct/grpo_fast.py Outdated
Comment on lines +376 to +377
if args.use_value_model:
self._init_value_model(args, model_config, ds_config if args.load_ref_policy else None)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The _init_value_model function is called with a ds_config that depends on args.load_ref_policy. If load_ref_policy is false, ds_config will be None, and the value model (if separate) will not be wrapped by DeepSpeed. This is inconsistent with the policy model and can lead to issues in a distributed environment. The value model should likely be initialized with the same DeepSpeed configuration as the main policy model.

Comment thread open_instruct/grpo_fast.py Outdated
# Use the separate value model
output = self.value_model(
input_ids=input_ids,
attention_mask=attention_mask.clamp(0, 1),
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The attention_mask is being clamped with .clamp(0, 1) here and on line 552. This suggests that the attention mask may contain values other than 0 and 1, which is unusual. While this clamp might prevent errors, it's better to investigate the root cause of why the attention mask has unexpected values and fix it at the source. This could be indicative of a bug in how attention masks are created or processed earlier in the pipeline.

Comment thread open_instruct/grpo_fast.py Outdated
Comment on lines +1099 to +1100
loss_stats_B["value_loss"][i] = masked_mean(value_loss_BT[:, 1:], response_mask_BT)
loss_stats_B["vf_clipfrac"][i] = masked_mean(vf_clipfrac_BT[:, 1:], response_mask_BT)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The value loss (value_loss_BT) and value clipping fraction (vf_clipfrac_BT) are sliced with [:, 1:] before being passed to masked_mean (here and on lines 1060-1061). This effectively ignores the value loss for the first token of each sequence in the batch. Unlike policy log-probabilities which are naturally shifted, value estimates and returns are typically computed for every token in the response. Dropping the first token's value loss seems incorrect and may have been copied from the policy loss calculation by mistake. Please verify if this is the intended behavior.

@hamishivi hamishivi force-pushed the hamish/ppo-value-model branch from ebbfb4a to 2cf1c36 Compare February 4, 2026 20:15
hamishivi and others added 26 commits February 4, 2026 12:36
- Add _gather_for_gae() method to gather values, rewards, dones, and
  response masks across SP ranks before GAE computation
- Add _extract_sp_chunk() method to extract this rank's chunk of
  advantages and returns after GAE computation
- Modify GAE computation in step() to use gather/extract when SP enabled
- Update forward_value() docstring to document SP handling

GAE requires temporal differences (value[t+1] - value[t]) across the
entire sequence. With sequence parallelism, each rank only has a chunk
of the sequence, so we gather full sequences, compute GAE, then split
the results back to each rank.

Co-authored-by: Cursor <cursoragent@cursor.com>
Two test scripts for the PPO value model feature:
- ppo_value_model_8gpu.sh: With sequence_parallel_size=2 to test SP+value model
- ppo_value_model_8gpu_no_sp.sh: Without SP for simpler testing

Both use Qwen2.5-1.5B on GSM8K with verifiable rewards.

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements key techniques from the VAPO paper (arXiv:2504.05118):

1. Decoupled GAE (--decoupled_gae):
   - Critic uses lambda=1.0 for unbiased Monte Carlo returns
   - Policy uses smaller lambda for faster convergence
   - Helps with value model learning in long-CoT tasks

2. Length-Adaptive GAE (--length_adaptive_gae):
   - Dynamically adjusts lambda based on sequence length
   - Formula: lambda = 1 - 1/(alpha * length)
   - Balances bias-variance tradeoff for varying sequence lengths

3. Positive Example LM Loss (--positive_example_lm_loss):
   - Adds NLL loss on correct/positive examples (self-imitation learning)
   - Improves sample efficiency when positive rewards are sparse
   - Configurable coefficient via --positive_example_lm_loss_coef

New config options in ExperimentConfig:
- decoupled_gae: bool (default False)
- length_adaptive_gae: bool (default False)
- length_adaptive_gae_alpha: float (default 0.05)
- positive_example_lm_loss: bool (default False)
- positive_example_lm_loss_coef: float (default 0.1)

Also adds new GAE functions in rl_utils.py:
- calculate_length_adaptive_lambda()
- calculate_advantages_packed_vapo()

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements VAPO's value pretraining technique where the value model is
trained for N steps while keeping the policy frozen. This helps reduce
value model initialization bias before starting policy training.

New config option:
- value_warmup_steps: Number of steps to train value model only (default 0)

During warmup (training_step <= value_warmup_steps):
- Policy backward/step is skipped (policy weights frozen)
- Value model backward/step still runs (value head is trained)
- Logged via vapo/value_warmup metric (1.0 during warmup, 0.0 after)

Reference: https://arxiv.org/abs/2504.05118 Section 4.1
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Use same data, batch size, learning rate, and other hyperparameters
- 4 nodes x 8 GPUs with SP=4 for the full-scale 8gpu script
- Same setup but SP=1 for the no_sp variant
- Scaled down 1-GPU version for quick testing

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…hould not be a CLI arg

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
… dataset

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
… policy

Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
hamishivi and others added 30 commits March 21, 2026 15:23
Three options via --gt_conditioning_template:
- 'answer_prefix' (default): "Answer: 42\n" (plain text, minimal)
- 'boxed_answer': "The correct answer is \boxed{42}.\n" (matches
  expected output format the model is trained to produce)
- 'system_hint': "<|im_start|>system\nThe ground truth answer to
  this problem is: 42<|im_end|>\n" (chat-formatted, uses tokens
  the model already understands)

Made-with: Cursor
Previously the value model was only saved when checkpoint_state_dir was
set (DeepSpeed state checkpoints). This adds value model saving to the
regular save_model() path so it's included in HF-format checkpoints at
each save_freq interval.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New wandb metrics:
- value/correct_by_position, value/incorrect_by_position: per-token value
  curves at absolute positions, split by rollout correctness
- value/correct_by_pct, value/incorrect_by_pct: same but normalized to
  percentile bins (20 bins) for cross-length comparison
- value/advantage_correct_mean/std, value/advantage_incorrect_mean/std:
  advantage statistics split by correctness
- value/advantage_correct_by_pct, value/advantage_incorrect_by_pct:
  advantage curves by normalized position

Also saves value model alongside policy in HF checkpoints, and adds
fetch-beaker-evals skill for querying AIME scores from Beaker.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script: qwen3_4b_base_ppo_8b_value.sh
- Qwen3-4B-Base policy with Qwen3-8B-Base value model
- 1000 value warmup steps (10x longer than default)
- Tests whether a larger, better-pretrained value model can provide
  useful credit assignment for PPO training

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Filter NaN values from array metrics before creating wandb.Histogram
- qwen3_4b_base_ppo_16vb.sh: 16 value mini-batches, value LR 5e-6
- qwen3_4b_base_ppo_lam1.sh: same + lambda=1.0, no decoupled/adaptive GAE

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test lambda=1.0 without decoupled/adaptive GAE on its own.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests whether GT value conditioning helps when bootstrapping is removed.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests whether the decoupled critic (lambda=1.0 for value targets) is the
source of instability, while keeping length-adaptive policy lambda.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Train on Skywork-Reward-Preference-80K-v0.2 + Tulu 3 70B preference mixture.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests whether an instruct-tuned value model provides better credit
assignment than a base model value head.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loads the RM as AutoModelForSequenceClassification, copies its backbone
weights and trained score head to the value model. This gives the value
model a pre-trained head that already understands reward prediction,
instead of random initialization.

New script: qwen3_4b_base_ppo_rm_init.sh uses
hamishivi/qwen3_4b_base_rm__1__1774369319 as the RM.

Made-with: Cursor
1000-step value warmup only (policy frozen), 32 prompts, no decoupled
GAE, no length-adaptive GAE. Tests how different lambda values affect
value model learning quality during pretraining.

Made-with: Cursor
…100-step

- lam095_ladaptive: 1000 steps, length-adaptive GAE
- lam095_decoupled: 1000 steps, decoupled GAE (critic lam=1.0)
- lam095_100steps: 100 steps only (quick baseline)

Made-with: Cursor
…atures

Three new value model conditioning modes, each enable-able via CLI:

1. --gt_conditioning_template expected_accuracy: prepends "Given the answer
   is <GT>, Let me compute the expected accuracy of the partial rollout:"
   before value model scoring.

2. --gt_conditioning_template rollout_context: prepends sibling rollouts
   (with CORRECT/INCORRECT labels) from the same prompt group + GT answer.
   Controlled by --rollout_context_num_siblings (default 4). Pipes sibling
   rollout text and scores through PackedSequences → CollatedBatchData.

3. --use_generative_value_model: separate LM path (keeps LM head) that
   breaks responses into chunks, prompts with <value_think> tags for
   correctness reasoning, parses {score: X}, and trains with MSE + REINFORCE.
   Controlled by --generative_value_chunk_size, --generative_value_loss_coef,
   --generative_value_reinforce_coef, --generative_value_max_think_tokens.

Includes 4B Qwen3 launch scripts for each variant (32 prompts × 16 rollouts).

Made-with: Cursor
Made-with: Cursor
…a=0.5

All match vip_vpretrain_sae_gt config (100 warmup, GT expected_accuracy):
- sae_gt_decoupled_ladaptive: decoupled + length-adaptive GAE
- sae_gt_decoupled: decoupled GAE only
- sae_gt_decoupled_lam05: decoupled GAE with lambda=0.5

Made-with: Cursor
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant