WIP: Add PPO with value model support to grpo_fast.py#1462
WIP: Add PPO with value model support to grpo_fast.py#1462
Conversation
Summary of ChangesHello @hamishivi, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly enhances the reinforcement learning capabilities by integrating a PPO-style training approach that leverages a learned value model for more sophisticated advantage estimation. This allows for greater flexibility in model architecture and training dynamics, moving beyond the traditional GRPO method. Additionally, a new evaluation utility has been introduced to streamline the assessment of model performance on diverse tasks. Highlights
🧠 New Feature in Public Preview: You can now enable Memory to help Gemini Code Assist learn from your team's feedback. This makes future code reviews more consistent and personalized to your project's style. Click here to enable Memory in your admin console. Changelog
Activity
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
This adds the ability to do PPO-style training with a learned value model for advantage estimation (GAE) instead of group-normalized rewards (GRPO). Key changes: - Add PPO configuration options (use_value_model, value_loss_coef, vf_clip_range, gamma, gae_lambda) - Add value model initialization (supports separate model or shared backbone with value head) - Compute GAE advantages using value model predictions - Add value loss to training objective with optional value function clipping - Support checkpointing for value model/head and optimizer state - Pass raw rewards through data loader when using value model Usage: --use_value_model True --value_loss_coef 0.5 --vf_clip_range 0.2 Co-authored-by: Cursor <cursoragent@cursor.com>
There was a problem hiding this comment.
Code Review
This pull request introduces support for PPO-style training with a learned value model, which is a significant enhancement. The implementation includes options for both a shared backbone and a separate value model, along with GAE computation and checkpointing. The changes are extensive and touch data loading, data types, and the core training loop. My review has identified a critical issue where the separate value model is not being trained, which needs to be addressed. I've also pointed out a few medium-severity issues related to DeepSpeed initialization, a potential bug in the value loss calculation, and a code smell in attention mask handling. Overall, the direction is great, but these key issues should be resolved before merging.
| if not self.args.separate_value_model: | ||
| # Value head is separate, need manual backward | ||
| value_loss.backward() | ||
| # For separate_value_model, the value model isn't trained in this loop |
There was a problem hiding this comment.
The current implementation does not train the separate value model (when separate_value_model=True). The value loss is calculated, but no backward pass or optimizer step is performed for the value model, as indicated by the comment on line 1069. This defeats the purpose of a learned value function in PPO. The value model should have its own optimizer, and its weights should be updated based on the value loss.
| if args.use_value_model: | ||
| self._init_value_model(args, model_config, ds_config if args.load_ref_policy else None) |
There was a problem hiding this comment.
The _init_value_model function is called with a ds_config that depends on args.load_ref_policy. If load_ref_policy is false, ds_config will be None, and the value model (if separate) will not be wrapped by DeepSpeed. This is inconsistent with the policy model and can lead to issues in a distributed environment. The value model should likely be initialized with the same DeepSpeed configuration as the main policy model.
| # Use the separate value model | ||
| output = self.value_model( | ||
| input_ids=input_ids, | ||
| attention_mask=attention_mask.clamp(0, 1), |
There was a problem hiding this comment.
The attention_mask is being clamped with .clamp(0, 1) here and on line 552. This suggests that the attention mask may contain values other than 0 and 1, which is unusual. While this clamp might prevent errors, it's better to investigate the root cause of why the attention mask has unexpected values and fix it at the source. This could be indicative of a bug in how attention masks are created or processed earlier in the pipeline.
| loss_stats_B["value_loss"][i] = masked_mean(value_loss_BT[:, 1:], response_mask_BT) | ||
| loss_stats_B["vf_clipfrac"][i] = masked_mean(vf_clipfrac_BT[:, 1:], response_mask_BT) |
There was a problem hiding this comment.
The value loss (value_loss_BT) and value clipping fraction (vf_clipfrac_BT) are sliced with [:, 1:] before being passed to masked_mean (here and on lines 1060-1061). This effectively ignores the value loss for the first token of each sequence in the batch. Unlike policy log-probabilities which are naturally shifted, value estimates and returns are typically computed for every token in the response. Dropping the first token's value loss seems incorrect and may have been copied from the policy loss calculation by mistake. Please verify if this is the intended behavior.
ebbfb4a to
2cf1c36
Compare
- Add _gather_for_gae() method to gather values, rewards, dones, and response masks across SP ranks before GAE computation - Add _extract_sp_chunk() method to extract this rank's chunk of advantages and returns after GAE computation - Modify GAE computation in step() to use gather/extract when SP enabled - Update forward_value() docstring to document SP handling GAE requires temporal differences (value[t+1] - value[t]) across the entire sequence. With sequence parallelism, each rank only has a chunk of the sequence, so we gather full sequences, compute GAE, then split the results back to each rank. Co-authored-by: Cursor <cursoragent@cursor.com>
Two test scripts for the PPO value model feature: - ppo_value_model_8gpu.sh: With sequence_parallel_size=2 to test SP+value model - ppo_value_model_8gpu_no_sp.sh: Without SP for simpler testing Both use Qwen2.5-1.5B on GSM8K with verifiable rewards. Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements key techniques from the VAPO paper (arXiv:2504.05118): 1. Decoupled GAE (--decoupled_gae): - Critic uses lambda=1.0 for unbiased Monte Carlo returns - Policy uses smaller lambda for faster convergence - Helps with value model learning in long-CoT tasks 2. Length-Adaptive GAE (--length_adaptive_gae): - Dynamically adjusts lambda based on sequence length - Formula: lambda = 1 - 1/(alpha * length) - Balances bias-variance tradeoff for varying sequence lengths 3. Positive Example LM Loss (--positive_example_lm_loss): - Adds NLL loss on correct/positive examples (self-imitation learning) - Improves sample efficiency when positive rewards are sparse - Configurable coefficient via --positive_example_lm_loss_coef New config options in ExperimentConfig: - decoupled_gae: bool (default False) - length_adaptive_gae: bool (default False) - length_adaptive_gae_alpha: float (default 0.05) - positive_example_lm_loss: bool (default False) - positive_example_lm_loss_coef: float (default 0.1) Also adds new GAE functions in rl_utils.py: - calculate_length_adaptive_lambda() - calculate_advantages_packed_vapo() Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Implements VAPO's value pretraining technique where the value model is trained for N steps while keeping the policy frozen. This helps reduce value model initialization bias before starting policy training. New config option: - value_warmup_steps: Number of steps to train value model only (default 0) During warmup (training_step <= value_warmup_steps): - Policy backward/step is skipped (policy weights frozen) - Value model backward/step still runs (value head is trained) - Logged via vapo/value_warmup metric (1.0 during warmup, 0.0 after) Reference: https://arxiv.org/abs/2504.05118 Section 4.1 Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
- Use same data, batch size, learning rate, and other hyperparameters - 4 nodes x 8 GPUs with SP=4 for the full-scale 8gpu script - Same setup but SP=1 for the no_sp variant - Scaled down 1-GPU version for quick testing Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
…hould not be a CLI arg Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
… dataset Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
… policy Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Co-authored-by: Cursor <cursoragent@cursor.com>
Made-with: Cursor
Three options via --gt_conditioning_template:
- 'answer_prefix' (default): "Answer: 42\n" (plain text, minimal)
- 'boxed_answer': "The correct answer is \boxed{42}.\n" (matches
expected output format the model is trained to produce)
- 'system_hint': "<|im_start|>system\nThe ground truth answer to
this problem is: 42<|im_end|>\n" (chat-formatted, uses tokens
the model already understands)
Made-with: Cursor
Made-with: Cursor
…en3-4B-Base Made-with: Cursor
…in GT scripts Made-with: Cursor
Made-with: Cursor
Previously the value model was only saved when checkpoint_state_dir was set (DeepSpeed state checkpoints). This adds value model saving to the regular save_model() path so it's included in HF-format checkpoints at each save_freq interval. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New wandb metrics: - value/correct_by_position, value/incorrect_by_position: per-token value curves at absolute positions, split by rollout correctness - value/correct_by_pct, value/incorrect_by_pct: same but normalized to percentile bins (20 bins) for cross-length comparison - value/advantage_correct_mean/std, value/advantage_incorrect_mean/std: advantage statistics split by correctness - value/advantage_correct_by_pct, value/advantage_incorrect_by_pct: advantage curves by normalized position Also saves value model alongside policy in HF checkpoints, and adds fetch-beaker-evals skill for querying AIME scores from Beaker. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
New script: qwen3_4b_base_ppo_8b_value.sh - Qwen3-4B-Base policy with Qwen3-8B-Base value model - 1000 value warmup steps (10x longer than default) - Tests whether a larger, better-pretrained value model can provide useful credit assignment for PPO training Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
- Filter NaN values from array metrics before creating wandb.Histogram - qwen3_4b_base_ppo_16vb.sh: 16 value mini-batches, value LR 5e-6 - qwen3_4b_base_ppo_lam1.sh: same + lambda=1.0, no decoupled/adaptive GAE Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Test lambda=1.0 without decoupled/adaptive GAE on its own. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests whether GT value conditioning helps when bootstrapping is removed. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests whether the decoupled critic (lambda=1.0 for value targets) is the source of instability, while keeping length-adaptive policy lambda. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Train on Skywork-Reward-Preference-80K-v0.2 + Tulu 3 70B preference mixture. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Tests whether an instruct-tuned value model provides better credit assignment than a base model value head. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Loads the RM as AutoModelForSequenceClassification, copies its backbone weights and trained score head to the value model. This gives the value model a pre-trained head that already understands reward prediction, instead of random initialization. New script: qwen3_4b_base_ppo_rm_init.sh uses hamishivi/qwen3_4b_base_rm__1__1774369319 as the RM. Made-with: Cursor
1000-step value warmup only (policy frozen), 32 prompts, no decoupled GAE, no length-adaptive GAE. Tests how different lambda values affect value model learning quality during pretraining. Made-with: Cursor
…100-step - lam095_ladaptive: 1000 steps, length-adaptive GAE - lam095_decoupled: 1000 steps, decoupled GAE (critic lam=1.0) - lam095_100steps: 100 steps only (quick baseline) Made-with: Cursor
…atures
Three new value model conditioning modes, each enable-able via CLI:
1. --gt_conditioning_template expected_accuracy: prepends "Given the answer
is <GT>, Let me compute the expected accuracy of the partial rollout:"
before value model scoring.
2. --gt_conditioning_template rollout_context: prepends sibling rollouts
(with CORRECT/INCORRECT labels) from the same prompt group + GT answer.
Controlled by --rollout_context_num_siblings (default 4). Pipes sibling
rollout text and scores through PackedSequences → CollatedBatchData.
3. --use_generative_value_model: separate LM path (keeps LM head) that
breaks responses into chunks, prompts with <value_think> tags for
correctness reasoning, parses {score: X}, and trains with MSE + REINFORCE.
Controlled by --generative_value_chunk_size, --generative_value_loss_coef,
--generative_value_reinforce_coef, --generative_value_max_think_tokens.
Includes 4B Qwen3 launch scripts for each variant (32 prompts × 16 rollouts).
Made-with: Cursor
Made-with: Cursor
…a=0.5 All match vip_vpretrain_sae_gt config (100 warmup, GT expected_accuracy): - sae_gt_decoupled_ladaptive: decoupled + length-adaptive GAE - sae_gt_decoupled: decoupled GAE only - sae_gt_decoupled_lam05: decoupled GAE with lambda=0.5 Made-with: Cursor
Summary
This PR adds the ability to do PPO-style training with a learned value model for advantage estimation (GAE) instead of group-normalized rewards (GRPO style).
Key features:
use_value_model,value_loss_coef,vf_clip_range,gamma,gae_lambda,separate_value_modelUsage
Status
Test plan
--use_value_model Trueand verify training worksMade with Cursor