chunk the logits materialization for rl loss function to decrease peak memory usage #1511

Jackmin801 · 2025-12-30T04:49:41Z

Note

Reduces peak memory during RL training by chunking logits computation and refines FSDP sharding for models with tied embeddings.

Introduces logits_chunk_size in RLTrainerConfig to control per-sequence chunking for logits materialization
In train.py, replaces single-shot logits with chunked application of lm_head over hidden_states, computing loss/entropy and backward per chunk, then backpropagating through the backbone; handles CP all-gather per chunk and configures lm_head FSDP resharding/blocking accordingly
In model.py, updates FSDP setup: if config.tie_word_embeddings, shard [model.model.embed_tokens, model.lm_head]; else shard model.lm_head; and shard model.model (not the entire model), consistently using config.reshard_after_forward

^{Written by Cursor Bugbot for commit 1313a42. This will update automatically on new commits. Configure here.}

cursor · 2025-12-30T04:56:33Z

src/prime_rl/trainer/rl/train.py

+                if cp_enabled:
+                    left_pad_logit = get_padding_logit_from_prev_cp_rank(logits, cp_rank, cp_size, cp_group)
+                else:
+                    left_pad_logit = None


Missing left pad logit between chunks breaks shift

When logits_chunks > 1, the shift_logits function requires the last logit from the previous chunk to properly shift logits at chunk boundaries. However, the current implementation only considers CP rank boundaries, not chunk boundaries. For chunks after the first one: without CP, left_pad_logit is None (causing zeros to be used); with CP, get_padding_logit_from_prev_cp_rank returns logits from a different rank rather than the previous chunk. The logits tensor is also deleted at line 350 before it can be used for the next chunk. This causes incorrect trainer_logprobs calculations and corrupted loss values whenever chunking is enabled.

cursor · 2025-12-30T04:56:33Z

src/prime_rl/trainer/rl/config.py

+            ge=1,
+            description="Number of chunks to split the sequence into for logits materialization. Higher values reduce memory usage but may increase computation time. Default is 1 (no chunking).",
+        ),
+    ] = 1


New config field missing CHANGELOG entry (Bugbot Rules)

A new config field logits_chunks has been added to src/prime_rl/trainer/rl/config.py, which matches the pattern src/prime_rl/*/config.py. According to the review rules, any PR that modifies configuration structures must update CHANGELOG.md, but no corresponding entry was added.

cursor · 2025-12-30T04:56:33Z

src/prime_rl/trainer/rl/train.py

+                    loss_mask=loss_mask_chunk.squeeze().split(response_lengths_chunk),
+                    loss_config=config.loss,
+                    loss_scale=loss_scale,
+                )


Chunking breaks sequence boundary detection for loss

When logits_chunks > 1, get_response_lengths(position_ids_chunk) is called on each chunk independently. This function detects sequence boundaries by looking for position_ids resetting to 0 followed by 1. When sequences span chunk boundaries, they are incorrectly identified as separate sequences in each chunk. This causes incorrect sequence-level loss normalization (when ratio_type == "sequence"), wrong sequence-level importance ratio calculations, and incorrect application of sequence_mask_low/sequence_mask_high thresholds. The loss values and gradient flow will be semantically incorrect for any packed batch where sequence boundaries don't align with chunk boundaries.

cursor · 2025-12-30T05:17:17Z

src/prime_rl/trainer/rl/train.py

+            position_ids_list = position_ids.chunk(num_logits_chunks, dim=1)
+            inference_logprobs_list = inference_logprobs.chunk(num_logits_chunks, dim=1)
+            advantages_list = advantages.chunk(num_logits_chunks, dim=1)
+            loss_mask_list = loss_mask.chunk(num_logits_chunks, dim=1)


CP mode mixes sharded and non-sharded tensors when chunking

When context parallelism is enabled, input_ids and hidden_states are sharded (have seq_len/cp_size tokens), while position_ids, inference_logprobs, advantages, and loss_mask remain non-sharded (have full seq_len tokens). The num_logits_chunks is computed using the full seq_len, but then applied to chunk both sharded and non-sharded tensors. This causes the chunked tensors to have mismatched sizes - for example, hidden_states_chunk[i] may have 2048 tokens while position_ids_chunk[i] has 4096 tokens - leading to incorrect loss computation or runtime errors in CP mode.

cursor · 2025-12-30T05:17:17Z

src/prime_rl/trainer/rl/train.py

+                    tensors[key].append(loss_tensor)
+
+            # Now backward through the rest of the model with accumulated gradients
+            hidden_states.backward(detached_hidden_states.grad)


Debug log shows only last chunk's loss not total

After chunking, tensors['loss'], tensors['entropy'], and tensors['mismatch_kl'] are appended once per chunk rather than once per micro-batch. The debug log message uses tensors['loss'][-1] which previously retrieved the micro-batch's total loss but now only retrieves the last chunk's loss. This makes the debug output misleading and complicates training diagnostics.

src/prime_rl/trainer/model.py

Jackmin801 added 2 commits December 30, 2025 04:42

draft impl

497ed39

refactor: use chunk_size

239c518

cursor bot reviewed Dec 30, 2025

View reviewed changes

shard lm head separately

edc7fb0

cursor bot reviewed Dec 30, 2025

View reviewed changes

src/prime_rl/trainer/model.py Outdated Show resolved Hide resolved

Jackmin801 added 3 commits December 30, 2025 21:01

fix tied weight emb sharding should not duplicate lm head

804de56

fix: trick fsdp to not do post forward hook until the last logit chunk

6f9acce

put lm head back to trigger post fwd correctly

1313a42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

chunk the logits materialization for rl loss function to decrease peak memory usage #1511

chunk the logits materialization for rl loss function to decrease peak memory usage #1511

Uh oh!

Jackmin801 commented Dec 30, 2025 •

edited by cursor bot

Loading

Uh oh!

cursor bot Dec 30, 2025

Uh oh!

cursor bot Dec 30, 2025

Uh oh!

cursor bot Dec 30, 2025

Uh oh!

cursor bot Dec 30, 2025

Uh oh!

cursor bot Dec 30, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

chunk the logits materialization for rl loss function to decrease peak memory usage #1511

Are you sure you want to change the base?

chunk the logits materialization for rl loss function to decrease peak memory usage #1511

Uh oh!

Conversation

Jackmin801 commented Dec 30, 2025 • edited by cursor bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor bot Dec 30, 2025

Choose a reason for hiding this comment

Missing left pad logit between chunks breaks shift

Uh oh!

cursor bot Dec 30, 2025

Choose a reason for hiding this comment

New config field missing CHANGELOG entry (Bugbot Rules)

Uh oh!

cursor bot Dec 30, 2025

Choose a reason for hiding this comment

Chunking breaks sequence boundary detection for loss

Uh oh!

cursor bot Dec 30, 2025

Choose a reason for hiding this comment

CP mode mixes sharded and non-sharded tensors when chunking

Uh oh!

cursor bot Dec 30, 2025

Choose a reason for hiding this comment

Debug log shows only last chunk's loss not total

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Jackmin801 commented Dec 30, 2025 •

edited by cursor bot

Loading