[feat] Add RLHF rollout integration support (verl) by sijyang · Pull Request #549 · ROCm/ATOM

sijyang · 2026-04-13T07:44:26Z

Overview

This PR enables ATOM to serve as a rollout backend for verl, a distributed RLHF training framework. In RLHF training, the system alternates between two phases:

Training phase — the trainer updates model weights via gradient descent (handled by verl)
Rollout phase — the inference engine generates responses using the latest weights (handled by ATOM)

This requires ATOM to support a new lifecycle that traditional serving doesn't need: receiving weight updates from an external trainer, dynamically releasing/reclaiming GPU memory between phases, and coordinating these operations across multiple DP ranks and TP ranks.

The integration is designed as a plugin layer (atom/rollout/) that extends ATOM's existing engine without modifying its core inference path. All changes to existing files are purely incremental additions (new methods, new fields, new message types), no existing behavior is altered.

Architecture

verl trainer (PyTorch DDP)
│
ATOMHttpServer (verl side, per-node)
│ ZMQ RPC
AsyncLLMEngine (atom/rollout/async_engine.py)
├── sleep() → release KV cache, free GPU memory for training
├── wake_up() → reallocate KV cache, ready for generation
├── load_weights() → receive updated weights via CUDA IPC
└── generate() → standard ATOM inference with logprobs

Weight Synchronization

Weight transfer uses CUDA IPC (weight_sync.py → weight_updater.py) for zero-copy GPU-to-GPU transfer via cudaIpcGetMemHandle/cudaIpcOpenMemHandle. Weights are packed into a GPU buffer, and IPC handles are sent to ModelRunner subprocesses. On multi-GPU setups (DP>1), per-GPU buffers ensure same-device IPC.

Weights are accumulated into fixed-size buckets and flushed incrementally, keeping peak memory overhead bounded regardless of model size.

Weight Update Pipeline

weight_updater.py handles the ModelRunner side of weight loading:

Maps incoming parameter names to ATOM's internal weight names (handling TP sharding, column/row parallel splits)
Supports packed weights (e.g., QKV fused) by slicing incoming tensors to correct offsets
Handles FP8 requantization — when the model uses FP8, incoming FP16/BF16 weights are quantized in-place with updated scales
Clears KV cache after weight update to prevent stale cache from previous weights

GPU Memory Lifecycle

memory_manager.py manages the sleep/wake cycle:

Sleep: deallocate KV cache blocks → torch.cuda.empty_cache() → memory returned to PyTorch/ROCm for trainer
Wake: empty_cache() → recalculate available blocks → reallocate KV cache → ready for inference
Each DP rank manages its own KV cache independently

DP Isolation

model_runner_ext.py (RLHFModelRunner) extends ATOM's ModelRunner for DP-isolated execution. Each DP rank's ModelRunners form an independent NCCL world scoped to TP only, with correct physical-to-logical device mapping and NCCL binding patches for ROCm multi-GPU setups.

Changes

New files (`atom/rollout/`)

File	Purpose
`__init__.py`	Package exports
`async_engine.py`	AsyncLLMEngine wrapper (sleep/wake/load_weights API)
`engine_utility.py`	Utility command handlers (update_weights, release/resume_memory)
`memory_manager.py`	GPU memory lifecycle (KV cache alloc/release, weight discard/resume)
`model_runner_ext.py`	RLHFModelRunner with DP isolation, NCCL device binding patch
`weight_sync.py`	Weight transfer via CUDA IPC (per-GPU buffers)
`weight_updater.py`	Weight update logic (packed weights, FP8 requantize, TP sharding)

Incremental changes to existing files

engine_core.py: utility queue, sleep mode, UTILITY_RESPONSE message type, DP sleep state sync
engine_core_mgr.py: utility_response_queue, broadcast_utility_command, broadcast_utility_command_sync
llm_engine.py: request_ids and logprobs support in add_request/generate/postprocess
async_proc.py: TP-rank barrier for safe weight update buffer reuse
scheduler.py: logprobs tracking in ScheduledBatch/ScheduledBatchOutput
sequence.py: request_id, return_logprobs, logprobs fields
sampling_params.py: logprobs parameter
config.py: runner_qualname, compilation_config dict→object conversion

…pport

…line

…integration (TP+DP)

…p API

…nd minor fixes

…parameters

…to atom/rollout/

… ModelRunner with DP isolation handling

…ion parameters and comments across multiple files

ZhangLirong-amd · 2026-04-13T09:12:45Z

atom/model_engine/engine_core.py

            shutdown = shutdown or self.pull_and_process_input_queue()
            if shutdown:
                break
+            if self._is_sleeping:


Why we need such sleep? I think it's not a good idea to sleep in our engine core

ZhangLirong-amd · 2026-04-13T09:13:43Z

atom/model_engine/engine_core.py

+            # agree to skip model execution together — MoE expert routing and
+            # dummy_execution also contain DP-wide collectives that would hang
+            # if only some cores participated.
+            local_sleeping = self._is_sleeping


ZhangLirong-amd · 2026-04-13T09:18:36Z

atom/rollout/memory_manager.py

+        # Recapture CUDA graphs after KV cache re-allocation (addresses changed)
+        self._recapture_cudagraphs_if_needed()
+
+    def _recapture_cudagraphs_if_needed(self) -> None:


Will be triggered frequently? Cuda Graph capture is overhead and is time-consuming, so it shouldn't be constantly recaptured.

ZhangLirong-amd · 2026-04-13T09:19:57Z

atom/rollout/model_runner_ext.py

+        return req_ids_out, processed_out, logprobs_map
+
+
+class RLHFModelRunner(ModelRunner, WeightUpdaterMixin, MemoryManagerMixin):


Find a better way to reuse parent model_runner in atom

ZhangLirong-amd · 2026-04-13T09:21:20Z

atom/rollout/model_runner_ext.py

+logger = logging.getLogger("atom")
+
+
+class LogprobsTokenIDProcessor(tokenIDProcessor):


Use atom TokenIDProcessor it's enough, can we move logprobs in TokenIDProcessor and remove this class?

sijyang closed this Apr 13, 2026

sijyang changed the title ~~Sijyang/verl dev~~ Add RLHF rollout integration support (verl) Apr 13, 2026

sijyang reopened this Apr 13, 2026

sijyang added 12 commits April 13, 2026 16:30

[verl] feat: add trust_remote_code arg and compilation_config dict su…

c181fbd

…pport

[verl] feat: add logprobs and request_id support across sampling pipe…

63dc36e

…line

[verl] feat: weight sync, memory lifecycle and DP isolation for verl …

60413f0

…integration (TP+DP)

[verl] feat: utility command dispatch and broadcast communication

f611004

[verl] feat: basic integration with verl - load_weights, sleep/wake_u…

9fa2ca8

…p API

[atom] fix: rope parameters handling, remove CLI trust_remote_code, a…

9c9698e

…nd minor fixes

[atom] feat: implement packed weight handling in ModelRunner for FP8 …

b4fa029

…parameters

[verl] refactor: decouple RLHF rollout logic from inference engine in…

0015cd4

…to atom/rollout/

[verl] feat: extend tokenIDProcessor for logprobs support and enhance…

2a0f623

… ModelRunner with DP isolation handling

fix: patch NCCL device binding for DP-isolated ModelRunner

ab5ccd8

refactor: minimize diff against main by reverting non-functional changes

bd4678c

refactor: improve code readability by formatting and organizing funct…

10bab61

…ion parameters and comments across multiple files

sijyang force-pushed the sijyang/verl_dev branch from ea9dc91 to 10bab61 Compare April 13, 2026 08:30

sijyang changed the title ~~Add RLHF rollout integration support (verl)~~ [feat]: Add RLHF rollout integration support (verl) Apr 13, 2026

sijyang changed the title ~~[feat]: Add RLHF rollout integration support (verl)~~ [feat] Add RLHF rollout integration support (verl) Apr 13, 2026

ZhangLirong-amd reviewed Apr 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[feat] Add RLHF rollout integration support (verl)#549

[feat] Add RLHF rollout integration support (verl)#549
sijyang wants to merge 12 commits intomainfrom
sijyang/verl_dev

sijyang commented Apr 13, 2026 •

edited

Loading

Uh oh!

ZhangLirong-amd Apr 13, 2026

Uh oh!

ZhangLirong-amd Apr 13, 2026

Uh oh!

ZhangLirong-amd Apr 13, 2026

Uh oh!

ZhangLirong-amd Apr 13, 2026

Uh oh!

ZhangLirong-amd Apr 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		return req_ids_out, processed_out, logprobs_map


		class RLHFModelRunner(ModelRunner, WeightUpdaterMixin, MemoryManagerMixin):

		logger = logging.getLogger("atom")


		class LogprobsTokenIDProcessor(tokenIDProcessor):

Conversation

sijyang commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Overview

Architecture

Weight Synchronization

Weight Update Pipeline

GPU Memory Lifecycle

DP Isolation

Changes

New files (atom/rollout/)

Incremental changes to existing files

Uh oh!

ZhangLirong-amd Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ZhangLirong-amd Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ZhangLirong-amd Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ZhangLirong-amd Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

ZhangLirong-amd Apr 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

sijyang commented Apr 13, 2026 •

edited

Loading

New files (`atom/rollout/`)