Skip to content

Questions about multinode RL & dataloader #1412

@alekseymalakhov11

Description

@alekseymalakhov11

Hi! 👋

First of all, thank you for open-sourcing prime-rl - it looks really impressive 🙌

I have two questions:

1. Multiple inference actors with NCCL weight broadcasts

If I understand correctly, in intra/inter-node setups it’s currently not possible to use multiple inference actors with NCCL weight broadcasts because of this validator in prime-rl/src/prime_rl/inference/config.py:

@model_validator(mode="after")
def nccl_and_dp(self):
    if self.weight_broadcast.type == "nccl" and self.parallel.dp != 1:
        raise ValueError("NCCL broadcast backend requires data parallel size to be 1")
    return self

Is that interpretation correct?

If so, is the recommended approach to launch multiple inference actors each with dp = 1 and then pass the list of these environments to the orchestrator?

2. Data loading via shared NFS

Why did you decide to use shared NFS storage to pass data between entities? And if possible, could you share any benchmarks comparing network-based transfer vs storage-based approaches?

Thanks a lot for any clarification you can provide, and again, amazing work on this project!

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions