Questions about multinode RL & dataloader

Hi! 👋

First of all, thank you for open-sourcing prime-rl - it looks really impressive 🙌

I have two questions:

### 1. Multiple inference actors with NCCL weight broadcasts

If I understand correctly, in intra/inter-node setups it’s currently not possible to use multiple inference actors with NCCL weight broadcasts because of this validator in `prime-rl/src/prime_rl/inference/config.py`:

```python
@model_validator(mode="after")
def nccl_and_dp(self):
    if self.weight_broadcast.type == "nccl" and self.parallel.dp != 1:
        raise ValueError("NCCL broadcast backend requires data parallel size to be 1")
    return self
```

Is that interpretation correct?

If so, is the recommended approach to launch multiple inference actors each with `dp = 1` and then pass the list of these environments to the orchestrator? 

### 2. Data loading via shared NFS

Why did you decide to use shared NFS storage to pass data between entities? And if possible, could you share any benchmarks comparing network-based transfer vs storage-based approaches?


Thanks a lot for any clarification you can provide, and again, amazing work on this project!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Questions about multinode RL & dataloader #1412

1. Multiple inference actors with NCCL weight broadcasts

2. Data loading via shared NFS

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Questions about multinode RL & dataloader #1412

Description

1. Multiple inference actors with NCCL weight broadcasts

2. Data loading via shared NFS

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions