-
Notifications
You must be signed in to change notification settings - Fork 168
Description
Hi! 👋
First of all, thank you for open-sourcing prime-rl - it looks really impressive 🙌
I have two questions:
1. Multiple inference actors with NCCL weight broadcasts
If I understand correctly, in intra/inter-node setups it’s currently not possible to use multiple inference actors with NCCL weight broadcasts because of this validator in prime-rl/src/prime_rl/inference/config.py:
@model_validator(mode="after")
def nccl_and_dp(self):
if self.weight_broadcast.type == "nccl" and self.parallel.dp != 1:
raise ValueError("NCCL broadcast backend requires data parallel size to be 1")
return selfIs that interpretation correct?
If so, is the recommended approach to launch multiple inference actors each with dp = 1 and then pass the list of these environments to the orchestrator?
2. Data loading via shared NFS
Why did you decide to use shared NFS storage to pass data between entities? And if possible, could you share any benchmarks comparing network-based transfer vs storage-based approaches?
Thanks a lot for any clarification you can provide, and again, amazing work on this project!