We have observed limited functionality (AllReduce, AllGather, ReduceScatter) running NCCLX-CTRAN with NCCL tests. We have observed successful functionality with send-recv operations. We are enabling CTRAN in our tests using NCCL_CTRAN_ENABLE=1 and NCCL_[operation]_ALGO=ctran. A few clarifying questions in this regard:
- Has NCCLX-CTRAN been tested with NCCL Tests?
- We consistently observe very low performance with CTRAN AllReduce, AllGather with NCCL tests, is this expected?
- Is there a commit where all NCCLX-CTRAN operations are functional with NCCL Tests?
We’d also like to use DQPLB in our testing. It looks like there are multiple factors (CVARS, and connection-type based on topology file) that govern whether or not a QP uses dplb or spray.
- Is there a heavy hammer way to turn on DQPLB for all operations and connection types? Is it
NCCL_CTRAN_IB_VC_MODE=dqplb?
- We noticed that DQPLB is turned off for cross-DC by default. Given that one of the motivations for DQPLB was cross-DC, we wanted to understand how we should interpret this?
- Once we have set factors to choose DQPLB, what level of logging would be best to verify that DQPLB is indeed being chosen and how do we set it?
@MaayanSheraizinNV
We have observed limited functionality (AllReduce, AllGather, ReduceScatter) running NCCLX-CTRAN with NCCL tests. We have observed successful functionality with send-recv operations. We are enabling CTRAN in our tests using
NCCL_CTRAN_ENABLE=1andNCCL_[operation]_ALGO=ctran. A few clarifying questions in this regard:We’d also like to use DQPLB in our testing. It looks like there are multiple factors (CVARS, and connection-type based on topology file) that govern whether or not a QP uses
dplborspray.NCCL_CTRAN_IB_VC_MODE=dqplb?@MaayanSheraizinNV