Whether the embedings generated by different modal data has comparability?

just like CLIP, whether embedings generated by Universal Encoder has comparability? if can, we can perform search and matching based on the similarity of embedings for different modal data. Could you provide the Encoder part of the model separately for testing? The overall 15GB model is too large at the moment.