RFC-0047-flagcx-support [commenting]#84
Conversation
|
Hi @MC952-arch! Thank you for your pull request and welcome to our community. Action RequiredIn order to merge any pull request (code, docs, etc.), we require contributors to sign our Contributor License Agreement, and we don't seem to have one on file for you. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at cla@meta.com. Thanks! |
|
Thank you for signing our Contributor License Agreement. We can now accept your code for this (and any) Meta Open Source project. Thanks! |
|
Hi, have you looked into implementing this as an out of tree backend (either for ProcessGroup or for torchcomms)? |
Yes — we already have one. There's a working out-of-tree backend at Comparison with UCCUCC is the closest comparison. It's a solid multi-vendor framework, but its heterogeneous support works by decomposing collectives into homogeneous subgroups: That's exactly what FlagCX does. Heterogeneous communicator topology. At initialization time, FlagCX tries to detect vendor strings across all ranks, partitions them into clusters (same-vendor groups), and builds both a Device-buffer RDMA. Cross-cluster transfers go through GDR-allocated buffers and IB RDMA directly — no D2H staging, no host collective, no H2D copy back. The IB adaptor has a full retransmission protocol (sequence numbers, SACK bitmaps, RTT estimation) for adaptive routing fabrics. Vendor coverage. FlagCX supports 11 backends: NVIDIA, AMD, Ascend, Iluvatar Corex, Cambricon, Metax, Musa, Kunlunxin, DU, TSMicro, Enflame. That covers a much broader AI chip ecosystem that UCC doesn't reach. Why optionally in-treeThe out-of-tree plugin works today —
Users who do not enable or install FlagCX see no change. |
Proposal to Add FlagCX backend to torch.distributed for Heterogeneous Cross-chip Communication.
Rendered version: https://github.com/MC952-arch/rfcs/blob/flagcx-support/RFC-0047-flagcx-support.md