Topology-aware auto-tuning for Triton alltoallv kernel (#2008)#2008
Open
snarayankh wants to merge 2 commits intometa-pytorch:mainfrom
Open
Topology-aware auto-tuning for Triton alltoallv kernel (#2008)#2008snarayankh wants to merge 2 commits intometa-pytorch:mainfrom
snarayankh wants to merge 2 commits intometa-pytorch:mainfrom
Conversation
Contributor
|
@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99965741. |
snarayankh
pushed a commit
to snarayankh/torchcomms
that referenced
this pull request
Apr 9, 2026
…2008) Summary: Add topology-aware auto-tuning that assigns different kernel parameters to NVLink vs IB peers for the Triton AlltoAllv kernel. Motivation ========== On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s) while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC). The existing auto-tuner applies the same parameters (16 blocks/peer, 64KB chunks) to all peers regardless of transport. NVLink benefits from high block parallelism (each block independently saturates NVLink via cooperative memcpy), while IB benefits from fewer blocks (the NIC handles pipelining; extra blocks just add WQE contention). Design ====== 1. **Host-side topology detection** (alltoallv_op.py): AlltoallvOp.setup() calls window.get_nvlink_address(peer) after tensor_register to classify each peer as NVL (non-zero) or IB (zero). 2. **Split auto-tuning** (device_alltoallv_dynamic.py): auto_tune_alltoallv_params() now accepts peer_is_nvl and returns per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer; _tune_for_ib() gives 1-4 blocks/peer with larger chunks. 3. **Per-peer block masking** (kernel): The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp) as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor holds the actual block count per peer. Excess blocks early-return. Completion counters and recv signal waits use actual_bpp instead of the uniform BLOCKS_PER_PEER. 4. **Backward compatible**: On single-node (all NVL), per_peer_blocks is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking code is compiled away by Triton. Zero performance impact on the existing single-node path. IB tuning rationale =================== - blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put where thread 0 posts an RDMA WQE; extra blocks add WQE completion queue overhead without proportional bandwidth gain. - chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the per-WQE IB round-trip latency. - num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs; extra warps only help the staging memcpy to the GIN buffer. Files changed ============= - device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(), per_peer_blocks kernel parameter, block masking, signal updates - alltoallv_op.py: topology detection in setup(), per_peer_blocks tensor creation, teardown cleanup Differential Revision: D99965741
snarayankh
pushed a commit
to snarayankh/torchcomms
that referenced
this pull request
Apr 9, 2026
…2008) Summary: Add topology-aware auto-tuning that assigns different kernel parameters to NVLink vs IB peers for the Triton AlltoAllv kernel. Motivation ========== On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s) while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC). The existing auto-tuner applies the same parameters (16 blocks/peer, 64KB chunks) to all peers regardless of transport. NVLink benefits from high block parallelism (each block independently saturates NVLink via cooperative memcpy), while IB benefits from fewer blocks (the NIC handles pipelining; extra blocks just add WQE contention). Design ====== 1. **Host-side topology detection** (alltoallv_op.py): AlltoallvOp.setup() calls window.get_nvlink_address(peer) after tensor_register to classify each peer as NVL (non-zero) or IB (zero). 2. **Split auto-tuning** (device_alltoallv_dynamic.py): auto_tune_alltoallv_params() now accepts peer_is_nvl and returns per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer; _tune_for_ib() gives 1-4 blocks/peer with larger chunks. 3. **Per-peer block masking** (kernel): The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp) as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor holds the actual block count per peer. Excess blocks early-return. Completion counters and recv signal waits use actual_bpp instead of the uniform BLOCKS_PER_PEER. 4. **Backward compatible**: On single-node (all NVL), per_peer_blocks is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking code is compiled away by Triton. Zero performance impact on the existing single-node path. IB tuning rationale =================== - blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put where thread 0 posts an RDMA WQE; extra blocks add WQE completion queue overhead without proportional bandwidth gain. - chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the per-WQE IB round-trip latency. - num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs; extra warps only help the staging memcpy to the GIN buffer. Files changed ============= - device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(), per_peer_blocks kernel parameter, block masking, signal updates - alltoallv_op.py: topology detection in setup(), per_peer_blocks tensor creation, teardown cleanup Differential Revision: D99965741
600c9af to
9f9a676
Compare
snarayankh
pushed a commit
to snarayankh/torchcomms
that referenced
this pull request
Apr 9, 2026
…2008) Summary: Add topology-aware auto-tuning that assigns different kernel parameters to NVLink vs IB peers for the Triton AlltoAllv kernel. Motivation ========== On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s) while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC). The existing auto-tuner applies the same parameters (16 blocks/peer, 64KB chunks) to all peers regardless of transport. NVLink benefits from high block parallelism (each block independently saturates NVLink via cooperative memcpy), while IB benefits from fewer blocks (the NIC handles pipelining; extra blocks just add WQE contention). Design ====== 1. **Host-side topology detection** (alltoallv_op.py): AlltoallvOp.setup() calls window.get_nvlink_address(peer) after tensor_register to classify each peer as NVL (non-zero) or IB (zero). 2. **Split auto-tuning** (device_alltoallv_dynamic.py): auto_tune_alltoallv_params() now accepts peer_is_nvl and returns per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer; _tune_for_ib() gives 1-4 blocks/peer with larger chunks. 3. **Per-peer block masking** (kernel): The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp) as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor holds the actual block count per peer. Excess blocks early-return. Completion counters and recv signal waits use actual_bpp instead of the uniform BLOCKS_PER_PEER. 4. **Backward compatible**: On single-node (all NVL), per_peer_blocks is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking code is compiled away by Triton. Zero performance impact on the existing single-node path. IB tuning rationale =================== - blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put where thread 0 posts an RDMA WQE; extra blocks add WQE completion queue overhead without proportional bandwidth gain. - chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the per-WQE IB round-trip latency. - num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs; extra warps only help the staging memcpy to the GIN buffer. Files changed ============= - device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(), per_peer_blocks kernel parameter, block masking, signal updates - alltoallv_op.py: topology detection in setup(), per_peer_blocks tensor creation, teardown cleanup Differential Revision: D99965741
snarayankh
pushed a commit
to snarayankh/torchcomms
that referenced
this pull request
Apr 9, 2026
…2008) Summary: Add topology-aware auto-tuning that assigns different kernel parameters to NVLink vs IB peers for the Triton AlltoAllv kernel. Motivation ========== On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s) while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC). The existing auto-tuner applies the same parameters (16 blocks/peer, 64KB chunks) to all peers regardless of transport. NVLink benefits from high block parallelism (each block independently saturates NVLink via cooperative memcpy), while IB benefits from fewer blocks (the NIC handles pipelining; extra blocks just add WQE contention). Design ====== 1. **Host-side topology detection** (alltoallv_op.py): AlltoallvOp.setup() calls window.get_nvlink_address(peer) after tensor_register to classify each peer as NVL (non-zero) or IB (zero). 2. **Split auto-tuning** (device_alltoallv_dynamic.py): auto_tune_alltoallv_params() now accepts peer_is_nvl and returns per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer; _tune_for_ib() gives 1-4 blocks/peer with larger chunks. 3. **Per-peer block masking** (kernel): The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp) as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor holds the actual block count per peer. Excess blocks early-return. Completion counters and recv signal waits use actual_bpp instead of the uniform BLOCKS_PER_PEER. 4. **Backward compatible**: On single-node (all NVL), per_peer_blocks is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking code is compiled away by Triton. Zero performance impact on the existing single-node path. IB tuning rationale =================== - blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put where thread 0 posts an RDMA WQE; extra blocks add WQE completion queue overhead without proportional bandwidth gain. - chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the per-WQE IB round-trip latency. - num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs; extra warps only help the staging memcpy to the GIN buffer. Files changed ============= - device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(), per_peer_blocks kernel parameter, block masking, signal updates - alltoallv_op.py: topology detection in setup(), per_peer_blocks tensor creation, teardown cleanup Differential Revision: D99965741
added 2 commits
April 9, 2026 14:33
…pytorch#1996) Summary: The `register_local_buffer()` API was changed to return a single `int64` (a device pointer to a `RegisteredBuffer` struct) instead of the previous 3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv kernel and all its consumers. The fix: - Remove dead kernel parameters `src_base_ptr` and `src_size` which were never used in the kernel body - Extract the `backend_window` (ncclWindow_t) field from the device-side `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is the value that `put_block_direct` and `put_warp_chunked_direct` externs need for NVLink inline-PTX memcpy and GIN RDMA fallback - Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass the single int handle directly This is a pure API adaptation with zero kernel behavioral change. The externs receive the identical `ncclWindow_t` value as before. The NVLink inline-PTX optimization in `put_block_direct` is preserved (no switch to the slower `put_block` path). RegisteredBuffer struct layout (all backends): offset 0: base_ptr (void*, 8 bytes) offset 8: size (size_t, 8 bytes) offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct offset 24: lkey (uint32_t, 4 bytes) Files changed: - device_alltoallv_dynamic.py: kernel params, host-side field extraction, kernel launch - alltoallv_op.py: type annotation, deregister call - test_device_alltoallv_dynamic_e2e.py: deregister calls - benchmark_device_alltoallv_dynamic.py: deregister calls Differential Revision: D99914167
…2008) Summary: Add topology-aware auto-tuning that assigns different kernel parameters to NVLink vs IB peers for the Triton AlltoAllv kernel. Motivation ========== On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s) while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC). The existing auto-tuner applies the same parameters (16 blocks/peer, 64KB chunks) to all peers regardless of transport. NVLink benefits from high block parallelism (each block independently saturates NVLink via cooperative memcpy), while IB benefits from fewer blocks (the NIC handles pipelining; extra blocks just add WQE contention). Design ====== 1. **Host-side topology detection** (alltoallv_op.py): AlltoallvOp.setup() calls window.get_nvlink_address(peer) after tensor_register to classify each peer as NVL (non-zero) or IB (zero). 2. **Split auto-tuning** (device_alltoallv_dynamic.py): auto_tune_alltoallv_params() now accepts peer_is_nvl and returns per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer; _tune_for_ib() gives 1-4 blocks/peer with larger chunks. 3. **Per-peer block masking** (kernel): The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp) as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor holds the actual block count per peer. Excess blocks early-return. Completion counters and recv signal waits use actual_bpp instead of the uniform BLOCKS_PER_PEER. 4. **Backward compatible**: On single-node (all NVL), per_peer_blocks is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking code is compiled away by Triton. Zero performance impact on the existing single-node path. IB tuning rationale =================== - blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put where thread 0 posts an RDMA WQE; extra blocks add WQE completion queue overhead without proportional bandwidth gain. - chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the per-WQE IB round-trip latency. - num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs; extra warps only help the staging memcpy to the GIN buffer. Files changed ============= - device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(), per_peer_blocks kernel parameter, block masking, signal updates - alltoallv_op.py: topology detection in setup(), per_peer_blocks tensor creation, teardown cleanup Differential Revision: D99965741
9f9a676 to
f82e312
Compare
snarayankh
pushed a commit
to snarayankh/torchcomms
that referenced
this pull request
Apr 9, 2026
…2008) Summary: Add topology-aware auto-tuning that assigns different kernel parameters to NVLink vs IB peers for the Triton AlltoAllv kernel. Motivation ========== On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s) while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC). The existing auto-tuner applies the same parameters (16 blocks/peer, 64KB chunks) to all peers regardless of transport. NVLink benefits from high block parallelism (each block independently saturates NVLink via cooperative memcpy), while IB benefits from fewer blocks (the NIC handles pipelining; extra blocks just add WQE contention). Design ====== 1. **Host-side topology detection** (alltoallv_op.py): AlltoallvOp.setup() calls window.get_nvlink_address(peer) after tensor_register to classify each peer as NVL (non-zero) or IB (zero). 2. **Split auto-tuning** (device_alltoallv_dynamic.py): auto_tune_alltoallv_params() now accepts peer_is_nvl and returns per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer; _tune_for_ib() gives 1-4 blocks/peer with larger chunks. 3. **Per-peer block masking** (kernel): The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp) as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor holds the actual block count per peer. Excess blocks early-return. Completion counters and recv signal waits use actual_bpp instead of the uniform BLOCKS_PER_PEER. 4. **Backward compatible**: On single-node (all NVL), per_peer_blocks is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking code is compiled away by Triton. Zero performance impact on the existing single-node path. IB tuning rationale =================== - blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put where thread 0 posts an RDMA WQE; extra blocks add WQE completion queue overhead without proportional bandwidth gain. - chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the per-WQE IB round-trip latency. - num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs; extra warps only help the staging memcpy to the GIN buffer. Files changed ============= - device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(), per_peer_blocks kernel parameter, block masking, signal updates - alltoallv_op.py: topology detection in setup(), per_peer_blocks tensor creation, teardown cleanup Differential Revision: D99965741
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.
Motivation
On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).
Design
Host-side topology detection (alltoallv_op.py):
AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
tensor_register to classify each peer as NVL (non-zero) or IB (zero).
Split auto-tuning (device_alltoallv_dynamic.py):
auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
_tune_for_ib() gives 1-4 blocks/peer with larger chunks.
Per-peer block masking (kernel):
The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
holds the actual block count per peer. Excess blocks early-return.
Completion counters and recv signal waits use actual_bpp instead of
the uniform BLOCKS_PER_PEER.
Backward compatible: On single-node (all NVL), per_peer_blocks
is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
code is compiled away by Triton. Zero performance impact on the
existing single-node path.
IB tuning rationale
where thread 0 posts an RDMA WQE; extra blocks add WQE completion
queue overhead without proportional bandwidth gain.
per-WQE IB round-trip latency.
extra warps only help the staging memcpy to the GIN buffer.
Files changed
per_peer_blocks kernel parameter, block masking, signal updates
tensor creation, teardown cleanup
Differential Revision: D99965741