Skip to content

Topology-aware auto-tuning for Triton alltoallv kernel (#2008)#2008

Open
snarayankh wants to merge 2 commits intometa-pytorch:mainfrom
snarayankh:export-D99965741
Open

Topology-aware auto-tuning for Triton alltoallv kernel (#2008)#2008
snarayankh wants to merge 2 commits intometa-pytorch:mainfrom
snarayankh:export-D99965741

Conversation

@snarayankh
Copy link
Copy Markdown

@snarayankh snarayankh commented Apr 9, 2026

Summary:

Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.

Motivation

On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).

Design

  1. Host-side topology detection (alltoallv_op.py):
    AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
    tensor_register to classify each peer as NVL (non-zero) or IB (zero).

  2. Split auto-tuning (device_alltoallv_dynamic.py):
    auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
    per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
    _tune_for_ib() gives 1-4 blocks/peer with larger chunks.

  3. Per-peer block masking (kernel):
    The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
    as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
    holds the actual block count per peer. Excess blocks early-return.
    Completion counters and recv signal waits use actual_bpp instead of
    the uniform BLOCKS_PER_PEER.

  4. Backward compatible: On single-node (all NVL), per_peer_blocks
    is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
    code is compiled away by Triton. Zero performance impact on the
    existing single-node path.

IB tuning rationale

  • blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put
    where thread 0 posts an RDMA WQE; extra blocks add WQE completion
    queue overhead without proportional bandwidth gain.
  • chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the
    per-WQE IB round-trip latency.
  • num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs;
    extra warps only help the staging memcpy to the GIN buffer.

Files changed

  • device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(),
    per_peer_blocks kernel parameter, block masking, signal updates
  • alltoallv_op.py: topology detection in setup(), per_peer_blocks
    tensor creation, teardown cleanup

Differential Revision: D99965741

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 9, 2026

@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99965741.

snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…2008)

Summary:

Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.

Motivation
==========
On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).

Design
======
1. **Host-side topology detection** (alltoallv_op.py):
   AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
   tensor_register to classify each peer as NVL (non-zero) or IB (zero).

2. **Split auto-tuning** (device_alltoallv_dynamic.py):
   auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
   per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
   _tune_for_ib() gives 1-4 blocks/peer with larger chunks.

3. **Per-peer block masking** (kernel):
   The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
   as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
   holds the actual block count per peer. Excess blocks early-return.
   Completion counters and recv signal waits use actual_bpp instead of
   the uniform BLOCKS_PER_PEER.

4. **Backward compatible**: On single-node (all NVL), per_peer_blocks
   is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
   code is compiled away by Triton. Zero performance impact on the
   existing single-node path.

IB tuning rationale
===================
- blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put
  where thread 0 posts an RDMA WQE; extra blocks add WQE completion
  queue overhead without proportional bandwidth gain.
- chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the
  per-WQE IB round-trip latency.
- num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs;
  extra warps only help the staging memcpy to the GIN buffer.

Files changed
=============
- device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(),
  per_peer_blocks kernel parameter, block masking, signal updates
- alltoallv_op.py: topology detection in setup(), per_peer_blocks
  tensor creation, teardown cleanup

Differential Revision: D99965741
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…2008)

Summary:

Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.

Motivation
==========
On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).

Design
======
1. **Host-side topology detection** (alltoallv_op.py):
   AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
   tensor_register to classify each peer as NVL (non-zero) or IB (zero).

2. **Split auto-tuning** (device_alltoallv_dynamic.py):
   auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
   per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
   _tune_for_ib() gives 1-4 blocks/peer with larger chunks.

3. **Per-peer block masking** (kernel):
   The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
   as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
   holds the actual block count per peer. Excess blocks early-return.
   Completion counters and recv signal waits use actual_bpp instead of
   the uniform BLOCKS_PER_PEER.

4. **Backward compatible**: On single-node (all NVL), per_peer_blocks
   is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
   code is compiled away by Triton. Zero performance impact on the
   existing single-node path.

IB tuning rationale
===================
- blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put
  where thread 0 posts an RDMA WQE; extra blocks add WQE completion
  queue overhead without proportional bandwidth gain.
- chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the
  per-WQE IB round-trip latency.
- num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs;
  extra warps only help the staging memcpy to the GIN buffer.

Files changed
=============
- device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(),
  per_peer_blocks kernel parameter, block masking, signal updates
- alltoallv_op.py: topology detection in setup(), per_peer_blocks
  tensor creation, teardown cleanup

Differential Revision: D99965741
@meta-codesync meta-codesync bot changed the title Topology-aware auto-tuning for Triton alltoallv kernel Topology-aware auto-tuning for Triton alltoallv kernel (#2008) Apr 9, 2026
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…2008)

Summary:

Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.

Motivation
==========
On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).

Design
======
1. **Host-side topology detection** (alltoallv_op.py):
   AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
   tensor_register to classify each peer as NVL (non-zero) or IB (zero).

2. **Split auto-tuning** (device_alltoallv_dynamic.py):
   auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
   per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
   _tune_for_ib() gives 1-4 blocks/peer with larger chunks.

3. **Per-peer block masking** (kernel):
   The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
   as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
   holds the actual block count per peer. Excess blocks early-return.
   Completion counters and recv signal waits use actual_bpp instead of
   the uniform BLOCKS_PER_PEER.

4. **Backward compatible**: On single-node (all NVL), per_peer_blocks
   is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
   code is compiled away by Triton. Zero performance impact on the
   existing single-node path.

IB tuning rationale
===================
- blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put
  where thread 0 posts an RDMA WQE; extra blocks add WQE completion
  queue overhead without proportional bandwidth gain.
- chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the
  per-WQE IB round-trip latency.
- num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs;
  extra warps only help the staging memcpy to the GIN buffer.

Files changed
=============
- device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(),
  per_peer_blocks kernel parameter, block masking, signal updates
- alltoallv_op.py: topology detection in setup(), per_peer_blocks
  tensor creation, teardown cleanup

Differential Revision: D99965741
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…2008)

Summary:

Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.

Motivation
==========
On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).

Design
======
1. **Host-side topology detection** (alltoallv_op.py):
   AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
   tensor_register to classify each peer as NVL (non-zero) or IB (zero).

2. **Split auto-tuning** (device_alltoallv_dynamic.py):
   auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
   per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
   _tune_for_ib() gives 1-4 blocks/peer with larger chunks.

3. **Per-peer block masking** (kernel):
   The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
   as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
   holds the actual block count per peer. Excess blocks early-return.
   Completion counters and recv signal waits use actual_bpp instead of
   the uniform BLOCKS_PER_PEER.

4. **Backward compatible**: On single-node (all NVL), per_peer_blocks
   is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
   code is compiled away by Triton. Zero performance impact on the
   existing single-node path.

IB tuning rationale
===================
- blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put
  where thread 0 posts an RDMA WQE; extra blocks add WQE completion
  queue overhead without proportional bandwidth gain.
- chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the
  per-WQE IB round-trip latency.
- num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs;
  extra warps only help the staging memcpy to the GIN buffer.

Files changed
=============
- device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(),
  per_peer_blocks kernel parameter, block masking, signal updates
- alltoallv_op.py: topology detection in setup(), per_peer_blocks
  tensor creation, teardown cleanup

Differential Revision: D99965741
Santosh Narayankhedkar added 2 commits April 9, 2026 14:33
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
…2008)

Summary:

Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.

Motivation
==========
On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).

Design
======
1. **Host-side topology detection** (alltoallv_op.py):
   AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
   tensor_register to classify each peer as NVL (non-zero) or IB (zero).

2. **Split auto-tuning** (device_alltoallv_dynamic.py):
   auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
   per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
   _tune_for_ib() gives 1-4 blocks/peer with larger chunks.

3. **Per-peer block masking** (kernel):
   The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
   as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
   holds the actual block count per peer. Excess blocks early-return.
   Completion counters and recv signal waits use actual_bpp instead of
   the uniform BLOCKS_PER_PEER.

4. **Backward compatible**: On single-node (all NVL), per_peer_blocks
   is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
   code is compiled away by Triton. Zero performance impact on the
   existing single-node path.

IB tuning rationale
===================
- blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put
  where thread 0 posts an RDMA WQE; extra blocks add WQE completion
  queue overhead without proportional bandwidth gain.
- chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the
  per-WQE IB round-trip latency.
- num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs;
  extra warps only help the staging memcpy to the GIN buffer.

Files changed
=============
- device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(),
  per_peer_blocks kernel parameter, block masking, signal updates
- alltoallv_op.py: topology detection in setup(), per_peer_blocks
  tensor creation, teardown cleanup

Differential Revision: D99965741
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…2008)

Summary:

Add topology-aware auto-tuning that assigns different kernel parameters
to NVLink vs IB peers for the Triton AlltoAllv kernel.

Motivation
==========
On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s)
while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC).
The existing auto-tuner applies the same parameters (16 blocks/peer,
64KB chunks) to all peers regardless of transport. NVLink benefits from
high block parallelism (each block independently saturates NVLink via
cooperative memcpy), while IB benefits from fewer blocks (the NIC
handles pipelining; extra blocks just add WQE contention).

Design
======
1. **Host-side topology detection** (alltoallv_op.py):
   AlltoallvOp.setup() calls window.get_nvlink_address(peer) after
   tensor_register to classify each peer as NVL (non-zero) or IB (zero).

2. **Split auto-tuning** (device_alltoallv_dynamic.py):
   auto_tune_alltoallv_params() now accepts peer_is_nvl and returns
   per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer;
   _tune_for_ib() gives 1-4 blocks/peer with larger chunks.

3. **Per-peer block masking** (kernel):
   The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp)
   as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor
   holds the actual block count per peer. Excess blocks early-return.
   Completion counters and recv signal waits use actual_bpp instead of
   the uniform BLOCKS_PER_PEER.

4. **Backward compatible**: On single-node (all NVL), per_peer_blocks
   is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking
   code is compiled away by Triton. Zero performance impact on the
   existing single-node path.

IB tuning rationale
===================
- blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put
  where thread 0 posts an RDMA WQE; extra blocks add WQE completion
  queue overhead without proportional bandwidth gain.
- chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the
  per-WQE IB round-trip latency.
- num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs;
  extra warps only help the staging memcpy to the GIN buffer.

Files changed
=============
- device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(),
  per_peer_blocks kernel parameter, block masking, signal updates
- alltoallv_op.py: topology detection in setup(), per_peer_blocks
  tensor creation, teardown cleanup

Differential Revision: D99965741
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant