IB auto-tune sweep benchmark for Triton alltoallv kernel (#2014) by snarayankh · Pull Request #2014 · meta-pytorch/torchcomms

snarayankh · 2026-04-09T19:13:43Z

Summary:

Add a parameter sweep benchmark that exhaustively tests IB-specific
kernel configurations (blocks_per_peer, num_warps, chunk_size) across
message sizes from 1KB to 16MB on multi-node H100.

The sweep:

Tests 64 parameter combinations per message size (4 bpp x 4 warps x 4 chunks)
Compares each config against NCCL alltoallv baseline (CUDA graph mode)
Selects the best config per message size
Outputs a copy-paste lookup table for _tune_for_ib()

New files:

autotune_sweep_ib.py: Sweep benchmark script
BUCK: autotune_sweep_ib target (nnodes=2, ppn=8, 2hr timeout)
Launcher: autotune_ib entry in triton_test_launcher

Differential Revision: D99970964

meta-codesync · 2026-04-09T19:13:56Z

@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99970964.

…h#2014) Summary: Add a parameter sweep benchmark that exhaustively tests IB-specific kernel configurations (blocks_per_peer, num_warps, chunk_size) across message sizes from 1KB to 16MB on multi-node H100. The sweep: - Tests 64 parameter combinations per message size (4 bpp x 4 warps x 4 chunks) - Compares each config against NCCL alltoallv baseline (CUDA graph mode) - Selects the best config per message size - Outputs a copy-paste lookup table for _tune_for_ib() New files: - autotune_sweep_ib.py: Sweep benchmark script - BUCK: autotune_sweep_ib target (nnodes=2, ppn=8, 2hr timeout) - Launcher: autotune_ib entry in triton_test_launcher Differential Revision: D99970964

…pytorch#1996) Summary: The `register_local_buffer()` API was changed to return a single `int64` (a device pointer to a `RegisteredBuffer` struct) instead of the previous 3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv kernel and all its consumers. The fix: - Remove dead kernel parameters `src_base_ptr` and `src_size` which were never used in the kernel body - Extract the `backend_window` (ncclWindow_t) field from the device-side `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is the value that `put_block_direct` and `put_warp_chunked_direct` externs need for NVLink inline-PTX memcpy and GIN RDMA fallback - Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass the single int handle directly This is a pure API adaptation with zero kernel behavioral change. The externs receive the identical `ncclWindow_t` value as before. The NVLink inline-PTX optimization in `put_block_direct` is preserved (no switch to the slower `put_block` path). RegisteredBuffer struct layout (all backends): offset 0: base_ptr (void*, 8 bytes) offset 8: size (size_t, 8 bytes) offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct offset 24: lkey (uint32_t, 4 bytes) Files changed: - device_alltoallv_dynamic.py: kernel params, host-side field extraction, kernel launch - alltoallv_op.py: type annotation, deregister call - test_device_alltoallv_dynamic_e2e.py: deregister calls - benchmark_device_alltoallv_dynamic.py: deregister calls Differential Revision: D99914167

…2008) Summary: Add topology-aware auto-tuning that assigns different kernel parameters to NVLink vs IB peers for the Triton AlltoAllv kernel. Motivation ========== On multi-node H100, intra-node peers communicate via NVLink (~900 GB/s) while inter-node peers use InfiniBand via GIN RDMA (~50 GB/s per NIC). The existing auto-tuner applies the same parameters (16 blocks/peer, 64KB chunks) to all peers regardless of transport. NVLink benefits from high block parallelism (each block independently saturates NVLink via cooperative memcpy), while IB benefits from fewer blocks (the NIC handles pipelining; extra blocks just add WQE contention). Design ====== 1. **Host-side topology detection** (alltoallv_op.py): AlltoallvOp.setup() calls window.get_nvlink_address(peer) after tensor_register to classify each peer as NVL (non-zero) or IB (zero). 2. **Split auto-tuning** (device_alltoallv_dynamic.py): auto_tune_alltoallv_params() now accepts peer_is_nvl and returns per-peer block counts. _tune_for_nvl() gives 8-16 blocks/peer; _tune_for_ib() gives 1-4 blocks/peer with larger chunks. 3. **Per-peer block masking** (kernel): The kernel launches with BLOCKS_PER_PEER = max(nvl_bpp, ib_bpp) as a constexpr upper bound. A runtime per_peer_blocks_ptr tensor holds the actual block count per peer. Excess blocks early-return. Completion counters and recv signal waits use actual_bpp instead of the uniform BLOCKS_PER_PEER. 4. **Backward compatible**: On single-node (all NVL), per_peer_blocks is None, so HAS_PER_PEER_BLOCKS=False (constexpr) and all masking code is compiled away by Triton. Zero performance impact on the existing single-node path. IB tuning rationale =================== - blocks_per_peer: 1-4 (vs 8-16 for NVL). IB transfers use GIN put where thread 0 posts an RDMA WQE; extra blocks add WQE completion queue overhead without proportional bandwidth gain. - chunk_size: 256KB (vs 64KB for NVL). Larger chunks amortize the per-WQE IB round-trip latency. - num_warps: 4-8 (vs 16-32 for NVL). Only thread 0 posts WQEs; extra warps only help the staging memcpy to the GIN buffer. Files changed ============= - device_alltoallv_dynamic.py: _tune_for_nvl(), _tune_for_ib(), per_peer_blocks kernel parameter, block masking, signal updates - alltoallv_op.py: topology detection in setup(), per_peer_blocks tensor creation, teardown cleanup Differential Revision: D99965741

…h#2014) Summary: Add a parameter sweep benchmark that exhaustively tests IB-specific kernel configurations (blocks_per_peer, num_warps, chunk_size) across message sizes from 1KB to 16MB on multi-node H100. The sweep: - Tests 64 parameter combinations per message size (4 bpp x 4 warps x 4 chunks) - Compares each config against NCCL alltoallv baseline (CUDA graph mode) - Selects the best config per message size - Outputs a copy-paste lookup table for _tune_for_ib() New files: - autotune_sweep_ib.py: Sweep benchmark script - BUCK: autotune_sweep_ib target (nnodes=2, ppn=8, 2hr timeout) - Launcher: autotune_ib entry in triton_test_launcher Differential Revision: D99970964

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 9, 2026

meta-codesync bot changed the title ~~IB auto-tune sweep benchmark for Triton alltoallv kernel~~ IB auto-tune sweep benchmark for Triton alltoallv kernel (#2014) Apr 9, 2026

snarayankh force-pushed the export-D99970964 branch from 07f035b to ee4907b Compare April 9, 2026 20:03

snarayankh force-pushed the export-D99970964 branch from ee4907b to 57808f1 Compare April 9, 2026 20:28

Santosh Narayankhedkar added 3 commits April 9, 2026 15:03

snarayankh force-pushed the export-D99970964 branch from 57808f1 to 7e3b35e Compare April 9, 2026 22:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

IB auto-tune sweep benchmark for Triton alltoallv kernel (#2014)#2014

IB auto-tune sweep benchmark for Triton alltoallv kernel (#2014)#2014
snarayankh wants to merge 3 commits intometa-pytorch:mainfrom
snarayankh:export-D99970964

snarayankh commented Apr 9, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

snarayankh commented Apr 9, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Apr 9, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

snarayankh commented Apr 9, 2026 •

edited by meta-codesync bot

Loading