Fix register_local_buffer API break in Triton alltoallv kernel (#1996) by snarayankh · Pull Request #1996 · meta-pytorch/torchcomms

snarayankh · 2026-04-07T23:12:05Z

Summary:

The register_local_buffer() API was changed to return a single int64
(a device pointer to a RegisteredBuffer struct) instead of the previous
3-tuple (base_ptr, size, nccl_win). This broke the Triton alltoallv
kernel and all its consumers.

The fix:

Remove dead kernel parameters src_base_ptr and src_size which were
never used in the kernel body
Extract the backend_window (ncclWindow_t) field from the device-side
RegisteredBuffer struct at byte offset 16 via cudaMemcpy D2H. This is
the value that put_block_direct and put_warp_chunked_direct externs
need for NVLink inline-PTX memcpy and GIN RDMA fallback
Fix all deregister_local_buffer(*src_info) splat-unpack calls to pass
the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical ncclWindow_t value as before. The NVLink
inline-PTX optimization in put_block_direct is preserved (no switch to
the slower put_block path).

RegisteredBuffer struct layout (all backends):
offset 0: base_ptr (void*, 8 bytes)
offset 8: size (size_t, 8 bytes)
offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
offset 24: lkey (uint32_t, 4 bytes)

Files changed:

device_alltoallv_dynamic.py: kernel params, host-side field extraction,
kernel launch
alltoallv_op.py: type annotation, deregister call
test_device_alltoallv_dynamic_e2e.py: deregister calls
benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167

meta-codesync · 2026-04-07T23:12:15Z

@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99914167.

…pytorch#1996) Summary: Pull Request resolved: meta-pytorch#1996 The `register_local_buffer()` API was changed to return a single `int64` (a device pointer to a `RegisteredBuffer` struct) instead of the previous 3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv kernel and all its consumers. The fix: - Remove dead kernel parameters `src_base_ptr` and `src_size` which were never used in the kernel body - Extract the `backend_window` (ncclWindow_t) field from the device-side `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is the value that `put_block_direct` and `put_warp_chunked_direct` externs need for NVLink inline-PTX memcpy and GIN RDMA fallback - Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass the single int handle directly This is a pure API adaptation with zero kernel behavioral change. The externs receive the identical `ncclWindow_t` value as before. The NVLink inline-PTX optimization in `put_block_direct` is preserved (no switch to the slower `put_block` path). RegisteredBuffer struct layout (all backends): offset 0: base_ptr (void*, 8 bytes) offset 8: size (size_t, 8 bytes) offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct offset 24: lkey (uint32_t, 4 bytes) Files changed: - device_alltoallv_dynamic.py: kernel params, host-side field extraction, kernel launch - alltoallv_op.py: type annotation, deregister call - test_device_alltoallv_dynamic_e2e.py: deregister calls - benchmark_device_alltoallv_dynamic.py: deregister calls Differential Revision: D99914167

…pytorch#1996) Summary: The `register_local_buffer()` API was changed to return a single `int64` (a device pointer to a `RegisteredBuffer` struct) instead of the previous 3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv kernel and all its consumers. The fix: - Remove dead kernel parameters `src_base_ptr` and `src_size` which were never used in the kernel body - Extract the `backend_window` (ncclWindow_t) field from the device-side `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is the value that `put_block_direct` and `put_warp_chunked_direct` externs need for NVLink inline-PTX memcpy and GIN RDMA fallback - Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass the single int handle directly This is a pure API adaptation with zero kernel behavioral change. The externs receive the identical `ncclWindow_t` value as before. The NVLink inline-PTX optimization in `put_block_direct` is preserved (no switch to the slower `put_block` path). RegisteredBuffer struct layout (all backends): offset 0: base_ptr (void*, 8 bytes) offset 8: size (size_t, 8 bytes) offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct offset 24: lkey (uint32_t, 4 bytes) Files changed: - device_alltoallv_dynamic.py: kernel params, host-side field extraction, kernel launch - alltoallv_op.py: type annotation, deregister call - test_device_alltoallv_dynamic_e2e.py: deregister calls - benchmark_device_alltoallv_dynamic.py: deregister calls Differential Revision: D99914167

…pytorch#1996) Summary: Pull Request resolved: meta-pytorch#1996 The `register_local_buffer()` API was changed to return a single `int64` (a device pointer to a `RegisteredBuffer` struct) instead of the previous 3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv kernel and all its consumers. The fix: - Remove dead kernel parameters `src_base_ptr` and `src_size` which were never used in the kernel body - Extract the `backend_window` (ncclWindow_t) field from the device-side `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is the value that `put_block_direct` and `put_warp_chunked_direct` externs need for NVLink inline-PTX memcpy and GIN RDMA fallback - Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass the single int handle directly This is a pure API adaptation with zero kernel behavioral change. The externs receive the identical `ncclWindow_t` value as before. The NVLink inline-PTX optimization in `put_block_direct` is preserved (no switch to the slower `put_block` path). RegisteredBuffer struct layout (all backends): offset 0: base_ptr (void*, 8 bytes) offset 8: size (size_t, 8 bytes) offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct offset 24: lkey (uint32_t, 4 bytes) Files changed: - device_alltoallv_dynamic.py: kernel params, host-side field extraction, kernel launch - alltoallv_op.py: type annotation, deregister call - test_device_alltoallv_dynamic_e2e.py: deregister calls - benchmark_device_alltoallv_dynamic.py: deregister calls Differential Revision: D99914167

…pytorch#1996) Summary: The `register_local_buffer()` API was changed to return a single `int64` (a device pointer to a `RegisteredBuffer` struct) instead of the previous 3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv kernel and all its consumers. The fix: - Remove dead kernel parameters `src_base_ptr` and `src_size` which were never used in the kernel body - Extract the `backend_window` (ncclWindow_t) field from the device-side `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is the value that `put_block_direct` and `put_warp_chunked_direct` externs need for NVLink inline-PTX memcpy and GIN RDMA fallback - Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass the single int handle directly This is a pure API adaptation with zero kernel behavioral change. The externs receive the identical `ncclWindow_t` value as before. The NVLink inline-PTX optimization in `put_block_direct` is preserved (no switch to the slower `put_block` path). RegisteredBuffer struct layout (all backends): offset 0: base_ptr (void*, 8 bytes) offset 8: size (size_t, 8 bytes) offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct offset 24: lkey (uint32_t, 4 bytes) Files changed: - device_alltoallv_dynamic.py: kernel params, host-side field extraction, kernel launch - alltoallv_op.py: type annotation, deregister call - test_device_alltoallv_dynamic_e2e.py: deregister calls - benchmark_device_alltoallv_dynamic.py: deregister calls Differential Revision: D99914167

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 7, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 7, 2026

meta-codesync bot changed the title ~~Fix register_local_buffer API break in Triton alltoallv kernel~~ Fix register_local_buffer API break in Triton alltoallv kernel (#1996) Apr 8, 2026

snarayankh force-pushed the export-D99914167 branch from 91fe099 to c9a80c2 Compare April 8, 2026 00:37

snarayankh force-pushed the export-D99914167 branch from c9a80c2 to d21d7b1 Compare April 8, 2026 04:42

snarayankh force-pushed the export-D99914167 branch 2 times, most recently from 9a3497a to 83371f8 Compare April 8, 2026 20:45

snarayankh force-pushed the export-D99914167 branch from 83371f8 to af5368e Compare April 9, 2026 04:01

snarayankh force-pushed the export-D99914167 branch from af5368e to d23be1f Compare April 9, 2026 04:10

snarayankh force-pushed the export-D99914167 branch from d23be1f to 9745eee Compare April 9, 2026 20:09

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix register_local_buffer API break in Triton alltoallv kernel (#1996)#1996

Fix register_local_buffer API break in Triton alltoallv kernel (#1996)#1996
snarayankh wants to merge 1 commit intometa-pytorch:mainfrom
snarayankh:export-D99914167

snarayankh commented Apr 7, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

snarayankh commented Apr 7, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

snarayankh commented Apr 7, 2026 •

edited by meta-codesync bot

Loading