Skip to content

Fix register_local_buffer API break in Triton alltoallv kernel (#1996)#1996

Open
snarayankh wants to merge 1 commit intometa-pytorch:mainfrom
snarayankh:export-D99914167
Open

Fix register_local_buffer API break in Triton alltoallv kernel (#1996)#1996
snarayankh wants to merge 1 commit intometa-pytorch:mainfrom
snarayankh:export-D99914167

Conversation

@snarayankh
Copy link
Copy Markdown

@snarayankh snarayankh commented Apr 7, 2026

Summary:

The register_local_buffer() API was changed to return a single int64
(a device pointer to a RegisteredBuffer struct) instead of the previous
3-tuple (base_ptr, size, nccl_win). This broke the Triton alltoallv
kernel and all its consumers.

The fix:

  • Remove dead kernel parameters src_base_ptr and src_size which were
    never used in the kernel body
  • Extract the backend_window (ncclWindow_t) field from the device-side
    RegisteredBuffer struct at byte offset 16 via cudaMemcpy D2H. This is
    the value that put_block_direct and put_warp_chunked_direct externs
    need for NVLink inline-PTX memcpy and GIN RDMA fallback
  • Fix all deregister_local_buffer(*src_info) splat-unpack calls to pass
    the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical ncclWindow_t value as before. The NVLink
inline-PTX optimization in put_block_direct is preserved (no switch to
the slower put_block path).

RegisteredBuffer struct layout (all backends):
offset 0: base_ptr (void*, 8 bytes)
offset 8: size (size_t, 8 bytes)
offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
offset 24: lkey (uint32_t, 4 bytes)

Files changed:

  • device_alltoallv_dynamic.py: kernel params, host-side field extraction,
    kernel launch
  • alltoallv_op.py: type annotation, deregister call
  • test_device_alltoallv_dynamic_e2e.py: deregister calls
  • benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 7, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 7, 2026

@snarayankh has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99914167.

@meta-codesync meta-codesync bot changed the title Fix register_local_buffer API break in Triton alltoallv kernel Fix register_local_buffer API break in Triton alltoallv kernel (#1996) Apr 8, 2026
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 8, 2026
…pytorch#1996)

Summary:
Pull Request resolved: meta-pytorch#1996

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
@snarayankh snarayankh force-pushed the export-D99914167 branch 2 times, most recently from 9a3497a to 83371f8 Compare April 8, 2026 20:45
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 8, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:
Pull Request resolved: meta-pytorch#1996

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
snarayankh pushed a commit to snarayankh/torchcomms that referenced this pull request Apr 9, 2026
…pytorch#1996)

Summary:

The `register_local_buffer()` API was changed to return a single `int64`
(a device pointer to a `RegisteredBuffer` struct) instead of the previous
3-tuple `(base_ptr, size, nccl_win)`. This broke the Triton alltoallv
kernel and all its consumers.

The fix:
- Remove dead kernel parameters `src_base_ptr` and `src_size` which were
  never used in the kernel body
- Extract the `backend_window` (ncclWindow_t) field from the device-side
  `RegisteredBuffer` struct at byte offset 16 via cudaMemcpy D2H. This is
  the value that `put_block_direct` and `put_warp_chunked_direct` externs
  need for NVLink inline-PTX memcpy and GIN RDMA fallback
- Fix all `deregister_local_buffer(*src_info)` splat-unpack calls to pass
  the single int handle directly

This is a pure API adaptation with zero kernel behavioral change. The
externs receive the identical `ncclWindow_t` value as before. The NVLink
inline-PTX optimization in `put_block_direct` is preserved (no switch to
the slower `put_block` path).

RegisteredBuffer struct layout (all backends):
  offset 0:  base_ptr (void*, 8 bytes)
  offset 8:  size (size_t, 8 bytes)
  offset 16: backend_window (void*, 8 bytes) ← extracted for put_block_direct
  offset 24: lkey (uint32_t, 4 bytes)

Files changed:
- device_alltoallv_dynamic.py: kernel params, host-side field extraction,
  kernel launch
- alltoallv_op.py: type annotation, deregister call
- test_device_alltoallv_dynamic_e2e.py: deregister calls
- benchmark_device_alltoallv_dynamic.py: deregister calls

Differential Revision: D99914167
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant