-
Notifications
You must be signed in to change notification settings - Fork 23
Userbuffer epic #367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: dev
Are you sure you want to change the base?
Userbuffer epic #367
Conversation
896c191 to
455b1ef
Compare
455b1ef to
e4e40e8
Compare
b3e676a to
823adfd
Compare
| if version < (12, 0): | ||
| raise RuntimeError("Transformer Engine requires CUDA 12.0 or newer") | ||
|
|
||
| if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Guard via ROCm specifc guards?
| parser.add_argument("--seed", type=int, default=1234, help="RNG seed.") | ||
| parser.add_argument( | ||
| "--fp8", action="store_true", default=False, help="Enables the te.fp8_autocast() context." | ||
| "--fp8", action="store_true", default=False, help="Enables the te.autocast() context." |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up to TE v2.8, I think it's still fp8_autocast. Were you targeting at higher versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you had a few comments on this, so will address it here quickly. I moved the UB code up to release 2.10, as there were a few bugs and inefficiencies that NV fixed. Most of the changes that aren't guarded in the files are NV upstream changes.
I am fixing up the te_layer_with_overlap differences, and working on integrating the benchmark script into the file directly.
|
|
||
| # This file was modified for portability to AMDGPU | ||
| # Copyright (c) 2025-2026, Advanced Micro Devices, Inc. All rights reserved. | ||
| # Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was this file sharing a lot of codes with examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py? Is it possible to consolidate those two files
| @@ -0,0 +1,15 @@ | |||
| { | |||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we put this file here? Should it be under /transformer_engine/common or pytorch
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is used for the examples scripts specifically to allow a user to change up algorithms used for each overlap scenario.
| import transformer_engine.pytorch.cpp_extensions as tex | ||
| from transformer_engine.pytorch.fp8 import FP8GlobalStateManager | ||
|
|
||
| from transformer_engine.jax.cpp_extensions.misc import is_hip_extension |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's not import jax specific code into pytorch side. Use this instead:
| from torch.utils.cpp_extension import IS_HIP_EXTENSION |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch, this is an mistake. Will fix.
| initialize(buffer_shape, buffer_dtype, rs_overlap_first_gemm); | ||
| } | ||
|
|
||
| void CommOverlapBase::initialize(const std::vector<size_t> &buffer_shape, DType buffer_dtype, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this initialize function used somewhere else? Or just to make NV upstream codes look cleaner?
| if (_ub_comm->myrank == 0) printf("!!! [UB] Register UBuf %d\n", _ub_reg); | ||
| if (_ub_comm->myrank == 0) { | ||
| printf("!!! [UB] Register UBuf %d\n", _ub_reg); | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer aligning the coding style with NV upstream so it's easier for us to maintain/IFU later
| allgather_handle, barrier_handle, tp_size, num_max_streams, comm_cga_size, | ||
| gemm_priority, comm_priority, num_comm_sm, set_sm_margin, use_ce, | ||
| atomic_gemm) { | ||
| initialize(buffer_shape, buffer_dtype, comm_type, aggregate); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same question here for the motivation of this initialize function in the constructor
| size_t buffer_bytes = get_buffer_size_bytes(buffer_shape[0], buffer_shape[1], buffer_dtype); | ||
| int buffer_chunk_bytes = buffer_bytes / tp_size; | ||
| _num_ubuf_chunks = tp_size; | ||
| int buffer_chunk_bytes = buffer_bytes / _tp_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does NV's original code compile successfully? I mean tp_size -> _tp_size sounds like a typo in their original code :-)
| NVTE_CHECK_CUDA(cudaStreamCreateWithPriority(&stream, cudaStreamNonBlocking, _comm_priority)); | ||
| _stream_send.push_back(std::move(stream)); | ||
| } | ||
| for (int i = 0; i < 7; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need more streams than NV upstream and where does the constant 7 comes out?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. We require more streams as NV goes through a single NVLink, whereas we have the # of GPUs - 1 connections with xGMI. The solid 7 comes from a 8 GPU max on a system, but this should probably be replaced with a macro definition like MAX_TP_SIZE
This is the userbuffer_epic branch, to be merged only once all epic tasks have been completed. PRs for epic tasks will be onto this branch.