Userbuffer epic #367

alextmagro · 2025-11-11T21:24:31Z

This is the userbuffer_epic branch, to be merged only once all epic tasks have been completed. PRs for epic tasks will be onto this branch.

wangye805 · 2026-02-09T16:08:12Z

build_tools/pytorch.py

            if version < (12, 0):
                raise RuntimeError("Transformer Engine requires CUDA 12.0 or newer")

-    if bool(int(os.getenv("NVTE_UB_WITH_MPI", "0"))):


Guard via ROCm specifc guards?

wangye805 · 2026-02-09T16:10:53Z

examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py

    parser.add_argument("--seed", type=int, default=1234, help="RNG seed.")
    parser.add_argument(
-        "--fp8", action="store_true", default=False, help="Enables the te.fp8_autocast() context."
+        "--fp8", action="store_true", default=False, help="Enables the te.autocast() context."


Up to TE v2.8, I think it's still fp8_autocast. Were you targeting at higher versions?

I think you had a few comments on this, so will address it here quickly. I moved the UB code up to release 2.10, as there were a few bugs and inefficiencies that NV fixed. Most of the changes that aren't guarded in the files are NV upstream changes.

I am fixing up the te_layer_with_overlap differences, and working on integrating the benchmark script into the file directly.

wangye805 · 2026-02-09T17:07:29Z

examples/pytorch/comm_gemm_overlap/te_layer_with_overlap_profile.py

+
+# This file was modified for portability to AMDGPU
+# Copyright (c) 2025-2026, Advanced Micro Devices, Inc. All rights reserved.
+# Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.


Was this file sharing a lot of codes with examples/pytorch/comm_gemm_overlap/te_layer_with_overlap.py? Is it possible to consolidate those two files

wangye805 · 2026-02-09T17:08:56Z

examples/pytorch/comm_gemm_overlap/ub_config.json

@@ -0,0 +1,15 @@
+{


Why do we put this file here? Should it be under /transformer_engine/common or pytorch

This file is used for the examples scripts specifically to allow a user to change up algorithms used for each overlap scenario.

wangye805 · 2026-02-09T17:10:22Z

tests/pytorch/distributed/test_comm_gemm_overlap.py

 import transformer_engine.pytorch.cpp_extensions as tex
 from transformer_engine.pytorch.fp8 import FP8GlobalStateManager

+from transformer_engine.jax.cpp_extensions.misc import is_hip_extension


Let's not import jax specific code into pytorch side. Use this instead:

TransformerEngine/tests/pytorch/test_numerics.py

Line 17 in 0dfee56

from torch.utils.cpp_extension import IS_HIP_EXTENSION

Good catch, this is an mistake. Will fix.

wangye805 · 2026-02-09T18:12:19Z