Fix FlightRecorder trace analyzer to accept ncclx and gloo backends (#2013)#2013
Open
lilyjanjigian wants to merge 1 commit intometa-pytorch:mainfrom
Open
Fix FlightRecorder trace analyzer to accept ncclx and gloo backends (#2013)#2013lilyjanjigian wants to merge 1 commit intometa-pytorch:mainfrom
lilyjanjigian wants to merge 1 commit intometa-pytorch:mainfrom
Conversation
Contributor
|
@lilyjanjigian has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100117074. |
…eta-pytorch#2013) Summary: X-link: pytorch/pytorch#179842 Two fixes: 1. FR trace analyzer backend allowlist: D99020245 generalized FlightRecorder profiling_name to use the actual backend name (e.g. "ncclx:", "gloo:") instead of hardcoded "nccl:". The FR trace analyzer's Op class needs to accept these additional backend prefixes to avoid AssertionError when parsing traces from non-nccl backends. 2. Sync op post-hook retirement race: TorchComm::postHook() deferred all post-hook invocations (including flight recorder entry retirement) to a callback that fires when the work's status transitions to COMPLETED. For synchronous operations (async_op=false), nobody calls work->wait() and the status only transitions asynchronously via the watchdog thread, so retired_ was never set by the time dump_json(include_completed=False) was called. Fix: invoke post-hooks synchronously for sync ops, since they run on the current CUDA stream and are complete from the user's perspective when the function returns. Added async_op field to PostHookArgs to distinguish sync vs async. Reviewed By: dolpm Differential Revision: D100117074
6ff7764 to
295f9af
Compare
lilyjanjigian
added a commit
to lilyjanjigian/pytorch
that referenced
this pull request
Apr 9, 2026
…ytorch#179842) Summary: X-link: meta-pytorch/torchcomms#2013 Two fixes: 1. FR trace analyzer backend allowlist: D99020245 generalized FlightRecorder profiling_name to use the actual backend name (e.g. "ncclx:", "gloo:") instead of hardcoded "nccl:". The FR trace analyzer's Op class needs to accept these additional backend prefixes to avoid AssertionError when parsing traces from non-nccl backends. 2. Sync op post-hook retirement race: TorchComm::postHook() deferred all post-hook invocations (including flight recorder entry retirement) to a callback that fires when the work's status transitions to COMPLETED. For synchronous operations (async_op=false), nobody calls work->wait() and the status only transitions asynchronously via the watchdog thread, so retired_ was never set by the time dump_json(include_completed=False) was called. Fix: invoke post-hooks synchronously for sync ops, since they run on the current CUDA stream and are complete from the user's perspective when the function returns. Added async_op field to PostHookArgs to distinguish sync vs async. Test Plan: Previously all tests below were failing: fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_expseg_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/33495522229099048 fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_default_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/4785074963057659 fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_default_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/34058472182524345 fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_default_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/18858823453032553 fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_expseg_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/12103424162971880 fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_expseg_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/23362423080399978 fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_default_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/9288674396164707 fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_expseg_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/30680772461998465 Reviewed By: dolpm Differential Revision: D100117074
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
X-link: pytorch/pytorch#179842
Two fixes:
Reviewed By: dolpm
Differential Revision: D100117074