Skip to content

Fix FlightRecorder trace analyzer to accept ncclx and gloo backends (#2013)#2013

Open
lilyjanjigian wants to merge 1 commit intometa-pytorch:mainfrom
lilyjanjigian:export-D100117074
Open

Fix FlightRecorder trace analyzer to accept ncclx and gloo backends (#2013)#2013
lilyjanjigian wants to merge 1 commit intometa-pytorch:mainfrom
lilyjanjigian:export-D100117074

Conversation

@lilyjanjigian
Copy link
Copy Markdown
Contributor

@lilyjanjigian lilyjanjigian commented Apr 9, 2026

Summary:

X-link: pytorch/pytorch#179842

Two fixes:

  1. FR trace analyzer backend allowlist: D99020245 generalized FlightRecorder profiling_name to use the actual backend name (e.g. "ncclx:", "gloo:") instead of hardcoded "nccl:". The FR trace analyzer's Op class needs to accept these additional backend prefixes to avoid AssertionError when parsing traces from non-nccl backends.
  2. Sync op post-hook retirement race: TorchComm::postHook() deferred all post-hook invocations (including flight recorder entry retirement) to a callback that fires when the work's status transitions to COMPLETED. For synchronous operations (async_op=false), nobody calls work->wait() and the status only transitions asynchronously via the watchdog thread, so retired_ was never set by the time dump_json(include_completed=False) was called. Fix: invoke post-hooks synchronously for sync ops, since they run on the current CUDA stream and are complete from the user's perspective when the function returns. Added async_op field to PostHookArgs to distinguish sync vs async.

Reviewed By: dolpm

Differential Revision: D100117074

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 9, 2026

@lilyjanjigian has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100117074.

…eta-pytorch#2013)

Summary:

X-link: pytorch/pytorch#179842

Two fixes:                                                             
                                                            
  1. FR trace analyzer backend allowlist: D99020245 generalized FlightRecorder profiling_name to use the actual backend name (e.g. "ncclx:", "gloo:") instead of hardcoded "nccl:". The FR trace analyzer's Op class needs to accept these additional backend prefixes to avoid AssertionError when parsing traces from non-nccl backends.
  2. Sync op post-hook retirement race: TorchComm::postHook() deferred all post-hook invocations (including flight recorder entry retirement) to a callback that fires when the work's status transitions to COMPLETED. For synchronous operations (async_op=false), nobody calls work->wait() and the status only transitions asynchronously via the watchdog thread, so retired_ was never set by the time dump_json(include_completed=False) was called. Fix: invoke post-hooks synchronously for sync ops, since they run on the current CUDA stream and are complete from the user's perspective when the function returns. Added async_op field to PostHookArgs to distinguish sync vs async.

Reviewed By: dolpm

Differential Revision: D100117074
@meta-codesync meta-codesync bot changed the title Fix FlightRecorder trace analyzer to accept ncclx and gloo backends Fix FlightRecorder trace analyzer to accept ncclx and gloo backends (#2013) Apr 9, 2026
lilyjanjigian added a commit to lilyjanjigian/pytorch that referenced this pull request Apr 9, 2026
…ytorch#179842)

Summary:
X-link: meta-pytorch/torchcomms#2013


Two fixes:                                                             
                                                            
  1. FR trace analyzer backend allowlist: D99020245 generalized FlightRecorder profiling_name to use the actual backend name (e.g. "ncclx:", "gloo:") instead of hardcoded "nccl:". The FR trace analyzer's Op class needs to accept these additional backend prefixes to avoid AssertionError when parsing traces from non-nccl backends.
  2. Sync op post-hook retirement race: TorchComm::postHook() deferred all post-hook invocations (including flight recorder entry retirement) to a callback that fires when the work's status transitions to COMPLETED. For synchronous operations (async_op=false), nobody calls work->wait() and the status only transitions asynchronously via the watchdog thread, so retired_ was never set by the time dump_json(include_completed=False) was called. Fix: invoke post-hooks synchronously for sync ops, since they run on the current CUDA stream and are complete from the user's perspective when the function returns. Added async_op field to PostHookArgs to distinguish sync vs async.

Test Plan:
Previously all tests below were failing:

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_expseg_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/33495522229099048

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_default_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/4785074963057659

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_default_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/34058472182524345

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_default_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/18858823453032553

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_default_memalloc_expseg_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/12103424162971880

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_expseg_transport_baseline: https://www.internalfb.com/intern/testinfra/testrun/23362423080399978

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_default_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/9288674396164707

fbcode//comms/torchcomms/hooks/fr/tests/py:FlightRecorderTest_1x8_backend_ncclx_initmode_fast_memalloc_expseg_transport_ctran: https://www.internalfb.com/intern/testinfra/testrun/30680772461998465

Reviewed By: dolpm

Differential Revision: D100117074
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant