Skip to content

fix some race conditions#1978

Open
dolpm wants to merge 4 commits intometa-pytorch:mainfrom
dolpm:export-D99405326
Open

fix some race conditions#1978
dolpm wants to merge 4 commits intometa-pytorch:mainfrom
dolpm:export-D99405326

Conversation

@dolpm
Copy link
Copy Markdown
Contributor

@dolpm dolpm commented Apr 7, 2026

Summary:
Graph replays bypass submit(), so we cannot know for certain whether the previous operation was a graph replay or eager. CPU-side sync ensures any in-flight graph host node (which enqueues a GPE command) has fired before the caller can cmdEnqueue. Without this, the eager command lands in the GPE queue first and the single-threaded GPE deadlocks.

Cross-stream eager, no graphs: GPU-side ordering only. We don't make any thread-safety guarantees for submit() so this is sufficient.

Differential Revision: D99405326

dolpm added 4 commits April 7, 2026 12:23
Summary:
AllToAllv on single-node (all-NVL, nLocalRanks == nRanks) uses a kernel-only path (empty opGroup, no GPE host callback).
The kernel reads per-peer send/recv metadata (count, displacement, peer rank) from KernelElem objects whose addresses
are baked into the graph as kernel args when captured. After the capture, the kernel resets status[*] = RESET on those elems.
The pool's reclaim() sees them as free and recycles them. If something else pops the same elems from the pool (e.g., another
eager AllToAllv call) and overwrites staged.count/staged.displ/staged.peerRank with new values, subsequent replays will be corupted.

to fix this, we can have allocKernelElems(stream) make the objects persistent. This prevents reclaim() from recycling them, and
on graph destruction, the retained callback calls clearPersistent() + free() to return them to the pool.

This also protects all other algorithms' KernelElems during graph capture, even though they didn't have the same bug
(their host callbacks re-populate the elems on each replay).

Differential Revision: D99361274
Summary:
the current cuStreamBatchMemOp increment + reset isn't sufficient.

we may have eager executions competing w/ graph executions. the eager path expects monotonically increasing counters, but a graph counter reset can interfere with this; think of the case where our eager reader calls ctranNextWaitSignalVal and gets some value GEQ 2, then some graph shows up and does it's increment-then-reset before this signal arrives. it'd be stuck waiting for some value that is now 1 (but expects GEQ 2).

i suppose there is an asymmetric comms where you have a graph that does A signals B repeatedly. it's possible that A will clobber it's own signals (i.e,. it sets 1->1 before B is able to set 1 back to 0). B will be waiting for a 1 but it'll never come because it's last wait was actually for multiple clobbered puts.

we can solve this by making sure that the put kernel does a device-side increment on the signal value similar to what the host does for the non-graph case. monotonically increasing values don't have the gaps mentioned above, and the eager path remains unchanged.

Differential Revision: D99684144
Summary: cuda graph capture doesn't support cudaMemcpyBatchAsync - we can defer to separate cudaMemcpyAsync

Differential Revision: D99489347
Summary:
Graph replays bypass submit(), so we cannot know for certain whether the previous operation was a graph replay or eager. CPU-side sync ensures any in-flight graph host node (which enqueues a GPE command) has fired before the caller can cmdEnqueue. Without this, the eager command lands in the GPE queue first and the single-threaded GPE deadlocks.

Cross-stream eager, no graphs: GPU-side ordering only. We don't make any thread-safety guarantees for submit() so this is sufficient.

Differential Revision: D99405326
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 7, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 7, 2026

@dolpm has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99405326.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant