fix some race conditions by dolpm · Pull Request #1978 · meta-pytorch/torchcomms

dolpm · 2026-04-07T19:23:53Z

Summary:
Graph replays bypass submit(), so we cannot know for certain whether the previous operation was a graph replay or eager. CPU-side sync ensures any in-flight graph host node (which enqueues a GPE command) has fired before the caller can cmdEnqueue. Without this, the eager command lands in the GPE queue first and the single-threaded GPE deadlocks.

Cross-stream eager, no graphs: GPU-side ordering only. We don't make any thread-safety guarantees for submit() so this is sufficient.

Differential Revision: D99405326

Summary: AllToAllv on single-node (all-NVL, nLocalRanks == nRanks) uses a kernel-only path (empty opGroup, no GPE host callback). The kernel reads per-peer send/recv metadata (count, displacement, peer rank) from KernelElem objects whose addresses are baked into the graph as kernel args when captured. After the capture, the kernel resets status[*] = RESET on those elems. The pool's reclaim() sees them as free and recycles them. If something else pops the same elems from the pool (e.g., another eager AllToAllv call) and overwrites staged.count/staged.displ/staged.peerRank with new values, subsequent replays will be corupted. to fix this, we can have allocKernelElems(stream) make the objects persistent. This prevents reclaim() from recycling them, and on graph destruction, the retained callback calls clearPersistent() + free() to return them to the pool. This also protects all other algorithms' KernelElems during graph capture, even though they didn't have the same bug (their host callbacks re-populate the elems on each replay). Differential Revision: D99361274

Summary: the current cuStreamBatchMemOp increment + reset isn't sufficient. we may have eager executions competing w/ graph executions. the eager path expects monotonically increasing counters, but a graph counter reset can interfere with this; think of the case where our eager reader calls ctranNextWaitSignalVal and gets some value GEQ 2, then some graph shows up and does it's increment-then-reset before this signal arrives. it'd be stuck waiting for some value that is now 1 (but expects GEQ 2). i suppose there is an asymmetric comms where you have a graph that does A signals B repeatedly. it's possible that A will clobber it's own signals (i.e,. it sets 1->1 before B is able to set 1 back to 0). B will be waiting for a 1 but it'll never come because it's last wait was actually for multiple clobbered puts. we can solve this by making sure that the put kernel does a device-side increment on the signal value similar to what the host does for the non-graph case. monotonically increasing values don't have the gaps mentioned above, and the eager path remains unchanged. Differential Revision: D99684144

Summary: cuda graph capture doesn't support cudaMemcpyBatchAsync - we can defer to separate cudaMemcpyAsync Differential Revision: D99489347

Summary: Graph replays bypass submit(), so we cannot know for certain whether the previous operation was a graph replay or eager. CPU-side sync ensures any in-flight graph host node (which enqueues a GPE command) has fired before the caller can cmdEnqueue. Without this, the eager command lands in the GPE queue first and the single-threaded GPE deadlocks. Cross-stream eager, no graphs: GPU-side ordering only. We don't make any thread-safety guarantees for submit() so this is sufficient. Differential Revision: D99405326

meta-codesync · 2026-04-07T19:24:21Z

@dolpm has exported this pull request. If you are a Meta employee, you can view the originating Diff in D99405326.

dolpm added 4 commits April 7, 2026 12:23

nvlCeBcast cudagraph support + document allgatherp graph capture caveats

fe85628

Summary: cuda graph capture doesn't support cudaMemcpyBatchAsync - we can defer to separate cudaMemcpyAsync Differential Revision: D99489347

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 7, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 7, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix some race conditions#1978

fix some race conditions#1978
dolpm wants to merge 4 commits intometa-pytorch:mainfrom
dolpm:export-D99405326

dolpm commented Apr 7, 2026

Uh oh!

meta-codesync bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

dolpm commented Apr 7, 2026

Uh oh!

meta-codesync bot commented Apr 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant