Open
Conversation
Summary: AllToAllv on single-node (all-NVL, nLocalRanks == nRanks) uses a kernel-only path (empty opGroup, no GPE host callback). The kernel reads per-peer send/recv metadata (count, displacement, peer rank) from KernelElem objects whose addresses are baked into the graph as kernel args when captured. After the capture, the kernel resets status[*] = RESET on those elems. The pool's reclaim() sees them as free and recycles them. If something else pops the same elems from the pool (e.g., another eager AllToAllv call) and overwrites staged.count/staged.displ/staged.peerRank with new values, subsequent replays will be corupted. to fix this, we can have allocKernelElems(stream) make the objects persistent. This prevents reclaim() from recycling them, and on graph destruction, the retained callback calls clearPersistent() + free() to return them to the pool. This also protects all other algorithms' KernelElems during graph capture, even though they didn't have the same bug (their host callbacks re-populate the elems on each replay). Differential Revision: D99361274
Summary: the current cuStreamBatchMemOp increment + reset isn't sufficient. we may have eager executions competing w/ graph executions. the eager path expects monotonically increasing counters, but a graph counter reset can interfere with this; think of the case where our eager reader calls ctranNextWaitSignalVal and gets some value GEQ 2, then some graph shows up and does it's increment-then-reset before this signal arrives. it'd be stuck waiting for some value that is now 1 (but expects GEQ 2). i suppose there is an asymmetric comms where you have a graph that does A signals B repeatedly. it's possible that A will clobber it's own signals (i.e,. it sets 1->1 before B is able to set 1 back to 0). B will be waiting for a 1 but it'll never come because it's last wait was actually for multiple clobbered puts. we can solve this by making sure that the put kernel does a device-side increment on the signal value similar to what the host does for the non-graph case. monotonically increasing values don't have the gaps mentioned above, and the eager path remains unchanged. Differential Revision: D99684144
Summary: cuda graph capture doesn't support cudaMemcpyBatchAsync - we can defer to separate cudaMemcpyAsync Differential Revision: D99489347
Summary: Graph replays bypass submit(), so we cannot know for certain whether the previous operation was a graph replay or eager. CPU-side sync ensures any in-flight graph host node (which enqueues a GPE command) has fired before the caller can cmdEnqueue. Without this, the eager command lands in the GPE queue first and the single-threaded GPE deadlocks. Cross-stream eager, no graphs: GPU-side ordering only. We don't make any thread-safety guarantees for submit() so this is sufficient. Differential Revision: D99405326
Contributor
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary:
Graph replays bypass submit(), so we cannot know for certain whether the previous operation was a graph replay or eager. CPU-side sync ensures any in-flight graph host node (which enqueues a GPE command) has fired before the caller can cmdEnqueue. Without this, the eager command lands in the GPE queue first and the single-threaded GPE deadlocks.
Cross-stream eager, no graphs: GPU-side ordering only. We don't make any thread-safety guarantees for submit() so this is sufficient.
Differential Revision: D99405326