fix RMATest failure and reduce testing cost (#2002) by tianfengfrank · Pull Request #2002 · meta-pytorch/torchcomms

tianfengfrank · 2026-04-08T20:22:28Z

Summary:

The NvlEnabledTestParam test was hanging because it had two NCCL communicators alive simultaneously — and it only needed one.

How it happened: D95122239 migrated RMATest from ::testing::Test to NcclxBaseTest, which made SetUp() create a full NCCL communicator (this->comm) for every test. This was correct for RMATestParam and MultiWindowTestParam which use this->comm. But NvlEnabledTestParam never uses it — it creates its own communicator with parameterized backends (NVL+IB or IB-only). Nobody noticed the test body was now running with two active comms.

Why it hangs: After 97 prior test cycles (each creating/destroying a comm with ~54MB pinned memory across 8 ranks), the CUDA pinned memory pool is fragmented. When NvlEnabledTestParam tries to bootstrap its second comm while the fixture's comm is still alive, one or more ranks can't complete cudaHostAlloc → ncclSocketAccept blocks waiting for that rank → all ranks hang → parent process sends SIGTERM.

The fix: Override SetUp()/TearDown() in NvlEnabledTestParam to skip the unnecessary fixture comm creation. Now only one comm exists at a time, and the bootstrap completes cleanly.

NOTE: we also decrease the NumIter to save test resources. And add re_timeout to avoid false alarm.

Reviewed By: dolpm

Differential Revision: D100051366

meta-codesync · 2026-04-08T20:22:37Z

@tianfengfrank has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100051366.

Summary: The NvlEnabledTestParam test was hanging because it had two NCCL communicators alive simultaneously — and it only needed one. How it happened: D95122239 migrated RMATest from ::testing::Test to NcclxBaseTest, which made SetUp() create a full NCCL communicator (this->comm) for every test. This was correct for RMATestParam and MultiWindowTestParam which use this->comm. But NvlEnabledTestParam never uses it — it creates its own communicator with parameterized backends (NVL+IB or IB-only). Nobody noticed the test body was now running with two active comms. Why it hangs: After 97 prior test cycles (each creating/destroying a comm with ~54MB pinned memory across 8 ranks), the CUDA pinned memory pool is fragmented. When NvlEnabledTestParam tries to bootstrap its second comm while the fixture's comm is still alive, one or more ranks can't complete cudaHostAlloc → ncclSocketAccept blocks waiting for that rank → all ranks hang → parent process sends SIGTERM. The fix: Override SetUp()/TearDown() in NvlEnabledTestParam to skip the unnecessary fixture comm creation. Now only one comm exists at a time, and the bootstrap completes cleanly. NOTE: we also decrease the NumIter to save test resources. And add re_timeout to avoid false alarm. Reviewed By: dolpm Differential Revision: D100051366

Summary: Pull Request resolved: meta-pytorch#2002 Differential Revision: D100051366

Summary: We need to align AGP's internal persistent request with the window API so that regular AllGather can be converted to AGP in graph capture mode (window init + dry-run exec at capture time, SM-free CE replay at execution time). In this diff, we extract `initResources` to a public func so that window AGP can directly call it Reviewed By: dsjohns2 Differential Revision: D99514784

Summary: The NvlEnabledTestParam test was hanging because it had two NCCL communicators alive simultaneously — and it only needed one. How it happened: D95122239 migrated RMATest from ::testing::Test to NcclxBaseTest, which made SetUp() create a full NCCL communicator (this->comm) for every test. This was correct for RMATestParam and MultiWindowTestParam which use this->comm. But NvlEnabledTestParam never uses it — it creates its own communicator with parameterized backends (NVL+IB or IB-only). Nobody noticed the test body was now running with two active comms. Why it hangs: After 97 prior test cycles (each creating/destroying a comm with ~54MB pinned memory across 8 ranks), the CUDA pinned memory pool is fragmented. When NvlEnabledTestParam tries to bootstrap its second comm while the fixture's comm is still alive, one or more ranks can't complete cudaHostAlloc → ncclSocketAccept blocks waiting for that rank → all ranks hang → parent process sends SIGTERM. The fix: Override SetUp()/TearDown() in NvlEnabledTestParam to skip the unnecessary fixture comm creation. Now only one comm exists at a time, and the bootstrap completes cleanly. NOTE: we also decrease the NumIter to save test resources. And add re_timeout to avoid false alarm. Reviewed By: dolpm Differential Revision: D100051366

Summary: Pull Request resolved: meta-pytorch#2002 Differential Revision: D100051366

Summary: Pull Request resolved: meta-pytorch#2002 - the ncclx test are migrated to meta/test but RMATest has been left - this diff migrate RMATest to meta/tests shared by all ncclx version - also simplified the UT to save resources since we have similar test at Ctran level as well, e.g: - reduce kNumIters from 500 to 10 - remove winAlloc test since winRegister would be the only entry later - consolidate Put/PutSignal test case Reviewed By: MogicianWu, dolpm Differential Revision: D100051366

Summary: Pull Request resolved: meta-pytorch#2017 Pull Request resolved: meta-pytorch#2002 - the ncclx test are migrated to meta/test but RMATest has been left - this diff migrate RMATest to meta/tests shared by all ncclx version - also simplified the UT to save resources since we have similar test at Ctran level as well, e.g: - reduce kNumIters from 500 to 10 - remove winAlloc test since winRegister would be the only entry later - consolidate Put/PutSignal test case Reviewed By: MogicianWu, dolpm Differential Revision: D100051366

Summary: Pull Request resolved: meta-pytorch#2002 Differential Revision: D100051366

meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 8, 2026

meta-codesync bot added fb-exported meta-exported labels Apr 8, 2026

meta-codesync bot changed the title ~~increase test timeout and reduce numIter of RMA test~~ fix RMATest failure and reduce testing cost (#2002) Apr 8, 2026

meta-codesync bot force-pushed the export-D100051366 branch from 45db8d4 to ea9ee5d Compare April 8, 2026 21:32

tianfengfrank pushed a commit to tianfengfrank/torchcomms-1 that referenced this pull request Apr 8, 2026

fix RMATest failure and reduce testing cost (meta-pytorch#2002)

a4fc73a

Summary: Pull Request resolved: meta-pytorch#2002 Differential Revision: D100051366

tianfengfrank added 2 commits April 8, 2026 19:06

meta-codesync bot force-pushed the export-D100051366 branch from ea9ee5d to 43cabb0 Compare April 9, 2026 02:07

tianfengfrank pushed a commit to tianfengfrank/torchcomms-1 that referenced this pull request Apr 9, 2026

fix RMATest failure and reduce testing cost (meta-pytorch#2002)

2d7f79f

Summary: Pull Request resolved: meta-pytorch#2002 Differential Revision: D100051366

tianfengfrank pushed a commit to tianfengfrank/torchcomms-1 that referenced this pull request Apr 9, 2026

fix RMATest failure and reduce testing cost (meta-pytorch#2002)

e2e52ad

Summary: Pull Request resolved: meta-pytorch#2002 Differential Revision: D100051366

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix RMATest failure and reduce testing cost (#2002)#2002

fix RMATest failure and reduce testing cost (#2002)#2002
tianfengfrank wants to merge 2 commits intomainfrom
export-D100051366

tianfengfrank commented Apr 8, 2026 •

edited by meta-codesync bot

Loading

Uh oh!

meta-codesync bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

tianfengfrank commented Apr 8, 2026 • edited by meta-codesync bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

meta-codesync bot commented Apr 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

tianfengfrank commented Apr 8, 2026 •

edited by meta-codesync bot

Loading