Skip to content

Add ForkBasedTestDriver for fork+exec multi-process tests (#2006)#2006

Open
Scusemua wants to merge 1 commit intometa-pytorch:mainfrom
Scusemua:export-D100079492
Open

Add ForkBasedTestDriver for fork+exec multi-process tests (#2006)#2006
Scusemua wants to merge 1 commit intometa-pytorch:mainfrom
Scusemua:export-D100079492

Conversation

@Scusemua
Copy link
Copy Markdown
Contributor

@Scusemua Scusemua commented Apr 9, 2026

Summary:

Introduce a lightweight fork+exec test driver that re-execs the test binary as worker subprocesses with TCPStore-based coordination. This enables ncclx tests to inspect worker exit codes (e.g., watchdog crash tests) without depending on MCCL's heavier Thrift-based CollectiveIntegrationTestMixin.

Also adds unit tests (ForkBasedTestDriverTest.cc) covering:

  • Basic multi-rank success with KV round-trip
  • Exact exit code capture
  • Signal-terminated worker reporting (128 + signal)

Differential Revision: D100079492

@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Apr 9, 2026
@meta-codesync
Copy link
Copy Markdown
Contributor

meta-codesync bot commented Apr 9, 2026

@Scusemua has exported this pull request. If you are a Meta employee, you can view the originating Diff in D100079492.

@meta-codesync meta-codesync bot changed the title Add ForkBasedTestDriver for fork+exec multi-process tests Add ForkBasedTestDriver for fork+exec multi-process tests (#2006) Apr 9, 2026
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request Apr 9, 2026
…ch#2006)

Summary:

Introduce a lightweight fork+exec test driver that re-execs the test binary as worker subprocesses with TCPStore-based coordination. This enables ncclx tests to inspect worker exit codes (e.g., watchdog crash tests) without depending on MCCL's heavier Thrift-based `CollectiveIntegrationTestMixin`.

Also adds unit tests (`ForkBasedTestDriverTest.cc`) covering:
- Basic multi-rank success with KV round-trip
- Exact exit code capture
- Signal-terminated worker reporting (128 + signal)

Differential Revision: D100079492
@Scusemua Scusemua force-pushed the export-D100079492 branch from 837130b to 93f7aa0 Compare April 9, 2026 16:26
…ch#2006)

Summary:
Pull Request resolved: meta-pytorch#2006

Introduce a lightweight fork+exec test driver that re-execs the test binary as worker subprocesses with TCPStore-based coordination. This enables ncclx tests to inspect worker exit codes (e.g., watchdog crash tests) without depending on MCCL's heavier Thrift-based `CollectiveIntegrationTestMixin`.

Also adds unit tests (`ForkBasedTestDriverTest.cc`) covering:
- Basic multi-rank success with KV round-trip
- Exact exit code capture
- Signal-terminated worker reporting (128 + signal)

Differential Revision: D100079492
@Scusemua Scusemua force-pushed the export-D100079492 branch from 93f7aa0 to 110ca12 Compare April 9, 2026 16:31
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request Apr 9, 2026
…ch#2006)

Summary:

Introduce a lightweight fork+exec test driver that re-execs the test binary as worker subprocesses with TCPStore-based coordination. This enables ncclx tests to inspect worker exit codes (e.g., watchdog crash tests) without depending on MCCL's heavier Thrift-based `CollectiveIntegrationTestMixin`.

Also adds unit tests (`ForkBasedTestDriverTest.cc`) covering:
- Basic multi-rank success with KV round-trip
- Exact exit code capture
- Signal-terminated worker reporting (128 + signal)

Differential Revision: D100079492
Scusemua added a commit to Scusemua/torchcomms that referenced this pull request Apr 9, 2026
…ch#2006)

Summary:

Introduce a lightweight fork+exec test driver that re-execs the test binary as worker subprocesses with TCPStore-based coordination. This enables ncclx tests to inspect worker exit codes (e.g., watchdog crash tests) without depending on MCCL's heavier Thrift-based `CollectiveIntegrationTestMixin`.

Also adds unit tests (`ForkBasedTestDriverTest.cc`) covering:
- Basic multi-rank success with KV round-trip
- Exact exit code capture
- Signal-terminated worker reporting (128 + signal)

Differential Revision: D100079492
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot. fb-exported meta-exported

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant