Skip to content

Fix NCCL Error 7 by removing eager_connect_single_device#320

Open
d4l3k wants to merge 2 commits intomainfrom
fix-nccl-test-sync
Open

Fix NCCL Error 7 by removing eager_connect_single_device#320
d4l3k wants to merge 2 commits intomainfrom
fix-nccl-test-sync

Conversation

@d4l3k
Copy link
Copy Markdown
Member

@d4l3k d4l3k commented Mar 24, 2026

Summary

  • Remove eager_connect_single_device from ProcessGroupNCCL._create_pg to fix "NCCL Error 7: operation in progress" failures
  • In non-blocking NCCL mode (blocking=False), eager_connect_single_device starts an async communicator init that cannot be waited on through Python APIs. The first collective then fails because the init hasn't completed.
  • Without eager_connect, the communicator is lazily initialized by the first collective, which properly handles the async init inside its ncclGroupStart/ncclGroupEnd context.

Test plan

  • CI GPU tests (QuantizedAllReduceTest) should pass, which have been failing since ~Dec 2025 due to a PyTorch nightly regression in non-blocking NCCL + eager_connect interaction

The QuantizedAllReduceTest and QuantizedReduceScatterTest tests were
failing with "NCCL Error 7: NCCL operation in progress" because
non-blocking NCCL work.wait() returns before NCCL is ready for the
next operation. Adding cuda.synchronize() between iterations ensures
operations are fully complete.
@meta-cla meta-cla bot added the CLA Signed This label is managed by the Meta Open Source bot. label Mar 24, 2026
In non-blocking NCCL mode (blocking=False), eager_connect_single_device
starts an async communicator init that cannot be waited on through Python
APIs. When the first collective is subsequently called, NCCL returns
"Error 7: operation in progress" because the init hasn't completed.

Remove the eager_connect call and let the communicator be lazily
initialized by the first collective, which properly handles the async
init inside its ncclGroupStart/ncclGroupEnd context.

Also reverts the cuda.synchronize() workaround in collectives_test.py
which was insufficient.
@d4l3k d4l3k changed the title Fix NCCL non-blocking test failures with cuda.synchronize() Fix NCCL Error 7 by removing eager_connect_single_device Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLA Signed This label is managed by the Meta Open Source bot.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant