fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter by nathon-lee · Pull Request #7878 · deepspeedai/DeepSpeed

nathon-lee · 2026-02-27T06:46:02Z

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter

This commit addresses the issue where cross-partition parameters received incorrect updates when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer. The Newton-Schulz orthogonalization in Muon requires complete gradient information, which wasn't available when reduce_scatter was enabled.

The fix introduces a check for Muon parameters and forces full all-reduce gradient reduction for these cases, ensuring consistent parameter updates across all ranks.

Closes #7807

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

@sfc-gh-truwase

Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4665aa9345

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-02-27T06:48:26Z

+        if self.reduce_scatter and self.uses_muon:
+            assert False, f"{self.zero_stage_string} with reduce_scatter=True is incompatible with Muon optimizer. Please disable reduce_scatter or use a different optimizer."


Allow Muon fallback instead of aborting on reduce_scatter

This new assert False makes ZeRO-1/2 initialization fail whenever reduce_scatter=True and any parameter has use_muon=True, which blocks the exact training configuration this change is trying to handle. The later average_tensor change already adds a fallback path (all-reduce when Muon is present), so this assertion prevents that fix from ever running in normal Python execution and turns a correctness bug into a hard runtime failure for Muon users.

Useful? React with 👍 / 👎.

@PKUWZP

## Summary ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates when the Muon optimizer is used together with `reduce_scatter` (the default). This adds an explicit error at initialization, mirroring the existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807. ## Root cause Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the rank that updates a parameter must hold that parameter's complete, fully-reduced gradient matrix, then take its partition slice of the orthogonalized result. - `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`) applies `muon_update()` to each parameter's gradient reshaped to its full 2-D shape, and only then narrows to this rank's partition. - With `reduce_scatter=True`, `average_tensor()` reduce-scatters the gradients: each rank receives the averaged values only for its own partition slice. For the rest of a parameter whose flattened gradient crosses a partition boundary, the rank still holds its local, un-all-reduced gradient. - So for any cross-partition parameter, no rank holds the full reduced matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent matrix, and each rank silently applies a different, incorrect update. Parameters that lie wholly inside one partition are unaffected — exactly matching the report. ZeRO-3 already guards this exact conflict in `deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919): ```python if self.use_muon and self.reduce_scatter: raise ValueError("Muon and reduce scatter cannot be used together") ``` ZeRO-1/2 had no equivalent. The existing Muon unit tests pin `"reduce_scatter": false` everywhere, which implicitly acknowledges the path is unsupported but never enforces it for users — and since `reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is silently wrong. ## Fix Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at initialization when the optimizer is `MuonWithAuxAdam` and `reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set `"reduce_scatter": false` (as the Muon tests already do). The change is the import plus the guard, with no other behavioral change. ## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2) - **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true` succeeds silently. With `world_size=2` and a model sized so a 2-D weight straddles the gradient-partition boundary, that weight's post-step update diverges by ~0.67 in relative Frobenius norm from the correct full-gradient result, while wholly-owned weights are unaffected — confirming the silent cross-partition corruption. - **After**: the same configuration raises `ValueError: Muon and reduce scatter cannot be used together` for both ZeRO stage 1 and 2. The existing Muon tests (which use `reduce_scatter: false`) remain green. ## Notes This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying to force a full all-reduce for Muon but ended up with a self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3 behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent numerical corruption into a clear, actionable error. A follow-up PR adds a numerical-correctness regression test for the supported `reduce_scatter: false` Muon path, since the current Muon tests only assert that parameters changed. Closes deepspeedai#7807 cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana Signed-off-by: whycoming <alwaysxd666@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>

@PKUWZP

## Summary ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates when the Muon optimizer is used together with `reduce_scatter` (the default). This adds an explicit error at initialization, mirroring the existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807. ## Root cause Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the rank that updates a parameter must hold that parameter's complete, fully-reduced gradient matrix, then take its partition slice of the orthogonalized result. - `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`) applies `muon_update()` to each parameter's gradient reshaped to its full 2-D shape, and only then narrows to this rank's partition. - With `reduce_scatter=True`, `average_tensor()` reduce-scatters the gradients: each rank receives the averaged values only for its own partition slice. For the rest of a parameter whose flattened gradient crosses a partition boundary, the rank still holds its local, un-all-reduced gradient. - So for any cross-partition parameter, no rank holds the full reduced matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent matrix, and each rank silently applies a different, incorrect update. Parameters that lie wholly inside one partition are unaffected — exactly matching the report. ZeRO-3 already guards this exact conflict in `deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919): ```python if self.use_muon and self.reduce_scatter: raise ValueError("Muon and reduce scatter cannot be used together") ``` ZeRO-1/2 had no equivalent. The existing Muon unit tests pin `"reduce_scatter": false` everywhere, which implicitly acknowledges the path is unsupported but never enforces it for users — and since `reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is silently wrong. ## Fix Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at initialization when the optimizer is `MuonWithAuxAdam` and `reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set `"reduce_scatter": false` (as the Muon tests already do). The change is the import plus the guard, with no other behavioral change. ## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2) - **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true` succeeds silently. With `world_size=2` and a model sized so a 2-D weight straddles the gradient-partition boundary, that weight's post-step update diverges by ~0.67 in relative Frobenius norm from the correct full-gradient result, while wholly-owned weights are unaffected — confirming the silent cross-partition corruption. - **After**: the same configuration raises `ValueError: Muon and reduce scatter cannot be used together` for both ZeRO stage 1 and 2. The existing Muon tests (which use `reduce_scatter: false`) remain green. ## Notes This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying to force a full all-reduce for Muon but ended up with a self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3 behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent numerical corruption into a clear, actionable error. A follow-up PR adds a numerical-correctness regression test for the supported `reduce_scatter: false` Muon path, since the current Muon tests only assert that parameters changed. Closes deepspeedai#7807 cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana Signed-off-by: whycoming <alwaysxd666@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>

@PKUWZP

## Summary ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates when the Muon optimizer is used together with `reduce_scatter` (the default). This adds an explicit error at initialization, mirroring the existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807. ## Root cause Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the rank that updates a parameter must hold that parameter's complete, fully-reduced gradient matrix, then take its partition slice of the orthogonalized result. - `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`) applies `muon_update()` to each parameter's gradient reshaped to its full 2-D shape, and only then narrows to this rank's partition. - With `reduce_scatter=True`, `average_tensor()` reduce-scatters the gradients: each rank receives the averaged values only for its own partition slice. For the rest of a parameter whose flattened gradient crosses a partition boundary, the rank still holds its local, un-all-reduced gradient. - So for any cross-partition parameter, no rank holds the full reduced matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent matrix, and each rank silently applies a different, incorrect update. Parameters that lie wholly inside one partition are unaffected — exactly matching the report. ZeRO-3 already guards this exact conflict in `deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919): ```python if self.use_muon and self.reduce_scatter: raise ValueError("Muon and reduce scatter cannot be used together") ``` ZeRO-1/2 had no equivalent. The existing Muon unit tests pin `"reduce_scatter": false` everywhere, which implicitly acknowledges the path is unsupported but never enforces it for users — and since `reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is silently wrong. ## Fix Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at initialization when the optimizer is `MuonWithAuxAdam` and `reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set `"reduce_scatter": false` (as the Muon tests already do). The change is the import plus the guard, with no other behavioral change. ## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2) - **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true` succeeds silently. With `world_size=2` and a model sized so a 2-D weight straddles the gradient-partition boundary, that weight's post-step update diverges by ~0.67 in relative Frobenius norm from the correct full-gradient result, while wholly-owned weights are unaffected — confirming the silent cross-partition corruption. - **After**: the same configuration raises `ValueError: Muon and reduce scatter cannot be used together` for both ZeRO stage 1 and 2. The existing Muon tests (which use `reduce_scatter: false`) remain green. ## Notes This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying to force a full all-reduce for Muon but ended up with a self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3 behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent numerical corruption into a clear, actionable error. A follow-up PR adds a numerical-correctness regression test for the supported `reduce_scatter: false` Muon path, since the current Muon tests only assert that parameters changed. Closes deepspeedai#7807 cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana Signed-off-by: whycoming <alwaysxd666@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>

nathon-lee and others added 6 commits January 24, 2026 05:01

fix: Ensure full gradient reduction for Muon with reduce_scatter

1dc4122

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Update stage_1_and_2.py

a873854

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Update stage_1_and_2.py

15996a9

Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

Merge branch 'deepspeedai:master' into fix_cp_7807

196c7ae

Merge branch 'deepspeedai:master' into fix_cp_7807

4665aa9

nathon-lee requested review from tjruwase and tohtana as code owners February 27, 2026 06:46

chatgpt-codex-connector Bot reviewed Feb 27, 2026

View reviewed changes

tjruwase requested a review from PKUWZP March 1, 2026 19:58

nathon-lee changed the title ~~Fix cp 7807~~ fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter Mar 6, 2026

whycoming mentioned this pull request Jun 25, 2026

Reject Muon optimizer with reduce_scatter in ZeRO-1/2 #8090

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter#7878

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter#7878
nathon-lee wants to merge 6 commits into
deepspeedai:masterfrom
nathon-lee:fix_cp_7807

nathon-lee commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Feb 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		if self.reduce_scatter and self.uses_muon:
		assert False, f"{self.zero_stage_string} with reduce_scatter=True is incompatible with Muon optimizer. Please disable reduce_scatter or use a different optimizer."

Uh oh!

Conversation

nathon-lee commented Feb 27, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Feb 27, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants