fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter#7878
fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter#7878nathon-lee wants to merge 6 commits into
Conversation
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Use ZeRO stage 1 to use BF16 optimizer. (We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change. @sfc-gh-truwase) - deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad accumulation without ZeRO, so that combo now raises NotImplementedError. - deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now invalid after deepspeedai#7790. Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com> Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 4665aa9345
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if self.reduce_scatter and self.uses_muon: | ||
| assert False, f"{self.zero_stage_string} with reduce_scatter=True is incompatible with Muon optimizer. Please disable reduce_scatter or use a different optimizer." |
There was a problem hiding this comment.
Allow Muon fallback instead of aborting on reduce_scatter
This new assert False makes ZeRO-1/2 initialization fail whenever reduce_scatter=True and any parameter has use_muon=True, which blocks the exact training configuration this change is trying to handle. The later average_tensor change already adds a fallback path (all-reduce when Muon is present), so this assertion prevents that fix from ever running in normal Python execution and turns a correctness bug into a hard runtime failure for Muon users.
Useful? React with 👍 / 👎.
## Summary ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates when the Muon optimizer is used together with `reduce_scatter` (the default). This adds an explicit error at initialization, mirroring the existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807. ## Root cause Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the rank that updates a parameter must hold that parameter's complete, fully-reduced gradient matrix, then take its partition slice of the orthogonalized result. - `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`) applies `muon_update()` to each parameter's gradient reshaped to its full 2-D shape, and only then narrows to this rank's partition. - With `reduce_scatter=True`, `average_tensor()` reduce-scatters the gradients: each rank receives the averaged values only for its own partition slice. For the rest of a parameter whose flattened gradient crosses a partition boundary, the rank still holds its local, un-all-reduced gradient. - So for any cross-partition parameter, no rank holds the full reduced matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent matrix, and each rank silently applies a different, incorrect update. Parameters that lie wholly inside one partition are unaffected — exactly matching the report. ZeRO-3 already guards this exact conflict in `deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919): ```python if self.use_muon and self.reduce_scatter: raise ValueError("Muon and reduce scatter cannot be used together") ``` ZeRO-1/2 had no equivalent. The existing Muon unit tests pin `"reduce_scatter": false` everywhere, which implicitly acknowledges the path is unsupported but never enforces it for users — and since `reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is silently wrong. ## Fix Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at initialization when the optimizer is `MuonWithAuxAdam` and `reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set `"reduce_scatter": false` (as the Muon tests already do). The change is the import plus the guard, with no other behavioral change. ## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2) - **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true` succeeds silently. With `world_size=2` and a model sized so a 2-D weight straddles the gradient-partition boundary, that weight's post-step update diverges by ~0.67 in relative Frobenius norm from the correct full-gradient result, while wholly-owned weights are unaffected — confirming the silent cross-partition corruption. - **After**: the same configuration raises `ValueError: Muon and reduce scatter cannot be used together` for both ZeRO stage 1 and 2. The existing Muon tests (which use `reduce_scatter: false`) remain green. ## Notes This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying to force a full all-reduce for Muon but ended up with a self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3 behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent numerical corruption into a clear, actionable error. A follow-up PR adds a numerical-correctness regression test for the supported `reduce_scatter: false` Muon path, since the current Muon tests only assert that parameters changed. Closes deepspeedai#7807 cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana Signed-off-by: whycoming <alwaysxd666@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
## Summary ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates when the Muon optimizer is used together with `reduce_scatter` (the default). This adds an explicit error at initialization, mirroring the existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807. ## Root cause Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the rank that updates a parameter must hold that parameter's complete, fully-reduced gradient matrix, then take its partition slice of the orthogonalized result. - `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`) applies `muon_update()` to each parameter's gradient reshaped to its full 2-D shape, and only then narrows to this rank's partition. - With `reduce_scatter=True`, `average_tensor()` reduce-scatters the gradients: each rank receives the averaged values only for its own partition slice. For the rest of a parameter whose flattened gradient crosses a partition boundary, the rank still holds its local, un-all-reduced gradient. - So for any cross-partition parameter, no rank holds the full reduced matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent matrix, and each rank silently applies a different, incorrect update. Parameters that lie wholly inside one partition are unaffected — exactly matching the report. ZeRO-3 already guards this exact conflict in `deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919): ```python if self.use_muon and self.reduce_scatter: raise ValueError("Muon and reduce scatter cannot be used together") ``` ZeRO-1/2 had no equivalent. The existing Muon unit tests pin `"reduce_scatter": false` everywhere, which implicitly acknowledges the path is unsupported but never enforces it for users — and since `reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is silently wrong. ## Fix Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at initialization when the optimizer is `MuonWithAuxAdam` and `reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set `"reduce_scatter": false` (as the Muon tests already do). The change is the import plus the guard, with no other behavioral change. ## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2) - **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true` succeeds silently. With `world_size=2` and a model sized so a 2-D weight straddles the gradient-partition boundary, that weight's post-step update diverges by ~0.67 in relative Frobenius norm from the correct full-gradient result, while wholly-owned weights are unaffected — confirming the silent cross-partition corruption. - **After**: the same configuration raises `ValueError: Muon and reduce scatter cannot be used together` for both ZeRO stage 1 and 2. The existing Muon tests (which use `reduce_scatter: false`) remain green. ## Notes This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying to force a full all-reduce for Muon but ended up with a self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3 behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent numerical corruption into a clear, actionable error. A follow-up PR adds a numerical-correctness regression test for the supported `reduce_scatter: false` Muon path, since the current Muon tests only assert that parameters changed. Closes deepspeedai#7807 cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana Signed-off-by: whycoming <alwaysxd666@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>
## Summary ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates when the Muon optimizer is used together with `reduce_scatter` (the default). This adds an explicit error at initialization, mirroring the existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807. ## Root cause Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the rank that updates a parameter must hold that parameter's complete, fully-reduced gradient matrix, then take its partition slice of the orthogonalized result. - `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`) applies `muon_update()` to each parameter's gradient reshaped to its full 2-D shape, and only then narrows to this rank's partition. - With `reduce_scatter=True`, `average_tensor()` reduce-scatters the gradients: each rank receives the averaged values only for its own partition slice. For the rest of a parameter whose flattened gradient crosses a partition boundary, the rank still holds its local, un-all-reduced gradient. - So for any cross-partition parameter, no rank holds the full reduced matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent matrix, and each rank silently applies a different, incorrect update. Parameters that lie wholly inside one partition are unaffected — exactly matching the report. ZeRO-3 already guards this exact conflict in `deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919): ```python if self.use_muon and self.reduce_scatter: raise ValueError("Muon and reduce scatter cannot be used together") ``` ZeRO-1/2 had no equivalent. The existing Muon unit tests pin `"reduce_scatter": false` everywhere, which implicitly acknowledges the path is unsupported but never enforces it for users — and since `reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is silently wrong. ## Fix Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at initialization when the optimizer is `MuonWithAuxAdam` and `reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set `"reduce_scatter": false` (as the Muon tests already do). The change is the import plus the guard, with no other behavioral change. ## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2) - **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true` succeeds silently. With `world_size=2` and a model sized so a 2-D weight straddles the gradient-partition boundary, that weight's post-step update diverges by ~0.67 in relative Frobenius norm from the correct full-gradient result, while wholly-owned weights are unaffected — confirming the silent cross-partition corruption. - **After**: the same configuration raises `ValueError: Muon and reduce scatter cannot be used together` for both ZeRO stage 1 and 2. The existing Muon tests (which use `reduce_scatter: false`) remain green. ## Notes This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying to force a full all-reduce for Muon but ended up with a self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3 behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent numerical corruption into a clear, actionable error. A follow-up PR adds a numerical-correctness regression test for the supported `reduce_scatter: false` Muon path, since the current Muon tests only assert that parameters changed. Closes deepspeedai#7807 cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana Signed-off-by: whycoming <alwaysxd666@gmail.com> Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter
This commit addresses the issue where cross-partition parameters received incorrect updates when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer. The Newton-Schulz orthogonalization in Muon requires complete gradient information, which wasn't available when reduce_scatter was enabled.
The fix introduces a check for Muon parameters and forces full all-reduce gradient reduction for these cases, ensuring consistent parameter updates across all ranks.
Closes #7807