Skip to content

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter#7878

Open
nathon-lee wants to merge 6 commits into
deepspeedai:masterfrom
nathon-lee:fix_cp_7807
Open

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter#7878
nathon-lee wants to merge 6 commits into
deepspeedai:masterfrom
nathon-lee:fix_cp_7807

Conversation

@nathon-lee

Copy link
Copy Markdown
Contributor

fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter

This commit addresses the issue where cross-partition parameters received incorrect updates when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer. The Newton-Schulz orthogonalization in Muon requires complete gradient information, which wasn't available when reduce_scatter was enabled.

The fix introduces a check for Muon parameters and forces full all-reduce gradient reduction for these cases, ensuring consistent parameter updates across all ranks.

Closes #7807

nathon-lee and others added 6 commits January 24, 2026 05:01
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Use ZeRO stage 1 to use BF16 optimizer.
(We should have switched to ZeRO1 in deepspeedai#7788, but I missed the change.
@sfc-gh-truwase)

- deepspeedai#7790 removed the fallback that allowed bf16 model + fp32 grad
accumulation without ZeRO, so that combo now raises NotImplementedError.
- deepspeedai#7788 changed test_bf16_optimizer_fragments to force BF16_Optimizer by
setting grad_accum_dtype=fp32, but it kept ZeRO stage 0, which is now
invalid after deepspeedai#7790.

Signed-off-by: Masahiro Tanaka <mtanaka@anyscale.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>
Signed-off-by: leejianwoo-collab <leejianwoo@gmail.com>

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 4665aa9345

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +297 to +298
if self.reduce_scatter and self.uses_muon:
assert False, f"{self.zero_stage_string} with reduce_scatter=True is incompatible with Muon optimizer. Please disable reduce_scatter or use a different optimizer."

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Allow Muon fallback instead of aborting on reduce_scatter

This new assert False makes ZeRO-1/2 initialization fail whenever reduce_scatter=True and any parameter has use_muon=True, which blocks the exact training configuration this change is trying to handle. The later average_tensor change already adds a fallback path (all-reduce when Muon is present), so this assertion prevents that fix from ever running in normal Python execution and turns a correctness bug into a hard runtime failure for Muon users.

Useful? React with 👍 / 👎.

@tjruwase tjruwase requested a review from PKUWZP March 1, 2026 19:58
@nathon-lee nathon-lee changed the title Fix cp 7807 fix(zero): Ensure full gradient reduction for Muon optimizer with reduce_scatter Mar 6, 2026
pull Bot pushed a commit to kokizzu/DeepSpeed that referenced this pull request Jun 27, 2026
## Summary
ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates
when the Muon optimizer is used together with `reduce_scatter` (the
default). This adds an explicit error at initialization, mirroring the
existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807.

## Root cause
Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the
rank that updates a parameter must hold that parameter's complete,
fully-reduced gradient matrix, then take its partition slice of the
orthogonalized result.

- `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`)
applies `muon_update()` to each parameter's gradient reshaped to its
full 2-D shape, and only then narrows to this rank's partition.
- With `reduce_scatter=True`, `average_tensor()` reduce-scatters the
gradients: each rank receives the averaged values only for its own
partition slice. For the rest of a parameter whose flattened gradient
crosses a partition boundary, the rank still holds its local,
un-all-reduced gradient.
- So for any cross-partition parameter, no rank holds the full reduced
matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent
matrix, and each rank silently applies a different, incorrect update.
Parameters that lie wholly inside one partition are unaffected — exactly
matching the report.

ZeRO-3 already guards this exact conflict in
`deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919):
```python
if self.use_muon and self.reduce_scatter:
    raise ValueError("Muon and reduce scatter cannot be used together")
```
ZeRO-1/2 had no equivalent. The existing Muon unit tests pin
`"reduce_scatter": false` everywhere, which implicitly acknowledges the
path is unsupported but never enforces it for users — and since
`reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is
silently wrong.

## Fix
Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at
initialization when the optimizer is `MuonWithAuxAdam` and
`reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set
`"reduce_scatter": false` (as the Muon tests already do). The change is
the import plus the guard, with no other behavioral change.

## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2)
- **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true`
succeeds silently. With `world_size=2` and a model sized so a 2-D weight
straddles the gradient-partition boundary, that weight's post-step
update diverges by ~0.67 in relative Frobenius norm from the correct
full-gradient result, while wholly-owned weights are unaffected —
confirming the silent cross-partition corruption.
- **After**: the same configuration raises `ValueError: Muon and reduce
scatter cannot be used together` for both ZeRO stage 1 and 2. The
existing Muon tests (which use `reduce_scatter: false`) remain green.

## Notes
This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying
to force a full all-reduce for Muon but ended up with a
self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3
behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent
numerical corruption into a clear, actionable error.

A follow-up PR adds a numerical-correctness regression test for the
supported `reduce_scatter: false` Muon path, since the current Muon
tests only assert that parameters changed.

Closes deepspeedai#7807

cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana

Signed-off-by: whycoming <alwaysxd666@gmail.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
nathon-lee pushed a commit to nathon-lee/DeepSpeed_woo that referenced this pull request Jul 1, 2026
## Summary
ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates
when the Muon optimizer is used together with `reduce_scatter` (the
default). This adds an explicit error at initialization, mirroring the
existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807.

## Root cause
Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the
rank that updates a parameter must hold that parameter's complete,
fully-reduced gradient matrix, then take its partition slice of the
orthogonalized result.

- `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`)
applies `muon_update()` to each parameter's gradient reshaped to its
full 2-D shape, and only then narrows to this rank's partition.
- With `reduce_scatter=True`, `average_tensor()` reduce-scatters the
gradients: each rank receives the averaged values only for its own
partition slice. For the rest of a parameter whose flattened gradient
crosses a partition boundary, the rank still holds its local,
un-all-reduced gradient.
- So for any cross-partition parameter, no rank holds the full reduced
matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent
matrix, and each rank silently applies a different, incorrect update.
Parameters that lie wholly inside one partition are unaffected — exactly
matching the report.

ZeRO-3 already guards this exact conflict in
`deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919):
```python
if self.use_muon and self.reduce_scatter:
    raise ValueError("Muon and reduce scatter cannot be used together")
```
ZeRO-1/2 had no equivalent. The existing Muon unit tests pin
`"reduce_scatter": false` everywhere, which implicitly acknowledges the
path is unsupported but never enforces it for users — and since
`reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is
silently wrong.

## Fix
Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at
initialization when the optimizer is `MuonWithAuxAdam` and
`reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set
`"reduce_scatter": false` (as the Muon tests already do). The change is
the import plus the guard, with no other behavioral change.

## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2)
- **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true`
succeeds silently. With `world_size=2` and a model sized so a 2-D weight
straddles the gradient-partition boundary, that weight's post-step
update diverges by ~0.67 in relative Frobenius norm from the correct
full-gradient result, while wholly-owned weights are unaffected —
confirming the silent cross-partition corruption.
- **After**: the same configuration raises `ValueError: Muon and reduce
scatter cannot be used together` for both ZeRO stage 1 and 2. The
existing Muon tests (which use `reduce_scatter: false`) remain green.

## Notes
This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying
to force a full all-reduce for Muon but ended up with a
self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3
behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent
numerical corruption into a clear, actionable error.

A follow-up PR adds a numerical-correctness regression test for the
supported `reduce_scatter: false` Muon path, since the current Muon
tests only assert that parameters changed.

Closes deepspeedai#7807

cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana

Signed-off-by: whycoming <alwaysxd666@gmail.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Signed-off-by: nathon-lee <leejianwoo@gmail.com>
nathon-lee pushed a commit to nathon-lee/DeepSpeed_woo that referenced this pull request Jul 1, 2026
## Summary
ZeRO-1/2 silently produces incorrect, rank-divergent parameter updates
when the Muon optimizer is used together with `reduce_scatter` (the
default). This adds an explicit error at initialization, mirroring the
existing ZeRO-3 guard, and includes a regression test. Closes deepspeedai#7807.

## Root cause
Muon's Newton-Schulz orthogonalization is a whole-matrix operation: the
rank that updates a parameter must hold that parameter's complete,
fully-reduced gradient matrix, then take its partition slice of the
orthogonalized result.

- `get_flat_partition()` (`deepspeed/runtime/zero/stage_1_and_2.py`)
applies `muon_update()` to each parameter's gradient reshaped to its
full 2-D shape, and only then narrows to this rank's partition.
- With `reduce_scatter=True`, `average_tensor()` reduce-scatters the
gradients: each rank receives the averaged values only for its own
partition slice. For the rest of a parameter whose flattened gradient
crosses a partition boundary, the rank still holds its local,
un-all-reduced gradient.
- So for any cross-partition parameter, no rank holds the full reduced
matrix. `muon_update` orthogonalizes a partly-reduced, rank-divergent
matrix, and each rank silently applies a different, incorrect update.
Parameters that lie wholly inside one partition are unaffected — exactly
matching the report.

ZeRO-3 already guards this exact conflict in
`deepspeed/runtime/zero/stage3.py` (added in deepspeedai#7919):
```python
if self.use_muon and self.reduce_scatter:
    raise ValueError("Muon and reduce scatter cannot be used together")
```
ZeRO-1/2 had no equivalent. The existing Muon unit tests pin
`"reduce_scatter": false` everywhere, which implicitly acknowledges the
path is unsupported but never enforces it for users — and since
`reduce_scatter` defaults to `true`, a default Muon + ZeRO-1/2 run is
silently wrong.

## Fix
Mirror the ZeRO-3 guard in ZeRO-1/2: raise the same `ValueError` at
initialization when the optimizer is `MuonWithAuxAdam` and
`reduce_scatter` is enabled. To run Muon under ZeRO-1/2, set
`"reduce_scatter": false` (as the Muon tests already do). The change is
the import plus the guard, with no other behavioral change.

## Verification (2x RTX 4090, torch 2.9.1+cu128, ZeRO stage 1 and 2)
- **Before**: `deepspeed.initialize` with Muon + `reduce_scatter=true`
succeeds silently. With `world_size=2` and a model sized so a 2-D weight
straddles the gradient-partition boundary, that weight's post-step
update diverges by ~0.67 in relative Frobenius norm from the correct
full-gradient result, while wholly-owned weights are unaffected —
confirming the silent cross-partition corruption.
- **After**: the same configuration raises `ValueError: Muon and reduce
scatter cannot be used together` for both ZeRO stage 1 and 2. The
existing Muon tests (which use `reduce_scatter: false`) remain green.

## Notes
This supersedes deepspeedai#7878 and deepspeedai#7808, which aimed at the same issue by trying
to force a full all-reduce for Muon but ended up with a
self-contradictory guard. Aligning ZeRO-1/2 with the merged ZeRO-3
behavior (deepspeedai#7919) keeps the two code paths consistent and turns silent
numerical corruption into a clear, actionable error.

A follow-up PR adds a numerical-correctness regression test for the
supported `reduce_scatter: false` Muon path, since the current Muon
tests only assert that parameters changed.

Closes deepspeedai#7807

cc @PKUWZP @pengdurice (ZeRO-3 Muon guard, deepspeedai#7919) @tohtana

Signed-off-by: whycoming <alwaysxd666@gmail.com>
Co-authored-by: Ma, Guokai <guokai.ma@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Cross-partition parameters incorrectly updated when using ZeRO-1/ZeRO-2 with reduce_scatter=true and Muon optimizer

2 participants