Validate fp16 dynamic loss scaling parameters are positive by aryanputta · Pull Request #8050 · deepspeedai/DeepSpeed

aryanputta · 2026-06-06T19:20:51Z

What

fp16.loss_scale_window and fp16.min_loss_scale drive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training:

loss_scale_window is used as stable_interval % self.scale_window in DynamicLossScaler.update_scale (deepspeed/runtime/fp16/loss_scaler.py), so a value of 0 raises ZeroDivisionError mid-training.
min_loss_scale is the loss-scale floor (max(cur_scale / scale_factor, min_scale)); a value <= 0 collapses dynamic loss scaling.

This is the same class of silent-misconfiguration bug as fp16.loss_scale accepting inf, fixed in #7889.

Change

Add a single Pydantic mode="before" field validator on DeepSpeedFP16Config covering both fields. It rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive values, raising a clear ValidationError (e.g. fp16.loss_scale_window must be > 0). Following the #7889 review, mode="before" runs prior to type coercion (so True is rejected), and float() is wrapped in try/except so []/{} surface a clear ValidationError rather than a raw TypeError.

Tests

Adds tests/unit/runtime/test_precision_config_dynamic_scale.py, parametrized over both fields:

invalid: 0, -1, inf, nan, True, [], {} -> ValidationError
valid: 1, 1000, "2" -> accepted

pytest -q tests/unit/runtime/test_precision_config_dynamic_scale.py

The validator logic was verified against the full matrix locally; the import-level test runs under CI.

loss_scale_window and min_loss_scale drive dynamic loss scaling but are not validated, so invalid values silently initialize and fail later: - loss_scale_window is used as `stable_interval % scale_window` in DynamicLossScaler.update_scale, so a value of 0 raises ZeroDivisionError during training. - min_loss_scale is the loss-scale floor; a value <= 0 collapses dynamic loss scaling. Add a Pydantic `mode="before"` field validator to DeepSpeedFP16Config that rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive values for both fields, raising a clear ValidationError. This follows the same pattern as the fp16.loss_scale validation added in deepspeedai#7889. Add unit tests covering invalid values (0, -1, inf, nan, True, [], {}) and valid values for both fields. Signed-off-by: Aryan <aryansputta@gmail.com>

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 68eae02630

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

chatgpt-codex-connector · 2026-06-06T19:22:57Z

+        if number <= 0:
+            raise ValueError(f"fp16.{name} must be > 0")


Gate dynamic-scale validation on dynamic loss scaling

This rejects existing static-loss-scale configs that happen to carry loss_scale_window or min_loss_scale values such as 0, even though those fields are only used when dynamic loss scaling is enabled. I checked DeepSpeedEngine.dynamic_loss_scale() in deepspeed/runtime/engine.py, which returns true only when fp16.loss_scale == 0; with loss_scale > 0 the optimizer uses the static scale and these dynamic parameters are ignored, so failing config construction here is a compatibility regression for otherwise valid static fp16 setups.

Useful? React with 👍 / 👎.

Address review: loss_scale_window and min_loss_scale only take effect when dynamic loss scaling is active (fp16 enabled and loss_scale == 0, per DeepSpeedEngine.dynamic_loss_scale). Validating them unconditionally rejected otherwise-valid static-loss-scale configs that carry unused values like 0. Replace the per-field validator with a model_validator(mode="after") that checks loss_scale_window > 0 and min_loss_scale > 0 only when fp16 is enabled and loss_scale == 0. Update tests to cover the static and fp16-disabled cases where these fields are ignored. Signed-off-by: Aryan <aryansputta@gmail.com>

aryanputta · 2026-06-06T19:27:08Z

Addressed in 27af879: replaced the per-field validator with a model_validator(mode="after") that only enforces loss_scale_window > 0 and min_loss_scale > 0 when self.enabled and self.loss_scale == 0. Static configs (loss_scale > 0) and fp16-disabled configs now construct fine even if these fields carry 0. Added tests for both the static and fp16-disabled cases.

Thoughts... @tohtana

tohtana · 2026-06-12T16:32:03Z

Hi @aryanputta, thank you for your contribution!

As the fix runs the validator in mode="after", when loss_scale_window or min_loss_scale are True, they are coerced to 1. Then the boolean values pass the validator.
A minimal fix would be to keep the current "after" validation to check positivity and add a "before" validator that rejects bool/non-numeric/non-finite values.

@tohtana

Pydantic coerces bool to int (True -> 1) and floats to int, so values like loss_scale_window=True or min_loss_scale=inf would silently pass the positivity check in _validate_dynamic_loss_scale_params. Add a before field validator that rejects bool, non-finite, and non-numeric values before coercion, mirroring the existing loss_scale validator. Addresses @tohtana review feedback. Signed-off-by: Aryan <aryansputta@gmail.com>

aryanputta · 2026-06-21T16:42:34Z

Thanks @tohtana, good catch. Fixed in 427695f: added a field_validator(..., mode="before") on loss_scale_window and min_loss_scale that rejects bool, non-finite (inf/nan), and non-numeric values before pydantic coerces them. This mirrors the existing loss_scale before-validator. The mode="after" model validator still enforces positivity only when dynamic loss scaling is active (fp16 enabled and loss_scale=0), so static-scale configs that carry unused values are unaffected.

Added tests covering bool (True/False), inf/nan, strings, and None for both fields.

tohtana

Looks good to me, thank you for the fix!

…ai#8050) ## What `fp16.loss_scale_window` and `fp16.min_loss_scale` drive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training: - **`loss_scale_window`** is used as `stable_interval % self.scale_window` in `DynamicLossScaler.update_scale` (`deepspeed/runtime/fp16/loss_scaler.py`), so a value of `0` raises `ZeroDivisionError` mid-training. - **`min_loss_scale`** is the loss-scale floor (`max(cur_scale / scale_factor, min_scale)`); a value `<= 0` collapses dynamic loss scaling. This is the same class of silent-misconfiguration bug as `fp16.loss_scale` accepting `inf`, fixed in deepspeedai#7889. ## Change Add a single Pydantic `mode="before"` field validator on `DeepSpeedFP16Config` covering both fields. It rejects `bool`, non-numeric, non-finite (`inf`/`-inf`/`nan`), and non-positive values, raising a clear `ValidationError` (e.g. `fp16.loss_scale_window must be > 0`). Following the deepspeedai#7889 review, `mode="before"` runs prior to type coercion (so `True` is rejected), and `float()` is wrapped in `try/except` so `[]`/`{}` surface a clear `ValidationError` rather than a raw `TypeError`. ## Tests Adds `tests/unit/runtime/test_precision_config_dynamic_scale.py`, parametrized over both fields: - invalid: `0, -1, inf, nan, True, [], {}` -> `ValidationError` - valid: `1, 1000, "2"` -> accepted ```bash pytest -q tests/unit/runtime/test_precision_config_dynamic_scale.py ``` The validator logic was verified against the full matrix locally; the import-level test runs under CI. --------- Signed-off-by: Aryan <aryansputta@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>

…ai#8050) ## What `fp16.loss_scale_window` and `fp16.min_loss_scale` drive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training: - **`loss_scale_window`** is used as `stable_interval % self.scale_window` in `DynamicLossScaler.update_scale` (`deepspeed/runtime/fp16/loss_scaler.py`), so a value of `0` raises `ZeroDivisionError` mid-training. - **`min_loss_scale`** is the loss-scale floor (`max(cur_scale / scale_factor, min_scale)`); a value `<= 0` collapses dynamic loss scaling. This is the same class of silent-misconfiguration bug as `fp16.loss_scale` accepting `inf`, fixed in deepspeedai#7889. ## Change Add a single Pydantic `mode="before"` field validator on `DeepSpeedFP16Config` covering both fields. It rejects `bool`, non-numeric, non-finite (`inf`/`-inf`/`nan`), and non-positive values, raising a clear `ValidationError` (e.g. `fp16.loss_scale_window must be > 0`). Following the deepspeedai#7889 review, `mode="before"` runs prior to type coercion (so `True` is rejected), and `float()` is wrapped in `try/except` so `[]`/`{}` surface a clear `ValidationError` rather than a raw `TypeError`. ## Tests Adds `tests/unit/runtime/test_precision_config_dynamic_scale.py`, parametrized over both fields: - invalid: `0, -1, inf, nan, True, [], {}` -> `ValidationError` - valid: `1, 1000, "2"` -> accepted ```bash pytest -q tests/unit/runtime/test_precision_config_dynamic_scale.py ``` The validator logic was verified against the full matrix locally; the import-level test runs under CI. --------- Signed-off-by: Aryan <aryansputta@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>

aryanputta requested review from loadams, tjruwase and tohtana as code owners June 6, 2026 19:20

chatgpt-codex-connector Bot reviewed Jun 6, 2026

View reviewed changes

Merge branch 'master' into validate-fp16-dynamic-loss-scale

3440b8b

tohtana approved these changes Jun 22, 2026

View reviewed changes

tohtana enabled auto-merge (squash) June 22, 2026 18:45

tohtana merged commit ded2349 into deepspeedai:master Jun 22, 2026
12 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Validate fp16 dynamic loss scaling parameters are positive#8050

Validate fp16 dynamic loss scaling parameters are positive#8050
tohtana merged 4 commits into
deepspeedai:masterfrom
aryanputta:validate-fp16-dynamic-loss-scale

aryanputta commented Jun 6, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Uh oh!

aryanputta commented Jun 6, 2026 •

edited

Loading

Uh oh!

tohtana commented Jun 12, 2026

Uh oh!

aryanputta commented Jun 21, 2026

Uh oh!

tohtana left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

aryanputta commented Jun 6, 2026

What

Change

Tests

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Jun 6, 2026

Choose a reason for hiding this comment

Uh oh!

aryanputta commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

tohtana commented Jun 12, 2026

Uh oh!

aryanputta commented Jun 21, 2026

Uh oh!

tohtana left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

aryanputta commented Jun 6, 2026 •

edited

Loading