Validate fp16 dynamic loss scaling parameters are positive#8050
Conversation
loss_scale_window and min_loss_scale drive dynamic loss scaling but are not validated, so invalid values silently initialize and fail later: - loss_scale_window is used as `stable_interval % scale_window` in DynamicLossScaler.update_scale, so a value of 0 raises ZeroDivisionError during training. - min_loss_scale is the loss-scale floor; a value <= 0 collapses dynamic loss scaling. Add a Pydantic `mode="before"` field validator to DeepSpeedFP16Config that rejects bool, non-numeric, non-finite (inf/-inf/nan), and non-positive values for both fields, raising a clear ValidationError. This follows the same pattern as the fp16.loss_scale validation added in deepspeedai#7889. Add unit tests covering invalid values (0, -1, inf, nan, True, [], {}) and valid values for both fields. Signed-off-by: Aryan <aryansputta@gmail.com>
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 68eae02630
ℹ️ About Codex in GitHub
Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".
| if number <= 0: | ||
| raise ValueError(f"fp16.{name} must be > 0") |
There was a problem hiding this comment.
Gate dynamic-scale validation on dynamic loss scaling
This rejects existing static-loss-scale configs that happen to carry loss_scale_window or min_loss_scale values such as 0, even though those fields are only used when dynamic loss scaling is enabled. I checked DeepSpeedEngine.dynamic_loss_scale() in deepspeed/runtime/engine.py, which returns true only when fp16.loss_scale == 0; with loss_scale > 0 the optimizer uses the static scale and these dynamic parameters are ignored, so failing config construction here is a compatibility regression for otherwise valid static fp16 setups.
Useful? React with 👍 / 👎.
Address review: loss_scale_window and min_loss_scale only take effect when dynamic loss scaling is active (fp16 enabled and loss_scale == 0, per DeepSpeedEngine.dynamic_loss_scale). Validating them unconditionally rejected otherwise-valid static-loss-scale configs that carry unused values like 0. Replace the per-field validator with a model_validator(mode="after") that checks loss_scale_window > 0 and min_loss_scale > 0 only when fp16 is enabled and loss_scale == 0. Update tests to cover the static and fp16-disabled cases where these fields are ignored. Signed-off-by: Aryan <aryansputta@gmail.com>
|
Addressed in 27af879: replaced the per-field validator with a Thoughts... @tohtana |
|
Hi @aryanputta, thank you for your contribution! As the fix runs the validator in |
Pydantic coerces bool to int (True -> 1) and floats to int, so values like loss_scale_window=True or min_loss_scale=inf would silently pass the positivity check in _validate_dynamic_loss_scale_params. Add a before field validator that rejects bool, non-finite, and non-numeric values before coercion, mirroring the existing loss_scale validator. Addresses @tohtana review feedback. Signed-off-by: Aryan <aryansputta@gmail.com>
|
Thanks @tohtana, good catch. Fixed in 427695f: added a Added tests covering bool (True/False), inf/nan, strings, and None for both fields. |
tohtana
left a comment
There was a problem hiding this comment.
Looks good to me, thank you for the fix!
…ai#8050) ## What `fp16.loss_scale_window` and `fp16.min_loss_scale` drive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training: - **`loss_scale_window`** is used as `stable_interval % self.scale_window` in `DynamicLossScaler.update_scale` (`deepspeed/runtime/fp16/loss_scaler.py`), so a value of `0` raises `ZeroDivisionError` mid-training. - **`min_loss_scale`** is the loss-scale floor (`max(cur_scale / scale_factor, min_scale)`); a value `<= 0` collapses dynamic loss scaling. This is the same class of silent-misconfiguration bug as `fp16.loss_scale` accepting `inf`, fixed in deepspeedai#7889. ## Change Add a single Pydantic `mode="before"` field validator on `DeepSpeedFP16Config` covering both fields. It rejects `bool`, non-numeric, non-finite (`inf`/`-inf`/`nan`), and non-positive values, raising a clear `ValidationError` (e.g. `fp16.loss_scale_window must be > 0`). Following the deepspeedai#7889 review, `mode="before"` runs prior to type coercion (so `True` is rejected), and `float()` is wrapped in `try/except` so `[]`/`{}` surface a clear `ValidationError` rather than a raw `TypeError`. ## Tests Adds `tests/unit/runtime/test_precision_config_dynamic_scale.py`, parametrized over both fields: - invalid: `0, -1, inf, nan, True, [], {}` -> `ValidationError` - valid: `1, 1000, "2"` -> accepted ```bash pytest -q tests/unit/runtime/test_precision_config_dynamic_scale.py ``` The validator logic was verified against the full matrix locally; the import-level test runs under CI. --------- Signed-off-by: Aryan <aryansputta@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com> Signed-off-by: nathon-lee <leejianwoo@gmail.com>
…ai#8050) ## What `fp16.loss_scale_window` and `fp16.min_loss_scale` drive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training: - **`loss_scale_window`** is used as `stable_interval % self.scale_window` in `DynamicLossScaler.update_scale` (`deepspeed/runtime/fp16/loss_scaler.py`), so a value of `0` raises `ZeroDivisionError` mid-training. - **`min_loss_scale`** is the loss-scale floor (`max(cur_scale / scale_factor, min_scale)`); a value `<= 0` collapses dynamic loss scaling. This is the same class of silent-misconfiguration bug as `fp16.loss_scale` accepting `inf`, fixed in deepspeedai#7889. ## Change Add a single Pydantic `mode="before"` field validator on `DeepSpeedFP16Config` covering both fields. It rejects `bool`, non-numeric, non-finite (`inf`/`-inf`/`nan`), and non-positive values, raising a clear `ValidationError` (e.g. `fp16.loss_scale_window must be > 0`). Following the deepspeedai#7889 review, `mode="before"` runs prior to type coercion (so `True` is rejected), and `float()` is wrapped in `try/except` so `[]`/`{}` surface a clear `ValidationError` rather than a raw `TypeError`. ## Tests Adds `tests/unit/runtime/test_precision_config_dynamic_scale.py`, parametrized over both fields: - invalid: `0, -1, inf, nan, True, [], {}` -> `ValidationError` - valid: `1, 1000, "2"` -> accepted ```bash pytest -q tests/unit/runtime/test_precision_config_dynamic_scale.py ``` The validator logic was verified against the full matrix locally; the import-level test runs under CI. --------- Signed-off-by: Aryan <aryansputta@gmail.com> Co-authored-by: Masahiro Tanaka <81312776+tohtana@users.noreply.github.com>
What
fp16.loss_scale_windowandfp16.min_loss_scaledrive dynamic loss scaling but are not validated, so invalid values initialize silently and fail later during training:loss_scale_windowis used asstable_interval % self.scale_windowinDynamicLossScaler.update_scale(deepspeed/runtime/fp16/loss_scaler.py), so a value of0raisesZeroDivisionErrormid-training.min_loss_scaleis the loss-scale floor (max(cur_scale / scale_factor, min_scale)); a value<= 0collapses dynamic loss scaling.This is the same class of silent-misconfiguration bug as
fp16.loss_scaleacceptinginf, fixed in #7889.Change
Add a single Pydantic
mode="before"field validator onDeepSpeedFP16Configcovering both fields. It rejectsbool, non-numeric, non-finite (inf/-inf/nan), and non-positive values, raising a clearValidationError(e.g.fp16.loss_scale_window must be > 0). Following the #7889 review,mode="before"runs prior to type coercion (soTrueis rejected), andfloat()is wrapped intry/exceptso[]/{}surface a clearValidationErrorrather than a rawTypeError.Tests
Adds
tests/unit/runtime/test_precision_config_dynamic_scale.py, parametrized over both fields:0, -1, inf, nan, True, [], {}->ValidationError1, 1000, "2"-> acceptedThe validator logic was verified against the full matrix locally; the import-level test runs under CI.