Skip to content

Add per-quantile-level reporting for quantile metrics#156

Merged
shchur merged 2 commits into
autogluon:mainfrom
shchur:per-quantile-metrics
Jun 24, 2026
Merged

Add per-quantile-level reporting for quantile metrics#156
shchur merged 2 commits into
autogluon:mainfrom
shchur:per-quantile-metrics

Conversation

@shchur

@shchur shchur commented Jun 24, 2026

Copy link
Copy Markdown
Contributor

Summary

Lets fev report quantile metrics (MQL / WQL / SQL) both overall (as today) and broken down per quantile level. With quantile_levels=[0.1, 0.5, 0.9] and per_quantile_scores=True, the summary gains SQL[0.1], SQL[0.5], SQL[0.9] alongside the overall SQL.

task.evaluation_summary(preds, model_name="m", per_quantile_scores=True)
# -> {..., "SQL": ..., "SQL[0.1]": ..., "SQL[0.5]": ..., "SQL[0.9]": ...}

Design

  • QuantileMetric base class for MQL/WQL/SQL. Subclasses implement only _per_quantile_level(...) -> np.ndarray ([Q]); the overall score is defined as mean over levels. So the overall score always equals the mean of the per-level scores by construction — single code path, cannot drift.
  • Metric.compute_scores(...) -> dict[str, float] is the new emission entry point. The base returns {self.name: self.compute(...)}; QuantileMetric overrides it to optionally add the per-level keys. compute() keeps returning a scalar, so test_error / leaderboards are unchanged.
  • Reporting is a call-time choice, not task state: per_quantile_scores is a kwarg on evaluation_summary (threaded to compute_metricscompute_scores). It deliberately does not become a Task field, so it stays out of to_dict() / YAML / the task fingerprint.
  • metrics_per_window switched to a defaultdict(list) since the per-level keys aren't known up front.

Compatibility

  • Default per_quantile_scores=False → summary schema is identical to before.
  • Non-quantile metrics are completely unaffected (inherit the single-key compute_scores).
  • Verified the refactored MQL/WQL/SQL compute() matches the previous implementations exactly, including with NaNs and too-short histories.

Tests

  • All existing metric tests pass (incl. the AutoGluon cross-check for every metric).
  • New tests: overall == mean of per-level for each quantile metric; no per-level keys when disabled; non-quantile metrics get no breakdown.

@shchur shchur requested a review from apointa June 24, 2026 09:58
Comment thread src/fev/metrics.py
seasonality=seasonality,
quantile_levels=quantile_levels,
) # [Q]
return float(np.mean(per_level))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure if intended but slight change to the previous logic as before the mean over the quantiles was nan safe and here not.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should have no effect since per_level already shouldn't contain NaNs. The NaNs might be present at some time steps in the target (never in predictions), so after we average across time & items [T, N] there should be no NaNs left in the array of shape [Q].

@apointa apointa Jun 24, 2026

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

okay make sense.
just for my understanding: it would become nan when ALL predictions of a specific quantiles are nan right? but in this case we don"t want to ignore it as it would mean you could get a better SQL by don"t providing the hard quantiles. which is also why the prediction are not allowed to be nan in general (as it would "sub-select" the aggregated based on the provided ones right?)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Currently we have a check here

if not pc.all(pc.is_finite(flat)).as_py():

that raises an error if there are any NaNs in the predictions, so only NaNs in the target are permitted.

This means the only scenarios where the quantile loss is NaN for one quantile is when all target values are NaN, but then loss will be NaN for all quantiles and metrics in general, which is easy to spot.

@shchur shchur merged commit 100695e into autogluon:main Jun 24, 2026
2 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants