Add per-quantile-level reporting for quantile metrics#156
Conversation
| seasonality=seasonality, | ||
| quantile_levels=quantile_levels, | ||
| ) # [Q] | ||
| return float(np.mean(per_level)) |
There was a problem hiding this comment.
Not sure if intended but slight change to the previous logic as before the mean over the quantiles was nan safe and here not.
There was a problem hiding this comment.
This should have no effect since per_level already shouldn't contain NaNs. The NaNs might be present at some time steps in the target (never in predictions), so after we average across time & items [T, N] there should be no NaNs left in the array of shape [Q].
There was a problem hiding this comment.
okay make sense.
just for my understanding: it would become nan when ALL predictions of a specific quantiles are nan right? but in this case we don"t want to ignore it as it would mean you could get a better SQL by don"t providing the hard quantiles. which is also why the prediction are not allowed to be nan in general (as it would "sub-select" the aggregated based on the provided ones right?)
There was a problem hiding this comment.
Currently we have a check here
Line 827 in 516786b
that raises an error if there are any NaNs in the predictions, so only NaNs in the target are permitted.
This means the only scenarios where the quantile loss is NaN for one quantile is when all target values are NaN, but then loss will be NaN for all quantiles and metrics in general, which is easy to spot.
Summary
Lets
fevreport quantile metrics (MQL / WQL / SQL) both overall (as today) and broken down per quantile level. Withquantile_levels=[0.1, 0.5, 0.9]andper_quantile_scores=True, the summary gainsSQL[0.1],SQL[0.5],SQL[0.9]alongside the overallSQL.Design
QuantileMetricbase class for MQL/WQL/SQL. Subclasses implement only_per_quantile_level(...) -> np.ndarray([Q]); the overall score is defined asmeanover levels. So the overall score always equals the mean of the per-level scores by construction — single code path, cannot drift.Metric.compute_scores(...) -> dict[str, float]is the new emission entry point. The base returns{self.name: self.compute(...)};QuantileMetricoverrides it to optionally add the per-level keys.compute()keeps returning a scalar, sotest_error/ leaderboards are unchanged.per_quantile_scoresis a kwarg onevaluation_summary(threaded tocompute_metrics→compute_scores). It deliberately does not become aTaskfield, so it stays out ofto_dict()/ YAML / the task fingerprint.metrics_per_windowswitched to adefaultdict(list)since the per-level keys aren't known up front.Compatibility
per_quantile_scores=False→ summary schema is identical to before.compute_scores).compute()matches the previous implementations exactly, including with NaNs and too-short histories.Tests