Forecasts crave a rating that reflects the forecast's quality in the context of what is possible in theory and what is reasonable to expect in practice.
Our ML group at Blue Yonder around Malte Tichy has published a new paper:
Granular forecasts in the regime of low count rates - as they often occur in retail, for which an intermittent demand of a handful might be observed per product, day, and location - are dominated by the inevitable statistical uncertainty of the Poisson distribution. This makes it hard to judge whether a certain metric value is dominated by Poisson noise or truly indicates a bad prediction model. To make things worse, every evaluation metric suffers from scaling: Its value is mostly defined by the predicted selling rate and the resulting rate-dependent Poisson noise, and only secondarily by the quality of the forecast. For any metric, comparing two groups of forecasted products often yields "the slow movers are performing worse than the fast movers" or vice versa - the naïve scaling trap. To distill the intrinsic quality of a forecast, we stratify predictions into buckets of approximately equal rate and evaluate metrics for each bucket separately. By comparing the achieved value per bucket to benchmarks, we obtain a scaling-aware rating of count forecasts. Our procedure avoids the naïve scaling trap, provides an immediate intuitive judgment of forecast quality, and allows to compare forecasts for different products or even industries. https://arxiv.org/abs/2211.16313
Malte will also talk about this on Pydata Global:
Meaningful probabilistic models do not only produce a “best guess” for the target, but also convey their uncertainty, i.e., a belief in how the target is distributed around the predicted estimate. Business evaluation metrics such as mean absolute error, a priori, neglect that unavoidable uncertainty. This talk discusses why and how to account for uncertainty when evaluating models using traditional business metrics, using python standard tooling. The resulting uncertainty-aware model rating satisfies the requirements of statisticians because it accounts for the probabilistic process that generates the target. It should please practitioners because it is based on established business metrics. It appeases executives because it allows concrete quantitative goals and non-defensive judgements.