Time Series Forecasting Metrics¶
In time series forecasting, evaluating the performance of predictive models is crucial to ensure accurate and reliable forecasts. Forecasting metrics are quantitative measures used to assess the accuracy and effectiveness of these models. They help in comparing different models, diagnosing errors, and making informed decisions based on forecast results.
Skforecast is compatible with most of the regression metrics from scikit-learn and includes additional metrics specifically designed for time series forecasting.
Mean Squared Error ("mean_squared_error")
Mean Absolute Error ("mean_absolute_error")
Mean Absolute Percentage Error ("mean_absolute_percentage_error")
Mean Squared Log Error ("mean_squared_log_error")
Median Absolute Error ("median_absolute_error")
Mean Absolute Scaled Error ("mean_absolute_scaled_error")
Root Mean Squared Scaled Error ("root_mean_squared_scaled_error")
In addition, Skforecast allows the user to define their own custom metrics. This can be done by creating a function that takes two arguments: the true values (y_true
) and the predicted values (y_pred
), and returns a single value representing the metric. The custom metric, optionally takes an additional argument y_train
with the training data used to fit the model.
In most cases, the metrics are calculated using the predictions generated in a backtesting process. For this, the backtesting_forecaster
and backtesting_forecaster_multiseries
functions take the metrics
argument to specify the metric(s) to be calculated in addition to the predictions.
Metrics for single series forecasting¶
The following code shows how to calculate metrics when a single series is forecasted. The example uses the backtesting_forecaster
function to generate predictions and calculate the metrics.
# Libraries
# ==============================================================================
import pandas as pd
from lightgbm import LGBMRegressor
from skforecast.datasets import fetch_dataset
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import backtesting_forecaster
from sklearn.metrics import mean_squared_error
from sklearn.metrics import mean_absolute_percentage_error
from sklearn.metrics import mean_absolute_error
from skforecast.metrics import mean_absolute_scaled_error
# Dowlnoad dataset
# ==============================================================================
data = fetch_dataset('website_visits', raw=True)
data['date'] = pd.to_datetime(data['date'], format='%d/%m/%y')
data = data.set_index('date')
data = data.asfreq('1D')
data = data.sort_index()
data.head(3)
website_visits -------------- Daily visits to the cienciadedatos.net website registered with the google analytics service. Amat Rodrigo, J. (2021). cienciadedatos.net (1.0.0). Zenodo. https://doi.org/10.5281/zenodo.10006330 Shape of the dataset: (421, 2)
users | |
---|---|
date | |
2020-07-01 | 2324 |
2020-07-02 | 2201 |
2020-07-03 | 2146 |
# Backtesting to generate predictions over the test set and calculate metrics
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor= LGBMRegressor(random_state=123, verbose=-1),
lags= 7
)
metrics = [
'mean_absolute_error',
'mean_squared_error',
'mean_absolute_percentage_error',
'mean_absolute_scaled_error'
]
backtest_metrics, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['users'],
steps = 7,
initial_train_size = len(data) // 2,
metric = metrics,
show_progress = True
)
backtest_metrics
0%| | 0/31 [00:00<?, ?it/s]
mean_absolute_error | mean_squared_error | mean_absolute_percentage_error | mean_absolute_scaled_error | |
---|---|---|---|---|
0 | 298.298903 | 165698.930884 | 0.176136 | 0.704935 |
It is possible to pass a list of functions instead of names.
# Backtesting with callable metrics
# ==============================================================================
metrics = [
mean_absolute_error,
mean_squared_error,
mean_absolute_percentage_error,
mean_absolute_scaled_error
]
backtest_metrics, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['users'],
steps = 7,
initial_train_size = len(data) // 2,
metric = metrics,
show_progress = True
)
backtest_metrics
0%| | 0/31 [00:00<?, ?it/s]
mean_absolute_error | mean_squared_error | mean_absolute_percentage_error | mean_absolute_scaled_error | |
---|---|---|---|---|
0 | 298.298903 | 165698.930884 | 0.176136 | 0.704935 |
Since functions can be passed as arguments, it is possible to define custom metrics. The custom metric must be a function that takes two arguments: the true values (y_true
) and the predicted values (y_pred
), and returns a single value representing the metric. Optionally, it can take additional arguments y_train
with the training data.
# Backtesting with custom metric
# ==============================================================================
def custom_metric(y_true, y_pred):
"""
Calculate the mean absolute error excluding weekends.
"""
mask = y_true.index.weekday < 5
metric = mean_absolute_error(y_true[mask], y_pred[mask])
return metric
backtest_metrics, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['users'],
steps = 7,
initial_train_size = len(data) // 2,
metric = custom_metric,
show_progress = True
)
backtest_metrics
0%| | 0/31 [00:00<?, ?it/s]
custom_metric | |
---|---|
0 | 298.375447 |
Metrics for multiple time series forecasting¶
When working with multiple time series, it is common to calculate metrics across all series to get an overall measure of performance, in addition to individual values. Skforecast provides 3 different aggregation methods:
average: the average (arithmetic mean) of all levels.
weighted_average: the average of the metrics weighted by the number of predicted values of each level.
pooling: the values of all levels are pooled and then the metric is calculated.
To include the aggregated metrics in the output of backtesting_forecaster_multiseries
, the add_aggregated_metric
argument must be set to True
.
# Libraries
# ==============================================================================
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from skforecast.datasets import fetch_dataset
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries
# Data download
# ==============================================================================
data = fetch_dataset(name="items_sales")
data.head()
items_sales ----------- Simulated time series for the sales of 3 different items. Simulated data. Shape of the dataset: (1097, 3)
item_1 | item_2 | item_3 | |
---|---|---|---|
date | |||
2012-01-01 | 8.253175 | 21.047727 | 19.429739 |
2012-01-02 | 22.777826 | 26.578125 | 28.009863 |
2012-01-03 | 27.549099 | 31.751042 | 32.078922 |
2012-01-04 | 25.895533 | 24.567708 | 27.252276 |
2012-01-05 | 21.379238 | 18.191667 | 20.357737 |
# Split data into train-val-test
# ==============================================================================
end_train = '2014-07-15 23:59:00'
data_train = data.loc[:end_train, :].copy()
data_test = data.loc[end_train:, :].copy()
print(
f"Train dates : {data_train.index.min()} --- {data_train.index.max()} "
f"(n={len(data_train)})"
)
print(
f"Test dates : {data_test.index.min()} --- {data_test.index.max()} "
f"(n={len(data_test)})"
)
Train dates : 2012-01-01 00:00:00 --- 2014-07-15 00:00:00 (n=927) Test dates : 2014-07-16 00:00:00 --- 2015-01-01 00:00:00 (n=170)
# Plot time series
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(7, 5), sharex=True)
for i, item in enumerate(['item_1', 'item_2', 'item_3'], start=0):
data_train[item].plot(label='train', ax=axes[i])
data_test[item].plot(label='test', ax=axes[i])
axes[i].set_ylabel('sales')
axes[i].set_title(f'Item {i + 1}')
if i == 2: # Only set xlabel for the last plot
axes[i].set_xlabel('')
axes[i].legend(loc='upper left')
fig.tight_layout()
plt.show()
# Create and fit a Forecaster Multi-Series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 24,
encoding = 'ordinal',
)
forecaster.fit(series=data_train)
# Backtesting forecaster on multiple time series
# ==============================================================================
metrics = [
'mean_absolute_error',
'mean_squared_error',
'mean_absolute_percentage_error',
'mean_absolute_scaled_error'
]
backtest_metrics, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = data,
levels = None,
steps = 24,
metric = metrics,
add_aggregated_metric = True,
initial_train_size = len(data_train),
verbose = False,
show_progress = True,
suppress_warnings = False
)
print("Backtest metrics")
backtest_metrics
0%| | 0/8 [00:00<?, ?it/s]
Backtest metrics
levels | mean_absolute_error | mean_squared_error | mean_absolute_percentage_error | mean_absolute_scaled_error | |
---|---|---|---|---|---|
0 | item_1 | 1.163837 | 2.748612 | 0.057589 | 0.751106 |
1 | item_2 | 2.634182 | 12.358668 | 0.171335 | 1.110702 |
2 | item_3 | 3.111351 | 16.096772 | 0.200664 | 0.835620 |
3 | average | 2.303123 | 10.401350 | 0.143196 | 0.899143 |
4 | weighted_average | 2.303123 | 10.401350 | 0.143196 | 0.899143 |
5 | pooling | 2.303123 | 10.401350 | 0.143196 | 0.903831 |
# Backtest predictions
# ==============================================================================
print("Backtest predictions")
backtest_predictions.head(4)
Backtest predictions
item_1 | item_2 | item_3 | |
---|---|---|---|
2014-07-16 | 25.906323 | 10.522491 | 12.034587 |
2014-07-17 | 25.807194 | 10.623789 | 10.503966 |
2014-07-18 | 25.127355 | 11.299802 | 12.206434 |
2014-07-19 | 23.902609 | 11.441606 | 12.618740 |
When using the hyperparameter optimization functions, the metrics are computed for each level and then aggregated. The aggregate_metric
argument determines the aggregation method(s) to be used. Some considerations when choosing the aggregation method are:
'average' and 'weighted_average' will be different if the number of predicted values is different for each series (level).
When all series have the same number of predicted values, 'average', 'weighted_average' and 'pooling' are equivalent except for the case of scaled metrics (
mean_absolute_scaled_error
,root_mean_squared_scaled_error
) as the naive forecast is calculated individually for each series.
In this example, metric = ['mean_absolute_scaled_error', 'root_mean_squared_scaled_error']
and aggregate_metric = ['weighted_average', 'average', 'pooling']
so that the following metrics are calculated:
mean_absolute_scaled_error__weighted_average, mean_absolute_scaled_error__average, mean_absolute_scaled_error__pooling
root_mean_squared_scaled_error__weighted_average, root_mean_squared_scaled_error__average, root_mean_squared_scaled_error__pooling
Since return_best
is set to True
, the best model is returned. The best model is the one that minimizes the metric obtained by the first specified metric and aggregation method, mean_absolute_scaled_error__weighted_average
.
# Grid search Multi-Series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 24,
encoding = 'ordinal',
)
lags_grid = [24, 48]
param_grid = {
'n_estimators': [10, 20],
'max_depth': [3, 7]
}
levels = ['item_1', 'item_2', 'item_3']
results = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = data,
levels = levels, # Same as levels=None
lags_grid = lags_grid,
param_grid = param_grid,
steps = 24,
metric = ['mean_absolute_scaled_error', 'root_mean_squared_scaled_error'],
aggregate_metric = ['weighted_average', 'average', 'pooling'],
initial_train_size = len(data_train),
refit = False,
fixed_train_size = False,
return_best = True,
n_jobs = 'auto',
verbose = False,
show_progress = True
)
results
8 models compared for 3 level(s). Number of iterations: 8.
lags grid: 0%| | 0/2 [00:00<?, ?it/s]
params grid: 0%| | 0/4 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48] Parameters: {'max_depth': 7, 'n_estimators': 20} Backtesting metric: 0.8832159475915824 Levels: ['item_1', 'item_2', 'item_3']
levels | lags | lags_label | params | mean_absolute_scaled_error__weighted_average | mean_absolute_scaled_error__average | mean_absolute_scaled_error__pooling | root_mean_squared_scaled_error__weighted_average | root_mean_squared_scaled_error__average | root_mean_squared_scaled_error__pooling | max_depth | n_estimators | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 20} | 0.883216 | 0.883216 | 0.879320 | 0.855980 | 0.855980 | 0.850240 | 7 | 20 |
1 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 20} | 0.954187 | 0.954187 | 0.933739 | 0.908577 | 0.908577 | 0.883190 | 3 | 20 |
2 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 20} | 0.961524 | 0.961524 | 0.954798 | 0.916509 | 0.916509 | 0.913176 | 7 | 20 |
3 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 20} | 0.983487 | 0.983487 | 0.964384 | 0.922949 | 0.922949 | 0.902035 | 3 | 20 |
4 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 10} | 1.275773 | 1.275773 | 1.196362 | 1.147386 | 1.147386 | 1.061439 | 7 | 10 |
5 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 10} | 1.356463 | 1.356463 | 1.282314 | 1.216944 | 1.216944 | 1.132679 | 3 | 10 |
6 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 10} | 1.357693 | 1.357693 | 1.273732 | 1.217487 | 1.217487 | 1.123731 | 3 | 10 |
7 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 10} | 1.396803 | 1.396803 | 1.307453 | 1.256564 | 1.256564 | 1.161849 | 7 | 10 |