Global Forecasting Models: Independent multi-series forecasting¶
Univariate time series forecasting models a single time series as a linear or nonlinear combination of its lags, using past values of the series to predict its future. Global forecasting, involves building a single predictive model that considers all time series simultaneously. It attempts to capture the core patterns that govern the series, thereby mitigating the potential noise that each series might introduce. This approach is computationally efficient, easy to maintain, and can yield more robust generalizations across time series.
In independent multi-series forecasting a single model is trained for all time series, but each time series remains independent of the others, meaning that past values of one series are not used as predictors of other series. However, modeling them together is useful because the series may follow the same intrinsic pattern regarding their past and future values. For instance, the sales of products A and B in the same store may not be related, but they follow the same dynamics, that of the store.
Internal Forecaster transformation of two time series and an exogenous variable into the matrices needed to train a machine learning model in a multi-series context.
To predict the next n steps, the strategy of recursive multi-step forecasting is applied, with the only difference being that the series name for which to estimate the predictions needs to be indicated.
Diagram of recursive forecasting with multiple independent time series.
Using the ForecasterAutoregMultiSeries
and ForecasterAutoregMultiSeriesCustom
classes, it is possible to easily build machine learning models for independent multi-series forecasting.
✎ Note
Skforecast offers additional approaches to create Global Forecasting Models:
💡 Tip
To learn more about global forecasting models visit our examples:
Libraries¶
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error
from skforecast.datasets import fetch_dataset
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries
from skforecast.model_selection_multiseries import bayesian_search_forecaster_multiseries
Data¶
# Data download
# ==============================================================================
data = fetch_dataset(name="items_sales")
data.head()
items_sales ----------- Simulated time series for the sales of 3 different items. Simulated data. Shape of the dataset: (1097, 3)
item_1 | item_2 | item_3 | |
---|---|---|---|
date | |||
2012-01-01 | 8.253175 | 21.047727 | 19.429739 |
2012-01-02 | 22.777826 | 26.578125 | 28.009863 |
2012-01-03 | 27.549099 | 31.751042 | 32.078922 |
2012-01-04 | 25.895533 | 24.567708 | 27.252276 |
2012-01-05 | 21.379238 | 18.191667 | 20.357737 |
# Split data into train-val-test
# ==============================================================================
end_train = '2014-07-15 23:59:00'
data_train = data.loc[:end_train, :].copy()
data_test = data.loc[end_train:, :].copy()
print(
f"Train dates : {data_train.index.min()} --- {data_train.index.max()} "
f"(n={len(data_train)})"
)
print(
f"Test dates : {data_test.index.min()} --- {data_test.index.max()} "
f"(n={len(data_test)})"
)
Train dates : 2012-01-01 00:00:00 --- 2014-07-15 00:00:00 (n=927) Test dates : 2014-07-16 00:00:00 --- 2015-01-01 00:00:00 (n=170)
# Plot time series
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(9, 5), sharex=True)
data_train['item_1'].plot(label='train', ax=axes[0])
data_test['item_1'].plot(label='test', ax=axes[0])
axes[0].set_xlabel('')
axes[0].set_ylabel('sales')
axes[0].set_title('Item 1')
axes[0].legend()
data_train['item_2'].plot(label='train', ax=axes[1])
data_test['item_2'].plot(label='test', ax=axes[1])
axes[1].set_xlabel('')
axes[1].set_ylabel('sales')
axes[1].set_title('Item 2')
data_train['item_3'].plot(label='train', ax=axes[2])
data_test['item_3'].plot(label='test', ax=axes[2])
axes[2].set_xlabel('')
axes[2].set_ylabel('sales')
axes[2].set_title('Item 3')
fig.tight_layout()
plt.show();
Train and predict ForecasterAutoregMultiSeries¶
# Create and fit a Forecaster Multi-Series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
encoding = 'ordinal',
transformer_series = StandardScaler(),
transformer_exog = None,
weight_func = None,
series_weights = None,
differentiation = None,
dropna_from_series = False,
fit_kwargs = None,
forecaster_id = None
)
forecaster.fit(series=data_train)
forecaster
============================ ForecasterAutoregMultiSeries ============================ Regressor: RandomForestRegressor(random_state=123) Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] Transformer for series: StandardScaler() Transformer for exog: None Series encoding: ordinal Window size: 24 Series levels (names): ['item_1', 'item_2', 'item_3'] Series weights: None Weight function included: False Differentiation order: None Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: ["'item_1': ['2012-01-01', '2014-07-15']", "'item_2': ['2012-01-01', '2014-07-15']", "'item_3': ['2012-01-01', '2014-07-15']"] Training index type: DatetimeIndex Training index frequency: D Regressor parameters: bootstrap: True, ccp_alpha: 0.0, criterion: squared_error, max_depth: None, max_features: 1.0, ... fit_kwargs: {} Creation date: 2024-05-20 14:52:33 Last fit date: 2024-05-20 14:52:38 Skforecast version: 0.12.0 Python version: 3.11.5 Forecaster id: None
Two methods can be use to predict the next n steps: predict()
or predict_interval()
. The argument levels
is used to indicate for which series estimate predictions. If None
all series will be predicted.
# Predict and predict_interval
# ==============================================================================
steps = 24
# Predictions for item_1
predictions_item_1 = forecaster.predict(steps=steps, levels='item_1')
display(predictions_item_1.head(3))
# Interval predictions for item_1 and item_2
predictions_intervals = forecaster.predict_interval(steps=steps, levels=['item_1', 'item_2'])
display(predictions_intervals.head(3))
item_1 | |
---|---|
2014-07-16 | 25.727855 |
2014-07-17 | 25.846049 |
2014-07-18 | 25.605574 |
item_1 | item_1_lower_bound | item_1_upper_bound | item_2 | item_2_lower_bound | item_2_upper_bound | |
---|---|---|---|---|---|---|
2014-07-16 | 25.727855 | 24.987747 | 26.574986 | 11.236102 | 9.801042 | 13.136002 |
2014-07-17 | 25.846049 | 24.870217 | 26.658201 | 10.930129 | 9.388806 | 13.134655 |
2014-07-18 | 25.605574 | 24.442493 | 26.235981 | 11.445854 | 9.573716 | 13.466252 |
Backtesting Multi Series¶
As in the predict
method, the levels
at which backtesting is performed must be indicated. The argument can also be set to None
to perform backtesting at all levels.
# Backtesting Multi-Series
# ==============================================================================
metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = data,
exog = None,
levels = None,
steps = 24,
metric = 'mean_absolute_error',
initial_train_size = len(data_train),
fixed_train_size = True,
gap = 0,
allow_incomplete_fold = True,
refit = True,
n_jobs = 'auto',
verbose = False,
show_progress = True,
suppress_warnings = False
)
print("Backtest metrics")
display(metrics_levels)
print("")
print("Backtest predictions")
backtest_predictions.head(4)
0%| | 0/8 [00:00<?, ?it/s]
Backtest metrics
levels | mean_absolute_error | |
---|---|---|
0 | item_1 | 1.244997 |
1 | item_2 | 2.449936 |
2 | item_3 | 3.219898 |
Backtest predictions
item_1 | item_2 | item_3 | |
---|---|---|---|
2014-07-16 | 25.727855 | 11.236102 | 10.598939 |
2014-07-17 | 25.846049 | 10.930129 | 11.898157 |
2014-07-18 | 25.605574 | 11.445854 | 11.708411 |
2014-07-19 | 24.156949 | 11.298911 | 12.194366 |
Hyperparameter tuning and lags selection Multi Series¶
The grid_search_forecaster_multiseries
, random_search_forecaster_multiseries
and bayesian_search_forecaster_multiseries
functions in the model_selection_multiseries
module allow for lags and hyperparameter optimization. It is performed using the backtesting strategy for validation as in other Forecasters, see the user guide here, except for the levels
argument:
levels
: level(s) at which the forecaster is optimized, for example:If
levels = ['item_1', 'item_2', 'item_3']
(Same aslevels = None
), the function will search for the lags and hyperparameters that minimize the average error of the predictions of all the time series. The resulting metric will be the average of all levels.If
levels = 'item_1'
(Same aslevels = ['item_1']
), the function will search for the lags and hyperparameters that minimize the error of theitem_1
predictions. The resulting metric will be the one calculated foritem_1
.
The following example shows how to use grid_search_forecaster_multiseries
to find the best lags and model hyperparameters for all time series (all levels
).
# Create Forecaster Multi-Series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
encoding = 'ordinal',
transformer_series = StandardScaler()
)
# Grid search Multi-Series
# ==============================================================================
lags_grid = [24, 48]
param_grid = {
'n_estimators': [10, 20],
'max_depth': [3, 7]
}
levels = ['item_1', 'item_2', 'item_3']
results = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = data,
exog = None,
levels = levels, # Same as levels=None
lags_grid = lags_grid,
param_grid = param_grid,
steps = 24,
metric = 'mean_absolute_error',
initial_train_size = len(data_train),
refit = False,
fixed_train_size = False,
return_best = False,
n_jobs = 'auto',
verbose = False,
show_progress = True
)
results
8 models compared for 3 level(s). Number of iterations: 8.
lags grid: 0%| | 0/2 [00:00<?, ?it/s]
params grid: 0%| | 0/4 [00:00<?, ?it/s]
levels | lags | lags_label | params | mean_absolute_error | max_depth | n_estimators | |
---|---|---|---|---|---|---|---|
7 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 20} | 2.339379 | 7 | 20 |
3 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 20} | 2.358138 | 7 | 20 |
6 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 10} | 2.392723 | 7 | 10 |
2 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 10} | 2.402785 | 7 | 10 |
5 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 20} | 2.464477 | 3 | 20 |
1 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 20} | 2.532149 | 3 | 20 |
4 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 10} | 2.588693 | 3 | 10 |
0 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 10} | 2.706529 | 3 | 10 |
It is also possible to perform a bayesian optimization with optuna
using the bayesian_search_forecaster_multiseries
function. For more information about this type of optimization, see the user guide here.
# Bayesian search hyperparameters and lags with Optuna
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
encoding = 'ordinal'
)
levels = ['item_1', 'item_2', 'item_3']
# Search space
def search_space(trial):
search_space = {
'lags' : trial.suggest_categorical('lags', [24, 48]),
'n_estimators' : trial.suggest_int('n_estimators', 10, 20),
'min_samples_leaf' : trial.suggest_int('min_samples_leaf', 1, 10),
'ccp_alpha' : trial.suggest_float('ccp_alpha', 0., 1.),
'max_features' : trial.suggest_categorical('max_features', ['log2', 'sqrt'])
}
return search_space
results, best_trial = bayesian_search_forecaster_multiseries(
forecaster = forecaster,
series = data,
exog = None,
levels = levels, # Same as levels=None
search_space = search_space,
steps = 24,
metric = 'mean_absolute_error',
refit = False,
initial_train_size = len(data_train),
fixed_train_size = False,
n_trials = 5,
random_state = 123,
return_best = False,
n_jobs = 'auto',
verbose = False,
show_progress = True,
suppress_warnings = False,
engine = 'optuna',
kwargs_create_study = {},
kwargs_study_optimize = {}
)
results.head(4)
0%| | 0/5 [00:00<?, ?it/s]
levels | lags | params | mean_absolute_error | n_estimators | min_samples_leaf | ccp_alpha | max_features | |
---|---|---|---|---|---|---|---|---|
3 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'n_estimators': 16, 'min_samples_leaf': 8, 'c... | 3.007254 | 16 | 8 | 0.322959 | log2 |
4 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'n_estimators': 11, 'min_samples_leaf': 5, 'c... | 3.070554 | 11 | 5 | 0.430863 | log2 |
2 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'n_estimators': 12, 'min_samples_leaf': 2, 'c... | 3.070939 | 12 | 2 | 0.531551 | sqrt |
1 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'n_estimators': 14, 'min_samples_leaf': 4, 'c... | 3.088771 | 14 | 4 | 0.729050 | log2 |
best_trial
contains information of the trial which achived the best results. See more in Study class.
# Optuna best trial in the study
# ==============================================================================
best_trial
FrozenTrial(number=3, state=1, values=[3.007253701973608], datetime_start=datetime.datetime(2024, 5, 20, 14, 54, 4, 206400), datetime_complete=datetime.datetime(2024, 5, 20, 14, 54, 4, 441404), params={'lags': 24, 'n_estimators': 16, 'min_samples_leaf': 8, 'ccp_alpha': 0.3229589138531782, 'max_features': 'log2'}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'lags': CategoricalDistribution(choices=(24, 48)), 'n_estimators': IntDistribution(high=20, log=False, low=10, step=1), 'min_samples_leaf': IntDistribution(high=10, log=False, low=1, step=1), 'ccp_alpha': FloatDistribution(high=1.0, log=False, low=0.0, step=None), 'max_features': CategoricalDistribution(choices=('log2', 'sqrt'))}, trial_id=3, value=None)
Exogenous variables in multi-series¶
Exogenous variables are predictors that are independent of the model being used for forecasting, and their future values must be known in order to include them in the prediction process.
✎ Note
Starting from version 0.12.0
, the ForecasterAutoregMultiSeries
allows the use of different exogenous variables for each series. See Global Forecasting Models: Time series with different lengths and different exogenous variables for more information.
💡 Tip
To learn more about exogenous variables in skforecast visit the exogenous variables user guide.
# Generate exogenous variable month
# ==============================================================================
data_exog = data.copy()
data_exog['month'] = data_exog.index.month
# Split data into train-val-test
# ==============================================================================
end_train = '2014-07-15 23:59:00'
data_exog_train = data_exog.loc[:end_train, :].copy()
data_exog_test = data_exog.loc[end_train:, :].copy()
data_exog_train.head(3)
item_1 | item_2 | item_3 | month | |
---|---|---|---|---|
date | ||||
2012-01-01 | 8.253175 | 21.047727 | 19.429739 | 1 |
2012-01-02 | 22.777826 | 26.578125 | 28.009863 | 1 |
2012-01-03 | 27.549099 | 31.751042 | 32.078922 | 1 |
# Create and fit forecaster Multi-Series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
encoding = 'ordinal'
)
forecaster.fit(
series = data_exog_train[['item_1', 'item_2', 'item_3']],
exog = data_exog_train[['month']]
)
forecaster
============================ ForecasterAutoregMultiSeries ============================ Regressor: RandomForestRegressor(random_state=123) Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] Transformer for series: StandardScaler() Transformer for exog: None Series encoding: ordinal Window size: 24 Series levels (names): ['item_1', 'item_2', 'item_3'] Series weights: None Weight function included: False Differentiation order: None Exogenous included: True Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> Exogenous variables names: ['month'] Training range: ["'item_1': ['2012-01-01', '2014-07-15']", "'item_2': ['2012-01-01', '2014-07-15']", "'item_3': ['2012-01-01', '2014-07-15']"] Training index type: DatetimeIndex Training index frequency: D Regressor parameters: bootstrap: True, ccp_alpha: 0.0, criterion: squared_error, max_depth: None, max_features: 1.0, ... fit_kwargs: {} Creation date: 2024-05-20 14:54:04 Last fit date: 2024-05-20 14:54:09 Skforecast version: 0.12.0 Python version: 3.11.5 Forecaster id: None
If the Forecaster
has been trained using exogenous variables, they should be provided during the prediction phase.
# Predict with exogenous variables
# ==============================================================================
predictions = forecaster.predict(steps=24, exog=data_exog_test[['month']])
predictions.head(3)
item_1 | item_2 | item_3 | |
---|---|---|---|
2014-07-16 | 25.793280 | 11.110627 | 10.682699 |
2014-07-17 | 25.846751 | 11.049392 | 12.089319 |
2014-07-18 | 25.552653 | 11.316862 | 12.093196 |
As mentioned earlier, the month
exogenous variable is replicated for each of the series. This can be easily demonstrated using the create_train_X_y
method, which returns the matrix used in the fit
method.
# X_train matrix
# ==============================================================================
X_train = forecaster.create_train_X_y(
series = data_exog_train[['item_1', 'item_2', 'item_3']],
exog = data_exog_train[['month']]
)[0]
# X_train slice for item_1
# ==============================================================================
X_train.loc[X_train['_level_skforecast'] == 0].head(3)
lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | lag_6 | lag_7 | lag_8 | lag_9 | lag_10 | ... | lag_17 | lag_18 | lag_19 | lag_20 | lag_21 | lag_22 | lag_23 | lag_24 | _level_skforecast | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||||
2012-01-25 | 2.163111 | 0.587328 | -0.656056 | 0.010719 | 0.602052 | 0.896105 | 2.641973 | 1.623623 | -0.877145 | -1.365855 | ... | -0.939251 | -0.757959 | -0.534430 | -0.428047 | 1.334476 | 1.979794 | 0.117764 | -5.550607 | 0 | 1 |
2012-01-26 | 2.447474 | 2.163111 | 0.587328 | -0.656056 | 0.010719 | 0.602052 | 0.896105 | 2.641973 | 1.623623 | -0.877145 | ... | -0.963902 | -0.939251 | -0.757959 | -0.534430 | -0.428047 | 1.334476 | 1.979794 | 0.117764 | 0 | 1 |
2012-01-27 | 0.558968 | 2.447474 | 2.163111 | 0.587328 | -0.656056 | 0.010719 | 0.602052 | 0.896105 | 2.641973 | 1.623623 | ... | -0.334016 | -0.963902 | -0.939251 | -0.757959 | -0.534430 | -0.428047 | 1.334476 | 1.979794 | 0 | 1 |
3 rows × 26 columns
# X_train slice for item_2
# ==============================================================================
X_train.loc[X_train['_level_skforecast'] == 1].head(3)
lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | lag_6 | lag_7 | lag_8 | lag_9 | lag_10 | ... | lag_17 | lag_18 | lag_19 | lag_20 | lag_21 | lag_22 | lag_23 | lag_24 | _level_skforecast | month | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
date | |||||||||||||||||||||
2012-01-25 | 2.050924 | 1.782460 | 1.054339 | 0.654039 | 0.551985 | 0.569480 | 2.471635 | 2.327927 | 0.567397 | 0.135856 | ... | 1.535865 | 0.618424 | 0.278939 | 0.354751 | 1.629588 | 3.065837 | 2.031555 | 0.925797 | 1 | 1 |
2012-01-26 | 2.038219 | 2.050924 | 1.782460 | 1.054339 | 0.654039 | 0.551985 | 0.569480 | 2.471635 | 2.327927 | 0.567397 | ... | 0.761091 | 1.535865 | 0.618424 | 0.278939 | 0.354751 | 1.629588 | 3.065837 | 2.031555 | 1 | 1 |
2012-01-27 | 0.668201 | 2.038219 | 2.050924 | 1.782460 | 1.054339 | 0.654039 | 0.551985 | 0.569480 | 2.471635 | 2.327927 | ... | 0.548653 | 0.761091 | 1.535865 | 0.618424 | 0.278939 | 0.354751 | 1.629588 | 3.065837 | 1 | 1 |
3 rows × 26 columns
To use exogenous variables in backtesting or hyperparameter tuning, they must be specified with the exog
argument.
# Backtesting Multi-Series with exog
# ==============================================================================
metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = data_exog[['item_1', 'item_2', 'item_3']],
exog = data_exog[['month']],
levels = None,
steps = 24,
metric = 'mean_absolute_error',
initial_train_size = len(data_exog_train),
fixed_train_size = True,
gap = 0,
allow_incomplete_fold = True,
refit = True,
n_jobs = 'auto',
verbose = False,
show_progress = True,
suppress_warnings = False
)
print("Backtest metrics")
display(metrics_levels)
print("")
print("Backtest predictions with exogenous variables")
backtest_predictions.head(4)
0%| | 0/8 [00:00<?, ?it/s]
Backtest metrics
levels | mean_absolute_error | |
---|---|---|
0 | item_1 | 1.256685 |
1 | item_2 | 2.408477 |
2 | item_3 | 3.224960 |
Backtest predictions with exogenous variables
item_1 | item_2 | item_3 | |
---|---|---|---|
2014-07-16 | 25.793280 | 11.110627 | 10.682699 |
2014-07-17 | 25.846751 | 11.049392 | 12.089319 |
2014-07-18 | 25.552653 | 11.316862 | 12.093196 |
2014-07-19 | 24.144075 | 11.357579 | 12.357503 |
Scikit-learn transformers in multi-series¶
By default, the ForecasterAutoregMultiSeries
class uses the scikit-learn StandardScaler
transformer to scale the data. This transformer is applied to all series. However, it is possible to use different transformers for each series or not to apply any transformation at all:
If
transformer_series
is atransformer
the same transformation will be applied to all series.If
transformer_series
is adict
a different transformation can be set for each series. Series not present in the dict will not have any transformation applied to them (check warning message).
Learn more about using scikit-learn transformers with skforecast.
# Transformers in Multi-Series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
encoding = 'ordinal',
transformer_series = {'item_1': StandardScaler(), 'item_2': StandardScaler()},
transformer_exog = None
)
forecaster.fit(series=data_train)
forecaster
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\utils\utils.py:233: IgnoredArgumentWarning: {'item_3'} not present in `transformer_series`. No transformation is applied to these series. You can suppress this warning using: warnings.simplefilter('ignore', category=IgnoredArgumentWarning) warnings.warn(
============================ ForecasterAutoregMultiSeries ============================ Regressor: RandomForestRegressor(random_state=123) Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] Transformer for series: {'item_1': StandardScaler(), 'item_2': StandardScaler()} Transformer for exog: None Series encoding: ordinal Window size: 24 Series levels (names): ['item_1', 'item_2', 'item_3'] Series weights: None Weight function included: False Differentiation order: None Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: ["'item_1': ['2012-01-01', '2014-07-15']", "'item_2': ['2012-01-01', '2014-07-15']", "'item_3': ['2012-01-01', '2014-07-15']"] Training index type: DatetimeIndex Training index frequency: D Regressor parameters: bootstrap: True, ccp_alpha: 0.0, criterion: squared_error, max_depth: None, max_features: 1.0, ... fit_kwargs: {} Creation date: 2024-05-20 14:54:20 Last fit date: 2024-05-20 14:54:24 Skforecast version: 0.12.0 Python version: 3.11.5 Forecaster id: None
Series with different lengths and different exogenous variables¶
Starting from version 0.12.0
, the classes ForecasterAutoregMultiSeries
and ForecasterAutoregMultiSeriesCustom
allow the simultaneous modeling of time series of different lengths and using different exogenous variables. Various scenarios are possible:
If
series
is apandas DataFrame
andexog
is apandas Series
orDataFrame
, each exog is duplicated for each series.exog
must have the same index asseries
(type, length and frequency).If
series
is apandas DataFrame
andexog
is a dict ofpandas Series
orDataFrames
. Each key inexog
must be a column inseries
and the values are the exog for each series.exog
must have the same index asseries
(type, length and frequency).If
series
is adict
ofpandas Series
,exog
must be a dict ofpandas Series
orDataFrames
. The keys inseries
andexog
must be the same. All series and exog must have apandas DatetimeIndex
with the same frequency.
Series type | Exog type | Requirements |
---|---|---|
DataFrame |
Series or DataFrame |
Same index (type, length and frequency) |
DataFrame |
dict |
Same index (type, length and frequency) |
dict |
dict |
Both pandas DatetimeIndex (same frequency) |
# Series and exog as DataFrames
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 4,
encoding = 'ordinal'
)
X, y = forecaster.create_train_X_y(
series = data_exog_train[['item_1', 'item_2', 'item_3']],
exog = data_exog_train[['month']]
)
X.head(3)
lag_1 | lag_2 | lag_3 | lag_4 | _level_skforecast | month | |
---|---|---|---|---|---|---|
date | ||||||
2012-01-05 | 1.334476 | 1.979794 | 0.117764 | -5.550607 | 0 | 1 |
2012-01-06 | -0.428047 | 1.334476 | 1.979794 | 0.117764 | 0 | 1 |
2012-01-07 | -0.534430 | -0.428047 | 1.334476 | 1.979794 | 0 | 1 |
When exog
is a dictionary of pandas Series
or DataFrames
, different exogenous variables can be used for each series or the same exogenous variable can have different values for each series.
# Ilustrative example of different values for the same exogenous variable
# ==============================================================================
exog_1_item_1_train = pd.Series([1]*len(data_exog_train), name='exog_1', index=data_exog_train.index)
exog_1_item_2_train = pd.Series([10]*len(data_exog_train), name='exog_1', index=data_exog_train.index)
exog_1_item_3_train = pd.Series([100]*len(data_exog_train), name='exog_1', index=data_exog_train.index)
exog_1_item_1_test = pd.Series([1]*len(data_exog_test), name='exog_1', index=data_exog_test.index)
exog_1_item_2_test = pd.Series([10]*len(data_exog_test), name='exog_1', index=data_exog_test.index)
exog_1_item_3_test = pd.Series([100]*len(data_exog_test), name='exog_1', index=data_exog_test.index)
# Series as DataFrame and exog as dict
# ==============================================================================
exog_train_as_dict = {
'item_1': exog_1_item_1_train,
'item_2': exog_1_item_2_train,
'item_3': exog_1_item_3_train
}
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 4,
encoding = 'ordinal'
)
X, y = forecaster.create_train_X_y(
series = data_exog_train[['item_1', 'item_2', 'item_3']],
exog = exog_train_as_dict
)
display(X.head(3))
print("")
print("Column `exog_1` as different values for each item (_level_skforecast id)")
X['exog_1'].value_counts()
lag_1 | lag_2 | lag_3 | lag_4 | _level_skforecast | exog_1 | |
---|---|---|---|---|---|---|
date | ||||||
2012-01-05 | 1.334476 | 1.979794 | 0.117764 | -5.550607 | 0 | 1 |
2012-01-06 | -0.428047 | 1.334476 | 1.979794 | 0.117764 | 0 | 1 |
2012-01-07 | -0.534430 | -0.428047 | 1.334476 | 1.979794 | 0 | 1 |
Column `exog_1` as different values for each item (_level_skforecast id)
exog_1 1 923 10 923 100 923 Name: count, dtype: int64
# Predict with series as DataFrame and exog as dict
# ==============================================================================
forecaster.fit(
series = data_exog_train[['item_1', 'item_2', 'item_3']],
exog = exog_train_as_dict
)
exog_pred_as_dict = {
'item_1': exog_1_item_1_test,
'item_2': exog_1_item_2_test,
'item_3': exog_1_item_3_test
}
predictions = forecaster.predict(steps=24, exog=exog_pred_as_dict)
predictions.head(3)
item_1 | item_2 | item_3 | |
---|---|---|---|
2014-07-16 | 25.697571 | 11.326645 | 13.139564 |
2014-07-17 | 25.071002 | 10.679352 | 11.981520 |
2014-07-18 | 24.614740 | 11.483302 | 13.358778 |
💡 Tip
When using series with different lengths and different exogenous variables, it is recommended to use series
and exog
as dictionaries. This way, it is easier to manage the data and avoid errors.
Visit Global Forecasting Models: Time series with different lengths and different exogenous variables for more information.
Series Encoding in multi-series¶
When creating the training matrices, the ForecasterAutoregMultiSeries
class encodes the series names to identify to which series the observations belong. Different encoding methods can be used:
'ordinal_category'
(default): a single column (_level_skforecast
) is created with integer values from 0 to n_series - 1. Then, the column is transformed intopandas.category
dtype so that it can be used as a categorical variable.'ordinal'
: a single column (_level_skforecast
) is created with integer values from 0 to n_series - 1.'onehot'
, a binary column is created for each series.
# Ordinal_category encoding
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 3,
encoding = 'ordinal_category'
)
X, y = forecaster.create_train_X_y(series=data_train)
display(X.head(3))
print("")
print(X.dtypes)
print("")
print(X['_level_skforecast'].value_counts())
lag_1 | lag_2 | lag_3 | _level_skforecast | |
---|---|---|---|---|
date | ||||
2012-01-04 | 1.979794 | 0.117764 | -5.550607 | 0 |
2012-01-05 | 1.334476 | 1.979794 | 0.117764 | 0 |
2012-01-06 | -0.428047 | 1.334476 | 1.979794 | 0 |
lag_1 float64 lag_2 float64 lag_3 float64 _level_skforecast category dtype: object _level_skforecast 0 924 1 924 2 924 Name: count, dtype: int64
# Ordinal encoding
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 3,
encoding = 'ordinal'
)
X, y = forecaster.create_train_X_y(series=data_train)
display(X.head(3))
print("")
print(X.dtypes)
print("")
print(X['_level_skforecast'].value_counts())
lag_1 | lag_2 | lag_3 | _level_skforecast | |
---|---|---|---|---|
date | ||||
2012-01-04 | 1.979794 | 0.117764 | -5.550607 | 0 |
2012-01-05 | 1.334476 | 1.979794 | 0.117764 | 0 |
2012-01-06 | -0.428047 | 1.334476 | 1.979794 | 0 |
lag_1 float64 lag_2 float64 lag_3 float64 _level_skforecast int32 dtype: object _level_skforecast 0 924 1 924 2 924 Name: count, dtype: int64
# Onehot encoding (one column per series)
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 3,
encoding = 'onehot'
)
X, y = forecaster.create_train_X_y(series=data_train)
display(X.head(3))
print("")
print(X.dtypes)
print("")
print(X['item_1'].value_counts())
lag_1 | lag_2 | lag_3 | item_1 | item_2 | item_3 | |
---|---|---|---|---|---|---|
date | ||||||
2012-01-04 | 1.979794 | 0.117764 | -5.550607 | 1 | 0 | 0 |
2012-01-05 | 1.334476 | 1.979794 | 0.117764 | 1 | 0 | 0 |
2012-01-06 | -0.428047 | 1.334476 | 1.979794 | 1 | 0 | 0 |
lag_1 float64 lag_2 float64 lag_3 float64 item_1 int32 item_2 int32 item_3 int32 dtype: object item_1 0 1848 1 924 Name: count, dtype: int64
Weights in multi-series¶
The weights are used to control the influence that each observation has on the training of the model. ForecasterAutoregMultiseries
accepts two types of weights:
series_weights
controls the relative importance of each series. If a series has twice as much weight as the others, the observations of that series influence the training twice as much. The higher the weight of a series relative to the others, the more the model will focus on trying to learn that series.weight_func
controls the relative importance of each observation according to its index value. For example, a function that assigns a lower weight to certain dates.
If the two types of weights are indicated, they are multiplied to create the final weights. The resulting sample_weight
cannot have negative values.
Weights in multi-series.
series_weights
is a dict of the form{'series_column_name': float}
. If a series is used duringfit
and is not present inseries_weights
, it will have a weight of 1.weight_func
is a function that defines the individual weights of each sample based on the index.If it is a
callable
, the same function will apply to all series.If it is a
dict
of the form{'series_column_name': callable}
, a different function can be used for each series. A weight of 1 is given to all series not present inweight_func
.
# Weights in Multi-Series
# ==============================================================================
def custom_weights(index):
"""
Return 0 if index is between '2013-01-01' and '2013-01-31', 1 otherwise.
"""
weights = np.where(
(index >= '2013-01-01') & (index <= '2013-01-31'),
0,
1
)
return weights
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
encoding = 'ordinal',
transformer_series = StandardScaler(),
transformer_exog = None,
weight_func = custom_weights,
series_weights = {'item_1': 1., 'item_2': 2., 'item_3': 1.} # Same as {'item_2': 2.}
)
forecaster.fit(series=data_train)
forecaster.predict(steps=24).head(3)
item_1 | item_2 | item_3 | |
---|---|---|---|
2014-07-16 | 25.944148 | 11.454737 | 11.069946 |
2014-07-17 | 25.790280 | 11.181609 | 11.958527 |
2014-07-18 | 25.531977 | 11.185339 | 12.064547 |
⚠ Warning
The weight_func
and series_weights
arguments will be ignored if the regressor does not accept sample_weight
in its fit
method.
The source code of the weight_func
added to the forecaster is stored in the argument source_code_weight_func
. If weight_func
is a dict
, it will be a dict
of the form {'series_column_name': source_code_weight_func}
.
# Source code weight function
# ==============================================================================
print(forecaster.source_code_weight_func)
def custom_weights(index): """ Return 0 if index is between '2013-01-01' and '2013-01-31', 1 otherwise. """ weights = np.where( (index >= '2013-01-01') & (index <= '2013-01-31'), 0, 1 ) return weights
Differentiation¶
Time series differentiation involves computing the differences between consecutive observations in the time series. When it comes to training forecasting models, differentiation offers the advantage of focusing on relative rates of change rather than directly attempting to model the absolute values. Once the predictions have been estimated, this transformation can be easily reversed to restore the values to their original scale.
💡 Tip
To learn more about modeling time series differentiation, visit our example: Modelling time series trend with tree based models.
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
differentiation = 1
)
forecaster.fit(series=data_train)
forecaster
============================ ForecasterAutoregMultiSeries ============================ Regressor: RandomForestRegressor(random_state=123) Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24] Transformer for series: StandardScaler() Transformer for exog: None Series encoding: ordinal_category Window size: 24 Series levels (names): ['item_1', 'item_2', 'item_3'] Series weights: None Weight function included: False Differentiation order: 1 Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: ["'item_1': ['2012-01-01', '2014-07-15']", "'item_2': ['2012-01-01', '2014-07-15']", "'item_3': ['2012-01-01', '2014-07-15']"] Training index type: DatetimeIndex Training index frequency: D Regressor parameters: bootstrap: True, ccp_alpha: 0.0, criterion: squared_error, max_depth: None, max_features: 1.0, ... fit_kwargs: {} Creation date: 2024-05-20 14:54:30 Last fit date: 2024-05-20 14:54:36 Skforecast version: 0.12.0 Python version: 3.11.5 Forecaster id: None
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=24)
predictions.head(3)
item_1 | item_2 | item_3 | |
---|---|---|---|
2014-07-16 | 26.593536 | 10.191353 | 11.010808 |
2014-07-17 | 26.380795 | 9.726819 | 10.982733 |
2014-07-18 | 26.461854 | 9.983902 | 12.822251 |
Feature selection in multi-series¶
Feature selection is the process of selecting a subset of relevant features (variables, predictors) for use in model construction. Feature selection techniques are used for several reasons: to simplify models to make them easier to interpret, to reduce training time, to avoid the curse of dimensionality, to improve generalization by reducing overfitting (formally, variance reduction), and others.
Skforecast is compatible with the feature selection methods implemented in the scikit-learn library. Visit Global Forecasting Models: Feature Selection for more information.
Compare multiple metrics¶
All four functions (backtesting_forecaster_multiseries
, grid_search_forecaster_multiseries
, random_search_forecaster_multiseries
, and bayesian_search_forecaster_multiseries
) allow the calculation of multiple metrics, including custom metrics, for each forecaster configuration if a list is provided.
The best model is selected based on the first metric in the list.
# Grid search Multi-Series with multiple metrics
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = RandomForestRegressor(random_state=123),
lags = 24,
encoding = 'ordinal'
)
def custom_metric(y_true, y_pred):
"""
Calculate the mean absolute error using only the predicted values of the last
3 months of the year.
"""
mask = y_true.index.month.isin([10, 11, 12])
metric = mean_absolute_error(y_true[mask], y_pred[mask])
return metric
lags_grid = [24, 48]
param_grid = {
'n_estimators': [10, 20],
'max_depth': [3, 7]
}
results = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = data,
lags_grid = lags_grid,
param_grid = param_grid,
steps = 24,
metric = [mean_absolute_error, custom_metric, 'mean_squared_error'],
initial_train_size = len(data_train),
fixed_train_size = True,
levels = None,
exog = None,
refit = True,
return_best = False,
n_jobs = 'auto',
verbose = False,
show_progress = True
)
results
8 models compared for 3 level(s). Number of iterations: 8.
lags grid: 0%| | 0/2 [00:00<?, ?it/s]
params grid: 0%| | 0/4 [00:00<?, ?it/s]
levels | lags | lags_label | params | mean_absolute_error | custom_metric | mean_squared_error | max_depth | n_estimators | |
---|---|---|---|---|---|---|---|---|---|
3 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 20} | 2.346023 | 2.441957 | 10.357276 | 7 | 20 |
7 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 20} | 2.368374 | 2.502645 | 10.411020 | 7 | 20 |
2 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 10} | 2.381166 | 2.486796 | 10.616470 | 7 | 10 |
6 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 7, 'n_estimators': 10} | 2.381476 | 2.499558 | 10.586728 | 7 | 10 |
1 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 20} | 2.460152 | 2.555881 | 11.323336 | 3 | 20 |
4 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 10} | 2.490342 | 2.622111 | 11.340823 | 3 | 10 |
5 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 20} | 2.490899 | 2.597174 | 11.130521 | 3 | 20 |
0 | [item_1, item_2, item_3] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_depth': 3, 'n_estimators': 10} | 2.561160 | 2.688161 | 12.524154 | 3 | 10 |