Global Forecasting Models: Dependent multi-series forecasting (Multivariate forecasting)¶

Univariate time series forecasting models a single time series as a linear or nonlinear combination of its lags, using past values of the series to predict its future. Global forecasting, involves building a single predictive model that considers all time series simultaneously. It attempts to capture the core patterns that govern the series, thereby mitigating the potential noise that each series might introduce. This approach is computationally efficient, easy to maintain, and can yield more robust generalizations across time series.

In dependent multi-series forecasting (multivariate time series), all series are modeled together in a single model, considering that each time series depends not only on its past values but also on the past values of the other series. The forecaster is expected not only to learn the information of each series separately but also to relate them. An example is the measurements made by all the sensors (flow, temperature, pressure...) installed on an industrial machine such as a compressor.

No description has been provided for this image
Internal Forecaster time series transformation to train a forecaster with multiple dependent time series.

Since as many training matrices are created as there are series in the dataset, it must be decided on which level the forecasting will be performed. To predict the next n steps a model is trained for each step to be predicted, the selected level in the figure is Series 1. This strategy is of the type direct multi-step forecasting.

No description has been provided for this image
Diagram of direct forecasting with multiple dependent time series.

Using the ForecasterAutoregMultiVariate class, it is possible to easily build machine learning models for dependent multi-series forecasting.

✎ Note

Skforecast offers additional approaches to create Global Forecasting Models:

Libraries¶

In [1]:

Copied!





# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

from skforecast.ForecasterAutoregMultiVariate import ForecasterAutoregMultiVariate
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries
from skforecast.model_selection_multiseries import random_search_forecaster_multiseries
from skforecast.model_selection_multiseries import bayesian_search_forecaster_multiseries
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import Ridge
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_absolute_error

from skforecast.ForecasterAutoregMultiVariate import ForecasterAutoregMultiVariate
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries
from skforecast.model_selection_multiseries import random_search_forecaster_multiseries
from skforecast.model_selection_multiseries import bayesian_search_forecaster_multiseries

Data¶

In [2]:

Copied!





# Data download
# ==============================================================================
url = (
       'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
       'data/guangyuan_air_pollution.csv'
)
data = pd.read_csv(url, sep=',')

# Data preparation
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('D')
data = data.sort_index()
data = data[['CO', 'SO2', 'PM2.5']]
data.head()
# Data download
# ==============================================================================
url = (
       'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
       'data/guangyuan_air_pollution.csv'
)
data = pd.read_csv(url, sep=',')

# Data preparation
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('D')
data = data.sort_index()
data = data[['CO', 'SO2', 'PM2.5']]
data.head()

Out[2]:

	CO	SO2	PM2.5
date
2013-03-01	9600.0	204.0	181.0
2013-03-02	20198.0	674.0	633.0
2013-03-03	47195.0	1661.0	1956.0
2013-03-04	15000.0	485.0	438.0
2013-03-05	59594.0	2001.0	3388.0

In [3]:

Copied!





# Split data into train-val-test
# ==============================================================================
end_train = '2016-05-31 23:59:00'
data_train = data.loc[:end_train, :].copy()
data_test  = data.loc[end_train:, :].copy()

print(
    f"Train dates : {data_train.index.min()} --- {data_train.index.max()}"
    f"(n={len(data_train)})"
)
print(
    f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}"
    f"(n={len(data_test)})"
)
# Split data into train-val-test
# ==============================================================================
end_train = '2016-05-31 23:59:00'
data_train = data.loc[:end_train, :].copy()
data_test  = data.loc[end_train:, :].copy()

print(
    f"Train dates : {data_train.index.min()} --- {data_train.index.max()}"
    f"(n={len(data_train)})"
)
print(
    f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}"
    f"(n={len(data_test)})"
)

Train dates : 2013-03-01 00:00:00 --- 2016-05-31 00:00:00(n=1188)
Test dates  : 2016-06-01 00:00:00 --- 2017-02-28 00:00:00(n=273)

In [4]:

Copied!





# Plot time series
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(9, 5), sharex=True)

data_train['CO'].plot(label='train', ax=axes[0])
data_test['CO'].plot(label='test', ax=axes[0])
axes[0].set_xlabel('')
axes[0].set_ylabel('Concentration (ug/m^3)')
axes[0].set_title('CO')
axes[0].legend()

data_train['SO2'].plot(label='train', ax=axes[1])
data_test['SO2'].plot(label='test', ax=axes[1])
axes[1].set_xlabel('')
axes[1].set_ylabel('Concentration (ug/m^3)')
axes[1].set_title('SO2')

data_train['PM2.5'].plot(label='train', ax=axes[2])
data_test['PM2.5'].plot(label='test', ax=axes[2])
axes[2].set_xlabel('')
axes[2].set_ylabel('Concentration (ug/m^3)')
axes[2].set_title('PM2.5')

fig.tight_layout()
plt.show();
# Plot time series
# ==============================================================================
fig, axes = plt.subplots(nrows=3, ncols=1, figsize=(9, 5), sharex=True)

data_train['CO'].plot(label='train', ax=axes[0])
data_test['CO'].plot(label='test', ax=axes[0])
axes[0].set_xlabel('')
axes[0].set_ylabel('Concentration (ug/m^3)')
axes[0].set_title('CO')
axes[0].legend()

data_train['SO2'].plot(label='train', ax=axes[1])
data_test['SO2'].plot(label='test', ax=axes[1])
axes[1].set_xlabel('')
axes[1].set_ylabel('Concentration (ug/m^3)')
axes[1].set_title('SO2')

data_train['PM2.5'].plot(label='train', ax=axes[2])
data_test['PM2.5'].plot(label='test', ax=axes[2])
axes[2].set_xlabel('')
axes[2].set_ylabel('Concentration (ug/m^3)')
axes[2].set_title('PM2.5')

fig.tight_layout()
plt.show();

No description has been provided for this image

Train and predict ForecasterAutoregMultiVariate¶

When initializing the forecaster, the level to be predicted and the maximum number of steps must be indicated since a different model will be created for each step.

✎ Note

Starting from version skforecast 0.9.0, the ForecasterAutoregMultiVariate now includes the n_jobs parameter, allowing multi-process parallelization. This allows to train regressors for all steps simultaneously.

The benefits of parallelization depend on several factors, including the regressor used, the number of fits to be performed, and the volume of data involved. When the n_jobs parameter is set to 'auto', the level of parallelization is automatically selected based on heuristic rules that aim to choose the best option for each scenario.

For a more detailed look at parallelization, visit Parallelization in skforecast.

In [5]:

Copied!





# Create and fit forecaster MultiVariate
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None,
                 n_jobs             = 'auto'
             )

forecaster.fit(series=data_train)
forecaster
# Create and fit forecaster MultiVariate
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None,
                 n_jobs             = 'auto'
             )

forecaster.fit(series=data_train)
forecaster

Out[5]:

============================= 
ForecasterAutoregMultiVariate 
============================= 
Regressor: Ridge(random_state=123) 
Lags: [1 2 3 4 5 6 7] 
Transformer for series: StandardScaler() 
Transformer for exog: None 
Weight function included: False 
Window size: 7 
Target series, level: CO 
Multivariate series (names): ['CO', 'SO2', 'PM2.5'] 
Maximum steps predicted: 7 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2013-03-01 00:00:00'), Timestamp('2016-05-31 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: D 
Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 123, 'solver': 'auto', 'tol': 0.0001} 
fit_kwargs: {} 
Creation date: 2024-05-05 11:16:27 
Last fit date: 2024-05-05 11:16:27 
Skforecast version: 0.12.0 
Python version: 3.11.5 
Forecaster id: None

When predicting, the value of steps must be less than or equal to the value of steps defined when initializing the forecaster. Starts at 1.

If int only steps within the range of 1 to int are predicted.
If list of int. Only the steps contained in the list are predicted.
If None as many steps are predicted as were defined at initialization.

In [6]:

Copied!





# Predict with forecaster MultiVariate
# ==============================================================================
# Predict as many steps as defined in the forecaster initialization
predictions = forecaster.predict()
display(predictions)
# Predict with forecaster MultiVariate
# ==============================================================================
# Predict as many steps as defined in the forecaster initialization
predictions = forecaster.predict()
display(predictions)

	CO
2016-06-01	20240.569930
2016-06-02	23299.549916
2016-06-03	22486.173088
2016-06-04	23116.366060
2016-06-05	24675.161490
2016-06-06	24012.882431
2016-06-07	24018.434477

In [7]:

Copied!

# Predict only a subset of steps
predictions = forecaster.predict(steps=[1, 5])
display(predictions)
# Predict only a subset of steps
predictions = forecaster.predict(steps=[1, 5])
display(predictions)

	CO
2016-06-01	20240.56993
2016-06-05	24675.16149

In [8]:

Copied!

# Predict with prediction intervals
predictions = forecaster.predict_interval(random_state=9871)
display(predictions)
# Predict with prediction intervals
predictions = forecaster.predict_interval(random_state=9871)
display(predictions)

	CO	CO_lower_bound	CO_upper_bound
2016-06-01	20240.569930	-83.341656	48710.878030
2016-06-02	23299.549916	1380.865126	63050.816445
2016-06-03	22486.173088	-436.466405	62638.750006
2016-06-04	23116.366060	1076.799288	61177.181140
2016-06-05	24675.161490	2936.403050	64020.672668
2016-06-06	24012.882431	3011.542673	66147.818994
2016-06-07	24018.434477	816.933404	60068.890597

Backtesting MultiVariate¶

See the backtesting user guide to learn more about backtesting.

In [9]:

Copied!





# Backtesting MultiVariate
# ==============================================================================
metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
                                           forecaster            = forecaster,
                                           series                = data,
                                           steps                 = 7,
                                           metric                = 'mean_absolute_error',
                                           initial_train_size    = len(data_train),
                                           fixed_train_size      = False,
                                           gap                   = 0,
                                           allow_incomplete_fold = True,
                                           refit                 = False,
                                           n_jobs                = 'auto',
                                           verbose               = False,
                                           show_progress         = True
                                       )

print("Backtest metrics")
display(metrics_levels)
print("")
print("Backtest predictions")
backtest_predictions.head(4)
# Backtesting MultiVariate
# ==============================================================================
metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
                                           forecaster            = forecaster,
                                           series                = data,
                                           steps                 = 7,
                                           metric                = 'mean_absolute_error',
                                           initial_train_size    = len(data_train),
                                           fixed_train_size      = False,
                                           gap                   = 0,
                                           allow_incomplete_fold = True,
                                           refit                 = False,
                                           n_jobs                = 'auto',
                                           verbose               = False,
                                           show_progress         = True
                                       )

print("Backtest metrics")
display(metrics_levels)
print("")
print("Backtest predictions")
backtest_predictions.head(4)

  0%|          | 0/39 [00:00<?, ?it/s]

Backtest metrics

	levels	mean_absolute_error
0	CO	14933.429818

Backtest predictions

Out[9]:

	CO
2016-06-01	20240.569930
2016-06-02	23299.549916
2016-06-03	22486.173088
2016-06-04	23116.366060

Hyperparameter tuning and lags selection MultiVariate¶

The grid_search_forecaster_multiseries, random_search_forecaster_multiseries and bayesian_search_forecaster_multiseries functions in the model_selection_multiseries module allow for lags and hyperparameter optimization. It is performed using the backtesting strategy for validation as in other Forecasters, see the user guide here.

The following example shows how to use random_search_forecaster_multiseries to find the best lags and model hyperparameters.

In [10]:

Copied!





# Create and forecaster MultiVariate
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = RandomForestRegressor(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None
             )
# Create and forecaster MultiVariate
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = RandomForestRegressor(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None
             )

In [11]:

Copied!





# Random search MultiVariate
# ==============================================================================
lags_grid = [7, 14]
param_distributions = {
    'n_estimators': np.arange(start=10, stop=20, step=1, dtype=int),
    'max_depth': np.arange(start=3, stop=6, step=1, dtype=int)
}

results = random_search_forecaster_multiseries(
              forecaster            = forecaster,
              series                = data,
              exog                  = None,
              lags_grid             = lags_grid,
              param_distributions   = param_distributions,
              steps                 = 7,
              metric                = 'mean_absolute_error',
              initial_train_size    = len(data_train),
              fixed_train_size      = False,
              gap                   = 0,
              allow_incomplete_fold = True,
              refit                 = False,
              n_iter                = 5,
              return_best           = False,
              n_jobs                = 'auto',
              verbose               = False,
              show_progress         = True
          )

results
# Random search MultiVariate
# ==============================================================================
lags_grid = [7, 14]
param_distributions = {
    'n_estimators': np.arange(start=10, stop=20, step=1, dtype=int),
    'max_depth': np.arange(start=3, stop=6, step=1, dtype=int)
}

results = random_search_forecaster_multiseries(
              forecaster            = forecaster,
              series                = data,
              exog                  = None,
              lags_grid             = lags_grid,
              param_distributions   = param_distributions,
              steps                 = 7,
              metric                = 'mean_absolute_error',
              initial_train_size    = len(data_train),
              fixed_train_size      = False,
              gap                   = 0,
              allow_incomplete_fold = True,
              refit                 = False,
              n_iter                = 5,
              return_best           = False,
              n_jobs                = 'auto',
              verbose               = False,
              show_progress         = True
          )

results

10 models compared for 1 level(s). Number of iterations: 10.

lags grid:   0%|          | 0/2 [00:00<?, ?it/s]

params grid:   0%|          | 0/5 [00:00<?, ?it/s]

Out[11]:

	levels	lags	lags_label	params	mean_absolute_error	n_estimators	max_depth
7	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'n_estimators': 15, 'max_depth': 3}	15979.159244	15	3
5	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'n_estimators': 17, 'max_depth': 3}	15984.129031	17	3
9	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'n_estimators': 18, 'max_depth': 3}	16037.503870	18	3
8	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'n_estimators': 16, 'max_depth': 5}	16055.322229	16	5
6	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'n_estimators': 19, 'max_depth': 5}	16093.561802	19	5
3	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 16, 'max_depth': 5}	16196.115017	16	5
1	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 19, 'max_depth': 5}	16205.027497	19	5
4	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 18, 'max_depth': 3}	16218.071503	18	3
0	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 17, 'max_depth': 3}	16289.646221	17	3
2	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 15, 'max_depth': 3}	16313.292935	15	3

It is also possible to perform a bayesian optimization with optuna using the bayesian_search_forecaster_multiseries function. For more information about this type of optimization, see the user guide here.

In [12]:

Copied!





# Bayesian search hyperparameters and lags with Optuna
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor = RandomForestRegressor(random_state=123),
                 level     = 'CO',
                 lags      = 7,
                 steps     = 7
             )

# Search space
def search_space(trial):
    search_space  = {
        'lags'            : trial.suggest_categorical('lags', [7, 14]),
        'n_estimators'    : trial.suggest_int('n_estimators', 10, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1., 10),
        'max_features'    : trial.suggest_categorical('max_features', ['log2', 'sqrt'])
    }

    return search_space

results, best_trial = bayesian_search_forecaster_multiseries(
                          forecaster            = forecaster,
                          series                = data,
                          exog                  = None, 
                          search_space          = search_space,
                          steps                 = 7,
                          metric                = 'mean_absolute_error',
                          refit                 = False,
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          n_trials              = 5,
                          random_state          = 123,
                          return_best           = False,
                          n_jobs                = 'auto',
                          verbose               = False,
                          show_progress         = True,
                          engine                = 'optuna',
                          kwargs_create_study   = {},
                          kwargs_study_optimize = {}
                      )

results.head(4)
# Bayesian search hyperparameters and lags with Optuna
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor = RandomForestRegressor(random_state=123),
                 level     = 'CO',
                 lags      = 7,
                 steps     = 7
             )

# Search space
def search_space(trial):
    search_space  = {
        'lags'            : trial.suggest_categorical('lags', [7, 14]),
        'n_estimators'    : trial.suggest_int('n_estimators', 10, 20),
        'min_samples_leaf': trial.suggest_int('min_samples_leaf', 1., 10),
        'max_features'    : trial.suggest_categorical('max_features', ['log2', 'sqrt'])
    }

    return search_space

results, best_trial = bayesian_search_forecaster_multiseries(
                          forecaster            = forecaster,
                          series                = data,
                          exog                  = None, 
                          search_space          = search_space,
                          steps                 = 7,
                          metric                = 'mean_absolute_error',
                          refit                 = False,
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          n_trials              = 5,
                          random_state          = 123,
                          return_best           = False,
                          n_jobs                = 'auto',
                          verbose               = False,
                          show_progress         = True,
                          engine                = 'optuna',
                          kwargs_create_study   = {},
                          kwargs_study_optimize = {}
                      )

results.head(4)

  0%|          | 0/5 [00:00<?, ?it/s]

Out[12]:

	levels	lags	params	mean_absolute_error	n_estimators	min_samples_leaf	max_features
2	[CO]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 14, 'min_samples_leaf': 8, 'm...	15766.047938	14	8	log2
0	[CO]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 12, 'min_samples_leaf': 6, 'm...	15942.583565	12	6	log2
3	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'n_estimators': 16, 'min_samples_leaf': 9, 'm...	16035.159683	16	9	log2
4	[CO]	[1, 2, 3, 4, 5, 6, 7]	{'n_estimators': 13, 'min_samples_leaf': 3, 'm...	16254.475533	13	3	sqrt

best_trial contains information of the trial which achived the best results. See more in Study class.

In [13]:

Copied!

# Optuna best trial in the study
# ==============================================================================
best_trial
# Optuna best trial in the study
# ==============================================================================
best_trial

Out[13]:

FrozenTrial(number=2, state=1, values=[15766.047938256981], datetime_start=datetime.datetime(2024, 5, 5, 11, 16, 42, 839891), datetime_complete=datetime.datetime(2024, 5, 5, 11, 16, 43, 368084), params={'lags': 7, 'n_estimators': 14, 'min_samples_leaf': 8, 'max_features': 'log2'}, user_attrs={}, system_attrs={}, intermediate_values={}, distributions={'lags': CategoricalDistribution(choices=(7, 14)), 'n_estimators': IntDistribution(high=20, log=False, low=10, step=1), 'min_samples_leaf': IntDistribution(high=10, log=False, low=1, step=1), 'max_features': CategoricalDistribution(choices=('log2', 'sqrt'))}, trial_id=2, value=None)

Different lags for each time series¶

If a dict is passed to the lags argument, it allows setting different lags for each of the series. The keys of the dictionary must be the names of the series to be used during training.

In [14]:

Copied!





# Create and fit forecaster MultiVariate Custom lags
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = {'CO': 7, 'SO2': [1, 7], 'PM2.5': 2},
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None
             )

forecaster.fit(series=data_train)

# Predict
# ==============================================================================
predictions = forecaster.predict(steps=7)
display(predictions)
# Create and fit forecaster MultiVariate Custom lags
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = {'CO': 7, 'SO2': [1, 7], 'PM2.5': 2},
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None
             )

forecaster.fit(series=data_train)

# Predict
# ==============================================================================
predictions = forecaster.predict(steps=7)
display(predictions)

	CO
2016-06-01	19422.784860
2016-06-02	21855.508044
2016-06-03	21909.797797
2016-06-04	23195.070914
2016-06-05	23291.297130
2016-06-06	22478.641763
2016-06-07	23393.503658

If a None is passed to any of the keys of the lags argument, that series will not be used to create the X training matrix.

In this example, no lags are created for the 'CO' series, but since it is the level of the forecaster, the 'CO' column will be used to create the y training matrix.

In [15]:

Copied!





# Create and fit forecaster MultiVariate Custom lags with None
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = {'CO': None, 'SO2': [1, 7], 'PM2.5': 2},
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None
             )

forecaster.fit(series=data_train)

# Predict
# ==============================================================================
predictions = forecaster.predict(steps=7)
display(predictions)
# Create and fit forecaster MultiVariate Custom lags with None
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = {'CO': None, 'SO2': [1, 7], 'PM2.5': 2},
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = None
             )

forecaster.fit(series=data_train)

# Predict
# ==============================================================================
predictions = forecaster.predict(steps=7)
display(predictions)

	CO
2016-06-01	22545.171026
2016-06-02	25259.218459
2016-06-03	25729.171621
2016-06-04	26535.545252
2016-06-05	26926.962008
2016-06-06	27174.268243
2016-06-07	27070.500361

It is possible to use the create_train_X_y method to generate the matrices that the forecaster is using to train the model. This approach enables gaining insight into the specific lags that have been created.

In [16]:

Copied!





# Extract training matrix
# ==============================================================================
X, y = forecaster.create_train_X_y(series=data_train)

# X and y to train model for step 1
X_1, y_1 = forecaster.filter_train_X_y_for_step(
               step    = 1,
               X_train = X,
               y_train = y,
           )

X_1.head(4)
# Extract training matrix
# ==============================================================================
X, y = forecaster.create_train_X_y(series=data_train)

# X and y to train model for step 1
X_1, y_1 = forecaster.filter_train_X_y_for_step(
               step    = 1,
               X_train = X,
               y_train = y,
           )

X_1.head(4)

Out[16]:

	SO2_lag_1	SO2_lag_7	PM2.5_lag_1	PM2.5_lag_2
date
2013-03-08	3.727899	-0.495413	2.427373	1.806709
2013-03-09	2.102055	0.421426	1.870726	2.427373
2013-03-10	0.825226	2.346788	-0.444267	1.870726
2013-03-11	0.185389	0.052740	-0.702775	-0.444267

In [17]:

Copied!

# Extract training matrix
# ==============================================================================
y_1.head(4)
# Extract training matrix
# ==============================================================================
y_1.head(4)

Out[17]:

date
2013-03-08    2.153622
2013-03-09   -0.221359
2013-03-10   -0.531358
2013-03-11    0.496391
Freq: D, Name: CO_step_1, dtype: float64

Exogenous variables in MultiVariate¶

Exogenous variables are predictors that are independent of the model being used for forecasting, and their future values must be known in order to include them in the prediction process.

In the ForecasterAutoregMultiVariate, as in the other forecasters, exogenous variables can be easily included as predictors using the exog argument.

To learn more about exogenous variables in skforecast visit the exogenous variables user guide.

Scikit-learn transformers in MultiVariate¶

By default, the ForecasterAutoregMultiVariate class uses the scikit-learn StandardScaler transformer to scale the data. This transformer is applied to all series. However, it is possible to use different transformers for each series or not to apply any transformation at all:

If transformer_series is a transformer the same transformation will be applied to all series.
If transformer_series is a dict a different transformation can be set for each series. Series not present in the dict will not have any transformation applied to them (check warning message).

Learn more about using scikit-learn transformers with skforecast.

In [18]:

Copied!





# Transformers in MultiVariate
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = {'CO': StandardScaler(), 'SO2': StandardScaler()},
                 transformer_exog   = None,
                 weight_func        = None,
             )

forecaster.fit(series=data_train)
forecaster
# Transformers in MultiVariate
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = {'CO': StandardScaler(), 'SO2': StandardScaler()},
                 transformer_exog   = None,
                 weight_func        = None,
             )

forecaster.fit(series=data_train)
forecaster

c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\utils\utils.py:233: IgnoredArgumentWarning: {'PM2.5'} not present in `transformer_series`. No transformation is applied to these series. 
 You can suppress this warning using: warnings.simplefilter('ignore', category=IgnoredArgumentWarning)
  warnings.warn(

Out[18]:

============================= 
ForecasterAutoregMultiVariate 
============================= 
Regressor: Ridge(random_state=123) 
Lags: [1 2 3 4 5 6 7] 
Transformer for series: {'CO': StandardScaler(), 'SO2': StandardScaler()} 
Transformer for exog: None 
Weight function included: False 
Window size: 7 
Target series, level: CO 
Multivariate series (names): ['CO', 'SO2', 'PM2.5'] 
Maximum steps predicted: 7 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('2013-03-01 00:00:00'), Timestamp('2016-05-31 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: D 
Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 123, 'solver': 'auto', 'tol': 0.0001} 
fit_kwargs: {} 
Creation date: 2024-05-05 11:16:44 
Last fit date: 2024-05-05 11:16:44 
Skforecast version: 0.12.0 
Python version: 3.11.5 
Forecaster id: None

Weights in MultiVariate¶

The weights are used to control the influence that each observation has on the training of the model.

Learn more about weighted time series forecasting with skforecast.

In [19]:

Copied!





# Weights in MultiVariate
# ==============================================================================
def custom_weights(index):
    """
    Return 0 if index is between '2013-01-01' and '2013-01-31', 1 otherwise.
    """
    weights = np.where(
                  (index >= '2013-01-01') & (index <= '2013-01-31'),
                   0,
                   1
              )
    
    return weights

forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = custom_weights
             )

forecaster.fit(series=data_train)
forecaster.predict(steps=7).head(3)
# Weights in MultiVariate
# ==============================================================================
def custom_weights(index):
    """
    Return 0 if index is between '2013-01-01' and '2013-01-31', 1 otherwise.
    """
    weights = np.where(
                  (index >= '2013-01-01') & (index <= '2013-01-31'),
                   0,
                   1
              )
    
    return weights

forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7,
                 transformer_series = StandardScaler(),
                 transformer_exog   = None,
                 weight_func        = custom_weights
             )

forecaster.fit(series=data_train)
forecaster.predict(steps=7).head(3)

Out[19]:

	CO
2016-06-01	20240.569930
2016-06-02	23299.549916
2016-06-03	22486.173088

⚠ Warning

The weight_func argument will be ignored if the regressor does not accept sample_weight in its fit method.

The source code of the weight_func added to the forecaster is stored in the argument source_code_weight_func.

In [20]:

Copied!

# Source code weight function
# ==============================================================================
print(forecaster.source_code_weight_func)
# Source code weight function
# ==============================================================================
print(forecaster.source_code_weight_func)

def custom_weights(index):
    """
    Return 0 if index is between '2013-01-01' and '2013-01-31', 1 otherwise.
    """
    weights = np.where(
                  (index >= '2013-01-01') & (index <= '2013-01-31'),
                   0,
                   1
              )
    
    return weights

Compare multiple metrics¶

All four functions (backtesting_forecaster_multiseries, grid_search_forecaster_multiseries, random_search_forecaster_multiseries, and bayesian_search_forecaster_multiseries) allow the calculation of multiple metrics for each forecaster configuration if a list is provided. This list can include custom metrics, and the best model is selected based on the first metric in the list.

In [21]:

Copied!





# Grid search MultiVariate with multiple metrics
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7
             )    

def custom_metric(y_true, y_pred):
    """
    Calculate the mean absolute error using only the predicted values of the last
    3 months of the year.
    """
    mask = y_true.index.month.isin([10, 11, 12])
    metric = mean_absolute_error(y_true[mask], y_pred[mask])
    
    return metric

lags_grid = [7, 14]
param_grid = {'alpha': [0.01, 0.1, 1]}

results = grid_search_forecaster_multiseries(
              forecaster          = forecaster,
              series              = data,
              exog                = None,
              steps               = 7,
              metric              = [mean_absolute_error, custom_metric, 'mean_squared_error'],
              lags_grid           = lags_grid,
              param_grid          = param_grid,
              initial_train_size  = len(data_train),
              refit               = False,
              fixed_train_size    = False,
              return_best         = True,
              n_jobs              = 'auto',
              verbose             = False,
              show_progress       = True
          )

results
# Grid search MultiVariate with multiple metrics
# ==============================================================================
forecaster = ForecasterAutoregMultiVariate(
                 regressor          = Ridge(random_state=123),
                 level              = 'CO',
                 lags               = 7,
                 steps              = 7
             )    

def custom_metric(y_true, y_pred):
    """
    Calculate the mean absolute error using only the predicted values of the last
    3 months of the year.
    """
    mask = y_true.index.month.isin([10, 11, 12])
    metric = mean_absolute_error(y_true[mask], y_pred[mask])
    
    return metric

lags_grid = [7, 14]
param_grid = {'alpha': [0.01, 0.1, 1]}

results = grid_search_forecaster_multiseries(
              forecaster          = forecaster,
              series              = data,
              exog                = None,
              steps               = 7,
              metric              = [mean_absolute_error, custom_metric, 'mean_squared_error'],
              lags_grid           = lags_grid,
              param_grid          = param_grid,
              initial_train_size  = len(data_train),
              refit               = False,
              fixed_train_size    = False,
              return_best         = True,
              n_jobs              = 'auto',
              verbose             = False,
              show_progress       = True
          )

results

6 models compared for 1 level(s). Number of iterations: 6.

lags grid:   0%|          | 0/2 [00:00<?, ?it/s]

params grid:   0%|          | 0/3 [00:00<?, ?it/s]

`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [1 2 3 4 5 6 7] 
  Parameters: {'alpha': 0.01}
  Backtesting metric: 14931.144696869993
  Levels: ['CO']

Out[21]:

	levels	lags	lags_label	params	mean_absolute_error	custom_metric	mean_squared_error	alpha
0	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'alpha': 0.01}	14931.144697	21210.921902	5.519128e+08	0.01
1	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'alpha': 0.1}	14931.353969	21211.328808	5.519430e+08	0.10
2	[CO]	[1, 2, 3, 4, 5, 6, 7]	[1, 2, 3, 4, 5, 6, 7]	{'alpha': 1}	14933.429818	21215.369713	5.522433e+08	1.00
3	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'alpha': 0.01}	15019.809054	20997.066088	5.649139e+08	0.01
4	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'alpha': 0.1}	15020.050002	20997.676368	5.649299e+08	0.10
5	[CO]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'alpha': 1}	15022.420642	21003.690818	5.650896e+08	1.00

Feature importances¶

Since ForecasterAutoregMultiVariate fits one model per step, it is necessary to specify from which model retrieves its feature importances.

In [22]:

Copied!

# Feature importances for step 1
# ==============================================================================
forecaster.get_feature_importances(step=1)
# Feature importances for step 1
# ==============================================================================
forecaster.get_feature_importances(step=1)

Out[22]:

	feature	importance
0	CO_lag_1	0.655431
6	CO_lag_7	0.115934
7	SO2_lag_1	0.109505
2	CO_lag_3	0.097844
4	CO_lag_5	0.067529
12	SO2_lag_6	0.060958
3	CO_lag_4	0.055881
19	PM2.5_lag_6	0.038826
11	SO2_lag_5	0.030440
16	PM2.5_lag_3	0.026527
10	SO2_lag_4	0.000420
8	SO2_lag_2	-0.022047
20	PM2.5_lag_7	-0.028511
13	SO2_lag_7	-0.036200
17	PM2.5_lag_4	-0.054530
18	PM2.5_lag_5	-0.058876
15	PM2.5_lag_2	-0.064166
14	PM2.5_lag_1	-0.071057
1	CO_lag_2	-0.072085
9	SO2_lag_3	-0.074830
5	CO_lag_6	-0.122197

Extract training matrix¶

Two steps are needed. One to create the whole training matrix and a second one to subset the data needed for each model (step).

In [23]:

Copied!





# Extract training matrix
# ==============================================================================
X, y = forecaster.create_train_X_y(series=data_train)

# X and y to train model for step 1
X_1, y_1 = forecaster.filter_train_X_y_for_step(
               step          = 1,
               X_train       = X,
               y_train       = y,
               remove_suffix = False
           )

X_1.head(4)
# Extract training matrix
# ==============================================================================
X, y = forecaster.create_train_X_y(series=data_train)

# X and y to train model for step 1
X_1, y_1 = forecaster.filter_train_X_y_for_step(
               step          = 1,
               X_train       = X,
               y_train       = y,
               remove_suffix = False
           )

X_1.head(4)

Out[23]:

	CO_lag_1	CO_lag_2	CO_lag_3	CO_lag_4	CO_lag_5	CO_lag_6	CO_lag_7	SO2_lag_1	SO2_lag_2	SO2_lag_3	...	SO2_lag_5	SO2_lag_6	SO2_lag_7	PM2.5_lag_1	PM2.5_lag_2	PM2.5_lag_3	PM2.5_lag_4	PM2.5_lag_5	PM2.5_lag_6	PM2.5_lag_7
date
2013-03-08	2.889096	2.290853	1.329437	-0.646567	0.780026	-0.416238	-0.885846	3.727899	5.060241	3.010033	...	2.346788	0.421426	-0.495413	2.427373	1.806709	0.864128	-0.934458	-0.008948	-0.815568	-1.091148
2013-03-09	2.153622	2.889096	2.290853	1.329437	-0.646567	0.780026	-0.416238	2.102055	3.727899	5.060241	...	0.052740	2.346788	0.421426	1.870726	2.427373	1.806709	0.864128	-0.934458	-0.008948	-0.815568
2013-03-10	-0.221359	2.153622	2.889096	2.290853	1.329437	-0.646567	0.780026	0.825226	2.102055	3.727899	...	3.010033	0.052740	2.346788	-0.444267	1.870726	2.427373	1.806709	0.864128	-0.934458	-0.008948
2013-03-11	-0.531358	-0.221359	2.153622	2.889096	2.290853	1.329437	-0.646567	0.185389	0.825226	2.102055	...	5.060241	3.010033	0.052740	-0.702775	-0.444267	1.870726	2.427373	1.806709	0.864128	-0.934458

4 rows × 21 columns

In [24]:

Copied!

y_1.head(4)
y_1.head(4)

Out[24]:

date
2013-03-08    2.153622
2013-03-09   -0.221359
2013-03-10   -0.531358
2013-03-11    0.496391
Freq: D, Name: CO_step_1, dtype: float64