Global Forecasting Models: Time series with different lengths and different exogenous variables¶

When faced with a multi-series forecasting problem, it is common for the series to have varying lengths due to differences in the starting times of data recording. To address this scenario, the classes ForecasterAutoregMultiSeries and ForecasterAutoregMultiSeriesCustom allow the simultaneous modeling of time series of different lengths and using different exogenous variables.

When the modeled series have different lengths, they must be stored in a Python dictionary. The keys of the dictionary are the names of the series and the values are the series themselves. All series must be of type pandas.Series, have a pandas.DatetimeIndex with the same frequency.

Series values	Allowed
`[NaN, NaN, NaN, NaN, 4, 5, 6, 7, 8, 9]`	✔️
`[0, 1, 2, 3, 4, 5, 6, 7, 8, NaN]`	✔️
`[0, 1, 2, 3, 4, NaN, 6, 7, 8, 9]`	✔️
`[NaN, NaN, 2, 3, 4, NaN, 6, 7, 8, 9]`	✔️

When different exogenous variables are used for each series, or if the exogenous variables are the same but have different values for each series, they must be stored in a dictionary. The keys of the dictionary are the names of the series and the values are the exogenous variables themselves. All exogenous variables must be of type pandas.DataFrame or pandas.Series and have a pandas.DatetimeIndex with the same frequency.

💡 Tip

API Reference:

Libraries¶

In [1]:

Copied!





# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor

from skforecast.plot import set_dark_theme
from skforecast.preprocessing import series_long_to_dict
from skforecast.preprocessing import exog_long_to_dict
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import bayesian_search_forecaster_multiseries
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor

from skforecast.plot import set_dark_theme
from skforecast.preprocessing import series_long_to_dict
from skforecast.preprocessing import exog_long_to_dict
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import bayesian_search_forecaster_multiseries

Data¶

The data for this example is stored in "long format" in a single DataFrame. The series_id column identifies the series to which each observation belongs. The timestamp column contains the date of the observation, and the value column contains the value of the series at that date. Each time series is of a different length.

The exogenous variables are stored in a separate DataFrame, also in "long format". The column series_id identifies the series to which each observation belongs. The column timestamp contains the date of the observation, and the remaining columns contain the values of the exogenous variables at that date.

In [2]:

Copied!





# Load time series of multiple lengths and exogenous variables
# ==============================================================================
series = pd.read_csv(
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast-datasets/main/data/demo_multi_series.csv'
)
exog = pd.read_csv(
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast-datasets/main/data/demo_multi_series_exog.csv'
)

series['timestamp'] = pd.to_datetime(series['timestamp'])
exog['timestamp'] = pd.to_datetime(exog['timestamp'])

display(series.head())
print("")
display(exog.head())
# Load time series of multiple lengths and exogenous variables
# ==============================================================================
series = pd.read_csv(
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast-datasets/main/data/demo_multi_series.csv'
)
exog = pd.read_csv(
    'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast-datasets/main/data/demo_multi_series_exog.csv'
)

series['timestamp'] = pd.to_datetime(series['timestamp'])
exog['timestamp'] = pd.to_datetime(exog['timestamp'])

display(series.head())
print("")
display(exog.head())

	series_id	timestamp	value
0	id_1000	2016-01-01	1012.500694
1	id_1000	2016-01-02	1158.500099
2	id_1000	2016-01-03	983.000099
3	id_1000	2016-01-04	1675.750496
4	id_1000	2016-01-05	1586.250694

	series_id	timestamp	sin_day_of_week	cos_day_of_week	air_temperature	wind_speed
0	id_1000	2016-01-01	-0.433884	-0.900969	6.416639	4.040115
1	id_1000	2016-01-02	-0.974928	-0.222521	6.366474	4.530395
2	id_1000	2016-01-03	-0.781831	0.623490	6.555272	3.273064
3	id_1000	2016-01-04	0.000000	1.000000	6.704778	4.865404
4	id_1000	2016-01-05	0.781831	0.623490	2.392998	5.228913

When series have different lengths, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the series themselves. To do this, the series_long_to_dict function is used, which takes the DataFrame in "long format" and returns a dict of pandas Series.

Similarly, when the exogenous variables are different (values or variables) for each series, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the exogenous variables themselves. The exog_long_to_dict function is used, which takes the DataFrame in "long format" and returns a dict of exogenous variables (pandas Series or pandas DataFrame).

In [3]:

Copied!





# Transform series and exog to dictionaries
# ==============================================================================
series_dict = series_long_to_dict(
    data      = series,
    series_id = 'series_id',
    index     = 'timestamp',
    values    = 'value',
    freq      = 'D'
)

exog_dict = exog_long_to_dict(
    data      = exog,
    series_id = 'series_id',
    index     = 'timestamp',
    freq      = 'D'
)
# Transform series and exog to dictionaries
# ==============================================================================
series_dict = series_long_to_dict(
    data      = series,
    series_id = 'series_id',
    index     = 'timestamp',
    values    = 'value',
    freq      = 'D'
)

exog_dict = exog_long_to_dict(
    data      = exog,
    series_id = 'series_id',
    index     = 'timestamp',
    freq      = 'D'
)

Some exogenous variables are omitted for series 1 and 3 to illustrate that different exogenous variables can be used for each series.

In [4]:

Copied!





# Drop some exogenous variables for series 'id_1000' and 'id_1003'
# ==============================================================================
exog_dict['id_1000'] = exog_dict['id_1000'].drop(columns=['air_temperature', 'wind_speed'])
exog_dict['id_1003'] = exog_dict['id_1003'].drop(columns=['cos_day_of_week'])
# Drop some exogenous variables for series 'id_1000' and 'id_1003'
# ==============================================================================
exog_dict['id_1000'] = exog_dict['id_1000'].drop(columns=['air_temperature', 'wind_speed'])
exog_dict['id_1003'] = exog_dict['id_1003'].drop(columns=['cos_day_of_week'])

In [5]:

Copied!





# Partition data in train and test
# ==============================================================================
end_train = '2016-07-31 23:59:00'

series_dict_train = {k: v.loc[: end_train,] for k, v in series_dict.items()}
exog_dict_train   = {k: v.loc[: end_train,] for k, v in exog_dict.items()}
series_dict_test  = {k: v.loc[end_train:,] for k, v in series_dict.items()}
exog_dict_test    = {k: v.loc[end_train:,] for k, v in exog_dict.items()}
# Partition data in train and test
# ==============================================================================
end_train = '2016-07-31 23:59:00'

series_dict_train = {k: v.loc[: end_train,] for k, v in series_dict.items()}
exog_dict_train   = {k: v.loc[: end_train,] for k, v in exog_dict.items()}
series_dict_test  = {k: v.loc[end_train:,] for k, v in series_dict.items()}
exog_dict_test    = {k: v.loc[end_train:,] for k, v in exog_dict.items()}

In [6]:

Copied!





# Plot series
# ==============================================================================
set_dark_theme()
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)

for i, s in enumerate(series_dict.values()):
    axs[i].plot(s, label=s.name, color=colors[i])
    axs[i].legend(loc='upper right', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)
    axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train

fig.suptitle('Series in `series_dict`', fontsize=15)
plt.tight_layout()
# Plot series
# ==============================================================================
set_dark_theme()
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)

for i, s in enumerate(series_dict.values()):
    axs[i].plot(s, label=s.name, color=colors[i])
    axs[i].legend(loc='upper right', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)
    axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train

fig.suptitle('Series in `series_dict`', fontsize=15)
plt.tight_layout()

No description has been provided for this image

In [7]:

Copied!





# Description of each partition
# ==============================================================================
for k in series_dict.keys():
    print(f"{k}:")
    try:
        print(
            f"\tTrain: len={len(series_dict_train[k])}, {series_dict_train[k].index[0]}"
            f" --- {series_dict_train[k].index[-1]}"
        )
    except:
        print(f"\tTrain: len=0")
    try:
        print(
            f"\tTest : len={len(series_dict_test[k])}, {series_dict_test[k].index[0]}"
            f" --- {series_dict_test[k].index[-1]}"
        )
    except:
        print(f"\tTest : len=0")
# Description of each partition
# ==============================================================================
for k in series_dict.keys():
    print(f"{k}:")
    try:
        print(
            f"\tTrain: len={len(series_dict_train[k])}, {series_dict_train[k].index[0]}"
            f" --- {series_dict_train[k].index[-1]}"
        )
    except:
        print(f"\tTrain: len=0")
    try:
        print(
            f"\tTest : len={len(series_dict_test[k])}, {series_dict_test[k].index[0]}"
            f" --- {series_dict_test[k].index[-1]}"
        )
    except:
        print(f"\tTest : len=0")

id_1000:
	Train: len=213, 2016-01-01 00:00:00 --- 2016-07-31 00:00:00
	Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00
id_1001:
	Train: len=30, 2016-07-02 00:00:00 --- 2016-07-31 00:00:00
	Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00
id_1002:
	Train: len=183, 2016-01-01 00:00:00 --- 2016-07-01 00:00:00
	Test : len=0
id_1003:
	Train: len=213, 2016-01-01 00:00:00 --- 2016-07-31 00:00:00
	Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00
id_1004:
	Train: len=91, 2016-05-02 00:00:00 --- 2016-07-31 00:00:00
	Test : len=31, 2016-08-01 00:00:00 --- 2016-08-31 00:00:00

In [8]:

Copied!





# Exogenous variables for each series
# ==============================================================================
for k in series_dict.keys():
    print(f"{k}:")
    try:
        print(f"\t{exog_dict[k].columns.to_list()}")
    except:
        print(f"\tNo exogenous variables")
# Exogenous variables for each series
# ==============================================================================
for k in series_dict.keys():
    print(f"{k}:")
    try:
        print(f"\t{exog_dict[k].columns.to_list()}")
    except:
        print(f"\tNo exogenous variables")

id_1000:
	['sin_day_of_week', 'cos_day_of_week']
id_1001:
	['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']
id_1002:
	['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']
id_1003:
	['sin_day_of_week', 'air_temperature', 'wind_speed']
id_1004:
	['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']

Train and predict¶

The fit method is used to train the model, it is passed the dictionary of series and the dictionary of exogenous variables where the keys of each dictionary are the names of the series.

In [9]:

Copied!





# Fit forecaster
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 14, 
                 encoding           = "ordinal", 
                 dropna_from_series = False
             )

forecaster.fit(series=series_dict_train, exog=exog_dict_train, suppress_warnings=True)
forecaster
# Fit forecaster
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 14, 
                 encoding           = "ordinal", 
                 dropna_from_series = False
             )

forecaster.fit(series=series_dict_train, exog=exog_dict_train, suppress_warnings=True)
forecaster

Out[9]:

============================ 
ForecasterAutoregMultiSeries 
============================ 
Regressor: LGBMRegressor(max_depth=5, random_state=123, verbose=-1) 
Lags: [ 1  2  3  4  5  6  7  8  9 10 11 12 13 14] 
Transformer for series: StandardScaler() 
Transformer for exog: None 
Series encoding: ordinal 
Window size: 14 
Series levels (names): ['id_1000', 'id_1001', 'id_1002', 'id_1003', 'id_1004'] 
Series weights: None 
Weight function included: False 
Differentiation order: None 
Exogenous included: True 
Type of exogenous variable: <class 'dict'> 
Exogenous variables names: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed'] 
Training range: 'id_1000': ['2016-01-01', '2016-07-31'], 'id_1001': ['2016-07-02', '2016-07-31'], 'id_1002': ['2016-01-01', '2016-07-01'], ... 
Training index type: DatetimeIndex 
Training index frequency: D 
Regressor parameters: boosting_type: gbdt, class_weight: None, colsample_bytree: 1.0, importance_type: split, learning_rate: 0.1, ... 
fit_kwargs: {} 
Creation date: 2024-05-08 08:35:19 
Last fit date: 2024-05-08 08:35:19 
Skforecast version: 0.12.0 
Python version: 3.11.5 
Forecaster id: None

Only series whose last window of data ends at the same datetime index can be predicted together. If levels = None, series that do not reach the maximum index are excluded from prediction. In this example, series 'id_1002' is excluded because it does not reach the maximum index.

In [10]:

Copied!





# Predict
# ==============================================================================
predictions = forecaster.predict(steps=5, exog=exog_dict_test, suppress_warnings=True)
predictions
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=5, exog=exog_dict_test, suppress_warnings=True)
predictions

Out[10]:

	id_1000	id_1001	id_1003	id_1004
2016-08-01	1293.204467	2524.395199	2709.278962	7832.637393
2016-08-02	1406.194447	2352.220339	2374.673568	8211.355533
2016-08-03	1373.037098	2448.617991	2151.117183	8691.257605
2016-08-04	1322.956825	2350.717306	2038.462016	8580.422024
2016-08-05	1236.472191	2175.728026	1927.793418	9066.134443

Missing values in the series¶

When working with time series of different lengths, it is common for some series to have missing values. As not all the regressors allow missing values, the argument dropna_from_series can be used to remove the missing values from the training matrices.

If False, leave NaNs in X_train and warn the user. (default)
If True, drop NaNs in X_train and same rows in y_train and warn the user.

y_train NaNs (and same rows from X_train) are always dropped because the target variable cannot have NaN values.

In [11]:

Copied!





# Sample data with interspersed NaNs
# ==============================================================================
series_dict_nan = {
    'id_1000': series_dict['id_1000'].copy(),
    'id_1003': series_dict['id_1003'].copy()
}

# Create NaNs
series_dict_nan['id_1000'].loc['2016-03-01':'2016-04-01',] = np.nan
series_dict_nan['id_1000'].loc['2016-05-01':'2016-05-07',] = np.nan
series_dict_nan['id_1003'].loc['2016-07-01',] = np.nan

# Plot series
# ==============================================================================
fig, axs = plt.subplots(2, 1, figsize=(8, 2.5), sharex=True)

for i, s in enumerate(series_dict_nan.values()):
    axs[i].plot(s, label=s.name, color=colors[i])
    axs[i].legend(loc='upper right', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)
    axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train

fig.suptitle('Series in `series_dict_nan`', fontsize=15)
plt.tight_layout()
# Sample data with interspersed NaNs
# ==============================================================================
series_dict_nan = {
    'id_1000': series_dict['id_1000'].copy(),
    'id_1003': series_dict['id_1003'].copy()
}

# Create NaNs
series_dict_nan['id_1000'].loc['2016-03-01':'2016-04-01',] = np.nan
series_dict_nan['id_1000'].loc['2016-05-01':'2016-05-07',] = np.nan
series_dict_nan['id_1003'].loc['2016-07-01',] = np.nan

# Plot series
# ==============================================================================
fig, axs = plt.subplots(2, 1, figsize=(8, 2.5), sharex=True)

for i, s in enumerate(series_dict_nan.values()):
    axs[i].plot(s, label=s.name, color=colors[i])
    axs[i].legend(loc='upper right', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)
    axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train

fig.suptitle('Series in `series_dict_nan`', fontsize=15)
plt.tight_layout()

When dropna_from_series = False, the NaNs in X_train are kept and the user is warned. This is useful if the user wants to keep the NaNs in the series and use a regressor that can handle them.

In [12]:

Copied!





# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 3, 
                 encoding           = "ordinal", 
                 dropna_from_series = False
             )

X, y = forecaster.create_train_X_y(series=series_dict_nan)

display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())
# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 3, 
                 encoding           = "ordinal", 
                 dropna_from_series = False
             )

X, y = forecaster.create_train_X_y(series=series_dict_nan)

display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())

c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:783: MissingValuesWarning: NaNs detected in `y_train`. They have been dropped because the target variable cannot have NaN values. Same rows have been dropped from `X_train` to maintain alignment. This is caused by series with interspersed NaNs. 
 You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning)
  warnings.warn(
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:805: MissingValuesWarning: NaNs detected in `X_train`. Some regressors do not allow NaN values during training. If you want to drop them, set `forecaster.dropna_from_series = True`. 
 You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning)
  warnings.warn(

	lag_1	lag_2	lag_3
2016-01-04	-1.462688	-0.940102	-1.374844
2016-01-05	0.600111	-1.462688	-0.940102
2016-01-06	0.333608	0.600111	-1.462688

Observations per series:
_level_skforecast
0    324
1    216
Name: count, dtype: int64

NaNs per series:
lag_1                 5
lag_2                 9
lag_3                13
_level_skforecast     0
dtype: int64

When dropna_from_series = True, the NaNs in X_train are removed and the user is warned. This is useful if the user chooses a regressor that cannot handle missing values.

In [13]:

Copied!





# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 3, 
                 encoding           = "ordinal", 
                 dropna_from_series = True
             )

X, y = forecaster.create_train_X_y(series=series_dict_nan)

display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())
# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 3, 
                 encoding           = "ordinal", 
                 dropna_from_series = True
             )

X, y = forecaster.create_train_X_y(series=series_dict_nan)

display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())

c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:783: MissingValuesWarning: NaNs detected in `y_train`. They have been dropped because the target variable cannot have NaN values. Same rows have been dropped from `X_train` to maintain alignment. This is caused by series with interspersed NaNs. 
 You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning)
  warnings.warn(
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:796: MissingValuesWarning: NaNs detected in `X_train`. They have been dropped. If you want to keep them, set `forecaster.dropna_from_series = False`. Same rows have been removed from `y_train` to maintain alignment. This caused by series with interspersed NaNs. 
 You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning)
  warnings.warn(

	lag_1	lag_2	lag_3
2016-01-04	-1.462688	-0.940102	-1.374844
2016-01-05	0.600111	-1.462688	-0.940102
2016-01-06	0.333608	0.600111	-1.462688

Observations per series:
_level_skforecast
0    318
1    207
Name: count, dtype: int64

NaNs per series:
lag_1                0
lag_2                0
lag_3                0
_level_skforecast    0
dtype: int64

During the training process, the warnings can be suppressed by setting suppress_warnings = True.

In [14]:

Copied!





# Suppress warnings during fit method
# ==============================================================================
forecaster.fit(series=series_dict_nan, suppress_warnings=True)
forecaster
# Suppress warnings during fit method
# ==============================================================================
forecaster.fit(series=series_dict_nan, suppress_warnings=True)
forecaster

Out[14]:

============================ 
ForecasterAutoregMultiSeries 
============================ 
Regressor: LGBMRegressor(max_depth=5, random_state=123, verbose=-1) 
Lags: [1 2 3] 
Transformer for series: StandardScaler() 
Transformer for exog: None 
Series encoding: ordinal 
Window size: 3 
Series levels (names): ['id_1000', 'id_1003'] 
Series weights: None 
Weight function included: False 
Differentiation order: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: ["'id_1000': ['2016-01-01', '2016-12-31']", "'id_1003': ['2016-01-01', '2016-12-31']"] 
Training index type: DatetimeIndex 
Training index frequency: D 
Regressor parameters: boosting_type: gbdt, class_weight: None, colsample_bytree: 1.0, importance_type: split, learning_rate: 0.1, ... 
fit_kwargs: {} 
Creation date: 2024-05-08 08:35:19 
Last fit date: 2024-05-08 08:35:19 
Skforecast version: 0.12.0 
Python version: 3.11.5 
Forecaster id: None

Backtesting¶

As in the predict method, the levels at which backtesting is performed must be indicated.

When series have different lengths, the backtesting process only returns predictions for the date-times that are present in the series.

In [15]:

Copied!





# Backtesting
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 14, 
                 encoding           = "ordinal", 
                 dropna_from_series = False
             )

metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
    forecaster            = forecaster,
    series                = series_dict,
    exog                  = exog_dict,
    levels                = None,
    steps                 = 24,
    metric                = "mean_absolute_error",
    initial_train_size    = len(series_dict_train["id_1000"]),
    fixed_train_size      = True,
    gap                   = 0,
    allow_incomplete_fold = True,
    refit                 = False,
    n_jobs                ="auto",
    verbose               = True,
    show_progress         = True,
    suppress_warnings     = True
)

display(metrics_levels)
print("")
display(backtest_predictions)
# Backtesting
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
                 regressor          = regressor, 
                 lags               = 14, 
                 encoding           = "ordinal", 
                 dropna_from_series = False
             )

metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
    forecaster            = forecaster,
    series                = series_dict,
    exog                  = exog_dict,
    levels                = None,
    steps                 = 24,
    metric                = "mean_absolute_error",
    initial_train_size    = len(series_dict_train["id_1000"]),
    fixed_train_size      = True,
    gap                   = 0,
    allow_incomplete_fold = True,
    refit                 = False,
    n_jobs                ="auto",
    verbose               = True,
    show_progress         = True,
    suppress_warnings     = True
)

display(metrics_levels)
print("")
display(backtest_predictions)

Information of backtesting process
----------------------------------
Number of observations used for initial training: 213
Number of observations used for backtesting: 153
    Number of folds: 7
    Number of steps per fold: 24
    Number of steps to exclude from the end of each train set before test (gap): 0
    Last fold only includes 9 observations.

Fold: 0
    Training:   2016-01-01 00:00:00 -- 2016-07-31 00:00:00  (n=213)
    Validation: 2016-08-01 00:00:00 -- 2016-08-24 00:00:00  (n=24)
Fold: 1
    Training:   2016-01-01 00:00:00 -- 2016-07-31 00:00:00  (n=213)
    Validation: 2016-08-25 00:00:00 -- 2016-09-17 00:00:00  (n=24)
Fold: 2
    Training:   2016-01-01 00:00:00 -- 2016-07-31 00:00:00  (n=213)
    Validation: 2016-09-18 00:00:00 -- 2016-10-11 00:00:00  (n=24)
Fold: 3
    Training:   2016-01-01 00:00:00 -- 2016-07-31 00:00:00  (n=213)
    Validation: 2016-10-12 00:00:00 -- 2016-11-04 00:00:00  (n=24)
Fold: 4
    Training:   2016-01-01 00:00:00 -- 2016-07-31 00:00:00  (n=213)
    Validation: 2016-11-05 00:00:00 -- 2016-11-28 00:00:00  (n=24)
Fold: 5
    Training:   2016-01-01 00:00:00 -- 2016-07-31 00:00:00  (n=213)
    Validation: 2016-11-29 00:00:00 -- 2016-12-22 00:00:00  (n=24)
Fold: 6
    Training:   2016-01-01 00:00:00 -- 2016-07-31 00:00:00  (n=213)
    Validation: 2016-12-23 00:00:00 -- 2016-12-31 00:00:00  (n=9)

  0%|          | 0/7 [00:00<?, ?it/s]

	levels	mean_absolute_error
0	id_1000	170.480334
1	id_1001	898.507443
2	id_1002	NaN
3	id_1003	312.142117
4	id_1004	1016.918515

	id_1000	id_1001	id_1003	id_1004
2016-08-01	1293.204467	2524.395199	2709.278962	7832.637393
2016-08-02	1406.194447	2352.220339	2374.673568	8211.355533
2016-08-03	1373.037098	2448.617991	2151.117183	8691.257605
2016-08-04	1322.956825	2350.717306	2038.462016	8580.422024
2016-08-05	1236.472191	2175.728026	1927.793418	9066.134443
...	...	...	...	...
2016-12-27	1707.247192	1090.228639	1917.176341	NaN
2016-12-28	1657.671846	1095.269468	1764.954581	NaN
2016-12-29	1435.185437	1202.631403	1808.017876	NaN
2016-12-30	1397.272408	1323.913812	1943.009954	NaN
2016-12-31	1247.455810	1062.773378	1910.004499	NaN

153 rows × 4 columns

Note that if a series has no observations in the test set, the backtesting process will not return any predictions for that series and the metric will be NaN, as happened with series 'id_1002'.

In [16]:

Copied!





# Plot backtesting predictions
# ==============================================================================
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)

for i, s in enumerate(series_dict.keys()):
    axs[i].plot(series_dict[s], label=series_dict[s].name, color=colors[i])
    axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train
    try:
        axs[i].plot(backtest_predictions[s], label='prediction', color="white")
    except:
        pass
    axs[i].legend(loc='upper right', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)

fig.suptitle('Backtest Predictions', fontsize=15)
plt.tight_layout()
# Plot backtesting predictions
# ==============================================================================
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)

for i, s in enumerate(series_dict.keys()):
    axs[i].plot(series_dict[s], label=series_dict[s].name, color=colors[i])
    axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train
    try:
        axs[i].plot(backtest_predictions[s], label='prediction', color="white")
    except:
        pass
    axs[i].legend(loc='upper right', fontsize=8)
    axs[i].tick_params(axis='both', labelsize=8)

fig.suptitle('Backtest Predictions', fontsize=15)
plt.tight_layout()

Hyperparameter tuning and lags selection¶

The grid_search_forecaster_multiseries, random_search_forecaster_multiseries and bayesian_search_forecaster_multiseries functions in the model_selection_multiseries module allow for lags and hyperparameter optimization. It is performed using the backtesting strategy for validation as in other Forecasters.

When using series of different lengths, the metric is calculated as a weighted average of the backtesting metrics of the selected series (levels) and their number of predictions.

More information about: Hyperparameter tuning and lags selection Multi-Series.

In [17]:

Copied!





# Bayesian search hyperparameters and lags with Optuna
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
                 regressor = LGBMRegressor(random_state=123, verbose=-1),
                 lags      = 14,
                 encoding  = 'ordinal'
             )

# Search space
def search_space(trial):
    search_space  = {
        'lags'             : trial.suggest_categorical('lags', [7, 14]),
        'min_samples_leaf' : trial.suggest_int('min_samples_leaf', 10, 30),
        'max_depth'        : trial.suggest_int('max_depth', 3, 7)
    }

    return search_space

results, best_trial = bayesian_search_forecaster_multiseries(
    forecaster            = forecaster,
    series                = series_dict,
    exog                  = exog_dict,
    levels                = None,
    search_space          = search_space,
    steps                 = 24,
    metric                = 'mean_absolute_error',
    refit                 = False,
    initial_train_size    = len(series_dict_train["id_1000"]),
    fixed_train_size      = False,
    n_trials              = 10,
    random_state          = 123,
    return_best           = False,
    n_jobs                = 'auto',
    verbose               = False,
    show_progress         = True,
    suppress_warnings     = True
)

results.head(4)
# Bayesian search hyperparameters and lags with Optuna
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
                 regressor = LGBMRegressor(random_state=123, verbose=-1),
                 lags      = 14,
                 encoding  = 'ordinal'
             )

# Search space
def search_space(trial):
    search_space  = {
        'lags'             : trial.suggest_categorical('lags', [7, 14]),
        'min_samples_leaf' : trial.suggest_int('min_samples_leaf', 10, 30),
        'max_depth'        : trial.suggest_int('max_depth', 3, 7)
    }

    return search_space

results, best_trial = bayesian_search_forecaster_multiseries(
    forecaster            = forecaster,
    series                = series_dict,
    exog                  = exog_dict,
    levels                = None,
    search_space          = search_space,
    steps                 = 24,
    metric                = 'mean_absolute_error',
    refit                 = False,
    initial_train_size    = len(series_dict_train["id_1000"]),
    fixed_train_size      = False,
    n_trials              = 10,
    random_state          = 123,
    return_best           = False,
    n_jobs                = 'auto',
    verbose               = False,
    show_progress         = True,
    suppress_warnings     = True
)

results.head(4)

  0%|          | 0/10 [00:00<?, ?it/s]

Out[17]:

	levels	lags	params	mean_absolute_error	min_samples_leaf	max_depth
5	[id_1000, id_1001, id_1002, id_1003, id_1004]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'min_samples_leaf': 25, 'max_depth': 6}	522.020758	25	6
9	[id_1000, id_1001, id_1002, id_1003, id_1004]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'min_samples_leaf': 29, 'max_depth': 5}	531.411523	29	5
8	[id_1000, id_1001, id_1002, id_1003, id_1004]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'min_samples_leaf': 18, 'max_depth': 4}	534.344251	18	4
7	[id_1000, id_1001, id_1002, id_1003, id_1004]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14]	{'min_samples_leaf': 11, 'max_depth': 5}	559.436812	11	5