Global Forecasting Models: Time series with different lengths and different exogenous variables¶
When faced with a multi-series forecasting problem, it is common for the series to have varying lengths due to differences in the starting times of data recording. To address this scenario, the classes ForecasterAutoregMultiSeries and ForecasterAutoregMultiSeriesCustom allow the simultaneous modeling of time series of different lengths and using different exogenous variables.
- When the modeled series have different lengths, they must be stored in a Python dictionary. The keys of the dictionary are the names of the series and the values are the series themselves. All series must be of type
pandas.Series
, have apandas.DatetimeIndex
with the same frequency.
Series values | Allowed |
---|---|
[NaN, NaN, NaN, NaN, 4, 5, 6, 7, 8, 9] |
✔️ |
[0, 1, 2, 3, 4, 5, 6, 7, 8, NaN] |
✔️ |
[0, 1, 2, 3, 4, NaN, 6, 7, 8, 9] |
✔️ |
[NaN, NaN, 2, 3, 4, NaN, 6, 7, 8, 9] |
✔️ |
- When different exogenous variables are used for each series, or if the exogenous variables are the same but have different values for each series, they must be stored in a dictionary. The keys of the dictionary are the names of the series and the values are the exogenous variables themselves. All exogenous variables must be of type
pandas.DataFrame
orpandas.Series
and have apandas.DatetimeIndex
with the same frequency.
Libraries¶
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from skforecast.plot import set_dark_theme
from skforecast.preprocessing import series_long_to_dict
from skforecast.preprocessing import exog_long_to_dict
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import bayesian_search_forecaster_multiseries
Data¶
The data for this example is stored in "long format" in a single DataFrame
. The series_id
column identifies the series to which each observation belongs. The timestamp
column contains the date of the observation, and the value
column contains the value of the series at that date. Each time series is of a different length.
The exogenous variables are stored in a separate DataFrame
, also in "long format". The column series_id
identifies the series to which each observation belongs. The column timestamp
contains the date of the observation, and the remaining columns contain the values of the exogenous variables at that date.
# Load time series of multiple lengths and exogenous variables
# ==============================================================================
series = pd.read_csv(
'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast-datasets/main/data/demo_multi_series.csv'
)
exog = pd.read_csv(
'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast-datasets/main/data/demo_multi_series_exog.csv'
)
series['timestamp'] = pd.to_datetime(series['timestamp'])
exog['timestamp'] = pd.to_datetime(exog['timestamp'])
display(series.head())
print("")
display(exog.head())
series_id | timestamp | value | |
---|---|---|---|
0 | id_1000 | 2016-01-01 | 1012.500694 |
1 | id_1000 | 2016-01-02 | 1158.500099 |
2 | id_1000 | 2016-01-03 | 983.000099 |
3 | id_1000 | 2016-01-04 | 1675.750496 |
4 | id_1000 | 2016-01-05 | 1586.250694 |
series_id | timestamp | sin_day_of_week | cos_day_of_week | air_temperature | wind_speed | |
---|---|---|---|---|---|---|
0 | id_1000 | 2016-01-01 | -0.433884 | -0.900969 | 6.416639 | 4.040115 |
1 | id_1000 | 2016-01-02 | -0.974928 | -0.222521 | 6.366474 | 4.530395 |
2 | id_1000 | 2016-01-03 | -0.781831 | 0.623490 | 6.555272 | 3.273064 |
3 | id_1000 | 2016-01-04 | 0.000000 | 1.000000 | 6.704778 | 4.865404 |
4 | id_1000 | 2016-01-05 | 0.781831 | 0.623490 | 2.392998 | 5.228913 |
When series have different lengths, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the series themselves. To do this, the series_long_to_dict
function is used, which takes the DataFrame
in "long format" and returns a dict
of pandas Series
.
Similarly, when the exogenous variables are different (values or variables) for each series, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the exogenous variables themselves. The exog_long_to_dict
function is used, which takes the DataFrame
in "long format" and returns a dict
of exogenous variables (pandas Series
or pandas DataFrame
).
# Transform series and exog to dictionaries
# ==============================================================================
series_dict = series_long_to_dict(
data = series,
series_id = 'series_id',
index = 'timestamp',
values = 'value',
freq = 'D'
)
exog_dict = exog_long_to_dict(
data = exog,
series_id = 'series_id',
index = 'timestamp',
freq = 'D'
)
Some exogenous variables are omitted for series 1 and 3 to illustrate that different exogenous variables can be used for each series.
# Drop some exogenous variables for series 'id_1000' and 'id_1003'
# ==============================================================================
exog_dict['id_1000'] = exog_dict['id_1000'].drop(columns=['air_temperature', 'wind_speed'])
exog_dict['id_1003'] = exog_dict['id_1003'].drop(columns=['cos_day_of_week'])
# Partition data in train and test
# ==============================================================================
end_train = '2016-07-31 23:59:00'
series_dict_train = {k: v.loc[: end_train,] for k, v in series_dict.items()}
exog_dict_train = {k: v.loc[: end_train,] for k, v in exog_dict.items()}
series_dict_test = {k: v.loc[end_train:,] for k, v in series_dict.items()}
exog_dict_test = {k: v.loc[end_train:,] for k, v in exog_dict.items()}
# Plot series
# ==============================================================================
set_dark_theme()
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)
for i, s in enumerate(series_dict.values()):
axs[i].plot(s, label=s.name, color=colors[i])
axs[i].legend(loc='upper right', fontsize=8)
axs[i].tick_params(axis='both', labelsize=8)
axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train
fig.suptitle('Series in `series_dict`', fontsize=15)
plt.tight_layout()
# Description of each partition
# ==============================================================================
for k in series_dict.keys():
print(f"{k}:")
try:
print(
f"\tTrain: len={len(series_dict_train[k])}, {series_dict_train[k].index[0]}"
f" --- {series_dict_train[k].index[-1]}"
)
except:
print(f"\tTrain: len=0")
try:
print(
f"\tTest : len={len(series_dict_test[k])}, {series_dict_test[k].index[0]}"
f" --- {series_dict_test[k].index[-1]}"
)
except:
print(f"\tTest : len=0")
id_1000: Train: len=213, 2016-01-01 00:00:00 --- 2016-07-31 00:00:00 Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00 id_1001: Train: len=30, 2016-07-02 00:00:00 --- 2016-07-31 00:00:00 Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00 id_1002: Train: len=183, 2016-01-01 00:00:00 --- 2016-07-01 00:00:00 Test : len=0 id_1003: Train: len=213, 2016-01-01 00:00:00 --- 2016-07-31 00:00:00 Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00 id_1004: Train: len=91, 2016-05-02 00:00:00 --- 2016-07-31 00:00:00 Test : len=31, 2016-08-01 00:00:00 --- 2016-08-31 00:00:00
# Exogenous variables for each series
# ==============================================================================
for k in series_dict.keys():
print(f"{k}:")
try:
print(f"\t{exog_dict[k].columns.to_list()}")
except:
print(f"\tNo exogenous variables")
id_1000: ['sin_day_of_week', 'cos_day_of_week'] id_1001: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed'] id_1002: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed'] id_1003: ['sin_day_of_week', 'air_temperature', 'wind_speed'] id_1004: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']
Train and predict¶
The fit
method is used to train the model, it is passed the dictionary of series and the dictionary of exogenous variables where the keys of each dictionary are the names of the series.
⚠ Warning
Starting with version 0.13.0
, the ForecasterAutoregMultiSeries
class defaults to using encoding='ordinal_category'
for encoding time series identifiers. This approach creates a new column (_level_skforecast) of type pandas category
. Consequently, the regressors must be able to handle categorical variables. If the regressors do not support categorical variables, the user should set the encoding to 'ordinal'
or 'onehot'
for compatibility.
Some examples of regressors that support categorical variables and how to enable them are:
-
HistGradientBoostingRegressor:
HistGradientBoostingRegressor(categorical_features="from_dtype")
-
LightGBM:
LGBMRegressor
does not allow configuration of categorical features during initialization, but rather in itsfit
method. Therefore, usefit_kwargs = {'categorical_feature':'auto'}
. This is the default behavior ofLGBMRegressor
if no indication is given.
-
XGBoost:
XGBRegressor(enable_categorical=True)
# Fit forecaster
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
regressor = regressor,
lags = 14,
encoding = "ordinal",
dropna_from_series = False
)
forecaster.fit(series=series_dict_train, exog=exog_dict_train, suppress_warnings=True)
forecaster
============================ ForecasterAutoregMultiSeries ============================ Regressor: LGBMRegressor(max_depth=5, random_state=123, verbose=-1) Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14] Transformer for series: None Transformer for exog: None Series encoding: ordinal Window size: 14 Series levels (names): ['id_1000', 'id_1001', 'id_1002', 'id_1003', 'id_1004'] Series weights: None Weight function included: False Differentiation order: None Exogenous included: True Type of exogenous variable: <class 'dict'> Exogenous variables names: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed'] Training range: 'id_1000': ['2016-01-01', '2016-07-31'], 'id_1001': ['2016-07-02', '2016-07-31'], 'id_1002': ['2016-01-01', '2016-07-01'], ... Training index type: DatetimeIndex Training index frequency: D Regressor parameters: boosting_type: gbdt, class_weight: None, colsample_bytree: 1.0, importance_type: split, learning_rate: 0.1, ... fit_kwargs: {} Creation date: 2024-07-29 18:05:00 Last fit date: 2024-07-29 18:05:00 Skforecast version: 0.13.0 Python version: 3.11.5 Forecaster id: None
Only series whose last window of data ends at the same datetime index can be predicted together. If levels = None
, series that do not reach the maximum index are excluded from prediction. In this example, series 'id_1002'
is excluded because it does not reach the maximum index.
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=5, exog=exog_dict_test, suppress_warnings=True)
predictions
id_1000 | id_1001 | id_1003 | id_1004 | |
---|---|---|---|---|
2016-08-01 | 1469.450739 | 3021.628418 | 2949.305530 | 7408.161532 |
2016-08-02 | 1461.934467 | 3308.658430 | 2657.179559 | 8495.680202 |
2016-08-03 | 1417.709761 | 3095.373322 | 2148.531262 | 8795.777535 |
2016-08-04 | 1341.888287 | 3395.436028 | 1995.329983 | 8863.318002 |
2016-08-05 | 1330.593757 | 2802.032726 | 2024.038548 | 8881.080283 |
Missing values in the series¶
When working with time series of different lengths, it is common for some series to have missing values. As not all the regressors allow missing values, the argument dropna_from_series
can be used to remove the missing values from the training matrices.
If
False
, leave NaNs inX_train
and warn the user. (default)If
True
, drop NaNs inX_train
and same rows iny_train
and warn the user.
y_train
NaNs (and same rows from X_train
) are always dropped because the target variable cannot have NaN values.
# Sample data with interspersed NaNs
# ==============================================================================
series_dict_nan = {
'id_1000': series_dict['id_1000'].copy(),
'id_1003': series_dict['id_1003'].copy()
}
# Create NaNs
series_dict_nan['id_1000'].loc['2016-03-01':'2016-04-01',] = np.nan
series_dict_nan['id_1000'].loc['2016-05-01':'2016-05-07',] = np.nan
series_dict_nan['id_1003'].loc['2016-07-01',] = np.nan
# Plot series
# ==============================================================================
fig, axs = plt.subplots(2, 1, figsize=(8, 2.5), sharex=True)
for i, s in enumerate(series_dict_nan.values()):
axs[i].plot(s, label=s.name, color=colors[i])
axs[i].legend(loc='upper right', fontsize=8)
axs[i].tick_params(axis='both', labelsize=8)
axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train
fig.suptitle('Series in `series_dict_nan`', fontsize=15)
plt.tight_layout()
When dropna_from_series = False
, the NaNs in X_train
are kept and the user is warned. This is useful if the user wants to keep the NaNs in the series and use a regressor that can handle them.
# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
regressor = regressor,
lags = 3,
encoding = "ordinal",
dropna_from_series = False
)
X, y = forecaster.create_train_X_y(series=series_dict_nan)
display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:834: MissingValuesWarning: NaNs detected in `y_train`. They have been dropped because the target variable cannot have NaN values. Same rows have been dropped from `X_train` to maintain alignment. This is caused by series with interspersed NaNs. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn( c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:856: MissingValuesWarning: NaNs detected in `X_train`. Some regressors do not allow NaN values during training. If you want to drop them, set `forecaster.dropna_from_series = True`. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn(
lag_1 | lag_2 | lag_3 | _level_skforecast | |
---|---|---|---|---|
2016-01-04 | 983.000099 | 1158.500099 | 1012.500694 | 0 |
2016-01-05 | 1675.750496 | 983.000099 | 1158.500099 | 0 |
2016-01-06 | 1586.250694 | 1675.750496 | 983.000099 | 0 |
Observations per series: _level_skforecast 0 324 1 216 Name: count, dtype: int64 NaNs per series: lag_1 5 lag_2 9 lag_3 13 _level_skforecast 0 dtype: int64
When dropna_from_series = True
, the NaNs in X_train
are removed and the user is warned. This is useful if the user chooses a regressor that cannot handle missing values.
# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterAutoregMultiSeries(
regressor = regressor,
lags = 3,
encoding = "ordinal",
dropna_from_series = True
)
X, y = forecaster.create_train_X_y(series=series_dict_nan)
display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:834: MissingValuesWarning: NaNs detected in `y_train`. They have been dropped because the target variable cannot have NaN values. Same rows have been dropped from `X_train` to maintain alignment. This is caused by series with interspersed NaNs. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn( c:\Users\jaesc2\Miniconda3\envs\skforecast_py11\Lib\site-packages\skforecast\ForecasterAutoregMultiSeries\ForecasterAutoregMultiSeries.py:847: MissingValuesWarning: NaNs detected in `X_train`. They have been dropped. If you want to keep them, set `forecaster.dropna_from_series = False`. Same rows have been removed from `y_train` to maintain alignment. This caused by series with interspersed NaNs. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn(
lag_1 | lag_2 | lag_3 | _level_skforecast | |
---|---|---|---|---|
2016-01-04 | 983.000099 | 1158.500099 | 1012.500694 | 0 |
2016-01-05 | 1675.750496 | 983.000099 | 1158.500099 | 0 |
2016-01-06 | 1586.250694 | 1675.750496 | 983.000099 | 0 |
Observations per series: _level_skforecast 0 318 1 207 Name: count, dtype: int64 NaNs per series: lag_1 0 lag_2 0 lag_3 0 _level_skforecast 0 dtype: int64
During the training process, the warnings can be suppressed by setting suppress_warnings = True
.
# Suppress warnings during fit method
# ==============================================================================
forecaster.fit(series=series_dict_nan, suppress_warnings=True)
forecaster
============================ ForecasterAutoregMultiSeries ============================ Regressor: LGBMRegressor(max_depth=5, random_state=123, verbose=-1) Lags: [1 2 3] Transformer for series: None Transformer for exog: None Series encoding: ordinal Window size: 3 Series levels (names): ['id_1000', 'id_1003'] Series weights: None Weight function included: False Differentiation order: None Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: ["'id_1000': ['2016-01-01', '2016-12-31']", "'id_1003': ['2016-01-01', '2016-12-31']"] Training index type: DatetimeIndex Training index frequency: D Regressor parameters: boosting_type: gbdt, class_weight: None, colsample_bytree: 1.0, importance_type: split, learning_rate: 0.1, ... fit_kwargs: {} Creation date: 2024-07-29 18:05:01 Last fit date: 2024-07-29 18:05:01 Skforecast version: 0.13.0 Python version: 3.11.5 Forecaster id: None
Backtesting¶
As in the predict
method, the levels
at which backtesting is performed must be indicated.
When series have different lengths, the backtesting process only returns predictions for the date-times that are present in the series.
# Backtesting
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = regressor,
lags = 14,
encoding = "ordinal",
dropna_from_series = False
)
metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = series_dict,
exog = exog_dict,
levels = None,
steps = 24,
metric = "mean_absolute_error",
add_aggregated_metric = True,
initial_train_size = len(series_dict_train["id_1000"]),
fixed_train_size = True,
gap = 0,
allow_incomplete_fold = True,
refit = False,
n_jobs ="auto",
verbose = True,
show_progress = True,
suppress_warnings = True
)
display(metrics_levels)
print("")
display(backtest_predictions)
Information of backtesting process ---------------------------------- Number of observations used for initial training: 213 Number of observations used for backtesting: 153 Number of folds: 7 Number skipped folds: 0 Number of steps per fold: 24 Number of steps to exclude from the end of each train set before test (gap): 0 Last fold only includes 9 observations. Fold: 0 Training: 2016-01-01 00:00:00 -- 2016-07-31 00:00:00 (n=213) Validation: 2016-08-01 00:00:00 -- 2016-08-24 00:00:00 (n=24) Fold: 1 Training: No training in this fold Validation: 2016-08-25 00:00:00 -- 2016-09-17 00:00:00 (n=24) Fold: 2 Training: No training in this fold Validation: 2016-09-18 00:00:00 -- 2016-10-11 00:00:00 (n=24) Fold: 3 Training: No training in this fold Validation: 2016-10-12 00:00:00 -- 2016-11-04 00:00:00 (n=24) Fold: 4 Training: No training in this fold Validation: 2016-11-05 00:00:00 -- 2016-11-28 00:00:00 (n=24) Fold: 5 Training: No training in this fold Validation: 2016-11-29 00:00:00 -- 2016-12-22 00:00:00 (n=24) Fold: 6 Training: No training in this fold Validation: 2016-12-23 00:00:00 -- 2016-12-31 00:00:00 (n=9)
0%| | 0/7 [00:00<?, ?it/s]
levels | mean_absolute_error | |
---|---|---|
0 | id_1000 | 170.087276 |
1 | id_1001 | 963.948336 |
2 | id_1002 | NaN |
3 | id_1003 | 329.189269 |
4 | id_1004 | 787.969280 |
5 | average | 562.798540 |
6 | weighted_average | 537.816877 |
7 | pooling | 537.816877 |
id_1000 | id_1001 | id_1003 | id_1004 | |
---|---|---|---|---|
2016-08-01 | 1469.450739 | 3021.628418 | 2949.305530 | 7408.161532 |
2016-08-02 | 1461.934467 | 3308.658430 | 2657.179559 | 8495.680202 |
2016-08-03 | 1417.709761 | 3095.373322 | 2148.531262 | 8795.777535 |
2016-08-04 | 1341.888287 | 3395.436028 | 1995.329983 | 8863.318002 |
2016-08-05 | 1330.593757 | 2802.032726 | 2024.038548 | 8881.080283 |
... | ... | ... | ... | ... |
2016-12-27 | 1655.064790 | 1049.095377 | 1799.599477 | NaN |
2016-12-28 | 1550.616672 | 1002.617817 | 1807.259072 | NaN |
2016-12-29 | 1463.626212 | 1061.122584 | 1810.838622 | NaN |
2016-12-30 | 1501.660276 | 1046.970317 | 1768.303003 | NaN |
2016-12-31 | 1347.881411 | 961.760005 | 1735.950240 | NaN |
153 rows × 4 columns
Note that if a series has no observations in the test set, the backtesting process will not return any predictions for that series and the metric will be NaN
, as happened with series 'id_1002'
.
# Plot backtesting predictions
# ==============================================================================
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)
for i, s in enumerate(series_dict.keys()):
axs[i].plot(series_dict[s], label=series_dict[s].name, color=colors[i])
axs[i].axvline(pd.to_datetime(end_train) , color='white', linestyle='--', linewidth=1) # End train
try:
axs[i].plot(backtest_predictions[s], label='prediction', color="white")
except:
pass
axs[i].legend(loc='upper right', fontsize=8)
axs[i].tick_params(axis='both', labelsize=8)
fig.suptitle('Backtest Predictions', fontsize=15)
plt.tight_layout()
Hyperparameter tuning and lags selection¶
The grid_search_forecaster_multiseries
, random_search_forecaster_multiseries
and bayesian_search_forecaster_multiseries
functions in the model_selection_multiseries
module allow for lags and hyperparameter optimization. It is performed using the backtesting strategy for validation as in other Forecasters.
When using series of different lengths, the metric is calculated as a weighted average of the backtesting metrics of the selected series (levels) and their number of predictions.
More information about: Hyperparameter tuning and lags selection Multi-Series.