Global Forecasting Models: Time series with different lengths and different exogenous variables¶
When faced with a multi-series forecasting problem, it is common for the series to have varying lengths due to differences in the starting times of data recording. To address this scenario, the class ForecasterRecursiveMultiSeries allows the simultaneous modeling of time series of different lengths and using different exogenous variables.
- When the modeled series have different lengths, they must be stored in a Python dictionary. The keys of the dictionary are the names of the series and the values are the series themselves. All series must be of type
pandas.Series
, have apandas.DatetimeIndex
with the same frequency.
Series values | Allowed |
---|---|
[NaN, NaN, NaN, NaN, 4, 5, 6, 7, 8, 9] |
✔️ |
[0, 1, 2, 3, 4, 5, 6, 7, 8, NaN] |
✔️ |
[0, 1, 2, 3, 4, NaN, 6, 7, 8, 9] |
✔️ |
[NaN, NaN, 2, 3, 4, NaN, 6, 7, 8, 9] |
✔️ |
- When different exogenous variables are used for each series, or if the exogenous variables are the same but have different values for each series, they must be stored in a dictionary. The keys of the dictionary are the names of the series and the values are the exogenous variables themselves. All exogenous variables must be of type
pandas.DataFrame
orpandas.Series
and have apandas.DatetimeIndex
with the same frequency.
💡 Tip
API Reference ForecasterRecursiveMultiSeries.
Libraries and data¶
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from skforecast.plot import set_dark_theme
from skforecast.preprocessing import series_long_to_dict
from skforecast.preprocessing import exog_long_to_dict
from skforecast.preprocessing import RollingFeatures
from skforecast.recursive import ForecasterRecursiveMultiSeries
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster_multiseries
from skforecast.model_selection import bayesian_search_forecaster_multiseries
The data for this example is stored in "long format" in a single DataFrame
. The series_id
column identifies the series to which each observation belongs. The timestamp
column contains the date of the observation, and the value
column contains the value of the series at that date. Each time series is of a different length.
The exogenous variables are stored in a separate DataFrame
, also in "long format". The column series_id
identifies the series to which each observation belongs. The column timestamp
contains the date of the observation, and the remaining columns contain the values of the exogenous variables at that date.
# Load time series of multiple lengths and exogenous variables
# ==============================================================================
series = pd.read_csv(
'https://raw.githubusercontent.com/skforecast/skforecast-datasets/main/data/demo_multi_series.csv'
)
exog = pd.read_csv(
'https://raw.githubusercontent.com/skforecast/skforecast-datasets/main/data/demo_multi_series_exog.csv'
)
series['timestamp'] = pd.to_datetime(series['timestamp'])
exog['timestamp'] = pd.to_datetime(exog['timestamp'])
display(series.head())
print("")
display(exog.head())
series_id | timestamp | value | |
---|---|---|---|
0 | id_1000 | 2016-01-01 | 1012.500694 |
1 | id_1000 | 2016-01-02 | 1158.500099 |
2 | id_1000 | 2016-01-03 | 983.000099 |
3 | id_1000 | 2016-01-04 | 1675.750496 |
4 | id_1000 | 2016-01-05 | 1586.250694 |
series_id | timestamp | sin_day_of_week | cos_day_of_week | air_temperature | wind_speed | |
---|---|---|---|---|---|---|
0 | id_1000 | 2016-01-01 | -0.433884 | -0.900969 | 6.416639 | 4.040115 |
1 | id_1000 | 2016-01-02 | -0.974928 | -0.222521 | 6.366474 | 4.530395 |
2 | id_1000 | 2016-01-03 | -0.781831 | 0.623490 | 6.555272 | 3.273064 |
3 | id_1000 | 2016-01-04 | 0.000000 | 1.000000 | 6.704778 | 4.865404 |
4 | id_1000 | 2016-01-05 | 0.781831 | 0.623490 | 2.392998 | 5.228913 |
When series have different lengths, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the series themselves. To do this, the series_long_to_dict
function is used, which takes the DataFrame
in "long format" and returns a dict
of pandas Series
.
Similarly, when the exogenous variables are different (values or variables) for each series, the data must be transformed into a dictionary. The keys of the dictionary are the names of the series and the values are the exogenous variables themselves. The exog_long_to_dict
function is used, which takes the DataFrame
in "long format" and returns a dict
of exogenous variables (pandas Series
or pandas DataFrame
).
# Transform series and exog to dictionaries
# ==============================================================================
series_dict = series_long_to_dict(
data = series,
series_id = 'series_id',
index = 'timestamp',
values = 'value',
freq = 'D'
)
exog_dict = exog_long_to_dict(
data = exog,
series_id = 'series_id',
index = 'timestamp',
freq = 'D'
)
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11_2\Lib\site-packages\skforecast\preprocessing\preprocessing.py:424: MissingValuesWarning: Series 'id_1003' is incomplete. NaNs have been introduced after setting the frequency. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn(
Some exogenous variables are omitted for series 1 and 3 to illustrate that different exogenous variables can be used for each series.
# Drop some exogenous variables for series 'id_1000' and 'id_1003'
# ==============================================================================
exog_dict['id_1000'] = exog_dict['id_1000'].drop(columns=['air_temperature', 'wind_speed'])
exog_dict['id_1003'] = exog_dict['id_1003'].drop(columns=['cos_day_of_week'])
# Partition data in train and test
# ==============================================================================
end_train = '2016-07-31 23:59:00'
series_dict_train = {k: v.loc[: end_train,] for k, v in series_dict.items()}
exog_dict_train = {k: v.loc[: end_train,] for k, v in exog_dict.items()}
series_dict_test = {k: v.loc[end_train:,] for k, v in series_dict.items()}
exog_dict_test = {k: v.loc[end_train:,] for k, v in exog_dict.items()}
# Plot series
# ==============================================================================
set_dark_theme()
colors = plt.rcParams['axes.prop_cycle'].by_key()['color']
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)
for i, s in enumerate(series_dict.values()):
axs[i].plot(s, label=s.name, color=colors[i])
axs[i].legend(loc='upper right', fontsize=8)
axs[i].tick_params(axis='both', labelsize=8)
axs[i].axvline(pd.to_datetime(end_train), color='white', linestyle='--', linewidth=1) # End train
fig.suptitle('Series in `series_dict`', fontsize=15)
plt.tight_layout()
# Description of each partition
# ==============================================================================
for k in series_dict.keys():
print(f"{k}:")
try:
print(
f"\tTrain: len={len(series_dict_train[k])}, {series_dict_train[k].index[0]}"
f" --- {series_dict_train[k].index[-1]}"
)
except:
print("\tTrain: len=0")
try:
print(
f"\tTest : len={len(series_dict_test[k])}, {series_dict_test[k].index[0]}"
f" --- {series_dict_test[k].index[-1]}"
)
except:
print("\tTest : len=0")
id_1000: Train: len=213, 2016-01-01 00:00:00 --- 2016-07-31 00:00:00 Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00 id_1001: Train: len=30, 2016-07-02 00:00:00 --- 2016-07-31 00:00:00 Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00 id_1002: Train: len=183, 2016-01-01 00:00:00 --- 2016-07-01 00:00:00 Test : len=0 id_1003: Train: len=213, 2016-01-01 00:00:00 --- 2016-07-31 00:00:00 Test : len=153, 2016-08-01 00:00:00 --- 2016-12-31 00:00:00 id_1004: Train: len=91, 2016-05-02 00:00:00 --- 2016-07-31 00:00:00 Test : len=31, 2016-08-01 00:00:00 --- 2016-08-31 00:00:00
# Exogenous variables for each series
# ==============================================================================
for k in series_dict.keys():
print(f"{k}:")
try:
print(f"\t{exog_dict[k].columns.to_list()}")
except:
print("\tNo exogenous variables")
id_1000: ['sin_day_of_week', 'cos_day_of_week'] id_1001: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed'] id_1002: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed'] id_1003: ['sin_day_of_week', 'air_temperature', 'wind_speed'] id_1004: ['sin_day_of_week', 'cos_day_of_week', 'air_temperature', 'wind_speed']
Train and predict¶
The fit
method is used to train the model, it is passed the dictionary of series and the dictionary of exogenous variables where the keys of each dictionary are the names of the series.
# Fit forecaster
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterRecursiveMultiSeries(
regressor = regressor,
lags = 14,
window_features = RollingFeatures(stats=['mean', 'mean'], window_sizes=[7, 14]),
encoding = "ordinal",
dropna_from_series = False
)
forecaster.fit(series=series_dict_train, exog=exog_dict_train, suppress_warnings=True)
forecaster
ForecasterRecursiveMultiSeries
General Information
- Regressor: LGBMRegressor
- Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14]
- Window features: ['roll_mean_7', 'roll_mean_14']
- Window size: 14
- Series encoding: ordinal
- Exogenous included: True
- Weight function included: False
- Series weights: None
- Differentiation order: None
- Creation date: 2024-11-10 17:23:28
- Last fit date: 2024-11-10 17:23:28
- Skforecast version: 0.14.0
- Python version: 3.11.10
- Forecaster id: None
Exogenous Variables
-
sin_day_of_week, cos_day_of_week, air_temperature, wind_speed
Data Transformations
- Transformer for series: None
- Transformer for exog: None
Training Information
- Series names (levels): id_1000, id_1001, id_1002, id_1003, id_1004
- Training range: 'id_1000': ['2016-01-01', '2016-07-31'], 'id_1001': ['2016-07-02', '2016-07-31'], 'id_1002': ['2016-01-01', '2016-07-01'], 'id_1003': ['2016-01-01', '2016-07-31'], 'id_1004': ['2016-05-02', '2016-07-31']
- Training index type: DatetimeIndex
- Training index frequency: D
Regressor Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': 5, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{}
Only series whose last window of data ends at the same datetime index can be predicted together. If levels = None
, series that do not reach the maximum index are excluded from prediction. In this example, series 'id_1002'
is excluded because it does not reach the maximum index.
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=5, exog=exog_dict_test, suppress_warnings=True)
predictions
id_1000 | id_1001 | id_1003 | id_1004 | |
---|---|---|---|---|
2016-08-01 | 1453.312971 | 2849.347882 | 2706.851726 | 7496.555367 |
2016-08-02 | 1440.763196 | 2947.579536 | 2310.075968 | 8685.425990 |
2016-08-03 | 1410.151437 | 2875.847691 | 1997.329410 | 8961.631705 |
2016-08-04 | 1348.787299 | 3160.533645 | 1923.897012 | 8764.338331 |
2016-08-05 | 1301.504387 | 2920.424937 | 1940.149954 | 8694.134833 |
Missing values in the series¶
When working with time series of different lengths, it is common for some series to have missing values. As not all the regressors allow missing values, the argument dropna_from_series
can be used to remove the missing values from the training matrices.
If
False
, leave NaNs inX_train
and warn the user. (default)If
True
, drop NaNs inX_train
and same rows iny_train
and warn the user.
y_train
NaNs (and same rows from X_train
) are always dropped because the target variable cannot have NaN values.
# Sample data with interspersed NaNs
# ==============================================================================
series_dict_nan = {
'id_1000': series_dict['id_1000'].copy(),
'id_1003': series_dict['id_1003'].copy()
}
# Create NaNs
series_dict_nan['id_1000'].loc['2016-03-01':'2016-04-01',] = np.nan
series_dict_nan['id_1000'].loc['2016-05-01':'2016-05-07',] = np.nan
series_dict_nan['id_1003'].loc['2016-07-01',] = np.nan
# Plot series
# ==============================================================================
fig, axs = plt.subplots(2, 1, figsize=(8, 2.5), sharex=True)
for i, s in enumerate(series_dict_nan.values()):
axs[i].plot(s, label=s.name, color=colors[i])
axs[i].legend(loc='upper right', fontsize=8)
axs[i].tick_params(axis='both', labelsize=8)
axs[i].axvline(pd.to_datetime(end_train), color='white', linestyle='--', linewidth=1) # End train
fig.suptitle('Series in `series_dict_nan`', fontsize=15)
plt.tight_layout()
When dropna_from_series = False
, the NaNs in X_train
are kept and the user is warned. This is useful if the user wants to keep the NaNs in the series and use a regressor that can handle them.
# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterRecursiveMultiSeries(
regressor = regressor,
lags = 3,
encoding = "ordinal",
dropna_from_series = False
)
X, y = forecaster.create_train_X_y(series=series_dict_nan)
display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11_2\Lib\site-packages\skforecast\recursive\_forecaster_recursive_multiseries.py:1057: MissingValuesWarning: NaNs detected in `y_train`. They have been dropped because the target variable cannot have NaN values. Same rows have been dropped from `X_train` to maintain alignment. This is caused by series with interspersed NaNs. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn( c:\Users\jaesc2\Miniconda3\envs\skforecast_py11_2\Lib\site-packages\skforecast\recursive\_forecaster_recursive_multiseries.py:1079: MissingValuesWarning: NaNs detected in `X_train`. Some regressors do not allow NaN values during training. If you want to drop them, set `forecaster.dropna_from_series = True`. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn(
lag_1 | lag_2 | lag_3 | _level_skforecast | |
---|---|---|---|---|
2016-01-04 | 983.000099 | 1158.500099 | 1012.500694 | 0 |
2016-01-05 | 1675.750496 | 983.000099 | 1158.500099 | 0 |
2016-01-06 | 1586.250694 | 1675.750496 | 983.000099 | 0 |
Observations per series: _level_skforecast 0 324 1 216 Name: count, dtype: int64 NaNs per series: lag_1 5 lag_2 9 lag_3 13 _level_skforecast 0 dtype: int64
When dropna_from_series = True
, the NaNs in X_train
are removed and the user is warned. This is useful if the user chooses a regressor that cannot handle missing values.
# Create Matrices, dropna_from_series = False
# ==============================================================================
regressor = LGBMRegressor(random_state=123, verbose=-1, max_depth=5)
forecaster = ForecasterRecursiveMultiSeries(
regressor = regressor,
lags = 3,
encoding = "ordinal",
dropna_from_series = True
)
X, y = forecaster.create_train_X_y(series=series_dict_nan)
display(X.head(3))
print("Observations per series:")
print(X['_level_skforecast'].value_counts())
print("")
print("NaNs per series:")
print(X.isnull().sum())
c:\Users\jaesc2\Miniconda3\envs\skforecast_py11_2\Lib\site-packages\skforecast\recursive\_forecaster_recursive_multiseries.py:1057: MissingValuesWarning: NaNs detected in `y_train`. They have been dropped because the target variable cannot have NaN values. Same rows have been dropped from `X_train` to maintain alignment. This is caused by series with interspersed NaNs. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn( c:\Users\jaesc2\Miniconda3\envs\skforecast_py11_2\Lib\site-packages\skforecast\recursive\_forecaster_recursive_multiseries.py:1070: MissingValuesWarning: NaNs detected in `X_train`. They have been dropped. If you want to keep them, set `forecaster.dropna_from_series = False`. Same rows have been removed from `y_train` to maintain alignment. This caused by series with interspersed NaNs. You can suppress this warning using: warnings.simplefilter('ignore', category=MissingValuesWarning) warnings.warn(
lag_1 | lag_2 | lag_3 | _level_skforecast | |
---|---|---|---|---|
2016-01-04 | 983.000099 | 1158.500099 | 1012.500694 | 0 |
2016-01-05 | 1675.750496 | 983.000099 | 1158.500099 | 0 |
2016-01-06 | 1586.250694 | 1675.750496 | 983.000099 | 0 |
Observations per series: _level_skforecast 0 318 1 207 Name: count, dtype: int64 NaNs per series: lag_1 0 lag_2 0 lag_3 0 _level_skforecast 0 dtype: int64
During the training process, the warnings can be suppressed by setting suppress_warnings = True
.
# Suppress warnings during fit method
# ==============================================================================
forecaster.fit(series=series_dict_nan, suppress_warnings=True)
forecaster
ForecasterRecursiveMultiSeries
General Information
- Regressor: LGBMRegressor
- Lags: [1 2 3]
- Window features: None
- Window size: 3
- Series encoding: ordinal
- Exogenous included: False
- Weight function included: False
- Series weights: None
- Differentiation order: None
- Creation date: 2024-11-10 17:23:28
- Last fit date: 2024-11-10 17:23:28
- Skforecast version: 0.14.0
- Python version: 3.11.10
- Forecaster id: None
Exogenous Variables
-
None
Data Transformations
- Transformer for series: None
- Transformer for exog: None
Training Information
- Series names (levels): id_1000, id_1003
- Training range: 'id_1000': ['2016-01-01', '2016-12-31'], 'id_1003': ['2016-01-01', '2016-12-31']
- Training index type: DatetimeIndex
- Training index frequency: D
Regressor Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': 5, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{}
Backtesting¶
As in the predict
method, the levels
at which backtesting is performed must be indicated.
When series have different lengths, the backtesting process only returns predictions for the date-times that are present in the series.
# Backtesting
# ==============================================================================
forecaster = ForecasterRecursiveMultiSeries(
regressor = regressor,
lags = 14,
window_features = RollingFeatures(stats=['mean', 'mean'], window_sizes=[7, 14]),
encoding = "ordinal",
dropna_from_series = False
)
cv = TimeSeriesFold(
steps = 24,
initial_train_size = len(series_dict_train["id_1000"]),
refit = False,
allow_incomplete_fold = True,
)
metrics_levels, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = series_dict,
exog = exog_dict,
cv = cv,
levels = None,
metric = "mean_absolute_error",
add_aggregated_metric = True,
n_jobs ="auto",
verbose = True,
show_progress = True,
suppress_warnings = True
)
display(metrics_levels)
print("")
display(backtest_predictions)
Information of folds -------------------- Number of observations used for initial training: 213 Number of observations used for backtesting: 153 Number of folds: 7 Number skipped folds: 0 Number of steps per fold: 24 Number of steps to exclude between last observed data (last window) and predictions (gap): 0 Last fold only includes 9 observations. Fold: 0 Training: 2016-01-01 00:00:00 -- 2016-07-31 00:00:00 (n=213) Validation: 2016-08-01 00:00:00 -- 2016-08-24 00:00:00 (n=24) Fold: 1 Training: No training in this fold Validation: 2016-08-25 00:00:00 -- 2016-09-17 00:00:00 (n=24) Fold: 2 Training: No training in this fold Validation: 2016-09-18 00:00:00 -- 2016-10-11 00:00:00 (n=24) Fold: 3 Training: No training in this fold Validation: 2016-10-12 00:00:00 -- 2016-11-04 00:00:00 (n=24) Fold: 4 Training: No training in this fold Validation: 2016-11-05 00:00:00 -- 2016-11-28 00:00:00 (n=24) Fold: 5 Training: No training in this fold Validation: 2016-11-29 00:00:00 -- 2016-12-22 00:00:00 (n=24) Fold: 6 Training: No training in this fold Validation: 2016-12-23 00:00:00 -- 2016-12-31 00:00:00 (n=9)
0%| | 0/7 [00:00<?, ?it/s]
levels | mean_absolute_error | |
---|---|---|
0 | id_1000 | 167.502214 |
1 | id_1001 | 1103.313887 |
2 | id_1002 | NaN |
3 | id_1003 | 280.492603 |
4 | id_1004 | 711.078359 |
5 | average | 565.596766 |
6 | weighted_average | 572.944127 |
7 | pooling | 572.944127 |
id_1000 | id_1001 | id_1003 | id_1004 | |
---|---|---|---|---|
2016-08-01 | 1453.312971 | 2849.347882 | 2706.851726 | 7496.555367 |
2016-08-02 | 1440.763196 | 2947.579536 | 2310.075968 | 8685.425990 |
2016-08-03 | 1410.151437 | 2875.847691 | 1997.329410 | 8961.631705 |
2016-08-04 | 1348.787299 | 3160.533645 | 1923.897012 | 8764.338331 |
2016-08-05 | 1301.504387 | 2920.424937 | 1940.149954 | 8694.134833 |
... | ... | ... | ... | ... |
2016-12-27 | 1667.998267 | 1108.052845 | 2121.157763 | NaN |
2016-12-28 | 1579.306861 | 1111.236661 | 2050.252915 | NaN |
2016-12-29 | 1487.230722 | 1113.581933 | 2063.309008 | NaN |
2016-12-30 | 1481.331642 | 1132.535774 | 2089.261345 | NaN |
2016-12-31 | 1393.128313 | 1106.034061 | 2064.475030 | NaN |
153 rows × 4 columns
Note that if a series has no observations in the test set, the backtesting process will not return any predictions for that series and the metric will be NaN
, as happened with series 'id_1002'
.
# Plot backtesting predictions
# ==============================================================================
fig, axs = plt.subplots(5, 1, figsize=(8, 4), sharex=True)
for i, s in enumerate(series_dict.keys()):
axs[i].plot(series_dict[s], label=series_dict[s].name, color=colors[i])
axs[i].axvline(pd.to_datetime(end_train), color='white', linestyle='--', linewidth=1) # End train
try:
axs[i].plot(backtest_predictions[s], label='prediction', color="white")
except:
pass
axs[i].legend(loc='upper right', fontsize=8)
axs[i].tick_params(axis='both', labelsize=8)
fig.suptitle('Backtest Predictions', fontsize=15)
plt.tight_layout()