Data transformation and pipelines¶
Skforecast has two arguments in all the forecasters that allow more detailed control over input data transformations. This feature is particularly useful as many machine learning models require specific data pre-processing transformations. For example, linear models may benefit from features being scaled, or categorical features being transformed into numerical values.
transformer_y
: an instance of a transformer (preprocessor) compatible with the scikit-learn preprocessing API with the methods: fit, transform, fit_transform and, inverse_transform. Scikit-learn ColumnTransformer is not allowed since they do not have the inverse_transform method.transformer_exog
: an instance of a transformer (preprocessor) compatible with the scikit-learn preprocessing API. Scikit-learn ColumnTransformer can be used if the preprocessing transformations only apply to some specific columns or if different transformations are needed for different columns. For example, scale numeric features and one hot encode categorical ones.
Transformations are learned and applied before training the forecaster and are automatically used when calling predict
. The output of predict
is always on the same scale as the original series y.
Although skforecast has allowed using scikit-learn pipelines as regressors since version 0.4.0, it is recommended to use transformer_y
and transformer_exog
instead.
Libraries¶
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
from lightgbm import LGBMRegressor
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import MinMaxScaler
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.model_selection import grid_search_forecaster
from skforecast.datasets import fetch_dataset
Data¶
# Download data
# ==============================================================================
data = fetch_dataset("h2o_exog")
h2o_exog -------- Monthly expenditure ($AUD) on corticosteroid drugs that the Australian health system had between 1991 and 2008. Two additional variables (exog_1, exog_2) are simulated. Hyndman R (2023). fpp3: Data for Forecasting: Principles and Practice (3rd Edition). http://pkg.robjhyndman.com/fpp3package/, https://github.com/robjhyndman/fpp3package, http://OTexts.com/fpp3. Shape of the dataset: (195, 3)
# Data preprocessing
# ==============================================================================
data.index.name = 'date'
# Add an extra categorical variable
data['exog_3'] = (["A"] * int(len(data)/2)) + (["B"] * (int(len(data)/2) +1))
data.head()
y | exog_1 | exog_2 | exog_3 | |
---|---|---|---|---|
date | ||||
1992-04-01 | 0.379808 | 0.958792 | 1.166029 | A |
1992-05-01 | 0.361801 | 0.951993 | 1.117859 | A |
1992-06-01 | 0.410534 | 0.952955 | 1.067942 | A |
1992-07-01 | 0.483389 | 0.958078 | 1.097376 | A |
1992-08-01 | 0.475463 | 0.956370 | 1.122199 | A |
Transforming input series¶
The following example shows how to include a transformer that scales the input series y.
# Create and fit forecaster that scales the input series
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 3,
transformer_y = StandardScaler(),
transformer_exog = None
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
================= ForecasterAutoreg ================= Regressor: Ridge(random_state=123) Lags: [1 2 3] Transformer for y: StandardScaler() Transformer for exog: None Window size: 3 Weight function included: False Differentiation order: None Exogenous included: True Exogenous variables names: ['exog_1', 'exog_2'] Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 123, 'solver': 'auto', 'tol': 0.0001} fit_kwargs: {} Creation date: 2024-08-13 20:55:49 Last fit date: 2024-08-13 20:55:49 Skforecast version: 0.13.0 Python version: 3.12.4 Forecaster id: None
Transforming exogenous variables¶
The following example shows how to apply the same transformation (scaling) to all exogenous variables.
# Create and fit forecaster with same tranformation for all exogenous variables
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 3,
transformer_y = None,
transformer_exog = StandardScaler()
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
================= ForecasterAutoreg ================= Regressor: Ridge(random_state=123) Lags: [1 2 3] Transformer for y: None Transformer for exog: StandardScaler() Window size: 3 Weight function included: False Differentiation order: None Exogenous included: True Exogenous variables names: ['exog_1', 'exog_2'] Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 123, 'solver': 'auto', 'tol': 0.0001} fit_kwargs: {} Creation date: 2024-08-13 20:55:50 Last fit date: 2024-08-13 20:55:50 Skforecast version: 0.13.0 Python version: 3.12.4 Forecaster id: None
It is also possible to apply a different transformation to each exogenous variable making use of ColumnTransformer
.
# Create and fit forecaster with different transformations for each exog variable
# ==============================================================================
transformer_exog = ColumnTransformer(
[('scale_1', StandardScaler(), ['exog_1']),
('scale_2', StandardScaler(), ['exog_2']),
('onehot', OneHotEncoder(), ['exog_3']),
],
remainder = 'passthrough',
verbose_feature_names_out = False
)
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 3,
transformer_y = None,
transformer_exog = transformer_exog
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2', 'exog_3']])
forecaster
================= ForecasterAutoreg ================= Regressor: Ridge(random_state=123) Lags: [1 2 3] Transformer for y: None Transformer for exog: ColumnTransformer(remainder='passthrough', transformers=[('scale_1', StandardScaler(), ['exog_1']), ('scale_2', StandardScaler(), ['exog_2']), ('onehot', OneHotEncoder(), ['exog_3'])], verbose_feature_names_out=False) Window size: 3 Weight function included: False Differentiation order: None Exogenous included: True Exogenous variables names: ['exog_1', 'exog_2', 'exog_3'] Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'positive': False, 'random_state': 123, 'solver': 'auto', 'tol': 0.0001} fit_kwargs: {} Creation date: 2024-08-13 20:55:50 Last fit date: 2024-08-13 20:55:50 Skforecast version: 0.13.0 Python version: 3.12.4 Forecaster id: None
It is possible to verify if the data transformation has been applied correctly by examining the training matrices. The training matrices should reflect the data transformation that was specified using the transformer_y
or transformer_exog
arguments.
X_train, y_train = forecaster.create_train_X_y(
y = data['y'],
exog = data[['exog_1', 'exog_2', 'exog_3']]
)
X_train.head(4)
lag_1 | lag_2 | lag_3 | exog_1 | exog_2 | exog_3_A | exog_3_B | |
---|---|---|---|---|---|---|---|
date | |||||||
1992-07-01 | 0.410534 | 0.361801 | 0.379808 | -2.119529 | -2.135088 | 1.0 | 0.0 |
1992-08-01 | 0.483389 | 0.410534 | 0.361801 | -2.131024 | -1.996017 | 1.0 | 0.0 |
1992-09-01 | 0.475463 | 0.483389 | 0.410534 | -2.109222 | -1.822392 | 1.0 | 0.0 |
1992-10-01 | 0.534761 | 0.475463 | 0.483389 | -2.132137 | -1.590667 | 1.0 | 0.0 |
y_train.head(4)
date 1992-07-01 0.483389 1992-08-01 0.475463 1992-09-01 0.534761 1992-10-01 0.568606 Freq: MS, Name: y, dtype: float64
Custom transformers¶
Using scikit-learn FunctionTransformer it is possible to include custom transformers in the forecaster object, for example, a logarithmic transformation.
Scikit-learn's FunctionTransformer can be used to incorporate custom transformers, such as a logarithmic transformation, in the forecaster object. To implement this, a user-defined transformation function can be created and then passed to the FunctionTransformer
. Detailed information on how to use FunctionTransformer can be found in the scikit-learn documentation.
⚠ Warning
For versions 1.1.0 >= scikit-learn <= 1.2.0 sklearn.preprocessing.FunctionTransformer.inverse_transform
does not support DataFrames that are all numerical when check_inverse=True
. It will raise an Exception which is fixed in scikit-learn 1.2.1.
More info: https://scikit-learn.org/stable/whats_new/v1.2.html#version-1-2-1
# Create custom transformer
# =============================================================================
def log_transform(x):
"""
Calculate log adding 1 to avoid calculation errors if x is very close to 0.
"""
return np.log(x+1)
def exp_transform(x):
"""
Inverse of log_transform.
"""
return np.exp(x) - 1
transformer_y = FunctionTransformer(func=log_transform, inverse_func=exp_transform)
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 3,
transformer_y = transformer_y
)
forecaster.fit(y=data['y'])
If the FunctionTransformer
has an inverse function, the output of the predict method is automatically transformed back to the original scale.
forecaster.predict(steps=4)
2008-07-01 0.776206 2008-08-01 0.775471 2008-09-01 0.777200 2008-10-01 0.777853 Freq: MS, Name: pred, dtype: float64
Scikit-learn pipelines¶
⚠ Warning
Starting from version 0.4.0, skforecast permits the usage of scikit-learn pipelines as regressors. It is important to note that ColumnTransformer cannot be included in the pipeline; thus, the same transformation will be applied to both the modeled series and all exogenous variables. However, if the preprocessing transformations only apply to specific columns, then they need to be applied separately using transformer_y
and transformer_exog
.
pipe = make_pipeline(StandardScaler(), Ridge())
pipe
Pipeline(steps=[('standardscaler', StandardScaler()), ('ridge', Ridge())])In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Pipeline(steps=[('standardscaler', StandardScaler()), ('ridge', Ridge())])
StandardScaler()
Ridge()
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = pipe,
lags = 10
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
================= ForecasterAutoreg ================= Regressor: Pipeline(steps=[('standardscaler', StandardScaler()), ('ridge', Ridge())]) Lags: [ 1 2 3 4 5 6 7 8 9 10] Transformer for y: None Transformer for exog: None Window size: 10 Weight function included: False Differentiation order: None Exogenous included: True Exogenous variables names: ['exog_1', 'exog_2'] Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'ridge__alpha': 1.0, 'ridge__copy_X': True, 'ridge__fit_intercept': True, 'ridge__max_iter': None, 'ridge__positive': False, 'ridge__random_state': None, 'ridge__solver': 'auto', 'ridge__tol': 0.0001} fit_kwargs: {} Creation date: 2024-08-13 20:55:51 Last fit date: 2024-08-13 20:55:51 Skforecast version: 0.13.0 Python version: 3.12.4 Forecaster id: None
When performing a grid search over a scikit-learn pipeline, the model's name precedes the parameters' name.
# Hyperparameter grid search using a scikit-learn pipeline
# ==============================================================================
pipe = make_pipeline(StandardScaler(), Ridge())
forecaster = ForecasterAutoreg(
regressor = pipe,
lags = 10 # This value will be replaced in the grid search
)
# Regressor's hyperparameters
param_grid = {'ridge__alpha': np.logspace(-3, 5, 10)}
# Lags used as predictors
lags_grid = [5, 24, [1, 2, 3, 23, 24]]
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = data['y'],
exog = data[['exog_1', 'exog_2']],
param_grid = param_grid,
lags_grid = lags_grid,
steps = 5,
metric = 'mean_absolute_error',
refit = False,
initial_train_size = len(data.loc[:'2000-04-01']),
return_best = True,
verbose = False,
show_progress = True
)
results_grid.head(4)
Number of models compared: 30.
lags grid: 0%| | 0/3 [00:00<?, ?it/s]
params grid: 0%| | 0/10 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: Lags: [1 2 3 4 5] Parameters: {'ridge__alpha': np.float64(0.001)} Backtesting metric: 6.845311709559406e-05
lags | lags_label | params | mean_absolute_error | ridge__alpha | |
---|---|---|---|---|---|
0 | [1, 2, 3, 4, 5] | [1, 2, 3, 4, 5] | {'ridge__alpha': 0.001} | 0.000068 | 0.001000 |
10 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'ridge__alpha': 0.001} | 0.000188 | 0.001000 |
1 | [1, 2, 3, 4, 5] | [1, 2, 3, 4, 5] | {'ridge__alpha': 0.007742636826811269} | 0.000526 | 0.007743 |
11 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'ridge__alpha': 0.007742636826811269} | 0.001413 | 0.007743 |
Transforming multiple input series in global models¶
When using global forecasting models (ForecasterAutoregMultiSeries or ForecasterAutoregMultiVariate) the transformer_series
argument replaces transformer_y
. Three diferent options are available:
transformer_series
is a single transformer: When a single transformer is provided, it is automatically cloned for each individual series. Each cloned transformer is then trained separately on one of the series.transformer_series
is a dictionary: A different transformer can be specified for each series by passing a dictionary where the keys correspond to the series names and the values are the transformers. Each series is transformed according to its designated transformer. When this option is used, it is mandatory to include a transformer for unknown series, which is indicated by the key'_unknown_level'
.transformer_series
is None: no transformations are applied to any of the series.
Regardless of the configuration, each series is transformed independently. Even when using a single transformer, it is cloned internally and applied separately to each series.
# Data download
# ==============================================================================
data = fetch_dataset(name="items_sales")
data.head()
items_sales ----------- Simulated time series for the sales of 3 different items. Simulated data. Shape of the dataset: (1097, 3)
item_1 | item_2 | item_3 | |
---|---|---|---|
date | |||
2012-01-01 | 8.253175 | 21.047727 | 19.429739 |
2012-01-02 | 22.777826 | 26.578125 | 28.009863 |
2012-01-03 | 27.549099 | 31.751042 | 32.078922 |
2012-01-04 | 25.895533 | 24.567708 | 27.252276 |
2012-01-05 | 21.379238 | 18.191667 | 20.357737 |
# Series transformation: same transformation for all series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 24,
encoding = 'ordinal',
transformer_series = StandardScaler(),
transformer_exog = None
)
forecaster.fit(series=data)
It is possible to access the fitted transformers for each series through the transformers_series_
attribute. This allows verification that each transformer has been trained independently.
# Mean and scale of the transformer for each series
# ==============================================================================
for k, v in forecaster.transformer_series_.items():
print(f"Series {k}: {v} mean={v.mean_}, scale={v.scale_}")
Series item_1: StandardScaler() mean=[22.37366364], scale=[2.54258317] Series item_2: StandardScaler() mean=[16.26942518], scale=[4.89965692] Series item_3: StandardScaler() mean=[17.19276546], scale=[5.43694388] Series _unknown_level: StandardScaler() mean=[18.61195143], scale=[5.21803675]
# Series transformation: different transformation for each series
# ==============================================================================
forecaster = ForecasterAutoregMultiSeries(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 24,
encoding = 'ordinal',
transformer_series = {'item_1': StandardScaler(), 'item_2': MinMaxScaler(), '_unknown_level': StandardScaler()},
transformer_exog = None
)
forecaster.fit(series=data)
/home/ubuntu/anaconda3/envs/skforecast_13_py12/lib/python3.12/site-packages/skforecast/utils/utils.py:255: IgnoredArgumentWarning: {'item_3'} not present in `transformer_series`. No transformation is applied to these series. You can suppress this warning using: warnings.simplefilter('ignore', category=IgnoredArgumentWarning) warnings.warn(
# Transformer trained for each series
# ==============================================================================
for k, v in forecaster.transformer_series_.items():
if v is not None:
print(f"Series {k}: {v.get_params()}")
else:
print(f"Series {k}: {v}")
Series item_1: {'copy': True, 'with_mean': True, 'with_std': True} Series item_2: {'clip': False, 'copy': True, 'feature_range': (0, 1)} Series item_3: None Series _unknown_level: {'copy': True, 'with_mean': True, 'with_std': True}