Forecasting with scikit-learn and transformers pipelines¶

Since version 0.5.0, skforecast includes two new arguments in all the forecasters to have detailed control over input transformations. This is useful since many machine learning models need specific data preprocessing transformations. For example, linear models with Ridge or Lasso regularization benefits from features being scaled.

transformer_y, an instance of a transformer (preprocessor) compatible with the scikit-learn preprocessing API with methods: fit, transform, fit_transform and inverse_transform. scikit-learn ColumnTransformer is not allowed since they do not have the inverse_transform method.
transformer_exog, an instance of a transformer (preprocessor) compatible with the scikit-learn preprocessing API. Scikit-learn ColumnTransformer can be used if the preprocessing transformations only apply to some specific columns or if different transformations are needed for different columns. For example, scale numeric features and one hot encode categorical ones.

Transformations are learned and applied before training the forecaster and are automatically used when calling predict. The output of predict is always on the same scale as the original series y.

Although, since version 0.4.0, skforecast allows using scikit-learn pipelines as regressors, it is recommended to use transformer_y and transformer_exog instead.

Note

When using ForecasterAutoregMultiSeries or ForecasterAutoregMultiVariate the transformer_series argument replaces transformer_y. If it is a transformer, the same transformation will be applied to all series. If it is a dict a different transformation can be set for each series.

Libraries¶

In [1]:

            
                Copied!
                
                    
                    
                
                

        
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import grid_search_forecaster
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import make_pipeline
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import grid_search_forecaster

Data¶

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')
# Add an extra categorical variable
data['exog_3'] = (["A"] * int(len(data)/2)) + (["B"] * (int(len(data)/2) +1))
data.head()
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')
# Add an extra categorical variable
data['exog_3'] = (["A"] * int(len(data)/2)) + (["B"] * (int(len(data)/2) +1))
data.head()

Out[2]:

	y	exog_1	exog_2	exog_3
date
1992-04-01	0.379808	0.958792	1.166029	A
1992-05-01	0.361801	0.951993	1.117859	A
1992-06-01	0.410534	0.952955	1.067942	A
1992-07-01	0.483389	0.958078	1.097376	A
1992-08-01	0.475463	0.956370	1.122199	A

Transforming input series¶

The following example shows how to scale the input series y.

In [3]:

            
                Copied!
                
                    
                    
                
                

        
# Create and fit forecaster scaling the input series
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = StandardScaler(),
                 transformer_exog = None
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
# Create and fit forecaster scaling the input series
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = StandardScaler(),
                 transformer_exog = None
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster

Out[3]:

================= 
ForecasterAutoreg 
================= 
Regressor: Ridge() 
Lags: [1 2 3] 
Transformer for y: StandardScaler() 
Transformer for exog: None 
Window size: 3 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['exog_1', 'exog_2'] 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': 'deprecated', 'positive': False, 'random_state': None, 'solver': 'auto', 'tol': 0.001} 
Creation date: 2022-11-29 15:55:58 
Last fit date: 2022-11-29 15:55:58 
Skforecast version: 0.6.0 
Python version: 3.9.13

Transforming exogenous variables¶

The following example shows how to apply the same transformation to all exogenous variables.

In [4]:

            
                Copied!
                
                    
                    
                
                

        
# Create and fit forecaster scaling all exogenous variables
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = None,
                 transformer_exog = StandardScaler()
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
# Create and fit forecaster scaling all exogenous variables
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = None,
                 transformer_exog = StandardScaler()
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster

Out[4]:

================= 
ForecasterAutoreg 
================= 
Regressor: Ridge() 
Lags: [1 2 3] 
Transformer for y: None 
Transformer for exog: StandardScaler() 
Window size: 3 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['exog_1', 'exog_2'] 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': 'deprecated', 'positive': False, 'random_state': None, 'solver': 'auto', 'tol': 0.001} 
Creation date: 2022-11-29 15:55:58 
Last fit date: 2022-11-29 15:55:58 
Skforecast version: 0.6.0 
Python version: 3.9.13

It is also possible to apply a different transformation to each exogenous variable making use of ColumnTransformer.

In [5]:

            
                Copied!
                
                    
                    
                
                

        
# Create and fit forecaster with different transformation for each exog variable
# ==============================================================================
transformer_exog = ColumnTransformer(
                       [('scale_1', StandardScaler(), ['exog_1']),
                        ('scale_2', StandardScaler(), ['exog_2']),
                        ('onehot', OneHotEncoder(), ['exog_3']),
                       ],
                       remainder = 'passthrough',
                       verbose_feature_names_out = False
                   )

forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = None,
                 transformer_exog = transformer_exog
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2', 'exog_3']])
forecaster
# Create and fit forecaster with different transformation for each exog variable
# ==============================================================================
transformer_exog = ColumnTransformer(
                       [('scale_1', StandardScaler(), ['exog_1']),
                        ('scale_2', StandardScaler(), ['exog_2']),
                        ('onehot', OneHotEncoder(), ['exog_3']),
                       ],
                       remainder = 'passthrough',
                       verbose_feature_names_out = False
                   )

forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = None,
                 transformer_exog = transformer_exog
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2', 'exog_3']])
forecaster

Out[5]:

================= 
ForecasterAutoreg 
================= 
Regressor: Ridge() 
Lags: [1 2 3] 
Transformer for y: None 
Transformer for exog: ColumnTransformer(remainder='passthrough',
                  transformers=[('scale_1', StandardScaler(), ['exog_1']),
                                ('scale_2', StandardScaler(), ['exog_2']),
                                ('onehot', OneHotEncoder(), ['exog_3'])],
                  verbose_feature_names_out=False) 
Window size: 3 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['exog_1', 'exog_2', 'exog_3'] 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'alpha': 1.0, 'copy_X': True, 'fit_intercept': True, 'max_iter': None, 'normalize': 'deprecated', 'positive': False, 'random_state': None, 'solver': 'auto', 'tol': 0.001} 
Creation date: 2022-11-29 15:55:58 
Last fit date: 2022-11-29 15:55:58 
Skforecast version: 0.6.0 
Python version: 3.9.13

It can be seen that the data transformation is applied when creating the training matrices.

In [6]:

            
                Copied!
                
X_train, y_train = forecaster.create_train_X_y(
                       y    = data['y'],
                       exog = data[['exog_1', 'exog_2', 'exog_3']]
                   )
X_train, y_train = forecaster.create_train_X_y(
                       y    = data['y'],
                       exog = data[['exog_1', 'exog_2', 'exog_3']]
                   )

In [7]:

            
                Copied!
                
X_train.head(4)
X_train.head(4)

Out[7]:

	lag_1	lag_2	lag_3	exog_1	exog_2	exog_3_A	exog_3_B
date
1992-07-01	0.410534	0.361801	0.379808	-2.119529	-2.135088	1.0	0.0
1992-08-01	0.483389	0.410534	0.361801	-2.131024	-1.996017	1.0	0.0
1992-09-01	0.475463	0.483389	0.410534	-2.109222	-1.822392	1.0	0.0
1992-10-01	0.534761	0.475463	0.483389	-2.132137	-1.590667	1.0	0.0

In [8]:

            
                Copied!
                
y_train.head(4)
y_train.head(4)

Out[8]:

date
1992-07-01    0.483389
1992-08-01    0.475463
1992-09-01    0.534761
1992-10-01    0.568606
Freq: MS, Name: y, dtype: float64

Custom transformers¶

Using scikit-learn FunctionTransformer it is possible to include custom transformers in the forecaster object, for example, a logarithmic transformation.

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# Create custom transformer
# =============================================================================
def log_transform(x):
    """ 
    Calculate log adding 1 to avoid calculation errors if x is very close to 0.
    """
    return np.log(x+1)

def exp_transform(x):
    """
    Inverse of log_transform.
    """
    return np.exp(x) - 1

transformer_y = FunctionTransformer(func=log_transform, inverse_func=exp_transform)

# Create forecaster and train
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = transformer_y
             )

forecaster.fit(y=data['y'])
# Create custom transformer
# =============================================================================
def log_transform(x):
    """ 
    Calculate log adding 1 to avoid calculation errors if x is very close to 0.
    """
    return np.log(x+1)

def exp_transform(x):
    """
    Inverse of log_transform.
    """
    return np.exp(x) - 1

transformer_y = FunctionTransformer(func=log_transform, inverse_func=exp_transform)

# Create forecaster and train
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(),
                 lags             = 3,
                 transformer_y    = transformer_y
             )

forecaster.fit(y=data['y'])

If the FunctionTransformer has an inverse function, the output of the predict method is automatically transformed back to the original scale.

In [10]:

            
                Copied!
                
forecaster.predict(steps=4)
forecaster.predict(steps=4)

Out[10]:

2008-07-01    0.776206
2008-08-01    0.775471
2008-09-01    0.777200
2008-10-01    0.777853
Freq: MS, Name: pred, dtype: float64

Pipeline¶

Warning

Since version 0.4.0, skforecast allows using scikit-learn pipelines as regressors. However, it does not allow including ColumnTransformer in the pipeline, so the same transformation is applied to the modeled series y and all exogenous variables. If the preprocessing transformations only apply to some specific columns, they have to be applied using transformer_y and transformer_exog.

In [11]:

            
                Copied!
                
pipe = make_pipeline(StandardScaler(), Ridge())
pipe
pipe = make_pipeline(StandardScaler(), Ridge())
pipe

Out[11]:

Pipeline(steps=[('standardscaler', StandardScaler()), ('ridge', Ridge())])

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [12]:

            
                Copied!
                
                    
                    
                
                

        
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = pipe,
                 lags      = 10
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = pipe,
                 lags      = 10
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster

Out[12]:

================= 
ForecasterAutoreg 
================= 
Regressor: Pipeline(steps=[('standardscaler', StandardScaler()), ('ridge', Ridge())]) 
Lags: [ 1  2  3  4  5  6  7  8  9 10] 
Transformer for y: None 
Transformer for exog: None 
Window size: 10 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['exog_1', 'exog_2'] 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'ridge__alpha': 1.0, 'ridge__copy_X': True, 'ridge__fit_intercept': True, 'ridge__max_iter': None, 'ridge__normalize': 'deprecated', 'ridge__positive': False, 'ridge__random_state': None, 'ridge__solver': 'auto', 'ridge__tol': 0.001} 
Creation date: 2022-11-29 15:55:58 
Last fit date: 2022-11-29 15:55:58 
Skforecast version: 0.6.0 
Python version: 3.9.13

When performing a grid search over a sklearn pipeline, the model's name precedes the parameters' name.

In [13]:

            
                Copied!
                
                    
                    
                
                

        
# Hyperparameter Grid search
# ==============================================================================
pipe = make_pipeline(StandardScaler(), Ridge())
forecaster = ForecasterAutoreg(
                 regressor = pipe,
                 lags = 10  # This value will be replaced in the grid search
             )

# Regressor's hyperparameters
param_grid = {'ridge__alpha': np.logspace(-3, 5, 10)}

# Lags used as predictors
lags_grid = [5, 24, [1, 2, 3, 23, 24]]

results_grid = grid_search_forecaster(
                   forecaster  = forecaster,
                   y           = data['y'],
                   exog        = data[['exog_1', 'exog_2']],
                   param_grid  = param_grid,
                   lags_grid   = lags_grid,
                   steps       = 5,
                   metric      = 'mean_absolute_error',
                   refit       = False,
                   initial_train_size = len(data.loc[:'2000-04-01']),
                   return_best = True,
                   verbose     = False
               )
# Hyperparameter Grid search
# ==============================================================================
pipe = make_pipeline(StandardScaler(), Ridge())
forecaster = ForecasterAutoreg(
                 regressor = pipe,
                 lags = 10  # This value will be replaced in the grid search
             )

# Regressor's hyperparameters
param_grid = {'ridge__alpha': np.logspace(-3, 5, 10)}

# Lags used as predictors
lags_grid = [5, 24, [1, 2, 3, 23, 24]]

results_grid = grid_search_forecaster(
                   forecaster  = forecaster,
                   y           = data['y'],
                   exog        = data[['exog_1', 'exog_2']],
                   param_grid  = param_grid,
                   lags_grid   = lags_grid,
                   steps       = 5,
                   metric      = 'mean_absolute_error',
                   refit       = False,
                   initial_train_size = len(data.loc[:'2000-04-01']),
                   return_best = True,
                   verbose     = False
               )

Number of models compared: 30.

loop lags_grid: 100%|███████████████████████████████████████| 3/3 [00:00<00:00,  3.84it/s]

`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [1 2 3 4 5] 
  Parameters: {'ridge__alpha': 0.001}
  Backtesting metric: 6.845311709709172e-05

In [14]:

            
                Copied!
                
results_grid
results_grid

Out[14]:

	lags	params	mean_absolute_error	ridge__alpha
0	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.001}	0.000068	0.001000
10	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.001}	0.000188	0.001000
1	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.007742636826811269}	0.000526	0.007743
11	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.007742636826811269}	0.001413	0.007743
2	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.05994842503189409}	0.003860	0.059948
12	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.05994842503189409}	0.008969	0.059948
3	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.46415888336127775}	0.021751	0.464159
13	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.46415888336127775}	0.029505	0.464159
14	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 3.593813663804626}	0.046323	3.593814
23	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.46415888336127775}	0.060623	0.464159
22	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.05994842503189409}	0.061567	0.059948
21	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.007742636826811269}	0.061747	0.007743
20	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.001}	0.061771	0.001000
24	[1, 2, 3, 23, 24]	{'ridge__alpha': 3.593813663804626}	0.063512	3.593814
15	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 27.825594022071257}	0.064551	27.825594
4	[1, 2, 3, 4, 5]	{'ridge__alpha': 3.593813663804626}	0.069220	3.593814
25	[1, 2, 3, 23, 24]	{'ridge__alpha': 27.825594022071257}	0.077934	27.825594
16	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 215.44346900318823}	0.130016	215.443469
5	[1, 2, 3, 4, 5]	{'ridge__alpha': 27.825594022071257}	0.143189	27.825594
26	[1, 2, 3, 23, 24]	{'ridge__alpha': 215.44346900318823}	0.146446	215.443469
17	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 1668.1005372000557}	0.204469	1668.100537
6	[1, 2, 3, 4, 5]	{'ridge__alpha': 215.44346900318823}	0.205496	215.443469
27	[1, 2, 3, 23, 24]	{'ridge__alpha': 1668.1005372000557}	0.212896	1668.100537
18	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 12915.496650148827}	0.227536	12915.496650
28	[1, 2, 3, 23, 24]	{'ridge__alpha': 12915.496650148827}	0.228974	12915.496650
19	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 100000.0}	0.231157	100000.000000
29	[1, 2, 3, 23, 24]	{'ridge__alpha': 100000.0}	0.231356	100000.000000
7	[1, 2, 3, 4, 5]	{'ridge__alpha': 1668.1005372000557}	0.236227	1668.100537
8	[1, 2, 3, 4, 5]	{'ridge__alpha': 12915.496650148827}	0.244788	12915.496650
9	[1, 2, 3, 4, 5]	{'ridge__alpha': 100000.0}	0.246091	100000.000000

In [15]:

            
                Copied!
                
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html