Forecasting with scikit-learn pipelines¶

Since version 0.4.0, skforecast allows using scikit-learn pipelines as regressors. This is useful since many machine learning models need specific data preprocessing transformations. For example, linear models with Ridge or Lasso regularization benefits from features been scaled.

Warning

Version 0.4 does not allow including ColumnTransformer in the pipeline used as regressor, so if the preprocessing transformations only apply to some specific columns, they have to be applied on the data set before training the model. More detailed example.

Libraries¶

In [1]:

            
                Copied!
                
                    
                    
                
                

        
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import grid_search_forecaster
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import make_pipeline
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import grid_search_forecaster

Data¶

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')

Create pipeline and forecaster¶

In [3]:

            
                Copied!
                
pipe = make_pipeline(StandardScaler(), Ridge())
pipe
pipe = make_pipeline(StandardScaler(), Ridge())
pipe

Out[3]:

Pipeline(steps=[('standardscaler', StandardScaler()), ('ridge', Ridge())])

In [4]:

            
                Copied!
                
                    
                    
                
                

        
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                    regressor = pipe,
                    lags = 10
                )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                    regressor = pipe,
                    lags = 10
                )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster

Out[4]:

================= 
ForecasterAutoreg 
================= 
Regressor: Pipeline(steps=[('standardscaler', StandardScaler()), ('ridge', Ridge())]) 
Lags: [ 1  2  3  4  5  6  7  8  9 10] 
Window size: 10 
Included exogenous: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['exog_1', 'exog_2'] 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'standardscaler__copy': True, 'standardscaler__with_mean': True, 'standardscaler__with_std': True, 'ridge__alpha': 1.0, 'ridge__copy_X': True, 'ridge__fit_intercept': True, 'ridge__max_iter': None, 'ridge__normalize': 'deprecated', 'ridge__positive': False, 'ridge__random_state': None, 'ridge__solver': 'auto', 'ridge__tol': 0.001} 
Creation date: 2022-03-12 12:28:15 
Last fit date: 2022-03-12 12:28:15 
Skforecast version: 0.4.3

Grid Search¶

The model's name precedes the parameters' name when performing grid search over a sklearn pipeline.

In [5]:

            
                Copied!
                
                    
                    
                
                

        
# Hyperparameter Grid search
# ==============================================================================
pipe = make_pipeline(StandardScaler(), Ridge())
forecaster = ForecasterAutoreg(
                    regressor = pipe,
                    lags = 10  # This value will be replaced in the grid search
                )

# Regressor's hyperparameters
param_grid = {'ridge__alpha': np.logspace(-3, 5, 10)}

# Lags used as predictors
lags_grid = [5, 24, [1, 2, 3, 23, 24]]

results_grid = grid_search_forecaster(
                        forecaster  = forecaster,
                        y           = data['y'],
                        exog        = data[['exog_1', 'exog_2']],
                        param_grid  = param_grid,
                        lags_grid   = lags_grid,
                        steps       = 5,
                        metric      = 'mean_absolute_error',
                        refit       = False,
                        initial_train_size = len(data.loc[:'2000-04-01']),
                        return_best = True,
                        verbose     = False
                  )
# Hyperparameter Grid search
# ==============================================================================
pipe = make_pipeline(StandardScaler(), Ridge())
forecaster = ForecasterAutoreg(
                    regressor = pipe,
                    lags = 10  # This value will be replaced in the grid search
                )

# Regressor's hyperparameters
param_grid = {'ridge__alpha': np.logspace(-3, 5, 10)}

# Lags used as predictors
lags_grid = [5, 24, [1, 2, 3, 23, 24]]

results_grid = grid_search_forecaster(
                        forecaster  = forecaster,
                        y           = data['y'],
                        exog        = data[['exog_1', 'exog_2']],
                        param_grid  = param_grid,
                        lags_grid   = lags_grid,
                        steps       = 5,
                        metric      = 'mean_absolute_error',
                        refit       = False,
                        initial_train_size = len(data.loc[:'2000-04-01']),
                        return_best = True,
                        verbose     = False
                  )

Number of models compared: 30

loop lags_grid: 100%|███████████████████████████████████████| 3/3 [00:02<00:00,  1.26it/s]

`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [1 2 3 4 5] 
  Parameters: {'ridge__alpha': 0.001}
  Backtesting metric: 6.845311709618769e-05

In [6]:

            
                Copied!
                
results_grid
results_grid

Out[6]:

	lags	params	metric	ridge__alpha
0	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.001}	0.000068	0.001000
10	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.001}	0.000188	0.001000
1	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.007742636826811269}	0.000526	0.007743
11	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.007742636826811269}	0.001413	0.007743
2	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.05994842503189409}	0.003860	0.059948
12	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.05994842503189409}	0.008969	0.059948
3	[1, 2, 3, 4, 5]	{'ridge__alpha': 0.46415888336127775}	0.021751	0.464159
13	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 0.46415888336127775}	0.029505	0.464159
14	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 3.593813663804626}	0.046323	3.593814
23	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.46415888336127775}	0.060623	0.464159
22	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.05994842503189409}	0.061567	0.059948
21	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.007742636826811269}	0.061747	0.007743
20	[1, 2, 3, 23, 24]	{'ridge__alpha': 0.001}	0.061771	0.001000
24	[1, 2, 3, 23, 24]	{'ridge__alpha': 3.593813663804626}	0.063512	3.593814
15	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 27.825594022071257}	0.064551	27.825594
4	[1, 2, 3, 4, 5]	{'ridge__alpha': 3.593813663804626}	0.069220	3.593814
25	[1, 2, 3, 23, 24]	{'ridge__alpha': 27.825594022071257}	0.077934	27.825594
16	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 215.44346900318823}	0.130016	215.443469
5	[1, 2, 3, 4, 5]	{'ridge__alpha': 27.825594022071257}	0.143189	27.825594
26	[1, 2, 3, 23, 24]	{'ridge__alpha': 215.44346900318823}	0.146446	215.443469
17	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 1668.1005372000557}	0.204469	1668.100537
6	[1, 2, 3, 4, 5]	{'ridge__alpha': 215.44346900318823}	0.205496	215.443469
27	[1, 2, 3, 23, 24]	{'ridge__alpha': 1668.1005372000557}	0.212896	1668.100537
18	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 12915.496650148827}	0.227536	12915.496650
28	[1, 2, 3, 23, 24]	{'ridge__alpha': 12915.496650148827}	0.228974	12915.496650
19	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14...	{'ridge__alpha': 100000.0}	0.231157	100000.000000
29	[1, 2, 3, 23, 24]	{'ridge__alpha': 100000.0}	0.231356	100000.000000
7	[1, 2, 3, 4, 5]	{'ridge__alpha': 1668.1005372000557}	0.236227	1668.100537
8	[1, 2, 3, 4, 5]	{'ridge__alpha': 12915.496650148827}	0.244788	12915.496650
9	[1, 2, 3, 4, 5]	{'ridge__alpha': 100000.0}	0.246091	100000.000000

In [7]:

            
                Copied!
                
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html