SARIMAX and ARIMA forecasters¶

SARIMAX (Seasonal Autoregressive Integrated Moving-Average with Exogenous Regressors) is a generalization of the ARIMA model that incorporates both seasonality and exogenous variables. SARIMAX models are among the most widely used statistical forecasting models with excellent forecasting performance.

In the SARIMAX model notation, the parameters $p$, $d$, and $q$ represent the autoregressive, differencing, and moving-average components, respectively. $P$, $D$, and $Q$ denote the same components for the seasonal part of the model, with $m$ representing the number of periods in each season.

$p$ is the order (number of time lags) of the autoregressive part of the model.
$d$ is the degree of differencing (the number of times that past values have been subtracted from the data).
$q$ is the order of the moving-average part of the model.
$P$ is the order (number of time lags) of the seasonal part of the model
$D$ is the degree of differencing (the number of times the data have had past values subtracted) of the seasonal part of the model.
$Q$ is the order of the moving-average of the seasonal part of the model.
$m$ refers to the number of periods in each season.

When two out of the three terms are zeros, the model may be referred to based on the non-zero parameter, dropping "AR", "I" or "MA" from the acronym describing the model. For example, $ARIMA(1,0,0)$ is $AR(1)$, $ARIMA(0,1,0)$ is $I(1)$, and $ARIMA(0,0,1)$ is $MA(1)$.

The ForecasterSarimax class allows for training and validating SARIMAX models from pmdarima (internally wraps the statsmodels SARIMAX) using the skforecast API. This facilitates the comparison of its performance with other machine learning models.

Libraries¶

In [1]:

            
                Copied!
                
                    
                    
                
                

        
# Libraries
# ======================================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.ForecasterSarimax import ForecasterSarimax
from skforecast.model_selection_sarimax import backtesting_sarimax
from skforecast.model_selection_sarimax import grid_search_sarimax
from pmdarima import ARIMA
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
# Libraries
# ======================================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.ForecasterSarimax import ForecasterSarimax
from skforecast.model_selection_sarimax import backtesting_sarimax
from skforecast.model_selection_sarimax import grid_search_sarimax
from pmdarima import ARIMA
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')

Data¶

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Download data
# ======================================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv')
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime'])

# Data preprocessing
# ======================================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y/%m/%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

# Train-test dates
# ======================================================================================
end_train = '2005-06-01 23:59:00'
print(f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()}  (n={len(data.loc[:end_train])})")
print(f"Test dates  : {data.loc[end_train:].index.min()} --- {data.loc[:].index.max()}  (n={len(data.loc[end_train:])})")
data_train = data.loc[:end_train]
data_test  = data.loc[end_train:]

# Plot
# ======================================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
ax.legend();
# Download data
# ======================================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv')
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime'])

# Data preprocessing
# ======================================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y/%m/%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

# Train-test dates
# ======================================================================================
end_train = '2005-06-01 23:59:00'
print(f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()}  (n={len(data.loc[:end_train])})")
print(f"Test dates  : {data.loc[end_train:].index.min()} --- {data.loc[:].index.max()}  (n={len(data.loc[end_train:])})")
data_train = data.loc[:end_train]
data_test  = data.loc[end_train:]

# Plot
# ======================================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
ax.legend();

Train dates : 1991-07-01 00:00:00 --- 2005-06-01 00:00:00  (n=168)
Test dates  : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00  (n=36)

ForecasterSarimax¶

Training¶

In [3]:

            
                Copied!
                
                    
                    
                
                

        
# Create and fit ForecasterSarimax
# ======================================================================================
forecaster = ForecasterSarimax(
                 regressor=ARIMA(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
             )

forecaster.fit(y=data_train)
forecaster
# Create and fit ForecasterSarimax
# ======================================================================================
forecaster = ForecasterSarimax(
                 regressor=ARIMA(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
             )

forecaster.fit(y=data_train)
forecaster

Out[3]:

================= 
ForecasterSarimax 
================= 
Regressor:  ARIMA(12,1,1)(0,0,0)[0] intercept 
Regressor parameters: {'maxiter': 200, 'method': 'lbfgs', 'order': (12, 1, 1), 'out_of_sample_size': 0, 'scoring': 'mse', 'scoring_args': None, 'seasonal_order': (0, 0, 0, 0), 'start_params': None, 'suppress_warnings': False, 'trend': None, 'with_intercept': True} 
fit_kwargs: {} 
Window size: 1 
Transformer for y: None 
Transformer for exog: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('1991-07-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Creation date: 2023-05-19 21:06:02 
Last fit date: 2023-05-19 21:06:06 
Index seen by the forecaster: DatetimeIndex(['1991-07-01', '1991-08-01', '1991-09-01', '1991-10-01',
               '1991-11-01', '1991-12-01', '1992-01-01', '1992-02-01',
               '1992-03-01', '1992-04-01',
               ...
               '2004-09-01', '2004-10-01', '2004-11-01', '2004-12-01',
               '2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01',
               '2005-05-01', '2005-06-01'],
              dtype='datetime64[ns]', name='datetime', length=168, freq='MS') 
Skforecast version: 0.8.0 
Python version: 3.9.13 
Forecaster id: None

Prediction¶

In [4]:

            
                Copied!
                
# Predict
# ======================================================================================
predictions = forecaster.predict(steps=36)
predictions.head(3)
# Predict
# ======================================================================================
predictions = forecaster.predict(steps=36)
predictions.head(3)

Out[4]:

2005-07-01    0.972575
2005-08-01    0.980463
2005-09-01    1.135129
Freq: MS, Name: pred, dtype: float64

In [5]:

            
                Copied!
                
                    
                    
                
                

        
# Plot predictions
# ======================================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
# Plot predictions
# ======================================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data_train.plot(ax=ax, label='train')
data_test.plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

In [6]:

            
                Copied!
                
                    
                    
                
                

        
# Prediction error
# ======================================================================================
error_mse = mean_absolute_error(
                y_true = data_test,
                y_pred = predictions
            )

print(f"Test error (mse): {error_mse}")
# Prediction error
# ======================================================================================
error_mse = mean_absolute_error(
                y_true = data_test,
                y_pred = predictions
            )

print(f"Test error (mse): {error_mse}")

Test error (mse): 0.08398095770987783

In [7]:

            
                Copied!
                
# Predict intervals
# ======================================================================================
predictions = forecaster.predict_interval(steps=36, interval=[5, 95])
predictions.head(3)
# Predict intervals
# ======================================================================================
predictions = forecaster.predict_interval(steps=36, interval=[5, 95])
predictions.head(3)

Out[7]:

	pred	lower_bound	upper_bound
2005-07-01	0.972575	0.875051	1.070099
2005-08-01	0.980463	0.877568	1.083358
2005-09-01	1.135129	1.028896	1.241363

Feature importances¶

In [8]:

            
                Copied!
                
forecaster.get_feature_importances()
forecaster.get_feature_importances()

Out[8]:

	feature	importance
0	intercept	0.004831
1	ar.L1	-0.109302
2	ar.L2	-0.138044
3	ar.L3	-0.109821
4	ar.L4	-0.157437
5	ar.L5	-0.127060
6	ar.L6	-0.115060
7	ar.L7	-0.160059
8	ar.L8	-0.108337
9	ar.L9	-0.134818
10	ar.L10	-0.134505
11	ar.L11	-0.064085
12	ar.L12	0.784414
13	ma.L1	-0.554273
14	sigma2	0.002476

Backtesting¶

SARIMAX models can be evaluated using any of the backtesting strategies implemented in skforecast.

In [9]:

            
                Copied!
                
                    
                    
                
                

        
# Backtest forecaster
# ======================================================================================
forecaster = ForecasterSarimax(
                 regressor=ARIMA(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
             )

metric, predictions = backtesting_sarimax(
                          forecaster         = forecaster,
                          y                  = data,
                          initial_train_size = len(data_train),
                          fixed_train_size   = False,
                          steps              = 12,
                          metric             = 'mean_absolute_error',
                          refit              = True,
                          verbose            = True
                      )
# Backtest forecaster
# ======================================================================================
forecaster = ForecasterSarimax(
                 regressor=ARIMA(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
             )

metric, predictions = backtesting_sarimax(
                          forecaster         = forecaster,
                          y                  = data,
                          initial_train_size = len(data_train),
                          fixed_train_size   = False,
                          steps              = 12,
                          metric             = 'mean_absolute_error',
                          refit              = True,
                          verbose            = True
                      )

Information of backtesting process
----------------------------------
Number of observations used for initial training: 168
Number of observations used for backtesting: 36
    Number of folds: 3
    Number of steps per fold: 12
    Number of steps to exclude from the end of each train set before test (gap): 0

Fold: 0
    Training:   1991-07-01 00:00:00 -- 2005-06-01 00:00:00  (n=168)
    Validation: 2005-07-01 00:00:00 -- 2006-06-01 00:00:00  (n=12)
Fold: 1
    Training:   1991-07-01 00:00:00 -- 2006-06-01 00:00:00  (n=180)
    Validation: 2006-07-01 00:00:00 -- 2007-06-01 00:00:00  (n=12)
Fold: 2
    Training:   1991-07-01 00:00:00 -- 2007-06-01 00:00:00  (n=192)
    Validation: 2007-07-01 00:00:00 -- 2008-06-01 00:00:00  (n=12)

100%|██████████| 3/3 [00:11<00:00,  3.84s/it]

In [10]:

            
                Copied!
                
print(f"Error backtest: {metric}")
print(f"Error backtest: {metric}")

Error backtest: 0.06973314211549979

In [11]:

            
                Copied!
                
predictions.head(4)
predictions.head(4)

Out[11]:

	pred
2005-07-01	0.972575
2005-08-01	0.980463
2005-09-01	1.135129
2005-10-01	1.173011

Model tunning¶

To find the optimal hyperparameters for the SARIMAX model, various hyperparameter optimization strategies can be used. It is crucial to conduct hyperparameter optimization using a validation dataset, rather than the test dataset, to ensure accurate evaluation of model performance.

In [12]:

            
                Copied!
                
                    
                    
                
                

        
# Train-validation-test data
# ======================================================================================
end_train = '2001-01-01 23:59:00'
end_val = '2006-01-01 23:59:00'

print(f"Train dates      : {data.index.min()} --- {data.loc[:end_train].index.max()}  (n={len(data.loc[:end_train])})")
print(f"Validation dates : {data.loc[end_train:].index.min()} --- {data.loc[:end_val].index.max()}  (n={len(data.loc[end_train:end_val])})")
print(f"Test dates       : {data.loc[end_val:].index.min()} --- {data.index.max()}  (n={len(data.loc[end_val:])})")

# Plot
# ======================================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:end_val].plot(ax=ax, label='validation')
data.loc[end_val:].plot(ax=ax, label='test')
ax.legend();
# Train-validation-test data
# ======================================================================================
end_train = '2001-01-01 23:59:00'
end_val = '2006-01-01 23:59:00'

print(f"Train dates      : {data.index.min()} --- {data.loc[:end_train].index.max()}  (n={len(data.loc[:end_train])})")
print(f"Validation dates : {data.loc[end_train:].index.min()} --- {data.loc[:end_val].index.max()}  (n={len(data.loc[end_train:end_val])})")
print(f"Test dates       : {data.loc[end_val:].index.min()} --- {data.index.max()}  (n={len(data.loc[end_val:])})")

# Plot
# ======================================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:end_val].plot(ax=ax, label='validation')
data.loc[end_val:].plot(ax=ax, label='test')
ax.legend();

Train dates      : 1991-07-01 00:00:00 --- 2001-01-01 00:00:00  (n=115)
Validation dates : 2001-02-01 00:00:00 --- 2006-01-01 00:00:00  (n=60)
Test dates       : 2006-02-01 00:00:00 --- 2008-06-01 00:00:00  (n=29)

In [13]:

            
                Copied!
                
                    
                    
                
                

        
# Grid search hyperparameter
# ======================================================================================
forecaster = ForecasterSarimax(
                 regressor=ARIMA(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
             )

param_grid = {
    'order': [(12, 0, 0), (12, 1, 0), (12, 1, 1), (14, 1, 4)],
    'seasonal_order': [(0, 0, 0, 0)],
    'trend': [None, 'n', 'c']
}

results_grid = grid_search_sarimax(
                   forecaster         = forecaster,
                   y                  = data.loc[:end_val],
                   param_grid         = param_grid,
                   steps              = 12,
                   refit              = True,
                   metric             = 'mean_absolute_error',
                   initial_train_size = len(data_train),
                   fixed_train_size   = False,
                   return_best        = True,
                   verbose            = False
               )

results_grid.head(5)
# Grid search hyperparameter
# ======================================================================================
forecaster = ForecasterSarimax(
                 regressor=ARIMA(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
             )

param_grid = {
    'order': [(12, 0, 0), (12, 1, 0), (12, 1, 1), (14, 1, 4)],
    'seasonal_order': [(0, 0, 0, 0)],
    'trend': [None, 'n', 'c']
}

results_grid = grid_search_sarimax(
                   forecaster         = forecaster,
                   y                  = data.loc[:end_val],
                   param_grid         = param_grid,
                   steps              = 12,
                   refit              = True,
                   metric             = 'mean_absolute_error',
                   initial_train_size = len(data_train),
                   fixed_train_size   = False,
                   return_best        = True,
                   verbose            = False
               )

results_grid.head(5)

Number of models compared: 12.

params grid: 100%|██████████| 12/12 [00:32<00:00,  2.70s/it]

`Forecaster` refitted using the best-found parameters, and the whole data set: 
  Parameters: {'order': (12, 1, 1), 'seasonal_order': (0, 0, 0, 0), 'trend': 'n'}
  Backtesting metric: 0.06805053775564675

Out[13]:

	params	mean_absolute_error	order	seasonal_order	trend
7	{'order': (12, 1, 1), 'seasonal_order': (0, 0,...	0.068051	(12, 1, 1)	(0, 0, 0, 0)	n
10	{'order': (14, 1, 4), 'seasonal_order': (0, 0,...	0.072275	(14, 1, 4)	(0, 0, 0, 0)	n
0	{'order': (12, 0, 0), 'seasonal_order': (0, 0,...	0.076132	(12, 0, 0)	(0, 0, 0, 0)	None
2	{'order': (12, 0, 0), 'seasonal_order': (0, 0,...	0.076132	(12, 0, 0)	(0, 0, 0, 0)	c
1	{'order': (12, 0, 0), 'seasonal_order': (0, 0,...	0.076644	(12, 0, 0)	(0, 0, 0, 0)	n

Since return_best = True, the forecaster object is updated with the best configuration found and trained with the whole data set. This means that the final model obtained from grid search will have the best combination of lags and hyperparameters that resulted in the highest performance metric. This final model can then be used for future predictions on new data.

In [14]:

            
                Copied!
                
forecaster
forecaster

Out[14]:

================= 
ForecasterSarimax 
================= 
Regressor:  ARIMA(12,1,1)(0,0,0)[0] intercept 
Regressor parameters: {'maxiter': 200, 'method': 'lbfgs', 'order': (12, 1, 1), 'out_of_sample_size': 0, 'scoring': 'mse', 'scoring_args': None, 'seasonal_order': (0, 0, 0, 0), 'start_params': None, 'suppress_warnings': False, 'trend': 'n', 'with_intercept': True} 
fit_kwargs: {} 
Window size: 1 
Transformer for y: None 
Transformer for exog: None 
Exogenous included: False 
Type of exogenous variable: None 
Exogenous variables names: None 
Training range: [Timestamp('1991-07-01 00:00:00'), Timestamp('2006-01-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Creation date: 2023-05-19 21:06:18 
Last fit date: 2023-05-19 21:06:52 
Index seen by the forecaster: DatetimeIndex(['1991-07-01', '1991-08-01', '1991-09-01', '1991-10-01',
               '1991-11-01', '1991-12-01', '1992-01-01', '1992-02-01',
               '1992-03-01', '1992-04-01',
               ...
               '2005-04-01', '2005-05-01', '2005-06-01', '2005-07-01',
               '2005-08-01', '2005-09-01', '2005-10-01', '2005-11-01',
               '2005-12-01', '2006-01-01'],
              dtype='datetime64[ns]', name='datetime', length=175, freq='MS') 
Skforecast version: 0.8.0 
Python version: 3.9.13 
Forecaster id: None

In [15]:

            
                Copied!
                
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html