Backtesting SARIMAX and ARIMA models¶

SARIMAX (Seasonal Auto-Regressive Integrated Moving Average with eXogenous factors) is a generalization of the ARIMA model that allows incorporating seasonality and exogenous variables. This model has a total of 6 hyperparameters that must be specified when training the model:

p: Trend autoregression order.
d: Trend difference order.
q: Trend moving average order.
P: Seasonal autoregressive order.
D: Seasonal difference order.
Q: Seasonal moving average order.
m: The number of time steps for a single seasonal period.

The backtesting_sarimax function of the skforecast.model_selection_statsmodels module is a wrapper that allows backtesting SARIMAX models.

Libraries¶

In [1]:

            
                Copied!
                
                    
                    
                
                

        
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.model_selection_statsmodels import backtesting_sarimax
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.model_selection_statsmodels import backtesting_sarimax

c:\Users\jaesc2\Miniconda3\envs\skforecast\lib\site-packages\statsmodels\compat\pandas.py:61: FutureWarning: pandas.Int64Index is deprecated and will be removed from pandas in a future version. Use pandas.Index with the appropriate dtype instead.
  from pandas import Int64Index as NumericIndex

Data¶

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv')
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime'])

# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y/%m/%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

# Split data in train snd backtest
# ==============================================================================
n_backtest = 36*3  # Last 9 years are used for backtest
data_train = data[:-n_backtest]
data_backtest = data[-n_backtest:]
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o.csv')
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime'])

# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y/%m/%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()

# Split data in train snd backtest
# ==============================================================================
n_backtest = 36*3  # Last 9 years are used for backtest
data_train = data[:-n_backtest]
data_backtest = data[-n_backtest:]

Backtest¶

In [3]:

            
                Copied!
                
import warnings
warnings.filterwarnings('ignore')
import warnings
warnings.filterwarnings('ignore')

In [4]:

            
                Copied!
                
                    
                    
                
                

        
metric, predictions_backtest = backtesting_sarimax(
                                    y = data,
                                    order = (12, 1, 1),
                                    seasonal_order = (0, 0, 0, 0),
                                    initial_train_size = len(data_train),
                                    fixed_train_size = False,
                                    steps = 7,
                                    metric = 'mean_absolute_error',
                                    refit = False,
                                    verbose = True,
                                    fit_kwargs = {'maxiter': 250, 'disp': 0},
                               )
metric, predictions_backtest = backtesting_sarimax(
                                    y = data,
                                    order = (12, 1, 1),
                                    seasonal_order = (0, 0, 0, 0),
                                    initial_train_size = len(data_train),
                                    fixed_train_size = False,
                                    steps = 7,
                                    metric = 'mean_absolute_error',
                                    refit = False,
                                    verbose = True,
                                    fit_kwargs = {'maxiter': 250, 'disp': 0},
                               )

Number of observations used for training: 96
Number of observations used for backtesting: 108
    Number of folds: 16
    Number of steps per fold: 7
    Last fold only includes 3 observations.

In [5]:

            
                Copied!
                
print(f"Error backtest: {metric}")
print(f"Error backtest: {metric}")

Error backtest: 0.055444446244429825

In [6]:

            
                Copied!
                
predictions_backtest.head(4)
predictions_backtest.head(4)

Out[6]:

	predicted_mean	lower y	upper y
1999-07-01	0.734647	0.650016	0.819278
1999-08-01	0.751779	0.660625	0.842933
1999-09-01	0.865059	0.767857	0.962262
1999-10-01	0.832461	0.730425	0.934497

In [7]:

            
                Copied!
                
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html