ARIMA and SARIMAX¶
ARIMA (AutoRegressive Integrated Moving Average) and SARIMAX (Seasonal AutoRegressive Integrated Moving Average with eXogenous regressors) are prominent and widely used statistical forecasting models. While ARIMA models are more widely known, SARIMAX models extend the ARIMA framework by seamlessly integrating seasonal patterns and exogenous variables.
In the ARIMA-SARIMAX model notation, the parameters $p$, $d$, and $q$ represent the autoregressive, differencing, and moving-average components, respectively. $P$, $D$, and $Q$ denote the same components for the seasonal part of the model, with $m$ representing the number of periods in each season.
$p$ is the order (number of time lags) of the autoregressive part of the model.
$d$ is the degree of differencing (the number of times that past values have been subtracted from the data).
$q$ is the order of the moving average part of the model.
$P$ is the order (number of time lags) of the seasonal part of the model.
$D$ is the degree of differencing (the number of times the data have had past values subtracted) of the seasonal part of the model.
$Q$ is the order of the moving average of the seasonal part of the model.
$m$ refers to the number of periods in each season.
When the terms $P$, $D$, $Q$, and $m$ are zero and no exogenous variables are included in the model, the SARIMAX model is equivalent to an ARIMA.
When two out of the three terms are zero, the model can be referred to based on the non-zero parameter, dropping "AR", "I" or "MA" from the acronym describing the model. For example, $ARIMA(1,0,0)$ is $AR(1)$, $ARIMA(0,1,0)$ is $I(1)$, and $ARIMA(0,0,1)$ is $MA(1)$.
ARIMA implementations
Several Python libraries implement ARIMA-SARIMAX models. Three of them are:
statsmodels: this is one of the most complete libraries for statistical modeling. While the functional paradigm may be intuitive for those coming from the R environment, those accustomed to the object-oriented API of scikit-learn may need a short period of adaptation.
pmdarima: This is a wrapper for
statsmodels SARIMAX
. Its distinguishing feature is its seamless integration with the scikit-learn API, allowing users familiar with scikit-learn's conventions to seamlessly dive into time series modeling.skforecast: a novel wrapper for
statsmodels SARIMAX
that also follows the scikit-learn API. This implementation is very similar to that of pmdarima, but has been streamlined to include only the essential elements for skforecast, resulting in significant speed improvements.
ForecasterSarimax
The ForecasterSarimax
class allows training and validation of ARIMA and SARIMAX models using the skforecast API. ForecasterSarimax
is compatible with two ARIMA-SARIMAX implementations:
ARIMA
from pmdarima: a wrapper for statsmodels SARIMAX that follows the scikit-learn API.Sarimax
from skforecast: a novel wrapper for statsmodels SARIMAX that also follows the sklearn API. This implementation is very similar to pmdarima, but has been streamlined to include only the essential elements for skforecast, resulting in significant speed improvements.
Tip
To learn more about modeling time series with ARIMA models, visit our example: ARIMA and SARIMAX models with Python.
Libraries¶
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.Sarimax import Sarimax
from skforecast.ForecasterSarimax import ForecasterSarimax
from skforecast.model_selection_sarimax import backtesting_sarimax
from skforecast.model_selection_sarimax import grid_search_sarimax
from pmdarima import ARIMA
from statsmodels.tsa.statespace.sarimax import SARIMAX
from sklearn.metrics import mean_absolute_error
import warnings
warnings.filterwarnings('ignore')
Data¶
# Download data
# ==============================================================================
url = (
'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
'data/h2o_exog.csv'
)
data = pd.read_csv(
url, sep=',', header=0, names=['datetime', 'y', 'exog_1', 'exog_2']
)
# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y-%m-%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data.sort_index()
# Train-test dates
# ==============================================================================
end_train = '2005-06-01 23:59:59'
print(f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()} (n={len(data.loc[:end_train])})")
print(f"Test dates : {data.loc[end_train:].index.min()} --- {data.loc[:].index.max()} (n={len(data.loc[end_train:])})")
data_train = data.loc[:end_train]
data_test = data.loc[end_train:]
# Plot
# ==============================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data.plot(ax=ax)
ax.legend();
Train dates : 1992-04-01 00:00:00 --- 2005-06-01 00:00:00 (n=159) Test dates : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00 (n=36)
Statsmodels, pmdarima and skforecast¶
The following section focus on how to train an ARIMA model and forecast future values with each of the three libraries.
statsmodels
# ARIMA model with statsmodels.Sarimax
# ==============================================================================
arima = SARIMAX(endog = data_train['y'], order = (1, 1, 1))
arima_res = arima.fit(disp=0)
arima_res.summary()
Dep. Variable: | y | No. Observations: | 159 |
---|---|---|---|
Model: | SARIMAX(1, 1, 1) | Log Likelihood | 89.934 |
Date: | Wed, 29 Nov 2023 | AIC | -173.869 |
Time: | 08:45:59 | BIC | -164.681 |
Sample: | 04-01-1992 | HQIC | -170.137 |
- 06-01-2005 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | 0.6316 | 0.143 | 4.421 | 0.000 | 0.352 | 0.912 |
ma.L1 | -0.9535 | 0.054 | -17.821 | 0.000 | -1.058 | -0.849 |
sigma2 | 0.0186 | 0.002 | 8.619 | 0.000 | 0.014 | 0.023 |
Ljung-Box (L1) (Q): | 0.75 | Jarque-Bera (JB): | 167.06 |
---|---|---|---|
Prob(Q): | 0.39 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 2.13 | Skew: | -1.66 |
Prob(H) (two-sided): | 0.01 | Kurtosis: | 6.78 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Prediction
# ==============================================================================
predictions = arima_res.get_forecast(steps=12)
predictions.predicted_mean.head(4)
2005-07-01 0.859450 2005-08-01 0.870306 2005-09-01 0.877163 2005-10-01 0.881494 Freq: MS, Name: predicted_mean, dtype: float64
pmdarima
# ARIMA model with pmdarima.ARIMA
# ==============================================================================
arima = ARIMA(order=(1, 1, 1), suppress_warnings=True)
arima.fit(y=data_train['y'])
arima.summary()
Dep. Variable: | y | No. Observations: | 159 |
---|---|---|---|
Model: | SARIMAX(1, 1, 1) | Log Likelihood | 93.643 |
Date: | Wed, 29 Nov 2023 | AIC | -179.287 |
Time: | 08:45:59 | BIC | -167.036 |
Sample: | 04-01-1992 | HQIC | -174.312 |
- 06-01-2005 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
intercept | 0.0011 | 0.000 | 2.183 | 0.029 | 0.000 | 0.002 |
ar.L1 | 0.6019 | 0.108 | 5.593 | 0.000 | 0.391 | 0.813 |
ma.L1 | -0.9998 | 5.468 | -0.183 | 0.855 | -11.717 | 9.718 |
sigma2 | 0.0175 | 0.096 | 0.183 | 0.855 | -0.170 | 0.205 |
Ljung-Box (L1) (Q): | 1.58 | Jarque-Bera (JB): | 129.76 |
---|---|---|---|
Prob(Q): | 0.21 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 2.19 | Skew: | -1.49 |
Prob(H) (two-sided): | 0.01 | Kurtosis: | 6.29 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Prediction
# ==============================================================================
predictions = arima.predict(n_periods=12)
predictions.head(4)
2005-07-01 0.891865 2005-08-01 0.922805 2005-09-01 0.942514 2005-10-01 0.955461 Freq: MS, dtype: float64
skforecast
# ARIMA model with skforecast.Sarimax
# ==============================================================================
arima = Sarimax(order=(1, 1, 1))
arima.fit(y=data_train['y'])
arima.summary()
Dep. Variable: | y | No. Observations: | 159 |
---|---|---|---|
Model: | SARIMAX(1, 1, 1) | Log Likelihood | 89.934 |
Date: | Wed, 29 Nov 2023 | AIC | -173.869 |
Time: | 08:45:59 | BIC | -164.681 |
Sample: | 04-01-1992 | HQIC | -170.137 |
- 06-01-2005 | |||
Covariance Type: | opg |
coef | std err | z | P>|z| | [0.025 | 0.975] | |
---|---|---|---|---|---|---|
ar.L1 | 0.6316 | 0.143 | 4.421 | 0.000 | 0.352 | 0.912 |
ma.L1 | -0.9535 | 0.054 | -17.821 | 0.000 | -1.058 | -0.849 |
sigma2 | 0.0186 | 0.002 | 8.619 | 0.000 | 0.014 | 0.023 |
Ljung-Box (L1) (Q): | 0.75 | Jarque-Bera (JB): | 167.06 |
---|---|---|---|
Prob(Q): | 0.39 | Prob(JB): | 0.00 |
Heteroskedasticity (H): | 2.13 | Skew: | -1.66 |
Prob(H) (two-sided): | 0.01 | Kurtosis: | 6.78 |
Warnings:
[1] Covariance matrix calculated using the outer product of gradients (complex-step).
# Prediction
# ==============================================================================
predictions = arima.predict(steps=12)
predictions.head(4)
pred | |
---|---|
2005-07-01 | 0.859450 |
2005-08-01 | 0.870306 |
2005-09-01 | 0.877163 |
2005-10-01 | 0.881494 |
ForecasterSarimax¶
The previous section introduced the construction of ARIMA-SARIMAX models using three different implementations. In order to seamlessly integrate these models with the various functionalities provided by skforecast, the next step is to encapsulate these models within a ForecasterSarimax
object. This encapsulation harmonizes the intricacies of the model and allows for the coherent use of skforecast's extensive capabilities.
The following code is done using the skforecast Sarimax
model, but the same code can be applied to the pmdarima ARIMA
model.
Training¶
# Create and fit ForecasterSarimax
# ==============================================================================
forecaster = ForecasterSarimax(
regressor=Sarimax(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
)
forecaster.fit(y=data_train['y'], suppress_warnings=True)
forecaster
================= ForecasterSarimax ================= Regressor: Sarimax(12,1,1)(0,0,0)[0] Regressor parameters: {'concentrate_scale': False, 'dates': None, 'disp': False, 'enforce_invertibility': True, 'enforce_stationarity': True, 'freq': None, 'hamilton_representation': False, 'maxiter': 200, 'measurement_error': False, 'method': 'lbfgs', 'missing': 'none', 'mle_regression': True, 'order': (12, 1, 1), 'seasonal_order': (0, 0, 0, 0), 'simple_differencing': False, 'sm_fit_kwargs': {}, 'sm_init_kwargs': {}, 'sm_predict_kwargs': {}, 'start_params': None, 'time_varying_regression': False, 'trend': None, 'trend_offset': 1, 'use_exact_diffuse': False, 'validate_specification': True} fit_kwargs: {} Window size: 1 Transformer for y: None Transformer for exog: None Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Creation date: 2023-11-29 08:45:59 Last fit date: 2023-11-29 08:46:02 Index seen by the forecaster: DatetimeIndex(['1992-04-01', '1992-05-01', '1992-06-01', '1992-07-01', '1992-08-01', '1992-09-01', '1992-10-01', '1992-11-01', '1992-12-01', '1993-01-01', ... '2004-09-01', '2004-10-01', '2004-11-01', '2004-12-01', '2005-01-01', '2005-02-01', '2005-03-01', '2005-04-01', '2005-05-01', '2005-06-01'], dtype='datetime64[ns]', name='datetime', length=159, freq='MS') Skforecast version: 0.11.0 Python version: 3.10.11 Forecaster id: None
Prediction¶
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=36)
predictions.head(3)
2005-07-01 0.957888 2005-08-01 0.960101 2005-09-01 1.108663 Freq: MS, Name: pred, dtype: float64
# Plot predictions
# ==============================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data_train['y'].plot(ax=ax, label='train')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
# Prediction error
# ==============================================================================
error_mse = mean_absolute_error(
y_true = data_test['y'],
y_pred = predictions
)
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.07124801395226059
Interval prediction¶
Either alpha
or interval
can be used to indicate the confidence of the estimated prediction interval.
# Predict intervals
# ==============================================================================
predictions = forecaster.predict_interval(steps=36, alpha=0.05)
predictions.head(3)
pred | lower_bound | upper_bound | |
---|---|---|---|
2005-07-01 | 0.957888 | 0.857849 | 1.057926 |
2005-08-01 | 0.960101 | 0.853867 | 1.066336 |
2005-09-01 | 1.108663 | 0.998261 | 1.219064 |
Exogenous variables¶
The addition of exogenous variables is done using the exog
argument. The only requirement for including an exogenous variable is the need to know the value of the variable also during the forecast period.
Tip
To learn more about exogenous variables and how to correctly manage them with skforecast visit: Exogenous variables (features) user guide.
# Create and fit ForecasterSarimax with exogenous variables
# ==============================================================================
forecaster = ForecasterSarimax(
regressor=Sarimax(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
)
forecaster.fit(
y = data_train['y'],
exog = data_train[['exog_1', 'exog_2']],
suppress_warnings = True
)
# Predict with exog
# ==============================================================================
predictions = forecaster.predict(
steps = 36,
exog = data_test[['exog_1', 'exog_2']]
)
predictions.head(3)
2005-07-01 0.904036 2005-08-01 0.932258 2005-09-01 1.088950 Freq: MS, Name: pred, dtype: float64
Using an already trained ARIMA¶
Forecasting with an ARIMA model becomes challenging when the forecast horizon data does not immediately follow the last observed value during the training phase. This complexity is due to the moving average (MA) component, which relies on past forecast errors as predictors. Thus, to predict at time 't', the error of the 't-1' prediction becomes a necessity. In situations where this prediction isn't available, the corresponding error remains unavailable.
For this reason, in most cases, ARIMA models are retrained each time predictions need to be made. Despite considerable efforts and advances to speed up the training process for these models, it is not always feasible to retrain the model between predictions, either due to time constraints or insufficient computational resources for repeated access to historical data. An intermediate approach is to feed the model with data from the last training observation to the start of the prediction phase. This technique enables the estimation of intermediate predictions and, as a result, the necessary errors.
For example, imagine a situation where a model was trained 20 days ago with daily data from the past three years. When generating new predictions, only the 20 most recent values would be needed, rather than the complete historical dataset (365 * 3 + 20).
Integrating new data into the model can be complex, but the ForecasterSarimax
class simplifies this considerably by automating the process through the last_window
argument in its predict
method.
# Split data Train - Last window - Test
# ==============================================================================
end_train = '2005-06-01 23:59:59'
end_last_window = '2007-06-01 23:59:59'
print(
f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()} "
f"(n={len(data.loc[:end_train])})"
)
print(
f"Last window dates : {data.loc[end_train:].index.min()} --- {data.loc[:end_last_window].index.max()} "
f"(n={len(data.loc[end_train:end_last_window])})"
)
print(
f"Test dates : {data.loc[end_last_window:].index.min()} --- {data.index.max()} "
f"(n={len(data.loc[end_last_window:])})"
)
data_train = data.loc[:end_train]
data_last_window = data.loc[end_train:end_last_window]
data_test = data.loc[end_last_window:]
# Plot
# ======================================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_train['y'].plot(ax=ax, label='train')
data_last_window['y'].plot(ax=ax, label='last window')
data_test['y'].plot(ax=ax, label='test')
ax.legend();
Train dates : 1992-04-01 00:00:00 --- 2005-06-01 00:00:00 (n=159) Last window dates : 2005-07-01 00:00:00 --- 2007-06-01 00:00:00 (n=24) Test dates : 2007-07-01 00:00:00 --- 2008-06-01 00:00:00 (n=12)
Since exogenous variables have been included in the Forecaster tuning, it is necessary to pass both past values and their future values to the predict
method using the last_window_exog
and exog
parameters when making predictions.
# Create and fit ForecasterSarimax with exogenous variables
# ==============================================================================
forecaster = ForecasterSarimax(
regressor=Sarimax(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
)
forecaster.fit(
y = data_train['y'],
exog = data_train[['exog_1', 'exog_2']],
suppress_warnings = True
)
# Predict with exog and last window
# ==============================================================================
predictions = forecaster.predict(
steps = 12,
exog = data_test[['exog_1', 'exog_2']],
last_window = data_last_window['y'],
last_window_exog = data_last_window[['exog_1', 'exog_2']]
)
predictions.head(3)
2007-07-01 0.883503 2007-08-01 1.041713 2007-09-01 1.071409 Freq: MS, Name: pred, dtype: float64
# Prediction error
# ==============================================================================
error_mse = mean_absolute_error(
y_true = data_test['y'],
y_pred = predictions
)
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.06296272219417774
# Plot predictions
# ==============================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data_train['y'].plot(ax=ax, label='train')
data_last_window['y'].plot(ax=ax, label='last window')
data_test['y'].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
Feature importances¶
Returns the parameters of the model.
# Feature importances
# ==============================================================================
forecaster.get_feature_importances()
feature | importance | |
---|---|---|
0 | exog_1 | -0.541309 |
1 | exog_2 | 1.525838 |
2 | ar.L1 | -0.162700 |
3 | ar.L2 | -0.236563 |
4 | ar.L3 | -0.194586 |
5 | ar.L4 | -0.284359 |
6 | ar.L5 | -0.190579 |
7 | ar.L6 | -0.211254 |
8 | ar.L7 | -0.247428 |
9 | ar.L8 | -0.212840 |
10 | ar.L9 | -0.211725 |
11 | ar.L10 | -0.229331 |
12 | ar.L11 | -0.151496 |
13 | ar.L12 | 0.672440 |
14 | ma.L1 | -0.978456 |
15 | sigma2 | 0.001588 |
Backtesting¶
SARIMAX models can be evaluated using any of the backtesting strategies implemented in skforecast.
# Backtest forecaster
# ======================================================================================
forecaster = ForecasterSarimax(
regressor=Sarimax(order=(12, 1, 1), seasonal_order=(0, 0, 0, 0), maxiter=200),
)
metric, predictions = backtesting_sarimax(
forecaster = forecaster,
y = data['y'],
exog = data[['exog_1', 'exog_2']],
initial_train_size = len(data_train),
fixed_train_size = False,
steps = 12,
metric = 'mean_absolute_error',
refit = True,
n_jobs = 'auto',
suppress_warnings_fit = True,
verbose = True,
show_progress = True
)
print(f"Error backtest: {metric}")
predictions.head(4)
Information of backtesting process ---------------------------------- Number of observations used for initial training: 159 Number of observations used for backtesting: 36 Number of folds: 3 Number of steps per fold: 12 Number of steps to exclude from the end of each train set before test (gap): 0 Fold: 0 Training: 1992-04-01 00:00:00 -- 2005-06-01 00:00:00 (n=159) Validation: 2005-07-01 00:00:00 -- 2006-06-01 00:00:00 (n=12) Fold: 1 Training: 1992-04-01 00:00:00 -- 2006-06-01 00:00:00 (n=171) Validation: 2006-07-01 00:00:00 -- 2007-06-01 00:00:00 (n=12) Fold: 2 Training: 1992-04-01 00:00:00 -- 2007-06-01 00:00:00 (n=183) Validation: 2007-07-01 00:00:00 -- 2008-06-01 00:00:00 (n=12)
0%| | 0/3 [00:00<?, ?it/s]
Error backtest: 0.0561092485592625
pred | |
---|---|
2005-07-01 | 0.904036 |
2005-08-01 | 0.932258 |
2005-09-01 | 1.088950 |
2005-10-01 | 1.112871 |
Model tunning¶
To find the optimal hyperparameters for the SARIMAX model, various hyperparameter optimization strategies can be used. It is crucial to conduct hyperparameter optimization using a validation dataset, rather than the test dataset, to ensure accurate evaluation of model performance.
# Train-validation-test data
# ======================================================================================
end_train = '2001-01-01 23:59:00'
end_val = '2006-01-01 23:59:00'
print(f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()} (n={len(data.loc[:end_train])})")
print(f"Validation dates : {data.loc[end_train:].index.min()} --- {data.loc[:end_val].index.max()} (n={len(data.loc[end_train:end_val])})")
print(f"Test dates : {data.loc[end_val:].index.min()} --- {data.index.max()} (n={len(data.loc[end_val:])})")
# Plot
# ======================================================================================
fig, ax=plt.subplots(figsize=(7, 3))
data.loc[:end_train, 'y'].plot(ax=ax, label='train')
data.loc[end_train:end_val, 'y'].plot(ax=ax, label='validation')
data.loc[end_val:, 'y'].plot(ax=ax, label='test')
ax.legend();
Train dates : 1992-04-01 00:00:00 --- 2001-01-01 00:00:00 (n=106) Validation dates : 2001-02-01 00:00:00 --- 2006-01-01 00:00:00 (n=60) Test dates : 2006-02-01 00:00:00 --- 2008-06-01 00:00:00 (n=29)
# Grid search hyperparameter
# ======================================================================================
forecaster = ForecasterSarimax(
regressor=Sarimax(order=(12, 1, 1), maxiter=200),
)
param_grid = {
'order': [(12, 0, 0), (12, 1, 0), (12, 1, 1)],
'trend': [None, 'n', 'c']
}
results_grid = grid_search_sarimax(
forecaster = forecaster,
y = data.loc[:end_val, 'y'],
param_grid = param_grid,
steps = 12,
refit = False,
metric = 'mean_absolute_error',
initial_train_size = len(data_train),
fixed_train_size = False,
return_best = True,
n_jobs = 'auto',
suppress_warnings_fit = True,
verbose = False,
show_progress = True
)
results_grid.head(5)
Number of models compared: 9.
params grid: 0%| | 0/9 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found parameters, and the whole data set: Parameters: {'order': (12, 1, 1), 'trend': None} Backtesting metric: 0.06867231274352846
params | mean_absolute_error | order | trend | |
---|---|---|---|---|
6 | {'order': (12, 1, 1), 'trend': None} | 0.068672 | (12, 1, 1) | None |
7 | {'order': (12, 1, 1), 'trend': 'n'} | 0.068672 | (12, 1, 1) | n |
2 | {'order': (12, 0, 0), 'trend': 'c'} | 0.077006 | (12, 0, 0) | c |
0 | {'order': (12, 0, 0), 'trend': None} | 0.077562 | (12, 0, 0) | None |
1 | {'order': (12, 0, 0), 'trend': 'n'} | 0.077562 | (12, 0, 0) | n |
Since return_best = True
, the Forecaster object is updated with the most optimal configuration found and trained with the entire dataset. This means that the grid search will yield the lowest error model with the best hyperparameters that lead to the highest performance metric. This last model can subsequently be utilized for forecasts on new data.
forecaster
================= ForecasterSarimax ================= Regressor: Sarimax(12,1,1)(0,0,0)[0] Regressor parameters: {'concentrate_scale': False, 'dates': None, 'disp': False, 'enforce_invertibility': True, 'enforce_stationarity': True, 'freq': None, 'hamilton_representation': False, 'maxiter': 200, 'measurement_error': False, 'method': 'lbfgs', 'missing': 'none', 'mle_regression': True, 'order': (12, 1, 1), 'seasonal_order': (0, 0, 0, 0), 'simple_differencing': False, 'sm_fit_kwargs': {}, 'sm_init_kwargs': {}, 'sm_predict_kwargs': {}, 'start_params': None, 'time_varying_regression': False, 'trend': None, 'trend_offset': 1, 'use_exact_diffuse': False, 'validate_specification': True} fit_kwargs: {} Window size: 1 Transformer for y: None Transformer for exog: None Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2006-01-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Creation date: 2023-11-29 08:46:23 Last fit date: 2023-11-29 08:46:45 Index seen by the forecaster: DatetimeIndex(['1992-04-01', '1992-05-01', '1992-06-01', '1992-07-01', '1992-08-01', '1992-09-01', '1992-10-01', '1992-11-01', '1992-12-01', '1993-01-01', ... '2005-04-01', '2005-05-01', '2005-06-01', '2005-07-01', '2005-08-01', '2005-09-01', '2005-10-01', '2005-11-01', '2005-12-01', '2006-01-01'], dtype='datetime64[ns]', name='datetime', length=166, freq='MS') Skforecast version: 0.11.0 Python version: 3.10.11 Forecaster id: None
Prediction on training data (In-sample Predictions)¶
Predictions on the training data are crucial for evaluating the accuracy and effectiveness of the model. By comparing the predicted values wtih the actual observed values in the training dataset, you can assess how well the model has learned the underlying patterns and trends in the data. This comparison helps in understanding the model's performance and identify areas where it may need improvement or adjustment. In essence, they act as a mirror, reflecting how the model interprets and reconstructs the historical data on which it was trained.
The predictions of the fitted values are stored in the fittedvalues
attribute of the SARIMAXResults
object. This object is stored within the sarimax_res
attribute of the skforecast.Sarimax
model:
# Create and fit ForecasterSarimax (skforecast)
# ==============================================================================
forecaster = ForecasterSarimax(
regressor=Sarimax(order=(12, 1, 1), maxiter=200),
)
forecaster.fit(y=data_train['y'], suppress_warnings=True)
# In-sample Predictions
# ==============================================================================
forecaster.regressor.sarimax_res.fittedvalues
datetime 1992-04-01 0.000000 1992-05-01 0.379808 1992-06-01 0.361370 1992-07-01 0.413246 1992-08-01 0.481038 ... 2005-02-01 0.739297 2005-03-01 0.751038 2005-04-01 0.716078 2005-05-01 0.749854 2005-06-01 0.817569 Freq: MS, Length: 159, dtype: float64
When using the pmdarima.ARIMA
model, the statsmodels SARIMAXResults
object is stored inside the arima_res_
attribute:
# Create and fit ForecasterSarimax (pmdarima)
# ==============================================================================
forecaster = ForecasterSarimax(
regressor=ARIMA(order=(12, 1, 1), maxiter=200),
)
forecaster.fit(y=data_train['y'], suppress_warnings=True)
# In-sample Predictions
# ==============================================================================
forecaster.regressor.arima_res_.fittedvalues
datetime 1992-04-01 0.002797 1992-05-01 0.382605 1992-06-01 0.363857 1992-07-01 0.416587 1992-08-01 0.485223 ... 2005-02-01 0.740753 2005-03-01 0.755807 2005-04-01 0.725037 2005-05-01 0.762426 2005-06-01 0.832514 Freq: MS, Length: 159, dtype: float64