model_selection_statsmodels
¶
backtesting_sarimax(y, steps, metric, initial_train_size, fixed_train_size=False, refit=False, order=(1, 0, 0), seasonal_order=(0, 0, 0, 0), trend=None, alpha=0.05, exog=None, sarimax_kwargs={}, fit_kwargs={'disp': 0}, verbose=False)
¶
Backtesting (validation) of SARIMAX
model from statsmodels >= 0.12. The model
is trained using the initial_train_size
first observations, then, in each
iteration, a number of steps
predictions are evaluated. If refit is True
,
the model is re-fitted in each iteration before making predictions.
https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_forecasting.html
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y |
Series |
Time series values. |
required |
steps |
int |
Number of steps to predict. |
required |
metric |
Union[str, <built-in function callable>] |
Metric used to quantify the goodness of fit of the model. If string: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error'} If callable: Function with arguments y_true, y_pred that returns a float. |
required |
initial_train_size |
int |
Number of samples used in the initial train. |
required |
fixed_train_size |
bool |
If True, train size doesn't increases but moves by |
False |
refit |
bool |
Whether to re-fit the model in each iteration. |
False |
order |
tuple |
The (p,d,q) order of the model for the number of AR parameters, differences, and MA parameters. d must be an integer indicating the integration order of the process, while p and q may either be an integers indicating the AR and MA orders (so that all lags up to those orders are included) or else iterables giving specific AR and / or MA lags to include. Default is an AR(1) model: (1,0,0). |
(1, 0, 0) |
seasonal_order |
tuple |
The (P,D,Q,s) order of the seasonal component of the model for the AR parameters, differences, MA parameters, and periodicity. D must be an integer indicating the integration order of the process, while P and Q may either be an integers indicating the AR and MA orders (so that all lags up to those orders are included) or else iterables giving specific AR and / or MA lags to include. s is an integer giving the periodicity (number of periods in season), often it is 4 for quarterly data or 12 for monthly data. Default is no seasonal effect. |
(0, 0, 0, 0) |
trend |
str |
Parameter controlling the deterministic trend polynomial A(t). Can be specified as a string where 'c' indicates a constant (i.e. a degree zero component of the trend polynomial), 't' indicates a linear trend with time, and 'ct' is both. Can also be specified as an iterable defining the non-zero polynomial exponents to include, in increasing order. For example, [1,1,0,1] denotes a+bt+ct3. Default is to not include a trend component. |
None |
alpha |
float |
The significance level for the confidence interval. The default alpha = .05 returns a 95% confidence interval. |
0.05 |
exog |
Union[pandas.core.series.Series, pandas.core.frame.DataFrame] |
Exogenous variable/s included as predictor/s. Must have the same
number of observations as |
None |
sarimax_kwargs |
dict |
Additional keyword arguments passed to SARIMAX constructor. See more in https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html#statsmodels.tsa.statespace.sarimax.SARIMAX |
{} |
fit_kwargs |
dict |
Additional keyword arguments passed to SARIMAX fit. See more in https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.fit.html#statsmodels.tsa.statespace.sarimax.SARIMAX.fit |
{'disp': 0} |
verbose |
bool |
Print number of folds used for backtesting. |
False |
Returns:
Type | Description |
---|---|
Tuple[float, pandas.core.frame.DataFrame] |
Value of the metric. |
Source code in skforecast/model_selection_statsmodels/model_selection_statsmodels.py
def backtesting_sarimax(
y: pd.Series,
steps: int,
metric: Union[str, callable],
initial_train_size: int,
fixed_train_size: bool=False,
refit: bool=False,
order: tuple=(1, 0, 0),
seasonal_order: tuple=(0, 0, 0, 0),
trend: str=None,
alpha: float= 0.05,
exog: Optional[Union[pd.Series, pd.DataFrame]]=None,
sarimax_kwargs: dict={},
fit_kwargs: dict={'disp':0},
verbose: bool=False
) -> Tuple[float, pd.DataFrame]:
"""
Backtesting (validation) of `SARIMAX` model from statsmodels >= 0.12. The model
is trained using the `initial_train_size` first observations, then, in each
iteration, a number of `steps` predictions are evaluated. If refit is `True`,
the model is re-fitted in each iteration before making predictions.
https://www.statsmodels.org/dev/examples/notebooks/generated/statespace_forecasting.html
Parameters
----------
y : pandas Series
Time series values.
steps : int
Number of steps to predict.
metric : str, callable
Metric used to quantify the goodness of fit of the model.
If string:
{'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error'}
If callable:
Function with arguments y_true, y_pred that returns a float.
initial_train_size: int
Number of samples used in the initial train.
fixed_train_size: bool, default `False`
If True, train size doesn't increases but moves by `steps` in each iteration.
refit: bool, default False
Whether to re-fit the model in each iteration.
order: tuple
The (p,d,q) order of the model for the number of AR parameters, differences,
and MA parameters. d must be an integer indicating the integration order
of the process, while p and q may either be an integers indicating the AR
and MA orders (so that all lags up to those orders are included) or else
iterables giving specific AR and / or MA lags to include. Default is an
AR(1) model: (1,0,0).
seasonal_order: tuple
The (P,D,Q,s) order of the seasonal component of the model for the AR parameters,
differences, MA parameters, and periodicity. D must be an integer
indicating the integration order of the process, while P and Q may either
be an integers indicating the AR and MA orders (so that all lags up to
those orders are included) or else iterables giving specific AR and / or
MA lags to include. s is an integer giving the periodicity (number of
periods in season), often it is 4 for quarterly data or 12 for monthly data.
Default is no seasonal effect.
trend: str {'n', 'c', 't', 'ct'}
Parameter controlling the deterministic trend polynomial A(t). Can be
specified as a string where 'c' indicates a constant (i.e. a degree zero
component of the trend polynomial), 't' indicates a linear trend with time,
and 'ct' is both. Can also be specified as an iterable defining the non-zero
polynomial exponents to include, in increasing order. For example, [1,1,0,1]
denotes a+bt+ct3. Default is to not include a trend component.
alpha: float, default 0.05
The significance level for the confidence interval. The default
alpha = .05 returns a 95% confidence interval.
exog : pandas Series, pandas DataFrame, default `None`
Exogenous variable/s included as predictor/s. Must have the same
number of observations as `y` and should be aligned so that y[i] is
regressed on exog[i].
sarimax_kwargs: dict, default `{}`
Additional keyword arguments passed to SARIMAX constructor. See more in
https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html#statsmodels.tsa.statespace.sarimax.SARIMAX
fit_kwargs: dict, default `{'disp':0}`
Additional keyword arguments passed to SARIMAX fit. See more in
https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.fit.html#statsmodels.tsa.statespace.sarimax.SARIMAX.fit
verbose : bool, default `False`
Print number of folds used for backtesting.
Returns
-------
metric_value: float
Value of the metric.
backtest_predictions: pandas DataFrame
Values predicted and their estimated interval:
column pred = predictions.
column lower = lower bound of the interval.
column upper = upper bound interval of the interval.
"""
if isinstance(metric, str):
metric = _get_metric(metric=metric)
folds = int(np.ceil((len(y) - initial_train_size) / steps))
remainder = (len(y) - initial_train_size) % steps
backtest_predictions = []
if verbose:
print(f"Number of observations used for training: {initial_train_size}")
print(f"Number of observations used for backtesting: {len(y) - initial_train_size}")
print(f" Number of folds: {folds}")
print(f" Number of steps per fold: {steps}")
if remainder != 0:
print(f" Last fold only includes {remainder} observations.")
if folds > 50 and refit:
print(
f"Model will be fit {folds} times. This can take substantial amounts of time. "
f"If not feasible, try with `refit = False`."
)
if refit:
for i in range(folds):
# In each iteration (except the last one) the model is fitted before making predictions.
if fixed_train_size:
# The train size doesn't increases but moves by `steps` in each iteration.
train_idx_start = i * steps
train_idx_end = initial_train_size + i * steps
else:
# The train size increases by `steps` in each iteration.
train_idx_start = 0
train_idx_end = initial_train_size + i * steps
if exog is not None:
next_window_exog = exog.iloc[train_idx_end:train_idx_end + steps, ]
if i < folds - 1: # from the first step to one before the last one.
if exog is None:
model = SARIMAX(
endog = y.iloc[train_idx_start:train_idx_end],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=steps)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
model = SARIMAX(
endog = y.iloc[train_idx_start:train_idx_end],
exog = exog.iloc[train_idx_start:train_idx_end, ],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=steps, exog=next_window_exog)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
if remainder == 0:
if exog is None:
model = SARIMAX(
endog = y.iloc[train_idx_start:train_idx_end],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=steps)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
model = SARIMAX(
endog = y.iloc[train_idx_start:train_idx_end],
exog = exog.iloc[train_idx_start:train_idx_end, ],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=steps, exog=next_window_exog)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
# Only the remaining steps need to be predicted
steps = remainder
if exog is None:
model = SARIMAX(
endog = y.iloc[train_idx_start:train_idx_end],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=steps)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
model = SARIMAX(
endog = y.iloc[train_idx_start:train_idx_end],
exog = exog.iloc[train_idx_start:train_idx_end, ],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=steps, exog=next_window_exog)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
backtest_predictions.append(pred)
else:
# Since the model is only fitted with the initial_train_size, the model
# must be extended in each iteration to include the data needed to make
# predictions.
if exog is None:
model = SARIMAX(
endog = y.iloc[:initial_train_size],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
else:
model = SARIMAX(
endog = y.iloc[:initial_train_size],
exog = exog.iloc[:initial_train_size],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
for i in range(folds):
last_window_end = initial_train_size + i * steps
last_window_start = (initial_train_size + i * steps) - steps
last_window_y = y.iloc[last_window_start:last_window_end]
if exog is not None:
last_window_exog = exog.iloc[last_window_start:last_window_end]
next_window_exog = exog.iloc[last_window_end:last_window_end + steps]
if i == 0:
# No extend is needed for the first fold
if exog is None:
pred = model.get_forecast(steps=steps)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
pred = model.get_forecast(steps=steps, exog=next_window_exog)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
elif i < folds - 1:
if exog is None:
model = model.extend(endog=last_window_y)
pred = model.get_forecast(steps=steps)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
model = model.extend(endog=last_window_y, exog=last_window_exog)
pred = model.get_forecast(steps=steps, exog=next_window_exog)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
if remainder == 0:
if exog is None:
model = model.extend(endog=last_window_y)
pred = model.get_forecast(steps=steps)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
model = model.extend(endog=last_window_y, exog=last_window_exog)
pred = model.get_forecast(steps=steps, exog=next_window_exog)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
# Only the remaining steps need to be predicted
steps = remainder
if exog is None:
model = model.extend(endog=last_window_y)
pred = model.get_forecast(steps=steps)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
else:
model = model.extend(endog=last_window_y, exog=last_window_exog)
pred = model.get_forecast(steps=steps, exog=next_window_exog)
pred = pd.concat((
pred.predicted_mean.rename("predicted_mean"),
pred.conf_int(alpha=alpha)),
axis = 1
)
backtest_predictions.append(pred)
backtest_predictions = pd.concat(backtest_predictions)
metric_value = metric(
y_true = y.iloc[initial_train_size: initial_train_size + len(backtest_predictions)],
y_pred = backtest_predictions['predicted_mean']
)
return metric_value, backtest_predictions
cv_sarimax(y, initial_train_size, steps, metric, order=(1, 0, 0), seasonal_order=(0, 0, 0, 0), trend=None, alpha=0.05, exog=None, allow_incomplete_fold=True, sarimax_kwargs={}, fit_kwargs={'disp': 0}, verbose=False)
¶
Cross-validation of SARIMAX
model from statsmodels >= 0.12. The order of data
is maintained and the training set increases in each iteration.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y |
Series |
Time series values. |
required |
order |
tuple |
The (p,d,q) order of the model for the number of AR parameters, differences, and MA parameters. d must be an integer indicating the integration order of the process, while p and q may either be an integers indicating the AR and MA orders (so that all lags up to those orders are included) or else iterables giving specific AR and / or MA lags to include. Default is an AR(1) model: (1,0,0). |
(1, 0, 0) |
seasonal_order |
tuple |
The (P,D,Q,s) order of the seasonal component of the model for the AR parameters, differences, MA parameters, and periodicity. D must be an integer indicating the integration order of the process, while P and Q may either be an integers indicating the AR and MA orders (so that all lags up to those orders are included) or else iterables giving specific AR and / or MA lags to include. s is an integer giving the periodicity (number of periods in season), often it is 4 for quarterly data or 12 for monthly data. Default is no seasonal effect. |
(0, 0, 0, 0) |
trend |
str |
Parameter controlling the deterministic trend polynomial A(t). Can be specified as a string where 'c' indicates a constant (i.e. a degree zero component of the trend polynomial), 't' indicates a linear trend with time, and 'ct' is both. Can also be specified as an iterable defining the non-zero polynomial exponents to include, in increasing order. For example, [1,1,0,1] denotes a+bt+ct3. Default is to not include a trend component. |
None |
alpha |
float |
The significance level for the confidence interval. The default alpha = .05 returns a 95% confidence interval. |
0.05 |
initial_train_size |
int |
Number of samples in the initial train split. |
required |
steps |
int |
Number of steps to predict. |
required |
metric |
Union[str, <built-in function callable>] |
Metric used to quantify the goodness of fit of the model. If string: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error'} If callable: Function with arguments y_true, y_pred that returns a float. |
required |
exog |
Union[pandas.core.series.Series, pandas.core.frame.DataFrame] |
Exogenous variable/s included as predictor/s. Must have the same
number of observations as |
None |
sarimax_kwargs |
dict |
Additional keyword arguments passed to SARIMAX initialization. See more in https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html#statsmodels.tsa.statespace.sarimax.SARIMAX |
{} |
fit_kwargs |
dict |
Additional keyword arguments passed to SARIMAX fit. See more in https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.fit.html#statsmodels.tsa.statespace.sarimax.SARIMAX.fit |
{'disp': 0} |
verbose |
bool |
Print number of folds used for cross-validation. |
False |
Returns:
Type | Description |
---|---|
Tuple[<built-in function array>, <built-in function array>] |
Value of the metric for each partition. |
Source code in skforecast/model_selection_statsmodels/model_selection_statsmodels.py
def cv_sarimax(
y: pd.Series,
initial_train_size: int,
steps: int,
metric: Union[str, callable],
order: tuple=(1, 0, 0),
seasonal_order: tuple=(0, 0, 0, 0),
trend: str=None,
alpha: float= 0.05,
exog: Union[pd.Series, pd.DataFrame]=None,
allow_incomplete_fold: bool=True,
sarimax_kwargs: dict={},
fit_kwargs: dict={'disp':0},
verbose: bool=False
) -> Tuple[np.array, np.array]:
"""
Cross-validation of `SARIMAX` model from statsmodels >= 0.12. The order of data
is maintained and the training set increases in each iteration.
Parameters
----------
y : pandas Series
Time series values.
order: tuple
The (p,d,q) order of the model for the number of AR parameters, differences,
and MA parameters. d must be an integer indicating the integration order
of the process, while p and q may either be an integers indicating the AR
and MA orders (so that all lags up to those orders are included) or else
iterables giving specific AR and / or MA lags to include. Default is an
AR(1) model: (1,0,0).
seasonal_order: tuple
The (P,D,Q,s) order of the seasonal component of the model for the AR parameters,
differences, MA parameters, and periodicity. D must be an integer
indicating the integration order of the process, while P and Q may either
be an integers indicating the AR and MA orders (so that all lags up to
those orders are included) or else iterables giving specific AR and / or
MA lags to include. s is an integer giving the periodicity (number of
periods in season), often it is 4 for quarterly data or 12 for monthly data.
Default is no seasonal effect.
trend: str {'n', 'c', 't', 'ct'}
Parameter controlling the deterministic trend polynomial A(t). Can be
specified as a string where 'c' indicates a constant (i.e. a degree zero
component of the trend polynomial), 't' indicates a linear trend with time,
and 'ct' is both. Can also be specified as an iterable defining the non-zero
polynomial exponents to include, in increasing order. For example, [1,1,0,1]
denotes a+bt+ct3. Default is to not include a trend component.
alpha: float, default 0.05
The significance level for the confidence interval. The default
alpha = .05 returns a 95% confidence interval.
initial_train_size: int
Number of samples in the initial train split.
steps : int
Number of steps to predict.
metric : str, callable
Metric used to quantify the goodness of fit of the model.
If string:
{'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error'}
If callable:
Function with arguments y_true, y_pred that returns a float.
exog : pandas Series, pandas DataFrame, default `None`
Exogenous variable/s included as predictor/s. Must have the same
number of observations as `y` and should be aligned so that y[i] is
regressed on exog[i].
sarimax_kwargs: dict, default {}
Additional keyword arguments passed to SARIMAX initialization. See more in
https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html#statsmodels.tsa.statespace.sarimax.SARIMAX
fit_kwargs: dict, default `{'disp':0}`
Additional keyword arguments passed to SARIMAX fit. See more in
https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.fit.html#statsmodels.tsa.statespace.sarimax.SARIMAX.fit
verbose : bool, default `False`
Print number of folds used for cross-validation.
Returns
-------
cv_metrics: 1D np.ndarray
Value of the metric for each partition.
cv_predictions: pandas DataFrame
Values predicted and their estimated interval:
column pred = predictions.
column lower = lower bound of the interval.
column upper = upper bound interval of the interval.
"""
if isinstance(metric, str):
metric = _get_metric(metric=metric)
if isinstance(y, pd.Series):
y = y.to_numpy(copy=True)
if isinstance(exog, (pd.Series, pd.DataFrame)):
exog = exog.to_numpy(copy=True)
cv_predictions = []
cv_metrics = []
splits = time_series_splitter(
y = y,
initial_train_size = initial_train_size,
steps = steps,
allow_incomplete_fold = allow_incomplete_fold,
verbose = verbose
)
for train_index, test_index in splits:
if exog is None:
model = SARIMAX(
endog = y.iloc[train_index],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=len(test_index))
pred = np.column_stack((pred.predicted_mean, pred.conf_int(alpha=alpha)))
else:
model = SARIMAX(
endog = y.iloc[train_index],
exog = exog.iloc[train_index],
order = order,
seasonal_order = seasonal_order,
trend = trend,
**sarimax_kwargs
).fit(**fit_kwargs)
pred = model.get_forecast(steps=len(test_index), exog=exog.iloc[test_index])
pred = np.column_stack((pred.predicted_mean, pred.conf_int(alpha=alpha)))
metric_value = metric(
y_true = y.iloc[test_index],
y_pred = pred[:, 0]
)
cv_metrics.append(metric_value)
cv_predictions.append(pred)
return np.array(cv_metrics), np.concatenate(cv_predictions)
grid_search_sarimax(y, param_grid, steps, metric, initial_train_size, fixed_train_size=False, exog=None, refit=False, sarimax_kwargs={}, fit_kwargs={'disp': 0}, verbose=False)
¶
Exhaustive search over specified parameter values for a SARIMAX
model from
statsmodels >= 0.12. Validation is done using time series cross-validation or backtesting.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
y |
Series |
Time series values. |
required |
param_grid |
dict |
Dictionary with parameters names ( |
required |
steps |
int |
Number of steps to predict. |
required |
metric |
Union[str, <built-in function callable>] |
Metric used to quantify the goodness of fit of the model. If string: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error'} If callable: Function with arguments y_true, y_pred that returns a float. |
required |
initial_train_size |
int |
Number of samples used in the initial train. |
required |
fixed_train_size |
bool |
If True, train size doesn't increases but moves by |
False |
exog |
Union[pandas.core.series.Series, pandas.core.frame.DataFrame] |
Exogenous variable/s included as predictor/s. Must have the same
number of observations as |
None |
refit |
bool |
Whether to re-fit the model in each iteration. |
False |
sarimax_kwargs |
dict |
Additional keyword arguments passed to SARIMAX initialization. See more in https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html#statsmodels.tsa.statespace.sarimax.SARIMAX |
{} |
fit_kwargs |
dict |
Additional keyword arguments passed to SARIMAX fit. See more in https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.fit.html#statsmodels.tsa.statespace.sarimax.SARIMAX.fit |
{'disp': 0} |
verbose |
bool |
Print number of folds used for cv or backtesting. |
False |
Returns:
Type | Description |
---|---|
DataFrame |
Results for each combination of parameters. column params = lower bound of the interval. column metric = metric value estimated for the combination of parameters. additional n columns with param = value. |
Source code in skforecast/model_selection_statsmodels/model_selection_statsmodels.py
def grid_search_sarimax(
y: pd.Series,
param_grid: dict,
steps: int,
metric: Union[str, callable],
initial_train_size: int,
fixed_train_size: bool=False,
exog: Union[pd.Series, pd.DataFrame]=None,
refit: bool=False,
sarimax_kwargs: dict={},
fit_kwargs: dict={'disp':0},
verbose: bool=False
) -> pd.DataFrame:
"""
Exhaustive search over specified parameter values for a `SARIMAX` model from
statsmodels >= 0.12. Validation is done using time series cross-validation or
backtesting.
Parameters
----------
y : pandas Series
Time series values.
param_grid : dict
Dictionary with parameters names (`str`) as keys and lists of parameter
settings to try as values. Allowed parameters in the grid are: order,
seasonal_order and trend.
steps : int
Number of steps to predict.
metric : str, callable
Metric used to quantify the goodness of fit of the model.
If string:
{'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error'}
If callable:
Function with arguments y_true, y_pred that returns a float.
initial_train_size: int
Number of samples used in the initial train.
fixed_train_size: bool, default `False`
If True, train size doesn't increases but moves by `steps` in each iteration.
exog : np.ndarray, pd.Series, pd.DataFrame, default `None`
Exogenous variable/s included as predictor/s. Must have the same
number of observations as `y` and should be aligned so that y[i] is
regressed on exog[i].
refit: bool, default False
Whether to re-fit the model in each iteration.
sarimax_kwargs: dict, default `{}`
Additional keyword arguments passed to SARIMAX initialization. See more in
https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.html#statsmodels.tsa.statespace.sarimax.SARIMAX
fit_kwargs: dict, default `{'disp':0}`
Additional keyword arguments passed to SARIMAX fit. See more in
https://www.statsmodels.org/stable/generated/statsmodels.tsa.statespace.sarimax.SARIMAX.fit.html#statsmodels.tsa.statespace.sarimax.SARIMAX.fit
verbose : bool, default `True`
Print number of folds used for cv or backtesting.
Returns
-------
results: pandas DataFrame
Results for each combination of parameters.
column params = lower bound of the interval.
column metric = metric value estimated for the combination of parameters.
additional n columns with param = value.
"""
params_list = []
metric_list = []
# bic_list = []
# aic_list = []
if 'order' not in param_grid:
param_grid['order'] = [(1, 0, 0)]
if 'seasonal_order' not in param_grid:
param_grid['seasonal_order'] = [(0, 0, 0, 0)]
if 'trend' not in param_grid:
param_grid['trend'] = [None]
keys_to_ignore = set(param_grid.keys()) - {'order', 'seasonal_order', 'trend'}
if keys_to_ignore:
print(
f'Only arguments: order, seasonal_order and trend are allowed for grid search.'
f' Ignoring {keys_to_ignore}.'
)
for key in keys_to_ignore:
del param_grid[key]
param_grid = list(ParameterGrid(param_grid))
logging.info(
f"Number of models compared: {len(param_grid)}"
)
for params in tqdm(param_grid, ncols=90):
metric_value = backtesting_sarimax(
y = y,
exog = exog,
order = params['order'],
seasonal_order = params['seasonal_order'],
trend = params['trend'],
initial_train_size = initial_train_size,
fixed_train_size = fixed_train_size,
steps = steps,
refit = refit,
metric = metric,
sarimax_kwargs = sarimax_kwargs,
fit_kwargs = fit_kwargs,
verbose = verbose
)[0]
params_list.append(params)
metric_list.append(metric_value)
# model = SARIMAX(
# endog = y,
# exog = exog,
# order = params['order'],
# seasonal_order = params['seasonal_order'],
# trend = params['trend'],
# **sarimax_kwargs
# ).fit(**fit_kwargs)
# bic_list.append(model.bic)
# aic_list.append(model.aic)
results = pd.DataFrame({
'params': params_list,
'metric': metric_list,
#'bic' : bic_list,
#'aic' : aic_list
})
results = results.sort_values(by='metric', ascending=True)
results = pd.concat([results, results['params'].apply(pd.Series)], axis=1)
return results