`model_selection`¶

skforecast.model_selection._validation.backtesting_forecaster ¶

backtesting_forecaster(
    forecaster,
    y,
    cv,
    metric,
    exog=None,
    interval=None,
    n_boot=250,
    random_state=123,
    use_in_sample_residuals=True,
    use_binned_residuals=False,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
)

Backtesting of forecaster model following the folds generated by the TimeSeriesFold class and using the metric(s) provided.

If forecaster is already trained and initial_train_size is set to None in the TimeSeriesFold class, no initial train will be done and all data will be used to evaluate the model. However, the first len(forecaster.last_window) observations are needed to create the initial predictors, so no predictions are calculated for them.

A copy of the original forecaster is created so that it is not modified during the process.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`interval`	`list`	Confidence of the prediction interval estimated. Sequence of percentiles to compute, which must be between 0 and 100 inclusive. For example, interval of 95% should be as `interval = [2.5, 97.5]`. If `None`, no intervals are estimated.	`None`
`n_boot`	`int`	Number of bootstrapping iterations used to estimate prediction intervals.	`250`
`random_state`	`int`	Sets a seed to the random generator, so that boot intervals are always deterministic.	`123`
`use_in_sample_residuals`	`bool`	If `True`, residuals from the training data are used as proxy of prediction error to create prediction intervals. If `False`, out_sample_residuals are used if they are already stored inside the forecaster.	`True`
`use_binned_residuals`	`bool`	If `True`, residuals used in each bootstrapping iteration are selected conditioning on the predicted values. If `False`, residuals are selected randomly without conditioning on the predicted values.	`False`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds and index of training and validation sets used for backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`

Returns:

Name	Type	Description
`metric_values`	`pandas DataFrame`	Value(s) of the metric(s).
`backtest_predictions`	`pandas DataFrame`	Value of predictions and their estimated interval if `interval` is not `None`. column pred: predictions. column lower_bound: lower bound of the interval. column upper_bound: upper bound of the interval.

Source code in skforecast/model_selection/_validation.py

def backtesting_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    metric: Union[str, Callable, list],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    interval: Optional[list] = None,
    n_boot: int = 250,
    random_state: int = 123,
    use_in_sample_residuals: bool = True,
    use_binned_residuals: bool = False,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = False,
    show_progress: bool = True
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Backtesting of forecaster model following the folds generated by the TimeSeriesFold
    class and using the metric(s) provided.

    If `forecaster` is already trained and `initial_train_size` is set to `None` in the
    TimeSeriesFold class, no initial train will be done and all data will be used
    to evaluate the model. However, the first `len(forecaster.last_window)` observations
    are needed to create the initial predictors, so no predictions are calculated for
    them.

    A copy of the original forecaster is created so that it is not modified during 
    the process.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    interval : list, default `None`
        Confidence of the prediction interval estimated. Sequence of percentiles
        to compute, which must be between 0 and 100 inclusive. For example, 
        interval of 95% should be as `interval = [2.5, 97.5]`. If `None`, no
        intervals are estimated.
    n_boot : int, default `250`
        Number of bootstrapping iterations used to estimate prediction
        intervals.
    random_state : int, default `123`
        Sets a seed to the random generator, so that boot intervals are always 
        deterministic.
    use_in_sample_residuals : bool, default `True`
        If `True`, residuals from the training data are used as proxy of prediction 
        error to create prediction intervals. If `False`, out_sample_residuals 
        are used if they are already stored inside the forecaster.
    use_binned_residuals : bool, default `False`
        If `True`, residuals used in each bootstrapping iteration are selected
        conditioning on the predicted values. If `False`, residuals are selected
        randomly without conditioning on the predicted values.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `False`
        Print number of folds and index of training and validation sets used 
        for backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.

    Returns
    -------
    metric_values : pandas DataFrame
        Value(s) of the metric(s).
    backtest_predictions : pandas DataFrame
        Value of predictions and their estimated interval if `interval` is not `None`.

        - column pred: predictions.
        - column lower_bound: lower bound of the interval.
        - column upper_bound: upper bound of the interval.

    """

    forecaters_allowed = [
        'ForecasterRecursive', 
        'ForecasterDirect',
        'ForecasterEquivalentDate'
    ]

    if type(forecaster).__name__ not in forecaters_allowed:
        raise TypeError(
            (f"`forecaster` must be of type {forecaters_allowed}, for all other types of "
             f" forecasters use the functions available in the other `model_selection` "
             f"modules.")
        )

    check_backtesting_input(
        forecaster              = forecaster,
        cv                      = cv,
        y                       = y,
        metric                  = metric,
        interval                = interval,
        n_boot                  = n_boot,
        random_state            = random_state,
        use_in_sample_residuals = use_in_sample_residuals,
        use_binned_residuals    = use_binned_residuals,
        n_jobs                  = n_jobs,
        show_progress           = show_progress
    )

    if type(forecaster).__name__ == 'ForecasterDirect' and \
       forecaster.steps < cv.steps + cv.gap:
        raise ValueError(
            (f"When using a ForecasterDirect, the combination of steps "
             f"+ gap ({cv.steps + cv.gap}) cannot be greater than the `steps` parameter "
             f"declared when the forecaster is initialized ({forecaster.steps}).")
        )

    metric_values, backtest_predictions = _backtesting_forecaster(
        forecaster              = forecaster,
        y                       = y,
        cv                      = cv,
        metric                  = metric,
        exog                    = exog,
        interval                = interval,
        n_boot                  = n_boot,
        random_state            = random_state,
        use_in_sample_residuals = use_in_sample_residuals,
        use_binned_residuals    = use_binned_residuals,
        n_jobs                  = n_jobs,
        verbose                 = verbose,
        show_progress           = show_progress
    )

    return metric_values, backtest_predictions

skforecast.model_selection._search.grid_search_forecaster ¶

grid_search_forecaster(
    forecaster,
    y,
    cv,
    param_grid,
    metric,
    exog=None,
    lags_grid=None,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    show_progress=True,
    output_file=None,
)

Exhaustive search over specified parameter values for a Forecaster object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_grid`	`dict`	Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast/model_selection/_search.py

def grid_search_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: Union[TimeSeriesFold, OneStepAheadFold],
    param_grid: dict,
    metric: Union[str, Callable, list],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    lags_grid: Optional[Union[list, dict]] = None,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    show_progress: bool = True,
    output_file: Optional[str] = None
) -> pd.DataFrame:
    """
    Exhaustive search over specified parameter values for a Forecaster object.
    Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    param_grid : dict
        Dictionary with parameters names (`str`) as keys and lists of parameter
        settings to try as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    lags_grid : list, dict, default `None`
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterGrid(param_grid))

    results = _evaluate_grid_hyperparameters(
                  forecaster    = forecaster,
                  y             = y,
                  cv            = cv,
                  param_grid    = param_grid,
                  metric        = metric,
                  exog          = exog,
                  lags_grid     = lags_grid,
                  return_best   = return_best,
                  n_jobs        = n_jobs,
                  verbose       = verbose,
                  show_progress = show_progress,
                  output_file   = output_file
              )

    return results

skforecast.model_selection._search.random_search_forecaster ¶

random_search_forecaster(
    forecaster,
    y,
    cv,
    param_distributions,
    metric,
    exog=None,
    lags_grid=None,
    n_iter=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    show_progress=True,
    output_file=None,
)

Random search over specified parameter values or distributions for a Forecaster object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_distributions`	`dict`	Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`n_iter`	`int`	Number of parameter settings that are sampled per lags configuration. n_iter trades off runtime vs quality of the solution.	`10`
`random_state`	`int`	Sets a seed to the random sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast/model_selection/_search.py

def random_search_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: Union[TimeSeriesFold, OneStepAheadFold],
    param_distributions: dict,
    metric: Union[str, Callable, list],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    lags_grid: Optional[Union[list, dict]] = None,
    n_iter: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    show_progress: bool = True,
    output_file: Optional[str] = None
) -> pd.DataFrame:
    """
    Random search over specified parameter values or distributions for a Forecaster 
    object. Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    param_distributions : dict
        Dictionary with parameters names (`str`) as keys and 
        distributions or lists of parameters to try.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i]. 
    lags_grid : list, dict, default `None`
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    n_iter : int, default `10`
        Number of parameter settings that are sampled per lags configuration. 
        n_iter trades off runtime vs quality of the solution.
    random_state : int, default `123`
        Sets a seed to the random sampling for reproducible output.
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterSampler(param_distributions, n_iter=n_iter, random_state=random_state))

    results = _evaluate_grid_hyperparameters(
                  forecaster    = forecaster,
                  y             = y,
                  cv            = cv,
                  param_grid    = param_grid,
                  metric        = metric,
                  exog          = exog,
                  lags_grid     = lags_grid,
                  return_best   = return_best,
                  n_jobs        = n_jobs,
                  verbose       = verbose,
                  show_progress = show_progress,
                  output_file   = output_file
              )

    return results

skforecast.model_selection._search.bayesian_search_forecaster ¶

bayesian_search_forecaster(
    forecaster,
    y,
    cv,
    search_space,
    metric,
    exog=None,
    n_trials=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    show_progress=True,
    output_file=None,
    kwargs_create_study={},
    kwargs_study_optimize={},
)

Bayesian search for hyperparameters of a Forecaster object.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`search_space`	`Callable(optuna)`	Function with argument `trial` which returns a dictionary with parameters names (`str`) as keys and Trial object from optuna (trial.suggest_float, trial.suggest_int, trial.suggest_categorical) as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`n_trials`	`int`	Number of parameter settings that are sampled in each lag configuration.	`10`
`random_state`	`int`	Sets a seed to the sampling for reproducible output. When a new sampler is passed in `kwargs_create_study`, the seed must be set within the sampler. For example `{'sampler': TPESampler(seed=145)}`.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`
`kwargs_create_study`	`dict`	Keyword arguments (key, value mappings) to pass to optuna.create_study(). If default, the direction is set to 'minimize' and a TPESampler(seed=123) sampler is used during optimization.	`{}`
`kwargs_study_optimize`	`dict`	Other keyword arguments (key, value mappings) to pass to study.optimize().	`{}`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column lags: lags configuration for each iteration. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.
`best_trial`	`optuna object`	The best optimization result returned as a FrozenTrial optuna object.

Source code in skforecast/model_selection/_search.py

def bayesian_search_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: Union[TimeSeriesFold, OneStepAheadFold],
    search_space: Callable,
    metric: Union[str, Callable, list],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    n_trials: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    show_progress: bool = True,
    output_file: Optional[str] = None,
    kwargs_create_study: dict = {},
    kwargs_study_optimize: dict = {}
) -> Tuple[pd.DataFrame, object]:
    """
    Bayesian search for hyperparameters of a Forecaster object.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    search_space : Callable (optuna)
        Function with argument `trial` which returns a dictionary with parameters names 
        (`str`) as keys and Trial object from optuna (trial.suggest_float, 
        trial.suggest_int, trial.suggest_categorical) as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    n_trials : int, default `10`
        Number of parameter settings that are sampled in each lag configuration.
    random_state : int, default `123`
        Sets a seed to the sampling for reproducible output. When a new sampler 
        is passed in `kwargs_create_study`, the seed must be set within the 
        sampler. For example `{'sampler': TPESampler(seed=145)}`.
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**
    kwargs_create_study : dict, default `{}`
        Keyword arguments (key, value mappings) to pass to optuna.create_study().
        If default, the direction is set to 'minimize' and a TPESampler(seed=123) 
        sampler is used during optimization.
    kwargs_study_optimize : dict, default `{}`
        Other keyword arguments (key, value mappings) to pass to study.optimize().

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column lags: lags configuration for each iteration.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.
    best_trial : optuna object
        The best optimization result returned as a FrozenTrial optuna object.

    """

    if return_best and exog is not None and (len(exog) != len(y)):
        raise ValueError(
            f"`exog` must have same number of samples as `y`. "
            f"length `exog`: ({len(exog)}), length `y`: ({len(y)})"
        )

    results, best_trial = _bayesian_search_optuna(
                              forecaster            = forecaster,
                              y                     = y,
                              cv                    = cv,
                              exog                  = exog,
                              search_space          = search_space,
                              metric                = metric,
                              n_trials              = n_trials,
                              random_state          = random_state,
                              return_best           = return_best,
                              n_jobs                = n_jobs,
                              verbose               = verbose,
                              show_progress         = show_progress,
                              output_file           = output_file,
                              kwargs_create_study   = kwargs_create_study,
                              kwargs_study_optimize = kwargs_study_optimize
                          )

    return results, best_trial

skforecast.model_selection._validation.backtesting_forecaster_multiseries ¶

backtesting_forecaster_multiseries(
    forecaster,
    series,
    cv,
    metric,
    levels=None,
    add_aggregated_metric=True,
    exog=None,
    interval=None,
    n_boot=250,
    random_state=123,
    use_in_sample_residuals=True,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    suppress_warnings=False,
)

Backtesting of forecaster model following the folds generated by the TimeSeriesFold class and using the metric(s) provided.

If forecaster is already trained and initial_train_size is set to None in the TimeSeriesFold class, no initial train will be done and all data will be used to evaluate the model. However, the first len(forecaster.last_window) observations are needed to create the initial predictors, so no predictions are calculated for them.

A copy of the original forecaster is created so that it is not modified during the process.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate, ForecasterRnn)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`levels`	`(str, list)`	Time series to be predicted. If `None` all levels will be predicted.	`None`
`add_aggregated_metric`	`bool`	If `True`, and multiple series (`levels`) are predicted, the aggregated metrics (average, weighted average and pooled) are also returned. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`True`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`interval`	`list`	Confidence of the prediction interval estimated. Sequence of percentiles to compute, which must be between 0 and 100 inclusive. If `None`, no intervals are estimated.	`None`
`n_boot`	`int`	Number of bootstrapping iterations used to estimate prediction intervals.	`250`
`random_state`	`int`	Sets a seed to the random generator, so that boot intervals are always deterministic.	`123`
`use_in_sample_residuals`	`bool`	If `True`, residuals from the training data are used as proxy of prediction error to create prediction intervals. If `False`, out_sample_residuals are used if they are already stored inside the forecaster.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds and index of training and validation sets used for backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the backtesting process. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`

Returns:

Name	Type	Description
`metrics_levels`	`pandas DataFrame`	Value(s) of the metric(s). Index are the levels and columns the metrics.
`backtest_predictions`	`pandas DataFrame`	Value of predictions and their estimated interval if `interval` is not `None`. If there is more than one level, this structure will be repeated for each of them. column pred: predictions. column lower_bound: lower bound of the interval. column upper_bound: upper bound of the interval.

Source code in skforecast/model_selection/_validation.py

def backtesting_forecaster_multiseries(
    forecaster: object,
    series: Union[pd.DataFrame, dict],
    cv: TimeSeriesFold,
    metric: Union[str, Callable, list],
    levels: Optional[Union[str, list]] = None,
    add_aggregated_metric: bool = True,
    exog: Optional[Union[pd.Series, pd.DataFrame, dict]] = None,
    interval: Optional[list] = None,
    n_boot: int = 250,
    random_state: int = 123,
    use_in_sample_residuals: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    suppress_warnings: bool = False
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Backtesting of forecaster model following the folds generated by the TimeSeriesFold
    class and using the metric(s) provided.

    If `forecaster` is already trained and `initial_train_size` is set to `None` in the
    TimeSeriesFold class, no initial train will be done and all data will be used
    to evaluate the model. However, the first `len(forecaster.last_window)` observations
    are needed to create the initial predictors, so no predictions are calculated for
    them.

    A copy of the original forecaster is created so that it is not modified during 
    the process.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate, ForecasterRnn
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    levels : str, list, default `None`
        Time series to be predicted. If `None` all levels will be predicted.
    add_aggregated_metric : bool, default `True`
        If `True`, and multiple series (`levels`) are predicted, the aggregated
        metrics (average, weighted average and pooled) are also returned.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    exog : pandas Series, pandas DataFrame, dict, default `None`
        Exogenous variables.
    interval : list, default `None`
        Confidence of the prediction interval estimated. Sequence of percentiles
        to compute, which must be between 0 and 100 inclusive. If `None`, no
        intervals are estimated.
    n_boot : int, default `250`
        Number of bootstrapping iterations used to estimate prediction
        intervals.
    random_state : int, default `123`
        Sets a seed to the random generator, so that boot intervals are always 
        deterministic.
    use_in_sample_residuals : bool, default `True`
        If `True`, residuals from the training data are used as proxy of prediction 
        error to create prediction intervals. If `False`, out_sample_residuals 
        are used if they are already stored inside the forecaster.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `False`
        Print number of folds and index of training and validation sets used 
        for backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    suppress_warnings: bool, default `False`
        If `True`, skforecast warnings will be suppressed during the backtesting 
        process. See skforecast.exceptions.warn_skforecast_categories for more
        information.

    Returns
    -------
    metrics_levels : pandas DataFrame
        Value(s) of the metric(s). Index are the levels and columns the metrics.
    backtest_predictions : pandas DataFrame
        Value of predictions and their estimated interval if `interval` is not `None`.
        If there is more than one level, this structure will be repeated for each of them.

        - column pred: predictions.
        - column lower_bound: lower bound of the interval.
        - column upper_bound: upper bound of the interval.

    """

    multi_series_forecasters = [
        'ForecasterRecursiveMultiSeries', 
        'ForecasterDirectMultiVariate',
        'ForecasterRnn'
    ]

    forecaster_name = type(forecaster).__name__

    if forecaster_name not in multi_series_forecasters:
        raise TypeError(
            (f"`forecaster` must be of type {multi_series_forecasters}, "
             f"for all other types of forecasters use the functions available in "
             f"the `model_selection` module. Got {forecaster_name}")
        )

    check_backtesting_input(
        forecaster              = forecaster,
        cv                      = cv,
        metric                  = metric,
        add_aggregated_metric   = add_aggregated_metric,
        series                  = series,
        exog                    = exog,
        interval                = interval,
        n_boot                  = n_boot,
        random_state            = random_state,
        use_in_sample_residuals = use_in_sample_residuals,
        n_jobs                  = n_jobs,
        show_progress           = show_progress,
        suppress_warnings       = suppress_warnings
    )

    metrics_levels, backtest_predictions = _backtesting_forecaster_multiseries(
        forecaster              = forecaster,
        series                  = series,
        cv                      = cv,
        levels                  = levels,
        metric                  = metric,
        add_aggregated_metric   = add_aggregated_metric,
        exog                    = exog,
        interval                = interval,
        n_boot                  = n_boot,
        random_state            = random_state,
        use_in_sample_residuals = use_in_sample_residuals,
        n_jobs                  = n_jobs,
        verbose                 = verbose,
        show_progress           = show_progress,
        suppress_warnings       = suppress_warnings
    )

    return metrics_levels, backtest_predictions

skforecast.model_selection._search.grid_search_forecaster_multiseries ¶

grid_search_forecaster_multiseries(
    forecaster,
    series,
    cv,
    param_grid,
    metric,
    aggregate_metric=[
        "weighted_average",
        "average",
        "pooling",
    ],
    levels=None,
    exog=None,
    lags_grid=None,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    show_progress=True,
    suppress_warnings=False,
    output_file=None,
)

Exhaustive search over specified parameter values for a Forecaster object. Validation is done using multi-series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_grid`	`dict`	Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`aggregate_metric`	`(str, list)`	Aggregation method/s used to combine the metric/s of all levels (series) when multiple levels are predicted. If list, the first aggregation method is used to select the best parameters. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`['weighted_average', 'average', 'pooling']`
`levels`	`(str, list)`	level (`str`) or levels (`list`) at which the forecaster is optimized. If `None`, all levels are taken into account.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the hyperparameter search. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column levels: levels configuration for each iteration. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. The resulting metric will be the average of the optimization of all levels. additional n columns with param = value.

Source code in skforecast/model_selection/_search.py

def grid_search_forecaster_multiseries(
    forecaster: object,
    series: Union[pd.DataFrame, dict],
    cv: Union[TimeSeriesFold, OneStepAheadFold],
    param_grid: dict,
    metric: Union[str, Callable, list],
    aggregate_metric: Union[str, list] = ['weighted_average', 'average', 'pooling'],
    levels: Optional[Union[str, list]] = None,
    exog: Optional[Union[pd.Series, pd.DataFrame, dict]] = None,
    lags_grid: Optional[Union[list, dict]] = None,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    show_progress: bool = True,
    suppress_warnings: bool = False,
    output_file: Optional[str] = None
) -> pd.DataFrame:
    """
    Exhaustive search over specified parameter values for a Forecaster object.
    Validation is done using multi-series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    param_grid : dict
        Dictionary with parameters names (`str`) as keys and lists of parameter
        settings to try as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    aggregate_metric : str, list, default `['weighted_average', 'average', 'pooling']`
        Aggregation method/s used to combine the metric/s of all levels (series)
        when multiple levels are predicted. If list, the first aggregation method
        is used to select the best parameters.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    levels : str, list, default `None`
        level (`str`) or levels (`list`) at which the forecaster is optimized. 
        If `None`, all levels are taken into account.
    exog : pandas Series, pandas DataFrame, dict, default `None`
        Exogenous variables.
    lags_grid : list, dict, default `None`
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    suppress_warnings: bool, default `False`
        If `True`, skforecast warnings will be suppressed during the hyperparameter 
        search. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column levels: levels configuration for each iteration.
        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration. The resulting 
        metric will be the average of the optimization of all levels.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterGrid(param_grid))

    results = _evaluate_grid_hyperparameters_multiseries(
                  forecaster        = forecaster,
                  series            = series,
                  cv                = cv,
                  param_grid        = param_grid,
                  metric            = metric,
                  aggregate_metric  = aggregate_metric,
                  levels            = levels,
                  exog              = exog,
                  lags_grid         = lags_grid,
                  n_jobs            = n_jobs,
                  return_best       = return_best,
                  verbose           = verbose,
                  show_progress     = show_progress,
                  suppress_warnings = suppress_warnings,
                  output_file       = output_file
              )

    return results

skforecast.model_selection._search.random_search_forecaster_multiseries ¶

random_search_forecaster_multiseries(
    forecaster,
    series,
    cv,
    param_distributions,
    metric,
    aggregate_metric=[
        "weighted_average",
        "average",
        "pooling",
    ],
    levels=None,
    exog=None,
    lags_grid=None,
    n_iter=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    show_progress=True,
    suppress_warnings=False,
    output_file=None,
)

Random search over specified parameter values or distributions for a Forecaster object. Validation is done using multi-series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds.	required
`param_distributions`	`dict`	Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`aggregate_metric`	`(str, list)`	Aggregation method/s used to combine the metric/s of all levels (series) when multiple levels are predicted. If list, the first aggregation method is used to select the best parameters. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`['weighted_average', 'average', 'pooling']`
`levels`	`(str, list)`	level (`str`) or levels (`list`) at which the forecaster is optimized. If `None`, all levels are taken into account.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`n_iter`	`int`	Number of parameter settings that are sampled per lags configuration. n_iter trades off runtime vs quality of the solution.	`10`
`random_state`	`int`	Sets a seed to the random sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the hyperparameter search. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column levels: levels configuration for each iteration. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. The resulting metric will be the average of the optimization of all levels. additional n columns with param = value.

Source code in skforecast/model_selection/_search.py

def random_search_forecaster_multiseries(
    forecaster: object,
    series: Union[pd.DataFrame, dict],
    cv: Union[TimeSeriesFold, OneStepAheadFold],
    param_distributions: dict,
    metric: Union[str, Callable, list],
    aggregate_metric: Union[str, list] = ['weighted_average', 'average', 'pooling'],
    levels: Optional[Union[str, list]] = None,
    exog: Optional[Union[pd.Series, pd.DataFrame, dict]] = None,
    lags_grid: Optional[Union[list, dict]] = None,
    n_iter: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    show_progress: bool = True,
    suppress_warnings: bool = False,
    output_file: Optional[str] = None
) -> pd.DataFrame:
    """
    Random search over specified parameter values or distributions for a Forecaster 
    object. Validation is done using multi-series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
    param_distributions : dict
        Dictionary with parameters names (`str`) as keys and distributions or 
        lists of parameters to try.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    aggregate_metric : str, list, default `['weighted_average', 'average', 'pooling']`
        Aggregation method/s used to combine the metric/s of all levels (series)
        when multiple levels are predicted. If list, the first aggregation method
        is used to select the best parameters.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    levels : str, list, default `None`
        level (`str`) or levels (`list`) at which the forecaster is optimized. 
        If `None`, all levels are taken into account.
    exog : pandas Series, pandas DataFrame, dict, default `None`
        Exogenous variables.
    lags_grid : list, dict, default `None`
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    n_iter : int, default `10`
        Number of parameter settings that are sampled per lags configuration. 
        n_iter trades off runtime vs quality of the solution.
    random_state : int, default `123`
        Sets a seed to the random sampling for reproducible output.
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    suppress_warnings: bool, default `False`
        If `True`, skforecast warnings will be suppressed during the hyperparameter 
        search. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column levels: levels configuration for each iteration.
        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration. The resulting 
        metric will be the average of the optimization of all levels.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterSampler(param_distributions, n_iter=n_iter, 
                                       random_state=random_state))

    results = _evaluate_grid_hyperparameters_multiseries(
                  forecaster        = forecaster,
                  series            = series,
                  cv                = cv,
                  param_grid        = param_grid,
                  metric            = metric,
                  aggregate_metric  = aggregate_metric,
                  levels            = levels,
                  exog              = exog,
                  lags_grid         = lags_grid,
                  return_best       = return_best,
                  n_jobs            = n_jobs,
                  verbose           = verbose,
                  show_progress     = show_progress,
                  suppress_warnings = suppress_warnings,
                  output_file       = output_file
              )

    return results

skforecast.model_selection._search.bayesian_search_forecaster_multiseries ¶

bayesian_search_forecaster_multiseries(
    forecaster,
    series,
    cv,
    search_space,
    metric,
    aggregate_metric=[
        "weighted_average",
        "average",
        "pooling",
    ],
    levels=None,
    exog=None,
    n_trials=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    show_progress=True,
    suppress_warnings=False,
    output_file=None,
    kwargs_create_study={},
    kwargs_study_optimize={},
)

Bayesian search for hyperparameters of a Forecaster object using optuna library.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`search_space`	`Callable`	Function with argument `trial` which returns a dictionary with parameters names (`str`) as keys and Trial object from optuna (trial.suggest_float, trial.suggest_int, trial.suggest_categorical) as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`aggregate_metric`	`(str, list)`	Aggregation method/s used to combine the metric/s of all levels (series) when multiple levels are predicted. If list, the first aggregation method is used to select the best parameters. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`['weighted_average', 'average', 'pooling']`
`levels`	`(str, list)`	level (`str`) or levels (`list`) at which the forecaster is optimized. If `None`, all levels are taken into account.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`n_trials`	`int`	Number of parameter settings that are sampled in each lag configuration.	`10`
`random_state`	`int`	Sets a seed to the sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the hyperparameter search. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`
`kwargs_create_study`	`dict`	Keyword arguments (key, value mappings) to pass to optuna.create_study(). If default, the direction is set to 'minimize' and a TPESampler(seed=123) sampler is used during optimization.	`{}`
`kwargs_study_optimize`	`dict`	Other keyword arguments (key, value mappings) to pass to study.optimize().	`{}`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column levels: levels configuration for each iteration. column lags: lags configuration for each iteration. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. The resulting metric will be the average of the optimization of all levels. additional n columns with param = value.
`best_trial`	`optuna object`	The best optimization result returned as a FrozenTrial optuna object.

Source code in skforecast/model_selection/_search.py

def bayesian_search_forecaster_multiseries(
    forecaster: object,
    series: Union[pd.DataFrame, dict],
    cv: Union[TimeSeriesFold, OneStepAheadFold],
    search_space: Callable,
    metric: Union[str, Callable, list],
    aggregate_metric: Union[str, list] = ['weighted_average', 'average', 'pooling'],
    levels: Optional[Union[str, list]] = None,
    exog: Optional[Union[pd.Series, pd.DataFrame, dict]] = None,
    n_trials: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    show_progress: bool = True,
    suppress_warnings: bool = False,
    output_file: Optional[str] = None,
    kwargs_create_study: dict = {},
    kwargs_study_optimize: dict = {}
) -> Tuple[pd.DataFrame, object]:
    """
    Bayesian search for hyperparameters of a Forecaster object using optuna library.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    search_space : Callable
        Function with argument `trial` which returns a dictionary with parameters names 
        (`str`) as keys and Trial object from optuna (trial.suggest_float, 
        trial.suggest_int, trial.suggest_categorical) as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    aggregate_metric : str, list, default `['weighted_average', 'average', 'pooling']`
        Aggregation method/s used to combine the metric/s of all levels (series)
        when multiple levels are predicted. If list, the first aggregation method
        is used to select the best parameters.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    levels : str, list, default `None`
        level (`str`) or levels (`list`) at which the forecaster is optimized. 
        If `None`, all levels are taken into account.
    exog : pandas Series, pandas DataFrame, dict, default `None`
        Exogenous variables.
    n_trials : int, default `10`
        Number of parameter settings that are sampled in each lag configuration.
    random_state : int, default `123`
        Sets a seed to the sampling for reproducible output.
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    suppress_warnings: bool, default `False`
        If `True`, skforecast warnings will be suppressed during the hyperparameter
        search. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**
    kwargs_create_study : dict, default `{}`
        Keyword arguments (key, value mappings) to pass to optuna.create_study().
        If default, the direction is set to 'minimize' and a TPESampler(seed=123) 
        sampler is used during optimization.
    kwargs_study_optimize : dict, default `{}`
        Other keyword arguments (key, value mappings) to pass to study.optimize().

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column levels: levels configuration for each iteration.
        - column lags: lags configuration for each iteration.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration. The resulting 
        metric will be the average of the optimization of all levels.
        - additional n columns with param = value.
    best_trial : optuna object
        The best optimization result returned as a FrozenTrial optuna object.

    """

    if return_best and exog is not None and (len(exog) != len(series)):
        raise ValueError(
            (f"`exog` must have same number of samples as `series`. "
             f"length `exog`: ({len(exog)}), length `series`: ({len(series)})")
        )

    results, best_trial = _bayesian_search_optuna_multiseries(
                              forecaster            = forecaster,
                              series                = series,
                              cv                    = cv,
                              exog                  = exog,
                              levels                = levels, 
                              search_space          = search_space,
                              metric                = metric,
                              aggregate_metric      = aggregate_metric,
                              n_trials              = n_trials,
                              random_state          = random_state,
                              return_best           = return_best,
                              n_jobs                = n_jobs,
                              verbose               = verbose,
                              show_progress         = show_progress,
                              suppress_warnings     = suppress_warnings,
                              output_file           = output_file,
                              kwargs_create_study   = kwargs_create_study,
                              kwargs_study_optimize = kwargs_study_optimize
                          )

    return results, best_trial

skforecast.model_selection._validation.backtesting_sarimax ¶

backtesting_sarimax(
    forecaster,
    y,
    cv,
    metric,
    exog=None,
    alpha=None,
    interval=None,
    n_jobs="auto",
    verbose=False,
    suppress_warnings_fit=False,
    show_progress=True,
)

Backtesting of ForecasterSarimax.

A copy of the original forecaster is created so that it is not modified during the process.

Parameters:

Name	Type	Description	Default
`forecaster`	`ForecasterSarimax`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`alpha`	`float`	The confidence intervals for the forecasts are (1 - alpha) %. If both, `alpha` and `interval` are provided, `alpha` will be used.	`0.05`
`interval`	`list`	Confidence of the prediction interval estimated. The values must be symmetric. Sequence of percentiles to compute, which must be between 0 and 100 inclusive. For example, interval of 95% should be as `interval = [2.5, 97.5]`. If both, `alpha` and `interval` are provided, `alpha` will be used.	`None`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds and index of training and validation sets used for backtesting.	`False`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`

Returns:

Name	Type	Description
`metric_values`	`pandas DataFrame`	Value(s) of the metric(s).
`backtest_predictions`	`pandas DataFrame`	Value of predictions and their estimated interval if `interval` is not `None`. column pred: predictions. column lower_bound: lower bound of the interval. column upper_bound: upper bound of the interval.

Source code in skforecast/model_selection/_validation.py

def backtesting_sarimax(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    metric: Union[str, Callable, list],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    alpha: Optional[float] = None,
    interval: Optional[list] = None,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = False,
    suppress_warnings_fit: bool = False,
    show_progress: bool = True
) -> Tuple[pd.DataFrame, pd.DataFrame]:
    """
    Backtesting of ForecasterSarimax.

    A copy of the original forecaster is created so that it is not modified during 
    the process.

    Parameters
    ----------
    forecaster : ForecasterSarimax
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    alpha : float, default `0.05`
        The confidence intervals for the forecasts are (1 - alpha) %.
        If both, `alpha` and `interval` are provided, `alpha` will be used.
    interval : list, default `None`
        Confidence of the prediction interval estimated. The values must be
        symmetric. Sequence of percentiles to compute, which must be between 
        0 and 100 inclusive. For example, interval of 95% should be as 
        `interval = [2.5, 97.5]`. If both, `alpha` and `interval` are 
        provided, `alpha` will be used.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting. 
    verbose : bool, default `False`
        Print number of folds and index of training and validation sets used 
        for backtesting.
    suppress_warnings_fit : bool, default `False`
        If `True`, warnings generated during fitting will be ignored.
    show_progress : bool, default `True`
        Whether to show a progress bar.

    Returns
    -------
    metric_values : pandas DataFrame
        Value(s) of the metric(s).
    backtest_predictions : pandas DataFrame
        Value of predictions and their estimated interval if `interval` is not `None`.

        - column pred: predictions.
        - column lower_bound: lower bound of the interval.
        - column upper_bound: upper bound of the interval.

    """

    if type(forecaster).__name__ not in ['ForecasterSarimax']:
        raise TypeError(
            ("`forecaster` must be of type `ForecasterSarimax`, for all other "
             "types of forecasters use the functions available in the other "
             "`model_selection` modules.")
        )

    check_backtesting_input(
        forecaster            = forecaster,
        cv                    = cv,
        y                     = y,
        metric                = metric,
        interval              = interval,
        alpha                 = alpha,
        n_jobs                = n_jobs,
        show_progress         = show_progress,
        suppress_warnings_fit = suppress_warnings_fit
    )

    metric_values, backtest_predictions = _backtesting_sarimax(
        forecaster            = forecaster,
        y                     = y,
        cv                    = cv,
        metric                = metric,
        exog                  = exog,
        alpha                 = alpha,
        interval              = interval,
        n_jobs                = n_jobs,
        verbose               = verbose,
        suppress_warnings_fit = suppress_warnings_fit,
        show_progress         = show_progress
    )

    return metric_values, backtest_predictions

skforecast.model_selection._search.grid_search_sarimax ¶

grid_search_sarimax(
    forecaster,
    y,
    cv,
    param_grid,
    metric,
    exog=None,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    suppress_warnings_fit=False,
    show_progress=True,
    output_file=None,
)

Exhaustive search over specified parameter values for a ForecasterSarimax object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`ForecasterSarimax`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_grid`	`dict`	Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast/model_selection/_search.py

def grid_search_sarimax(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    param_grid: dict,
    metric: Union[str, Callable, list],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    suppress_warnings_fit: bool = False,
    show_progress: bool = True,
    output_file: Optional[str] = None
) -> pd.DataFrame:
    """
    Exhaustive search over specified parameter values for a ForecasterSarimax object.
    Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterSarimax
        Forecaster model.
    y : pandas Series
        Training time series. 
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    param_grid : dict
        Dictionary with parameters names (`str`) as keys and lists of parameter
        settings to try as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    suppress_warnings_fit : bool, default `False`
        If `True`, warnings generated during fitting will be ignored.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterGrid(param_grid))

    results = _evaluate_grid_hyperparameters_sarimax(
        forecaster            = forecaster,
        y                     = y,
        cv                    = cv,
        param_grid            = param_grid,
        metric                = metric,
        exog                  = exog,
        return_best           = return_best,
        n_jobs                = n_jobs,
        verbose               = verbose,
        suppress_warnings_fit = suppress_warnings_fit,
        show_progress         = show_progress,
        output_file           = output_file
    )

    return results

skforecast.model_selection._search.random_search_sarimax ¶

random_search_sarimax(
    forecaster,
    y,
    cv,
    param_distributions,
    metric,
    exog=None,
    n_iter=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=True,
    suppress_warnings_fit=False,
    show_progress=True,
    output_file=None,
)

Random search over specified parameter values or distributions for a Forecaster object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`ForecasterSarimax`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_distributions`	`dict`	Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`n_iter`	`int`	Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.	`10`
`random_state`	`int`	Sets a seed to the random sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`True`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast/model_selection/_search.py

def random_search_sarimax(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    param_distributions: dict,
    metric: Union[str, Callable, list],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    n_iter: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: Union[int, str] = 'auto',
    verbose: bool = True,
    suppress_warnings_fit: bool = False,
    show_progress: bool = True,
    output_file: Optional[str] = None
) -> pd.DataFrame:
    """
    Random search over specified parameter values or distributions for a Forecaster 
    object. Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterSarimax
        Forecaster model.
    y : pandas Series
        Training time series. 
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    param_distributions : dict
        Dictionary with parameters names (`str`) as keys and 
        distributions or lists of parameters to try.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    n_iter : int, default `10`
        Number of parameter settings that are sampled. 
        n_iter trades off runtime vs quality of the solution.
    random_state : int, default `123`
        Sets a seed to the random sampling for reproducible output.
    return_best : bool, default `True`
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default `True`
        Print number of folds used for cv or backtesting.
    suppress_warnings_fit : bool, default `False`
        If `True`, warnings generated during fitting will be ignored.
    show_progress : bool, default `True`
        Whether to show a progress bar.
    output_file : str, default `None`
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterSampler(param_distributions, n_iter=n_iter, random_state=random_state))

    results = _evaluate_grid_hyperparameters_sarimax(
        forecaster            = forecaster,
        y                     = y,
        cv                    = cv,
        param_grid            = param_grid,
        metric                = metric,
        exog                  = exog,
        return_best           = return_best,
        n_jobs                = n_jobs,
        verbose               = verbose,
        suppress_warnings_fit = suppress_warnings_fit,
        show_progress         = show_progress,
        output_file           = output_file
    )

    return results

skforecast.model_selection._split.BaseFold ¶

BaseFold(
    steps=None,
    initial_train_size=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Base class for all Fold classes in skforecast. All fold classes should specify all the parameters that can be set at the class level in their __init__.

Parameters:

Name	Type	Description	Default
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.	`None`
`initial_train_size`	`int`	Number of observations used for initial training.	`None`
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.	`None`
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.	`None`
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold. If `True`, the forecaster is refitted in each fold. If `False`, the forecaster is trained only in the first fold. If an integer, the forecaster is trained in the first fold and then refitted every `refit` folds.	`False`
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.	`True`
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.	`0`
`skip_folds`	`(int, list)`	Number of folds to skip. If an integer, every 'skip_folds'-th is returned. If a list, the indexes of the folds to skip. For example, if `skip_folds=3` and there are 10 folds, the returned folds are 0, 3, 6, and 9. If `skip_folds=[1, 2, 3]`, the returned folds are 0, 4, 5, 6, 7, 8, and 9.	`None`
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`. If `False`, the last fold is excluded if it is incomplete.	`True`
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.	`False`
`verbose`	`bool`	Whether to print information about generated folds.	`True`

Attributes:

Name	Type	Description
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.
`initial_train_size`	`int`	Number of observations used for initial training.
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold.
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.
`skip_folds`	`(int, list)`	Number of folds to skip.
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`.
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.
`verbose`	`bool`	Whether to print information about generated folds.

Source code in skforecast/model_selection/_split.py

def __init__(
    self,
    steps: Optional[int] = None,
    initial_train_size: Optional[int] = None,
    window_size: Optional[int] = None,
    differentiation: Optional[int] = None,
    refit: Union[bool, int] = False,
    fixed_train_size: bool = True,
    gap: int = 0,
    skip_folds: Optional[Union[int, list]] = None,
    allow_incomplete_fold: bool = True,
    return_all_indexes: bool = False,
    verbose: bool = True
) -> None:

    self._validate_params(
        cv_name               = type(self).__name__,
        steps                 = steps,
        initial_train_size    = initial_train_size,
        window_size           = window_size,
        differentiation       = differentiation,
        refit                 = refit,
        fixed_train_size      = fixed_train_size,
        gap                   = gap,
        skip_folds            = skip_folds,
        allow_incomplete_fold = allow_incomplete_fold,
        return_all_indexes    = return_all_indexes,
        verbose               = verbose
    )

    self.steps                 = steps
    self.initial_train_size    = initial_train_size
    self.window_size           = window_size
    self.differentiation       = differentiation
    self.refit                 = refit
    self.fixed_train_size      = fixed_train_size
    self.gap                   = gap
    self.skip_folds            = skip_folds
    self.allow_incomplete_fold = allow_incomplete_fold
    self.return_all_indexes    = return_all_indexes
    self.verbose               = verbose

_validate_params ¶

_validate_params(
    cv_name,
    steps=None,
    initial_train_size=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Validate all input parameters to ensure correctness.

Source code in skforecast/model_selection/_split.py

def _validate_params(
    self,
    cv_name: str,
    steps: Optional[int] = None,
    initial_train_size: Optional[int] = None,
    window_size: Optional[int] = None,
    differentiation: Optional[int] = None,
    refit: Union[bool, int] = False,
    fixed_train_size: bool = True,
    gap: int = 0,
    skip_folds: Optional[Union[int, list]] = None,
    allow_incomplete_fold: bool = True,
    return_all_indexes: bool = False,
    verbose: bool = True
) -> None: 
    """
    Validate all input parameters to ensure correctness.
    """

    if cv_name == "TimeSeriesFold":
        if not isinstance(steps, (int, np.integer)) or steps < 1:
            raise ValueError(
                f"`steps` must be an integer greater than 0. Got {steps}."
            )
        if not isinstance(initial_train_size, (int, np.integer, type(None))):
            raise ValueError(
                f"`initial_train_size` must be an integer greater than 0 or None. "
                f"Got {initial_train_size}."
            )
        if initial_train_size is not None and initial_train_size < 1:
            raise ValueError(
                f"`initial_train_size` must be an integer greater than 0 or None. "
                f"Got {initial_train_size}."
            )
        if not isinstance(refit, (bool, int, np.integer)):
            raise TypeError(
                f"`refit` must be a boolean or an integer equal or greater than 0. "
                f"Got {refit}."
            )
        if isinstance(refit, (int, np.integer)) and not isinstance(refit, bool) and refit < 0:
            raise TypeError(
                f"`refit` must be a boolean or an integer equal or greater than 0. "
                f"Got {refit}."
            )
        if not isinstance(fixed_train_size, bool):
            raise TypeError(
                f"`fixed_train_size` must be a boolean: `True`, `False`. "
                f"Got {fixed_train_size}."
            )
        if not isinstance(gap, (int, np.integer)) or gap < 0:
            raise ValueError(
                f"`gap` must be an integer greater than or equal to 0. Got {gap}."
            )
        if skip_folds is not None:
            if not isinstance(skip_folds, (int, np.integer, list, type(None))):
                raise TypeError(
                    f"`skip_folds` must be an integer greater than 0, a list of "
                    f"integers or `None`. Got {skip_folds}."
                )
            if isinstance(skip_folds, (int, np.integer)) and skip_folds < 1:
                raise ValueError(
                    f"`skip_folds` must be an integer greater than 0, a list of "
                    f"integers or `None`. Got {skip_folds}."
                )
            if isinstance(skip_folds, list) and any([x < 1 for x in skip_folds]):
                raise ValueError(
                    f"`skip_folds` list must contain integers greater than or "
                    f"equal to 1. The first fold is always needed to train the "
                    f"forecaster. Got {skip_folds}."
                ) 
        if not isinstance(allow_incomplete_fold, bool):
            raise TypeError(
                f"`allow_incomplete_fold` must be a boolean: `True`, `False`. "
                f"Got {allow_incomplete_fold}."
            )

    if cv_name == "OneStepAheadFold":
        if (
            not isinstance(initial_train_size, (int, np.integer))
            or initial_train_size < 1
        ):
            raise ValueError(
                f"`initial_train_size` must be an integer greater than 0. "
                f"Got {initial_train_size}."
            )

    if (
        not isinstance(window_size, (int, np.integer, pd.DateOffset, type(None)))
        or isinstance(window_size, (int, np.integer))
        and window_size < 1
    ):
        raise ValueError(
            f"`window_size` must be an integer greater than 0. Got {window_size}."
        )

    if not isinstance(return_all_indexes, bool):
        raise TypeError(
            f"`return_all_indexes` must be a boolean: `True`, `False`. "
            f"Got {return_all_indexes}."
        )
    if differentiation is not None:
        if not isinstance(differentiation, (int, np.integer)) or differentiation < 0:
            raise ValueError(
                f"`differentiation` must be None or an integer greater than or "
                f"equal to 0. Got {differentiation}."
            )
    if not isinstance(verbose, bool):
        raise TypeError(
            f"`verbose` must be a boolean: `True`, `False`. "
            f"Got {verbose}."
        )

_extract_index ¶

_extract_index(X)

Extracts and returns the index from the input data X.

Parameters:

Name	Type	Description	Default
`X`	`pandas Series, pandas DataFrame, pandas Index, dict`	Time series data or index to split.	required

Returns:

Name	Type	Description
`idx`	`pandas Index`	Index extracted from the input data.

Source code in skforecast/model_selection/_split.py

def _extract_index(
    self,
    X: Union[pd.Series, pd.DataFrame, pd.Index, dict]
) -> pd.Index:
    """
    Extracts and returns the index from the input data X.

    Parameters
    ----------
    X : pandas Series, pandas DataFrame, pandas Index, dict
        Time series data or index to split.

    Returns
    -------
    idx : pandas Index
        Index extracted from the input data.

    """

    if isinstance(X, (pd.Series, pd.DataFrame)):
        idx = X.index
    elif isinstance(X, dict):
        freqs = [s.index.freq for s in X.values() if s.index.freq is not None]
        if not freqs:
            raise ValueError("At least one series must have a frequency.")
        if not all(f == freqs[0] for f in freqs):
            raise ValueError(
                "All series with frequency must have the same frequency."
            )
        min_idx = min([v.index[0] for v in X.values()])
        max_idx = max([v.index[-1] for v in X.values()])
        idx = pd.date_range(start=min_idx, end=max_idx, freq=freqs[0])
    else:
        idx = X

    return idx

set_params ¶

set_params(params)

Set the parameters of the Fold object. Before overwriting the current parameters, the input parameters are validated to ensure correctness.

Parameters:

Name	Type	Description	Default
`params`	`dict`	Dictionary with the parameters to set.	required

Returns:

Type	Description
`None`

Source code in skforecast/model_selection/_split.py

def set_params(
    self, 
    params: dict
) -> None:
    """
    Set the parameters of the Fold object. Before overwriting the current 
    parameters, the input parameters are validated to ensure correctness.

    Parameters
    ----------
    params : dict
        Dictionary with the parameters to set.

    Returns
    -------
    None

    """

    if not isinstance(params, dict):
        raise TypeError(
            f"`params` must be a dictionary. Got {type(params)}."
        )

    current_params = deepcopy(vars(self))
    unknown_params = set(params.keys()) - set(current_params.keys())
    if unknown_params:
        warnings.warn(
            f"Unknown parameters: {unknown_params}. They have been ignored.",
            IgnoredArgumentWarning
        )

    filtered_params = {k: v for k, v in params.items() if k in current_params}
    updated_params = {'cv_name': type(self).__name__, **current_params, **filtered_params}

    self._validate_params(**updated_params)
    for key, value in updated_params.items():
        setattr(self, key, value)

skforecast.model_selection._split.TimeSeriesFold ¶

TimeSeriesFold(
    steps,
    initial_train_size=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Bases: BaseFold

Class to split time series data into train and test folds. When used within a backtesting or hyperparameter search, the arguments 'initial_train_size', 'window_size' and 'differentiation' are not required as they are automatically set by the backtesting or hyperparameter search functions.

Parameters:

Name	Type	Description	Default
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.	required
`initial_train_size`	`int`	Number of observations used for initial training. If `None` or 0, the initial forecaster is not trained in the first fold.	`None`
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.	`None`
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.	`None`
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold. If `True`, the forecaster is refitted in each fold. If `False`, the forecaster is trained only in the first fold. If an integer, the forecaster is trained in the first fold and then refitted every `refit` folds.	`False`
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.	`True`
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.	`0`
`skip_folds`	`(int, list)`	Number of folds to skip. If an integer, every 'skip_folds'-th is returned. If a list, the indexes of the folds to skip. For example, if `skip_folds=3` and there are 10 folds, the returned folds are 0, 3, 6, and 9. If `skip_folds=[1, 2, 3]`, the returned folds are 0, 4, 5, 6, 7, 8, and 9.	`None`
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`. If `False`, the last fold is excluded if it is incomplete.	`True`
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.	`False`
`verbose`	`bool`	Whether to print information about generated folds.	`True`

Attributes:

Name	Type	Description
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.
`initial_train_size`	`int`	Number of observations used for initial training. If `None` or 0, the initial forecaster is not trained in the first fold.
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold.
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.
`skip_folds`	`(int, list)`	Number of folds to skip.
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`.
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.
`verbose`	`bool`	Whether to print information about generated folds.

Notes

Returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. For example, if the input series is X = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], the initial_train_size = 3, window_size = 2, steps = 4, and gap = 1, the output of the first fold will: [[0, 3], [1, 3], [3, 8], [4, 8], True].

The first list [0, 3] indicates that the training set goes from the first to the third observation. The second list [1, 3] indicates that the last window seen by the forecaster during training goes from the second to the third observation. The third list [3, 8] indicates that the test set goes from the fourth to the eighth observation. The fourth list [4, 8] indicates that the test set including the gap goes from the fifth to the eighth observation. The boolean False indicates that the forecaster should not be trained in this fold.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

Source code in skforecast/model_selection/_split.py

def __init__(
    self,
    steps: int,
    initial_train_size: Optional[int] = None,
    window_size: Optional[int] = None,
    differentiation: Optional[int] = None,
    refit: Union[bool, int] = False,
    fixed_train_size: bool = True,
    gap: int = 0,
    skip_folds: Optional[Union[int, list]] = None,
    allow_incomplete_fold: bool = True,
    return_all_indexes: bool = False,
    verbose: bool = True
) -> None:

    super().__init__(
        steps                 = steps,
        initial_train_size    = initial_train_size,
        window_size           = window_size,
        differentiation       = differentiation,
        refit                 = refit,
        fixed_train_size      = fixed_train_size,
        gap                   = gap,
        skip_folds            = skip_folds,
        allow_incomplete_fold = allow_incomplete_fold,
        return_all_indexes    = return_all_indexes,
        verbose               = verbose
    )

split ¶

split(X, as_pandas=False)

Split the time series data into train and test folds.

Parameters:

Name	Type	Description	Default
`X`	`pandas Series, pandas DataFrame, pandas Index, dict`	Time series data or index to split.	required
`as_pandas`	`bool`	If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way.	`False`

Returns:

Name	Type	Description
`folds`	`list, pandas DataFrame`	A list of lists containing the indices (position) for for each fold. Each list contains 4 lists and a boolean with the following information: [train_start, train_end]: list with the start and end positions of the training set. [last_window_start, last_window_end]: list with the start and end positions of the last window seen by the forecaster during training. The last window is used to generate the lags use as predictors. If `differentiation` is included, the interval is extended as many observations as the differentiation order. If the argument `window_size` is `None`, this list is empty. [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster. [test_start_with_gap, test_end_with_gap]: list with the start and end positions of the test set including the gap. The gap is the number of observations between the end of the training set and the start of the test set. fit_forecaster: boolean indicating whether the forecaster should be fitted in this fold. It is important to note that the returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. If `as_pandas` is `True`, the folds are returned as a DataFrame with the following columns: 'fold', 'train_start', 'train_end', 'last_window_start', 'last_window_end', 'test_start', 'test_end', 'test_start_with_gap', 'test_end_with_gap', 'fit_forecaster'. Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

Source code in skforecast/model_selection/_split.py

def split(
    self,
    X: Union[pd.Series, pd.DataFrame, pd.Index, dict],
    as_pandas: bool = False
) -> Union[list, pd.DataFrame]:
    """
    Split the time series data into train and test folds.

    Parameters
    ----------
    X : pandas Series, pandas DataFrame, pandas Index, dict
        Time series data or index to split.
    as_pandas : bool, default `False`
        If True, the folds are returned as a DataFrame. This is useful to visualize
        the folds in a more interpretable way.

    Returns
    -------
    folds : list, pandas DataFrame
        A list of lists containing the indices (position) for for each fold. Each list
        contains 4 lists and a boolean with the following information:

        - [train_start, train_end]: list with the start and end positions of the
        training set.
        - [last_window_start, last_window_end]: list with the start and end positions
        of the last window seen by the forecaster during training. The last window
        is used to generate the lags use as predictors. If `differentiation` is
        included, the interval is extended as many observations as the
        differentiation order. If the argument `window_size` is `None`, this list is
        empty.
        - [test_start, test_end]: list with the start and end positions of the test
        set. These are the observations used to evaluate the forecaster.
        - [test_start_with_gap, test_end_with_gap]: list with the start and end
        positions of the test set including the gap. The gap is the number of
        observations between the end of the training set and the start of the test
        set.
        - fit_forecaster: boolean indicating whether the forecaster should be fitted
        in this fold.

        It is important to note that the returned values are the positions of the
        observations and not the actual values of the index, so they can be used to
        slice the data directly using iloc.

        If `as_pandas` is `True`, the folds are returned as a DataFrame with the
        following columns: 'fold', 'train_start', 'train_end', 'last_window_start',
        'last_window_end', 'test_start', 'test_end', 'test_start_with_gap',
        'test_end_with_gap', 'fit_forecaster'.

        Following the python convention, the start index is inclusive and the end
        index is exclusive. This means that the last index is not included in the
        slice.

    """

    if not isinstance(X, (pd.Series, pd.DataFrame, pd.Index, dict)):
        raise TypeError(
            f"X must be a pandas Series, DataFrame, Index or a dictionary. "
            f"Got {type(X)}."
        )

    if isinstance(self.window_size, pd.tseries.offsets.DateOffset):
        # Calculate the window_size in steps. This is not a exact calculation
        # because the offset follows the calendar rules and the distance between
        # two dates may not be constant.
        first_valid_index = X.index[-1] - self.window_size
        try:
            window_size_idx_start = X.index.get_loc(first_valid_index)
            window_size_idx_end = X.index.get_loc(X.index[-1])
            self.window_size = window_size_idx_end - window_size_idx_start
        except KeyError:
            raise ValueError(
                f"The length of `X` ({len(X)}), must be greater than or equal "
                f"to the window size ({self.window_size}). Try to decrease the "
                f"size of the offset (forecaster.offset), or increase the "
                f"size of `y`."
            )

    if self.initial_train_size is None:
        if self.window_size is None:
            raise ValueError(
                "To use split method when `initial_train_size` is None, "
                "`window_size` must be an integer greater than 0. "
                "Although no initial training is done and all data is used to "
                "evaluate the model, the first `window_size` observations are "
                "needed to create the initial predictors. Got `window_size` = None."
            )
        if self.refit:
            raise ValueError(
                "`refit` is only allowed when `initial_train_size` is not `None`. "
                "Set `refit` to `False` if you want to use `initial_train_size = None`."
            )
        externally_fitted = True
        self.initial_train_size = self.window_size  # Reset to None later
    else:
        if self.window_size is None:
            warnings.warn(
                "Last window cannot be calculated because `window_size` is None."
            )
        externally_fitted = False

    index = self._extract_index(X)
    idx = range(len(index))
    folds = []
    i = 0
    last_fold_excluded = False

    if len(index) < self.initial_train_size + self.steps:
        raise ValueError(
            f"The time series must have at least `initial_train_size + steps` "
            f"observations. Got {len(index)} observations."
        )

    while self.initial_train_size + (i * self.steps) + self.gap < len(index):

        if self.refit:
            # If `fixed_train_size` the train size doesn't increase but moves by 
            # `steps` positions in each iteration. If `False`, the train size
            # increases by `steps` in each iteration.
            train_iloc_start = i * (self.steps) if self.fixed_train_size else 0
            train_iloc_end = self.initial_train_size + i * (self.steps)
            test_iloc_start = train_iloc_end
        else:
            # The train size doesn't increase and doesn't move.
            train_iloc_start = 0
            train_iloc_end = self.initial_train_size
            test_iloc_start = self.initial_train_size + i * (self.steps)

        if self.window_size is not None:
            last_window_iloc_start = test_iloc_start - self.window_size
        test_iloc_end = test_iloc_start + self.gap + self.steps

        partitions = [
            idx[train_iloc_start : train_iloc_end],
            idx[last_window_iloc_start : test_iloc_start] if self.window_size is not None else [],
            idx[test_iloc_start : test_iloc_end],
            idx[test_iloc_start + self.gap : test_iloc_end]
        ]
        folds.append(partitions)
        i += 1

    if not self.allow_incomplete_fold and len(folds[-1][3]) < self.steps:
        folds = folds[:-1]
        last_fold_excluded = True

    # Replace partitions inside folds with length 0 with `None`
    folds = [
        [partition if len(partition) > 0 else None for partition in fold] 
         for fold in folds
    ]

    # Create a flag to know whether to train the forecaster
    if self.refit == 0:
        self.refit = False

    if isinstance(self.refit, bool):
        fit_forecaster = [self.refit] * len(folds)
        fit_forecaster[0] = True
    else:
        fit_forecaster = [False] * len(folds)
        for i in range(0, len(fit_forecaster), self.refit): 
            fit_forecaster[i] = True

    for i in range(len(folds)): 
        folds[i].append(fit_forecaster[i])
        if fit_forecaster[i] is False:
            folds[i][0] = folds[i - 1][0]

    index_to_skip = []
    if self.skip_folds is not None:
        if isinstance(self.skip_folds, (int, np.integer)) and self.skip_folds > 0:
            index_to_keep = np.arange(0, len(folds), self.skip_folds)
            index_to_skip = np.setdiff1d(np.arange(0, len(folds)), index_to_keep, assume_unique=True)
            index_to_skip = [int(x) for x in index_to_skip]  # Required since numpy 2.0
        if isinstance(self.skip_folds, list):
            index_to_skip = [i for i in self.skip_folds if i < len(folds)]        

    if self.verbose:
        self._print_info(
            index              = index,
            folds              = folds,
            externally_fitted  = externally_fitted,
            last_fold_excluded = last_fold_excluded,
            index_to_skip      = index_to_skip
        )

    folds = [fold for i, fold in enumerate(folds) if i not in index_to_skip]
    if not self.return_all_indexes:
        # +1 to prevent iloc pandas from deleting the last observation
        folds = [
            [[fold[0][0], fold[0][-1] + 1], 
             [fold[1][0], fold[1][-1] + 1] if self.window_size is not None else [],
             [fold[2][0], fold[2][-1] + 1],
             [fold[3][0], fold[3][-1] + 1],
             fold[4]] 
            for fold in folds
        ]

    if externally_fitted:
        self.initial_train_size = None
        folds[0][4] = False

    if as_pandas:
        if self.window_size is None:
            for fold in folds:
                fold[1] = [None, None]

        if not self.return_all_indexes:
            folds = pd.DataFrame(
                data = [list(itertools.chain(*fold[:-1])) + [fold[-1]] for fold in folds],
                columns = [
                    'train_start',
                    'train_end',
                    'last_window_start',
                    'last_window_end',
                    'test_start',
                    'test_end',
                    'test_start_with_gap',
                    'test_end_with_gap',
                    'fit_forecaster'
                ],
            )
        else:
            folds = pd.DataFrame(
                data = folds,
                columns = [
                    'train_index',
                    'last_window_index',
                    'test_index',
                    'test_index_with_gap',
                    'fit_forecaster'
                ],
            )
        folds.insert(0, 'fold', range(len(folds)))

    return folds

_print_info ¶

_print_info(
    index,
    folds,
    externally_fitted,
    last_fold_excluded,
    index_to_skip,
)

Print information about folds.

Source code in skforecast/model_selection/_split.py

def _print_info(
    self,
    index: pd.Index,
    folds: list,
    externally_fitted: bool,
    last_fold_excluded: bool,
    index_to_skip: list,
) -> None:
    """
    Print information about folds.
    """

    print("Information of folds")
    print("--------------------")
    if externally_fitted:
        print(
            f"An already trained forecaster is to be used. Window size: "
            f"{self.window_size}"
        )
    else:
        if self.differentiation is None:
            print(
                f"Number of observations used for initial training: "
                f"{self.initial_train_size}"
            )
        else:
            print(
                f"Number of observations used for initial training: "
                f"{self.initial_train_size - self.differentiation}"
            )
            print(
                f"    First {self.differentiation} observation/s in training sets "
                f"are used for differentiation"
            )
    print(
        f"Number of observations used for backtesting: "
        f"{len(index) - self.initial_train_size}"
    )
    print(f"    Number of folds: {len(folds)}")
    print(
        f"    Number skipped folds: "
        f"{len(index_to_skip)} {index_to_skip if index_to_skip else ''}"
    )
    print(f"    Number of steps per fold: {self.steps}")
    print(
        f"    Number of steps to exclude between last observed data "
        f"(last window) and predictions (gap): {self.gap}"
    )
    if last_fold_excluded:
        print("    Last fold has been excluded because it was incomplete.")
    if len(folds[-1][3]) < self.steps:
        print(f"    Last fold only includes {len(folds[-1][3])} observations.")
    print("")

    if self.differentiation is None:
        differentiation = 0
    else:
        differentiation = self.differentiation

    for i, fold in enumerate(folds):
        is_fold_skipped   = i in index_to_skip
        has_training      = fold[-1] if i != 0 else True
        training_start    = (
            index[fold[0][0] + differentiation] if fold[0] is not None else None
        )
        training_end      = index[fold[0][-1]] if fold[0] is not None else None
        training_length   = (
            len(fold[0]) - differentiation if fold[0] is not None else 0
        )
        validation_start  = index[fold[3][0]]
        validation_end    = index[fold[3][-1]]
        validation_length = len(fold[3])

        print(f"Fold: {i}")
        if is_fold_skipped:
            print("    Fold skipped")
        elif not externally_fitted and has_training:
            print(
                f"    Training:   {training_start} -- {training_end}  "
                f"(n={training_length})"
            )
            print(
                f"    Validation: {validation_start} -- {validation_end}  "
                f"(n={validation_length})"
            )
        else:
            print("    Training:   No training in this fold")
            print(
                f"    Validation: {validation_start} -- {validation_end}  "
                f"(n={validation_length})"
            )

    print("")

skforecast.model_selection._split.OneStepAheadFold ¶

OneStepAheadFold(
    initial_train_size,
    window_size=None,
    differentiation=None,
    return_all_indexes=False,
    verbose=True,
)

Bases: BaseFold

Class to split time series data into train and test folds for one-step-ahead forecasting.

Parameters:

Name	Type	Description	Default
`initial_train_size`	`int`	Number of observations used for initial training.	required
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.	`None`
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.	`None`
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.	`False`
`verbose`	`bool`	Whether to print information about generated folds.	`True`

Attributes:

Name	Type	Description
`initial_train_size`	`int`	Number of observations used for initial training.
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.
`verbose`	`bool`	Whether to print information about generated folds.
`steps`	`Any`	This attribute is not used in this class. It is included for API consistency.
`fixed_train_size`	`Any`	This attribute is not used in this class. It is included for API consistency.
`gap`	`Any`	This attribute is not used in this class. It is included for API consistency.
`skip_folds`	`Any`	This attribute is not used in this class. It is included for API consistency.
`allow_incomplete_fold`	`Any`	This attribute is not used in this class. It is included for API consistency.
`refit`	`Any`	This attribute is not used in this class. It is included for API consistency.

Source code in skforecast/model_selection/_split.py

def __init__(
    self,
    initial_train_size: int,
    window_size: Optional[int] = None,
    differentiation: Optional[int] = None,
    return_all_indexes: bool = False,
    verbose: bool = True,
) -> None:

    super().__init__(
        initial_train_size = initial_train_size,
        window_size        = window_size,
        differentiation    = differentiation,
        return_all_indexes = return_all_indexes,
        verbose            = verbose
    )

split ¶

split(X, as_pandas=False, externally_fitted=None)

Split the time series data into train and test folds.

Parameters:

Name	Type	Description	Default
`X`	`pandas Series, DataFrame, Index, or dictionary`	Time series data or index to split.	required
`as_pandas`	`bool`	If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way.	`False`
`externally_fitted`	`Any`	This argument is not used in this class. It is included for API consistency.	`None`

Returns:

Name	Type	Description
`fold`	`list, pandas DataFrame`	A list of lists containing the indices (position) for for each fold. Each list contains 2 lists the following information: [train_start, train_end]: list with the start and end positions of the training set. [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster. It is important to note that the returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. If `as_pandas` is `True`, the folds are returned as a DataFrame with the following columns: 'fold', 'train_start', 'train_end', 'test_start', 'test_end'. Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

Source code in skforecast/model_selection/_split.py

def split(
    self,
    X: Union[pd.Series, pd.DataFrame, pd.Index, dict],
    as_pandas: bool = False,
    externally_fitted: Any = None
) -> Union[list, pd.DataFrame]:
    """
    Split the time series data into train and test folds.

    Parameters
    ----------
    X : pandas Series, DataFrame, Index, or dictionary
        Time series data or index to split.
    as_pandas : bool, default `False`
        If True, the folds are returned as a DataFrame. This is useful to visualize
        the folds in a more interpretable way.
    externally_fitted : Any
        This argument is not used in this class. It is included for API consistency.

    Returns
    -------
    fold : list, pandas DataFrame
        A list of lists containing the indices (position) for for each fold. Each list
        contains 2 lists the following information:

        - [train_start, train_end]: list with the start and end positions of the
        training set.
        - [test_start, test_end]: list with the start and end positions of the test
        set. These are the observations used to evaluate the forecaster.

        It is important to note that the returned values are the positions of the
        observations and not the actual values of the index, so they can be used to
        slice the data directly using iloc.

        If `as_pandas` is `True`, the folds are returned as a DataFrame with the
        following columns: 'fold', 'train_start', 'train_end', 'test_start', 'test_end'.

        Following the python convention, the start index is inclusive and the end
        index is exclusive. This means that the last index is not included in the
        slice.

    """

    if not isinstance(X, (pd.Series, pd.DataFrame, pd.Index, dict)):
        raise TypeError(
            f"X must be a pandas Series, DataFrame, Index or a dictionary. "
            f"Got {type(X)}."
        )

    index = self._extract_index(X)
    fold = [
        [0, self.initial_train_size],
        [self.initial_train_size, len(X)],
        True
    ]

    if self.verbose:
        self._print_info(
            index = index,
            fold = fold,
        )

    if self.return_all_indexes:
        fold = [
            [range(fold[0][0], fold[0][1])],
            [range(fold[1][0], fold[1][1])],
            fold[2]
        ]

    if as_pandas:
        if not self.return_all_indexes:
            fold = pd.DataFrame(
                data = [list(itertools.chain(*fold[:-1])) + [fold[-1]]],
                columns = [
                    'train_start',
                    'train_end',
                    'test_start',
                    'test_end',
                    'fit_forecaster'
                ],
            )
        else:
            fold = pd.DataFrame(
                data = [fold],
                columns = [
                    'train_index',
                    'test_index',
                    'fit_forecaster'
                ],
            )
        fold.insert(0, 'fold', range(len(fold)))

    return fold

_print_info ¶

_print_info(index, fold)

Print information about folds.

Source code in skforecast/model_selection/_split.py

def _print_info(
    self,
    index: pd.Index,
    fold: list,
) -> None:
    """
    Print information about folds.
    """

    if self.differentiation is None:
        differentiation = 0
    else:
        differentiation = self.differentiation

    initial_train_size = self.initial_train_size - differentiation
    test_length = len(index) - (initial_train_size + differentiation)

    print("Information of folds")
    print("--------------------")
    print(
        f"Number of observations in train: {initial_train_size}"
    )
    if self.differentiation is not None:
        print(
            f"    First {differentiation} observation/s in training set "
            f"are used for differentiation"
        )
    print(
        f"Number of observations in test: {test_length}"
    )

    training_start = index[fold[0][0] + differentiation]
    training_end = index[fold[0][-1]]
    test_start  = index[fold[1][0]]
    test_end    = index[fold[1][-1] - 1]

    print(
        f"Training : {training_start} -- {training_end} (n={initial_train_size})"
    )
    print(
        f"Test     : {test_start} -- {test_end} (n={test_length})"
    )
    print("")

skforecast.model_selection._utils.initialize_lags_grid ¶

initialize_lags_grid(forecaster, lags_grid=None)

Initialize lags grid and lags label for model selection.

Parameters:

Name	Type	Description	Default
`forecaster`	`Forecaster`	Forecaster model. ForecasterRecursive, ForecasterDirect, ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate.	required
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`

Returns:

Name	Type	Description
`lags_grid`	`dict`	Dictionary with lags configuration for each iteration.
`lags_label`	`str`	Label for lags representation in the results object.

Source code in skforecast/model_selection/_utils.py

def initialize_lags_grid(
    forecaster: object, 
    lags_grid: Optional[Union[list, dict]] = None
) -> Tuple[dict, str]:
    """
    Initialize lags grid and lags label for model selection. 

    Parameters
    ----------
    forecaster : Forecaster
        Forecaster model. ForecasterRecursive, ForecasterDirect, 
        ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate.
    lags_grid : list, dict, default `None`
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.

    Returns
    -------
    lags_grid : dict
        Dictionary with lags configuration for each iteration.
    lags_label : str
        Label for lags representation in the results object.

    """

    if not isinstance(lags_grid, (list, dict, type(None))):
        raise TypeError(
            (f"`lags_grid` argument must be a list, dict or None. "
             f"Got {type(lags_grid)}.")
        )

    lags_label = 'values'
    if isinstance(lags_grid, list):
        lags_grid = {f'{lags}': lags for lags in lags_grid}
    elif lags_grid is None:
        lags = [int(lag) for lag in forecaster.lags]  # Required since numpy 2.0
        lags_grid = {f'{lags}': lags}
    else:
        lags_label = 'keys'

    return lags_grid, lags_label

skforecast.model_selection._utils.check_backtesting_input ¶

check_backtesting_input(
    forecaster,
    cv,
    metric,
    add_aggregated_metric=True,
    y=None,
    series=None,
    exog=None,
    interval=None,
    alpha=None,
    n_boot=250,
    random_state=123,
    use_in_sample_residuals=True,
    use_binned_residuals=False,
    n_jobs="auto",
    show_progress=True,
    suppress_warnings=False,
    suppress_warnings_fit=False,
)

This is a helper function to check most inputs of backtesting functions in modules model_selection.

Parameters:

Name	Type	Description	Default
`forecaster`	`Forecaster`	Forecaster model.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model.	required
`add_aggregated_metric`	`bool`	If `True`, the aggregated metrics (average, weighted average and pooling) over all levels are also returned (only multiseries).	`True`
`y`	`pandas Series`	Training time series for uni-series forecasters.	`None`
`series`	`pandas DataFrame, dict`	Training time series for multi-series forecasters.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`interval`	`list`	Confidence of the prediction interval estimated. Sequence of percentiles to compute, which must be between 0 and 100 inclusive.	`None`
`alpha`	`float`	The confidence intervals used in ForecasterSarimax are (1 - alpha) %.	`None`
`n_boot`	`int`	Number of bootstrapping iterations used to estimate prediction intervals.	`250`
`random_state`	`int`	Sets a seed to the random generator, so that boot intervals are always deterministic.	`123`
`use_in_sample_residuals`	`bool`	If `True`, residuals from the training data are used as proxy of prediction error to create prediction intervals. If `False`, out_sample_residuals are used if they are already stored inside the forecaster.	`True`
`use_binned_residuals`	`bool`	If `True`, residuals used in each bootstrapping iteration are selected conditioning on the predicted values. If `False`, residuals are selected randomly without conditioning on the predicted values.	`False`
`n_jobs`	`(int, auto)`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the fuction skforecast.utils.select_n_jobs_fit_forecaster. New in version 0.9.0	`'auto'`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the backtesting process. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored. Only `ForecasterSarimax`.	`False`

Returns:

Type	Description
`None`

Source code in skforecast/model_selection/_utils.py

def check_backtesting_input(
    forecaster: object,
    cv: object,
    metric: Union[str, Callable, list],
    add_aggregated_metric: bool = True,
    y: Optional[pd.Series] = None,
    series: Optional[Union[pd.DataFrame, dict]] = None,
    exog: Optional[Union[pd.Series, pd.DataFrame, dict]] = None,
    interval: Optional[list] = None,
    alpha: Optional[float] = None,
    n_boot: int = 250,
    random_state: int = 123,
    use_in_sample_residuals: bool = True,
    use_binned_residuals: bool = False,
    n_jobs: Union[int, str] = 'auto',
    show_progress: bool = True,
    suppress_warnings: bool = False,
    suppress_warnings_fit: bool = False
) -> None:
    """
    This is a helper function to check most inputs of backtesting functions in 
    modules `model_selection`.

    Parameters
    ----------
    forecaster : Forecaster
        Forecaster model.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.
    add_aggregated_metric : bool, default `True`
        If `True`, the aggregated metrics (average, weighted average and pooling)
        over all levels are also returned (only multiseries).
    y : pandas Series, default `None`
        Training time series for uni-series forecasters.
    series : pandas DataFrame, dict, default `None`
        Training time series for multi-series forecasters.
    exog : pandas Series, pandas DataFrame, dict, default `None`
        Exogenous variables.
    interval : list, default `None`
        Confidence of the prediction interval estimated. Sequence of percentiles
        to compute, which must be between 0 and 100 inclusive.
    alpha : float, default `None`
        The confidence intervals used in ForecasterSarimax are (1 - alpha) %. 
    n_boot : int, default `250`
        Number of bootstrapping iterations used to estimate prediction
        intervals.
    random_state : int, default `123`
        Sets a seed to the random generator, so that boot intervals are always 
        deterministic.
    use_in_sample_residuals : bool, default `True`
        If `True`, residuals from the training data are used as proxy of prediction 
        error to create prediction intervals.  If `False`, out_sample_residuals 
        are used if they are already stored inside the forecaster.
    use_binned_residuals : bool, default `False`
        If `True`, residuals used in each bootstrapping iteration are selected
        conditioning on the predicted values. If `False`, residuals are selected
        randomly without conditioning on the predicted values.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the fuction
        skforecast.utils.select_n_jobs_fit_forecaster.
        **New in version 0.9.0**
    show_progress : bool, default `True`
        Whether to show a progress bar.
    suppress_warnings: bool, default `False`
        If `True`, skforecast warnings will be suppressed during the backtesting 
        process. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    suppress_warnings_fit : bool, default `False`
        If `True`, warnings generated during fitting will be ignored. Only 
        `ForecasterSarimax`.

    Returns
    -------
    None

    """

    forecaster_name = type(forecaster).__name__
    cv_name = type(cv).__name__

    if cv_name != "TimeSeriesFold":
        raise TypeError(f"`cv` must be a TimeSeriesFold object. Got {cv_name}.")

    steps = cv.steps
    initial_train_size = cv.initial_train_size
    gap = cv.gap
    allow_incomplete_fold = cv.allow_incomplete_fold
    refit = cv.refit

    forecasters_uni = [
        "ForecasterRecursive",
        "ForecasterDirect",
        "ForecasterSarimax",
        "ForecasterEquivalentDate",
    ]
    forecasters_multi = [
        "ForecasterDirectMultiVariate",
        "ForecasterRnn",
    ]
    forecasters_multi_dict = [
        "ForecasterRecursiveMultiSeries"
    ]

    if forecaster_name in forecasters_uni:
        if not isinstance(y, pd.Series):
            raise TypeError("`y` must be a pandas Series.")
        data_name = 'y'
        data_length = len(y)

    elif forecaster_name in forecasters_multi:
        if not isinstance(series, pd.DataFrame):
            raise TypeError("`series` must be a pandas DataFrame.")
        data_name = 'series'
        data_length = len(series)

    elif forecaster_name in forecasters_multi_dict:
        if not isinstance(series, (pd.DataFrame, dict)):
            raise TypeError(
                f"`series` must be a pandas DataFrame or a dict of DataFrames or Series. "
                f"Got {type(series)}."
            )

        data_name = 'series'
        if isinstance(series, dict):
            not_valid_series = [
                k 
                for k, v in series.items()
                if not isinstance(v, (pd.Series, pd.DataFrame))
            ]
            if not_valid_series:
                raise TypeError(
                    f"If `series` is a dictionary, all series must be a named "
                    f"pandas Series or a pandas DataFrame with a single column. "
                    f"Review series: {not_valid_series}"
                )
            not_valid_index = [
                k 
                for k, v in series.items()
                if not isinstance(v.index, pd.DatetimeIndex)
            ]
            if not_valid_index:
                raise ValueError(
                    f"If `series` is a dictionary, all series must have a Pandas "
                    f"DatetimeIndex as index with the same frequency. "
                    f"Review series: {not_valid_index}"
                )

            indexes_freq = [f'{v.index.freq}' for v in series.values()]
            indexes_freq = sorted(set(indexes_freq))
            if not len(indexes_freq) == 1:
                raise ValueError(
                    f"If `series` is a dictionary, all series must have a Pandas "
                    f"DatetimeIndex as index with the same frequency. "
                    f"Found frequencies: {indexes_freq}"
                )
            data_length = max([len(series[serie]) for serie in series])
        else:
            data_length = len(series)

    if exog is not None:
        if forecaster_name in forecasters_multi_dict:
            if not isinstance(exog, (pd.Series, pd.DataFrame, dict)):
                raise TypeError(
                    f"`exog` must be a pandas Series, DataFrame, dictionary of pandas "
                    f"Series/DataFrames or None. Got {type(exog)}."
                )
            if isinstance(exog, dict):
                not_valid_exog = [
                    k 
                    for k, v in exog.items()
                    if not isinstance(v, (pd.Series, pd.DataFrame, type(None)))
                ]
                if not_valid_exog:
                    raise TypeError(
                        f"If `exog` is a dictionary, All exog must be a named pandas "
                        f"Series, a pandas DataFrame or None. Review exog: {not_valid_exog}"
                    )
        else:
            if not isinstance(exog, (pd.Series, pd.DataFrame)):
                raise TypeError(
                    f"`exog` must be a pandas Series, DataFrame or None. Got {type(exog)}."
                )

    if hasattr(forecaster, 'differentiation'):
        if forecaster.differentiation != cv.differentiation:
            raise ValueError(
                f"The differentiation included in the forecaster "
                f"({forecaster.differentiation}) differs from the differentiation "
                f"included in the cv ({cv.differentiation}). Set the same value "
                f"for both using the `differentiation` argument."
            )

    if not isinstance(metric, (str, Callable, list)):
        raise TypeError(
            f"`metric` must be a string, a callable function, or a list containing "
            f"multiple strings and/or callables. Got {type(metric)}."
        )

    if forecaster_name == "ForecasterEquivalentDate" and isinstance(
        forecaster.offset, pd.tseries.offsets.DateOffset
    ):
        if initial_train_size is None:
            raise ValueError(
                f"`initial_train_size` must be an integer greater than "
                f"the `window_size` of the forecaster ({forecaster.window_size}) "
                f"and smaller than the length of `{data_name}` ({data_length})."
            )
    elif initial_train_size is not None:
        if initial_train_size < forecaster.window_size or initial_train_size >= data_length:
            raise ValueError(
                f"If used, `initial_train_size` must be an integer greater than "
                f"the `window_size` of the forecaster ({forecaster.window_size}) "
                f"and smaller than the length of `{data_name}` ({data_length})."
            )
        if initial_train_size + gap >= data_length:
            raise ValueError(
                f"The combination of initial_train_size {initial_train_size} and "
                f"gap {gap} cannot be greater than the length of `{data_name}` "
                f"({data_length})."
            )
    else:
        if forecaster_name in ['ForecasterSarimax', 'ForecasterEquivalentDate']:
            raise ValueError(
                f"`initial_train_size` must be an integer smaller than the "
                f"length of `{data_name}` ({data_length})."
            )
        else:
            if not forecaster.is_fitted:
                raise NotFittedError(
                    "`forecaster` must be already trained if no `initial_train_size` "
                    "is provided."
                )
            if refit:
                raise ValueError(
                    "`refit` is only allowed when `initial_train_size` is not `None`."
                )

    if forecaster_name == 'ForecasterSarimax' and cv.skip_folds is not None:
        raise ValueError(
            "`skip_folds` is not allowed for ForecasterSarimax. Set it to `None`."
        )

    if not isinstance(add_aggregated_metric, bool):
        raise TypeError("`add_aggregated_metric` must be a boolean: `True`, `False`.")
    if not isinstance(n_boot, (int, np.integer)) or n_boot < 0:
        raise TypeError(f"`n_boot` must be an integer greater than 0. Got {n_boot}.")
    if not isinstance(random_state, (int, np.integer)) or random_state < 0:
        raise TypeError(f"`random_state` must be an integer greater than 0. Got {random_state}.")
    if not isinstance(use_in_sample_residuals, bool):
        raise TypeError("`use_in_sample_residuals` must be a boolean: `True`, `False`.")
    if not isinstance(use_binned_residuals, bool):
        raise TypeError("`use_binned_residuals` must be a boolean: `True`, `False`.")
    if not isinstance(n_jobs, int) and n_jobs != 'auto':
        raise TypeError(f"`n_jobs` must be an integer or `'auto'`. Got {n_jobs}.")
    if not isinstance(show_progress, bool):
        raise TypeError("`show_progress` must be a boolean: `True`, `False`.")
    if not isinstance(suppress_warnings, bool):
        raise TypeError("`suppress_warnings` must be a boolean: `True`, `False`.")
    if not isinstance(suppress_warnings_fit, bool):
        raise TypeError("`suppress_warnings_fit` must be a boolean: `True`, `False`.")

    if interval is not None or alpha is not None:
        check_interval(interval=interval, alpha=alpha)

    if not allow_incomplete_fold and data_length - (initial_train_size + gap) < steps:
        raise ValueError(
            f"There is not enough data to evaluate {steps} steps in a single "
            f"fold. Set `allow_incomplete_fold` to `True` to allow incomplete folds.\n"
            f"    Data available for test : {data_length - (initial_train_size + gap)}\n"
            f"    Steps                   : {steps}"
        )

skforecast.model_selection._utils.select_n_jobs_backtesting ¶

select_n_jobs_backtesting(forecaster, refit)

Select the optimal number of jobs to use in the backtesting process. This selection is based on heuristics and is not guaranteed to be optimal.

The number of jobs is chosen as follows:

If refit is an integer, then n_jobs = 1. This is because parallelization doesn't work with intermittent refit.
If forecaster is 'ForecasterRecursive' and regressor is a linear regressor, then n_jobs = 1.
If forecaster is 'ForecasterRecursive' and regressor is not a linear regressor then n_jobs = cpu_count() - 1.
If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate' and refit = True, then n_jobs = cpu_count() - 1.
If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate' and refit = False, then n_jobs = 1.
If forecaster is 'ForecasterRecursiveMultiSeries', then n_jobs = cpu_count() - 1.
If forecaster is 'ForecasterSarimax' or 'ForecasterEquivalentDate', then n_jobs = 1.
If regressor is a LGBMRegressor(n_jobs=1), then n_jobs = cpu_count() - 1.
If regressor is a LGBMRegressor with internal n_jobs != 1, then n_jobs = 1. This is because lightgbm is highly optimized for gradient boosting and parallelizes operations at a very fine-grained level, making additional parallelization unnecessary and potentially harmful due to resource contention.

Parameters:

Name	Type	Description	Default
`forecaster`	`Forecaster`	Forecaster model.	required
`refit`	`(bool, int)`	If the forecaster is refitted during the backtesting process.	required

Returns:

Name	Type	Description
`n_jobs`	`int`	The number of jobs to run in parallel.

Source code in skforecast/model_selection/_utils.py

def select_n_jobs_backtesting(
    forecaster: object,
    refit: Union[bool, int]
) -> int:
    """
    Select the optimal number of jobs to use in the backtesting process. This
    selection is based on heuristics and is not guaranteed to be optimal.

    The number of jobs is chosen as follows:

    - If `refit` is an integer, then `n_jobs = 1`. This is because parallelization doesn't 
    work with intermittent refit.
    - If forecaster is 'ForecasterRecursive' and regressor is a linear regressor, 
    then `n_jobs = 1`.
    - If forecaster is 'ForecasterRecursive' and regressor is not a linear 
    regressor then `n_jobs = cpu_count() - 1`.
    - If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate'
    and `refit = True`, then `n_jobs = cpu_count() - 1`.
    - If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate'
    and `refit = False`, then `n_jobs = 1`.
    - If forecaster is 'ForecasterRecursiveMultiSeries', then `n_jobs = cpu_count() - 1`.
    - If forecaster is 'ForecasterSarimax' or 'ForecasterEquivalentDate', 
    then `n_jobs = 1`.
    - If regressor is a `LGBMRegressor(n_jobs=1)`, then `n_jobs = cpu_count() - 1`.
    - If regressor is a `LGBMRegressor` with internal n_jobs != 1, then `n_jobs = 1`.
    This is because `lightgbm` is highly optimized for gradient boosting and
    parallelizes operations at a very fine-grained level, making additional
    parallelization unnecessary and potentially harmful due to resource contention.

    Parameters
    ----------
    forecaster : Forecaster
        Forecaster model.
    refit : bool, int
        If the forecaster is refitted during the backtesting process.

    Returns
    -------
    n_jobs : int
        The number of jobs to run in parallel.

    """

    forecaster_name = type(forecaster).__name__

    if isinstance(forecaster.regressor, Pipeline):
        regressor = forecaster.regressor[-1]
        regressor_name = type(regressor).__name__
    else:
        regressor = forecaster.regressor
        regressor_name = type(regressor).__name__

    linear_regressors = [
        regressor_name
        for regressor_name in dir(sklearn.linear_model)
        if not regressor_name.startswith('_')
    ]

    refit = False if refit == 0 else refit
    if not isinstance(refit, bool) and refit != 1:
        n_jobs = 1
    else:
        if forecaster_name in ['ForecasterRecursive']:
            if regressor_name in linear_regressors:
                n_jobs = 1
            elif regressor_name == 'LGBMRegressor':
                n_jobs = cpu_count() - 1 if regressor.n_jobs == 1 else 1
            else:
                n_jobs = cpu_count() - 1
        elif forecaster_name in ['ForecasterDirect', 'ForecasterDirectMultiVariate']:
            # Parallelization is applied during the fitting process.
            n_jobs = 1
        elif forecaster_name in ['ForecasterRecursiveMultiSeries']:
            if regressor_name == 'LGBMRegressor':
                n_jobs = cpu_count() - 1 if regressor.n_jobs == 1 else 1
            else:
                n_jobs = cpu_count() - 1
        elif forecaster_name in ['ForecasterSarimax', 'ForecasterEquivalentDate']:
            n_jobs = 1
        else:
            n_jobs = 1

    return n_jobs

model_selection¶

skforecast.model_selection._validation.backtesting_forecaster ¶

skforecast.model_selection._search.grid_search_forecaster ¶

skforecast.model_selection._search.random_search_forecaster ¶

skforecast.model_selection._search.bayesian_search_forecaster ¶

skforecast.model_selection._validation.backtesting_forecaster_multiseries ¶

skforecast.model_selection._search.grid_search_forecaster_multiseries ¶

skforecast.model_selection._search.random_search_forecaster_multiseries ¶

skforecast.model_selection._search.bayesian_search_forecaster_multiseries ¶

skforecast.model_selection._validation.backtesting_sarimax ¶

skforecast.model_selection._search.grid_search_sarimax ¶

skforecast.model_selection._search.random_search_sarimax ¶

skforecast.model_selection._split.BaseFold ¶

_validate_params ¶

_extract_index ¶

set_params ¶

skforecast.model_selection._split.TimeSeriesFold ¶

split ¶

_print_info ¶

skforecast.model_selection._split.OneStepAheadFold ¶

split ¶

_print_info ¶

skforecast.model_selection._utils.initialize_lags_grid ¶

skforecast.model_selection._utils.check_backtesting_input ¶

skforecast.model_selection._utils.select_n_jobs_backtesting ¶

`model_selection`¶