`model_selection`¶

skforecast.model_selection._validation.backtesting_forecaster ¶

backtesting_forecaster(
    forecaster,
    y,
    cv,
    metric,
    exog=None,
    interval=None,
    interval_method="bootstrapping",
    n_boot=250,
    use_in_sample_residuals=True,
    use_binned_residuals=True,
    random_state=123,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
)

Backtesting of forecaster model following the folds generated by the TimeSeriesFold class and using the metric(s) provided.

If forecaster is already trained and initial_train_size is set to None in the TimeSeriesFold class, no initial train will be done and all data will be used to evaluate the model. However, the first len(forecaster.last_window) observations are needed to create the initial predictors, so no predictions are calculated for them.

A copy of the original forecaster is created so that it is not modified during the process.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect, ForecasterEquivalentDate)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`interval`	`(float, list, tuple, str, object)`	Specifies whether probabilistic predictions should be estimated and the method to use. The following options are supported: If `float`, represents the nominal (expected) coverage (between 0 and 1). For instance, `interval=0.95` corresponds to `[2.5, 97.5]` percentiles. If `list` or `tuple`: Sequence of percentiles to compute, each value must be between 0 and 100 inclusive. For example, a 95% confidence interval can be specified as `interval = [2.5, 97.5]` or multiple percentiles (e.g. 10, 50 and 90) as `interval = [10, 50, 90]`. If 'bootstrapping' (str): `n_boot` bootstrapping predictions will be generated. If scipy.stats distribution object, the distribution parameters will be estimated for each prediction. If None, no probabilistic predictions are estimated.	`None`
`interval_method`	`str`	Technique used to estimate prediction intervals. Available options: 'bootstrapping': Bootstrapping is used to generate prediction intervals [1]_. 'conformal': Employs the conformal prediction split method for interval estimation [2]_.	`'bootstrapping'`
`n_boot`	`int`	Number of bootstrapping iterations to perform when estimating prediction intervals.	`250`
`use_in_sample_residuals`	`bool`	If `True`, residuals from the training data are used as proxy of prediction error to create predictions. If `False`, out of sample residuals (calibration) are used. Out-of-sample residuals must be precomputed using Forecaster's `set_out_sample_residuals()` method.	`True`
`use_binned_residuals`	`bool`	If `True`, residuals are selected based on the predicted values (binned selection). If `False`, residuals are selected randomly.	`True`
`random_state`	`int`	Seed for the random number generator to ensure reproducibility.	`123`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds and index of training and validation sets used for backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`

Returns:

Name	Type	Description
`metric_values`	`pandas DataFrame`	Value(s) of the metric(s).
`backtest_predictions`	`pandas DataFrame`	Value of predictions and their estimated probabilistic predictions if `interval` is not `None`. column pred: predictions. If `interval` is a float, columns 'lower_bound' and 'upper_bound' are created. If `interval` is a list or tuple, columns are the percentiles. If `interval` has two elements, they are renamed to 'lower_bound' and 'upper_bound'. If `interval` is 'bootstrapping', `n_boot` columns are created with the bootstrapping predictions. If `interval` is a distribution object, columns are the parameters of the distribution.

References

.. [1] Forecasting: Principles and Practice (3^rd ed) Rob J Hyndman and George Athanasopoulos. https://otexts.com/fpp3/prediction-intervals.html

.. [2] MAPIE - Model Agnostic Prediction Interval Estimator. https://mapie.readthedocs.io/en/stable/theoretical_description_regression.html#the-split-method

Source code in skforecast\model_selection\_validation.py

def backtesting_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    metric: str | Callable | list[str | Callable],
    exog: pd.Series | pd.DataFrame | None = None,
    interval: float | list[float] | tuple[float] | str | object | None = None,
    interval_method: str = 'bootstrapping',
    n_boot: int = 250,
    use_in_sample_residuals: bool = True,
    use_binned_residuals: bool = True,
    random_state: int = 123,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Backtesting of forecaster model following the folds generated by the TimeSeriesFold
    class and using the metric(s) provided.

    If `forecaster` is already trained and `initial_train_size` is set to `None` in the
    TimeSeriesFold class, no initial train will be done and all data will be used
    to evaluate the model. However, the first `len(forecaster.last_window)` observations
    are needed to create the initial predictors, so no predictions are calculated for
    them.

    A copy of the original forecaster is created so that it is not modified during 
    the process.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect, ForecasterEquivalentDate
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default None
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    interval : float, list, tuple, str, object, default None
        Specifies whether probabilistic predictions should be estimated and the 
        method to use. The following options are supported:

        - If `float`, represents the nominal (expected) coverage (between 0 and 1). 
        For instance, `interval=0.95` corresponds to `[2.5, 97.5]` percentiles.
        - If `list` or `tuple`: Sequence of percentiles to compute, each value must 
        be between 0 and 100 inclusive. For example, a 95% confidence interval can 
        be specified as `interval = [2.5, 97.5]` or multiple percentiles (e.g. 10, 
        50 and 90) as `interval = [10, 50, 90]`.
        - If 'bootstrapping' (str): `n_boot` bootstrapping predictions will be generated.
        - If scipy.stats distribution object, the distribution parameters will
        be estimated for each prediction.
        - If None, no probabilistic predictions are estimated.
    interval_method : str, default 'bootstrapping'
        Technique used to estimate prediction intervals. Available options:

        - 'bootstrapping': Bootstrapping is used to generate prediction 
        intervals [1]_.
        - 'conformal': Employs the conformal prediction split method for 
        interval estimation [2]_.
    n_boot : int, default 250
        Number of bootstrapping iterations to perform when estimating prediction
        intervals.
    use_in_sample_residuals : bool, default True
        If `True`, residuals from the training data are used as proxy of
        prediction error to create predictions. 
        If `False`, out of sample residuals (calibration) are used. 
        Out-of-sample residuals must be precomputed using Forecaster's
        `set_out_sample_residuals()` method.
    use_binned_residuals : bool, default True
        If `True`, residuals are selected based on the predicted values 
        (binned selection).
        If `False`, residuals are selected randomly.
    random_state : int, default 123
        Seed for the random number generator to ensure reproducibility.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds and index of training and validation sets used 
        for backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.

    Returns
    -------
    metric_values : pandas DataFrame
        Value(s) of the metric(s).
    backtest_predictions : pandas DataFrame
        Value of predictions and their estimated probabilistic predictions if 
        `interval` is not `None`.

        - column pred: predictions.
        - If `interval` is a float, columns 'lower_bound' and 'upper_bound' are created.
        - If `interval` is a list or tuple, columns are the percentiles. If `interval`
        has two elements, they are renamed to 'lower_bound' and 'upper_bound'.
        - If `interval` is 'bootstrapping', `n_boot` columns are created with the
        bootstrapping predictions.
        - If `interval` is a distribution object, columns are the parameters of the
        distribution.

    References
    ----------
    .. [1] Forecasting: Principles and Practice (3rd ed) Rob J Hyndman and George Athanasopoulos.
           https://otexts.com/fpp3/prediction-intervals.html

    .. [2] MAPIE - Model Agnostic Prediction Interval Estimator.
           https://mapie.readthedocs.io/en/stable/theoretical_description_regression.html#the-split-method

    """

    forecaters_allowed = [
        'ForecasterRecursive', 
        'ForecasterDirect',
        'ForecasterEquivalentDate'
    ]

    if type(forecaster).__name__ not in forecaters_allowed:
        raise TypeError(
            f"`forecaster` must be of type {forecaters_allowed}, for all other types of "
            f" forecasters use the functions available in the other `model_selection` "
            f"modules."
        )

    check_backtesting_input(
        forecaster              = forecaster,
        cv                      = cv,
        y                       = y,
        metric                  = metric,
        interval                = interval,
        interval_method         = interval_method,
        n_boot                  = n_boot,
        use_in_sample_residuals = use_in_sample_residuals,
        use_binned_residuals    = use_binned_residuals,
        random_state            = random_state,
        n_jobs                  = n_jobs,
        show_progress           = show_progress
    )

    metric_values, backtest_predictions = _backtesting_forecaster(
        forecaster              = forecaster,
        y                       = y,
        cv                      = cv,
        metric                  = metric,
        exog                    = exog,
        interval                = interval,
        interval_method         = interval_method,
        n_boot                  = n_boot,
        use_in_sample_residuals = use_in_sample_residuals,
        use_binned_residuals    = use_binned_residuals,
        random_state            = random_state,
        n_jobs                  = n_jobs,
        verbose                 = verbose,
        show_progress           = show_progress
    )

    return metric_values, backtest_predictions

skforecast.model_selection._search.grid_search_forecaster ¶

grid_search_forecaster(
    forecaster,
    y,
    cv,
    param_grid,
    metric,
    exog=None,
    lags_grid=None,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    output_file=None,
)

Exhaustive search over specified parameter values for a Forecaster object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_grid`	`dict`	Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast\model_selection\_search.py

def grid_search_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold | OneStepAheadFold,
    param_grid: dict,
    metric: str | Callable | list[str | Callable],
    exog: pd.Series | pd.DataFrame | None = None,
    lags_grid: (
        list[int | list[int] | np.ndarray[int] | range[int]]
        | dict[str, list[int | list[int] | np.ndarray[int] | range[int]]]
        | None
    ) = None,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    output_file: str | None = None
) -> pd.DataFrame:
    """
    Exhaustive search over specified parameter values for a Forecaster object.
    Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    param_grid : dict
        Dictionary with parameters names (`str`) as keys and lists of parameter
        settings to try as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default None
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    lags_grid : list, dict, default None
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterGrid(param_grid))

    results = _evaluate_grid_hyperparameters(
                  forecaster    = forecaster,
                  y             = y,
                  cv            = cv,
                  param_grid    = param_grid,
                  metric        = metric,
                  exog          = exog,
                  lags_grid     = lags_grid,
                  return_best   = return_best,
                  n_jobs        = n_jobs,
                  verbose       = verbose,
                  show_progress = show_progress,
                  output_file   = output_file
              )

    return results

skforecast.model_selection._search.random_search_forecaster ¶

random_search_forecaster(
    forecaster,
    y,
    cv,
    param_distributions,
    metric,
    exog=None,
    lags_grid=None,
    n_iter=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    output_file=None,
)

Random search over specified parameter values or distributions for a Forecaster object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_distributions`	`dict`	Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`n_iter`	`int`	Number of parameter settings that are sampled per lags configuration. n_iter trades off runtime vs quality of the solution.	`10`
`random_state`	`int`	Sets a seed to the random sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast\model_selection\_search.py

def random_search_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold | OneStepAheadFold,
    param_distributions: dict,
    metric: str | Callable | list[str | Callable],
    exog: pd.Series | pd.DataFrame | None = None,
    lags_grid: (
        list[int | list[int] | np.ndarray[int] | range[int]]
        | dict[str, list[int | list[int] | np.ndarray[int] | range[int]]]
        | None
    ) = None,
    n_iter: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    output_file: str | None = None
) -> pd.DataFrame:
    """
    Random search over specified parameter values or distributions for a Forecaster 
    object. Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    param_distributions : dict
        Dictionary with parameters names (`str`) as keys and 
        distributions or lists of parameters to try.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default None
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i]. 
    lags_grid : list, dict, default None
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    n_iter : int, default 10
        Number of parameter settings that are sampled per lags configuration. 
        n_iter trades off runtime vs quality of the solution.
    random_state : int, default 123
        Sets a seed to the random sampling for reproducible output.
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterSampler(param_distributions, n_iter=n_iter, random_state=random_state))

    results = _evaluate_grid_hyperparameters(
                  forecaster    = forecaster,
                  y             = y,
                  cv            = cv,
                  param_grid    = param_grid,
                  metric        = metric,
                  exog          = exog,
                  lags_grid     = lags_grid,
                  return_best   = return_best,
                  n_jobs        = n_jobs,
                  verbose       = verbose,
                  show_progress = show_progress,
                  output_file   = output_file
              )

    return results

skforecast.model_selection._search.bayesian_search_forecaster ¶

bayesian_search_forecaster(
    forecaster,
    y,
    cv,
    search_space,
    metric,
    exog=None,
    n_trials=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    output_file=None,
    kwargs_create_study={},
    kwargs_study_optimize={},
)

Bayesian search for hyperparameters of a Forecaster object.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursive, ForecasterDirect)`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`search_space`	`Callable(optuna)`	Function with argument `trial` which returns a dictionary with parameters names (`str`) as keys and Trial object from optuna (trial.suggest_float, trial.suggest_int, trial.suggest_categorical) as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`n_trials`	`int`	Number of parameter settings that are sampled in each lag configuration.	`10`
`random_state`	`int`	Sets a seed to the sampling for reproducible output. When a new sampler is passed in `kwargs_create_study`, the seed must be set within the sampler. For example `{'sampler': TPESampler(seed=145)}`.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`
`kwargs_create_study`	`dict`	Additional keyword arguments (key, value mappings) to pass to optuna.create_study(). If default, the direction is set to 'minimize' and a TPESampler(seed=123) sampler is used during optimization.	`{}`
`kwargs_study_optimize`	`dict`	Additional keyword arguments (key, value mappings) to pass to study.optimize().	`{}`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column lags: lags configuration for each iteration. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.
`best_trial`	`optuna object`	The best optimization result returned as a FrozenTrial optuna object.

Source code in skforecast\model_selection\_search.py

def bayesian_search_forecaster(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold | OneStepAheadFold,
    search_space: Callable,
    metric: str | Callable | list[str | Callable],
    exog: pd.Series | pd.DataFrame | None = None,
    n_trials: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    output_file: str | None = None,
    kwargs_create_study: dict = {},
    kwargs_study_optimize: dict = {}
) -> tuple[pd.DataFrame, object]:
    """
    Bayesian search for hyperparameters of a Forecaster object.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    search_space : Callable (optuna)
        Function with argument `trial` which returns a dictionary with parameters names 
        (`str`) as keys and Trial object from optuna (trial.suggest_float, 
        trial.suggest_int, trial.suggest_categorical) as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default None
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    n_trials : int, default 10
        Number of parameter settings that are sampled in each lag configuration.
    random_state : int, default 123
        Sets a seed to the sampling for reproducible output. When a new sampler 
        is passed in `kwargs_create_study`, the seed must be set within the 
        sampler. For example `{'sampler': TPESampler(seed=145)}`.
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**
    kwargs_create_study : dict, default {}
        Additional keyword arguments (key, value mappings) to pass to optuna.create_study().
        If default, the direction is set to 'minimize' and a TPESampler(seed=123) 
        sampler is used during optimization.
    kwargs_study_optimize : dict, default {}
        Additional keyword arguments (key, value mappings) to pass to study.optimize().

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column lags: lags configuration for each iteration.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.
    best_trial : optuna object
        The best optimization result returned as a FrozenTrial optuna object.

    """

    if return_best and exog is not None and (len(exog) != len(y)):
        raise ValueError(
            f"`exog` must have same number of samples as `y`. "
            f"length `exog`: ({len(exog)}), length `y`: ({len(y)})"
        )

    results, best_trial = _bayesian_search_optuna(
                              forecaster            = forecaster,
                              y                     = y,
                              cv                    = cv,
                              exog                  = exog,
                              search_space          = search_space,
                              metric                = metric,
                              n_trials              = n_trials,
                              random_state          = random_state,
                              return_best           = return_best,
                              n_jobs                = n_jobs,
                              verbose               = verbose,
                              show_progress         = show_progress,
                              output_file           = output_file,
                              kwargs_create_study   = kwargs_create_study,
                              kwargs_study_optimize = kwargs_study_optimize
                          )

    return results, best_trial

skforecast.model_selection._validation.backtesting_forecaster_multiseries ¶

backtesting_forecaster_multiseries(
    forecaster,
    series,
    cv,
    metric,
    levels=None,
    add_aggregated_metric=True,
    exog=None,
    interval=None,
    interval_method="conformal",
    n_boot=250,
    use_in_sample_residuals=True,
    use_binned_residuals=True,
    random_state=123,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    suppress_warnings=False,
)

Backtesting of forecaster model following the folds generated by the TimeSeriesFold class and using the metric(s) provided.

If forecaster is already trained and initial_train_size is set to None in the TimeSeriesFold class, no initial train will be done and all data will be used to evaluate the model. However, the first len(forecaster.last_window) observations are needed to create the initial predictors, so no predictions are calculated for them.

A copy of the original forecaster is created so that it is not modified during the process.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate, ForecasterRnn)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`levels`	`(str, list)`	Time series to be predicted. If `None` all levels will be predicted.	`None`
`add_aggregated_metric`	`bool`	If `True`, and multiple series (`levels`) are predicted, the aggregated metrics (average, weighted average and pooled) are also returned. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`True`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`interval`	`(float, list, tuple, str, object)`	Specifies whether probabilistic predictions should be estimated and the method to use. The following options are supported: If `float`, represents the nominal (expected) coverage (between 0 and 1). For instance, `interval=0.95` corresponds to `[2.5, 97.5]` percentiles. If `list` or `tuple`: Sequence of percentiles to compute, each value must be between 0 and 100 inclusive. For example, a 95% confidence interval can be specified as `interval = [2.5, 97.5]` or multiple percentiles (e.g. 10, 50 and 90) as `interval = [10, 50, 90]`. If 'bootstrapping' (str): `n_boot` bootstrapping predictions will be generated. If scipy.stats distribution object, the distribution parameters will be estimated for each prediction. If None, no probabilistic predictions are estimated.	`None`
`interval_method`	`str`	Technique used to estimate prediction intervals. Available options: 'bootstrapping': Bootstrapping is used to generate prediction intervals [1]_. 'conformal': Employs the conformal prediction split method for interval estimation [2]_.	`'conformal'`
`n_boot`	`int`	Number of bootstrapping iterations to perform when estimating prediction intervals.	`250`
`use_in_sample_residuals`	`bool`	If `True`, residuals from the training data are used as proxy of prediction error to create predictions. If `False`, out of sample residuals (calibration) are used. Out-of-sample residuals must be precomputed using Forecaster's `set_out_sample_residuals()` method.	`True`
`use_binned_residuals`	`bool`	If `True`, residuals are selected based on the predicted values (binned selection). If `False`, residuals are selected randomly.	`True`
`random_state`	`int`	Seed for the random number generator to ensure reproducibility.	`123`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds and index of training and validation sets used for backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the backtesting process. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`

Returns:

Name	Type	Description
`metrics_levels`	`pandas DataFrame`	Value(s) of the metric(s). Index are the levels and columns the metrics.
`backtest_predictions`	`pandas DataFrame`	Long-format DataFrame containing the predicted values for each series. The DataFrame includes the following columns: `level`: Identifier for the time series or level being predicted. `pred`: Predicted values for the corresponding series and time steps. If `interval` is not `None`, additional columns are included depending on the method: For `float`: Columns `lower_bound` and `upper_bound`. For `list` or `tuple` of 2 elements: Columns `lower_bound` and `upper_bound`. For `list` or `tuple` with multiple percentiles: One column per percentile (e.g., `p_10`, `p_50`, `p_90`). For `'bootstrapping'`: One column per bootstrapping iteration (e.g., `pred_boot_0`, `pred_boot_1`, ..., `pred_boot_n`). For `scipy.stats` distribution objects: One column for each estimated parameter of the distribution (e.g., `loc`, `scale`).

References

.. [1] Forecasting: Principles and Practice (3^rd ed) Rob J Hyndman and George Athanasopoulos. https://otexts.com/fpp3/prediction-intervals.html

.. [2] MAPIE - Model Agnostic Prediction Interval Estimator. https://mapie.readthedocs.io/en/stable/theoretical_description_regression.html#the-split-method

Source code in skforecast\model_selection\_validation.py

def backtesting_forecaster_multiseries(
    forecaster: object,
    series: pd.DataFrame | dict[str, pd.Series | pd.DataFrame],
    cv: TimeSeriesFold,
    metric: str | Callable | list[str | Callable],
    levels: str | list[str] | None = None,
    add_aggregated_metric: bool = True,
    exog: pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None = None,
    interval: float | list[float] | tuple[float] | str | object | None = None,
    interval_method: str = 'conformal',
    n_boot: int = 250,
    use_in_sample_residuals: bool = True,
    use_binned_residuals: bool = True,
    random_state: int = 123,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    suppress_warnings: bool = False
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Backtesting of forecaster model following the folds generated by the TimeSeriesFold
    class and using the metric(s) provided.

    If `forecaster` is already trained and `initial_train_size` is set to `None` in the
    TimeSeriesFold class, no initial train will be done and all data will be used
    to evaluate the model. However, the first `len(forecaster.last_window)` observations
    are needed to create the initial predictors, so no predictions are calculated for
    them.

    A copy of the original forecaster is created so that it is not modified during 
    the process.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate, ForecasterRnn
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    levels : str, list, default None
        Time series to be predicted. If `None` all levels will be predicted.
    add_aggregated_metric : bool, default True
        If `True`, and multiple series (`levels`) are predicted, the aggregated
        metrics (average, weighted average and pooled) are also returned.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    exog : pandas Series, pandas DataFrame, dict, default None
        Exogenous variables.
    interval : float, list, tuple, str, object, default None
        Specifies whether probabilistic predictions should be estimated and the 
        method to use. The following options are supported:

        - If `float`, represents the nominal (expected) coverage (between 0 and 1). 
        For instance, `interval=0.95` corresponds to `[2.5, 97.5]` percentiles.
        - If `list` or `tuple`: Sequence of percentiles to compute, each value must 
        be between 0 and 100 inclusive. For example, a 95% confidence interval can 
        be specified as `interval = [2.5, 97.5]` or multiple percentiles (e.g. 10, 
        50 and 90) as `interval = [10, 50, 90]`.
        - If 'bootstrapping' (str): `n_boot` bootstrapping predictions will be generated.
        - If scipy.stats distribution object, the distribution parameters will
        be estimated for each prediction.
        - If None, no probabilistic predictions are estimated.
    interval_method : str, default 'conformal'
        Technique used to estimate prediction intervals. Available options:

        - 'bootstrapping': Bootstrapping is used to generate prediction 
        intervals [1]_.
        - 'conformal': Employs the conformal prediction split method for 
        interval estimation [2]_.
    n_boot : int, default 250
        Number of bootstrapping iterations to perform when estimating prediction 
        intervals.
    use_in_sample_residuals : bool, default True
        If `True`, residuals from the training data are used as proxy of
        prediction error to create predictions. 
        If `False`, out of sample residuals (calibration) are used. 
        Out-of-sample residuals must be precomputed using Forecaster's
        `set_out_sample_residuals()` method.
    use_binned_residuals : bool, default True
        If `True`, residuals are selected based on the predicted values 
        (binned selection).
        If `False`, residuals are selected randomly.
    random_state : int, default 123
        Seed for the random number generator to ensure reproducibility.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds and index of training and validation sets used 
        for backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.
    suppress_warnings: bool, default False
        If `True`, skforecast warnings will be suppressed during the backtesting 
        process. See skforecast.exceptions.warn_skforecast_categories for more
        information.

    Returns
    -------
    metrics_levels : pandas DataFrame
        Value(s) of the metric(s). Index are the levels and columns the metrics.
    backtest_predictions : pandas DataFrame
        Long-format DataFrame containing the predicted values for each series. The 
        DataFrame includes the following columns:

        - `level`: Identifier for the time series or level being predicted.
        - `pred`: Predicted values for the corresponding series and time steps.

        If `interval` is not `None`, additional columns are included depending on the method:

        - For `float`: Columns `lower_bound` and `upper_bound`.
        - For `list` or `tuple` of 2 elements: Columns `lower_bound` and `upper_bound`.
        - For `list` or `tuple` with multiple percentiles: One column per percentile 
        (e.g., `p_10`, `p_50`, `p_90`).
        - For `'bootstrapping'`: One column per bootstrapping iteration 
        (e.g., `pred_boot_0`, `pred_boot_1`, ..., `pred_boot_n`).
        - For `scipy.stats` distribution objects: One column for each estimated 
        parameter of the distribution (e.g., `loc`, `scale`).

    References
    ----------
    .. [1] Forecasting: Principles and Practice (3rd ed) Rob J Hyndman and George Athanasopoulos.
           https://otexts.com/fpp3/prediction-intervals.html

    .. [2] MAPIE - Model Agnostic Prediction Interval Estimator.
           https://mapie.readthedocs.io/en/stable/theoretical_description_regression.html#the-split-method

    """

    multi_series_forecasters = [
        'ForecasterRecursiveMultiSeries', 
        'ForecasterDirectMultiVariate',
        'ForecasterRnn'
    ]

    forecaster_name = type(forecaster).__name__

    if forecaster_name not in multi_series_forecasters:
        raise TypeError(
            f"`forecaster` must be of type {multi_series_forecasters}, "
            f"for all other types of forecasters use the functions available in "
            f"the `model_selection` module. Got {forecaster_name}"
        )

    check_backtesting_input(
        forecaster              = forecaster,
        cv                      = cv,
        metric                  = metric,
        add_aggregated_metric   = add_aggregated_metric,
        series                  = series,
        exog                    = exog,
        interval                = interval,
        interval_method         = interval_method,
        n_boot                  = n_boot,
        use_in_sample_residuals = use_in_sample_residuals,
        use_binned_residuals    = use_binned_residuals,
        random_state            = random_state,
        n_jobs                  = n_jobs,
        show_progress           = show_progress,
        suppress_warnings       = suppress_warnings
    )

    metrics_levels, backtest_predictions = _backtesting_forecaster_multiseries(
        forecaster              = forecaster,
        series                  = series,
        cv                      = cv,
        levels                  = levels,
        metric                  = metric,
        add_aggregated_metric   = add_aggregated_metric,
        exog                    = exog,
        interval                = interval,
        interval_method         = interval_method,
        n_boot                  = n_boot,
        use_in_sample_residuals = use_in_sample_residuals,
        use_binned_residuals    = use_binned_residuals,
        random_state            = random_state,
        n_jobs                  = n_jobs,
        verbose                 = verbose,
        show_progress           = show_progress,
        suppress_warnings       = suppress_warnings
    )

    return metrics_levels, backtest_predictions

skforecast.model_selection._search.grid_search_forecaster_multiseries ¶

grid_search_forecaster_multiseries(
    forecaster,
    series,
    cv,
    param_grid,
    metric,
    aggregate_metric=[
        "weighted_average",
        "average",
        "pooling",
    ],
    levels=None,
    exog=None,
    lags_grid=None,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    suppress_warnings=False,
    output_file=None,
)

Exhaustive search over specified parameter values for a Forecaster object. Validation is done using multi-series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_grid`	`dict`	Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`aggregate_metric`	`(str, list)`	Aggregation method/s used to combine the metric/s of all levels (series) when multiple levels are predicted. If list, the first aggregation method is used to select the best parameters. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`['weighted_average', 'average', 'pooling']`
`levels`	`(str, list)`	level (`str`) or levels (`list`) at which the forecaster is optimized. If `None`, all levels are taken into account.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the hyperparameter search. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column levels: levels configuration for each iteration. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. The resulting metric will be the average of the optimization of all levels. additional n columns with param = value.

Source code in skforecast\model_selection\_search.py

def grid_search_forecaster_multiseries(
    forecaster: object,
    series: pd.DataFrame | dict[str, pd.Series | pd.DataFrame],
    cv: TimeSeriesFold | OneStepAheadFold,
    param_grid: dict,
    metric: str | Callable | list[str | Callable],
    aggregate_metric: str | list[str] = ['weighted_average', 'average', 'pooling'],
    levels: str | list[str] | None = None,
    exog: pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None = None,
    lags_grid: (
        list[int | list[int] | np.ndarray[int] | range[int]]
        | dict[str, list[int | list[int] | np.ndarray[int] | range[int]]]
        | None
    ) = None,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    suppress_warnings: bool = False,
    output_file: str | None = None
) -> pd.DataFrame:
    """
    Exhaustive search over specified parameter values for a Forecaster object.
    Validation is done using multi-series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
        **New in version 0.14.0**
    param_grid : dict
        Dictionary with parameters names (`str`) as keys and lists of parameter
        settings to try as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    aggregate_metric : str, list, default `['weighted_average', 'average', 'pooling']`
        Aggregation method/s used to combine the metric/s of all levels (series)
        when multiple levels are predicted. If list, the first aggregation method
        is used to select the best parameters.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    levels : str, list, default None
        level (`str`) or levels (`list`) at which the forecaster is optimized. 
        If `None`, all levels are taken into account.
    exog : pandas Series, pandas DataFrame, dict, default None
        Exogenous variables.
    lags_grid : list, dict, default None
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.
    suppress_warnings: bool, default False
        If `True`, skforecast warnings will be suppressed during the hyperparameter 
        search. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column levels: levels configuration for each iteration.
        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration. The resulting 
        metric will be the average of the optimization of all levels.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterGrid(param_grid))

    results = _evaluate_grid_hyperparameters_multiseries(
                  forecaster        = forecaster,
                  series            = series,
                  cv                = cv,
                  param_grid        = param_grid,
                  metric            = metric,
                  aggregate_metric  = aggregate_metric,
                  levels            = levels,
                  exog              = exog,
                  lags_grid         = lags_grid,
                  n_jobs            = n_jobs,
                  return_best       = return_best,
                  verbose           = verbose,
                  show_progress     = show_progress,
                  suppress_warnings = suppress_warnings,
                  output_file       = output_file
              )

    return results

skforecast.model_selection._search.random_search_forecaster_multiseries ¶

random_search_forecaster_multiseries(
    forecaster,
    series,
    cv,
    param_distributions,
    metric,
    aggregate_metric=[
        "weighted_average",
        "average",
        "pooling",
    ],
    levels=None,
    exog=None,
    lags_grid=None,
    n_iter=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    suppress_warnings=False,
    output_file=None,
)

Random search over specified parameter values or distributions for a Forecaster object. Validation is done using multi-series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`cv`	`(TimeSeriesFold, OneStepAheadFold)`	TimeSeriesFold or OneStepAheadFold object with the information needed to split the data into folds.	required
`param_distributions`	`dict`	Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`aggregate_metric`	`(str, list)`	Aggregation method/s used to combine the metric/s of all levels (series) when multiple levels are predicted. If list, the first aggregation method is used to select the best parameters. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`['weighted_average', 'average', 'pooling']`
`levels`	`(str, list)`	level (`str`) or levels (`list`) at which the forecaster is optimized. If `None`, all levels are taken into account.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`
`n_iter`	`int`	Number of parameter settings that are sampled per lags configuration. n_iter trades off runtime vs quality of the solution.	`10`
`random_state`	`int`	Sets a seed to the random sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the hyperparameter search. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column levels: levels configuration for each iteration. column lags: lags configuration for each iteration. column lags_label: descriptive label or alias for the lags. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. The resulting metric will be the average of the optimization of all levels. additional n columns with param = value.

Source code in skforecast\model_selection\_search.py

def random_search_forecaster_multiseries(
    forecaster: object,
    series: pd.DataFrame | dict[str, pd.Series | pd.DataFrame],
    cv: TimeSeriesFold | OneStepAheadFold,
    param_distributions: dict,
    metric: str | Callable | list[str | Callable],
    aggregate_metric: str | list[str] = ['weighted_average', 'average', 'pooling'],
    levels: str | list[str] | None = None,
    exog: pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None = None,
    lags_grid: (
        list[int | list[int] | np.ndarray[int] | range[int]]
        | dict[str, list[int | list[int] | np.ndarray[int] | range[int]]]
        | None
    ) = None,
    n_iter: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    suppress_warnings: bool = False,
    output_file: str | None = None
) -> pd.DataFrame:
    """
    Random search over specified parameter values or distributions for a Forecaster 
    object. Validation is done using multi-series backtesting.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    cv : TimeSeriesFold, OneStepAheadFold
        TimeSeriesFold or OneStepAheadFold object with the information needed to split
        the data into folds.
    param_distributions : dict
        Dictionary with parameters names (`str`) as keys and distributions or 
        lists of parameters to try.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    aggregate_metric : str, list, default `['weighted_average', 'average', 'pooling']`
        Aggregation method/s used to combine the metric/s of all levels (series)
        when multiple levels are predicted. If list, the first aggregation method
        is used to select the best parameters.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    levels : str, list, default None
        level (`str`) or levels (`list`) at which the forecaster is optimized. 
        If `None`, all levels are taken into account.
    exog : pandas Series, pandas DataFrame, dict, default None
        Exogenous variables.
    lags_grid : list, dict, default None
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.
    n_iter : int, default 10
        Number of parameter settings that are sampled per lags configuration. 
        n_iter trades off runtime vs quality of the solution.
    random_state : int, default 123
        Sets a seed to the random sampling for reproducible output.
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.
    suppress_warnings: bool, default False
        If `True`, skforecast warnings will be suppressed during the hyperparameter 
        search. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column levels: levels configuration for each iteration.
        - column lags: lags configuration for each iteration.
        - column lags_label: descriptive label or alias for the lags.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration. The resulting 
        metric will be the average of the optimization of all levels.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterSampler(param_distributions, n_iter=n_iter, 
                                       random_state=random_state))

    results = _evaluate_grid_hyperparameters_multiseries(
                  forecaster        = forecaster,
                  series            = series,
                  cv                = cv,
                  param_grid        = param_grid,
                  metric            = metric,
                  aggregate_metric  = aggregate_metric,
                  levels            = levels,
                  exog              = exog,
                  lags_grid         = lags_grid,
                  return_best       = return_best,
                  n_jobs            = n_jobs,
                  verbose           = verbose,
                  show_progress     = show_progress,
                  suppress_warnings = suppress_warnings,
                  output_file       = output_file
              )

    return results

skforecast.model_selection._search.bayesian_search_forecaster_multiseries ¶

bayesian_search_forecaster_multiseries(
    forecaster,
    series,
    cv,
    search_space,
    metric,
    aggregate_metric=[
        "weighted_average",
        "average",
        "pooling",
    ],
    levels=None,
    exog=None,
    n_trials=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    show_progress=True,
    suppress_warnings=False,
    output_file=None,
    kwargs_create_study={},
    kwargs_study_optimize={},
)

Bayesian search for hyperparameters of a Forecaster object using optuna library.

Parameters:

Name	Type	Description	Default
`forecaster`	`(ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate)`	Forecaster model.	required
`series`	`pandas DataFrame, dict`	Training time series.	required
`search_space`	`Callable`	Function with argument `trial` which returns a dictionary with parameters names (`str`) as keys and Trial object from optuna (trial.suggest_float, trial.suggest_int, trial.suggest_categorical) as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`aggregate_metric`	`(str, list)`	Aggregation method/s used to combine the metric/s of all levels (series) when multiple levels are predicted. If list, the first aggregation method is used to select the best parameters. 'average': the average (arithmetic mean) of all levels. 'weighted_average': the average of the metrics weighted by the number of predicted values of each level. 'pooling': the values of all levels are pooled and then the metric is calculated.	`['weighted_average', 'average', 'pooling']`
`levels`	`(str, list)`	level (`str`) or levels (`list`) at which the forecaster is optimized. If `None`, all levels are taken into account.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`n_trials`	`int`	Number of parameter settings that are sampled in each lag configuration.	`10`
`random_state`	`int`	Sets a seed to the sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the hyperparameter search. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`
`kwargs_create_study`	`dict`	Additional keyword arguments (key, value mappings) to pass to optuna.create_study(). If default, the direction is set to 'minimize' and a TPESampler(seed=123) sampler is used during optimization.	`{}`
`kwargs_study_optimize`	`dict`	Additional keyword arguments (key, value mappings) to pass to study.optimize().	`{}`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column levels: levels configuration for each iteration. column lags: lags configuration for each iteration. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. The resulting metric will be the average of the optimization of all levels. additional n columns with param = value.
`best_trial`	`optuna object`	The best optimization result returned as a FrozenTrial optuna object.

Source code in skforecast\model_selection\_search.py

def bayesian_search_forecaster_multiseries(
    forecaster: object,
    series: pd.DataFrame | dict[str, pd.Series | pd.DataFrame],
    cv: TimeSeriesFold | OneStepAheadFold,
    search_space: Callable,
    metric: str | Callable | list[str | Callable],
    aggregate_metric: str | list[str] = ['weighted_average', 'average', 'pooling'],
    levels: str | list[str] | None = None,
    exog: pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None = None,
    n_trials: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    show_progress: bool = True,
    suppress_warnings: bool = False,
    output_file: str | None = None,
    kwargs_create_study: dict = {},
    kwargs_study_optimize: dict = {}
) -> tuple[pd.DataFrame, object]:
    """
    Bayesian search for hyperparameters of a Forecaster object using optuna library.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate
        Forecaster model.
    series : pandas DataFrame, dict
        Training time series.
    search_space : Callable
        Function with argument `trial` which returns a dictionary with parameters names 
        (`str`) as keys and Trial object from optuna (trial.suggest_float, 
        trial.suggest_int, trial.suggest_categorical) as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    aggregate_metric : str, list, default `['weighted_average', 'average', 'pooling']`
        Aggregation method/s used to combine the metric/s of all levels (series)
        when multiple levels are predicted. If list, the first aggregation method
        is used to select the best parameters.

        - 'average': the average (arithmetic mean) of all levels.
        - 'weighted_average': the average of the metrics weighted by the number of
        predicted values of each level.
        - 'pooling': the values of all levels are pooled and then the metric is
        calculated.
    levels : str, list, default None
        level (`str`) or levels (`list`) at which the forecaster is optimized. 
        If `None`, all levels are taken into account.
    exog : pandas Series, pandas DataFrame, dict, default None
        Exogenous variables.
    n_trials : int, default 10
        Number of parameter settings that are sampled in each lag configuration.
    random_state : int, default 123
        Sets a seed to the sampling for reproducible output.
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    show_progress : bool, default True
        Whether to show a progress bar.
    suppress_warnings: bool, default False
        If `True`, skforecast warnings will be suppressed during the hyperparameter
        search. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**
    kwargs_create_study : dict, default {}
        Additional keyword arguments (key, value mappings) to pass to optuna.create_study().
        If default, the direction is set to 'minimize' and a TPESampler(seed=123) 
        sampler is used during optimization.
    kwargs_study_optimize : dict, default {}
        Additional keyword arguments (key, value mappings) to pass to study.optimize().

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column levels: levels configuration for each iteration.
        - column lags: lags configuration for each iteration.
        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration. The resulting 
        metric will be the average of the optimization of all levels.
        - additional n columns with param = value.
    best_trial : optuna object
        The best optimization result returned as a FrozenTrial optuna object.

    """

    if return_best and exog is not None and (len(exog) != len(series)):
        raise ValueError(
            f"`exog` must have same number of samples as `series`. "
            f"length `exog`: ({len(exog)}), length `series`: ({len(series)})"
        )

    results, best_trial = _bayesian_search_optuna_multiseries(
                              forecaster            = forecaster,
                              series                = series,
                              cv                    = cv,
                              exog                  = exog,
                              levels                = levels, 
                              search_space          = search_space,
                              metric                = metric,
                              aggregate_metric      = aggregate_metric,
                              n_trials              = n_trials,
                              random_state          = random_state,
                              return_best           = return_best,
                              n_jobs                = n_jobs,
                              verbose               = verbose,
                              show_progress         = show_progress,
                              suppress_warnings     = suppress_warnings,
                              output_file           = output_file,
                              kwargs_create_study   = kwargs_create_study,
                              kwargs_study_optimize = kwargs_study_optimize
                          )

    return results, best_trial

skforecast.model_selection._validation.backtesting_sarimax ¶

backtesting_sarimax(
    forecaster,
    y,
    cv,
    metric,
    exog=None,
    alpha=None,
    interval=None,
    n_jobs="auto",
    verbose=False,
    suppress_warnings_fit=False,
    show_progress=True,
)

Backtesting of ForecasterSarimax.

A copy of the original forecaster is created so that it is not modified during the process.

Parameters:

Name	Type	Description	Default
`forecaster`	`ForecasterSarimax`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`alpha`	`float`	The confidence intervals for the forecasts are (1 - alpha) %. If both, `alpha` and `interval` are provided, `alpha` will be used.	`0.05`
`interval`	`(list, tuple)`	Confidence of the prediction interval estimated. The values must be symmetric. Sequence of percentiles to compute, which must be between 0 and 100 inclusive. For example, interval of 95% should be as `interval = [2.5, 97.5]`. If both, `alpha` and `interval` are provided, `alpha` will be used.	`None`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds and index of training and validation sets used for backtesting.	`False`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`

Returns:

Name	Type	Description
`metric_values`	`pandas DataFrame`	Value(s) of the metric(s).
`backtest_predictions`	`pandas DataFrame`	Predicted values and their estimated interval if `interval` is not `None`. column pred: predictions. column lower_bound: lower bound of the interval. column upper_bound: upper bound of the interval.

Source code in skforecast\model_selection\_validation.py

def backtesting_sarimax(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    metric: str | Callable | list[str | Callable],
    exog: pd.Series | pd.DataFrame | None = None,
    alpha: float | None = None,
    interval: list[float] | tuple[float] | None = None,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    suppress_warnings_fit: bool = False,
    show_progress: bool = True
) -> tuple[pd.DataFrame, pd.DataFrame]:
    """
    Backtesting of ForecasterSarimax.

    A copy of the original forecaster is created so that it is not modified during 
    the process.

    Parameters
    ----------
    forecaster : ForecasterSarimax
        Forecaster model.
    y : pandas Series
        Training time series.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default None
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    alpha : float, default 0.05
        The confidence intervals for the forecasts are (1 - alpha) %.
        If both, `alpha` and `interval` are provided, `alpha` will be used.
    interval : list, tuple, default None
        Confidence of the prediction interval estimated. The values must be
        symmetric. Sequence of percentiles to compute, which must be between 
        0 and 100 inclusive. For example, interval of 95% should be as 
        `interval = [2.5, 97.5]`. If both, `alpha` and `interval` are 
        provided, `alpha` will be used.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting. 
    verbose : bool, default False
        Print number of folds and index of training and validation sets used 
        for backtesting.
    suppress_warnings_fit : bool, default False
        If `True`, warnings generated during fitting will be ignored.
    show_progress : bool, default True
        Whether to show a progress bar.

    Returns
    -------
    metric_values : pandas DataFrame
        Value(s) of the metric(s).
    backtest_predictions : pandas DataFrame
        Predicted values and their estimated interval if `interval` is not `None`.

        - column pred: predictions.
        - column lower_bound: lower bound of the interval.
        - column upper_bound: upper bound of the interval.

    """

    if type(forecaster).__name__ not in ['ForecasterSarimax']:
        raise TypeError(
            "`forecaster` must be of type `ForecasterSarimax`, for all other "
            "types of forecasters use the functions available in the other "
            "`model_selection` modules."
        )

    check_backtesting_input(
        forecaster            = forecaster,
        cv                    = cv,
        y                     = y,
        metric                = metric,
        interval              = interval,
        alpha                 = alpha,
        n_jobs                = n_jobs,
        show_progress         = show_progress,
        suppress_warnings_fit = suppress_warnings_fit
    )

    metric_values, backtest_predictions = _backtesting_sarimax(
        forecaster            = forecaster,
        y                     = y,
        cv                    = cv,
        metric                = metric,
        exog                  = exog,
        alpha                 = alpha,
        interval              = interval,
        n_jobs                = n_jobs,
        verbose               = verbose,
        suppress_warnings_fit = suppress_warnings_fit,
        show_progress         = show_progress
    )

    return metric_values, backtest_predictions

skforecast.model_selection._search.grid_search_sarimax ¶

grid_search_sarimax(
    forecaster,
    y,
    cv,
    param_grid,
    metric,
    exog=None,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    suppress_warnings_fit=False,
    show_progress=True,
    output_file=None,
)

Exhaustive search over specified parameter values for a ForecasterSarimax object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`ForecasterSarimax`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_grid`	`dict`	Dictionary with parameters names (`str`) as keys and lists of parameter settings to try as values.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast\model_selection\_search.py

def grid_search_sarimax(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    param_grid: dict,
    metric: str | Callable | list[str | Callable],
    exog: pd.Series | pd.DataFrame | None = None,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    suppress_warnings_fit: bool = False,
    show_progress: bool = True,
    output_file: str | None = None
) -> pd.DataFrame:
    """
    Exhaustive search over specified parameter values for a ForecasterSarimax object.
    Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterSarimax
        Forecaster model.
    y : pandas Series
        Training time series. 
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    param_grid : dict
        Dictionary with parameters names (`str`) as keys and lists of parameter
        settings to try as values.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default None
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    suppress_warnings_fit : bool, default False
        If `True`, warnings generated during fitting will be ignored.
    show_progress : bool, default True
        Whether to show a progress bar.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterGrid(param_grid))

    results = _evaluate_grid_hyperparameters_sarimax(
        forecaster            = forecaster,
        y                     = y,
        cv                    = cv,
        param_grid            = param_grid,
        metric                = metric,
        exog                  = exog,
        return_best           = return_best,
        n_jobs                = n_jobs,
        verbose               = verbose,
        suppress_warnings_fit = suppress_warnings_fit,
        show_progress         = show_progress,
        output_file           = output_file
    )

    return results

skforecast.model_selection._search.random_search_sarimax ¶

random_search_sarimax(
    forecaster,
    y,
    cv,
    param_distributions,
    metric,
    exog=None,
    n_iter=10,
    random_state=123,
    return_best=True,
    n_jobs="auto",
    verbose=False,
    suppress_warnings_fit=False,
    show_progress=True,
    output_file=None,
)

Random search over specified parameter values or distributions for a Forecaster object. Validation is done using time series backtesting.

Parameters:

Name	Type	Description	Default
`forecaster`	`ForecasterSarimax`	Forecaster model.	required
`y`	`pandas Series`	Training time series.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds. New in version 0.14.0	required
`param_distributions`	`dict`	Dictionary with parameters names (`str`) as keys and distributions or lists of parameters to try.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model. If `string`: {'mean_squared_error', 'mean_absolute_error', 'mean_absolute_percentage_error', 'mean_squared_log_error', 'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'} If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train` (Optional) that returns a float. If `list`: List containing multiple strings and/or Callables.	required
`exog`	`pandas Series, pandas DataFrame`	Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i].	`None`
`n_iter`	`int`	Number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution.	`10`
`random_state`	`int`	Sets a seed to the random sampling for reproducible output.	`123`
`return_best`	`bool`	Refit the `forecaster` using the best found parameters on the whole data.	`True`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_backtesting.	`'auto'`
`verbose`	`bool`	Print number of folds used for cv or backtesting.	`False`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored.	`False`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`output_file`	`str`	Specifies the filename or full path where the results should be saved. The results will be saved in a tab-separated values (TSV) format. If `None`, the results will not be saved to a file. New in version 0.12.0	`None`

Returns:

Name	Type	Description
`results`	`pandas DataFrame`	Results for each combination of parameters. column params: parameters configuration for each iteration. column metric: metric value estimated for each iteration. additional n columns with param = value.

Source code in skforecast\model_selection\_search.py

def random_search_sarimax(
    forecaster: object,
    y: pd.Series,
    cv: TimeSeriesFold,
    param_distributions: dict,
    metric: str | Callable | list[str | Callable],
    exog: pd.Series | pd.DataFrame | None = None,
    n_iter: int = 10,
    random_state: int = 123,
    return_best: bool = True,
    n_jobs: int | str = 'auto',
    verbose: bool = False,
    suppress_warnings_fit: bool = False,
    show_progress: bool = True,
    output_file: str | None = None
) -> pd.DataFrame:
    """
    Random search over specified parameter values or distributions for a Forecaster 
    object. Validation is done using time series backtesting.

    Parameters
    ----------
    forecaster : ForecasterSarimax
        Forecaster model.
    y : pandas Series
        Training time series. 
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
        **New in version 0.14.0**
    param_distributions : dict
        Dictionary with parameters names (`str`) as keys and 
        distributions or lists of parameters to try.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.

        - If `string`: {'mean_squared_error', 'mean_absolute_error',
        'mean_absolute_percentage_error', 'mean_squared_log_error',
        'mean_absolute_scaled_error', 'root_mean_squared_scaled_error'}
        - If `Callable`: Function with arguments `y_true`, `y_pred` and `y_train`
        (Optional) that returns a float.
        - If `list`: List containing multiple strings and/or Callables.
    exog : pandas Series, pandas DataFrame, default None
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    n_iter : int, default 10
        Number of parameter settings that are sampled. 
        n_iter trades off runtime vs quality of the solution.
    random_state : int, default 123
        Sets a seed to the random sampling for reproducible output.
    return_best : bool, default True
        Refit the `forecaster` using the best found parameters on the whole data.
    n_jobs : int, 'auto', default 'auto'
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_backtesting.
    verbose : bool, default False
        Print number of folds used for cv or backtesting.
    suppress_warnings_fit : bool, default False
        If `True`, warnings generated during fitting will be ignored.
    show_progress : bool, default True
        Whether to show a progress bar.
    output_file : str, default None
        Specifies the filename or full path where the results should be saved. 
        The results will be saved in a tab-separated values (TSV) format. If 
        `None`, the results will not be saved to a file.
        **New in version 0.12.0**

    Returns
    -------
    results : pandas DataFrame
        Results for each combination of parameters.

        - column params: parameters configuration for each iteration.
        - column metric: metric value estimated for each iteration.
        - additional n columns with param = value.

    """

    param_grid = list(ParameterSampler(param_distributions, n_iter=n_iter, random_state=random_state))

    results = _evaluate_grid_hyperparameters_sarimax(
        forecaster            = forecaster,
        y                     = y,
        cv                    = cv,
        param_grid            = param_grid,
        metric                = metric,
        exog                  = exog,
        return_best           = return_best,
        n_jobs                = n_jobs,
        verbose               = verbose,
        suppress_warnings_fit = suppress_warnings_fit,
        show_progress         = show_progress,
        output_file           = output_file
    )

    return results

skforecast.model_selection._split.BaseFold ¶

BaseFold(
    steps=None,
    initial_train_size=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Base class for all Fold classes in skforecast. All fold classes should specify all the parameters that can be set at the class level in their __init__.

Parameters:

Name	Type	Description	Default
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.	`None`
`initial_train_size`	`int, str, pandas Timestamp`	Number of observations used for initial training. If an integer, the number of observations used for initial training. If a date string or pandas Timestamp, it is the last date included in the initial training set.	`None`
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.	`None`
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.	`None`
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold. If `True`, the forecaster is refitted in each fold. If `False`, the forecaster is trained only in the first fold. If an integer, the forecaster is trained in the first fold and then refitted every `refit` folds.	`False`
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.	`True`
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.	`0`
`skip_folds`	`(int, list)`	Number of folds to skip. If an integer, every 'skip_folds'-th is returned. If a list, the indexes of the folds to skip. For example, if `skip_folds=3` and there are 10 folds, the returned folds are 0, 3, 6, and 9. If `skip_folds=[1, 2, 3]`, the returned folds are 0, 4, 5, 6, 7, 8, and 9.	`None`
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`. If `False`, the last fold is excluded if it is incomplete.	`True`
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.	`False`
`verbose`	`bool`	Whether to print information about generated folds.	`True`

Attributes:

Name	Type	Description
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.
`initial_train_size`	`int`	Number of observations used for initial training.
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold.
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.
`skip_folds`	`(int, list)`	Number of folds to skip.
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`.
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.
`verbose`	`bool`	Whether to print information about generated folds.

Methods:

Name	Description
`set_params`	Set the parameters of the Fold object. Before overwriting the current

Source code in skforecast\model_selection\_split.py

def __init__(
    self,
    steps: int | None = None,
    initial_train_size: int | str | pd.Timestamp | None = None,
    window_size: int | None = None,
    differentiation: int | None = None,
    refit: bool | int = False,
    fixed_train_size: bool = True,
    gap: int = 0,
    skip_folds: int | list[int] | None = None,
    allow_incomplete_fold: bool = True,
    return_all_indexes: bool = False,
    verbose: bool = True
) -> None:

    self._validate_params(
        cv_name               = type(self).__name__,
        steps                 = steps,
        initial_train_size    = initial_train_size,
        window_size           = window_size,
        differentiation       = differentiation,
        refit                 = refit,
        fixed_train_size      = fixed_train_size,
        gap                   = gap,
        skip_folds            = skip_folds,
        allow_incomplete_fold = allow_incomplete_fold,
        return_all_indexes    = return_all_indexes,
        verbose               = verbose
    )

    self.steps                 = steps
    self.initial_train_size    = initial_train_size
    self.window_size           = window_size
    self.differentiation       = differentiation
    self.refit                 = refit
    self.fixed_train_size      = fixed_train_size
    self.gap                   = gap
    self.skip_folds            = skip_folds
    self.allow_incomplete_fold = allow_incomplete_fold
    self.return_all_indexes    = return_all_indexes
    self.verbose               = verbose

steps `instance-attribute` ¶

steps = steps

initial_train_size `instance-attribute` ¶

initial_train_size = initial_train_size

window_size `instance-attribute` ¶

window_size = window_size

differentiation `instance-attribute` ¶

differentiation = differentiation

refit `instance-attribute` ¶

refit = refit

fixed_train_size `instance-attribute` ¶

fixed_train_size = fixed_train_size

gap `instance-attribute` ¶

gap = gap

skip_folds `instance-attribute` ¶

skip_folds = skip_folds

allow_incomplete_fold `instance-attribute` ¶

allow_incomplete_fold = allow_incomplete_fold

return_all_indexes `instance-attribute` ¶

return_all_indexes = return_all_indexes

verbose `instance-attribute` ¶

verbose = verbose

_validate_params ¶

_validate_params(
    cv_name,
    steps=None,
    initial_train_size=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Validate all input parameters to ensure correctness.

Source code in skforecast\model_selection\_split.py

def _validate_params(
    self,
    cv_name: str,
    steps: int | None = None,
    initial_train_size: int | str | pd.Timestamp | None = None,
    window_size: int | None = None,
    differentiation: int | None = None,
    refit: bool | int = False,
    fixed_train_size: bool = True,
    gap: int = 0,
    skip_folds: int | list[int] | None = None,
    allow_incomplete_fold: bool = True,
    return_all_indexes: bool = False,
    verbose: bool = True
) -> None: 
    """
    Validate all input parameters to ensure correctness.
    """

    if cv_name == "TimeSeriesFold":
        if not isinstance(steps, (int, np.integer)) or steps < 1:
            raise ValueError(
                f"`steps` must be an integer greater than 0. Got {steps}."
            )
        if not isinstance(initial_train_size, (int, np.integer, str, pd.Timestamp, type(None))):
            raise ValueError(
                f"`initial_train_size` must be an integer greater than 0, a date "
                f"string, a pandas Timestamp, or None. Got {initial_train_size}."
            )
        if isinstance(initial_train_size, (int, np.integer)) and initial_train_size < 1:
            raise ValueError(
                f"`initial_train_size` must be an integer greater than 0, "
                f"a date string, a pandas Timestamp, or None. Got {initial_train_size}."
            )
        if not isinstance(refit, (bool, int, np.integer)):
            raise TypeError(
                f"`refit` must be a boolean or an integer equal or greater than 0. "
                f"Got {refit}."
            )
        if isinstance(refit, (int, np.integer)) and not isinstance(refit, bool) and refit < 0:
            raise TypeError(
                f"`refit` must be a boolean or an integer equal or greater than 0. "
                f"Got {refit}."
            )
        if not isinstance(fixed_train_size, bool):
            raise TypeError(
                f"`fixed_train_size` must be a boolean: `True`, `False`. "
                f"Got {fixed_train_size}."
            )
        if not isinstance(gap, (int, np.integer)) or gap < 0:
            raise ValueError(
                f"`gap` must be an integer greater than or equal to 0. Got {gap}."
            )
        if skip_folds is not None:
            if not isinstance(skip_folds, (int, np.integer, list, type(None))):
                raise TypeError(
                    f"`skip_folds` must be an integer greater than 0, a list of "
                    f"integers or `None`. Got {skip_folds}."
                )
            if isinstance(skip_folds, (int, np.integer)) and skip_folds < 1:
                raise ValueError(
                    f"`skip_folds` must be an integer greater than 0, a list of "
                    f"integers or `None`. Got {skip_folds}."
                )
            if isinstance(skip_folds, list) and any([x < 1 for x in skip_folds]):
                raise ValueError(
                    f"`skip_folds` list must contain integers greater than or "
                    f"equal to 1. The first fold is always needed to train the "
                    f"forecaster. Got {skip_folds}."
                ) 
        if not isinstance(allow_incomplete_fold, bool):
            raise TypeError(
                f"`allow_incomplete_fold` must be a boolean: `True`, `False`. "
                f"Got {allow_incomplete_fold}."
            )

    if cv_name == "OneStepAheadFold":
        if not isinstance(initial_train_size, (int, np.integer, str, pd.Timestamp)):
            raise ValueError(
                f"`initial_train_size` must be an integer greater than 0, a date "
                f"string, or a pandas Timestamp. Got {initial_train_size}."
            )
        if isinstance(initial_train_size, (int, np.integer)) and initial_train_size < 1:
            raise ValueError(
                f"`initial_train_size` must be an integer greater than 0, "
                f"a date string, or a pandas Timestamp. Got {initial_train_size}."
            )

    if (
        not isinstance(window_size, (int, np.integer, pd.DateOffset, type(None)))
        or isinstance(window_size, (int, np.integer))
        and window_size < 1
    ):
        raise ValueError(
            f"`window_size` must be an integer greater than 0. Got {window_size}."
        )

    if not isinstance(return_all_indexes, bool):
        raise TypeError(
            f"`return_all_indexes` must be a boolean: `True`, `False`. "
            f"Got {return_all_indexes}."
        )
    if differentiation is not None:
        if not isinstance(differentiation, (int, np.integer)) or differentiation < 0:
            raise ValueError(
                f"`differentiation` must be None or an integer greater than or "
                f"equal to 0. Got {differentiation}."
            )
    if not isinstance(verbose, bool):
        raise TypeError(
            f"`verbose` must be a boolean: `True`, `False`. "
            f"Got {verbose}."
        )

_extract_index ¶

_extract_index(X)

Extracts and returns the index from the input data X.

Parameters:

Name	Type	Description	Default
`X`	`pandas Series, pandas DataFrame, pandas Index, dict`	Time series data or index to split.	required

Returns:

Name	Type	Description
`idx`	`pandas Index`	Index extracted from the input data.

Source code in skforecast\model_selection\_split.py

def _extract_index(
    self,
    X: pd.Series | pd.DataFrame | pd.Index | dict[str, pd.Series | pd.DataFrame]
) -> pd.Index:
    """
    Extracts and returns the index from the input data X.

    Parameters
    ----------
    X : pandas Series, pandas DataFrame, pandas Index, dict
        Time series data or index to split.

    Returns
    -------
    idx : pandas Index
        Index extracted from the input data.

    """

    if isinstance(X, (pd.Series, pd.DataFrame)):
        idx = X.index
    elif isinstance(X, dict):
        freqs = [s.index.freq for s in X.values() if s.index.freq is not None]
        if not freqs:
            raise ValueError("At least one series must have a frequency.")
        if not all(f == freqs[0] for f in freqs):
            raise ValueError(
                "All series with frequency must have the same frequency."
            )
        min_idx = min([v.index[0] for v in X.values() if not v.empty])
        max_idx = max([v.index[-1] for v in X.values() if not v.empty])
        idx = pd.date_range(start=min_idx, end=max_idx, freq=freqs[0])
    else:
        idx = X

    return idx

set_params ¶

set_params(params)

Set the parameters of the Fold object. Before overwriting the current parameters, the input parameters are validated to ensure correctness.

Parameters:

Name	Type	Description	Default
`params`	`dict`	Dictionary with the parameters to set.	required

Returns:

Type	Description
`None`

Source code in skforecast\model_selection\_split.py

def set_params(
    self, 
    params: dict
) -> None:
    """
    Set the parameters of the Fold object. Before overwriting the current 
    parameters, the input parameters are validated to ensure correctness.

    Parameters
    ----------
    params : dict
        Dictionary with the parameters to set.

    Returns
    -------
    None

    """

    if not isinstance(params, dict):
        raise TypeError(
            f"`params` must be a dictionary. Got {type(params)}."
        )

    current_params = deepcopy(vars(self))
    unknown_params = set(params.keys()) - set(current_params.keys())
    if unknown_params:
        warnings.warn(
            f"Unknown parameters: {unknown_params}. They have been ignored.",
            IgnoredArgumentWarning
        )

    filtered_params = {k: v for k, v in params.items() if k in current_params}
    updated_params = {'cv_name': type(self).__name__, **current_params, **filtered_params}

    self._validate_params(**updated_params)
    for key, value in updated_params.items():
        setattr(self, key, value)

skforecast.model_selection._split.TimeSeriesFold ¶

TimeSeriesFold(
    steps,
    initial_train_size=None,
    window_size=None,
    differentiation=None,
    refit=False,
    fixed_train_size=True,
    gap=0,
    skip_folds=None,
    allow_incomplete_fold=True,
    return_all_indexes=False,
    verbose=True,
)

Bases: BaseFold

Class to split time series data into train and test folds. When used within a backtesting or hyperparameter search, the arguments 'initial_train_size', 'window_size' and 'differentiation' are not required as they are automatically set by the backtesting or hyperparameter search functions.

Parameters:

Name	Type	Description	Default
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.	required
`initial_train_size`	`int, str, pandas Timestamp`	Number of observations used for initial training. If `None` or 0, the initial forecaster is not trained in the first fold. If an integer, the number of observations used for initial training. If a date string or pandas Timestamp, it is the last date included in the initial training set.	`None`
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.	`None`
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.	`None`
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold. If `True`, the forecaster is refitted in each fold. If `False`, the forecaster is trained only in the first fold. If an integer, the forecaster is trained in the first fold and then refitted every `refit` folds.	`False`
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.	`True`
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.	`0`
`skip_folds`	`(int, list)`	Number of folds to skip. If an integer, every 'skip_folds'-th is returned. If a list, the indexes of the folds to skip. For example, if `skip_folds=3` and there are 10 folds, the returned folds are 0, 3, 6, and 9. If `skip_folds=[1, 2, 3]`, the returned folds are 0, 4, 5, 6, 7, 8, and 9.	`None`
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`. If `False`, the last fold is excluded if it is incomplete.	`True`
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.	`False`
`verbose`	`bool`	Whether to print information about generated folds.	`True`

Attributes:

Name	Type	Description
`steps`	`int`	Number of observations used to be predicted in each fold. This is also commonly referred to as the forecast horizon or test size.
`initial_train_size`	`int`	Number of observations used for initial training. If `None` or 0, the initial forecaster is not trained in the first fold.
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
`refit`	`(bool, int)`	Whether to refit the forecaster in each fold.
`fixed_train_size`	`bool`	Whether the training size is fixed or increases in each fold.
`gap`	`int`	Number of observations between the end of the training set and the start of the test set.
`skip_folds`	`(int, list)`	Number of folds to skip.
`allow_incomplete_fold`	`bool`	Whether to allow the last fold to include fewer observations than `steps`.
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.
`verbose`	`bool`	Whether to print information about generated folds.

Notes

Returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. For example, if the input series is X = [10, 11, 12, 13, 14, 15, 16, 17, 18, 19], the initial_train_size = 3, window_size = 2, steps = 4, and gap = 1, the output of the first fold will: [[0, 3], [1, 3], [3, 8], [4, 8], True].

The first list [0, 3] indicates that the training set goes from the first to the third observation. The second list [1, 3] indicates that the last window seen by the forecaster during training goes from the second to the third observation. The third list [3, 8] indicates that the test set goes from the fourth to the eighth observation. The fourth list [4, 8] indicates that the test set including the gap goes from the fifth to the eighth observation. The boolean False indicates that the forecaster should not be trained in this fold.

Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

Methods:

Name	Description
`split`	Split the time series data into train and test folds.

Source code in skforecast\model_selection\_split.py

def __init__(
    self,
    steps: int,
    initial_train_size: int | str | pd.Timestamp | None = None,
    window_size: int | None = None,
    differentiation: int | None = None,
    refit: bool | int = False,
    fixed_train_size: bool = True,
    gap: int = 0,
    skip_folds: int | list[int] | None = None,
    allow_incomplete_fold: bool = True,
    return_all_indexes: bool = False,
    verbose: bool = True
) -> None:

    super().__init__(
        steps                 = steps,
        initial_train_size    = initial_train_size,
        window_size           = window_size,
        differentiation       = differentiation,
        refit                 = refit,
        fixed_train_size      = fixed_train_size,
        gap                   = gap,
        skip_folds            = skip_folds,
        allow_incomplete_fold = allow_incomplete_fold,
        return_all_indexes    = return_all_indexes,
        verbose               = verbose
    )

_repr_html_ ¶

_repr_html_()

HTML representation of the object. The "General Information" section is expanded by default.

Source code in skforecast\model_selection\_split.py

def _repr_html_(self) -> str:
    """
    HTML representation of the object.
    The "General Information" section is expanded by default.
    """

    style, unique_id = get_style_repr_html()
    content = f"""
    <div class="container-{unique_id}">
        <h2>{type(self).__name__}</h2>
        <details open>
            <summary>General Information</summary>
            <ul>
                <li><strong>Initial train size:</strong> {self.initial_train_size}</li>
                <li><strong>Steps:</strong> {self.steps}</li>
                <li><strong>Window size:</strong> {self.window_size}</li>
                <li><strong>Differentiation:</strong> {self.differentiation}</li>
                <li><strong>Refit:</strong> {self.refit}</li>
                <li><strong>Fixed train size:</strong> {self.fixed_train_size}</li>
                <li><strong>Gap:</strong> {self.gap}</li>
                <li><strong>Skip folds:</strong> {self.skip_folds}</li>
                <li><strong>Allow incomplete fold:</strong> {self.allow_incomplete_fold}</li>
                <li><strong>Return all indexes:</strong> {self.return_all_indexes}</li>
            </ul>
        </details>
        <p>
            <a href="https://skforecast.org/{skforecast.__version__}/api/model_selection.html#skforecast.model_selection._split.TimeSeriesFold">&#128712 <strong>API Reference</strong></a>
            &nbsp;&nbsp;
            <a href="https://skforecast.org/{skforecast.__version__}/user_guides/backtesting.html#timeseriesfold">&#128462 <strong>User Guide</strong></a>
        </p>
    </div>
    """

    return style + content

split ¶

split(X, as_pandas=False)

Split the time series data into train and test folds.

Parameters:

Name	Type	Description	Default
`X`	`pandas Series, pandas DataFrame, pandas Index, dict`	Time series data or index to split.	required
`as_pandas`	`bool`	If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way.	`False`

Returns:

Name	Type	Description
`folds`	`list, pandas DataFrame`	A list of lists containing the indices (position) for each fold. Each list contains 4 lists and a boolean with the following information: [train_start, train_end]: list with the start and end positions of the training set. [last_window_start, last_window_end]: list with the start and end positions of the last window seen by the forecaster during training. The last window is used to generate the lags use as predictors. If `differentiation` is included, the interval is extended as many observations as the differentiation order. If the argument `window_size` is `None`, this list is empty. [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster. [test_start_with_gap, test_end_with_gap]: list with the start and end positions of the test set including the gap. The gap is the number of observations between the end of the training set and the start of the test set. fit_forecaster: boolean indicating whether the forecaster should be fitted in this fold. It is important to note that the returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. If `as_pandas` is `True`, the folds are returned as a DataFrame with the following columns: 'fold', 'train_start', 'train_end', 'last_window_start', 'last_window_end', 'test_start', 'test_end', 'test_start_with_gap', 'test_end_with_gap', 'fit_forecaster'. Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

Source code in skforecast\model_selection\_split.py

def split(
    self,
    X: pd.Series | pd.DataFrame | pd.Index | dict[str, pd.Series | pd.DataFrame],
    as_pandas: bool = False
) -> list | pd.DataFrame:
    """
    Split the time series data into train and test folds.

    Parameters
    ----------
    X : pandas Series, pandas DataFrame, pandas Index, dict
        Time series data or index to split.
    as_pandas : bool, default False
        If True, the folds are returned as a DataFrame. This is useful to visualize
        the folds in a more interpretable way.

    Returns
    -------
    folds : list, pandas DataFrame
        A list of lists containing the indices (position) for each fold. Each list
        contains 4 lists and a boolean with the following information:

        - [train_start, train_end]: list with the start and end positions of the
        training set.
        - [last_window_start, last_window_end]: list with the start and end positions
        of the last window seen by the forecaster during training. The last window
        is used to generate the lags use as predictors. If `differentiation` is
        included, the interval is extended as many observations as the
        differentiation order. If the argument `window_size` is `None`, this list is
        empty.
        - [test_start, test_end]: list with the start and end positions of the test
        set. These are the observations used to evaluate the forecaster.
        - [test_start_with_gap, test_end_with_gap]: list with the start and end
        positions of the test set including the gap. The gap is the number of
        observations between the end of the training set and the start of the test
        set.
        - fit_forecaster: boolean indicating whether the forecaster should be fitted
        in this fold.

        It is important to note that the returned values are the positions of the
        observations and not the actual values of the index, so they can be used to
        slice the data directly using iloc.

        If `as_pandas` is `True`, the folds are returned as a DataFrame with the
        following columns: 'fold', 'train_start', 'train_end', 'last_window_start',
        'last_window_end', 'test_start', 'test_end', 'test_start_with_gap',
        'test_end_with_gap', 'fit_forecaster'.

        Following the python convention, the start index is inclusive and the end
        index is exclusive. This means that the last index is not included in the
        slice.

    """

    if not isinstance(X, (pd.Series, pd.DataFrame, pd.Index, dict)):
        raise TypeError(
            f"X must be a pandas Series, DataFrame, Index or a dictionary. "
            f"Got {type(X)}."
        )

    if isinstance(self.window_size, pd.tseries.offsets.DateOffset):
        # Calculate the window_size in steps. This is not a exact calculation
        # because the offset follows the calendar rules and the distance between
        # two dates may not be constant.
        first_valid_index = X.index[-1] - self.window_size
        try:
            window_size_idx_start = X.index.get_loc(first_valid_index)
            window_size_idx_end = X.index.get_loc(X.index[-1])
            self.window_size = window_size_idx_end - window_size_idx_start
        except KeyError:
            raise ValueError(
                f"The length of `X` ({len(X)}), must be greater than or equal "
                f"to the window size ({self.window_size}). Try to decrease the "
                f"size of the offset (forecaster.offset), or increase the "
                f"size of `y`."
            )

    if self.initial_train_size is None:
        if self.window_size is None:
            raise ValueError(
                "To use split method when `initial_train_size` is None, "
                "`window_size` must be an integer greater than 0. "
                "Although no initial training is done and all data is used to "
                "evaluate the model, the first `window_size` observations are "
                "needed to create the initial predictors. Got `window_size` = None."
            )
        if self.refit:
            raise ValueError(
                "`refit` is only allowed when `initial_train_size` is not `None`. "
                "Set `refit` to `False` if you want to use `initial_train_size = None`."
            )
        externally_fitted = True
        self.initial_train_size = self.window_size  # Reset to None later
    else:
        if self.window_size is None:
            warnings.warn(
                "Last window cannot be calculated because `window_size` is None."
            )
        externally_fitted = False

    index = self._extract_index(X)
    idx = range(len(index))
    folds = []
    i = 0
    last_fold_excluded = False

    self.initial_train_size = date_to_index_position(
                                  index        = index, 
                                  date_input   = self.initial_train_size, 
                                  method       = 'validation',
                                  date_literal = 'initial_train_size'
                              )

    if len(index) < self.initial_train_size + self.steps:
        raise ValueError(
            f"The time series must have at least `initial_train_size + steps` "
            f"observations. Got {len(index)} observations."
        )

    while self.initial_train_size + (i * self.steps) + self.gap < len(index):

        if self.refit:
            # If `fixed_train_size` the train size doesn't increase but moves by 
            # `steps` positions in each iteration. If `False`, the train size
            # increases by `steps` in each iteration.
            train_iloc_start = i * (self.steps) if self.fixed_train_size else 0
            train_iloc_end = self.initial_train_size + i * (self.steps)
            test_iloc_start = train_iloc_end
        else:
            # The train size doesn't increase and doesn't move.
            train_iloc_start = 0
            train_iloc_end = self.initial_train_size
            test_iloc_start = self.initial_train_size + i * (self.steps)

        if self.window_size is not None:
            last_window_iloc_start = test_iloc_start - self.window_size
        test_iloc_end = test_iloc_start + self.gap + self.steps

        partitions = [
            idx[train_iloc_start : train_iloc_end],
            idx[last_window_iloc_start : test_iloc_start] if self.window_size is not None else [],
            idx[test_iloc_start : test_iloc_end],
            idx[test_iloc_start + self.gap : test_iloc_end]
        ]
        folds.append(partitions)
        i += 1

    if not self.allow_incomplete_fold and len(folds[-1][3]) < self.steps:
        folds = folds[:-1]
        last_fold_excluded = True

    # Replace partitions inside folds with length 0 with `None`
    folds = [
        [partition if len(partition) > 0 else None for partition in fold] 
         for fold in folds
    ]

    # Create a flag to know whether to train the forecaster
    if self.refit == 0:
        self.refit = False

    if isinstance(self.refit, bool):
        fit_forecaster = [self.refit] * len(folds)
        fit_forecaster[0] = True
    else:
        fit_forecaster = [False] * len(folds)
        for i in range(0, len(fit_forecaster), self.refit): 
            fit_forecaster[i] = True

    for i in range(len(folds)): 
        folds[i].append(fit_forecaster[i])
        if fit_forecaster[i] is False:
            folds[i][0] = folds[i - 1][0]

    index_to_skip = []
    if self.skip_folds is not None:
        if isinstance(self.skip_folds, (int, np.integer)) and self.skip_folds > 0:
            index_to_keep = np.arange(0, len(folds), self.skip_folds)
            index_to_skip = np.setdiff1d(np.arange(0, len(folds)), index_to_keep, assume_unique=True)
            index_to_skip = [int(x) for x in index_to_skip]  # Required since numpy 2.0
        if isinstance(self.skip_folds, list):
            index_to_skip = [i for i in self.skip_folds if i < len(folds)]        

    if self.verbose:
        self._print_info(
            index              = index,
            folds              = folds,
            externally_fitted  = externally_fitted,
            last_fold_excluded = last_fold_excluded,
            index_to_skip      = index_to_skip
        )

    folds = [fold for i, fold in enumerate(folds) if i not in index_to_skip]
    if not self.return_all_indexes:
        # +1 to prevent iloc pandas from deleting the last observation
        folds = [
            [[fold[0][0], fold[0][-1] + 1], 
             [fold[1][0], fold[1][-1] + 1] if self.window_size is not None else [],
             [fold[2][0], fold[2][-1] + 1],
             [fold[3][0], fold[3][-1] + 1],
             fold[4]] 
            for fold in folds
        ]

    if externally_fitted:
        self.initial_train_size = None
        folds[0][4] = False

    if as_pandas:
        if self.window_size is None:
            for fold in folds:
                fold[1] = [None, None]

        if not self.return_all_indexes:
            folds = pd.DataFrame(
                data = [list(itertools.chain(*fold[:-1])) + [fold[-1]] for fold in folds],
                columns = [
                    'train_start',
                    'train_end',
                    'last_window_start',
                    'last_window_end',
                    'test_start',
                    'test_end',
                    'test_start_with_gap',
                    'test_end_with_gap',
                    'fit_forecaster'
                ],
            )
        else:
            folds = pd.DataFrame(
                data = folds,
                columns = [
                    'train_index',
                    'last_window_index',
                    'test_index',
                    'test_index_with_gap',
                    'fit_forecaster'
                ],
            )
        folds.insert(0, 'fold', range(len(folds)))

    return folds

_print_info ¶

_print_info(
    index,
    folds,
    externally_fitted,
    last_fold_excluded,
    index_to_skip,
)

Print information about folds.

Parameters:

Name	Type	Description	Default
`index`	`pandas Index`	Index of the time series data.	required
`folds`	`list`	A list of lists containing the indices (position) for each fold.	required
`externally_fitted`	`bool`	Whether an already trained forecaster is to be used.	required
`last_fold_excluded`	`bool`	Whether the last fold has been excluded because it was incomplete.	required
`index_to_skip`	`list`	Number of folds skipped.	required

Returns:

Type	Description
`None`

Source code in skforecast\model_selection\_split.py

def _print_info(
    self,
    index: pd.Index,
    folds: list[list[int]],
    externally_fitted: bool,
    last_fold_excluded: bool,
    index_to_skip: list[int]
) -> None:
    """
    Print information about folds.

    Parameters
    ----------
    index : pandas Index
        Index of the time series data.
    folds : list
        A list of lists containing the indices (position) for each fold.
    externally_fitted : bool
        Whether an already trained forecaster is to be used.
    last_fold_excluded : bool
        Whether the last fold has been excluded because it was incomplete.
    index_to_skip : list
        Number of folds skipped.

    Returns
    -------
    None

    """

    print("Information of folds")
    print("--------------------")
    if externally_fitted:
        print(
            f"An already trained forecaster is to be used. Window size: "
            f"{self.window_size}"
        )
    else:
        if self.differentiation is None:
            print(
                f"Number of observations used for initial training: "
                f"{self.initial_train_size}"
            )
        else:
            print(
                f"Number of observations used for initial training: "
                f"{self.initial_train_size - self.differentiation}"
            )
            print(
                f"    First {self.differentiation} observation/s in training sets "
                f"are used for differentiation"
            )
    print(
        f"Number of observations used for backtesting: "
        f"{len(index) - self.initial_train_size}"
    )
    print(f"    Number of folds: {len(folds)}")
    print(
        f"    Number skipped folds: "
        f"{len(index_to_skip)} {index_to_skip if index_to_skip else ''}"
    )
    print(f"    Number of steps per fold: {self.steps}")
    print(
        f"    Number of steps to exclude between last observed data "
        f"(last window) and predictions (gap): {self.gap}"
    )
    if last_fold_excluded:
        print("    Last fold has been excluded because it was incomplete.")
    if len(folds[-1][3]) < self.steps:
        print(f"    Last fold only includes {len(folds[-1][3])} observations.")
    print("")

    if self.differentiation is None:
        differentiation = 0
    else:
        differentiation = self.differentiation

    for i, fold in enumerate(folds):
        is_fold_skipped   = i in index_to_skip
        has_training      = fold[-1] if i != 0 else True
        training_start    = (
            index[fold[0][0] + differentiation] if fold[0] is not None else None
        )
        training_end      = index[fold[0][-1]] if fold[0] is not None else None
        training_length   = (
            len(fold[0]) - differentiation if fold[0] is not None else 0
        )
        validation_start  = index[fold[3][0]]
        validation_end    = index[fold[3][-1]]
        validation_length = len(fold[3])

        print(f"Fold: {i}")
        if is_fold_skipped:
            print("    Fold skipped")
        elif not externally_fitted and has_training:
            print(
                f"    Training:   {training_start} -- {training_end}  "
                f"(n={training_length})"
            )
            print(
                f"    Validation: {validation_start} -- {validation_end}  "
                f"(n={validation_length})"
            )
        else:
            print("    Training:   No training in this fold")
            print(
                f"    Validation: {validation_start} -- {validation_end}  "
                f"(n={validation_length})"
            )

    print("")

skforecast.model_selection._split.OneStepAheadFold ¶

OneStepAheadFold(
    initial_train_size,
    window_size=None,
    differentiation=None,
    return_all_indexes=False,
    verbose=True,
)

Bases: BaseFold

Class to split time series data into train and test folds for one-step-ahead forecasting.

Parameters:

Name	Type	Description	Default
`initial_train_size`	`int, str, pandas Timestamp`	Number of observations used for initial training. If an integer, the number of observations used for initial training. If a date string or pandas Timestamp, it is the last date included in the initial training set.	required
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.	`None`
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.	`None`
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.	`False`
`verbose`	`bool`	Whether to print information about generated folds.	`True`

Attributes:

Name	Type	Description
`initial_train_size`	`int`	Number of observations used for initial training.
`window_size`	`int`	Number of observations needed to generate the autoregressive predictors.
`differentiation`	`int`	Number of observations to use for differentiation. This is used to extend the `last_window` as many observations as the differentiation order.
`return_all_indexes`	`bool`	Whether to return all indexes or only the start and end indexes of each fold.
`verbose`	`bool`	Whether to print information about generated folds.
`steps`	`Any`	This attribute is not used in this class. It is included for API consistency.
`fixed_train_size`	`Any`	This attribute is not used in this class. It is included for API consistency.
`gap`	`Any`	This attribute is not used in this class. It is included for API consistency.
`skip_folds`	`Any`	This attribute is not used in this class. It is included for API consistency.
`allow_incomplete_fold`	`Any`	This attribute is not used in this class. It is included for API consistency.
`refit`	`Any`	This attribute is not used in this class. It is included for API consistency.

Methods:

Name	Description
`split`	Split the time series data into train and test folds.

Source code in skforecast\model_selection\_split.py

def __init__(
    self,
    initial_train_size: int | str | pd.Timestamp,
    window_size: int | None = None,
    differentiation: int | None = None,
    return_all_indexes: bool = False,
    verbose: bool = True
) -> None:

    super().__init__(
        initial_train_size = initial_train_size,
        window_size        = window_size,
        differentiation    = differentiation,
        return_all_indexes = return_all_indexes,
        verbose            = verbose
    )

_repr_html_ ¶

_repr_html_()

HTML representation of the object. The "General Information" section is expanded by default.

Source code in skforecast\model_selection\_split.py

def _repr_html_(self) -> str:
    """
    HTML representation of the object.
    The "General Information" section is expanded by default.
    """

    style, unique_id = get_style_repr_html()
    content = f"""
    <div class="container-{unique_id}">
        <h2>{type(self).__name__}</h2>
        <details open>
            <summary>General Information</summary>
            <ul>
                <li><strong>Initial train size:</strong> {self.initial_train_size}</li>
                <li><strong>Window size:</strong> {self.window_size}</li>
                <li><strong>Differentiation:</strong> {self.differentiation}</li>
                <li><strong>Return all indexes:</strong> {self.return_all_indexes}</li>
            </ul>
        </details>
        <p>
            <a href="https://skforecast.org/{skforecast.__version__}/api/model_selection.html#skforecast.model_selection._split.OneStepAheadFold">&#128712 <strong>API Reference</strong></a>
            &nbsp;&nbsp;
            <a href="https://skforecast.org/{skforecast.__version__}/faq/parameters-search-backtesting-vs-one-step-ahead.html">&#128462 <strong>User Guide</strong></a>
        </p>
    </div>
    """

    return style + content

split ¶

split(X, as_pandas=False, externally_fitted=None)

Split the time series data into train and test folds.

Parameters:

Name	Type	Description	Default
`X`	`pandas Series, DataFrame, Index, or dictionary`	Time series data or index to split.	required
`as_pandas`	`bool`	If True, the folds are returned as a DataFrame. This is useful to visualize the folds in a more interpretable way.	`False`
`externally_fitted`	`Any`	This argument is not used in this class. It is included for API consistency.	`None`

Returns:

Name	Type	Description
`fold`	`list, pandas DataFrame`	A list of lists containing the indices (position) of the fold. The list contains 2 lists with the following information: [train_start, train_end]: list with the start and end positions of the training set. [test_start, test_end]: list with the start and end positions of the test set. These are the observations used to evaluate the forecaster. It is important to note that the returned values are the positions of the observations and not the actual values of the index, so they can be used to slice the data directly using iloc. If `as_pandas` is `True`, the folds are returned as a DataFrame with the following columns: 'fold', 'train_start', 'train_end', 'test_start', 'test_end'. Following the python convention, the start index is inclusive and the end index is exclusive. This means that the last index is not included in the slice.

Source code in skforecast\model_selection\_split.py

def split(
    self,
    X: pd.Series | pd.DataFrame | pd.Index | dict[str, pd.Series | pd.DataFrame],
    as_pandas: bool = False,
    externally_fitted: Any = None
) -> list | pd.DataFrame:
    """
    Split the time series data into train and test folds.

    Parameters
    ----------
    X : pandas Series, DataFrame, Index, or dictionary
        Time series data or index to split.
    as_pandas : bool, default False
        If True, the folds are returned as a DataFrame. This is useful to visualize
        the folds in a more interpretable way.
    externally_fitted : Any
        This argument is not used in this class. It is included for API consistency.

    Returns
    -------
    fold : list, pandas DataFrame
        A list of lists containing the indices (position) of the fold. The list
        contains 2 lists with the following information:

        - [train_start, train_end]: list with the start and end positions of the
        training set.
        - [test_start, test_end]: list with the start and end positions of the test
        set. These are the observations used to evaluate the forecaster.

        It is important to note that the returned values are the positions of the
        observations and not the actual values of the index, so they can be used to
        slice the data directly using iloc.

        If `as_pandas` is `True`, the folds are returned as a DataFrame with the
        following columns: 'fold', 'train_start', 'train_end', 'test_start', 'test_end'.

        Following the python convention, the start index is inclusive and the end
        index is exclusive. This means that the last index is not included in the
        slice.

    """

    if not isinstance(X, (pd.Series, pd.DataFrame, pd.Index, dict)):
        raise TypeError(
            f"X must be a pandas Series, DataFrame, Index or a dictionary. "
            f"Got {type(X)}."
        )

    index = self._extract_index(X)

    self.initial_train_size = date_to_index_position(
                                  index        = index, 
                                  date_input   = self.initial_train_size, 
                                  method       = 'validation',
                                  date_literal = 'initial_train_size'
                              )

    fold = [
        [0, self.initial_train_size],
        [self.initial_train_size, len(X)],
        True
    ]

    if self.verbose:
        self._print_info(index=index, fold=fold)

    if self.return_all_indexes:
        fold = [
            [range(fold[0][0], fold[0][1])],
            [range(fold[1][0], fold[1][1])],
            fold[2]
        ]

    if as_pandas:
        if not self.return_all_indexes:
            fold = pd.DataFrame(
                data = [list(itertools.chain(*fold[:-1])) + [fold[-1]]],
                columns = [
                    'train_start',
                    'train_end',
                    'test_start',
                    'test_end',
                    'fit_forecaster'
                ],
            )
        else:
            fold = pd.DataFrame(
                data = [fold],
                columns = [
                    'train_index',
                    'test_index',
                    'fit_forecaster'
                ],
            )
        fold.insert(0, 'fold', range(len(fold)))

    return fold

_print_info ¶

_print_info(index, fold)

Print information about folds.

Parameters:

Name	Type	Description	Default
`index`	`pandas Index`	Index of the time series data.	required
`fold`	`list`	A list of lists containing the indices (position) of the fold.	required

Returns:

Type	Description
`None`

Source code in skforecast\model_selection\_split.py

def _print_info(
    self,
    index: pd.Index,
    fold: list[list[int]]
) -> None:
    """
    Print information about folds.

    Parameters
    ----------
    index : pandas Index
        Index of the time series data.
    fold : list
        A list of lists containing the indices (position) of the fold.

    Returns
    -------
    None

    """

    if self.differentiation is None:
        differentiation = 0
    else:
        differentiation = self.differentiation

    initial_train_size = self.initial_train_size - differentiation
    test_length = len(index) - (initial_train_size + differentiation)

    print("Information of folds")
    print("--------------------")
    print(
        f"Number of observations in train: {initial_train_size}"
    )
    if self.differentiation is not None:
        print(
            f"    First {differentiation} observation/s in training set "
            f"are used for differentiation"
        )
    print(
        f"Number of observations in test: {test_length}"
    )

    training_start = index[fold[0][0] + differentiation]
    training_end = index[fold[0][-1]]
    test_start  = index[fold[1][0]]
    test_end    = index[fold[1][-1] - 1]

    print(
        f"Training : {training_start} -- {training_end} (n={initial_train_size})"
    )
    print(
        f"Test     : {test_start} -- {test_end} (n={test_length})"
    )
    print("")

skforecast.model_selection._utils.initialize_lags_grid ¶

initialize_lags_grid(forecaster, lags_grid=None)

Initialize lags grid and lags label for model selection.

Parameters:

Name	Type	Description	Default
`forecaster`	`Forecaster`	Forecaster model. ForecasterRecursive, ForecasterDirect, ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate.	required
`lags_grid`	`(list, dict)`	Lists of lags to try, containing int, lists, numpy ndarray, or range objects. If `dict`, the keys are used as labels in the `results` DataFrame, and the values are used as the lists of lags to try.	`None`

Returns:

Name	Type	Description
`lags_grid`	`dict`	Dictionary with lags configuration for each iteration.
`lags_label`	`str`	Label for lags representation in the results object.

Source code in skforecast\model_selection\_utils.py

def initialize_lags_grid(
    forecaster: object, 
    lags_grid: (
        list[int | list[int] | np.ndarray[int] | range[int]]
        | dict[str, list[int | list[int] | np.ndarray[int] | range[int]]]
        | None
    ) = None,
) -> tuple[dict[str, int], str]:
    """
    Initialize lags grid and lags label for model selection. 

    Parameters
    ----------
    forecaster : Forecaster
        Forecaster model. ForecasterRecursive, ForecasterDirect, 
        ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate.
    lags_grid : list, dict, default None
        Lists of lags to try, containing int, lists, numpy ndarray, or range 
        objects. If `dict`, the keys are used as labels in the `results` 
        DataFrame, and the values are used as the lists of lags to try.

    Returns
    -------
    lags_grid : dict
        Dictionary with lags configuration for each iteration.
    lags_label : str
        Label for lags representation in the results object.

    """

    if not isinstance(lags_grid, (list, dict, type(None))):
        raise TypeError(
            f"`lags_grid` argument must be a list, dict or None. "
            f"Got {type(lags_grid)}."
        )

    lags_label = 'values'
    if isinstance(lags_grid, list):
        lags_grid = {f'{lags}': lags for lags in lags_grid}
    elif lags_grid is None:
        lags = [int(lag) for lag in forecaster.lags]  # Required since numpy 2.0
        lags_grid = {f'{lags}': lags}
    else:
        lags_label = 'keys'

    return lags_grid, lags_label

skforecast.model_selection._utils.check_backtesting_input ¶

check_backtesting_input(
    forecaster,
    cv,
    metric,
    add_aggregated_metric=True,
    y=None,
    series=None,
    exog=None,
    interval=None,
    interval_method="bootstrapping",
    alpha=None,
    n_boot=250,
    use_in_sample_residuals=True,
    use_binned_residuals=True,
    random_state=123,
    n_jobs="auto",
    show_progress=True,
    suppress_warnings=False,
    suppress_warnings_fit=False,
)

This is a helper function to check most inputs of backtesting functions in modules model_selection.

Parameters:

Name	Type	Description	Default
`forecaster`	`Forecaster`	Forecaster model.	required
`cv`	`TimeSeriesFold`	TimeSeriesFold object with the information needed to split the data into folds.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model.	required
`add_aggregated_metric`	`bool`	If `True`, the aggregated metrics (average, weighted average and pooling) over all levels are also returned (only multiseries).	`True`
`y`	`pandas Series`	Training time series for uni-series forecasters.	`None`
`series`	`pandas DataFrame, dict`	Training time series for multi-series forecasters.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`interval`	`(float, list, tuple, str, object)`	Specifies whether probabilistic predictions should be estimated and the method to use. The following options are supported: If `float`, represents the nominal (expected) coverage (between 0 and 1). For instance, `interval=0.95` corresponds to `[2.5, 97.5]` percentiles. If `list` or `tuple`: Sequence of percentiles to compute, each value must be between 0 and 100 inclusive. For example, a 95% confidence interval can be specified as `interval = [2.5, 97.5]` or multiple percentiles (e.g. 10, 50 and 90) as `interval = [10, 50, 90]`. If 'bootstrapping' (str): `n_boot` bootstrapping predictions will be generated. If scipy.stats distribution object, the distribution parameters will be estimated for each prediction. If None, no probabilistic predictions are estimated.	`None`
`interval_method`	`str`	Technique used to estimate prediction intervals. Available options: 'bootstrapping': Bootstrapping is used to generate prediction intervals. 'conformal': Employs the conformal prediction split method for interval estimation.	`'bootstrapping'`
`alpha`	`float`	The confidence intervals used in ForecasterSarimax are (1 - alpha) %.	`None`
`n_boot`	`int`	Number of bootstrapping iterations to perform when estimating prediction intervals.	`250`
`use_in_sample_residuals`	`bool`	If `True`, residuals from the training data are used as proxy of prediction error to create prediction intervals. If `False`, out_sample_residuals are used if they are already stored inside the forecaster.	`True`
`use_binned_residuals`	`bool`	If `True`, residuals are selected based on the predicted values (binned selection). If `False`, residuals are selected randomly.	`True`
`random_state`	`int`	Seed for the random number generator to ensure reproducibility.	`123`
`n_jobs`	`(int, 'auto')`	The number of jobs to run in parallel. If `-1`, then the number of jobs is set to the number of cores. If 'auto', `n_jobs` is set using the function skforecast.utils.select_n_jobs_fit_forecaster.	`'auto'`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the backtesting process. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`
`suppress_warnings_fit`	`bool`	If `True`, warnings generated during fitting will be ignored. Only `ForecasterSarimax`.	`False`

Returns:

Type	Description
`None`

Source code in skforecast\model_selection\_utils.py

def check_backtesting_input(
    forecaster: object,
    cv: object,
    metric: str | Callable | list[str | Callable],
    add_aggregated_metric: bool = True,
    y: pd.Series | None = None,
    series: pd.DataFrame | dict[str, pd.Series | pd.DataFrame] = None,
    exog: pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None = None,
    interval: float | list[float] | tuple[float] | str | object | None = None,
    interval_method: str = 'bootstrapping',    
    alpha: float | None = None,
    n_boot: int = 250,
    use_in_sample_residuals: bool = True,
    use_binned_residuals: bool = True,
    random_state: int = 123,
    n_jobs: int | str = 'auto',
    show_progress: bool = True,
    suppress_warnings: bool = False,
    suppress_warnings_fit: bool = False
) -> None:
    """
    This is a helper function to check most inputs of backtesting functions in 
    modules `model_selection`.

    Parameters
    ----------
    forecaster : Forecaster
        Forecaster model.
    cv : TimeSeriesFold
        TimeSeriesFold object with the information needed to split the data into folds.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.
    add_aggregated_metric : bool, default True
        If `True`, the aggregated metrics (average, weighted average and pooling)
        over all levels are also returned (only multiseries).
    y : pandas Series, default None
        Training time series for uni-series forecasters.
    series : pandas DataFrame, dict, default None
        Training time series for multi-series forecasters.
    exog : pandas Series, pandas DataFrame, dict, default None
        Exogenous variables.
    interval : float, list, tuple, str, object, default None
        Specifies whether probabilistic predictions should be estimated and the 
        method to use. The following options are supported:

        - If `float`, represents the nominal (expected) coverage (between 0 and 1). 
        For instance, `interval=0.95` corresponds to `[2.5, 97.5]` percentiles.
        - If `list` or `tuple`: Sequence of percentiles to compute, each value must 
        be between 0 and 100 inclusive. For example, a 95% confidence interval can 
        be specified as `interval = [2.5, 97.5]` or multiple percentiles (e.g. 10, 
        50 and 90) as `interval = [10, 50, 90]`.
        - If 'bootstrapping' (str): `n_boot` bootstrapping predictions will be generated.
        - If scipy.stats distribution object, the distribution parameters will
        be estimated for each prediction.
        - If None, no probabilistic predictions are estimated.
    interval_method : str, default 'bootstrapping'
        Technique used to estimate prediction intervals. Available options:

        + 'bootstrapping': Bootstrapping is used to generate prediction 
        intervals.
        + 'conformal': Employs the conformal prediction split method for 
        interval estimation.
    alpha : float, default None
        The confidence intervals used in ForecasterSarimax are (1 - alpha) %. 
    n_boot : int, default `250`
        Number of bootstrapping iterations to perform when estimating prediction
            intervals.
    use_in_sample_residuals : bool, default True
        If `True`, residuals from the training data are used as proxy of prediction 
        error to create prediction intervals.  If `False`, out_sample_residuals 
        are used if they are already stored inside the forecaster.
    use_binned_residuals : bool, default True
        If `True`, residuals are selected based on the predicted values 
        (binned selection).
        If `False`, residuals are selected randomly.
    random_state : int, default `123`
        Seed for the random number generator to ensure reproducibility.
    n_jobs : int, 'auto', default `'auto'`
        The number of jobs to run in parallel. If `-1`, then the number of jobs is 
        set to the number of cores. If 'auto', `n_jobs` is set using the function
        skforecast.utils.select_n_jobs_fit_forecaster.
    show_progress : bool, default True
        Whether to show a progress bar.
    suppress_warnings: bool, default False
        If `True`, skforecast warnings will be suppressed during the backtesting 
        process. See skforecast.exceptions.warn_skforecast_categories for more
        information.
    suppress_warnings_fit : bool, default False
        If `True`, warnings generated during fitting will be ignored. Only 
        `ForecasterSarimax`.

    Returns
    -------
    None

    """

    forecaster_name = type(forecaster).__name__
    cv_name = type(cv).__name__

    if cv_name != "TimeSeriesFold":
        raise TypeError(f"`cv` must be a 'TimeSeriesFold' object. Got '{cv_name}'.")

    steps = cv.steps
    initial_train_size = cv.initial_train_size
    gap = cv.gap
    allow_incomplete_fold = cv.allow_incomplete_fold
    refit = cv.refit

    forecasters_uni = [
        "ForecasterRecursive",
        "ForecasterDirect",
        "ForecasterSarimax",
        "ForecasterEquivalentDate",
    ]
    forecasters_direct = [
        "ForecasterDirect",
        "ForecasterDirectMultiVariate"
    ]
    forecasters_multi_no_dict = [
        "ForecasterDirectMultiVariate",
        "ForecasterRnn",
    ]
    forecasters_multi_dict = [
        "ForecasterRecursiveMultiSeries"
    ]
    forecasters_boot_conformal = [
        "ForecasterRecursive",
        "ForecasterDirect",
        "ForecasterRecursiveMultiSeries",
        "ForecasterDirectMultiVariate",
    ]
    # NOTE: ForecasterSarimax has interval but not with bootstrapping or conformal
    forecasters_not_interval = [
        "ForecasterEquivalentDate",
        "ForecasterRnn"
    ]

    if forecaster_name in forecasters_uni:
        if not isinstance(y, pd.Series):
            raise TypeError("`y` must be a pandas Series.")
        data_name = 'y'
        data_length = len(y)

    elif forecaster_name in forecasters_multi_no_dict:
        if not isinstance(series, pd.DataFrame):
            raise TypeError("`series` must be a pandas DataFrame.")
        data_name = 'series'
        data_length = len(series)

    elif forecaster_name in forecasters_multi_dict:
        if not isinstance(series, (pd.DataFrame, dict)):
            raise TypeError(
                f"`series` must be a pandas DataFrame or a dict of DataFrames or Series. "
                f"Got {type(series)}."
            )

        data_name = 'series'
        if isinstance(series, dict):
            not_valid_series = [
                k 
                for k, v in series.items()
                if not isinstance(v, (pd.Series, pd.DataFrame))
            ]
            if not_valid_series:
                raise TypeError(
                    f"If `series` is a dictionary, all series must be a named "
                    f"pandas Series or a pandas DataFrame with a single column. "
                    f"Review series: {not_valid_series}"
                )
            not_valid_index = [
                k 
                for k, v in series.items()
                if not isinstance(v.index, pd.DatetimeIndex)
            ]
            if not_valid_index:
                raise ValueError(
                    f"If `series` is a dictionary, all series must have a Pandas "
                    f"DatetimeIndex as index with the same frequency. "
                    f"Review series: {not_valid_index}"
                )

            indexes_freq = [f'{v.index.freq}' for v in series.values()]
            indexes_freq = sorted(set(indexes_freq))
            if not len(indexes_freq) == 1:
                raise ValueError(
                    f"If `series` is a dictionary, all series must have a Pandas "
                    f"DatetimeIndex as index with the same frequency. "
                    f"Found frequencies: {indexes_freq}"
                )
            data_length = max([len(series[serie]) for serie in series])
        else:
            data_length = len(series)

    if exog is not None:
        if forecaster_name in forecasters_multi_dict:
            if not isinstance(exog, (pd.Series, pd.DataFrame, dict)):
                raise TypeError(
                    f"`exog` must be a pandas Series, DataFrame, dictionary of pandas "
                    f"Series/DataFrames or None. Got {type(exog)}."
                )
            if isinstance(exog, dict):
                not_valid_exog = [
                    k 
                    for k, v in exog.items()
                    if not isinstance(v, (pd.Series, pd.DataFrame, type(None)))
                ]
                if not_valid_exog:
                    raise TypeError(
                        f"If `exog` is a dictionary, All exog must be a named pandas "
                        f"Series, a pandas DataFrame or None. Review exog: {not_valid_exog}"
                    )
        else:
            if not isinstance(exog, (pd.Series, pd.DataFrame)):
                raise TypeError(
                    f"`exog` must be a pandas Series, DataFrame or None. Got {type(exog)}."
                )

    if hasattr(forecaster, 'differentiation'):
        if forecaster.differentiation_max != cv.differentiation:
            if forecaster_name == "ForecasterRecursiveMultiSeries" and isinstance(
                forecaster.differentiation, dict
            ):
                raise ValueError(
                    f"When using a dict as `differentiation` in ForecasterRecursiveMultiSeries, "
                    f"the `differentiation` included in the cv ({cv.differentiation}) must be "
                    f"the same as the maximum `differentiation` included in the forecaster "
                    f"({forecaster.differentiation_max}). Set the same value "
                    f"for both using the `differentiation` argument."
                )
            else:
                raise ValueError(
                    f"The differentiation included in the forecaster "
                    f"({forecaster.differentiation_max}) differs from the differentiation "
                    f"included in the cv ({cv.differentiation}). Set the same value "
                    f"for both using the `differentiation` argument."
                )

    if not isinstance(metric, (str, Callable, list)):
        raise TypeError(
            f"`metric` must be a string, a callable function, or a list containing "
            f"multiple strings and/or callables. Got {type(metric)}."
        )

    if forecaster_name == "ForecasterEquivalentDate" and isinstance(
        forecaster.offset, pd.tseries.offsets.DateOffset
    ):
        # NOTE: Checks when initial_train_size is not None cannot be done here
        # because the forecaster is not fitted yet and we don't know the
        # window_size since pd.DateOffset is not a fixed window size.
        if initial_train_size is None:
            raise ValueError(
                f"`initial_train_size` must be an integer greater than "
                f"the `window_size` of the forecaster ({forecaster.window_size}) "
                f"and smaller than the length of `{data_name}` ({data_length}) or "
                f"a date within this range of the index."
            )
    elif initial_train_size is not None:
        if forecaster_name in forecasters_uni:
            index = cv._extract_index(y)
        else:
            index = cv._extract_index(series)

        initial_train_size = date_to_index_position(
                                 index        = index, 
                                 date_input   = initial_train_size, 
                                 method       = 'validation',
                                 date_literal = 'initial_train_size'
                             )
        if initial_train_size < forecaster.window_size or initial_train_size >= data_length:
            raise ValueError(
                f"If `initial_train_size` is an integer, it must be greater than "
                f"the `window_size` of the forecaster ({forecaster.window_size}) "
                f"and smaller than the length of `{data_name}` ({data_length}). If "
                f"it is a date, it must be within this range of the index."
            )
        if initial_train_size + gap >= data_length:
            raise ValueError(
                f"The total size of `initial_train_size` {initial_train_size} plus "
                f"`gap` {gap} cannot be greater than the length of `{data_name}` "
                f"({data_length})."
            )
    else:
        if forecaster_name in ['ForecasterSarimax', 'ForecasterEquivalentDate']:
            raise ValueError(
                f"`initial_train_size` must be an integer smaller than the "
                f"length of `{data_name}` ({data_length})."
            )
        else:
            if not forecaster.is_fitted:
                raise NotFittedError(
                    "`forecaster` must be already trained if no `initial_train_size` "
                    "is provided."
                )
            if refit:
                raise ValueError(
                    "`refit` is only allowed when `initial_train_size` is not `None`."
                )

    if forecaster_name == 'ForecasterSarimax' and cv.skip_folds is not None:
        raise ValueError(
            "`skip_folds` is not allowed for ForecasterSarimax. Set it to `None`."
        )

    if not isinstance(add_aggregated_metric, bool):
        raise TypeError("`add_aggregated_metric` must be a boolean: `True`, `False`.")
    if not isinstance(n_boot, (int, np.integer)) or n_boot < 0:
        raise TypeError(f"`n_boot` must be an integer greater than 0. Got {n_boot}.")
    if not isinstance(use_in_sample_residuals, bool):
        raise TypeError("`use_in_sample_residuals` must be a boolean: `True`, `False`.")
    if not isinstance(use_binned_residuals, bool):
        raise TypeError("`use_binned_residuals` must be a boolean: `True`, `False`.")
    if not isinstance(random_state, (int, np.integer)) or random_state < 0:
        raise TypeError(f"`random_state` must be an integer greater than 0. Got {random_state}.")
    if not isinstance(n_jobs, int) and n_jobs != 'auto':
        raise TypeError(f"`n_jobs` must be an integer or `'auto'`. Got {n_jobs}.")
    if not isinstance(show_progress, bool):
        raise TypeError("`show_progress` must be a boolean: `True`, `False`.")
    if not isinstance(suppress_warnings, bool):
        raise TypeError("`suppress_warnings` must be a boolean: `True`, `False`.")
    if not isinstance(suppress_warnings_fit, bool):
        raise TypeError("`suppress_warnings_fit` must be a boolean: `True`, `False`.")

    if interval is not None or alpha is not None:
        if forecaster_name in forecasters_not_interval:
            raise ValueError(
                f"Interval predictions are not allowed for {forecaster_name}. "
                f"Set `interval` and `alpha` to `None`."
            )

        if forecaster_name in forecasters_boot_conformal:

            if interval_method == 'conformal':
                if not isinstance(interval, (float, list, tuple)):
                    raise TypeError(
                        f"When `interval_method` is 'conformal', `interval` must "
                        f"be a float or a list/tuple defining a symmetric interval. "
                        f"Got {type(interval)}."
                    )
            elif interval_method == 'bootstrapping':
                if (
                    not isinstance(interval, (float, list, tuple, str))
                    and (not hasattr(interval, "_pdf") or not callable(getattr(interval, "fit", None)))
                ):                
                    raise TypeError(
                        f"When `interval_method` is 'bootstrapping', `interval` "
                        f"must be a float, a list or tuple of floats, a "
                        f"scipy.stats distribution object (with methods `_pdf` and "
                        f"`fit`) or the string 'bootstrapping'. Got {type(interval)}."
                    )
                if isinstance(interval, (list, tuple)):
                    for i in interval:
                        if not isinstance(i, (int, float)):
                            raise TypeError(
                                f"`interval` must be a list or tuple of floats. "
                                f"Got {type(i)} in {interval}."
                            )
                    if len(interval) == 2:
                        check_interval(interval=interval)
                    else:
                        for q in interval:
                            if (q < 0.) or (q > 100.):
                                raise ValueError(
                                    "When `interval` is a list or tuple, all values must be "
                                    "between 0 and 100 inclusive."
                                )
                elif isinstance(interval, str):
                    if interval != 'bootstrapping':
                        raise ValueError(
                            f"When `interval` is a string, it must be 'bootstrapping'."
                            f"Got {interval}."
                        )
            else:
                raise ValueError(
                    f"`interval_method` must be 'bootstrapping' or 'conformal'. "
                    f"Got {interval_method}."
                )
        else:
            check_interval(interval=interval, alpha=alpha)

    if (
        not allow_incomplete_fold
        and initial_train_size is not None
        and data_length - (initial_train_size + gap) < steps
    ):        
        raise ValueError(
            f"There is not enough data to evaluate {steps} steps in a single "
            f"fold. Set `allow_incomplete_fold` to `True` to allow incomplete folds.\n"
            f"    Data available for test : {data_length - (initial_train_size + gap)}\n"
            f"    Steps                   : {steps}"
        )

    if forecaster_name in forecasters_direct and forecaster.steps < steps + gap:
        raise ValueError(
            f"When using a {forecaster_name}, the combination of steps "
            f"+ gap ({steps + gap}) cannot be greater than the `steps` parameter "
            f"declared when the forecaster is initialized ({forecaster.steps})."
        )

skforecast.model_selection._utils.check_one_step_ahead_input ¶

check_one_step_ahead_input(
    forecaster,
    cv,
    metric,
    y=None,
    series=None,
    exog=None,
    show_progress=True,
    suppress_warnings=False,
)

This is a helper function to check most inputs of hyperparameter tuning functions in modules model_selection when using a OneStepAheadFold.

Parameters:

Name	Type	Description	Default
`forecaster`	`Forecaster`	Forecaster model.	required
`cv`	`OneStepAheadFold`	OneStepAheadFold object with the information needed to split the data into folds.	required
`metric`	`(str, Callable, list)`	Metric used to quantify the goodness of fit of the model.	required
`y`	`pandas Series`	Training time series for uni-series forecasters.	`None`
`series`	`pandas DataFrame, dict`	Training time series for multi-series forecasters.	`None`
`exog`	`pandas Series, pandas DataFrame, dict`	Exogenous variables.	`None`
`show_progress`	`bool`	Whether to show a progress bar.	`True`
`suppress_warnings`	`bool`	If `True`, skforecast warnings will be suppressed during the hyperparameter search. See skforecast.exceptions.warn_skforecast_categories for more information.	`False`

Returns:

Type	Description
`None`

Source code in skforecast\model_selection\_utils.py

def check_one_step_ahead_input(
    forecaster: object,
    cv: object,
    metric: str | Callable | list[str | Callable],
    y: pd.Series | None = None,
    series: pd.DataFrame | dict[str, pd.Series | pd.DataFrame] = None,
    exog: pd.Series | pd.DataFrame | dict[str, pd.Series | pd.DataFrame] | None = None,
    show_progress: bool = True,
    suppress_warnings: bool = False
) -> None:
    """
    This is a helper function to check most inputs of hyperparameter tuning
    functions in modules `model_selection` when using a `OneStepAheadFold`.

    Parameters
    ----------
    forecaster : Forecaster
        Forecaster model.
    cv : OneStepAheadFold
        OneStepAheadFold object with the information needed to split the data into folds.
    metric : str, Callable, list
        Metric used to quantify the goodness of fit of the model.
    y : pandas Series, default None
        Training time series for uni-series forecasters.
    series : pandas DataFrame, dict, default None
        Training time series for multi-series forecasters.
    exog : pandas Series, pandas DataFrame, dict, default None
        Exogenous variables.
    show_progress : bool, default True
        Whether to show a progress bar.
    suppress_warnings: bool, default False
        If `True`, skforecast warnings will be suppressed during the hyperparameter 
        search. See skforecast.exceptions.warn_skforecast_categories for more
        information.

    Returns
    -------
    None

    """

    forecaster_name = type(forecaster).__name__
    cv_name = type(cv).__name__

    if cv_name != "OneStepAheadFold":
        raise TypeError(f"`cv` must be a 'OneStepAheadFold' object. Got '{cv_name}'.")

    initial_train_size = cv.initial_train_size

    forecasters_one_step_ahead = [
        "ForecasterRecursive",
        "ForecasterDirect",
        'ForecasterRecursiveMultiSeries',
        'ForecasterDirectMultiVariate'
    ]
    if forecaster_name not in forecasters_one_step_ahead:
        raise TypeError(
            f"Only forecasters of type {forecasters_one_step_ahead} are allowed "
            f"when using `cv` of type `OneStepAheadFold`. Got {forecaster_name}."
        )

    forecasters_uni = [
        "ForecasterRecursive",
        "ForecasterDirect",
    ]
    forecasters_multi_no_dict = [
        "ForecasterDirectMultiVariate",
    ]
    forecasters_multi_dict = [
        "ForecasterRecursiveMultiSeries"
    ]

    if forecaster_name in forecasters_uni:
        if not isinstance(y, pd.Series):
            raise TypeError(f"`y` must be a pandas Series. Got {type(y)}")
        data_name = 'y'
        data_length = len(y)

    elif forecaster_name in forecasters_multi_no_dict:
        if not isinstance(series, pd.DataFrame):
            raise TypeError(f"`series` must be a pandas DataFrame. Got {type(series)}")
        data_name = 'series'
        data_length = len(series)

    elif forecaster_name in forecasters_multi_dict:
        if not isinstance(series, (pd.DataFrame, dict)):
            raise TypeError(
                f"`series` must be a pandas DataFrame or a dict of DataFrames or Series. "
                f"Got {type(series)}."
            )

        data_name = 'series'
        if isinstance(series, dict):
            not_valid_series = [
                k 
                for k, v in series.items()
                if not isinstance(v, (pd.Series, pd.DataFrame))
            ]
            if not_valid_series:
                raise TypeError(
                    f"If `series` is a dictionary, all series must be a named "
                    f"pandas Series or a pandas DataFrame with a single column. "
                    f"Review series: {not_valid_series}"
                )
            not_valid_index = [
                k 
                for k, v in series.items()
                if not isinstance(v.index, pd.DatetimeIndex)
            ]
            if not_valid_index:
                raise ValueError(
                    f"If `series` is a dictionary, all series must have a Pandas "
                    f"DatetimeIndex as index with the same frequency. "
                    f"Review series: {not_valid_index}"
                )

            indexes_freq = [f'{v.index.freq}' for v in series.values()]
            indexes_freq = sorted(set(indexes_freq))
            if not len(indexes_freq) == 1:
                raise ValueError(
                    f"If `series` is a dictionary, all series must have a Pandas "
                    f"DatetimeIndex as index with the same frequency. "
                    f"Found frequencies: {indexes_freq}"
                )
            data_length = max([len(series[serie]) for serie in series])
        else:
            data_length = len(series)

    if exog is not None:
        if forecaster_name in forecasters_multi_dict:
            if not isinstance(exog, (pd.Series, pd.DataFrame, dict)):
                raise TypeError(
                    f"`exog` must be a pandas Series, DataFrame, dictionary of pandas "
                    f"Series/DataFrames or None. Got {type(exog)}."
                )
            if isinstance(exog, dict):
                not_valid_exog = [
                    k 
                    for k, v in exog.items()
                    if not isinstance(v, (pd.Series, pd.DataFrame, type(None)))
                ]
                if not_valid_exog:
                    raise TypeError(
                        f"If `exog` is a dictionary, All exog must be a named pandas "
                        f"Series, a pandas DataFrame or None. Review exog: {not_valid_exog}"
                    )
        else:
            if not isinstance(exog, (pd.Series, pd.DataFrame)):
                raise TypeError(
                    f"`exog` must be a pandas Series, DataFrame or None. Got {type(exog)}."
                )

    if hasattr(forecaster, 'differentiation'):
        if forecaster.differentiation_max != cv.differentiation:
            if forecaster_name == "ForecasterRecursiveMultiSeries" and isinstance(
                forecaster.differentiation, dict
            ):
                raise ValueError(
                    f"When using a dict as `differentiation` in ForecasterRecursiveMultiSeries, "
                    f"the `differentiation` included in the cv ({cv.differentiation}) must be "
                    f"the same as the maximum `differentiation` included in the forecaster "
                    f"({forecaster.differentiation_max}). Set the same value "
                    f"for both using the `differentiation` argument."
                )
            else:
                raise ValueError(
                    f"The differentiation included in the forecaster "
                    f"({forecaster.differentiation_max}) differs from the differentiation "
                    f"included in the cv ({cv.differentiation}). Set the same value "
                    f"for both using the `differentiation` argument."
                )

    if not isinstance(metric, (str, Callable, list)):
        raise TypeError(
            f"`metric` must be a string, a callable function, or a list containing "
            f"multiple strings and/or callables. Got {type(metric)}."
        )

    if forecaster_name in forecasters_uni:
        index = cv._extract_index(y)
    else:
        index = cv._extract_index(series)

    initial_train_size = date_to_index_position(
                             index        = index, 
                             date_input   = initial_train_size, 
                             method       = 'validation',
                             date_literal = 'initial_train_size'
                         )
    if initial_train_size < forecaster.window_size or initial_train_size >= data_length:
        raise ValueError(
            f"If `initial_train_size` is an integer, it must be greater than "
            f"the `window_size` of the forecaster ({forecaster.window_size}) "
            f"and smaller than the length of `{data_name}` ({data_length}). If "
            f"it is a date, it must be within this range of the index."
        )

    if not isinstance(show_progress, bool):
        raise TypeError("`show_progress` must be a boolean: `True`, `False`.")
    if not isinstance(suppress_warnings, bool):
        raise TypeError("`suppress_warnings` must be a boolean: `True`, `False`.")

    if not suppress_warnings:
        warnings.warn(
            "One-step-ahead predictions are used for faster model comparison, but they "
            "may not fully represent multi-step prediction performance. It is recommended "
            "to backtest the final model for a more accurate multi-step performance "
            "estimate.", OneStepAheadValidationWarning
        )

skforecast.model_selection._utils.select_n_jobs_backtesting ¶

select_n_jobs_backtesting(forecaster, refit)

Select the optimal number of jobs to use in the backtesting process. This selection is based on heuristics and is not guaranteed to be optimal.

The number of jobs is chosen as follows:

If refit is an integer, then n_jobs = 1. This is because parallelization doesn't work with intermittent refit.
If forecaster is 'ForecasterRecursive' and regressor is a linear regressor, then n_jobs = 1.
If forecaster is 'ForecasterRecursive' and regressor is not a linear regressor then n_jobs = cpu_count() - 1.
If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate' and refit = True, then n_jobs = cpu_count() - 1.
If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate' and refit = False, then n_jobs = 1.
If forecaster is 'ForecasterRecursiveMultiSeries', then n_jobs = cpu_count() - 1.
If forecaster is 'ForecasterSarimax' or 'ForecasterEquivalentDate', then n_jobs = 1.
If regressor is a LGBMRegressor(n_jobs=1), then n_jobs = cpu_count() - 1.
If regressor is a LGBMRegressor with internal n_jobs != 1, then n_jobs = 1. This is because lightgbm is highly optimized for gradient boosting and parallelizes operations at a very fine-grained level, making additional parallelization unnecessary and potentially harmful due to resource contention.

Parameters:

Name	Type	Description	Default
`forecaster`	`Forecaster`	Forecaster model.	required
`refit`	`(bool, int)`	If the forecaster is refitted during the backtesting process.	required

Returns:

Name	Type	Description
`n_jobs`	`int`	The number of jobs to run in parallel.

Source code in skforecast\model_selection\_utils.py

def select_n_jobs_backtesting(
    forecaster: object,
    refit: bool | int
) -> int:
    """
    Select the optimal number of jobs to use in the backtesting process. This
    selection is based on heuristics and is not guaranteed to be optimal.

    The number of jobs is chosen as follows:

    - If `refit` is an integer, then `n_jobs = 1`. This is because parallelization doesn't 
    work with intermittent refit.
    - If forecaster is 'ForecasterRecursive' and regressor is a linear regressor, 
    then `n_jobs = 1`.
    - If forecaster is 'ForecasterRecursive' and regressor is not a linear 
    regressor then `n_jobs = cpu_count() - 1`.
    - If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate'
    and `refit = True`, then `n_jobs = cpu_count() - 1`.
    - If forecaster is 'ForecasterDirect' or 'ForecasterDirectMultiVariate'
    and `refit = False`, then `n_jobs = 1`.
    - If forecaster is 'ForecasterRecursiveMultiSeries', then `n_jobs = cpu_count() - 1`.
    - If forecaster is 'ForecasterSarimax' or 'ForecasterEquivalentDate', 
    then `n_jobs = 1`.
    - If regressor is a `LGBMRegressor(n_jobs=1)`, then `n_jobs = cpu_count() - 1`.
    - If regressor is a `LGBMRegressor` with internal n_jobs != 1, then `n_jobs = 1`.
    This is because `lightgbm` is highly optimized for gradient boosting and
    parallelizes operations at a very fine-grained level, making additional
    parallelization unnecessary and potentially harmful due to resource contention.

    Parameters
    ----------
    forecaster : Forecaster
        Forecaster model.
    refit : bool, int
        If the forecaster is refitted during the backtesting process.

    Returns
    -------
    n_jobs : int
        The number of jobs to run in parallel.

    """

    forecaster_name = type(forecaster).__name__

    if isinstance(forecaster.regressor, Pipeline):
        regressor = forecaster.regressor[-1]
        regressor_name = type(regressor).__name__
    else:
        regressor = forecaster.regressor
        regressor_name = type(regressor).__name__

    linear_regressors = [
        regressor_name
        for regressor_name in dir(sklearn.linear_model)
        if not regressor_name.startswith('_')
    ]

    refit = False if refit == 0 else refit
    if not isinstance(refit, bool) and refit != 1:
        n_jobs = 1
    else:
        if forecaster_name in ['ForecasterRecursive']:
            if regressor_name in linear_regressors:
                n_jobs = 1
            elif regressor_name == 'LGBMRegressor':
                n_jobs = cpu_count() - 1 if regressor.n_jobs == 1 else 1
            else:
                n_jobs = cpu_count() - 1
        elif forecaster_name in ['ForecasterDirect', 'ForecasterDirectMultiVariate']:
            # Parallelization is applied during the fitting process.
            n_jobs = 1
        elif forecaster_name in ['ForecasterRecursiveMultiSeries']:
            if regressor_name == 'LGBMRegressor':
                n_jobs = cpu_count() - 1 if regressor.n_jobs == 1 else 1
            else:
                n_jobs = cpu_count() - 1
        elif forecaster_name in ['ForecasterSarimax', 'ForecasterEquivalentDate']:
            n_jobs = 1
        else:
            n_jobs = 1

    return n_jobs

model_selection¶

skforecast.model_selection._validation.backtesting_forecaster ¶

skforecast.model_selection._search.grid_search_forecaster ¶

skforecast.model_selection._search.random_search_forecaster ¶

skforecast.model_selection._search.bayesian_search_forecaster ¶

skforecast.model_selection._validation.backtesting_forecaster_multiseries ¶

skforecast.model_selection._search.grid_search_forecaster_multiseries ¶

skforecast.model_selection._search.random_search_forecaster_multiseries ¶

skforecast.model_selection._search.bayesian_search_forecaster_multiseries ¶

skforecast.model_selection._validation.backtesting_sarimax ¶

skforecast.model_selection._search.grid_search_sarimax ¶

skforecast.model_selection._search.random_search_sarimax ¶

skforecast.model_selection._split.BaseFold ¶

steps instance-attribute ¶

initial_train_size instance-attribute ¶

window_size instance-attribute ¶

differentiation instance-attribute ¶

refit instance-attribute ¶

fixed_train_size instance-attribute ¶

gap instance-attribute ¶

skip_folds instance-attribute ¶

allow_incomplete_fold instance-attribute ¶

return_all_indexes instance-attribute ¶

verbose instance-attribute ¶

_validate_params ¶

_extract_index ¶

set_params ¶

skforecast.model_selection._split.TimeSeriesFold ¶

_repr_html_ ¶

split ¶

_print_info ¶

skforecast.model_selection._split.OneStepAheadFold ¶

_repr_html_ ¶

split ¶

_print_info ¶

skforecast.model_selection._utils.initialize_lags_grid ¶

skforecast.model_selection._utils.check_backtesting_input ¶

skforecast.model_selection._utils.check_one_step_ahead_input ¶

skforecast.model_selection._utils.select_n_jobs_backtesting ¶

`model_selection`¶

steps `instance-attribute` ¶

initial_train_size `instance-attribute` ¶

window_size `instance-attribute` ¶

differentiation `instance-attribute` ¶

refit `instance-attribute` ¶

fixed_train_size `instance-attribute` ¶

gap `instance-attribute` ¶

skip_folds `instance-attribute` ¶

allow_incomplete_fold `instance-attribute` ¶

return_all_indexes `instance-attribute` ¶

verbose `instance-attribute` ¶