Feature selection using any of the sklearn.feature_selection module selectors
(such as RFECV, SelectFromModel, etc.). Two groups of features are
evaluated: autoregressive features (lags and window features) and exogenous
features. By default, the selection process is performed on both sets of features
at the same time, so that the most relevant autoregressive and exogenous features
are selected. However, using the select_only argument, the selection process
can focus only on the autoregressive or exogenous features without taking into
account the other features. Therefore, all other features will remain in the model.
It is also possible to force the inclusion of certain features in the final list
of selected features using the force_inclusion parameter.
Parameters:
Name
Type
Description
Default
forecaster
(ForecasterRecursive, ForecasterDirect)
Forecaster model. If forecaster is a ForecasterDirect, the
selector will only be applied to the features of the first step.
A feature selector from sklearn.feature_selection.
required
y
pandas Series, pandas DataFrame
Target time series to which the feature selection will be applied.
required
exog
pandas Series, pandas DataFrame
Exogenous variable/s included as predictor/s. Must have the same
number of observations as y and should be aligned so that y[i] is
regressed on exog[i].
Decide what type of features to include in the selection process.
If 'autoreg', only autoregressive features (lags and window features)
are evaluated by the selector. All exogenous features are included in the
output selected_exog.
If 'exog', only exogenous features are evaluated without the presence
of autoregressive features. All autoregressive features are included
in the outputs selected_lags and selected_window_features.
If None, all features are evaluated by the selector.
Features to force include in the final list of selected features.
If list, list of feature names to force include.
If str, regular expression to identify features to force include.
For example, if force_inclusion="^sun_", all features that begin
with "sun_" will be included in the final list of selected features.
defselect_features(forecaster:object,selector:object,y:pd.Series|pd.DataFrame,exog:pd.Series|pd.DataFrame|None=None,select_only:str|None=None,force_inclusion:list[str]|str|None=None,subsample:int|float=0.5,random_state:int=123,verbose:bool=True)->tuple[list[int],list[str],list[str]]:""" Feature selection using any of the sklearn.feature_selection module selectors (such as `RFECV`, `SelectFromModel`, etc.). Two groups of features are evaluated: autoregressive features (lags and window features) and exogenous features. By default, the selection process is performed on both sets of features at the same time, so that the most relevant autoregressive and exogenous features are selected. However, using the `select_only` argument, the selection process can focus only on the autoregressive or exogenous features without taking into account the other features. Therefore, all other features will remain in the model. It is also possible to force the inclusion of certain features in the final list of selected features using the `force_inclusion` parameter. Parameters ---------- forecaster : ForecasterRecursive, ForecasterDirect Forecaster model. If forecaster is a ForecasterDirect, the selector will only be applied to the features of the first step. selector : object A feature selector from sklearn.feature_selection. y : pandas Series, pandas DataFrame Target time series to which the feature selection will be applied. exog : pandas Series, pandas DataFrame, default None Exogenous variable/s included as predictor/s. Must have the same number of observations as `y` and should be aligned so that y[i] is regressed on exog[i]. select_only : str, default None Decide what type of features to include in the selection process. - If `'autoreg'`, only autoregressive features (lags and window features) are evaluated by the selector. All exogenous features are included in the output `selected_exog`. - If `'exog'`, only exogenous features are evaluated without the presence of autoregressive features. All autoregressive features are included in the outputs `selected_lags` and `selected_window_features`. - If `None`, all features are evaluated by the selector. force_inclusion : list, str, default None Features to force include in the final list of selected features. - If `list`, list of feature names to force include. - If `str`, regular expression to identify features to force include. For example, if `force_inclusion="^sun_"`, all features that begin with "sun_" will be included in the final list of selected features. subsample : int, float, default 0.5 Proportion of records to use for feature selection. random_state : int, default 123 Sets a seed for the random subsample so that the subsampling process is always deterministic. verbose : bool, default True Print information about feature selection process. Returns ------- selected_lags : list List of selected lags. selected_window_features : list List of selected window features. selected_exog : list List of selected exogenous features. """forecaster_name=type(forecaster).__name__valid_forecasters=['ForecasterRecursive','ForecasterDirect']ifforecaster_namenotinvalid_forecasters:raiseTypeError(f"`forecaster` must be one of the following classes: {valid_forecasters}.")ifselect_onlynotin['autoreg','exog',None]:raiseValueError("`select_only` must be one of the following values: 'autoreg', 'exog', None.")ifsubsample<=0orsubsample>1:raiseValueError("`subsample` must be a number greater than 0 and less than or equal to 1.")forecaster=deepcopy(forecaster)forecaster.is_fitted=FalseX_train,y_train=forecaster.create_train_X_y(y=y,exog=exog)ifforecaster_name=='ForecasterDirect':X_train,y_train=forecaster.filter_train_X_y_for_step(step=1,X_train=X_train,y_train=y_train,remove_suffix=True)lags_cols=[]window_features_cols=[]autoreg_cols=[]ifforecaster.lagsisnotNone:lags_cols=forecaster.lags_namesautoreg_cols.extend(lags_cols)ifforecaster.window_featuresisnotNone:window_features_cols=forecaster.window_features_namesautoreg_cols.extend(window_features_cols)exog_cols=[colforcolinX_train.columnsifcolnotinautoreg_cols]forced_autoreg=[]forced_exog=[]ifforce_inclusionisnotNone:ifisinstance(force_inclusion,list):forced_autoreg=[colforcolinforce_inclusionifcolinautoreg_cols]forced_exog=[colforcolinforce_inclusionifcolinexog_cols]elifisinstance(force_inclusion,str):forced_autoreg=[colforcolinautoreg_colsifre.match(force_inclusion,col)]forced_exog=[colforcolinexog_colsifre.match(force_inclusion,col)]ifselect_only=='autoreg':X_train=X_train.drop(columns=exog_cols)elifselect_only=='exog':X_train=X_train.drop(columns=autoreg_cols)ifisinstance(subsample,float):subsample=int(len(X_train)*subsample)rng=np.random.default_rng(seed=random_state)sample=rng.integers(low=0,high=len(X_train),size=subsample)X_train_sample=X_train.iloc[sample,:]y_train_sample=y_train.iloc[sample]selector.fit(X_train_sample,y_train_sample)selected_features=selector.get_feature_names_out()ifselect_only=='exog':selected_autoreg=autoreg_colselse:selected_autoreg=[featureforfeatureinselected_featuresiffeatureinautoreg_cols]ifselect_only=='autoreg':selected_exog=exog_colselse:selected_exog=[featureforfeatureinselected_featuresiffeatureinexog_cols]ifforce_inclusionisnotNone:ifselect_only!='autoreg':forced_exog_not_selected=set(forced_exog)-set(selected_features)selected_exog.extend(forced_exog_not_selected)selected_exog.sort(key=exog_cols.index)ifselect_only!='exog':forced_autoreg_not_selected=set(forced_autoreg)-set(selected_features)selected_autoreg.extend(forced_autoreg_not_selected)selected_autoreg.sort(key=autoreg_cols.index)iflen(selected_autoreg)==0:warnings.warn("No autoregressive features have been selected. Since a Forecaster ""cannot be created without them, be sure to include at least one ""using the `force_inclusion` parameter.")selected_lags=[]selected_window_features=[]else:selected_lags=[int(feature.replace('lag_',''))forfeatureinselected_autoregiffeatureinlags_cols]selected_window_features=[featureforfeatureinselected_autoregiffeatureinwindow_features_cols]ifverbose:print(f"Recursive feature elimination ({selector.__class__.__name__})")print("--------------------------------"+"-"*len(selector.__class__.__name__))print(f"Total number of records available: {X_train.shape[0]}")print(f"Total number of records used for feature selection: {X_train_sample.shape[0]}")print(f"Number of features available: {len(autoreg_cols)+len(exog_cols)}")print(f" Lags (n={len(lags_cols)})")print(f" Window features (n={len(window_features_cols)})")print(f" Exog (n={len(exog_cols)})")print(f"Number of features selected: {len(selected_features)}")print(f" Lags (n={len(selected_lags)}) : {selected_lags}")print(f" Window features (n={len(selected_window_features)}) : {selected_window_features}")print(f" Exog (n={len(selected_exog)}) : {selected_exog}")returnselected_lags,selected_window_features,selected_exog
Feature selection using any of the sklearn.feature_selection module selectors
(such as RFECV, SelectFromModel, etc.). Two groups of features are
evaluated: autoregressive features and exogenous features. By default, the
selection process is performed on both sets of features at the same time,
so that the most relevant autoregressive and exogenous features are selected.
However, using the select_only argument, the selection process can focus
only on the autoregressive or exogenous features without taking into account
the other features. Therefore, all other features will remain in the model.
It is also possible to force the inclusion of certain features in the final
list of selected features using the force_inclusion parameter.
Decide what type of features to include in the selection process.
If 'autoreg', only autoregressive features (lags and window features)
are evaluated by the selector. All exogenous features are
included in the output selected_exog.
If 'exog', only exogenous features are evaluated without the presence
of autoregressive features. All autoregressive features are included
in the outputs selected_lags and selected_window_features.
If None, all features are evaluated by the selector.
Features to force include in the final list of selected features.
If list, list of feature names to force include.
If str, regular expression to identify features to force include.
For example, if force_inclusion="^sun_", all features that begin
with "sun_" will be included in the final list of selected features.
List of selected lags. If the forecaster is a ForecasterDirectMultiVariate,
the output is a dict with the selected lags for each series, {series_name: lags},
as the lags can be different for each series.
defselect_features_multiseries(forecaster:object,selector:object,series:pd.DataFrame|dict[str,pd.Series|pd.DataFrame],exog:pd.Series|pd.DataFrame|dict[str,pd.Series|pd.DataFrame]|None=None,select_only:str|None=None,force_inclusion:list[str]|str|None=None,subsample:int|float=0.5,random_state:int=123,verbose:bool=True,)->tuple[list[int]|dict[str,int],list[str],list[str]]:""" Feature selection using any of the sklearn.feature_selection module selectors (such as `RFECV`, `SelectFromModel`, etc.). Two groups of features are evaluated: autoregressive features and exogenous features. By default, the selection process is performed on both sets of features at the same time, so that the most relevant autoregressive and exogenous features are selected. However, using the `select_only` argument, the selection process can focus only on the autoregressive or exogenous features without taking into account the other features. Therefore, all other features will remain in the model. It is also possible to force the inclusion of certain features in the final list of selected features using the `force_inclusion` parameter. Parameters ---------- forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate Forecaster model. If forecaster is a ForecasterDirectMultiVariate, the selector will only be applied to the features of the first step. selector : object A feature selector from sklearn.feature_selection. series : pandas DataFrame, dict Target time series to which the feature selection will be applied. exog : pandas Series, pandas DataFrame, dict, default None Exogenous variables. select_only : str, default None Decide what type of features to include in the selection process. - If `'autoreg'`, only autoregressive features (lags and window features) are evaluated by the selector. All exogenous features are included in the output `selected_exog`. - If `'exog'`, only exogenous features are evaluated without the presence of autoregressive features. All autoregressive features are included in the outputs `selected_lags` and `selected_window_features`. - If `None`, all features are evaluated by the selector. force_inclusion : list, str, default None Features to force include in the final list of selected features. - If `list`, list of feature names to force include. - If `str`, regular expression to identify features to force include. For example, if `force_inclusion="^sun_"`, all features that begin with "sun_" will be included in the final list of selected features. subsample : int, float, default 0.5 Proportion of records to use for feature selection. random_state : int, default 123 Sets a seed for the random subsample so that the subsampling process is always deterministic. verbose : bool, default True Print information about feature selection process. Returns ------- selected_lags : list, dict List of selected lags. If the forecaster is a ForecasterDirectMultiVariate, the output is a dict with the selected lags for each series, {series_name: lags}, as the lags can be different for each series. selected_window_features : list List of selected window features. selected_exog : list List of selected exogenous features. """forecaster_name=type(forecaster).__name__valid_forecasters=['ForecasterRecursiveMultiSeries','ForecasterDirectMultiVariate']ifforecaster_namenotinvalid_forecasters:raiseTypeError(f"`forecaster` must be one of the following classes: {valid_forecasters}.")ifselect_onlynotin['autoreg','exog',None]:raiseValueError("`select_only` must be one of the following values: 'autoreg', 'exog', None.")ifsubsample<=0orsubsample>1:raiseValueError("`subsample` must be a number greater than 0 and less than or equal to 1.")forecaster=deepcopy(forecaster)forecaster.is_fitted=Falseoutput=forecaster._create_train_X_y(series=series,exog=exog)X_train=output[0]y_train=output[1]ifforecaster_name=='ForecasterDirectMultiVariate':X_train,y_train=forecaster.filter_train_X_y_for_step(step=1,X_train=X_train,y_train=y_train,remove_suffix=True)lags_cols=list(chain(*[vforvinforecaster.lags_names.values()ifvisnotNone]))window_features_cols=forecaster.X_train_window_features_names_out_encoding_cols=[]else:lags_cols=forecaster.lags_nameswindow_features_cols=output[6]# X_train_window_features_names_out_ outputifforecaster.encoding=='onehot':encoding_cols=output[4]# X_train_series_names_in_ outputelse:encoding_cols=['_level_skforecast']lags_cols=[]iflags_colsisNoneelselags_colswindow_features_cols=[]ifwindow_features_colsisNoneelsewindow_features_colsautoreg_cols=[]ifforecaster.lagsisnotNone:autoreg_cols.extend(lags_cols)ifforecaster.window_featuresisnotNone:autoreg_cols.extend(window_features_cols)exog_cols=[colforcolinX_train.columnsifcolnotinautoreg_colsandcolnotinencoding_cols]forced_autoreg=[]forced_exog=[]ifforce_inclusionisnotNone:ifisinstance(force_inclusion,list):forced_autoreg=[colforcolinforce_inclusionifcolinautoreg_cols]forced_exog=[colforcolinforce_inclusionifcolinexog_cols]elifisinstance(force_inclusion,str):forced_autoreg=[colforcolinautoreg_colsifre.match(force_inclusion,col)]forced_exog=[colforcolinexog_colsifre.match(force_inclusion,col)]ifselect_only=='autoreg':X_train=X_train.drop(columns=exog_cols+encoding_cols)elifselect_only=='exog':X_train=X_train.drop(columns=autoreg_cols+encoding_cols)else:X_train=X_train.drop(columns=encoding_cols)ifisinstance(subsample,float):subsample=int(len(X_train)*subsample)rng=np.random.default_rng(seed=random_state)sample=rng.integers(low=0,high=len(X_train),size=subsample)X_train_sample=X_train.iloc[sample,:]y_train_sample=y_train.iloc[sample]selector.fit(X_train_sample,y_train_sample)selected_features=selector.get_feature_names_out()ifselect_only=='exog':selected_autoreg=autoreg_colselse:selected_autoreg=[featureforfeatureinselected_featuresiffeatureinautoreg_cols]ifselect_only=='autoreg':selected_exog=exog_colselse:selected_exog=[featureforfeatureinselected_featuresiffeatureinexog_cols]ifforce_inclusionisnotNone:ifselect_only!='autoreg':forced_exog_not_selected=set(forced_exog)-set(selected_features)selected_exog.extend(forced_exog_not_selected)selected_exog.sort(key=exog_cols.index)ifselect_only!='exog':forced_autoreg_not_selected=set(forced_autoreg)-set(selected_features)selected_autoreg.extend(forced_autoreg_not_selected)selected_autoreg.sort(key=autoreg_cols.index)iflen(selected_autoreg)==0:warnings.warn("No autoregressive features have been selected. Since a Forecaster ""cannot be created without them, be sure to include at least one ""using the `force_inclusion` parameter.")selected_lags=[]selected_window_features=[]verbose_selected_lags=[]else:ifforecaster_name=='ForecasterDirectMultiVariate':selected_lags={series_name:([int(feature.replace(f"{series_name}_lag_",""))forfeatureinselected_autoregiffeatureinlags_names]iflags_namesisnotNoneelse[])forseries_name,lags_namesinforecaster.lags_names.items()}verbose_selected_lags=[featureforfeatureinselected_autoregiffeatureinlags_cols]else:selected_lags=[int(feature.replace('lag_',''))forfeatureinselected_autoregiffeatureinlags_cols]verbose_selected_lags=selected_lagsselected_window_features=[featureforfeatureinselected_autoregiffeatureinwindow_features_cols]ifverbose:print(f"Recursive feature elimination ({selector.__class__.__name__})")print("--------------------------------"+"-"*len(selector.__class__.__name__))print(f"Total number of records available: {X_train.shape[0]}")print(f"Total number of records used for feature selection: {X_train_sample.shape[0]}")print(f"Number of features available: {len(autoreg_cols)+len(exog_cols)}")print(f" Lags (n={len(lags_cols)})")print(f" Window features (n={len(window_features_cols)})")print(f" Exog (n={len(exog_cols)})")print(f"Number of features selected: {len(selected_features)}")print(f" Lags (n={len(verbose_selected_lags)}) : {verbose_selected_lags}")print(f" Window features (n={len(selected_window_features)}) : {selected_window_features}")print(f" Exog (n={len(selected_exog)}) : {selected_exog}")returnselected_lags,selected_window_features,selected_exog