Skip to content

feature_selection

skforecast.feature_selection.feature_selection.select_features

select_features(
    forecaster,
    selector,
    y,
    exog=None,
    select_only=None,
    force_inclusion=None,
    subsample=0.5,
    random_state=123,
    verbose=True,
)

Feature selection using any of the sklearn.feature_selection module selectors (such as RFECV, SelectFromModel, etc.). Two groups of features are evaluated: autoregressive features (lags and window features) and exogenous features. By default, the selection process is performed on both sets of features at the same time, so that the most relevant autoregressive and exogenous features are selected. However, using the select_only argument, the selection process can focus only on the autoregressive or exogenous features without taking into account the other features. Therefore, all other features will remain in the model. It is also possible to force the inclusion of certain features in the final list of selected features using the force_inclusion parameter.

Parameters:

Name Type Description Default
forecaster (ForecasterRecursive, ForecasterDirect)

Forecaster model. If forecaster is a ForecasterDirect, the selector will only be applied to the features of the first step.

required
selector object

A feature selector from sklearn.feature_selection.

required
y pandas Series, pandas DataFrame

Target time series to which the feature selection will be applied.

required
exog pandas Series, pandas DataFrame

Exogenous variable/s included as predictor/s. Must have the same number of observations as y and should be aligned so that y[i] is regressed on exog[i].

`None`
select_only str

Decide what type of features to include in the selection process.

  • If 'autoreg', only autoregressive features (lags and window features) are evaluated by the selector. All exogenous features are included in the output selected_exog.
  • If 'exog', only exogenous features are evaluated without the presence of autoregressive features. All autoregressive features are included in the outputs selected_lags and selected_window_features.
  • If None, all features are evaluated by the selector.
`None`
force_inclusion (list, str)

Features to force include in the final list of selected features.

  • If list, list of feature names to force include.
  • If str, regular expression to identify features to force include. For example, if force_inclusion="^sun_", all features that begin with "sun_" will be included in the final list of selected features.
`None`
subsample (int, float)

Proportion of records to use for feature selection.

`0.5`
random_state int

Sets a seed for the random subsample so that the subsampling process is always deterministic.

`123`
verbose bool

Print information about feature selection process.

`True`

Returns:

Name Type Description
selected_lags list

List of selected lags.

selected_window_features list

List of selected window features.

selected_exog list

List of selected exogenous features.

Source code in skforecast\feature_selection\feature_selection.py
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
def select_features(
    forecaster: object,
    selector: object,
    y: Union[pd.Series, pd.DataFrame],
    exog: Optional[Union[pd.Series, pd.DataFrame]] = None,
    select_only: Optional[str] = None,
    force_inclusion: Optional[Union[list, str]] = None,
    subsample: Union[int, float] = 0.5,
    random_state: int = 123,
    verbose: bool = True
) -> Union[list, list, list]:
    """
    Feature selection using any of the sklearn.feature_selection module selectors 
    (such as `RFECV`, `SelectFromModel`, etc.). Two groups of features are
    evaluated: autoregressive features (lags and window features) and exogenous
    features. By default, the selection process is performed on both sets of features
    at the same time, so that the most relevant autoregressive and exogenous features
    are selected. However, using the `select_only` argument, the selection process
    can focus only on the autoregressive or exogenous features without taking into
    account the other features. Therefore, all other features will remain in the model. 
    It is also possible to force the inclusion of certain features in the final list
    of selected features using the `force_inclusion` parameter.

    Parameters
    ----------
    forecaster : ForecasterRecursive, ForecasterDirect
        Forecaster model. If forecaster is a ForecasterDirect, the
        selector will only be applied to the features of the first step.
    selector : object
        A feature selector from sklearn.feature_selection.
    y : pandas Series, pandas DataFrame
        Target time series to which the feature selection will be applied.
    exog : pandas Series, pandas DataFrame, default `None`
        Exogenous variable/s included as predictor/s. Must have the same
        number of observations as `y` and should be aligned so that y[i] is
        regressed on exog[i].
    select_only : str, default `None`
        Decide what type of features to include in the selection process. 

        - If `'autoreg'`, only autoregressive features (lags and window features)
        are evaluated by the selector. All exogenous features are included in the
        output `selected_exog`.
        - If `'exog'`, only exogenous features are evaluated without the presence
        of autoregressive features. All autoregressive features are included 
        in the outputs `selected_lags` and `selected_window_features`.
        - If `None`, all features are evaluated by the selector.
    force_inclusion : list, str, default `None`
        Features to force include in the final list of selected features.

        - If `list`, list of feature names to force include.
        - If `str`, regular expression to identify features to force include. 
        For example, if `force_inclusion="^sun_"`, all features that begin 
        with "sun_" will be included in the final list of selected features.
    subsample : int, float, default `0.5`
        Proportion of records to use for feature selection.
    random_state : int, default `123`
        Sets a seed for the random subsample so that the subsampling process 
        is always deterministic.
    verbose : bool, default `True`
        Print information about feature selection process.

    Returns
    -------
    selected_lags : list
        List of selected lags.
    selected_window_features : list
        List of selected window features.
    selected_exog : list
        List of selected exogenous features.

    """

    forecaster_name = type(forecaster).__name__
    valid_forecasters = ['ForecasterRecursive', 'ForecasterDirect']

    if forecaster_name not in valid_forecasters:
        raise TypeError(
            f"`forecaster` must be one of the following classes: {valid_forecasters}."
        )

    if select_only not in ['autoreg', 'exog', None]:
        raise ValueError(
            "`select_only` must be one of the following values: 'autoreg', 'exog', None."
        )

    if subsample <= 0 or subsample > 1:
        raise ValueError(
            "`subsample` must be a number greater than 0 and less than or equal to 1."
        )

    forecaster = deepcopy(forecaster)
    forecaster.is_fitted = False
    X_train, y_train = forecaster.create_train_X_y(y=y, exog=exog)
    if forecaster_name == 'ForecasterDirect':
        X_train, y_train = forecaster.filter_train_X_y_for_step(
                               step          = 1,
                               X_train       = X_train,
                               y_train       = y_train,
                               remove_suffix = True
                           )

    lags_cols = []
    window_features_cols = []
    autoreg_cols = []
    if forecaster.lags is not None:
        lags_cols = forecaster.lags_names
        autoreg_cols.extend(lags_cols)
    if forecaster.window_features is not None:
        window_features_cols = forecaster.window_features_names
        autoreg_cols.extend(window_features_cols)

    exog_cols = [col for col in X_train.columns if col not in autoreg_cols]

    forced_autoreg = []
    forced_exog = []
    if force_inclusion is not None:
        if isinstance(force_inclusion, list):
            forced_autoreg = [col for col in force_inclusion if col in autoreg_cols]
            forced_exog = [col for col in force_inclusion if col in exog_cols]
        elif isinstance(force_inclusion, str):
            forced_autoreg = [col for col in autoreg_cols if re.match(force_inclusion, col)]
            forced_exog = [col for col in exog_cols if re.match(force_inclusion, col)]

    if select_only == 'autoreg':
        X_train = X_train.drop(columns=exog_cols)
    elif select_only == 'exog':
        X_train = X_train.drop(columns=autoreg_cols)

    if isinstance(subsample, float):
        subsample = int(len(X_train) * subsample)

    rng = np.random.default_rng(seed=random_state)
    sample = rng.integers(low=0, high=len(X_train), size=subsample)
    X_train_sample = X_train.iloc[sample, :]
    y_train_sample = y_train.iloc[sample]
    selector.fit(X_train_sample, y_train_sample)
    selected_features = selector.get_feature_names_out()

    if select_only == 'exog':
        selected_autoreg = autoreg_cols
    else:
        selected_autoreg = [
            feature
            for feature in selected_features
            if feature in autoreg_cols
        ]

    if select_only == 'autoreg':
        selected_exog = exog_cols
    else:
        selected_exog = [
            feature
            for feature in selected_features
            if feature in exog_cols
        ]

    if force_inclusion is not None: 
        if select_only != 'autoreg':
            forced_exog_not_selected = set(forced_exog) - set(selected_features)
            selected_exog.extend(forced_exog_not_selected)
            selected_exog.sort(key=exog_cols.index)
        if select_only != 'exog':
            forced_autoreg_not_selected = set(forced_autoreg) - set(selected_features)
            selected_autoreg.extend(forced_autoreg_not_selected)
            selected_autoreg.sort(key=autoreg_cols.index)

    if len(selected_autoreg) == 0:
        warnings.warn(
            "No autoregressive features have been selected. Since a Forecaster "
            "cannot be created without them, be sure to include at least one "
            "using the `force_inclusion` parameter."
        )
        selected_lags = []
        selected_window_features = []
    else:
        selected_lags = [
            int(feature.replace('lag_', '')) 
            for feature in selected_autoreg if feature in lags_cols
        ]
        selected_window_features = [
            feature for feature in selected_autoreg if feature in window_features_cols
        ]

    if verbose:
        print(f"Recursive feature elimination ({selector.__class__.__name__})")
        print("--------------------------------" + "-" * len(selector.__class__.__name__))
        print(f"Total number of records available: {X_train.shape[0]}")
        print(f"Total number of records used for feature selection: {X_train_sample.shape[0]}")
        print(f"Number of features available: {len(autoreg_cols) + len(exog_cols)}") 
        print(f"    Lags            (n={len(lags_cols)})")
        print(f"    Window features (n={len(window_features_cols)})")
        print(f"    Exog            (n={len(exog_cols)})")
        print(f"Number of features selected: {len(selected_features)}")
        print(f"    Lags            (n={len(selected_lags)}) : {selected_lags}")
        print(f"    Window features (n={len(selected_window_features)}) : {selected_window_features}")
        print(f"    Exog            (n={len(selected_exog)}) : {selected_exog}")

    return selected_lags, selected_window_features, selected_exog

skforecast.feature_selection.feature_selection.select_features_multiseries

select_features_multiseries(
    forecaster,
    selector,
    series,
    exog=None,
    select_only=None,
    force_inclusion=None,
    subsample=0.5,
    random_state=123,
    verbose=True,
)

Feature selection using any of the sklearn.feature_selection module selectors (such as RFECV, SelectFromModel, etc.). Two groups of features are evaluated: autoregressive features and exogenous features. By default, the selection process is performed on both sets of features at the same time, so that the most relevant autoregressive and exogenous features are selected. However, using the select_only argument, the selection process can focus only on the autoregressive or exogenous features without taking into account the other features. Therefore, all other features will remain in the model. It is also possible to force the inclusion of certain features in the final list of selected features using the force_inclusion parameter.

Parameters:

Name Type Description Default
forecaster (ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate)

Forecaster model. If forecaster is a ForecasterDirectMultiVariate, the selector will only be applied to the features of the first step.

required
selector object

A feature selector from sklearn.feature_selection.

required
series pandas DataFrame

Target time series to which the feature selection will be applied.

required
exog pandas Series, pandas DataFrame, dict

Exogenous variables.

`None`
select_only str

Decide what type of features to include in the selection process.

  • If 'autoreg', only autoregressive features (lags and window features) are evaluated by the selector. All exogenous features are included in the output selected_exog.
  • If 'exog', only exogenous features are evaluated without the presence of autoregressive features. All autoregressive features are included in the outputs selected_lags and selected_window_features.
  • If None, all features are evaluated by the selector.
`None`
force_inclusion (list, str)

Features to force include in the final list of selected features.

  • If list, list of feature names to force include.
  • If str, regular expression to identify features to force include. For example, if force_inclusion="^sun_", all features that begin with "sun_" will be included in the final list of selected features.
`None`
subsample (int, float)

Proportion of records to use for feature selection.

`0.5`
random_state int

Sets a seed for the random subsample so that the subsampling process is always deterministic.

`123`
verbose bool

Print information about feature selection process.

`True`

Returns:

Name Type Description
selected_lags (list, dict)

List of selected lags. If the forecaster is a ForecasterDirectMultiVariate, the output is a dict with the selected lags for each series, {series_name: lags}, as the lags can be different for each series.

selected_window_features list

List of selected window features.

selected_exog list

List of selected exogenous features.

Source code in skforecast\feature_selection\feature_selection.py
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
def select_features_multiseries(
    forecaster: object,
    selector: object,
    series: Union[pd.DataFrame, dict],
    exog: Optional[Union[pd.Series, pd.DataFrame, dict]] = None,
    select_only: Optional[str] = None,
    force_inclusion: Optional[Union[list, str]] = None,
    subsample: Union[int, float] = 0.5,
    random_state: int = 123,
    verbose: bool = True,
) -> Union[Union[list, dict], list, list]:
    """
    Feature selection using any of the sklearn.feature_selection module selectors 
    (such as `RFECV`, `SelectFromModel`, etc.). Two groups of features are
    evaluated: autoregressive features and exogenous features. By default, the 
    selection process is performed on both sets of features at the same time, 
    so that the most relevant autoregressive and exogenous features are selected. 
    However, using the `select_only` argument, the selection process can focus 
    only on the autoregressive or exogenous features without taking into account 
    the other features. Therefore, all other features will remain in the model. 
    It is also possible to force the inclusion of certain features in the final 
    list of selected features using the `force_inclusion` parameter.

    Parameters
    ----------
    forecaster : ForecasterRecursiveMultiSeries, ForecasterDirectMultiVariate
        Forecaster model. If forecaster is a ForecasterDirectMultiVariate, the
        selector will only be applied to the features of the first step.
    selector : object
        A feature selector from sklearn.feature_selection.
    series : pandas DataFrame
        Target time series to which the feature selection will be applied.
    exog : pandas Series, pandas DataFrame, dict, default `None`
        Exogenous variables.
    select_only : str, default `None`
        Decide what type of features to include in the selection process. 

        - If `'autoreg'`, only autoregressive features (lags and window features) 
        are evaluated by the selector. All exogenous features are 
        included in the output `selected_exog`.
        - If `'exog'`, only exogenous features are evaluated without the presence
        of autoregressive features. All autoregressive features are included 
        in the outputs `selected_lags` and `selected_window_features`.
        - If `None`, all features are evaluated by the selector.
    force_inclusion : list, str, default `None`
        Features to force include in the final list of selected features.

        - If `list`, list of feature names to force include.
        - If `str`, regular expression to identify features to force include. 
        For example, if `force_inclusion="^sun_"`, all features that begin 
        with "sun_" will be included in the final list of selected features.
    subsample : int, float, default `0.5`
        Proportion of records to use for feature selection.
    random_state : int, default `123`
        Sets a seed for the random subsample so that the subsampling process 
        is always deterministic.
    verbose : bool, default `True`
        Print information about feature selection process.

    Returns
    -------
    selected_lags : list, dict
        List of selected lags. If the forecaster is a ForecasterDirectMultiVariate,
        the output is a dict with the selected lags for each series, {series_name: lags},
        as the lags can be different for each series.
    selected_window_features : list
        List of selected window features.
    selected_exog : list
        List of selected exogenous features.

    """

    forecaster_name = type(forecaster).__name__
    valid_forecasters = [
        'ForecasterRecursiveMultiSeries',
        'ForecasterDirectMultiVariate'
    ]

    if forecaster_name not in valid_forecasters:
        raise TypeError(
            f"`forecaster` must be one of the following classes: {valid_forecasters}."
        )

    if select_only not in ['autoreg', 'exog', None]:
        raise ValueError(
            "`select_only` must be one of the following values: 'autoreg', 'exog', None."
        )

    if subsample <= 0 or subsample > 1:
        raise ValueError(
            "`subsample` must be a number greater than 0 and less than or equal to 1."
        )

    forecaster = deepcopy(forecaster)
    forecaster.is_fitted = False
    output = forecaster._create_train_X_y(series=series, exog=exog)
    X_train = output[0]
    y_train = output[1]
    if forecaster_name == 'ForecasterDirectMultiVariate':
        X_train, y_train = forecaster.filter_train_X_y_for_step(
                               step          = 1,
                               X_train       = X_train,
                               y_train       = y_train,
                               remove_suffix = True
                           )
        lags_cols = list(
            chain(*[v for v in forecaster.lags_names.values() if v is not None])
        )
        window_features_cols = forecaster.X_train_window_features_names_out_
        encoding_cols = []
    else:
        lags_cols = forecaster.lags_names
        window_features_cols = output[6]  # X_train_window_features_names_out_ output
        if forecaster.encoding == 'onehot':
            encoding_cols = output[4]  # X_train_series_names_in_ output
        else:
            encoding_cols = ['_level_skforecast']

    lags_cols = [] if lags_cols is None else lags_cols
    window_features_cols = [] if window_features_cols is None else window_features_cols
    autoreg_cols = []
    if forecaster.lags is not None:
        autoreg_cols.extend(lags_cols)
    if forecaster.window_features is not None:
        autoreg_cols.extend(window_features_cols)

    exog_cols = [
        col
        for col in X_train.columns
        if col not in autoreg_cols and col not in encoding_cols
    ]

    forced_autoreg = []
    forced_exog = []
    if force_inclusion is not None:
        if isinstance(force_inclusion, list):
            forced_autoreg = [col for col in force_inclusion if col in autoreg_cols]
            forced_exog = [col for col in force_inclusion if col in exog_cols]
        elif isinstance(force_inclusion, str):
            forced_autoreg = [col for col in autoreg_cols if re.match(force_inclusion, col)]
            forced_exog = [col for col in exog_cols if re.match(force_inclusion, col)]

    if select_only == 'autoreg':
        X_train = X_train.drop(columns=exog_cols + encoding_cols)
    elif select_only == 'exog':
        X_train = X_train.drop(columns=autoreg_cols + encoding_cols)
    else:
        X_train = X_train.drop(columns=encoding_cols)

    if isinstance(subsample, float):
        subsample = int(len(X_train) * subsample)

    rng = np.random.default_rng(seed=random_state)
    sample = rng.integers(low=0, high=len(X_train), size=subsample)
    X_train_sample = X_train.iloc[sample, :]
    y_train_sample = y_train.iloc[sample]
    selector.fit(X_train_sample, y_train_sample)
    selected_features = selector.get_feature_names_out()

    if select_only == 'exog':
        selected_autoreg = autoreg_cols
    else:
        selected_autoreg = [
            feature
            for feature in selected_features
            if feature in autoreg_cols
        ]

    if select_only == 'autoreg':
        selected_exog = exog_cols
    else:
        selected_exog = [
            feature
            for feature in selected_features
            if feature in exog_cols
        ]

    if force_inclusion is not None: 
        if select_only != 'autoreg':
            forced_exog_not_selected = set(forced_exog) - set(selected_features)
            selected_exog.extend(forced_exog_not_selected)
            selected_exog.sort(key=exog_cols.index)
        if select_only != 'exog':
            forced_autoreg_not_selected = set(forced_autoreg) - set(selected_features)
            selected_autoreg.extend(forced_autoreg_not_selected)
            selected_autoreg.sort(key=autoreg_cols.index)

    if len(selected_autoreg) == 0:
        warnings.warn(
            "No autoregressive features have been selected. Since a Forecaster "
            "cannot be created without them, be sure to include at least one "
            "using the `force_inclusion` parameter."
        )
        selected_lags = []
        selected_window_features = []
        verbose_selected_lags = []
    else:
        if forecaster_name == 'ForecasterDirectMultiVariate':
            selected_lags = {
                series_name: (
                    [
                        int(feature.replace(f"{series_name}_lag_", ""))
                        for feature in selected_autoreg
                        if feature in lags_names
                    ]
                    if lags_names is not None
                    else []
                )
                for series_name, lags_names in forecaster.lags_names.items()
            }
            verbose_selected_lags = [
                feature for feature in selected_autoreg if feature in lags_cols
            ]
        else:
            selected_lags = [
                int(feature.replace('lag_', '')) 
                for feature in selected_autoreg 
                if feature in lags_cols
            ]
            verbose_selected_lags = selected_lags

        selected_window_features = [
            feature for feature in selected_autoreg 
            if feature in window_features_cols
        ]

    if verbose:
        print(f"Recursive feature elimination ({selector.__class__.__name__})")
        print("--------------------------------" + "-" * len(selector.__class__.__name__))
        print(f"Total number of records available: {X_train.shape[0]}")
        print(f"Total number of records used for feature selection: {X_train_sample.shape[0]}")
        print(f"Number of features available: {len(autoreg_cols) + len(exog_cols)}") 
        print(f"    Lags            (n={len(lags_cols)})")
        print(f"    Window features (n={len(window_features_cols)})")
        print(f"    Exog            (n={len(exog_cols)})")
        print(f"Number of features selected: {len(selected_features)}")
        print(f"    Lags            (n={len(verbose_selected_lags)}) : {verbose_selected_lags}")
        print(f"    Window features (n={len(selected_window_features)}) : {selected_window_features}")
        print(f"    Exog            (n={len(selected_exog)}) : {selected_exog}")

    return selected_lags, selected_window_features, selected_exog