Feature importances¶
Feature importance is a technique used in machine learning to determine the relevance or importance of each feature (or variable) in a model's prediction. In other words, it measures how much each feature contributes to the model's output.
Feature importance can be used for several purposes, such as identifying the most relevant features for a given prediction, understanding the behavior of a model, and selecting the best set of features for a given task. It can also help to identify potential biases or errors in the data used to train the model. It is important to note that feature importance is not a definitive measure of causality. Just because a feature is identified as important does not necessarily mean that it causes the outcome. Other factors, such as confounding variables, may also be at play.
The method used to calculate feature importance may vary depending on the type of machine learning model being used. Different machine learning models may have different assumptions and characteristics that affect the calculation of feature importance. For example, decision tree-based models such as Random Forest and Gradient Boosting typically use mean decrease impurity or permutation feature importance methods to calculate feature importance.
Linear regression models typically use coefficients or standardized coefficients to determine the importance of a feature. The magnitude of the coefficient reflects the strength and direction of the relationship between the feature and the target variable.
The importance of the predictors included in a forecaster can be obtained using the method get_feature_importances()
. This method accesses the coef_
and feature_importances_
attributes of the internal regressor.
  Warning
The get_feature_importances()
method will only return values if the forecaster's regressor has either the coef_
or feature_importances_
attribute, which is the default in scikit-learn. If your regressor does not follow this naming convention, please consider opening an issue on GitHub and we will strive to include it in future updates.
  Note
See also: SHAP values in skforecast models
Libraries¶
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Ridge
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
Data¶
# Download data
# ==============================================================================
url = (
'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
'data/h2o_exog.csv'
)
data = pd.read_csv(
url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2']
)
# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('MS')
Extract feature importances from trained forecaster¶
# Create and fit forecaster using a RandomForest regressor
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=123),
lags = 5
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
# Predictors importances
# ==============================================================================
forecaster.get_feature_importances()
feature | importance | |
---|---|---|
0 | lag_1 | 0.530186 |
1 | lag_2 | 0.100529 |
2 | lag_3 | 0.023620 |
3 | lag_4 | 0.070458 |
4 | lag_5 | 0.063155 |
5 | exog_1 | 0.047043 |
6 | exog_2 | 0.165009 |
# Create and fit forecaster using a linear regressor
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 5
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster.get_feature_importances()
feature | importance | |
---|---|---|
0 | lag_1 | 0.327688 |
1 | lag_2 | -0.073593 |
2 | lag_3 | -0.152202 |
3 | lag_4 | -0.217106 |
4 | lag_5 | -0.145800 |
5 | exog_1 | 0.379798 |
6 | exog_2 | 0.668162 |
To properly retrieve the feature importances in the ForecasterAutoregDirect
and ForecasterAutoregMultiVariate
, it is essential to specify the model from which to extract the feature importances are to be extracted. This is because Direct Strategy Forecasters fit one model per step, and each model may have different important features. Therefore, the user must explicitly specify which model's feature importances wish to extract to ensure that the correct features are used.
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoregDirect(
regressor = Ridge(random_state=123),
steps = 10,
lags = 5
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
# Predictors importances of model for step 1
# ==============================================================================
forecaster.get_feature_importances(step=1)
feature | importance | |
---|---|---|
0 | lag_1 | 0.326827 |
1 | lag_2 | -0.055386 |
2 | lag_3 | -0.155098 |
3 | lag_4 | -0.220415 |
4 | lag_5 | -0.138252 |
5 | exog_1 | 0.386103 |
6 | exog_2 | 0.635972 |
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>