Categorical features¶
In the field of machine learning, categorical features play a crucial role in determining the predictive ability of a model. Categorical features are features that can take a limited number of values, such as color, gender or location. While these features can provide useful insights into patterns and relationships within data, they also present unique challenges for machine learning models.
One of these challenges is the need to transform categorical features before they can be used by most models. This transformation involves converting categorical values into numerical values that can be processed by machine learning algorithms.
Another challenge is dealing with infrequent categories, which can lead to biased models. If a categorical feature has a large number of categories, but some of them are rare or appear infrequently in the data, the model may not be able to learn accurately from these categories, resulting in biased predictions and inaccurate results.
Despite these difficulties, categorical features are still an essential component in many use cases. When properly encoded and handled, machine learning models can effectively learn from patterns and relationships in categorical data, leading to better predictions.
This document provides an overview of three of the most commonly used transformations: one-hot encoding, ordinal encoding, and target encoding. It explains how to apply them in the skforecast package using scikit-learn encoders, which provide a convenient and flexible way to pre-process data. It also shows how to use the native implementation of three popular gradient boosting frameworks – LightGBM, scikit-learn's HistogramGradientBoosting, and XGBoost – to handle categorical features directly in the model.
For a comprehensive demonstration of the use of categorical features in time series forecasting, check out the article Forecasting time series with gradient boosting: Skforecast, XGBoost, LightGBM and CatBoost.
Note
All of the transformations described in this document can be applied to the entire dataset, regardless of the forecaster. However, it is important to ensure that the transformations are learned only from the training data to avoid information leakage. Furthermore, the same transformation should be applied to the input data during prediction. To reduce the likelihood of errors and to ensure consistent application of the transformations, it is advisable to include the transformation within the forecaster object, so that it is handled internally.
Libraries and data¶
The dataset used in this user guide consists of information on the number of users of a bicycle rental service, in addition to weather variables and holiday data. Two of the variables in the dataset, holiday
and weather
, are categorical.
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
from lightgbm import LGBMRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
# Downloading data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-'
'learning-python/master/data/bike_sharing_dataset_clean.csv')
data = pd.read_csv(url)
# Preprocess data
# ==============================================================================
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
data['holiday'] = data['holiday'].astype(int)
data = data[['holiday', 'weather', 'temp', 'hum', 'users']]
data[['holiday', 'weather']] = data[['holiday', 'weather']].astype(str)
print(data.dtypes)
data.head(3)
holiday object weather object temp float64 hum float64 users float64 dtype: object
holiday | weather | temp | hum | users | |
---|---|---|---|---|---|
date_time | |||||
2011-01-01 00:00:00 | 0 | clear | 9.84 | 81.0 | 16.0 |
2011-01-01 01:00:00 | 0 | clear | 9.02 | 80.0 | 40.0 |
2011-01-01 02:00:00 | 0 | clear | 9.02 | 80.0 | 32.0 |
Only part of the data is used to simplify the example.
# Split train-test
# ==============================================================================
start_train = '2012-06-01 00:00:00'
end_train = '2012-07-31 23:59:00'
end_test = '2012-08-15 23:59:00'
data_train = data.loc[start_train:end_train, :]
data_test = data.loc[end_train:end_test, :]
print(
f"Dates train : {data_train.index.min()} --- {data_train.index.max()}"
f" (n={len(data_train)})"
)
print(
f"Dates test : {data_test.index.min()} --- {data_test.index.max()}"
f" (n={len(data_test)})"
)
Dates train : 2012-06-01 00:00:00 --- 2012-07-31 23:00:00 (n=1464) Dates test : 2012-08-01 00:00:00 --- 2012-08-15 23:00:00 (n=360)
One Hot Encoding¶
One hot encoding, also known as dummy encoding or one-of-K encoding, consists of replacing the categorical variable with a set of binary variables that take the value 0 or 1 to indicate whether a particular category is present in an observation. For example, suppose a dataset contains a categorical variable called "color" with the possible values of "red," "blue," and "green". Using one hot encoding, this variable is converted into three binary variables such as color_red
, color_blue
, and color_green
, where each variable takes a value of 0 or 1 depending on the category.
The OneHotEncoder class in scikit-learn can be used to transform any categorical feature with n possible values into n new binary features, where one of them takes the value 1, and all the others take the value 0. The OneHotEncoder
can be configured to handle certain corner cases, including unknown categories, missing values, and infrequent categories.
When
handle_unknown='ignore'
anddrop
is notNone
, unknown categories are encoded as zeros. Additionally, if a feature contains bothnp.nan
andNone
, they are considered separate categories.It supports the aggregation of infrequent categories into a single output for each feature. The parameters to enable the aggregation of infrequent categories are
min_frequency
andmax_categories
. By settinghandle_unknown
to 'infrequent_if_exist', unknown categories are considered infrequent.To avoid collinearity between features, it is possible to drop one of the categories per feature using the
drop
argument. This is especially important when using linear models.
ColumnTransformers in scikit-learn provide a powerful way to define transformations and apply them to specific features. By encapsulating the OneHotEncoder
in a ColumnTransformer
object, it can be passed to a forecaster using the transformer_exog
argument.
# ColumnTransformer with one-hot encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical features (no numerical)
# using one-hot encoding. Numeric features are left untouched. For binary
# features, only one column is created.
one_hot_encoder = make_column_transformer(
(
OneHotEncoder(sparse_output=False, drop='if_binary'),
make_column_selector(dtype_exclude=np.number)
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state=123),
lags = 5,
transformer_exog = one_hot_encoder
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
forecaster
================= ForecasterAutoreg ================= Regressor: LGBMRegressor(random_state=123) Lags: [1 2 3 4 5] Transformer for y: None Transformer for exog: ColumnTransformer(remainder='passthrough', transformers=[('onehotencoder', OneHotEncoder(drop='if_binary', sparse_output=False), <sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D73CEC50>)], verbose_feature_names_out=False) Window size: 5 Weight function included: False Exogenous included: True Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> Exogenous variables names: ['holiday', 'weather', 'temp', 'hum'] Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-07-31 23:00:00')] Training index type: DatetimeIndex Training index frequency: H Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0} fit_kwargs: {} Creation date: 2023-05-29 13:08:20 Last fit date: 2023-05-29 13:08:20 Skforecast version: 0.8.1 Python version: 3.10.11 Forecaster id: None
Once the forecaster has been trained, the transformer can be inspected (feature_names_in, feature_names_out, ...) by accessing the transformer_exog
attribute.
# Access to the transformer used for exogenous features
# ==============================================================================
print(forecaster.transformer_exog.get_feature_names_out())
forecaster.transformer_exog
['holiday_1' 'weather_clear' 'weather_mist' 'weather_rain' 'temp' 'hum']
ColumnTransformer(remainder='passthrough', transformers=[('onehotencoder', OneHotEncoder(drop='if_binary', sparse_output=False), <sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D73CEC50>)], verbose_feature_names_out=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
ColumnTransformer(remainder='passthrough', transformers=[('onehotencoder', OneHotEncoder(drop='if_binary', sparse_output=False), <sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D73CEC50>)], verbose_feature_names_out=False)
<sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D73CEC50>
OneHotEncoder(drop='if_binary', sparse_output=False)
['temp', 'hum']
passthrough
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 88.946940 2012-08-01 01:00:00 59.848451 2012-08-01 02:00:00 28.870817 Freq: H, Name: pred, dtype: float64
Note
It is possible to apply a transformation to the entire dataset independent of the forecaster. However, it is crucial to ensure that the transformations are only learned from the training data to avoid information leakage. In addition, the same transformation should be applied to the input data during prediction. It is therefore advisable to incorporate the transformation into the forecaster, so that it is handled internally. This approach ensures consistency in the application of transformations and reduces the likelihood of errors.
To examine how data is being transformed, it is possible to use the create_train_X_y()
method to generate the matrices used by the forecaster to train the model. This approach enables gaining insight into the specific data manipulations that occur during the training process.
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print(X_train.dtypes)
X_train.head()
lag_1 float64 lag_2 float64 lag_3 float64 lag_4 float64 lag_5 float64 holiday_1 float64 weather_clear float64 weather_mist float64 weather_rain float64 temp float64 hum float64 dtype: object
lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | holiday_1 | weather_clear | weather_mist | weather_rain | temp | hum | |
---|---|---|---|---|---|---|---|---|---|---|---|
date_time | |||||||||||
2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9.84 | 75.0 |
2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | 0.0 | 1.0 | 0.0 | 0.0 | 9.02 | 80.0 |
2011-01-01 07:00:00 | 2.0 | 1.0 | 1.0 | 13.0 | 32.0 | 0.0 | 1.0 | 0.0 | 0.0 | 8.20 | 86.0 |
2011-01-01 08:00:00 | 3.0 | 2.0 | 1.0 | 1.0 | 13.0 | 0.0 | 1.0 | 0.0 | 0.0 | 9.84 | 75.0 |
2011-01-01 09:00:00 | 8.0 | 3.0 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 13.12 | 76.0 |
# Transform exogenous features using the transformer outside the forecaster
# ==============================================================================
exog_transformed = one_hot_encoder.fit_transform(data.loc[:end_train, exog_features])
exog_transformed.head()
holiday_1 | weather_clear | weather_mist | weather_rain | temp | hum | |
---|---|---|---|---|---|---|
date_time | ||||||
2011-01-01 00:00:00 | 0.0 | 1.0 | 0.0 | 0.0 | 9.84 | 81.0 |
2011-01-01 01:00:00 | 0.0 | 1.0 | 0.0 | 0.0 | 9.02 | 80.0 |
2011-01-01 02:00:00 | 0.0 | 1.0 | 0.0 | 0.0 | 9.02 | 80.0 |
2011-01-01 03:00:00 | 0.0 | 1.0 | 0.0 | 0.0 | 9.84 | 75.0 |
2011-01-01 04:00:00 | 0.0 | 1.0 | 0.0 | 0.0 | 9.84 | 75.0 |
Ordinal encoding¶
Ordinal encoding is a technique used to convert categorical variables into numerical variables. Each category is assigned a unique numerical value based on its order or rank, as determined by a chosen criterion such as frequency or importance. This encoding method is particularly useful when categories have a natural order or ranking, such as educational qualifications. However, it is important to note that the numerical values assigned to each category do not represent any inherent numerical difference between them, but simply provide a numerical representation.
The scikit-learn library provides the OrdinalEncoder class, which allows users to replace categorical variables with ordinal numbers ranging from 0 to n_categories-1. In addition, this class includes the encoded_missing_value
parameter, which allows for the encoding of missing values. It is important to note that this implementation arbitrarily assigns numbers to categories on a first-seen-first-served basis. Users should therefore exercise caution when interpreting the numerical values assigned to the categories. Other implementations, such as the Feature-engine, numbers can be ordered based on the mean of the target.
# ColumnTransformer with ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
ordinal_encoder = make_column_transformer(
(
OrdinalEncoder(
handle_unknown='use_encoded_value',
unknown_value=-1,
encoded_missing_value=-1
),
make_column_selector(dtype_exclude=np.number)
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state=123),
lags = 5,
transformer_exog = ordinal_encoder
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
forecaster
================= ForecasterAutoreg ================= Regressor: LGBMRegressor(random_state=123) Lags: [1 2 3 4 5] Transformer for y: None Transformer for exog: ColumnTransformer(remainder='passthrough', transformers=[('ordinalencoder', OrdinalEncoder(encoded_missing_value=-1, handle_unknown='use_encoded_value', unknown_value=-1), <sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D74B36A0>)], verbose_feature_names_out=False) Window size: 5 Weight function included: False Exogenous included: True Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> Exogenous variables names: ['holiday', 'weather', 'temp', 'hum'] Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-07-31 23:00:00')] Training index type: DatetimeIndex Training index frequency: H Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0} fit_kwargs: {} Creation date: 2023-05-29 13:08:20 Last fit date: 2023-05-29 13:08:21 Skforecast version: 0.8.1 Python version: 3.10.11 Forecaster id: None
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print(X_train.dtypes)
X_train.head()
lag_1 float64 lag_2 float64 lag_3 float64 lag_4 float64 lag_5 float64 holiday float64 weather float64 temp float64 hum float64 dtype: object
lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | holiday | weather | temp | hum | |
---|---|---|---|---|---|---|---|---|---|
date_time | |||||||||
2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | 0.0 | 1.0 | 9.84 | 75.0 |
2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | 0.0 | 0.0 | 9.02 | 80.0 |
2011-01-01 07:00:00 | 2.0 | 1.0 | 1.0 | 13.0 | 32.0 | 0.0 | 0.0 | 8.20 | 86.0 |
2011-01-01 08:00:00 | 3.0 | 2.0 | 1.0 | 1.0 | 13.0 | 0.0 | 0.0 | 9.84 | 75.0 |
2011-01-01 09:00:00 | 8.0 | 3.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 13.12 | 76.0 |
Once the forecaster has been trained, the transformer can be inspected by accessing the transformer_exog
attribute.
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 89.096098 2012-08-01 01:00:00 57.749964 2012-08-01 02:00:00 29.263922 Freq: H, Name: pred, dtype: float64
Target encoding¶
Target encoding is a technic that encodes categorical variables based on the relationship between the categories and the target variable. Each category is encoded based on a shrinked estimate of the average target values for observations belonging to the category. The encoding scheme mixes the global target mean with the target mean conditioned on the value of the category.
For example, suppose a categorical variable "City" with categories "New York," "Los Angeles," and "Chicago," and a target variable "Salary." One can calculate the mean salary for each city category based on the training data, and use these mean values to encode the categories.
This encoding scheme is useful with categorical features with high cardinality, where one-hot encoding would inflate the feature space making it more expensive for a downstream model to process. A classical example of high cardinality categories are location based such as zip code or region.
The TargetEncoder class is available in Scikit-learn (since version 1.3). TargetEncoder
considers missing values, such as np.nan
or None
, as another category and encodes them like any other category. Categories that are not seen during fit are encoded with the target mean, i.e. target_mean_
. A more detailed description of target encoding can be found in the scikit-learn user guide.
Warning
We are currently working to allow this type of transformation within skforecast.
Native implementation for categorical features¶
Some machine learning models, including XGBoost, LightGBM, CatBoost, and HistGradientBoostingRegressor, provide built-in methods to handle categorical features, but they assume that the input categories are integers starting from 0 up to the number of categories [0, 1, ..., n_categories-1]. In practice, categorical variables are not coded with numbers but with strings, so an intermediate transformation step is necessary. Two options are:
Set columns with categorical variables to the type
category
. For each column, the data structure consists of an array of categories and an array of integer values (codes) that point to the actual value of the array of categories. That is, internally it is a numeric array with a mapping that relates each value to a category. Models are able to automatically identify the columns of typecategory
and access their internal codes. This approach is applicable to XGBoost, LightGBM and CatBoost.Preprocess the categorical columns with an
OrdinalEncoder
to transform their values to integers and explicitly indicate that the columns should be treated as categorical. Skforecast allows this by using thefit_kwargs
argument.
Warning
When deploying models in production, it is strongly recommended to avoid using automatic detection based on pandas category
type columns. Although pandas provides an internal coding for these columns, it is not consistent across different datasets and may vary depending on the categories present in each one. It is therefore crucial to be aware of this issue and to take appropriate measures to ensure consistency in the coding of categorical features when deploying models in production.
At the time of writing, the only thing the authors have observed is that LightGBM internally manages changes in the coding of categories. More details on this issue can be found in github issue and stackoverflow.
If the user still wishes to rely on automatic detection of categorical features based on pandas data types, categorical variables must first be encoded as integers (ordinal encoding) and then stored as category type. This is necessary because skforecast uses a numeric numpy array internally to speed up the calculation.
LightGBM¶
Encoding the categories as integers and explicitly specifying the names of the categorical features (recommended)
When creating a forecaster with LGBMRegressor
, it is necessary to specify the names of the categorical columns using the fit_kwargs
argument. This is because the categorical_feature
argument is only specified in the fit
method of LGBMRegressor
, and not during its initialization.
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
(
OrdinalEncoder(
dtype=int,
handle_unknown="use_encoded_value",
unknown_value=-1,
encoded_missing_value=-1
),
categorical_features
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit forecaster indicating the categorical features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state=963),
lags = 5,
transformer_exog = transformer_exog,
fit_kwargs = {'categorical_feature': categorical_features}
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
c:\Users\jaesc2\Miniconda3\envs\skforecast\lib\site-packages\lightgbm\basic.py:2065: UserWarning: Using categorical_feature in Dataset. _log_warning('Using categorical_feature in Dataset.')
The UserWarning raised indicates that categorical features have been included. To suppress this warning use warnings.filterwarnings('ignore')
. Warnings can then be restored using warnings.filterwarnings('default')
or warnings.resetwarnings()
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 88.946940 2012-08-01 01:00:00 59.848451 2012-08-01 02:00:00 28.870817 Freq: H, Name: pred, dtype: float64
Allow the model to automatically detect categorical features (not recommended)
Warning
Handling categorical variables by relying on the automatic detection of the category
datatype can be achieved by setting categorical_features='auto'
during model initialization. However, this approach can lead to significant problems when the model is exposed to new datasets that have a different pandas encoding for categorical columns than the one used during training. Therefore, it's crucial to ensure that the encoding is consistent between the training and the testing datasets to avoid any potential errors.
# Transformer: ordinal encoding and cast to category type
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1. After encoding, the features are converted back to category type so that
# they can be identified as categorical features by the regressor.
pipeline_categorical = make_pipeline(
OrdinalEncoder(
dtype=int,
handle_unknown="use_encoded_value",
unknown_value=-1,
encoded_missing_value=-1
),
FunctionTransformer(
func=lambda x: x.astype('category'),
feature_names_out= 'one-to-one'
)
)
transformer_exog = make_column_transformer(
(
pipeline_categorical,
make_column_selector(dtype_exclude=np.number)
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state=963),
lags = 5,
transformer_exog = transformer_exog,
fit_kwargs = {'categorical_feature': 'auto'}
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 88.946940 2012-08-01 01:00:00 59.848451 2012-08-01 02:00:00 28.870817 Freq: H, Name: pred, dtype: float64
As with any other forecaster, the matrices used during model training can be created with create_train_X_y
.
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
X_train.head()
lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | holiday | weather | temp | hum | |
---|---|---|---|---|---|---|---|---|---|
date_time | |||||||||
2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | 0 | 1 | 9.84 | 75.0 |
2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | 0 | 0 | 9.02 | 80.0 |
2011-01-01 07:00:00 | 2.0 | 1.0 | 1.0 | 13.0 | 32.0 | 0 | 0 | 8.20 | 86.0 |
2011-01-01 08:00:00 | 3.0 | 2.0 | 1.0 | 1.0 | 13.0 | 0 | 0 | 9.84 | 75.0 |
2011-01-01 09:00:00 | 8.0 | 3.0 | 2.0 | 1.0 | 1.0 | 0 | 0 | 13.12 | 76.0 |
Scikit-learn HistogramGradientBoosting¶
When creating a forecaster using HistogramGradientBoosting
, the names of the categorical columns should be specified during the instantiation by passing them as a list to the categorical_feature
argument.
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
(
OrdinalEncoder(
dtype=int,
handle_unknown="use_encoded_value",
unknown_value=-1,
encoded_missing_value=-1
),
categorical_features
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit forecaster indicating the categorical features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
regressor = HistGradientBoostingRegressor(
categorical_features = categorical_features,
random_state = 963
),
lags = 5,
transformer_exog = transformer_exog
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
HistGradientBoostingRegressor
stores a boolean mask indicating which features were considered categorical. It will be None
if there are no categorical features.
forecaster.regressor.is_categorical_
array([False, False, False, False, False, True, True, False, False])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 99.185547 2012-08-01 01:00:00 71.914255 2012-08-01 02:00:00 43.342723 Freq: H, Name: pred, dtype: float64
XGBoost¶
Encoding the categories as integers and explicitly specifying the names of the categorical features (recommended)
At the time of writing, the XGBRegressor
module does not provide an option to specify the names of categorical features. Instead, the feature types are specified by passing a list of strings to the feature_types
argument, where 'c' denotes categorical and 'q' numeric features. The enable_categorical
argument must also be set to True
.
Determining the positions of each column to create a list of feature types can be a challenging task. The shape of the data matrix depends on two factors, the number of lags used and the transformations applied to the exogenous variables. However, there is a workaround to this problem. First, create a forecaster without specifying the feature_types
argument. Next, the create_train_X_y
method can be used with a small sample of data to determine the position of each feature. Once the position of each feature has been determined, the set_params()
method can be used to specify the values of feature_types
. By following this approach it is possible to ensure that the feature types are correctly specified, thus avoiding any errors that may occur due to incorrect specification.
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (no numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
(
OrdinalEncoder(
dtype=int,
handle_unknown="use_encoded_value",
unknown_value=-1,
encoded_missing_value=-1
),
categorical_features
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
A forecaster is created without specifying the feature_types
argument.
# Create forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
regressor = XGBRegressor(
tree_method='hist',
random_state=12345,
enable_categorical=True,
),
lags = 5,
transformer_exog = transformer_exog
)
Once the forecaster is instantiated, its create_train_X_y()
method is used to generate the training matrices that allow the user to identify the positions of the variables.
# Create training matrices using a sample of the training data
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'][:10],
exog = data.loc[:end_train, exog_features][:10]
)
X_train.head(2)
lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | holiday | weather | temp | hum | |
---|---|---|---|---|---|---|---|---|---|
date_time | |||||||||
2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | 0 | 1 | 9.84 | 75.0 |
2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | 0 | 0 | 9.02 | 80.0 |
Create a list to identify which columns in the training matrix are numeric ('q') and categorical ('c').
feature_types = ['c' if col in categorical_features else 'q' for col in X_train.columns]
feature_types
['q', 'q', 'q', 'q', 'q', 'c', 'c', 'q', 'q']
Update the regressor parameters using the forecaster's set_params
method and fit.
# Update regressor parameters
# ==============================================================================
forecaster.set_params({'feature_types': feature_types})
# Fit forecaster
# ==============================================================================
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 77.470787 2012-08-01 01:00:00 40.735706 2012-08-01 02:00:00 13.448755 Freq: H, Name: pred, dtype: float64
Allow the model to automatically detect categorical features (not recommended)
Warning
Handling categorical variables by relying on the automatic detection of the category
datatype can be achieved by setting enable_categorical=True
during model initialization. However, this approach can lead to significant problems when the model is exposed to new datasets that have a different pandas encoding for categorical columns than the one used during training. Therefore, it's crucial to ensure that the encoding is consistent between the training and the testing datasets to avoid any potential errors.
# Transformer: ordinal encoding and cast to category type
# ==============================================================================
# A ColumnTransformer is used to transform categorical (no numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1. After the encoding, the features are converted back to category type so
# that they can be identified as categorical features by the regressor.
pipeline_categorical = make_pipeline(
OrdinalEncoder(
dtype=int,
handle_unknown="use_encoded_value",
unknown_value=-1,
encoded_missing_value=-1
),
FunctionTransformer(
func=lambda x: x.astype('category'),
feature_names_out= 'one-to-one'
)
)
transformer_exog = make_column_transformer(
(
pipeline_categorical,
make_column_selector(dtype_exclude=np.number)
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
regressor = XGBRegressor(
enable_categorical=True,
tree_method='hist',
random_state=963
),
lags = 5,
transformer_exog = transformer_exog
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 77.470787 2012-08-01 01:00:00 40.735706 2012-08-01 02:00:00 13.448755 Freq: H, Name: pred, dtype: float64
CatBoost¶
Unfortunately, the current version of skforecast is not compatible with CatBoost's built-in handling of categorical features. The issue arises because CatBoost only accepts categorical features as integers, while skforecast converts input data to floats for faster computation using numpy arrays in the internal prediction process. If a CatBoost model is required, an external encoder should be used for the categorical variables.
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>