Categorical features¶

In the field of machine learning, categorical features play a crucial role in determining the predictive ability of a model. Categorical features are features that can take a limited number of values, such as color, gender or location. While these features can provide useful insights into patterns and relationships within data, they also present unique challenges for machine learning models.

One of these challenges is the need to transform categorical features before they can be used by most models. This transformation involves converting categorical values into numerical values that can be processed by machine learning algorithms.

Another challenge is dealing with infrequent categories, which can lead to biased models. If a categorical feature has a large number of categories, but some of them are rare or appear infrequently in the data, the model may not be able to learn accurately from these categories, resulting in biased predictions and inaccurate results.

Despite these difficulties, categorical features are still an essential component in many use cases. When properly encoded and handled, machine learning models can effectively learn from patterns and relationships in categorical data, leading to better predictions.

This document provides an overview of three of the most commonly used transformations: one-hot encoding, ordinal encoding, and target encoding. It explains how to apply them in the skforecast package using scikit-learn encoders, which provide a convenient and flexible way to pre-process data. It also shows how to use the native implementation of three popular gradient boosting frameworks – LightGBM, scikit-learn's HistogramGradientBoosting, and XGBoost – to handle categorical features directly in the model.

For a comprehensive demonstration of the use of categorical features in time series forecasting, check out the article Forecasting time series with gradient boosting: Skforecast, XGBoost, LightGBM and CatBoost.

Note

All of the transformations described in this document can be applied to the entire dataset, regardless of the forecaster. However, it is important to ensure that the transformations are learned only from the training data to avoid information leakage. Furthermore, the same transformation should be applied to the input data during prediction. To reduce the likelihood of errors and to ensure consistent application of the transformations, it is advisable to include the transformation within the forecaster object, so that it is handled internally.

Libraries and data¶

The dataset used in this user guide consists of information on the number of users of a bicycle rental service, in addition to weather variables and holiday data. Two of the variables in the dataset, holiday and weather, are categorical.

In [1]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5

from lightgbm import LGBMRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline

from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd

import matplotlib.pyplot as plt
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5

from lightgbm import LGBMRegressor
from sklearn.ensemble import HistGradientBoostingRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from sklearn.preprocessing import OneHotEncoder
from sklearn.preprocessing import OrdinalEncoder
from sklearn.preprocessing import FunctionTransformer
from sklearn.compose import make_column_transformer
from sklearn.compose import make_column_selector
from sklearn.pipeline import make_pipeline

from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect

In [2]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Downloading data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-'
       'learning-python/master/data/bike_sharing_dataset_clean.csv')
data = pd.read_csv(url)

# Preprocess data
# ==============================================================================
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
data['holiday'] = data['holiday'].astype(int)
data = data[['holiday', 'weather', 'temp', 'hum', 'users']]
data[['holiday', 'weather']] = data[['holiday', 'weather']].astype(str)
print(data.dtypes)
data.head(3)
# Downloading data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-'
       'learning-python/master/data/bike_sharing_dataset_clean.csv')
data = pd.read_csv(url)

# Preprocess data
# ==============================================================================
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
data['holiday'] = data['holiday'].astype(int)
data = data[['holiday', 'weather', 'temp', 'hum', 'users']]
data[['holiday', 'weather']] = data[['holiday', 'weather']].astype(str)
print(data.dtypes)
data.head(3)

holiday     object
weather     object
temp       float64
hum        float64
users      float64
dtype: object

Out[2]:

	holiday	weather	temp	hum	users
date_time
2011-01-01 00:00:00	0	clear	9.84	81.0	16.0
2011-01-01 01:00:00	0	clear	9.02	80.0	40.0
2011-01-01 02:00:00	0	clear	9.02	80.0	32.0

Only part of the data is used to simplify the example.

In [3]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Split train-test
# ==============================================================================
start_train = '2012-06-01 00:00:00'
end_train = '2012-07-31 23:59:00'
end_test = '2012-08-15 23:59:00'
data_train = data.loc[start_train:end_train, :]
data_test  = data.loc[end_train:end_test, :]

print(
    f"Dates train : {data_train.index.min()} --- {data_train.index.max()}"
    f"  (n={len(data_train)})"
)
print(
    f"Dates test  : {data_test.index.min()} --- {data_test.index.max()}"
    f"  (n={len(data_test)})"
)
# Split train-test
# ==============================================================================
start_train = '2012-06-01 00:00:00'
end_train = '2012-07-31 23:59:00'
end_test = '2012-08-15 23:59:00'
data_train = data.loc[start_train:end_train, :]
data_test  = data.loc[end_train:end_test, :]

print(
    f"Dates train : {data_train.index.min()} --- {data_train.index.max()}"
    f"  (n={len(data_train)})"
)
print(
    f"Dates test  : {data_test.index.min()} --- {data_test.index.max()}"
    f"  (n={len(data_test)})"
)

Dates train : 2012-06-01 00:00:00 --- 2012-07-31 23:00:00  (n=1464)
Dates test  : 2012-08-01 00:00:00 --- 2012-08-15 23:00:00  (n=360)

One Hot Encoding¶

One hot encoding, also known as dummy encoding or one-of-K encoding, consists of replacing the categorical variable with a set of binary variables that take the value 0 or 1 to indicate whether a particular category is present in an observation. For example, suppose a dataset contains a categorical variable called "color" with the possible values of "red," "blue," and "green". Using one hot encoding, this variable is converted into three binary variables such as color_red, color_blue, and color_green, where each variable takes a value of 0 or 1 depending on the category.

The OneHotEncoder class in scikit-learn can be used to transform any categorical feature with n possible values into n new binary features, where one of them takes the value 1, and all the others take the value 0. The OneHotEncoder can be configured to handle certain corner cases, including unknown categories, missing values, and infrequent categories.

When handle_unknown='ignore' and drop is not None, unknown categories are encoded as zeros. Additionally, if a feature contains both np.nan and None, they are considered separate categories.
It supports the aggregation of infrequent categories into a single output for each feature. The parameters to enable the aggregation of infrequent categories are min_frequency and max_categories. By setting handle_unknown to 'infrequent_if_exist', unknown categories are considered infrequent.
To avoid collinearity between features, it is possible to drop one of the categories per feature using the drop argument. This is especially important when using linear models.

ColumnTransformers in scikit-learn provide a powerful way to define transformations and apply them to specific features. By encapsulating the OneHotEncoder in a ColumnTransformer object, it can be passed to a forecaster using the transformer_exog argument.

In [4]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# ColumnTransformer with one-hot encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical features (no numerical)
# using one-hot encoding. Numeric features are left untouched. For binary
# features, only one column is created.
one_hot_encoder = make_column_transformer(
                    (
                        OneHotEncoder(sparse_output=False, drop='if_binary'),
                        make_column_selector(dtype_exclude=np.number)
                    ),
                    remainder="passthrough",
                    verbose_feature_names_out=False,
                ).set_output(transform="pandas")
# ColumnTransformer with one-hot encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical features (no numerical)
# using one-hot encoding. Numeric features are left untouched. For binary
# features, only one column is created.
one_hot_encoder = make_column_transformer(
                    (
                        OneHotEncoder(sparse_output=False, drop='if_binary'),
                        make_column_selector(dtype_exclude=np.number)
                    ),
                    remainder="passthrough",
                    verbose_feature_names_out=False,
                ).set_output(transform="pandas")

In [5]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=123),
                 lags = 5,
                 transformer_exog = one_hot_encoder
             )

forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

forecaster
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=123),
                 lags = 5,
                 transformer_exog = one_hot_encoder
             )

forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

forecaster

Out[5]:

================= 
ForecasterAutoreg 
================= 
Regressor: LGBMRegressor(random_state=123) 
Lags: [1 2 3 4 5] 
Transformer for y: None 
Transformer for exog: ColumnTransformer(remainder='passthrough',
                  transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary',
                                               sparse_output=False),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D73CEC50>)],
                  verbose_feature_names_out=False) 
Window size: 5 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['holiday', 'weather', 'temp', 'hum'] 
Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-07-31 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0} 
fit_kwargs: {} 
Creation date: 2023-05-29 13:08:20 
Last fit date: 2023-05-29 13:08:20 
Skforecast version: 0.8.1 
Python version: 3.10.11 
Forecaster id: None

Once the forecaster has been trained, the transformer can be inspected (feature_names_in, feature_names_out, ...) by accessing the transformer_exog attribute.

In [6]:

                
                    Copied!
                    
# Access to the transformer used for exogenous features
# ==============================================================================
print(forecaster.transformer_exog.get_feature_names_out())
forecaster.transformer_exog
# Access to the transformer used for exogenous features
# ==============================================================================
print(forecaster.transformer_exog.get_feature_names_out())
forecaster.transformer_exog

['holiday_1' 'weather_clear' 'weather_mist' 'weather_rain' 'temp' 'hum']

Out[6]:

ColumnTransformer(remainder='passthrough',
                  transformers=[('onehotencoder',
                                 OneHotEncoder(drop='if_binary',
                                               sparse_output=False),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D73CEC50>)],
                  verbose_feature_names_out=False)

In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.

In [7]:

                
                    Copied!
                    
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])

Out[7]:

2012-08-01 00:00:00    88.946940
2012-08-01 01:00:00    59.848451
2012-08-01 02:00:00    28.870817
Freq: H, Name: pred, dtype: float64

Note

It is possible to apply a transformation to the entire dataset independent of the forecaster. However, it is crucial to ensure that the transformations are only learned from the training data to avoid information leakage. In addition, the same transformation should be applied to the input data during prediction. It is therefore advisable to incorporate the transformation into the forecaster, so that it is handled internally. This approach ensures consistency in the application of transformations and reduces the likelihood of errors.

To examine how data is being transformed, it is possible to use the create_train_X_y() method to generate the matrices used by the forecaster to train the model. This approach enables gaining insight into the specific data manipulations that occur during the training process.

In [8]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
print(X_train.dtypes)
X_train.head()
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
print(X_train.dtypes)
X_train.head()

lag_1            float64
lag_2            float64
lag_3            float64
lag_4            float64
lag_5            float64
holiday_1        float64
weather_clear    float64
weather_mist     float64
weather_rain     float64
temp             float64
hum              float64
dtype: object

Out[8]:

	lag_1	lag_2	lag_3	lag_4	lag_5	holiday_1	weather_clear	weather_mist	weather_rain	temp	hum
date_time
2011-01-01 05:00:00	1.0	13.0	32.0	40.0	16.0	0.0	0.0	1.0	0.0	9.84	75.0
2011-01-01 06:00:00	1.0	1.0	13.0	32.0	40.0	0.0	1.0	0.0	0.0	9.02	80.0
2011-01-01 07:00:00	2.0	1.0	1.0	13.0	32.0	0.0	1.0	0.0	0.0	8.20	86.0
2011-01-01 08:00:00	3.0	2.0	1.0	1.0	13.0	0.0	1.0	0.0	0.0	9.84	75.0
2011-01-01 09:00:00	8.0	3.0	2.0	1.0	1.0	0.0	1.0	0.0	0.0	13.12	76.0

In [9]:

                
                    Copied!
                    
# Transform exogenous features using the transformer outside the forecaster
# ==============================================================================
exog_transformed = one_hot_encoder.fit_transform(data.loc[:end_train, exog_features])
exog_transformed.head()
# Transform exogenous features using the transformer outside the forecaster
# ==============================================================================
exog_transformed = one_hot_encoder.fit_transform(data.loc[:end_train, exog_features])
exog_transformed.head()

Out[9]:

	holiday_1	weather_clear	weather_mist	weather_rain	temp	hum
date_time
2011-01-01 00:00:00	0.0	1.0	0.0	0.0	9.84	81.0
2011-01-01 01:00:00	0.0	1.0	0.0	0.0	9.02	80.0
2011-01-01 02:00:00	0.0	1.0	0.0	0.0	9.02	80.0
2011-01-01 03:00:00	0.0	1.0	0.0	0.0	9.84	75.0
2011-01-01 04:00:00	0.0	1.0	0.0	0.0	9.84	75.0

Ordinal encoding¶

Ordinal encoding is a technique used to convert categorical variables into numerical variables. Each category is assigned a unique numerical value based on its order or rank, as determined by a chosen criterion such as frequency or importance. This encoding method is particularly useful when categories have a natural order or ranking, such as educational qualifications. However, it is important to note that the numerical values assigned to each category do not represent any inherent numerical difference between them, but simply provide a numerical representation.

The scikit-learn library provides the OrdinalEncoder class, which allows users to replace categorical variables with ordinal numbers ranging from 0 to n_categories-1. In addition, this class includes the encoded_missing_value parameter, which allows for the encoding of missing values. It is important to note that this implementation arbitrarily assigns numbers to categories on a first-seen-first-served basis. Users should therefore exercise caution when interpreting the numerical values assigned to the categories. Other implementations, such as the Feature-engine, numbers can be ordered based on the mean of the target.

In [10]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# ColumnTransformer with ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
ordinal_encoder = make_column_transformer(
                    (
                        OrdinalEncoder(
                            handle_unknown='use_encoded_value',
                            unknown_value=-1,
                            encoded_missing_value=-1
                        ),
                        make_column_selector(dtype_exclude=np.number)
                    ),
                    remainder="passthrough",
                    verbose_feature_names_out=False,
                ).set_output(transform="pandas")
# ColumnTransformer with ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
ordinal_encoder = make_column_transformer(
                    (
                        OrdinalEncoder(
                            handle_unknown='use_encoded_value',
                            unknown_value=-1,
                            encoded_missing_value=-1
                        ),
                        make_column_selector(dtype_exclude=np.number)
                    ),
                    remainder="passthrough",
                    verbose_feature_names_out=False,
                ).set_output(transform="pandas")

In [11]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                regressor = LGBMRegressor(random_state=123),
                lags = 5,
                transformer_exog = ordinal_encoder
             )

forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

forecaster
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                regressor = LGBMRegressor(random_state=123),
                lags = 5,
                transformer_exog = ordinal_encoder
             )

forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

forecaster

Out[11]:

================= 
ForecasterAutoreg 
================= 
Regressor: LGBMRegressor(random_state=123) 
Lags: [1 2 3 4 5] 
Transformer for y: None 
Transformer for exog: ColumnTransformer(remainder='passthrough',
                  transformers=[('ordinalencoder',
                                 OrdinalEncoder(encoded_missing_value=-1,
                                                handle_unknown='use_encoded_value',
                                                unknown_value=-1),
                                 <sklearn.compose._column_transformer.make_column_selector object at 0x000002D2D74B36A0>)],
                  verbose_feature_names_out=False) 
Window size: 5 
Weight function included: False 
Exogenous included: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['holiday', 'weather', 'temp', 'hum'] 
Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-07-31 23:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: H 
Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0} 
fit_kwargs: {} 
Creation date: 2023-05-29 13:08:20 
Last fit date: 2023-05-29 13:08:21 
Skforecast version: 0.8.1 
Python version: 3.10.11 
Forecaster id: None

In [12]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
print(X_train.dtypes)
X_train.head()
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
print(X_train.dtypes)
X_train.head()

lag_1      float64
lag_2      float64
lag_3      float64
lag_4      float64
lag_5      float64
holiday    float64
weather    float64
temp       float64
hum        float64
dtype: object

Out[12]:

	lag_1	lag_2	lag_3	lag_4	lag_5	holiday	weather	temp	hum
date_time
2011-01-01 05:00:00	1.0	13.0	32.0	40.0	16.0	0.0	1.0	9.84	75.0
2011-01-01 06:00:00	1.0	1.0	13.0	32.0	40.0	0.0	0.0	9.02	80.0
2011-01-01 07:00:00	2.0	1.0	1.0	13.0	32.0	0.0	0.0	8.20	86.0
2011-01-01 08:00:00	3.0	2.0	1.0	1.0	13.0	0.0	0.0	9.84	75.0
2011-01-01 09:00:00	8.0	3.0	2.0	1.0	1.0	0.0	0.0	13.12	76.0

Once the forecaster has been trained, the transformer can be inspected by accessing the transformer_exog attribute.

In [13]:

                
                    Copied!
                    
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])

Out[13]:

2012-08-01 00:00:00    89.096098
2012-08-01 01:00:00    57.749964
2012-08-01 02:00:00    29.263922
Freq: H, Name: pred, dtype: float64

Target encoding¶

Target encoding is a technic that encodes categorical variables based on the relationship between the categories and the target variable. Each category is encoded based on a shrinked estimate of the average target values for observations belonging to the category. The encoding scheme mixes the global target mean with the target mean conditioned on the value of the category.

For example, suppose a categorical variable "City" with categories "New York," "Los Angeles," and "Chicago," and a target variable "Salary." One can calculate the mean salary for each city category based on the training data, and use these mean values to encode the categories.

This encoding scheme is useful with categorical features with high cardinality, where one-hot encoding would inflate the feature space making it more expensive for a downstream model to process. A classical example of high cardinality categories are location based such as zip code or region.

The TargetEncoder class is available in Scikit-learn (since version 1.3). TargetEncoder considers missing values, such as np.nan or None, as another category and encodes them like any other category. Categories that are not seen during fit are encoded with the target mean, i.e. target_mean_. A more detailed description of target encoding can be found in the scikit-learn user guide.

Warning

We are currently working to allow this type of transformation within skforecast.

Native implementation for categorical features¶

Some machine learning models, including XGBoost, LightGBM, CatBoost, and HistGradientBoostingRegressor, provide built-in methods to handle categorical features, but they assume that the input categories are integers starting from 0 up to the number of categories [0, 1, ..., n_categories-1]. In practice, categorical variables are not coded with numbers but with strings, so an intermediate transformation step is necessary. Two options are:

Set columns with categorical variables to the type category. For each column, the data structure consists of an array of categories and an array of integer values (codes) that point to the actual value of the array of categories. That is, internally it is a numeric array with a mapping that relates each value to a category. Models are able to automatically identify the columns of type category and access their internal codes. This approach is applicable to XGBoost, LightGBM and CatBoost.
Preprocess the categorical columns with an OrdinalEncoder to transform their values to integers and explicitly indicate that the columns should be treated as categorical. Skforecast allows this by using the fit_kwargs argument.

Warning

When deploying models in production, it is strongly recommended to avoid using automatic detection based on pandas category type columns. Although pandas provides an internal coding for these columns, it is not consistent across different datasets and may vary depending on the categories present in each one. It is therefore crucial to be aware of this issue and to take appropriate measures to ensure consistency in the coding of categorical features when deploying models in production.

At the time of writing, the only thing the authors have observed is that LightGBM internally manages changes in the coding of categories. More details on this issue can be found in github issue and stackoverflow.

If the user still wishes to rely on automatic detection of categorical features based on pandas data types, categorical variables must first be encoded as integers (ordinal encoding) and then stored as category type. This is necessary because skforecast uses a numeric numpy array internally to speed up the calculation.

LightGBM¶

Encoding the categories as integers and explicitly specifying the names of the categorical features (recommended)

When creating a forecaster with LGBMRegressor, it is necessary to specify the names of the categorical columns using the fit_kwargs argument. This is because the categorical_feature argument is only specified in the fit method of LGBMRegressor, and not during its initialization.

In [14]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
                        (
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            categorical_features
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
                        (
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            categorical_features
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")

In [15]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create and fit forecaster indicating the categorical features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=963),
                 lags = 5,
                 transformer_exog = transformer_exog,
                 fit_kwargs = {'categorical_feature': categorical_features}
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)
# Create and fit forecaster indicating the categorical features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=963),
                 lags = 5,
                 transformer_exog = transformer_exog,
                 fit_kwargs = {'categorical_feature': categorical_features}
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

c:\Users\jaesc2\Miniconda3\envs\skforecast\lib\site-packages\lightgbm\basic.py:2065: UserWarning: Using categorical_feature in Dataset.
  _log_warning('Using categorical_feature in Dataset.')

The UserWarning raised indicates that categorical features have been included. To suppress this warning use warnings.filterwarnings('ignore'). Warnings can then be restored using warnings.filterwarnings('default') or warnings.resetwarnings()

In [16]:

                
                    Copied!
                    
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])

Out[16]:

2012-08-01 00:00:00    88.946940
2012-08-01 01:00:00    59.848451
2012-08-01 02:00:00    28.870817
Freq: H, Name: pred, dtype: float64

Allow the model to automatically detect categorical features (not recommended)

Warning

Handling categorical variables by relying on the automatic detection of the category datatype can be achieved by setting categorical_features='auto' during model initialization. However, this approach can lead to significant problems when the model is exposed to new datasets that have a different pandas encoding for categorical columns than the one used during training. Therefore, it's crucial to ensure that the encoding is consistent between the training and the testing datasets to avoid any potential errors.

In [17]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Transformer: ordinal encoding and cast to category type
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1. After encoding, the features are converted back to category type so that 
# they can be identified as categorical features by the regressor.

pipeline_categorical = make_pipeline(
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            FunctionTransformer(
                                func=lambda x: x.astype('category'),
                                feature_names_out= 'one-to-one'
                            )
                       )

transformer_exog = make_column_transformer(
                        (
                            pipeline_categorical,
                            make_column_selector(dtype_exclude=np.number)
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
# Transformer: ordinal encoding and cast to category type
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1. After encoding, the features are converted back to category type so that 
# they can be identified as categorical features by the regressor.

pipeline_categorical = make_pipeline(
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            FunctionTransformer(
                                func=lambda x: x.astype('category'),
                                feature_names_out= 'one-to-one'
                            )
                       )

transformer_exog = make_column_transformer(
                        (
                            pipeline_categorical,
                            make_column_selector(dtype_exclude=np.number)
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")

In [18]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create and fit forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=963),
                 lags = 5,
                 transformer_exog = transformer_exog,
                 fit_kwargs = {'categorical_feature': 'auto'}
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)
# Create and fit forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = LGBMRegressor(random_state=963),
                 lags = 5,
                 transformer_exog = transformer_exog,
                 fit_kwargs = {'categorical_feature': 'auto'}
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

In [19]:

                
                    Copied!
                    
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])

Out[19]:

2012-08-01 00:00:00    88.946940
2012-08-01 01:00:00    59.848451
2012-08-01 02:00:00    28.870817
Freq: H, Name: pred, dtype: float64

As with any other forecaster, the matrices used during model training can be created with create_train_X_y.

In [20]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
X_train.head()
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'],
                        exog = data.loc[:end_train, exog_features]
                   )
X_train.head()

Out[20]:

	lag_1	lag_2	lag_3	lag_4	lag_5	holiday	weather	temp	hum
date_time
2011-01-01 05:00:00	1.0	13.0	32.0	40.0	16.0	0	1	9.84	75.0
2011-01-01 06:00:00	1.0	1.0	13.0	32.0	40.0	0	0	9.02	80.0
2011-01-01 07:00:00	2.0	1.0	1.0	13.0	32.0	0	0	8.20	86.0
2011-01-01 08:00:00	3.0	2.0	1.0	1.0	13.0	0	0	9.84	75.0
2011-01-01 09:00:00	8.0	3.0	2.0	1.0	1.0	0	0	13.12	76.0

Scikit-learn HistogramGradientBoosting¶

When creating a forecaster using HistogramGradientBoosting, the names of the categorical columns should be specified during the instantiation by passing them as a list to the categorical_feature argument.

In [21]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
                        (
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            categorical_features
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
                        (
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            categorical_features
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")

In [22]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create and fit forecaster indicating the categorical features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = HistGradientBoostingRegressor(
                                 categorical_features = categorical_features,
                                 random_state = 963
                             ),
                 lags = 5,
                 transformer_exog = transformer_exog
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)
# Create and fit forecaster indicating the categorical features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = HistGradientBoostingRegressor(
                                 categorical_features = categorical_features,
                                 random_state = 963
                             ),
                 lags = 5,
                 transformer_exog = transformer_exog
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

HistGradientBoostingRegressor stores a boolean mask indicating which features were considered categorical. It will be None if there are no categorical features.

In [23]:

                
                    Copied!
                    
forecaster.regressor.is_categorical_
forecaster.regressor.is_categorical_

Out[23]:

array([False, False, False, False, False,  True,  True, False, False])

In [24]:

                
                    Copied!
                    
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])

Out[24]:

2012-08-01 00:00:00    99.185547
2012-08-01 01:00:00    71.914255
2012-08-01 02:00:00    43.342723
Freq: H, Name: pred, dtype: float64

XGBoost¶

Encoding the categories as integers and explicitly specifying the names of the categorical features (recommended)

At the time of writing, the XGBRegressor module does not provide an option to specify the names of categorical features. Instead, the feature types are specified by passing a list of strings to the feature_types argument, where 'c' denotes categorical and 'q' numeric features. The enable_categorical argument must also be set to True.

Determining the positions of each column to create a list of feature types can be a challenging task. The shape of the data matrix depends on two factors, the number of lags used and the transformations applied to the exogenous variables. However, there is a workaround to this problem. First, create a forecaster without specifying the feature_types argument. Next, the create_train_X_y method can be used with a small sample of data to determine the position of each feature. Once the position of each feature has been determined, the set_params() method can be used to specify the values of feature_types. By following this approach it is possible to ensure that the feature types are correctly specified, thus avoiding any errors that may occur due to incorrect specification.

In [25]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (no numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
                        (
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            categorical_features
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
# Transformer: ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (no numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
categorical_features = data.select_dtypes(exclude=[np.number]).columns.tolist()
transformer_exog = make_column_transformer(
                        (
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            categorical_features
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")

A forecaster is created without specifying the feature_types argument.

In [26]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(
                               tree_method='hist',
                               random_state=12345,
                               enable_categorical=True,
                             ),
                 lags = 5,
                 transformer_exog = transformer_exog
             )
# Create forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(
                               tree_method='hist',
                               random_state=12345,
                               enable_categorical=True,
                             ),
                 lags = 5,
                 transformer_exog = transformer_exog
             )

Once the forecaster is instantiated, its create_train_X_y() method is used to generate the training matrices that allow the user to identify the positions of the variables.

In [27]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create training matrices using a sample of the training data
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'][:10],
                        exog = data.loc[:end_train, exog_features][:10]
                   )
X_train.head(2)
# Create training matrices using a sample of the training data
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
                        y = data.loc[:end_train, 'users'][:10],
                        exog = data.loc[:end_train, exog_features][:10]
                   )
X_train.head(2)

Out[27]:

	lag_1	lag_2	lag_3	lag_4	lag_5	holiday	weather	temp	hum
date_time
2011-01-01 05:00:00	1.0	13.0	32.0	40.0	16.0	0	1	9.84	75.0
2011-01-01 06:00:00	1.0	1.0	13.0	32.0	40.0	0	0	9.02	80.0

Create a list to identify which columns in the training matrix are numeric ('q') and categorical ('c').

In [28]:

                
                    Copied!
                    
feature_types = ['c' if col in categorical_features else 'q' for col in X_train.columns]
feature_types
feature_types = ['c' if col in categorical_features else 'q' for col in X_train.columns]
feature_types

Out[28]:

['q', 'q', 'q', 'q', 'q', 'c', 'c', 'q', 'q']

Update the regressor parameters using the forecaster's set_params method and fit.

In [29]:

                
                    Copied!
                    
# Update regressor parameters
# ==============================================================================
forecaster.set_params({'feature_types': feature_types})
# Update regressor parameters
# ==============================================================================
forecaster.set_params({'feature_types': feature_types})

In [30]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Fit forecaster
# ==============================================================================
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)
# Fit forecaster
# ==============================================================================
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

In [31]:

                
                    Copied!
                    
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])

Out[31]:

2012-08-01 00:00:00    77.470787
2012-08-01 01:00:00    40.735706
2012-08-01 02:00:00    13.448755
Freq: H, Name: pred, dtype: float64

Allow the model to automatically detect categorical features (not recommended)

Warning

Handling categorical variables by relying on the automatic detection of the category datatype can be achieved by setting enable_categorical=True during model initialization. However, this approach can lead to significant problems when the model is exposed to new datasets that have a different pandas encoding for categorical columns than the one used during training. Therefore, it's crucial to ensure that the encoding is consistent between the training and the testing datasets to avoid any potential errors.

In [32]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Transformer: ordinal encoding and cast to category type
# ==============================================================================
# A ColumnTransformer is used to transform categorical (no numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1. After the encoding, the features are converted back to category type so 
# that they can be identified as categorical features by the regressor.

pipeline_categorical = make_pipeline(
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            FunctionTransformer(
                                func=lambda x: x.astype('category'),
                                feature_names_out= 'one-to-one'
                            )
                       )

transformer_exog = make_column_transformer(
                        (
                            pipeline_categorical,
                            make_column_selector(dtype_exclude=np.number)
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")
# Transformer: ordinal encoding and cast to category type
# ==============================================================================
# A ColumnTransformer is used to transform categorical (no numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1. After the encoding, the features are converted back to category type so 
# that they can be identified as categorical features by the regressor.

pipeline_categorical = make_pipeline(
                            OrdinalEncoder(
                                dtype=int,
                                handle_unknown="use_encoded_value",
                                unknown_value=-1,
                                encoded_missing_value=-1
                            ),
                            FunctionTransformer(
                                func=lambda x: x.astype('category'),
                                feature_names_out= 'one-to-one'
                            )
                       )

transformer_exog = make_column_transformer(
                        (
                            pipeline_categorical,
                            make_column_selector(dtype_exclude=np.number)
                        ),
                        remainder="passthrough",
                        verbose_feature_names_out=False,
                   ).set_output(transform="pandas")

In [33]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create and fit forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(
                                 enable_categorical=True,
                                 tree_method='hist',
                                 random_state=963
                             ),
                 lags = 5,
                 transformer_exog = transformer_exog
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)
# Create and fit forecaster
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']

forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(
                                 enable_categorical=True,
                                 tree_method='hist',
                                 random_state=963
                             ),
                 lags = 5,
                 transformer_exog = transformer_exog
             )
            
forecaster.fit(
    y = data.loc[:end_train, 'users'],
    exog = data.loc[:end_train, exog_features]
)

In [34]:

                
                    Copied!
                    
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])

Out[34]:

2012-08-01 00:00:00    77.470787
2012-08-01 01:00:00    40.735706
2012-08-01 02:00:00    13.448755
Freq: H, Name: pred, dtype: float64

CatBoost¶

Unfortunately, the current version of skforecast is not compatible with CatBoost's built-in handling of categorical features. The issue arises because CatBoost only accepts categorical features as integers, while skforecast converts input data to floats for faster computation using numpy arrays in the internal prediction process. If a CatBoost model is required, an external encoder should be used for the categorical variables.

In [35]:

                
                    Copied!
                    
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html