Categorical features¶
Categorical features (variables that take a discrete set of values, such as weather conditions or holiday status) provide valuable signals in time series forecasting. Before most machine learning models can use them, however, they must be converted into numerical representations. This encoding step must be learned exclusively from training data to avoid information leakage.
Since version 0.22.0, skforecast provides a built-in categorical_features parameter that automatically handles encoding and natively configures gradient boosting estimators (XGBoost, LightGBM, CatBoost and HistGradientBoosting), requiring no manual encoder pipelines or estimator-specific parameters. This is the recommended approach for most use cases and is covered first in this guide.
This document also covers three manual encoding techniques applied via transformer_exog: one-hot encoding, ordinal encoding, and target encoding, using scikit-learn encoders. These approaches are useful when working with non-gradient-boosting estimators, when fine-grained control over encoding behaviour is needed, or when using TargetEncoder (which cannot be embedded inside a forecaster).
For a comprehensive walkthrough of categorical features in gradient boosting forecasting models, see Forecasting time series with gradient boosting: Skforecast, XGBoost, LightGBM and CatBoost.
Choosing an encoding strategy
The table below summarises when to use each approach.
| Method | API | Estimator compatibility | Key trade-offs |
|---|---|---|---|
Built-in categorical_features |
categorical_features='auto' or list |
LightGBM, XGBoost, HistGradientBoosting, CatBoost | Simplest workflow; automatic native configuration; floats used internally |
| One-hot encoding | transformer_exog |
Any | Expands dimensionality; avoids ordinal assumptions; drop-category handles collinearity |
| Ordinal encoding | transformer_exog |
Any (most useful for trees) | Single column; arbitrary numeric order unless categories is specified explicitly |
| Target encoding | Outside forecaster | Any | Leverages target signal; high-cardinality friendly; must be applied manually to avoid leakage |
General recommendation:
For gradient boosting models, use
categorical_features='auto', which requires the least code and lets the model use native categorical splits.For linear models or non-gradient-boosting trees, use one-hot or ordinal encoding via
transformer_exog.For high-cardinality features (e.g., hundreds of category levels), target encoding via
transformer_exog(applied outside the forecaster) is the most compact option.
✏️ Note
All of the transformations described in this document are compatible with any forecaster type. However, it is important to ensure that they are learned only from the training data to avoid information leakage, and that the same fitted transformation is applied during prediction. To reduce the likelihood of errors, it is advisable to include the transformation within the forecaster object (via transformer_exog or categorical_features), so that it is handled internally during both fit() and predict().
Libraries and data¶
The dataset used in this user guide consists of information on the number of users of a bicycle rental service, in addition to weather variables and holiday data. Two of the variables in the dataset, holiday and weather, are categorical.
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder, TargetEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector
from sklearn.ensemble import HistGradientBoostingRegressor
from lightgbm import LGBMRegressor
from xgboost import XGBRegressor
from catboost import CatBoostRegressor
from skforecast.datasets import fetch_dataset
from skforecast.recursive import ForecasterRecursive
plt.style.use('fivethirtyeight')
plt.rcParams['lines.linewidth'] = 1.5
# Downloading data
# ==============================================================================
data = fetch_dataset(name='bike_sharing', raw=True)
╭───────────────────────────────── bike_sharing ──────────────────────────────────╮ │ Description: │ │ Hourly usage of the bike share system in the city of Washington D.C. during the │ │ years 2011 and 2012. In addition to the number of users per hour, information │ │ about weather conditions and holidays is available. │ │ │ │ Source: │ │ Fanaee-T,Hadi. (2013). Bike Sharing Dataset. UCI Machine Learning Repository. │ │ https://doi.org/10.24432/C5W894. │ │ │ │ URL: │ │ https://raw.githubusercontent.com/skforecast/skforecast- │ │ datasets/main/data/bike_sharing_dataset_clean.csv │ │ │ │ Shape: 17544 rows x 12 columns │ ╰─────────────────────────────────────────────────────────────────────────────────╯
# Preprocess data
# ==============================================================================
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('h')
data = data.sort_index()
data = data[['holiday', 'weather', 'temp', 'hum', 'users']]
data['holiday'] = data['holiday'].astype(int)
data[['holiday', 'weather']] = data[['holiday', 'weather']].astype(str)
print(data.dtypes)
data.head(3)
holiday object weather object temp float64 hum float64 users float64 dtype: object
| holiday | weather | temp | hum | users | |
|---|---|---|---|---|---|
| date_time | |||||
| 2011-01-01 00:00:00 | 0 | clear | 9.84 | 81.0 | 16.0 |
| 2011-01-01 01:00:00 | 0 | clear | 9.02 | 80.0 | 40.0 |
| 2011-01-01 02:00:00 | 0 | clear | 9.02 | 80.0 | 32.0 |
Only a subset of the available data is used to keep execution time short. The training period covers June–July 2012 and the test period covers the first half of August 2012. This window is long enough to contain realistic variation in both categorical features (holiday and weather).
# Split train-test
# ==============================================================================
start_train = '2012-06-01 00:00:00'
end_train = '2012-07-31 23:59:00'
end_test = '2012-08-15 23:59:00'
data_train = data.loc[start_train:end_train, :]
data_test = data.loc[end_train:end_test, :]
print(
f"Dates train : {data_train.index.min()} --- {data_train.index.max()}"
f" (n={len(data_train)})"
)
print(
f"Dates test : {data_test.index.min()} --- {data_test.index.max()}"
f" (n={len(data_test)})"
)
Dates train : 2012-06-01 00:00:00 --- 2012-07-31 23:00:00 (n=1464) Dates test : 2012-08-01 00:00:00 --- 2012-08-15 23:00:00 (n=360)
Built-in categorical features handling¶
New in version 0.22.0
The categorical_features parameter allows the forecaster to handle categorical exogenous variables internally. This removes the need to build encoder pipelines in transformer_exog or configure estimator-specific parameters manually.
| categorical_features | Behaviour |
|---|---|
| 'auto' (default) | Any exogenous column with a non-numeric, non-boolean dtype (after applying transformer_exog) is treated as categorical. |
| list | Only the listed column names are treated as categorical. Numeric columns can also be included in the list. |
| None | No internal encoding is applied. |
For the most popular gradient boosting frameworks (LightGBM, XGBoost, HistGradientBoostingRegressor, CatBoost), the forecaster also automatically configures the estimator's native categorical support, no fit_kwargs or estimator-level parameters are needed.
✏️ Note
Internally, the forecaster applies OrdinalEncoder to the categorical features, converting them to float codes. This avoids errors when fitting the internal estimator. For estimators that natively support categorical features (such as LightGBM, XGBoost, CatBoost, and HistGradientBoosting), the forecaster also passes the list of categorical column indices to the estimator, so they are treated as categorical rather than numeric. For all other estimators, the integer-encoded values are passed as-is and treated as continuous numeric features.
┌─────────────────────────────────────────────────────────┐
│ User provides exog │
│ (with categorical string columns) │
└───────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ transformer_exog applied │
│ (if configured by the user) │
└───────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ categorical_features detection │
│ │
│ 'auto' → detect non-numeric, non-boolean columns │
│ list → use explicit column names │
│ None → skip (user manages encoding) │
└───────────────────────┬─────────────────────────────────┘
│
┌───────────┴───────────┐
▼ ▼
Categorical cols found No categorical cols
│ │
▼ ▼
┌───────────────────────┐ ┌───────────────────────┐
│ OrdinalEncoder │ │ Pass data as-is to │
│ (internal) │ │ estimator │
│ │ └───────────────────────┘
│ string → float codes │
│ 'a' → 0.0 │
│ 'b' → 1.0 │
│ 'c' → 2.0 │
└───────────┬───────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ Configure estimator native categorical support │
│ │
│ LightGBM → fit(categorical_feature=[indices]) │
│ CatBoost → fit(cat_features=[indices]) │
│ XGBoost → set_params(feature_types, enable_cat) │
│ HistGBR → set_params(categorical_features) │
│ Other → no-op (integers treated as numeric) │
└───────────────────────┬─────────────────────────────────┘
│
▼
┌─────────────────────────────────────────────────────────┐
│ estimator.fit(X_train, y_train) │
│ (float-encoded cats + native cat config) │
└─────────────────────────────────────────────────────────┘
⚠ Warning
The internal OrdinalEncoder is configured with unknown_value=np.nan. This means that any category encountered during predict() that was not seen during fit() will be encoded as NaN. Whether this causes an error depends on the estimator: NaN-tolerant models (LightGBM, CatBoost, XGBoost with tree_method='hist', HistGradientBoosting) will handle it gracefully, but other estimators will raise an error.
LightGBM¶
# Forecaster with lightgbm and categorical feature handling
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterRecursive(
estimator = LGBMRegressor(random_state=123, verbose=-1),
lags = 5,
categorical_features = 'auto' # Detects any non-numeric column
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
forecaster
ForecasterRecursive
General Information
- Estimator: LGBMRegressor
- Lags: [1 2 3 4 5]
- Window features: None
- Window size: 5
- Series name: users
- Exogenous included: True
- Categorical features: auto
- Weight function included: False
- Differentiation order: None
- Drop NaN from series: False
- Creation date: 2026-04-23 10:35:45
- Last fit date: 2026-04-23 10:35:45
- Skforecast version: 0.22.0
- Python version: 3.14.3
- Forecaster id: None
Exogenous Variables
holidayCAT, weatherCAT, temp, hum
Data Transformations
- Transformer for y: None
- Transformer for exog: None
Training Information
- Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-07-31 23:00:00')]
- Training index type: DatetimeIndex
- Training index frequency:
Estimator Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{}
# Categorical features detected and the encoding applied
# ==============================================================================
print('Detected categorical features:', forecaster.categorical_features_names_in_)
print('Encoding mapping:')
for feature, cats in zip(
forecaster.categorical_features_names_in_,
forecaster.categorical_encoder.categories_
):
mapping = {cat: float(i) for i, cat in enumerate(cats)}
print(f" '{feature}': {mapping}")
Detected categorical features: ['holiday', 'weather']
Encoding mapping:
'holiday': {'0': 0.0, '1': 1.0}
'weather': {'clear': 0.0, 'mist': 1.0, 'rain': 2.0}
Exploring the training matrices, one can see that categorical features are encoded as floats. This is required since skforecast makes internal use of numpy arrays, which do not support mixed data types. However, if the estimator supports native categorical features, the forecaster automatically configures it to treat these columns as categorical, so the model can still leverage the categorical nature of the data.
# Training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print(X_train.dtypes)
X_train.head(3)
lag_1 float64 lag_2 float64 lag_3 float64 lag_4 float64 lag_5 float64 holiday float64 weather float64 temp float64 hum float64 dtype: object
| lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | holiday | weather | temp | hum | |
|---|---|---|---|---|---|---|---|---|---|
| date_time | |||||||||
| 2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | 0.0 | 1.0 | 9.84 | 75.0 |
| 2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | 0.0 | 0.0 | 9.02 | 80.0 |
| 2011-01-01 07:00:00 | 2.0 | 1.0 | 1.0 | 13.0 | 32.0 | 0.0 | 0.0 | 8.20 | 86.0 |
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 88.946940 2012-08-01 01:00:00 59.848451 2012-08-01 02:00:00 28.870817 Freq: h, Name: pred, dtype: float64
The categorical_features parameter also accepts an explicit list of column names. It is also possible to include numeric columns in the explicit list, forcing the model to treat them as categorical even though they have an integer or float dtype.
# Explicit list of categorical column names
# ==============================================================================
forecaster = ForecasterRecursive(
estimator = LGBMRegressor(random_state=123, verbose=-1),
lags = 5,
categorical_features = ['holiday', 'weather'] # Explicitly specified
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print('Detected categorical features:', forecaster.categorical_features_names_in_)
Detected categorical features: ['holiday', 'weather']
XGBoost¶
# Forecaster with xgboost and categorical feature handling
# ==============================================================================
forecaster = ForecasterRecursive(
estimator = XGBRegressor(random_state=123),
lags = 5,
categorical_features = 'auto'
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print('Detected categorical features:', forecaster.categorical_features_names_in_)
Detected categorical features: ['holiday', 'weather']
HistGradientBoostingRegressor¶
# Forecaster with HistGradientBoostingRegressor and categorical feature handling
# ==============================================================================
forecaster = ForecasterRecursive(
estimator = HistGradientBoostingRegressor(random_state=123),
lags = 5,
categorical_features = 'auto'
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print('Detected categorical features:', forecaster.categorical_features_names_in_)
Detected categorical features: ['holiday', 'weather']
CatBoost¶
# Forecaster with CatBoostRegressor and categorical feature handling
# ==============================================================================
forecaster = ForecasterRecursive(
estimator = CatBoostRegressor(random_state=123, verbose=0, allow_writing_files=False),
lags = 5,
categorical_features = 'auto'
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print('Detected categorical features:', forecaster.categorical_features_names_in_)
Detected categorical features: ['holiday', 'weather']
Categorical_features and transformer_exog¶
Arguments categorical_features and transformer_exog can be used together. The former focus in handling categorical features internally, while the latter allows for custom transformations (including encoding) to be applied to the exogenous data before it is passed to the forecaster. This means that users can apply any desired transformations to their exogenous features using transformer_exog, and then rely on categorical_features to automatically detect and handle any categorical columns in the transformed data.
For example, it is possible to apply an standard scaler to some of the numerical columns in the exogenous data using transformer_exog, and then use categorical_features='auto' to let the forecaster automatically handle any categorical columns.
✏️ Note
Because transformer_exog is applied before categorical_features detection, ensure that the transformer does not convert categorical columns to numeric, otherwise categorical_features='auto' will no longer detect them.
# Example of how to use transformer_exog together with categorical_features
# ==============================================================================
# Scale 'temp' and 'hum' columns, while leaving 'holiday' and 'weather' unchanged.
# Then, use categorical_features='auto' to let the forecaster automatically handle
# the categorical columns.
transformer_exog = make_column_transformer(
(StandardScaler(), ['temp', 'hum']),
remainder='passthrough',
verbose_feature_names_out=False
).set_output(transform='pandas')
forecaster = ForecasterRecursive(
estimator = LGBMRegressor(random_state=123, verbose=-1),
lags = 5,
transformer_exog = transformer_exog,
categorical_features = 'auto' # Detects any non-numeric column
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
# Training matrices with transformer_exog applied and categorical features detected
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
X_train.head(3)
| lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | temp | hum | holiday | weather | |
|---|---|---|---|---|---|---|---|---|---|
| date_time | |||||||||
| 2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | -1.286295 | 0.633839 | 0.0 | 1.0 |
| 2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | -1.387480 | 0.885040 | 0.0 | 0.0 |
| 2011-01-01 07:00:00 | 2.0 | 1.0 | 1.0 | 13.0 | 32.0 | -1.488665 | 1.186482 | 0.0 | 0.0 |
Categorical encoding via transformer_exog¶
The built-in categorical_features parameter covers the most common use case: gradient boosting estimators with native categorical support. For other scenarios (linear models, SVMs, non-gradient-boosting tree ensembles) or when precise control over encoding behaviour is required (custom category orderings, infrequent-category grouping, or target encoding), the recommended approach is to pass a scikit-learn ColumnTransformer via transformer_exog.
The following sections demonstrate three encoding strategies: One-hot encoding, Ordinal encoding, and Target encoding.
✏️ Note
transformer_exog is applied before categorical_features detection. If transformer_exog already converts all categorical columns to numeric, categorical_features = 'auto' (the default) will find nothing to encode and will have no effect. Setting categorical_features = None explicitly is not required in this case, but can be used to make the intent clearer.
One-hot encoding¶
One-hot encoding, also known as dummy encoding or one-of-K encoding, consists of replacing the categorical variable with a set of binary variables that take the value 0 or 1 to indicate whether a particular category is present in an observation. For example, suppose a dataset contains a categorical variable called "color" with the possible values of "red," "blue," and "green". Using one-hot encoding, this variable is converted into three binary variables such as color_red, color_blue, and color_green, where each variable takes a value of 0 or 1 depending on the category.
The OneHotEncoder class in scikit-learn can be used to transform any categorical feature with n possible values into n new binary features, where one of them takes the value 1, and all the others take the value 0. The OneHotEncoder can be configured to handle certain corner cases, including unknown categories, missing values, and infrequent categories.
When
handle_unknown='ignore'anddropis notNone, unknown categories are encoded as zeros. Additionally, if a feature contains bothnp.nanandNone, they are considered separate categories.It supports the aggregation of infrequent categories into a single output for each feature. The parameters to enable the aggregation of infrequent categories are
min_frequencyandmax_categories. By settinghandle_unknownto'infrequent_if_exist', unknown categories are considered infrequent.To avoid collinearity between features, it is possible to drop one of the categories per feature using the
dropargument. This is especially important when using linear models.
ColumnTransformers in scikit-learn provide a powerful way to define transformations and apply them to specific features. By encapsulating the OneHotEncoder in a ColumnTransformer object, it can be passed to a forecaster using the transformer_exog argument.
One-hot encoding is particularly well-suited for linear models, support vector machines, and neural networks. For tree-based models, the built-in categorical_features parameter is generally the preferred approach.
# ColumnTransformer with one-hot encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical features (no numerical)
# using one-hot encoding. Numeric features are left untouched. For binary
# features, only one column is created.
one_hot_encoder = make_column_transformer(
(
OneHotEncoder(sparse_output=False, drop='if_binary'),
make_column_selector(dtype_exclude=np.number)
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterRecursive(
estimator = LGBMRegressor(random_state=123, verbose=-1),
lags = 5,
transformer_exog = one_hot_encoder
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
forecaster
ForecasterRecursive
General Information
- Estimator: LGBMRegressor
- Lags: [1 2 3 4 5]
- Window features: None
- Window size: 5
- Series name: users
- Exogenous included: True
- Categorical features: auto
- Weight function included: False
- Differentiation order: None
- Drop NaN from series: False
- Creation date: 2026-04-23 10:35:49
- Last fit date: 2026-04-23 10:35:49
- Skforecast version: 0.22.0
- Python version: 3.14.3
- Forecaster id: None
Exogenous Variables
holiday, weather, temp, hum
Data Transformations
- Transformer for y: None
- Transformer for exog: ColumnTransformer(remainder='passthrough',
transformers=[('onehotencoder',
OneHotEncoder(drop='if_binary',
sparse_output=False),
)], verbose_feature_names_out=False)
Training Information
- Training range: [Timestamp('2011-01-01 00:00:00'), Timestamp('2012-07-31 23:00:00')]
- Training index type: DatetimeIndex
- Training index frequency:
Estimator Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{}
Once the forecaster has been trained, the transformer can be inspected by accessing the transformer_exog attribute.
# Access to the transformer used for exogenous features
# ==============================================================================
print(forecaster.transformer_exog.get_feature_names_out())
forecaster.transformer_exog
['holiday_1' 'weather_clear' 'weather_mist' 'weather_rain' 'temp' 'hum']
ColumnTransformer(remainder='passthrough',
transformers=[('onehotencoder',
OneHotEncoder(drop='if_binary',
sparse_output=False),
<sklearn.compose._column_transformer.make_column_selector object at 0x17f6a12b0>)],
verbose_feature_names_out=False)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
Parameters
<sklearn.compose._column_transformer.make_column_selector object at 0x17f6a12b0>
Parameters
['temp', 'hum']
passthrough
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 88.946940 2012-08-01 01:00:00 59.848451 2012-08-01 02:00:00 28.870817 Freq: h, Name: pred, dtype: float64
✏️ Note
Use the create_train_X_y() method to inspect the exact feature matrix that the forecaster passes to the estimator during training. This is a useful debugging tool to verify that the transformer is applied correctly and that the final column names and dtypes match your expectations.
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print(X_train.dtypes)
X_train.head()
lag_1 float64 lag_2 float64 lag_3 float64 lag_4 float64 lag_5 float64 holiday_1 float64 weather_clear float64 weather_mist float64 weather_rain float64 temp float64 hum float64 dtype: object
| lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | holiday_1 | weather_clear | weather_mist | weather_rain | temp | hum | |
|---|---|---|---|---|---|---|---|---|---|---|---|
| date_time | |||||||||||
| 2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | 0.0 | 0.0 | 1.0 | 0.0 | 9.84 | 75.0 |
| 2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | 0.0 | 1.0 | 0.0 | 0.0 | 9.02 | 80.0 |
| 2011-01-01 07:00:00 | 2.0 | 1.0 | 1.0 | 13.0 | 32.0 | 0.0 | 1.0 | 0.0 | 0.0 | 8.20 | 86.0 |
| 2011-01-01 08:00:00 | 3.0 | 2.0 | 1.0 | 1.0 | 13.0 | 0.0 | 1.0 | 0.0 | 0.0 | 9.84 | 75.0 |
| 2011-01-01 09:00:00 | 8.0 | 3.0 | 2.0 | 1.0 | 1.0 | 0.0 | 1.0 | 0.0 | 0.0 | 13.12 | 76.0 |
Ordinal encoding¶
Ordinal encoding converts categorical variables into integers by assigning each category a unique numeric code. Unlike one-hot encoding, it produces a single column per feature, which makes it memory-efficient. The encoding is most appropriate when categories have a meaningful natural order (e.g., low < medium < high), or when used with tree-based models that do not misinterpret the numeric codes as continuous magnitudes.
The scikit-learn library provides the OrdinalEncoder class, which assigns integers from 0 to n_categories-1. By default, the order is determined alphabetically. It is important to note that this ordering is arbitrary for unordered categories; users should specify explicit categories if a meaningful order exists. The class also exposes the encoded_missing_value parameter to handle missing values, and handle_unknown='use_encoded_value' to gracefully deal with unseen categories at prediction time.
Using OrdinalEncoder via transformer_exog gives full control over encoding parameters: for example, setting unknown_value=-1 (instead of the NaN used by the built-in categorical_features encoder), specifying explicit category orderings, or choosing dtype=int instead of float.
# ColumnTransformer with ordinal encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using ordinal encoding. Numeric features are left untouched. Missing values
# are coded as -1. If a new category is found in the test set, it is encoded
# as -1.
ordinal_encoder = make_column_transformer(
(
OrdinalEncoder(
handle_unknown='use_encoded_value',
unknown_value=-1,
encoded_missing_value=-1
),
make_column_selector(dtype_exclude=np.number)
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Create and fit a forecaster with a transformer for exogenous features
# ==============================================================================
exog_features = ['holiday', 'weather', 'temp', 'hum']
forecaster = ForecasterRecursive(
estimator = LGBMRegressor(random_state=123, verbose=-1),
lags = 5,
transformer_exog = ordinal_encoder
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
# Create training matrices
# ==============================================================================
X_train, y_train = forecaster.create_train_X_y(
y = data.loc[:end_train, 'users'],
exog = data.loc[:end_train, exog_features]
)
print(X_train.dtypes)
X_train.head()
lag_1 float64 lag_2 float64 lag_3 float64 lag_4 float64 lag_5 float64 holiday float64 weather float64 temp float64 hum float64 dtype: object
| lag_1 | lag_2 | lag_3 | lag_4 | lag_5 | holiday | weather | temp | hum | |
|---|---|---|---|---|---|---|---|---|---|
| date_time | |||||||||
| 2011-01-01 05:00:00 | 1.0 | 13.0 | 32.0 | 40.0 | 16.0 | 0.0 | 1.0 | 9.84 | 75.0 |
| 2011-01-01 06:00:00 | 1.0 | 1.0 | 13.0 | 32.0 | 40.0 | 0.0 | 0.0 | 9.02 | 80.0 |
| 2011-01-01 07:00:00 | 2.0 | 1.0 | 1.0 | 13.0 | 32.0 | 0.0 | 0.0 | 8.20 | 86.0 |
| 2011-01-01 08:00:00 | 3.0 | 2.0 | 1.0 | 1.0 | 13.0 | 0.0 | 0.0 | 9.84 | 75.0 |
| 2011-01-01 09:00:00 | 8.0 | 3.0 | 2.0 | 1.0 | 1.0 | 0.0 | 0.0 | 13.12 | 76.0 |
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=data_test[exog_features])
2012-08-01 00:00:00 89.096098 2012-08-01 01:00:00 57.749964 2012-08-01 02:00:00 29.263922 Freq: h, Name: pred, dtype: float64
Target encoding¶
Target encoding is a technique that encodes categorical variables based on the relationship between the categories and the target variable. Each category is encoded based on a shrinkage estimate of the average target values for observations belonging to that category. The encoding scheme mixes the global target mean with the target mean conditioned on the value of the category.
For example, suppose a categorical variable "City" with categories "New York," "Los Angeles," and "Chicago," and a target variable "Salary." One can calculate the mean salary for each city based on the training data, and use these mean values to encode the categories.
This encoding scheme is useful for categorical features with high cardinality, where one-hot encoding would inflate the feature space making it more expensive for a downstream model to process. A classic example of high-cardinality categorical variables is location data, such as zip codes or regions.
The TargetEncoder class is available in Scikit-learn (since version 1.3). TargetEncoder considers missing values, such as np.nan or None, as another category and encodes them like any other category. Categories that are not seen during fit are encoded with the target mean, i.e. target_mean_. A more detailed description of target encoding can be found in the scikit-learn user guide.
⚠ Warning
TargetEncoder differs from the other transformers in scikit-learn in that it requires not only the features to be transformed but also the response variable (target), in the context of prediction, this is the time series. Currently, the only transformers allowed in the prediction classes are those that do not require the target variable to be fitted. Therefore, to use target encoding, transformations must be applied outside the Forecaster object.
# ColumnTransformer with target encoding
# ==============================================================================
# A ColumnTransformer is used to transform categorical (not numerical) features
# using target encoding. Numeric features are left untouched. TargetEncoder
# considers missing values, such as np.nan or None, as another category and
# encodes them like any other category. Categories that are not seen during fit
# are encoded with the target mean
target_encoder = make_column_transformer(
(
TargetEncoder(
categories = 'auto',
target_type = 'continuous',
smooth = 'auto',
random_state = 9874
),
make_column_selector(dtype_exclude=np.number)
),
remainder="passthrough",
verbose_feature_names_out=False,
).set_output(transform="pandas")
# Transform the exogenous features using the transformer outside the forecaster
# ==============================================================================
exog_transformed = target_encoder.fit_transform(
X = data.loc[:end_train, exog_features],
y = data.loc[:end_train, 'users']
)
exog_transformed.head()
| holiday | weather | temp | hum | |
|---|---|---|---|---|
| date_time | ||||
| 2011-01-01 00:00:00 | 172.823951 | 188.121327 | 9.84 | 81.0 |
| 2011-01-01 01:00:00 | 172.607889 | 187.330734 | 9.02 | 80.0 |
| 2011-01-01 02:00:00 | 173.476675 | 189.423278 | 9.02 | 80.0 |
| 2011-01-01 03:00:00 | 172.823951 | 188.121327 | 9.84 | 75.0 |
| 2011-01-01 04:00:00 | 172.823951 | 188.121327 | 9.84 | 75.0 |
# Transform test exog using the already-fitted encoder
# ==============================================================================
# Use .transform() (not .fit_transform()) so no target data is needed and
# training distribution is preserved.
exog_test_transformed = target_encoder.transform(data_test[exog_features])
exog_test_transformed.head(3)
| holiday | weather | temp | hum | |
|---|---|---|---|---|
| date_time | ||||
| 2012-08-01 00:00:00 | 172.848487 | 188.530168 | 27.88 | 79.0 |
| 2012-08-01 01:00:00 | 172.848487 | 188.530168 | 27.06 | 83.0 |
| 2012-08-01 02:00:00 | 172.848487 | 188.530168 | 26.24 | 83.0 |
# Create and fit forecaster with pre-transformed exogenous features
# ==============================================================================
forecaster = ForecasterRecursive(
estimator = LGBMRegressor(random_state=123, verbose=-1),
lags = 5
)
forecaster.fit(
y = data.loc[:end_train, 'users'],
exog = exog_transformed # pre-transformed training exog (all numeric)
)
# Predictions
# ==============================================================================
forecaster.predict(steps=3, exog=exog_test_transformed)
2012-08-01 00:00:00 90.744834 2012-08-01 01:00:00 61.148656 2012-08-01 02:00:00 30.500134 Freq: h, Name: pred, dtype: float64