Classification forecasting for categorical time series¶
Recursive multi-step forecasting is a technique used to predict future values in a time series by using previous predictions as inputs for subsequent forecasts.
In the context of categorical time series, where the values belong to discrete categories rather than continuous values, the model is trained to predict the probability distribution over the possible categories for each future time step. The most likely category for the current time step is then used as input for predicting the next time step, and this process continues recursively.
Although the process is conceptually very similar to that used for forecasting continuous time series, many challenges arise due the need to encode and decode categorical variables, as well as allowing the underlying machine learning model to handle categorical data effectively.
Skforecast provides the class ForecasterRecursiveClassifier to facilitate recursive multi-step forecasting for categorical time series, automatically managing the encoding and decoding of categorical variables, and integrating seamlessly with various machine learning models that support classification tasks.
Libraries and data¶
# Libraries
# ==============================================================================
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Patch
from matplotlib.colors import ListedColormap, BoundaryNorm
from lightgbm import LGBMClassifier
from sklearn.ensemble import HistGradientBoostingClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.calibration import CalibratedClassifierCV, CalibrationDisplay, calibration_curve
from skforecast.datasets import fetch_dataset
from skforecast.preprocessing import RollingFeaturesClassification
from skforecast.recursive import ForecasterRecursiveClassifier
from skforecast.model_selection import (
TimeSeriesFold, OneStepAheadFold, backtesting_forecaster, bayesian_search_forecaster
)
from skforecast.plot import set_dark_theme
# Data download
# ==============================================================================
data = fetch_dataset(name='vic_electricity_classification')
data
╭──────────────────────── vic_electricity_classification ────────────────────────╮ │ Description: │ │ Hourly electricity demand for Victoria, Australia classified into three │ │ categories: 'low', 'medium' and 'high' according to the 20th and 80th │ │ percentiles. The dataset also includes temperature, holiday indicator and hour │ │ of the day as features. │ │ │ │ Source: │ │ O'Hara-Wild M, Hyndman R, Wang E, Godahewa R (2022).tsibbledata: Diverse │ │ Datasets for 'tsibble'. https://tsibbledata.tidyverts.org/, │ │ https://github.com/tidyverts/tsibbledata/. │ │ https://tsibbledata.tidyverts.org/reference/vic_elec.html │ │ │ │ URL: │ │ https://raw.githubusercontent.com/skforecast/skforecast- │ │ datasets/main/data/vic_electricity_classification.csv │ │ │ │ Shape: 8736 rows x 5 columns │ ╰────────────────────────────────────────────────────────────────────────────────╯
| Demand | Temperature | Holiday | Hour_of_day | Demand_raw | |
|---|---|---|---|---|---|
| Time | |||||
| 2014-01-01 01:00:00 | low | 24.80 | 1.0 | 1 | 3730.065980 |
| 2014-01-01 02:00:00 | medium | 25.90 | 1.0 | 2 | 3858.473927 |
| 2014-01-01 03:00:00 | medium | 25.30 | 1.0 | 3 | 3851.415438 |
| 2014-01-01 04:00:00 | medium | 23.65 | 1.0 | 4 | 3838.972926 |
| 2014-01-01 05:00:00 | medium | 20.70 | 1.0 | 5 | 3860.540970 |
| ... | ... | ... | ... | ... | ... |
| 2014-12-30 20:00:00 | low | 12.10 | 0.0 | 20 | 3527.232855 |
| 2014-12-30 21:00:00 | medium | 12.30 | 0.0 | 21 | 3846.439766 |
| 2014-12-30 22:00:00 | medium | 14.05 | 0.0 | 22 | 3961.529675 |
| 2014-12-30 23:00:00 | medium | 16.60 | 0.0 | 23 | 4059.288011 |
| 2014-12-31 00:00:00 | medium | 17.70 | 0.0 | 0 | 4071.800790 |
8736 rows × 5 columns
# Distribution of classes in the series
# ==============================================================================
print("Distribution of classes:")
data['Demand'].value_counts(normalize=True).round(2)
Distribution of classes:
Demand medium 0.6 low 0.2 high 0.2 Name: proportion, dtype: float64
# Split train-test
# ==============================================================================
end_train = '2014-10-31 23:59:00'
data_train = data[:end_train].copy()
data_test = data[end_train:].copy()
print(
f"Train dates : {data_train.index.min()} --- "
f"{data_train.index.max()} (n={len(data_train)})"
)
print(
f"Test dates : {data_test.index.min()} --- "
f"{data_test.index.max()} (n={len(data_test)})"
)
Train dates : 2014-01-01 01:00:00 --- 2014-10-31 23:00:00 (n=7295) Test dates : 2014-11-01 00:00:00 --- 2014-12-31 00:00:00 (n=1441)
Plotting categorical time series¶
Visualizing categorical time series can be challenging due to the discrete nature of the data. Most of the standard plotting libraries require numerical inputs to create plots which means that categorical data needs to be encoded into numerical format for visualization purposes.
# Data preprocessing
# ==============================================================================
# Encode categorical variable as numerical for plotting
encode_mapping = {'low': 0, 'medium': 1, 'high': 2}
print(f"Encoding map: {encode_mapping}")
data['encoded_value'] = data['Demand'].map(encode_mapping).astype(float)
data_train['encoded_value'] = data_train['Demand'].map(encode_mapping).astype(float)
data_test['encoded_value'] = data_test['Demand'].map(encode_mapping).astype(float)
data.head(3)
Encoding map: {'low': 0, 'medium': 1, 'high': 2}
| Demand | Temperature | Holiday | Hour_of_day | Demand_raw | encoded_value | |
|---|---|---|---|---|---|---|
| Time | ||||||
| 2014-01-01 01:00:00 | low | 24.8 | 1.0 | 1 | 3730.065980 | 0.0 |
| 2014-01-01 02:00:00 | medium | 25.9 | 1.0 | 2 | 3858.473927 | 1.0 |
| 2014-01-01 03:00:00 | medium | 25.3 | 1.0 | 3 | 3851.415438 | 1.0 |
# Plot: line plot
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(8, 3))
data_train['encoded_value'].plot(ax=ax, label='train')
data_test['encoded_value'].plot(ax=ax, label='test')
y_ticks = list(encode_mapping.values())
y_labels = list(encode_mapping.keys())
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_labels)
ax.set_title("Categorical Time Series (States)")
ax.set_xlabel("Date")
ax.legend();
# Plot: 1-row heatmap-style scatter
# ==================================================================================================
colors = ["skyblue", "gold", "tomato"]
cmap = ListedColormap(colors)
norm = BoundaryNorm([0, 1, 2, 3], cmap.N)
stripe_height = 20
state_array = np.tile(data["encoded_value"].values, (stripe_height, 1))
train_mask = data.index <= pd.to_datetime(end_train)
train_indices = np.where(train_mask)[0]
test_indices = np.where(~train_mask)[0]
fig, ax = plt.subplots(figsize=(8, 2.5))
ax.imshow(
state_array,
aspect="auto",
cmap=cmap,
norm=norm,
extent=[0, len(data) - 1, 0, stripe_height],
)
ax.hlines(y=30, xmin=train_indices[0], xmax=train_indices[-1], color="#30a2da", linewidth=4)
ax.hlines(y=30, xmin=test_indices[0], xmax=test_indices[-1], color="#fc4f30", linewidth=4)
ax.set_yticks([])
ax.set_frame_on(False)
tick_indices = np.linspace(0, len(data) - 1, 10, dtype=int)
ax.set_xticks(tick_indices)
ax.set_xticklabels([data.index[i].strftime("%Y-%m") for i in tick_indices], rotation=0, ha="center")
ax.set_xlabel("Date")
ax.set_title("Categorical Daily States Over Time", pad=8)
cbar = fig.colorbar(
plt.cm.ScalarMappable(cmap=cmap, norm=norm),
ax=ax,
orientation="horizontal",
ticks=[0.5, 1.5, 2.5],
pad=0.3, # space between plot and colorbar
fraction=0.05, # thickness
aspect=20 # length-to-height ratio
)
cbar.outline.set_visible(False)
cbar.ax.set_xticklabels(["low", "medium", "high"], fontsize=8)
cbar.set_label("", labelpad=2, fontsize=9)
legend_elements = [
Patch(facecolor="#30a2da", label="Train"),
Patch(facecolor="#fc4f30", label="Test"),
]
ax.legend(handles=legend_elements, loc="upper right", frameon=False, ncol=2, borderaxespad=0.5);
# Plot: states timeline
# ==============================================================================
fig, ax = plt.subplots(figsize=(8, 3))
color_map = {"low": "skyblue", "medium": "gold", "high": "tomato"}
for state, color in color_map.items():
mask = data["Demand"] == state
ax.plot(
data.index[mask], [state] * mask.sum(), '|', color=color, markersize=8, label=state
)
ax.set_yticks(y_ticks)
ax.set_title("Categorical States Timeline")
ax.set_xlabel("Date")
ax.legend()
plt.tight_layout()
plt.show()
Create and train forecaster¶
When working with categorical time series, a common challenge arises: not only is the target variable categorical, but the lagged values used as predictors are also categorical, as they represent previous observations of the target variable.
Most classification algorithms can handle categorical target variables; however, many cannot directly process categorical predictors. In such cases, it is necessary to encode the categorical features into a numerical format prior to training the model.
The features_encoding parameter in the ForecasterRecursiveClassifier class specifies how the categorical features derived from the target series (e.g., lagged values and window features) are encoded before fitting the underlying machine learning model. The parameter accepts the following options:
auto: Use a categorical data type if the classifier supports native categorical features (such as LightGBM, CatBoost, XGBoost, or HistGradientBoostingClassifier). Otherwise, fallback to numerical encoding.categorical: Enforce the use of a categorical data type. This requires a classifier capable of handling native categorical features.ordinal: Apply ordinal encoding (0, 1, 2, …). In this case, the classifier interprets the encoded values as numeric, implicitly assuming an ordinal relationship among the categories (for example, 'low' < 'medium' < 'high').
Proper encoding of categorical features is crucial to ensure that the model can accurately interpret and leverage the information contained in previous observations of the target series.
💡 Tip
The ForecasterRecursiveClassifier class allows the inclusion of window features through the RollingFeaturesClassification parameter. These features can capture patterns over a specified number of previous time steps, potentially enhancing the model's predictive capabilities.
The available statistics are:
proportion: The relative frequency or ratio of each class within the window.mode: The most frequent class (value) observed in the window.entropy: A measure of the diversity or randomness of the classes.n_changes: The number of times the classes changes within the window.n_unique: The count of unique classes present in the window.
To learn more about window features visit the Window and custom features user guide.
# Create and fit forecaster with rolling window features
# ==============================================================================
window_features = RollingFeaturesClassification(stats=['proportion'], window_sizes=72)
forecaster = ForecasterRecursiveClassifier(
estimator = LGBMClassifier(random_state=123, verbose=-1),
lags = 24,
window_features = window_features,
features_encoding = 'auto',
)
forecaster.fit(
y = data_train['Demand'],
exog = data_train[['Temperature', 'Holiday', 'Hour_of_day']]
)
forecaster
ForecasterRecursiveClassifier
General Information
- Estimator: LGBMClassifier
- Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
- Window features: ['roll_proportion_72']
- Window size: 72
- Series name: Demand
- Exogenous included: True
- Weight function included: False
- Creation date: 2025-11-28 19:39:52
- Last fit date: 2025-11-28 19:39:55
- Skforecast version: 0.19.0
- Python version: 3.12.11
- Forecaster id: None
Classification Information
- Classes: ['high', 'low', 'medium']
- Class encoding: {'high': 0, 'low': 1, 'medium': 2}
Exogenous Variables
- Exogenous names: Temperature, Holiday, Hour_of_day
- Transformer for exog: None
Training Information
- Training range: [Timestamp('2014-01-01 01:00:00'), Timestamp('2014-10-31 23:00:00')]
- Training index type: DatetimeIndex
- Training index frequency:
Estimator Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{}
Prediction¶
# Predict
# ==============================================================================
predictions = forecaster.predict(
steps = 24,
exog = data_test[['Temperature', 'Holiday', 'Hour_of_day']]
)
predictions
2014-11-01 00:00:00 medium 2014-11-01 01:00:00 medium 2014-11-01 02:00:00 medium 2014-11-01 03:00:00 medium 2014-11-01 04:00:00 medium 2014-11-01 05:00:00 medium 2014-11-01 06:00:00 medium 2014-11-01 07:00:00 medium 2014-11-01 08:00:00 medium 2014-11-01 09:00:00 high 2014-11-01 10:00:00 high 2014-11-01 11:00:00 medium 2014-11-01 12:00:00 medium 2014-11-01 13:00:00 medium 2014-11-01 14:00:00 medium 2014-11-01 15:00:00 medium 2014-11-01 16:00:00 medium 2014-11-01 17:00:00 low 2014-11-01 18:00:00 low 2014-11-01 19:00:00 low 2014-11-01 20:00:00 low 2014-11-01 21:00:00 medium 2014-11-01 22:00:00 medium 2014-11-01 23:00:00 medium Freq: h, Name: pred, dtype: object
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test.loc[predictions.index, 'encoded_value'].plot(ax=ax, label='test')
(
predictions
.to_frame()
.assign(encoded_value=predictions.map(encode_mapping))['encoded_value']
.plot(ax=ax, label='predictions')
)
y_ticks = list(encode_mapping.values())
y_labels = list(encode_mapping.keys())
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_labels)
ax.set_title("Categorical Time Series (States)")
ax.set_xlabel("Date")
ax.set_ylabel("State")
ax.legend();
# Prediction metrics
# ==============================================================================
error_mse = accuracy_score(
y_true = data_test.loc[predictions.index, 'Demand'],
y_pred = predictions
)
print(f"Test error (accuracy): {error_mse}")
print("Classification Report \n---------------------")
print(
classification_report(
data.loc[predictions.index, 'Demand'],
predictions
)
)
Test error (accuracy): 0.7083333333333334
Classification Report
---------------------
precision recall f1-score support
high 0.00 0.00 0.00 0
low 1.00 0.44 0.62 9
medium 0.72 0.87 0.79 15
accuracy 0.71 24
macro avg 0.57 0.44 0.47 24
weighted avg 0.83 0.71 0.72 24
c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\sklearn\metrics\_classification.py:1565: UndefinedMetricWarning: Recall is ill-defined and being set to 0.0 in labels with no true samples. Use `zero_division` parameter to control this behavior.
_warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
Predict probabilities¶
ForecasterRecursiveClassifier includes the method predict_proba which allows to obtain the predicted probabilities for each category at each forecasted time step. Internally, this method calls the predict_proba method of the underlying classifier used inside the forecaster.
⚠️ Warning
It is important to note that these probabilities come from the classifier’s predict_proba method. Their calibration and reliability depend on the specific model used. For example, in scikit-learn, predict_proba returns scores that approximate probabilities, but these are not always well-calibrated or truly probabilistic. See section on Probability Calibration for more details.
# Predict probabilities
# ==============================================================================
predictions = forecaster.predict_proba(
steps = 24,
exog = data_test[['Temperature', 'Holiday', 'Hour_of_day']]
)
predictions
| high_proba | low_proba | medium_proba | |
|---|---|---|---|
| 2014-11-01 00:00:00 | 0.004573 | 0.002774 | 0.992653 |
| 2014-11-01 01:00:00 | 0.004353 | 0.007518 | 0.988129 |
| 2014-11-01 02:00:00 | 0.002101 | 0.003061 | 0.994838 |
| 2014-11-01 03:00:00 | 0.001198 | 0.003052 | 0.995751 |
| 2014-11-01 04:00:00 | 0.005935 | 0.001566 | 0.992499 |
| 2014-11-01 05:00:00 | 0.001352 | 0.000036 | 0.998612 |
| 2014-11-01 06:00:00 | 0.003241 | 0.000080 | 0.996678 |
| 2014-11-01 07:00:00 | 0.073888 | 0.000195 | 0.925917 |
| 2014-11-01 08:00:00 | 0.451197 | 0.000588 | 0.548215 |
| 2014-11-01 09:00:00 | 0.540362 | 0.001000 | 0.458639 |
| 2014-11-01 10:00:00 | 0.770847 | 0.000463 | 0.228690 |
| 2014-11-01 11:00:00 | 0.079652 | 0.000732 | 0.919616 |
| 2014-11-01 12:00:00 | 0.000957 | 0.001410 | 0.997633 |
| 2014-11-01 13:00:00 | 0.000998 | 0.000735 | 0.998268 |
| 2014-11-01 14:00:00 | 0.000491 | 0.000104 | 0.999404 |
| 2014-11-01 15:00:00 | 0.000503 | 0.001229 | 0.998267 |
| 2014-11-01 16:00:00 | 0.001892 | 0.052383 | 0.945725 |
| 2014-11-01 17:00:00 | 0.000693 | 0.940728 | 0.058580 |
| 2014-11-01 18:00:00 | 0.000101 | 0.989826 | 0.010072 |
| 2014-11-01 19:00:00 | 0.000205 | 0.968296 | 0.031499 |
| 2014-11-01 20:00:00 | 0.000332 | 0.588383 | 0.411285 |
| 2014-11-01 21:00:00 | 0.000354 | 0.276246 | 0.723400 |
| 2014-11-01 22:00:00 | 0.110881 | 0.004136 | 0.884984 |
| 2014-11-01 23:00:00 | 0.097221 | 0.001579 | 0.901201 |
Feature importances¶
# Native feature importance of the estimator used in the forecaster
# ==============================================================================
forecaster.get_feature_importances()
| feature | importance | |
|---|---|---|
| 27 | Temperature | 2030 |
| 29 | Hour_of_day | 1168 |
| 24 | roll_proportion_72_class_0 | 747 |
| 25 | roll_proportion_72_class_1 | 701 |
| 26 | roll_proportion_72_class_2 | 648 |
| 0 | lag_1 | 318 |
| 23 | lag_24 | 273 |
| 22 | lag_23 | 235 |
| 4 | lag_5 | 189 |
| 2 | lag_3 | 168 |
| 6 | lag_7 | 167 |
| 5 | lag_6 | 167 |
| 9 | lag_10 | 164 |
| 1 | lag_2 | 163 |
| 13 | lag_14 | 154 |
| 21 | lag_22 | 145 |
| 3 | lag_4 | 136 |
| 8 | lag_9 | 134 |
| 10 | lag_11 | 133 |
| 12 | lag_13 | 127 |
| 14 | lag_15 | 124 |
| 16 | lag_17 | 123 |
| 7 | lag_8 | 119 |
| 15 | lag_16 | 118 |
| 20 | lag_21 | 115 |
| 19 | lag_20 | 112 |
| 18 | lag_19 | 109 |
| 17 | lag_18 | 101 |
| 11 | lag_12 | 76 |
| 28 | Holiday | 36 |
Backtesting¶
The backtesting process for classification forecasters follows the same principles as that of regression forecasters. The key difference lies in the evaluation metrics used to assess model performance. Instead of using metrics like Mean Squared Error (MSE) or Mean Absolute Error (MAE), classification forecasters utilize metrics such as accuracy, precision, ..., to evaluate how well the model predicts categorical outcomes.
The returned predictions will include the most likely category for each forecasted time step in the column pred as well as the predicted probabilities for each category in separate columns.
# Backtesting forecaster
# ==============================================================================
window_features = RollingFeaturesClassification(stats=['proportion'], window_sizes=72)
forecaster = ForecasterRecursiveClassifier(
estimator = HistGradientBoostingClassifier(random_state=123),
lags = 24,
window_features = window_features,
features_encoding = 'auto',
)
cv = TimeSeriesFold(steps = 24, initial_train_size = end_train)
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['Demand'],
exog = data[['Temperature', 'Holiday', 'Hour_of_day']],
cv = cv,
metric = 'accuracy_score',
n_jobs = 'auto',
verbose = False,
show_progress = True
)
0%| | 0/61 [00:00<?, ?it/s]
display(metric)
display(predictions)
| accuracy_score | |
|---|---|
| 0 | 0.843858 |
| fold | pred | high_proba | low_proba | medium_proba | |
|---|---|---|---|---|---|
| 2014-11-01 00:00:00 | 0 | medium | 0.001643 | 0.001311 | 0.997046 |
| 2014-11-01 01:00:00 | 0 | medium | 0.002286 | 0.004299 | 0.993415 |
| 2014-11-01 02:00:00 | 0 | medium | 0.000829 | 0.001231 | 0.997940 |
| 2014-11-01 03:00:00 | 0 | medium | 0.000601 | 0.002429 | 0.996970 |
| 2014-11-01 04:00:00 | 0 | medium | 0.001399 | 0.000247 | 0.998354 |
| ... | ... | ... | ... | ... | ... |
| 2014-12-30 20:00:00 | 59 | low | 0.000042 | 0.746859 | 0.253099 |
| 2014-12-30 21:00:00 | 59 | low | 0.000015 | 0.990844 | 0.009141 |
| 2014-12-30 22:00:00 | 59 | medium | 0.000097 | 0.460920 | 0.538982 |
| 2014-12-30 23:00:00 | 59 | medium | 0.000414 | 0.009459 | 0.990127 |
| 2014-12-31 00:00:00 | 60 | medium | 0.001948 | 0.002009 | 0.996042 |
1441 rows × 5 columns
# Summary of classification metrics on test set
# ==============================================================================
print("Classification Report \n---------------------")
print(
classification_report(
data.loc[predictions.index, 'Demand'],
predictions['pred']
)
)
Classification Report
---------------------
precision recall f1-score support
high 0.50 0.36 0.42 107
low 0.84 0.83 0.84 364
medium 0.87 0.90 0.89 970
accuracy 0.84 1441
macro avg 0.74 0.70 0.71 1441
weighted avg 0.84 0.84 0.84 1441
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[predictions.index, 'encoded_value'].plot(ax=ax, label='test')
predictions.assign(encoded_value=predictions['pred'].map(encode_mapping))['encoded_value'].plot(ax=ax, label='predictions')
y_ticks = list(encode_mapping.values())
y_labels = list(encode_mapping.keys())
ax.set_yticks(y_ticks)
ax.set_yticklabels(y_labels)
ax.set_title("Categorical Time Series (States)")
ax.set_xlabel("Date")
ax.set_ylabel("State")
ax.legend();
Custom metrics during backtesting¶
The backtesting_forecaster allows the use of custom metrics to evaluate forecaster performance.
This is particularly useful for classification tasks, as it enables metrics to be tailored to the specific problem type (e.g., binary, multi-class, or multi-label). Common examples include f1_score, roc_auc_score, and precision_score.
# Custom f1-score for multi-class classification
# ==============================================================================
from sklearn.metrics import f1_score
def custom_f1_score(y_true, y_pred):
"""
Custom f1-score metric, it must have `y_true` and `y_pred` as parameters.
"""
return f1_score(y_true, y_pred, average='weighted')
Hyperparameter tuning¶
Hyperparameter tuning for ForecasterRecursiveClassifier can be performed using the functions available in the skforecast.model_selection module, such as grid_search_forecaster, random_search_forecaster and bayesian_search_forecaster. These functions allow for systematic exploration of hyperparameter combinations to identify the best configuration for the forecaster based on specified evaluation metrics suitable for classification tasks.
When performing hyperparameter tuning, it is important to use a train-validation-test split approach. This involves dividing the dataset into three distinct subsets:
- Training set: Used to fit the model.
- Validation set: Used to tune the hyperparameters.
- Test set: Used to evaluate the model's performance.
This approach helps to prevent overfitting and ensures that the model generalizes well to unseen data.
# Train-validation-test split
# ==============================================================================
end_train = '2014-08-31 23:59:00'
end_val = '2014-10-31 23:59:00'
print(
f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()}"
f" (n={len(data.loc[:end_train])})"
)
print(
f"Validation dates : {data.loc[end_train:].index.min()} --- {data.loc[:end_val].index.max()}"
f" (n={len(data.loc[end_train:end_val])})"
)
print(
f"Test dates : {data.loc[end_val:].index.min()} --- {data.index.max()}"
f" (n={len(data.loc[end_val:])})"
)
print("")
Train dates : 2014-01-01 01:00:00 --- 2014-08-31 23:00:00 (n=5831) Validation dates : 2014-09-01 00:00:00 --- 2014-10-31 23:00:00 (n=1464) Test dates : 2014-11-01 00:00:00 --- 2014-12-31 00:00:00 (n=1441)
# Bayesian search hyperparameters and lags with Optuna
# ==============================================================================
estimator = HistGradientBoostingClassifier(
early_stopping= True,
random_state=123,
verbose=0
)
forecaster = ForecasterRecursiveClassifier(
estimator = estimator,
lags = 14, # Placeholder, the value will be overwritten
window_features = RollingFeaturesClassification(stats=['proportion'], window_sizes=[72]),
features_encoding = 'auto',
)
# Lags used as predictors
lags_grid = [24, (1, 2, 3, 23, 24, 25, 47, 48, 49)]
# Search space
def search_space(trial):
search_space = {
'lags': trial.suggest_categorical('lags', lags_grid),
'max_iter': trial.suggest_int('max_iter', 100, 500, step=100),
'max_depth': trial.suggest_int('max_depth', 3, 10),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.5),
'l2_regularization': trial.suggest_float('l2_regularization', 0.0, 1.0)
}
return search_space
# Folds
cv_search = OneStepAheadFold(initial_train_size = end_train)
results, best_trial = bayesian_search_forecaster(
forecaster = forecaster,
y = data.loc[:end_val, 'Demand'],
exog = data.loc[:end_val, ['Temperature', 'Holiday', 'Hour_of_day']],
search_space = search_space,
cv = cv_search,
metric = 'accuracy_score',
kwargs_create_study = {'direction': 'maximize'},
n_trials = 20,
random_state = 123,
return_best = True,
n_jobs = 'auto',
verbose = False,
show_progress = True,
)
results.head(4)
╭─────────────────────────── OneStepAheadValidationWarning ────────────────────────────╮ │ One-step-ahead predictions are used for faster model comparison, but they may not │ │ fully represent multi-step prediction performance. It is recommended to backtest the │ │ final model for a more accurate multi-step performance estimate. │ │ │ │ Category : skforecast.exceptions.OneStepAheadValidationWarning │ │ Location : │ │ c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\skforecast\model_s │ │ election\_utils.py:607 │ │ Suppress : warnings.simplefilter('ignore', category=OneStepAheadValidationWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
0%| | 0/20 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set:
Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24]
Parameters: {'max_iter': 200, 'max_depth': 8, 'learning_rate': 0.06995250096123734, 'l2_regularization': 0.3790013234407905}
One-step-ahead metric: 0.9050546448087432
| lags | params | accuracy_score | max_iter | max_depth | learning_rate | l2_regularization | |
|---|---|---|---|---|---|---|---|
| 0 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_iter': 200, 'max_depth': 8, 'learning_ra... | 0.905055 | 200.0 | 8.0 | 0.069953 | 0.379001 |
| 1 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_iter': 200, 'max_depth': 8, 'learning_ra... | 0.903689 | 200.0 | 8.0 | 0.099421 | 0.175452 |
| 2 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_iter': 400, 'max_depth': 7, 'learning_ra... | 0.902322 | 400.0 | 7.0 | 0.059226 | 0.412988 |
| 3 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14... | {'max_iter': 200, 'max_depth': 9, 'learning_ra... | 0.901639 | 200.0 | 9.0 | 0.096356 | 0.215461 |
# Best model
# ==============================================================================
best_params = results.loc[0, 'params']
best_lags = results.loc[0, 'lags']
forecaster = ForecasterRecursiveClassifier(
estimator = HistGradientBoostingClassifier(random_state=123, **best_params),
lags = best_lags,
window_features = RollingFeaturesClassification(stats=['proportion'], window_sizes=[72]),
features_encoding = 'auto',
)
# Backtest final model on test data
# ==============================================================================
cv = TimeSeriesFold(steps = 24, initial_train_size = end_val)
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['Demand'],
exog = data[['Temperature', 'Holiday', 'Hour_of_day']],
cv = cv,
metric = 'accuracy_score'
)
display(metric)
predictions
0%| | 0/61 [00:00<?, ?it/s]
| accuracy_score | |
|---|---|
| 0 | 0.840389 |
| fold | pred | high_proba | low_proba | medium_proba | |
|---|---|---|---|---|---|
| 2014-11-01 00:00:00 | 0 | medium | 0.001048 | 0.001395 | 0.997558 |
| 2014-11-01 01:00:00 | 0 | medium | 0.000473 | 0.002579 | 0.996947 |
| 2014-11-01 02:00:00 | 0 | medium | 0.000147 | 0.001187 | 0.998666 |
| 2014-11-01 03:00:00 | 0 | medium | 0.000296 | 0.003188 | 0.996516 |
| 2014-11-01 04:00:00 | 0 | medium | 0.000237 | 0.000265 | 0.999497 |
| ... | ... | ... | ... | ... | ... |
| 2014-12-30 20:00:00 | 59 | low | 0.000195 | 0.697114 | 0.302691 |
| 2014-12-30 21:00:00 | 59 | low | 0.000020 | 0.994279 | 0.005701 |
| 2014-12-30 22:00:00 | 59 | medium | 0.000138 | 0.375040 | 0.624822 |
| 2014-12-30 23:00:00 | 59 | medium | 0.000092 | 0.005908 | 0.994000 |
| 2014-12-31 00:00:00 | 60 | medium | 0.000738 | 0.002541 | 0.996721 |
1441 rows × 5 columns
# Summary of classification metrics on test set
# ==============================================================================
print("Classification Report \n---------------------")
print(
classification_report(
data.loc[predictions.index, 'Demand'],
predictions['pred']
)
)
Classification Report
---------------------
precision recall f1-score support
high 0.46 0.44 0.45 107
low 0.86 0.81 0.84 364
medium 0.87 0.89 0.88 970
accuracy 0.84 1441
macro avg 0.73 0.72 0.72 1441
weighted avg 0.84 0.84 0.84 1441
Probability calibration¶
Well calibrated classifiers are probabilistic classifiers for which the output of the predict_proba method can be directly interpreted as a confidence level. For instance, a well calibrated (binary) classifier should classify the samples such that among the samples to which it gave a predict_proba value close to, say, 0.8, approximately 80% actually belong to the positive class.
Before we show how to re-calibrate a classifier, we first need a way to detect how good a classifier is calibrated.
Calibration curves¶
The following definition of calibration curves is taken from scikit-learn documentation:
"Calibration curves, also referred to as reliability diagrams, compare how well the probabilistic predictions of a binary classifier are calibrated. It plots the frequency of the positive label (to be more precise, an estimation of the conditional event probability ) on the y-axis against the predicted probability predict_proba of a model on the x-axis. The tricky part is to get values for the y-axis. In scikit-learn, this is accomplished by binning the predictions such that the x-axis represents the average predicted probability in each bin. The y-axis is then the fraction of positives given the predictions of that bin, i.e. the proportion of samples whose class is the positive class (in each bin)."
Since calibration curves are defined for binary classifiers, in the case of multi-class classification problems, one-vs-rest calibration curves are used. This means that for each class, the samples belonging to that class are considered as positive samples, while all other samples are treated as negative samples.
# Calibration plots: one-vs-rest
# ==============================================================================
for level, code in {'low': 0, 'medium': 1, 'high': 2}.items():
fig, ax = plt.subplots(figsize=(6, 2.5))
y_prob = predictions.loc[:, f'{level}_proba'].rename(code)
y_true = (data.loc[predictions.index, 'encoded_value'] == code).astype(int)
curve_prob_true, curve_prob_pred = calibration_curve(
y_true=y_true,
y_prob=y_prob,
n_bins=10
)
disp = CalibrationDisplay(
prob_true=curve_prob_true, prob_pred=curve_prob_pred, y_prob=y_prob
)
disp.plot(name=level, ax=ax, color='C' + str(code))
ax.set_title(f"{level}")
ax.set_xlabel("Mean predicted probability")
ax.set_ylabel("Fraction of positives")
ax.get_lines()[0].set_color('white')
The calibrations curves show that the classifier is not well calibrated, as the curves deviate significantly from the diagonal line representing perfect calibration. For example, for the class 'low', when the classifier predicts a probability of 0.8, the actual frequency of the positive class is only around 0.5, indicating overconfidence in its predictions for that class.
Calibration¶
The calibration of a classification model refers to the process of adjusting the predicted probabilities so that they align with the actual observed outcomes. In other words, calibration corrects the model when it tends to systematically overestimate or underestimate probabilities.
A model is said to be perfectly calibrated if, for any predicted probability $p \in [0, 1]$, the proportion of correctly classified observations among those predicted with probability $p$ is exactly $p$. Formally:
$$ P(\hat{Y} = Y \mid \hat{P} = p) = p, \quad \forall p \in [0, 1] $$
where:
- $\hat{Y}$ is the predicted class,
- $Y$ is the true class, and
- $\hat{P}$ is the predicted probability of the positive (or target) class.
For example, if we look at all cases where the model predicts a probability of $\hat{P} = 0.8$, we would expect that about 80% of those cases truly belong to the positive class.
When a model is perfectly calibrated, plotting the observed frequencies against the predicted probabilities produces a diagonal line (the identity function) across the range $[0, 1]$:
$$ \text{Observed frequency} = \text{Predicted probability} $$
The CalibratedClassifierCV class in sklearn is used to calibrate a classifier. Since skforecast is compatible with scikit-learn, CalibratedClassifierCV can be used directly into ForecasterRecursiveClassifier.
# Forecaster with calibrated classifier
# ==============================================================================
estimator = CalibratedClassifierCV(
estimator = HistGradientBoostingClassifier(
**best_params,
random_state=123
),
method = 'sigmoid',
ensemble = False, # False to speed up prediction
cv = 3,
)
forecaster = ForecasterRecursiveClassifier(
estimator = estimator,
lags = best_lags,
window_features = RollingFeaturesClassification(stats=['proportion'], window_sizes=[72]),
features_encoding= 'auto',
)
cv = TimeSeriesFold(steps = 24, initial_train_size = end_val)
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['Demand'],
exog = data[['Temperature', 'Holiday', 'Hour_of_day']],
cv = cv,
metric = 'accuracy_score'
)
display(metric)
predictions
0%| | 0/61 [00:00<?, ?it/s]
| accuracy_score | |
|---|---|
| 0 | 0.843164 |
| fold | pred | high_proba | low_proba | medium_proba | |
|---|---|---|---|---|---|
| 2014-11-01 00:00:00 | 0 | medium | 0.022222 | 0.022406 | 0.955372 |
| 2014-11-01 01:00:00 | 0 | medium | 0.020423 | 0.065693 | 0.913884 |
| 2014-11-01 02:00:00 | 0 | medium | 0.009767 | 0.044904 | 0.945330 |
| 2014-11-01 03:00:00 | 0 | medium | 0.018259 | 0.100651 | 0.881091 |
| 2014-11-01 04:00:00 | 0 | medium | 0.014561 | 0.012693 | 0.972746 |
| ... | ... | ... | ... | ... | ... |
| 2014-12-30 20:00:00 | 59 | low | 0.001329 | 0.627120 | 0.371551 |
| 2014-12-30 21:00:00 | 59 | low | 0.001431 | 0.913995 | 0.084574 |
| 2014-12-30 22:00:00 | 59 | medium | 0.000931 | 0.426978 | 0.572091 |
| 2014-12-30 23:00:00 | 59 | medium | 0.003846 | 0.098475 | 0.897678 |
| 2014-12-31 00:00:00 | 60 | medium | 0.027835 | 0.060985 | 0.911179 |
1441 rows × 5 columns
# Calibration plots: one-vs-rest
# ==============================================================================
for level, code in {'low': 0, 'medium': 1, 'high': 2}.items():
fig, ax = plt.subplots(figsize=(6, 2.5))
y_prob = predictions.loc[:, f'{level}_proba'].rename(code)
y_true = (data.loc[predictions.index, 'encoded_value'] == code).astype(int)
prob_true, prob_pred = calibration_curve(
y_true=y_true,
y_prob=y_prob,
n_bins=10
)
disp = CalibrationDisplay(prob_true=prob_true, prob_pred=prob_pred, y_prob=y_prob)
disp.plot(name=level, ax=ax, color='C' + str(code))
ax.set_title(f"{level}")
ax.set_xlabel("Mean predicted probability")
ax.set_ylabel("Fraction of positives")
ax.get_lines()[0].set_color('white')
After applying calibration techniques, the calibration curves are much closer to the diagonal line, indicating that the predicted probabilities are now more aligned with the actual observed frequencies. For instance, for the class 'low', when the classifier predicts a probability of 0.8, the actual frequency of the positive class is now around 0.75, showing a significant improvement compared to before calibration.