Probabilistic Forecasting: Quantile Regression¶
Quantile regression is a technique for estimating the conditional quantiles of a response variable. By combining the predictions of two quantile regressors, an interval can be constructed, where each model estimates one of the interval’s boundaries. For example, models trained for and produce an 80% prediction interval ().
If a machine learning algorithm capable of modeling quantiles is used as the regressor
in a forecaster, the predict
method will return predictions for a specified quantile. By creating two forecasters, each configured with a different quantile, their predictions can be combined to generate a prediction interval.
As opposed to linear regression, which is intended to estimate the conditional mean of the response variable given certain values of the predictor variables, quantile regression aims at estimating the conditional quantiles of the response variable. For a continuous distribution function, the -quantile is defined such that the probability of being smaller than is, for a given , equal to . For example, 36% of the population values are lower than the quantile . The most known quantile is the 50%-quantile, more commonly called the median.
By combining the predictions of two quantile regressors, it is possible to build an interval. Each model estimates one of the limits of the interval. For example, the models obtained for and produce an 80% prediction interval (90% - 10% = 80%).
Several machine learning algorithms are capable of modeling quantiles. Some of them are:
Just as the squared-error loss function is used to train models that predict the mean value, a specific loss function is needed in order to train models that predict quantiles. The most common metric used for quantile regression is calles quantile loss or pinball loss:
where is the target quantile, the real value and the quantile prediction.
It can be seen that loss differs depending on the evaluated quantile. The higher the quantile, the more the loss function penalizes underestimates, and the less it penalizes overestimates. As with MSE and MAE, the goal is to minimize its values (the lower loss, the better).
Two disadvantages of quantile regression, compared to the bootstrap approach to prediction intervals, are that each quantile needs its regressor and quantile regression is not available for all types of regression models. However, once the models are trained, the inference is much faster since no iterative process is needed.
This type of prediction intervals can be easily estimated using ForecasterDirect and ForecasterDirectMultiVariate models.
💡 Tip
For more examples on how to use probabilistic forecasting, check out the following articles:
⚠ Warning
Forecasters of type ForecasterDirect
are slower than ForecasterRecursive
because they require training one model per step. Although they can achieve better performance, their scalability is an important limitation when many steps need to be predicted.
# Data processing
# ==============================================================================
import pandas as pd
from skforecast.datasets import fetch_dataset
# Plots
# ==============================================================================
import matplotlib.pyplot as plt
from skforecast.plot import set_dark_theme
# Modelling and Forecasting
# ==============================================================================
from lightgbm import LGBMRegressor
from skforecast.direct import ForecasterDirect
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import bayesian_search_forecaster
from skforecast.metrics import calculate_coverage
from skforecast.metrics import create_mean_pinball_loss
# Configuration
# ==============================================================================
import warnings
warnings.filterwarnings('once')
# Data
# ==============================================================================
data = fetch_dataset(name="ett_m2_extended")
data = data[[
"OT",
"day_of_week_cos",
"day_of_week_sin",
"hour_cos",
"hour_sin",
"month_cos",
"month_sin",
"week_cos",
"week_sin",
"year",
]]
data.head(2)
ett_m2_extended --------------- Data from an electricity transformer station was collected between July 2016 and July 2018 (2 years x 365 days x 24 hours x 4 intervals per hour = 70,080 data points). Each data point consists of 8 features, including the date of the point, the predictive value "Oil Temperature (OT)", and 6 different types of external power load features: High UseFul Load (HUFL), High UseLess Load (HULL), Middle UseFul Load (MUFL), Middle UseLess Load (MULL), Low UseFul Load (LUFL), Low UseLess Load (LULL). Additional variables are created based on calendar information (year, month, week, day of the week, and hour). These variables have been encoded using the cyclical encoding technique (sin and cos transformations) to preserve the cyclical nature of the data. Zhou, Haoyi & Zhang, Shanghang & Peng, Jieqi & Zhang, Shuai & Li, Jianxin & Xiong, Hui & Zhang, Wancai. (2020). Informer: Beyond Efficient Transformer for Long Sequence Time-Series Forecasting. [10.48550/arXiv.2012.07436](https://arxiv.org/abs/2012.07436). https://github.com/zhouhaoyi/ETDataset Shape of the dataset: (69680, 16)
OT | day_of_week_cos | day_of_week_sin | hour_cos | hour_sin | month_cos | month_sin | week_cos | week_sin | year | |
---|---|---|---|---|---|---|---|---|---|---|
date | ||||||||||
2016-07-01 00:00:00 | 38.661999 | -0.900969 | -0.433884 | 1.0 | 0.0 | -0.866025 | -0.5 | -1.0 | 1.224647e-16 | 2016 |
2016-07-01 00:15:00 | 38.223000 | -0.900969 | -0.433884 | 1.0 | 0.0 | -0.866025 | -0.5 | -1.0 | 1.224647e-16 | 2016 |
# Split train-validation-test
# ==============================================================================
exog_features = data.columns.difference(['OT']).tolist()
end_train = '2017-10-01 23:59:00'
end_validation = '2018-04-03 23:59:00'
data_train = data.loc[: end_train, :]
data_val = data.loc[end_train:end_validation, :]
data_test = data.loc[end_validation:, :]
print(f"Dates train : {data_train.index.min()} --- {data_train.index.max()} (n={len(data_train)})")
print(f"Dates validation : {data_val.index.min()} --- {data_val.index.max()} (n={len(data_val)})")
print(f"Dates test : {data_test.index.min()} --- {data_test.index.max()} (n={len(data_test)})")
Dates train : 2016-07-01 00:00:00 --- 2017-10-01 23:45:00 (n=43968) Dates validation : 2017-10-02 00:00:00 --- 2018-04-03 23:45:00 (n=17664) Dates test : 2018-04-04 00:00:00 --- 2018-06-26 19:45:00 (n=8048)
# Plot partitions
# ==============================================================================
set_dark_theme()
plt.rcParams['lines.linewidth'] = 0.5
fig, ax = plt.subplots(figsize=(8, 3))
ax.plot(data_train['OT'], label='Train')
ax.plot(data_val['OT'], label='Validation')
ax.plot(data_test['OT'], label='Test')
ax.set_title('Oil Temperature')
ax.legend();
Two quantile forecasters are trained, one for the 10% quantile and another for the 90% quantile. The predictions of both forecasters can be combined to generate a prediction interval of 80% coverage.
# Create forecasters: one for each bound of the interval
# ==============================================================================
# The forecasters obtained for alpha=0.1 and alpha=0.9 produce a 80% confidence
# interval (90% - 10% = 80%).
# Forecaster for quantile 10%
forecaster_q10 = ForecasterDirect(
regressor = LGBMRegressor(
objective = 'quantile',
metric = 'quantile',
alpha = 0.1,
verbose = -1
),
steps = 24,
lags = [1, 2, 3, 23, 24, 25, 47, 48, 49, 71, 72, 73]
)
# Forecaster for quantile 90%
forecaster_q90 = ForecasterDirect(
regressor = LGBMRegressor(
objective = 'quantile',
metric = 'quantile',
alpha = 0.9,
verbose = -1
),
steps = 24,
lags = [1, 2, 3, 23, 24, 25, 47, 48, 49, 71, 72, 73]
)
Next, a bayesian search is performed to find the best hyperparameters for the quantile regressors. When validating a quantile regression model, it is important to use a metric that is coherent with the quantile being evaluated. In this case, the pinball loss is used. Skforecast provides the function create_mean_pinball_loss
calculate the pinball loss for a given quantile.
# Bayesian search of hyper-parameters and lags for each quantile forecaster
# ==============================================================================
def search_space(trial):
search_space = {
'n_estimators': trial.suggest_int('n_estimators', 100, 500, step=50),
'max_depth': trial.suggest_int('max_depth', 3, 10, step=1),
'learning_rate': trial.suggest_float('learning_rate', 0.001, 0.1)
}
return search_space
cv = TimeSeriesFold(
steps = 24,
initial_train_size = len(data[:end_train]),
refit = False,
)
results_grid_q10 = bayesian_search_forecaster(
forecaster = forecaster_q10,
y = data.loc[:end_validation, 'OT'],
cv = cv,
metric = create_mean_pinball_loss(alpha=0.1),
search_space = search_space,
n_trials = 10,
return_best = True
)
results_grid_q90 = bayesian_search_forecaster(
forecaster = forecaster_q90,
y = data.loc[:end_validation, 'OT'],
cv = cv,
metric = create_mean_pinball_loss(alpha=0.9),
search_space = search_space,
n_trials = 10,
return_best = True
)
0%| | 0/10 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: Lags: [ 1 2 3 23 24 25 47 48 49 71 72 73] Parameters: {'n_estimators': 250, 'max_depth': 3, 'learning_rate': 0.04040638127771271} Backtesting metric: 0.41020051875078006
0%| | 0/10 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: Lags: [ 1 2 3 23 24 25 47 48 49 71 72 73] Parameters: {'n_estimators': 450, 'max_depth': 8, 'learning_rate': 0.061491327557080706} Backtesting metric: 0.350142766138876
Predictions (backtesting)¶
Once the quantile forecasters are trained, they can be used to predict each of the bounds of the forecasting interval.
# Backtesting on test data
# ==============================================================================
cv = TimeSeriesFold(
steps = 24,
initial_train_size = len(data.loc[:end_validation]),
refit = False
)
metric_q10, predictions_q10 = backtesting_forecaster(
forecaster = forecaster_q10,
y = data['OT'],
cv = cv,
metric = create_mean_pinball_loss(alpha=0.1)
)
metric_q90, predictions_q90 = backtesting_forecaster(
forecaster = forecaster_q90,
y = data['OT'],
cv = cv,
metric = create_mean_pinball_loss(alpha=0.9)
)
pred_intervals = pd.concat([predictions_q10, predictions_q90], axis=1)
pred_intervals.columns = ['lower_bound', 'upper_bound']
pred_intervals.head()
0%| | 0/336 [00:00<?, ?it/s]
0%| | 0/336 [00:00<?, ?it/s]
lower_bound | upper_bound | |
---|---|---|
2018-04-04 00:00:00 | 31.848286 | 32.268329 |
2018-04-04 00:15:00 | 31.563152 | 32.150039 |
2018-04-04 00:30:00 | 31.269266 | 32.271917 |
2018-04-04 00:45:00 | 31.081806 | 32.139591 |
2018-04-04 01:00:00 | 30.670913 | 32.049578 |
# Plot interval
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[end_validation:, 'OT'].plot(ax=ax, label='Real value', color='orange')
ax.fill_between(
data.loc[end_validation:].index,
pred_intervals['lower_bound'],
pred_intervals['upper_bound'],
color = 'gray',
alpha = 0.6,
zorder = 1,
label = '80% prediction interval'
)
ax.set_xlabel('')
ax.set_title("Predicted intervals")
ax.legend();
# Plot intervals with zoom ['2018-05-01', '2018-05-15']
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[end_validation:, 'OT'].plot(ax=ax, label='Real value', color='orange')
ax.fill_between(
data.loc[end_validation:].index,
pred_intervals['lower_bound'],
pred_intervals['upper_bound'],
color = 'gray',
alpha = 0.6,
zorder = 1,
label = '80% prediction interval'
)
ax.set_xlabel('')
ax.set_title("Predicted intervals (zoom in)")
ax.set_xlim(['2018-05-01', '2018-05-15']);
# Predicted interval coverage (on test data)
# ==============================================================================
coverage = calculate_coverage(
y_true = data.loc[end_validation:, 'OT'],
lower_bound = predictions_q10["pred"],
upper_bound = predictions_q90["pred"]
)
print(f"Predicted interval coverage: {round(100 * coverage, 2)} %")
# Area of the interval
# ==============================================================================
area = (predictions_q90["pred"] - predictions_q10["pred"]).sum()
print(f"Area of the interval: {round(area, 2)}")
Predicted interval coverage: 77.37 % Area of the interval: 38980.18
The prediction intervals generated by quantile regression achieve an empirical coverage close to the nominal coverage of 80%.