Avoid Negative Predictions in Forecasting¶
When training a forecasting model, even if none of the training observations are negative, some predictions may turn out to be negative. This is quite common when trying to predict attendance, sales, or rainfall, among other things. To avoid this, it is possible to use the following approaches:
Use a $log+K$ transformation to always have positive values in the transformed time series. $K$ is a positive integer to avoid the undefinability of the $log(x)$ function at 0. Despite the simplicity of the approach, one must be aware that some caveats may arise depending on the metric we use to optimize our models.
Use a different link function in the models. The link function provides the relationship between the linear predictor and the expected value of the response variable. When using machine learning models, several algorithms support different objective functions to account for this situation. For example, when predicting the number of visitors, one can use
count:poisson
in XGBoost or LightGBM as well as gamma regression if the target is always strictly positive.
The following tutorial will explore both possibilities.
Libraries¶
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')
from sklearn.linear_model import Ridge
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import backtesting_forecaster
Data¶
# Downloading data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-'
'learning-python/master/data/bike_sharing_dataset_clean.csv')
data = pd.read_csv(url, usecols=['date_time', 'users'], nrows=1000)
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
# Split train-test
# ==============================================================================
end_train = '2011-01-31 23:59:00'
data_train = data.loc[: end_train, :]
data_test = data.loc[end_train:, :]
print(f"Dates train: {data_train.index.min()} --- {data_train.index.max()} (n={len(data_train)})")
print(f"Dates test : {data_test.index.min()} --- {data_test.index.max()} (n={len(data_test)})")
Dates train: 2011-01-01 00:00:00 --- 2011-01-31 23:00:00 (n=744) Dates test : 2011-02-01 00:00:00 --- 2011-02-11 15:00:00 (n=256)
# Plot time series
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_train['users'].plot(ax=ax, label='train')
data_test['users'].plot(ax=ax, label='test')
ax.set_title('Number of users')
ax.set_xlabel('')
ax.legend();
Forecaster¶
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 24
)
# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['users'],
steps = 24,
metric = 'mean_squared_error',
initial_train_size = len(data_train),
fixed_train_size = False,
gap = 0,
allow_incomplete_fold = True,
refit = False,
verbose = False,
show_progress = True
)
0%| | 0/11 [00:00<?, ?it/s]
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.set_xlabel('')
ax.legend();
The graph above shows that some predictions are negative.
# Negative predictions
# ==============================================================================
predictions[predictions.pred < 0]
pred | |
---|---|
2011-02-01 00:00:00 | -23.337236 |
2011-02-01 01:00:00 | -17.691405 |
2011-02-02 00:00:00 | -5.243456 |
2011-02-02 01:00:00 | -19.363139 |
2011-02-03 01:00:00 | -6.441943 |
2011-02-04 01:00:00 | -10.579940 |
2011-02-05 00:00:00 | -6.026119 |
2011-02-05 01:00:00 | -21.396841 |
2011-02-07 04:00:00 | -3.412043 |
2011-02-07 05:00:00 | -3.701964 |
2011-02-08 00:00:00 | -17.045913 |
2011-02-08 01:00:00 | -13.233004 |
2011-02-09 00:00:00 | -25.315228 |
2011-02-09 01:00:00 | -24.743686 |
2011-02-10 01:00:00 | -5.704407 |
2011-02-11 01:00:00 | -11.758940 |
Modeling time series in logarithmic scale¶
# Transform data into a logarithmic scale
# =============================================================================
data_log = np.log1p(data)
data_train_log = np.log1p(data_train)
data_test_log = np.log1p(data_test)
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 24,
)
# Backtesting predictions on test data
# ==============================================================================
# Since the data has been transformed outside the forecaster, the predictions
# and metric are not in the original scale. They need to be transformed back.
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data_log['users'],
steps = 24,
metric = 'mean_squared_error',
initial_train_size = len(data_train_log),
fixed_train_size = False,
gap = 0,
allow_incomplete_fold = True,
refit = False,
verbose = False,
show_progress = True
)
0%| | 0/11 [00:00<?, ?it/s]
# Revert the transformation
# ==============================================================================
predictions = np.expm1(predictions)
predictions.head(4)
pred | |
---|---|
2011-02-01 00:00:00 | 7.936823 |
2011-02-01 01:00:00 | 4.210682 |
2011-02-01 02:00:00 | 3.009993 |
2011-02-01 03:00:00 | 3.083922 |
# Backtesting metric (test data)
# ==============================================================================
# The error metric is calculated once the transformation is reversed.
metric = mean_squared_error(y_true=data_test['users'], y_pred=predictions)
print(f"Backtesting metric (test data): {metric}")
Backtesting metric (test data): 1991.9332571759892
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();
Include a logarithmic transformer as part of the forecaster¶
Using scikit-learn FunctionTransformer it is possible to include custom transformers in the forecaster object, for example, a logarithmic transformation. If the FunctionTransformer
has an inverse function, the output of the predict method is automatically transformed back to the original scale.
# Create a custom transformer
# =============================================================================
transformer_y = FunctionTransformer(func=np.log1p, inverse_func=np.expm1)
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = Ridge(random_state=123),
lags = 24,
transformer_y = transformer_y,
transformer_exog = None
)
forecaster.fit(data['users'])
# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['users'],
steps = 24,
metric = 'mean_squared_error',
initial_train_size = len(data_train),
fixed_train_size = False,
gap = 0,
allow_incomplete_fold = True,
refit = False,
verbose = False,
show_progress = True
)
# Since the transformation is included in the forecaster, predictions are
# automatically transformed back into the original scale. And the metric is
# calculated in the original scale.
print(f"Backtesting metric: {metric}")
predictions.head()
0%| | 0/11 [00:00<?, ?it/s]
Backtesting metric: 1991.9332571759892
pred | |
---|---|
2011-02-01 00:00:00 | 7.936823 |
2011-02-01 01:00:00 | 4.210682 |
2011-02-01 02:00:00 | 3.009993 |
2011-02-01 03:00:00 | 3.083922 |
2011-02-01 04:00:00 | 3.865169 |
The same results are obtained as if the data were log-transformed outside the model. However, by including the log transformation in the model, all inverse transformations are handled automatically.
Usage of link functions¶
A forecaster with a linear model (scikit learn GammaRegressor) that uses a different link function is evaluated.
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = GammaRegressor(alpha=1, max_iter=100000),
lags = 20,
)
# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['users'],
steps = 24,
metric = 'mean_squared_error',
initial_train_size = len(data_train),
fixed_train_size = False,
gap = 0,
allow_incomplete_fold = True,
refit = False,
verbose = False,
show_progress = True
)
print(f"Backtesting metric: {metric}")
predictions.head()
0%| | 0/11 [00:00<?, ?it/s]
Backtesting metric: 3509.69378608639
pred | |
---|---|
2011-02-01 00:00:00 | 6.685830 |
2011-02-01 01:00:00 | 10.238422 |
2011-02-01 02:00:00 | 11.950507 |
2011-02-01 03:00:00 | 9.194079 |
2011-02-01 04:00:00 | 6.372489 |
In this case the gamma regressor model performed poorly, but if we look to the negative results:
# Negative predictions
# ======================================================================================
predictions[predictions.pred < 0]
pred |
---|
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();
Usage of XGBoost objective functions¶
Now a XGBoost model is trained, but with the objective function set to reg:gamma
, so that the output is a mean of the gamma distribution, which is defined for positive values only.
# Create forecaster and train
# ==============================================================================
params = {
'tree_method': 'hist',
'objective': 'reg:gamma'
}
forecaster = ForecasterAutoreg(
regressor = XGBRegressor(**params),
lags = 24,
)
# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
forecaster = forecaster,
y = data['users'],
steps = 24,
metric = 'mean_squared_error',
initial_train_size = len(data_train),
fixed_train_size = False,
gap = 0,
allow_incomplete_fold = True,
refit = False,
verbose = False,
show_progress = True
)
print(f"Backtesting metric: {metric}")
0%| | 0/11 [00:00<?, ?it/s]
Backtesting metric: 1665.3412504662654
# Negative predictions
predictions[predictions.pred < 0]
pred |
---|
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();
The forecaster's performance is better than using a Ridge regressor with a log-transformer, and no negative predictions are made.
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>