Avoid Negative Predictions in Forecasting¶

When training a forecasting model, even if none of the training observations are negative, some predictions may turn out to be negative. This is quite common when trying to predict attendance, sales, or rainfall, among other things. To avoid this, it is possible to use the following approaches:

Use a $log+K$ transformation to always have positive values in the transformed time series. $K$ is a positive integer to avoid the undefinability of the $log(x)$ function at 0. Despite the simplicity of the approach, one must be aware that some caveats may arise depending on the metric we use to optimize our models.
Use a different link function in the models. The link function provides the relationship between the linear predictor and the expected value of the response variable. When using machine learning models, several algorithms support different objective functions to account for this situation. For example, when predicting the number of visitors, one can use count:poisson in XGBoost or LightGBM as well as gamma regression if the target is always strictly positive.

The following tutorial will explore both possibilities.

Libraries¶

In [1]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')
from sklearn.linear_model import Ridge
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import backtesting_forecaster
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
plt.style.use('seaborn-v0_8-darkgrid')
from sklearn.linear_model import Ridge
from sklearn.linear_model import GammaRegressor
from sklearn.linear_model import TweedieRegressor
from sklearn.linear_model import PoissonRegressor
from sklearn.preprocessing import FunctionTransformer
from sklearn.metrics import mean_squared_error
from xgboost import XGBRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import backtesting_forecaster

Data¶

In [2]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Downloading data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-'
       'learning-python/master/data/bike_sharing_dataset_clean.csv')
data = pd.read_csv(url, usecols=['date_time', 'users'], nrows=1000)
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()
# Downloading data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/Estadistica-machine-'
       'learning-python/master/data/bike_sharing_dataset_clean.csv')
data = pd.read_csv(url, usecols=['date_time', 'users'], nrows=1000)
data['date_time'] = pd.to_datetime(data['date_time'], format='%Y-%m-%d %H:%M:%S')
data = data.set_index('date_time')
data = data.asfreq('H')
data = data.sort_index()

In [3]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Split train-test
# ==============================================================================
end_train = '2011-01-31 23:59:00'
data_train = data.loc[: end_train, :]
data_test  = data.loc[end_train:, :]

print(f"Dates train: {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates test : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
# Split train-test
# ==============================================================================
end_train = '2011-01-31 23:59:00'
data_train = data.loc[: end_train, :]
data_test  = data.loc[end_train:, :]

print(f"Dates train: {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Dates test : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

Dates train: 2011-01-01 00:00:00 --- 2011-01-31 23:00:00  (n=744)
Dates test : 2011-02-01 00:00:00 --- 2011-02-11 15:00:00  (n=256)

In [4]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Plot time series
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_train['users'].plot(ax=ax, label='train')
data_test['users'].plot(ax=ax, label='test')
ax.set_title('Number of users')
ax.set_xlabel('')
ax.legend();
# Plot time series
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_train['users'].plot(ax=ax, label='train')
data_test['users'].plot(ax=ax, label='test')
ax.set_title('Number of users')
ax.set_xlabel('')
ax.legend();

Forecaster¶

In [5]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = Ridge(random_state=123),
                 lags      = 24
             )
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = Ridge(random_state=123),
                 lags      = 24
             )

In [6]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )
# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

  0%|          | 0/11 [00:00<?, ?it/s]

In [7]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.set_xlabel('')
ax.legend();
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.set_xlabel('')
ax.legend();

The graph above shows that some predictions are negative.

In [8]:

                
                    Copied!
                    
# Negative predictions
# ==============================================================================
predictions[predictions.pred < 0]
# Negative predictions
# ==============================================================================
predictions[predictions.pred < 0]

Out[8]:

	pred
2011-02-01 00:00:00	-23.337236
2011-02-01 01:00:00	-17.691405
2011-02-02 00:00:00	-5.243456
2011-02-02 01:00:00	-19.363139
2011-02-03 01:00:00	-6.441943
2011-02-04 01:00:00	-10.579940
2011-02-05 00:00:00	-6.026119
2011-02-05 01:00:00	-21.396841
2011-02-07 04:00:00	-3.412043
2011-02-07 05:00:00	-3.701964
2011-02-08 00:00:00	-17.045913
2011-02-08 01:00:00	-13.233004
2011-02-09 00:00:00	-25.315228
2011-02-09 01:00:00	-24.743686
2011-02-10 01:00:00	-5.704407
2011-02-11 01:00:00	-11.758940

Modeling time series in logarithmic scale¶

In [9]:

                
                    Copied!
                    
# Transform data into a logarithmic scale
# =============================================================================
data_log = np.log1p(data)
data_train_log = np.log1p(data_train)
data_test_log  = np.log1p(data_test)
# Transform data into a logarithmic scale
# =============================================================================
data_log = np.log1p(data)
data_train_log = np.log1p(data_train)
data_test_log  = np.log1p(data_test)

In [10]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = Ridge(random_state=123),
                 lags      = 24,
             )

# Backtesting predictions on test data
# ==============================================================================
# Since the data has been transformed outside the forecaster, the predictions
# and metric are not in the original scale. They need to be transformed back.
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data_log['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train_log),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = Ridge(random_state=123),
                 lags      = 24,
             )

# Backtesting predictions on test data
# ==============================================================================
# Since the data has been transformed outside the forecaster, the predictions
# and metric are not in the original scale. They need to be transformed back.
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data_log['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train_log),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

  0%|          | 0/11 [00:00<?, ?it/s]

In [11]:

                
                    Copied!
                    
# Revert the transformation
# ==============================================================================
predictions = np.expm1(predictions)
predictions.head(4)
# Revert the transformation
# ==============================================================================
predictions = np.expm1(predictions)
predictions.head(4)

Out[11]:

	pred
2011-02-01 00:00:00	7.936823
2011-02-01 01:00:00	4.210682
2011-02-01 02:00:00	3.009993
2011-02-01 03:00:00	3.083922

In [12]:

                
                    Copied!
                    
# Backtesting metric (test data)
# ==============================================================================
# The error metric is calculated once the transformation is reversed.
metric = mean_squared_error(y_true=data_test['users'], y_pred=predictions)
print(f"Backtesting metric (test data): {metric}")
# Backtesting metric (test data)
# ==============================================================================
# The error metric is calculated once the transformation is reversed.
metric = mean_squared_error(y_true=data_test['users'], y_pred=predictions)
print(f"Backtesting metric (test data): {metric}")

Backtesting metric (test data): 1991.9332571759892

In [13]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();

Include a logarithmic transformer as part of the forecaster¶

Using scikit-learn FunctionTransformer it is possible to include custom transformers in the forecaster object, for example, a logarithmic transformation. If the FunctionTransformer has an inverse function, the output of the predict method is automatically transformed back to the original scale.

In [14]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create a custom transformer
# =============================================================================
transformer_y = FunctionTransformer(func=np.log1p, inverse_func=np.expm1)

# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(random_state=123),
                 lags             = 24,
                 transformer_y    = transformer_y,
                 transformer_exog = None
             )
forecaster.fit(data['users'])

# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

# Since the transformation is included in the forecaster, predictions are
# automatically transformed back into the original scale. And the metric is
# calculated in the original scale.
print(f"Backtesting metric: {metric}")
predictions.head()
# Create a custom transformer
# =============================================================================
transformer_y = FunctionTransformer(func=np.log1p, inverse_func=np.expm1)

# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor        = Ridge(random_state=123),
                 lags             = 24,
                 transformer_y    = transformer_y,
                 transformer_exog = None
             )
forecaster.fit(data['users'])

# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

# Since the transformation is included in the forecaster, predictions are
# automatically transformed back into the original scale. And the metric is
# calculated in the original scale.
print(f"Backtesting metric: {metric}")
predictions.head()

  0%|          | 0/11 [00:00<?, ?it/s]

Backtesting metric: 1991.9332571759892

Out[14]:

	pred
2011-02-01 00:00:00	7.936823
2011-02-01 01:00:00	4.210682
2011-02-01 02:00:00	3.009993
2011-02-01 03:00:00	3.083922
2011-02-01 04:00:00	3.865169

The same results are obtained as if the data were log-transformed outside the model. However, by including the log transformation in the model, all inverse transformations are handled automatically.

Usage of link functions¶

A forecaster with a linear model (scikit learn GammaRegressor) that uses a different link function is evaluated.

In [15]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = GammaRegressor(alpha=1, max_iter=100000),
                 lags      = 20,
             )

# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

print(f"Backtesting metric: {metric}")
predictions.head()
# Create a forecaster and train it
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = GammaRegressor(alpha=1, max_iter=100000),
                 lags      = 20,
             )

# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

print(f"Backtesting metric: {metric}")
predictions.head()

  0%|          | 0/11 [00:00<?, ?it/s]

Backtesting metric: 3509.69378608639

Out[15]:

	pred
2011-02-01 00:00:00	6.685830
2011-02-01 01:00:00	10.238422
2011-02-01 02:00:00	11.950507
2011-02-01 03:00:00	9.194079
2011-02-01 04:00:00	6.372489

In this case the gamma regressor model performed poorly, but if we look to the negative results:

In [16]:

                
                    Copied!
                    
# Negative predictions
# ======================================================================================
predictions[predictions.pred < 0]
# Negative predictions
# ======================================================================================
predictions[predictions.pred < 0]

Out[16]:

	pred

In [17]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();

Usage of XGBoost objective functions¶

Now a XGBoost model is trained, but with the objective function set to reg:gamma, so that the output is a mean of the gamma distribution, which is defined for positive values only.

In [18]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Create forecaster and train
# ==============================================================================
params = {
    'tree_method': 'hist',
    'objective': 'reg:gamma'
}

forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(**params),
                 lags      = 24,
             )

# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

print(f"Backtesting metric: {metric}")
# Create forecaster and train
# ==============================================================================
params = {
    'tree_method': 'hist',
    'objective': 'reg:gamma'
}

forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(**params),
                 lags      = 24,
             )

# Backtesting predictions on test data
# ==============================================================================
metric, predictions = backtesting_forecaster(
                          forecaster            = forecaster,
                          y                     = data['users'],
                          steps                 = 24,
                          metric                = 'mean_squared_error',
                          initial_train_size    = len(data_train),
                          fixed_train_size      = False,
                          gap                   = 0,
                          allow_incomplete_fold = True,
                          refit                 = False,
                          verbose               = False,
                          show_progress         = True  
                      )

print(f"Backtesting metric: {metric}")

  0%|          | 0/11 [00:00<?, ?it/s]

Backtesting metric: 1665.3412504662654

In [20]:

                
                    Copied!
                    
# Negative predictions
predictions[predictions.pred < 0]
# Negative predictions
predictions[predictions.pred < 0]

Out[20]:

	pred

In [21]:

                
                    Copied!
                    
                        
                        
                    
                    

            
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data_test['users'].plot(ax=ax, label='test')
predictions['pred'].plot(ax=ax, label='predictions')
ax.legend();

The forecaster's performance is better than using a Ridge regressor with a log-transformer, and no negative predictions are made.

In [22]:

                
                    Copied!
                    
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html