Forecasting with XGBoost¶

XGBoost, the acronym for Extreme Gradient Boosting, is a very efficient implementation of the stochastic gradient boosting algorithm that has become a benchmark in machine learning. Besides its API, the XGBoost library includes the XGBRegressor class which follows the scikit-learn API and, therefore it is compatible with skforecast.

Note

Since the success of XGBoost as a machine learning algorithm, new implementations have been developed that also achieve excellent results, two of them are: LightGBM and CatBoost. A more detailed example can be found here.

Libraries¶

In [1]:

            
                Copied!
                
                    
                    
                
                

        
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg

Data¶

In [2]:

            
                Copied!
                
                    
                    
                
                

        
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')

steps = 36
data_train = data.iloc[:-steps, :]
data_test  = data.iloc[-steps:, :]
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')

steps = 36
data_train = data.iloc[:-steps, :]
data_test  = data.iloc[-steps:, :]

Create and train forecaster¶

In [3]:

            
                Copied!
                
                    
                    
                
                

        
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(),
                 lags = 8
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
                 regressor = XGBRegressor(),
                 lags = 8
             )

forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster

Out[3]:

================= 
ForecasterAutoreg 
================= 
Regressor: XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
             colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
             early_stopping_rounds=None, enable_categorical=False,
             eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
             importance_type=None, interaction_constraints='',
             learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
             max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
             missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
             num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
             reg_lambda=1, ...) 
Lags: [1 2 3 4 5 6 7 8] 
Transformer for y: None 
Transformer for exog: None 
Window size: 8 
Included exogenous: True 
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> 
Exogenous variables names: ['exog_1', 'exog_2'] 
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] 
Training index type: DatetimeIndex 
Training index frequency: MS 
Regressor parameters: {'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'callbacks': None, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'gamma': 0, 'gpu_id': -1, 'grow_policy': 'depthwise', 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.300000012, 'max_bin': 256, 'max_cat_to_onehot': 4, 'max_delta_step': 0, 'max_depth': 6, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 0, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'sampling_method': 'uniform', 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None} 
Creation date: 2022-09-24 08:33:22 
Last fit date: 2022-09-24 08:33:23 
Skforecast version: 0.5.0 
Python version: 3.9.13

Prediction¶

In [4]:

            
                Copied!
                
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])

Out[4]:

2008-07-01    0.700701
2008-08-01    0.829139
2008-09-01    0.983677
2008-10-01    1.098782
2008-11-01    1.078021
2008-12-01    1.206761
2009-01-01    1.149827
2009-02-01    1.049927
2009-03-01    0.947129
2009-04-01    0.700440
Freq: MS, Name: pred, dtype: float64

Feature importance¶

In [5]:

            
                Copied!
                
# Predictors importance
# ==============================================================================
forecaster.get_feature_importance()
# Predictors importance
# ==============================================================================
forecaster.get_feature_importance()

Out[5]:

	feature	importance
0	lag_1	0.358967
1	lag_2	0.093567
2	lag_3	0.016729
3	lag_4	0.044611
4	lag_5	0.054733
5	lag_6	0.009510
6	lag_7	0.097179
7	lag_8	0.027964
8	exog_1	0.186294
9	exog_2	0.110447

In [6]:

            
                Copied!
                
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html