Forecasting with XGBoost¶
XGBoost, the acronym for Extreme Gradient Boosting, is a very efficient implementation of the stochastic gradient boosting algorithm that has become a benchmark in machine learning. Besides its API, the XGBoost library includes the XGBRegressor class which follows the scikit-learn API and, therefore it is compatible with skforecast.
Libraries¶
In [1]:
Copied!
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg
Data¶
In [2]:
Copied!
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])
# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')
steps = 36
data_train = data.iloc[:-steps, :]
data_test = data.iloc[-steps:, :]
# Download data
# ==============================================================================
url = ('https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/data/h2o_exog.csv')
data = pd.read_csv(url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2'])
# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y/%m/%d')
data = data.set_index('date')
data = data.asfreq('MS')
steps = 36
data_train = data.iloc[:-steps, :]
data_test = data.iloc[-steps:, :]
Create and train forecaster¶
In [3]:
Copied!
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = XGBRegressor(),
lags = 8
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = XGBRegressor(),
lags = 8
)
forecaster.fit(y=data['y'], exog=data[['exog_1', 'exog_2']])
forecaster
Out[3]:
=================
ForecasterAutoreg
=================
Regressor: XGBRegressor(base_score=0.5, booster='gbtree', callbacks=None,
colsample_bylevel=1, colsample_bynode=1, colsample_bytree=1,
early_stopping_rounds=None, enable_categorical=False,
eval_metric=None, gamma=0, gpu_id=-1, grow_policy='depthwise',
importance_type=None, interaction_constraints='',
learning_rate=0.300000012, max_bin=256, max_cat_to_onehot=4,
max_delta_step=0, max_depth=6, max_leaves=0, min_child_weight=1,
missing=nan, monotone_constraints='()', n_estimators=100, n_jobs=0,
num_parallel_tree=1, predictor='auto', random_state=0, reg_alpha=0,
reg_lambda=1, ...)
Lags: [1 2 3 4 5 6 7 8]
Transformer for y: None
Transformer for exog: None
Window size: 8
Included exogenous: True
Type of exogenous variable: <class 'pandas.core.frame.DataFrame'>
Exogenous variables names: ['exog_1', 'exog_2']
Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2008-06-01 00:00:00')]
Training index type: DatetimeIndex
Training index frequency: MS
Regressor parameters: {'objective': 'reg:squarederror', 'base_score': 0.5, 'booster': 'gbtree', 'callbacks': None, 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'gamma': 0, 'gpu_id': -1, 'grow_policy': 'depthwise', 'importance_type': None, 'interaction_constraints': '', 'learning_rate': 0.300000012, 'max_bin': 256, 'max_cat_to_onehot': 4, 'max_delta_step': 0, 'max_depth': 6, 'max_leaves': 0, 'min_child_weight': 1, 'missing': nan, 'monotone_constraints': '()', 'n_estimators': 100, 'n_jobs': 0, 'num_parallel_tree': 1, 'predictor': 'auto', 'random_state': 0, 'reg_alpha': 0, 'reg_lambda': 1, 'sampling_method': 'uniform', 'scale_pos_weight': 1, 'subsample': 1, 'tree_method': 'exact', 'validate_parameters': 1, 'verbosity': None}
Creation date: 2022-09-24 08:33:22
Last fit date: 2022-09-24 08:33:23
Skforecast version: 0.5.0
Python version: 3.9.13
Prediction¶
In [4]:
Copied!
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])
Out[4]:
2008-07-01 0.700701 2008-08-01 0.829139 2008-09-01 0.983677 2008-10-01 1.098782 2008-11-01 1.078021 2008-12-01 1.206761 2009-01-01 1.149827 2009-02-01 1.049927 2009-03-01 0.947129 2009-04-01 0.700440 Freq: MS, Name: pred, dtype: float64
Feature importance¶
In [5]:
Copied!
# Predictors importance
# ==============================================================================
forecaster.get_feature_importance()
# Predictors importance
# ==============================================================================
forecaster.get_feature_importance()
Out[5]:
| feature | importance | |
|---|---|---|
| 0 | lag_1 | 0.358967 |
| 1 | lag_2 | 0.093567 |
| 2 | lag_3 | 0.016729 |
| 3 | lag_4 | 0.044611 |
| 4 | lag_5 | 0.054733 |
| 5 | lag_6 | 0.009510 |
| 6 | lag_7 | 0.097179 |
| 7 | lag_8 | 0.027964 |
| 8 | exog_1 | 0.186294 |
| 9 | exog_2 | 0.110447 |
In [6]:
Copied!
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>
%%html