Forecasting with XGBoost, LightGBM and other Gradient Boosting models¶
Gradient boosting models have gained popularity in the machine learning community due to their ability to achieve excellent results in a wide range of use cases, including both regression and classification. Although these models have traditionally been less common in forecasting, recent research has shown that they can be highly effective in this domain. Some of the key advantages of using gradient boosting models for forecasting include:
The ease with which exogenous variables, in addition to autoregressive variables, can be incorporated into the model.
The ability to capture non-linear relationships between variables.
High scalability, which enables the models to handle large volumes of data.
There are several popular implementations of gradient boosting in Python, with four of the most popular being XGBoost, LightGBM, scikit-learn HistGradientBoostingRegressor and CatBoost. All of these libraries follow the scikit-learn API, making them compatible with skforecast.
  Note
All of the gradient boosting libraries mentioned above - XGBoost, Lightgbm, HistGradientBoostingRegressor, and CatBoost - can handle categorical features natively, but they require specific encoding techniques that may not be entirely intuitive. Detailed information can be found in categorical features and in this example.
Libraries¶
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from skforecast.ForecasterAutoreg import ForecasterAutoreg
Data¶
# Download data
# ==============================================================================
url = (
'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
'data/h2o_exog.csv'
)
data = pd.read_csv(
url, sep=',', header=0, names=['date', 'y', 'exog_1', 'exog_2']
)
# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('MS')
steps = 36
data_train = data.iloc[:-steps, :]
data_test = data.iloc[-steps:, :]
Forecaster LightGBM¶
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = LGBMRegressor(random_state = 123),
lags = 8
)
forecaster.fit(y=data_train['y'], exog=data_train[['exog_1', 'exog_2']])
forecaster
================= ForecasterAutoreg ================= Regressor: LGBMRegressor(random_state=123) Lags: [1 2 3 4 5 6 7 8] Transformer for y: None Transformer for exog: None Window size: 8 Weight function included: False Exogenous included: True Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> Exogenous variables names: ['exog_1', 'exog_2'] Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': -1, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'silent': 'warn', 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0} fit_kwargs: {} Creation date: 2023-05-29 13:13:14 Last fit date: 2023-05-29 13:13:14 Skforecast version: 0.8.1 Python version: 3.10.11 Forecaster id: None
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])
2005-07-01 0.939158 2005-08-01 0.931943 2005-09-01 1.072937 2005-10-01 1.090429 2005-11-01 1.087492 2005-12-01 1.170073 2006-01-01 0.964073 2006-02-01 0.760841 2006-03-01 0.829831 2006-04-01 0.800095 Freq: MS, Name: pred, dtype: float64
# Feature importances
# ==============================================================================
forecaster.get_feature_importances()
feature | importance | |
---|---|---|
0 | lag_1 | 61 |
1 | lag_2 | 91 |
2 | lag_3 | 14 |
3 | lag_4 | 38 |
4 | lag_5 | 35 |
5 | lag_6 | 49 |
6 | lag_7 | 25 |
7 | lag_8 | 26 |
8 | exog_1 | 43 |
9 | exog_2 | 127 |
Forecaster XGBoost¶
# Create and fit forecaster
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = XGBRegressor(random_state = 123),
lags = 8
)
forecaster.fit(y=data_train['y'], exog=data_train[['exog_1', 'exog_2']])
forecaster
================= ForecasterAutoreg ================= Regressor: XGBRegressor(base_score=None, booster=None, callbacks=None, colsample_bylevel=None, colsample_bynode=None, colsample_bytree=None, early_stopping_rounds=None, enable_categorical=False, eval_metric=None, feature_types=None, gamma=None, gpu_id=None, grow_policy=None, importance_type=None, interaction_constraints=None, learning_rate=None, max_bin=None, max_cat_threshold=None, max_cat_to_onehot=None, max_delta_step=None, max_depth=None, max_leaves=None, min_child_weight=None, missing=nan, monotone_constraints=None, n_estimators=100, n_jobs=None, num_parallel_tree=None, predictor=None, random_state=123, ...) Lags: [1 2 3 4 5 6 7 8] Transformer for y: None Transformer for exog: None Window size: 8 Weight function included: False Exogenous included: True Type of exogenous variable: <class 'pandas.core.frame.DataFrame'> Exogenous variables names: ['exog_1', 'exog_2'] Training range: [Timestamp('1992-04-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'objective': 'reg:squarederror', 'base_score': None, 'booster': None, 'callbacks': None, 'colsample_bylevel': None, 'colsample_bynode': None, 'colsample_bytree': None, 'early_stopping_rounds': None, 'enable_categorical': False, 'eval_metric': None, 'feature_types': None, 'gamma': None, 'gpu_id': None, 'grow_policy': None, 'importance_type': None, 'interaction_constraints': None, 'learning_rate': None, 'max_bin': None, 'max_cat_threshold': None, 'max_cat_to_onehot': None, 'max_delta_step': None, 'max_depth': None, 'max_leaves': None, 'min_child_weight': None, 'missing': nan, 'monotone_constraints': None, 'n_estimators': 100, 'n_jobs': None, 'num_parallel_tree': None, 'predictor': None, 'random_state': 123, 'reg_alpha': None, 'reg_lambda': None, 'sampling_method': None, 'scale_pos_weight': None, 'subsample': None, 'tree_method': None, 'validate_parameters': None, 'verbosity': None} fit_kwargs: {} Creation date: 2023-05-29 13:13:15 Last fit date: 2023-05-29 13:13:16 Skforecast version: 0.8.1 Python version: 3.10.11 Forecaster id: None
# Predict
# ==============================================================================
forecaster.predict(steps=10, exog=data_test[['exog_1', 'exog_2']])
2005-07-01 0.882285 2005-08-01 0.971786 2005-09-01 1.106107 2005-10-01 1.064638 2005-11-01 1.094615 2005-12-01 1.139401 2006-01-01 0.948508 2006-02-01 0.784839 2006-03-01 0.774227 2006-04-01 0.789593 Freq: MS, Name: pred, dtype: float64
# Feature importances
# ==============================================================================
forecaster.get_feature_importances()
feature | importance | |
---|---|---|
0 | lag_1 | 0.286422 |
1 | lag_2 | 0.125064 |
2 | lag_3 | 0.001548 |
3 | lag_4 | 0.027828 |
4 | lag_5 | 0.075020 |
5 | lag_6 | 0.011337 |
6 | lag_7 | 0.058954 |
7 | lag_8 | 0.045198 |
8 | exog_1 | 0.075610 |
9 | exog_2 | 0.293018 |
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>