Quick start skforecast¶
Welcome to a quick start guide to using skforecast! In this guide, we will provide you with a code example that demonstrates how to create, validate, and optimize a recursive multi-step forecaster, ForecasterRecursive
, using skforecast.
A Forecaster object in the skforecast library is a comprehensive container that provides essential functionality and methods for training a forecasting model and generating predictions for future points in time.
If you need more detailed documentation or guidance, you can visit the User Guides section.
Without further ado, let's jump into the code example!
Libraries and data¶
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error
from skforecast.datasets import load_demo_dataset
from skforecast.preprocessing import RollingFeatures
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import grid_search_forecaster
plt.style.use('seaborn-v0_8-darkgrid')
# Download data
# ==============================================================================
data = load_demo_dataset()
data.head(5)
datetime 1991-07-01 0.429795 1991-08-01 0.400906 1991-09-01 0.432159 1991-10-01 0.492543 1991-11-01 0.502369 Freq: MS, Name: y, dtype: float64
# Data partition train-test
# ==============================================================================
end_train = '2005-06-01 23:59:00'
print(
f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()} "
f"(n={len(data.loc[:end_train])})")
print(
f"Test dates : {data.loc[end_train:].index.min()} --- {data.index.max()} "
f"(n={len(data.loc[end_train:])})")
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
ax.legend()
plt.show();
Train dates : 1991-07-01 00:00:00 --- 2005-06-01 00:00:00 (n=168) Test dates : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00 (n=36)
Train a forecaster¶
Let's start by training a forecaster! For a more in-depth guide to using ForecasterRecursive
, visit the User guide.
# Create and fit a recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
regressor = LGBMRegressor(random_state=123, verbose=-1),
lags = 15,
window_features = RollingFeatures(stats=['mean'], window_sizes=10)
)
forecaster.fit(y=data.loc[:end_train])
forecaster
ForecasterRecursive
General Information
- Regressor: LGBMRegressor
- Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
- Window features: ['roll_mean_10']
- Window size: 15
- Exogenous included: False
- Weight function included: False
- Differentiation order: None
- Creation date: 2024-11-10 18:11:49
- Last fit date: 2024-11-10 18:11:50
- Skforecast version: 0.14.0
- Python version: 3.11.10
- Forecaster id: None
Exogenous Variables
-
None
Data Transformations
- Transformer for y: None
- Transformer for exog: None
Training Information
- Training range: [Timestamp('1991-07-01 00:00:00'), Timestamp('2005-06-01 00:00:00')]
- Training index type: DatetimeIndex
- Training index frequency: MS
Regressor Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{}
💡 Tip
To understand what can be done when initializing a forecaster with skforecast visit Forecaster parameters and Forecaster attributes.
Prediction¶
After training the forecaster, the predict
method can be used to make predictions for the future $n$ steps.
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=len(data.loc[end_train:]))
predictions.head(3)
2005-07-01 1.026507 2005-08-01 1.042429 2005-09-01 1.116730 Freq: MS, Name: pred, dtype: float64
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
# Prediction error on test data
# ==============================================================================
error_mse = mean_squared_error(
y_true = data.loc[end_train:],
y_pred = predictions
)
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.006632513357651682
Backtesting: forecaster validation¶
In time series forecasting, backtesting refers to the process of validating a predictive model using historical data. The technique involves moving backwards in time, step-by-step, to assess how well a model would have performed if it had been used to make predictions during that time period. Backtesting is a form of cross-validation that is applied to previous periods in the time series.
Backtesting can be done using a variety of techniques, such as simple train-test splits or more sophisticated methods like rolling windows or expanding windows. The choice of method depends on the specific needs of the analysis and the characteristics of the time series data. For more detailed documentation on backtesting, visit: User guide Backtesting forecaster.
# Backtesting
# ==============================================================================
cv = TimeSeriesFold(
steps = 10,
initial_train_size = len(data.loc[:end_train]),
refit = True,
fixed_train_size = False
)
metric, predictions_backtest = backtesting_forecaster(
forecaster = forecaster,
y = data,
cv = cv,
metric = 'mean_squared_error',
n_jobs = 'auto',
verbose = True,
show_progress = True
)
print(f"Backtest error: {metric}")
Information of folds -------------------- Number of observations used for initial training: 168 Number of observations used for backtesting: 36 Number of folds: 4 Number skipped folds: 0 Number of steps per fold: 10 Number of steps to exclude between last observed data (last window) and predictions (gap): 0 Last fold only includes 6 observations. Fold: 0 Training: 1991-07-01 00:00:00 -- 2005-06-01 00:00:00 (n=168) Validation: 2005-07-01 00:00:00 -- 2006-04-01 00:00:00 (n=10) Fold: 1 Training: 1991-07-01 00:00:00 -- 2006-04-01 00:00:00 (n=178) Validation: 2006-05-01 00:00:00 -- 2007-02-01 00:00:00 (n=10) Fold: 2 Training: 1991-07-01 00:00:00 -- 2007-02-01 00:00:00 (n=188) Validation: 2007-03-01 00:00:00 -- 2007-12-01 00:00:00 (n=10) Fold: 3 Training: 1991-07-01 00:00:00 -- 2007-12-01 00:00:00 (n=198) Validation: 2008-01-01 00:00:00 -- 2008-06-01 00:00:00 (n=6)
0%| | 0/4 [00:00<?, ?it/s]
Backtest error: mean_squared_error 0 0.006816
Hyperparameter tuning and lags selection¶
Hyperparameter tuning is a crucial aspect of developing accurate and effective machine learning models. In machine learning, hyperparameters are values that cannot be learned from data and must be set by the user before the model is trained. These hyperparameters can significantly impact the performance of the model, and tuning them carefully can improve its accuracy and generalization to new data. In the case of forecasting models, the lags included in the model can be considered as an additional hyperparameter.
Hyperparameter tuning involves systematically testing different values or combinations of hyperparameters (including lags) to find the optimal configuration that produces the best results. The Skforecast library offers various hyperparameter tuning strategies, including grid search, random search, and Bayesian search. For more detailed documentation on Hyperparameter tuning, visit: Hyperparameter tuning and lags selection.
# Grid search hyperparameter and lags
# ==============================================================================
# Regressor hyperparameters
param_grid = {
'n_estimators': [50, 100],
'max_depth': [5, 10, 15]
}
# Lags used as predictors
lags_grid = [3, 10, [1, 2, 3, 20]]
# Folds
cv = TimeSeriesFold(
steps = 10,
initial_train_size = len(data.loc[:end_train]),
refit = False,
)
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = data,
param_grid = param_grid,
lags_grid = lags_grid,
cv = cv,
metric = 'mean_squared_error',
return_best = True,
n_jobs = 'auto',
verbose = False,
show_progress = True
)
lags grid: 0%| | 0/3 [00:00<?, ?it/s]
params grid: 0%| | 0/6 [00:00<?, ?it/s]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: Lags: [ 1 2 3 4 5 6 7 8 9 10] Parameters: {'max_depth': 10, 'n_estimators': 50} Backtesting metric: 0.017861588026122758
# Grid results
# ==============================================================================
results_grid
lags | lags_label | params | mean_squared_error | max_depth | n_estimators | |
---|---|---|---|---|---|---|
0 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 10, 'n_estimators': 50} | 0.017862 | 10 | 50 |
1 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 15, 'n_estimators': 50} | 0.017862 | 15 | 50 |
2 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 5, 'n_estimators': 100} | 0.018772 | 5 | 100 |
3 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 15, 'n_estimators': 100} | 0.018898 | 15 | 100 |
4 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 10, 'n_estimators': 100} | 0.018898 | 10 | 100 |
5 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 5, 'n_estimators': 50} | 0.019198 | 5 | 50 |
6 | [1, 2, 3] | [1, 2, 3] | {'max_depth': 10, 'n_estimators': 50} | 0.035032 | 10 | 50 |
7 | [1, 2, 3] | [1, 2, 3] | {'max_depth': 15, 'n_estimators': 50} | 0.035032 | 15 | 50 |
8 | [1, 2, 3] | [1, 2, 3] | {'max_depth': 5, 'n_estimators': 50} | 0.035168 | 5 | 50 |
9 | [1, 2, 3] | [1, 2, 3] | {'max_depth': 15, 'n_estimators': 100} | 0.040312 | 15 | 100 |
10 | [1, 2, 3] | [1, 2, 3] | {'max_depth': 10, 'n_estimators': 100} | 0.040312 | 10 | 100 |
11 | [1, 2, 3] | [1, 2, 3] | {'max_depth': 5, 'n_estimators': 100} | 0.040562 | 5 | 100 |
12 | [1, 2, 3, 20] | [1, 2, 3, 20] | {'max_depth': 5, 'n_estimators': 100} | 0.042146 | 5 | 100 |
13 | [1, 2, 3, 20] | [1, 2, 3, 20] | {'max_depth': 10, 'n_estimators': 100} | 0.042147 | 10 | 100 |
14 | [1, 2, 3, 20] | [1, 2, 3, 20] | {'max_depth': 15, 'n_estimators': 100} | 0.042147 | 15 | 100 |
15 | [1, 2, 3, 20] | [1, 2, 3, 20] | {'max_depth': 5, 'n_estimators': 50} | 0.043385 | 5 | 50 |
16 | [1, 2, 3, 20] | [1, 2, 3, 20] | {'max_depth': 10, 'n_estimators': 50} | 0.043385 | 10 | 50 |
17 | [1, 2, 3, 20] | [1, 2, 3, 20] | {'max_depth': 15, 'n_estimators': 50} | 0.043385 | 15 | 50 |
Since return_best = True
, the forecaster object is updated with the best configuration found and trained with the whole data set. This means that the final model obtained from grid search will have the best combination of lags and hyperparameters that resulted in the highest performance metric. This final model can then be used for future predictions on new data.
# Print forecaster information
# ==============================================================================
forecaster
ForecasterRecursive
General Information
- Regressor: LGBMRegressor
- Lags: [ 1 2 3 4 5 6 7 8 9 10]
- Window features: ['roll_mean_10']
- Window size: 10
- Exogenous included: False
- Weight function included: False
- Differentiation order: None
- Creation date: 2024-11-10 18:11:49
- Last fit date: 2024-11-10 18:11:51
- Skforecast version: 0.14.0
- Python version: 3.11.10
- Forecaster id: None
Exogenous Variables
-
None
Data Transformations
- Transformer for y: None
- Transformer for exog: None
Training Information
- Training range: [Timestamp('1991-07-01 00:00:00'), Timestamp('2008-06-01 00:00:00')]
- Training index type: DatetimeIndex
- Training index frequency: MS
Regressor Parameters
-
{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': 10, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 50, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}
Fit Kwargs
-
{}