Quick start skforecast¶
Welcome to a quick start guide to using skforecast! In this guide, we will provide you with a code example that demonstrates how to create, validate, and optimize a recursive multi-step forecaster, ForecasterAutoreg
, using skforecast.
If you need more detailed documentation or guidance, you can visit the User Guides section.
Without further ado, let's jump into the code example!
Libraries¶
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.ensemble import RandomForestRegressor
from sklearn.metrics import mean_squared_error
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import grid_search_forecaster
Data¶
# Download data
# ==============================================================================
url = (
'https://raw.githubusercontent.com/JoaquinAmatRodrigo/skforecast/master/'
'data/h2o.csv'
)
data = pd.read_csv(url, sep=',', header=0, names=['y', 'datetime'])
# Data preprocessing
# ==============================================================================
data['datetime'] = pd.to_datetime(data['datetime'], format='%Y/%m/%d')
data = data.set_index('datetime')
data = data.asfreq('MS')
data = data['y']
data = data.sort_index()
# Data partition train-test
# ==============================================================================
end_train = '2005-06-01 23:59:00'
print(
f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()} "
f"(n={len(data.loc[:end_train])})")
print(
f"Test dates : {data.loc[end_train:].index.min()} --- {data.index.max()} "
f"(n={len(data.loc[end_train:])})")
# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
ax.legend();
Train dates : 1991-07-01 00:00:00 --- 2005-06-01 00:00:00 (n=168) Test dates : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00 (n=36)
Train a forecaster¶
Let's start by training a forecaster! For a more in-depth guide to using ForecasterAutoreg
, visit the User guide ForecasterAutoreg.
# Create and fit a recursive multi-step forecaster (ForecasterAutoreg)
# ==============================================================================
forecaster = ForecasterAutoreg(
regressor = RandomForestRegressor(random_state=123),
lags = 15
)
forecaster.fit(y=data.loc[:end_train])
forecaster
================= ForecasterAutoreg ================= Regressor: RandomForestRegressor(random_state=123) Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15] Transformer for y: None Transformer for exog: None Window size: 15 Weight function included: False Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: [Timestamp('1991-07-01 00:00:00'), Timestamp('2005-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': None, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123, 'verbose': 0, 'warm_start': False} fit_kwargs: {} Creation date: 2023-05-19 21:22:42 Last fit date: 2023-05-19 21:22:42 Skforecast version: 0.8.0 Python version: 3.9.13 Forecaster id: None
Prediction¶
After training the forecaster, the predict
method can be used to make predictions for the future $n$ steps.
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=len(data.loc[end_train:]))
predictions.head(3)
2005-07-01 0.921840 2005-08-01 0.954921 2005-09-01 1.101716 Freq: MS, Name: pred, dtype: float64
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
# Prediction error on test data
# ==============================================================================
error_mse = mean_squared_error(
y_true = data.loc[end_train:],
y_pred = predictions
)
print(f"Test error (mse): {error_mse}")
Test error (mse): 0.00429855684785846
Backtesting: forecaster validation¶
In time series forecasting, backtesting refers to the process of validating a predictive model using historical data. The technique involves moving backwards in time, step-by-step, to assess how well a model would have performed if it had been used to make predictions during that time period. Backtesting is a form of cross-validation that is applied to previous periods in the time series.
Backtesting can be done using a variety of techniques, such as simple train-test splits or more sophisticated methods like rolling windows or expanding windows. The choice of method depends on the specific needs of the analysis and the characteristics of the time series data. For more detailed documentation on backtesting, visit: User guide Backtesting forecaster.
# Backtesting
# ==============================================================================
metric, predictions_backtest = backtesting_forecaster(
forecaster = forecaster,
y = data,
steps = 10,
metric = 'mean_squared_error',
initial_train_size = len(data.loc[:end_train]),
fixed_train_size = False,
gap = 0,
allow_incomplete_fold = True,
refit = True,
verbose = True,
show_progress = True
)
print(f"Backtest error: {metric}")
Information of backtesting process ---------------------------------- Number of observations used for initial training: 168 Number of observations used for backtesting: 36 Number of folds: 4 Number of steps per fold: 10 Number of steps to exclude from the end of each train set before test (gap): 0 Last fold only includes 6 observations. Fold: 0 Training: 1991-07-01 00:00:00 -- 2005-06-01 00:00:00 (n=168) Validation: 2005-07-01 00:00:00 -- 2006-04-01 00:00:00 (n=10) Fold: 1 Training: 1991-07-01 00:00:00 -- 2006-04-01 00:00:00 (n=178) Validation: 2006-05-01 00:00:00 -- 2007-02-01 00:00:00 (n=10) Fold: 2 Training: 1991-07-01 00:00:00 -- 2007-02-01 00:00:00 (n=188) Validation: 2007-03-01 00:00:00 -- 2007-12-01 00:00:00 (n=10) Fold: 3 Training: 1991-07-01 00:00:00 -- 2007-12-01 00:00:00 (n=198) Validation: 2008-01-01 00:00:00 -- 2008-06-01 00:00:00 (n=6)
100%|██████████| 4/4 [00:00<00:00, 4.60it/s]
Backtest error: 0.004810878299401304
Hyperparameter tuning and lags selection¶
Hyperparameter tuning is a crucial aspect of developing accurate and effective machine learning models. In machine learning, hyperparameters are values that cannot be learned from data and must be set by the user before the model is trained. These hyperparameters can significantly impact the performance of the model, and tuning them carefully can improve its accuracy and generalization to new data. In the case of forecasting models, the lags included in the model can be considered as an additional hyperparameter.
Hyperparameter tuning involves systematically testing different values or combinations of hyperparameters (including lags) to find the optimal configuration that produces the best results. The Skforecast library offers various hyperparameter tuning strategies, including grid search, random search, and Bayesian search. For more detailed documentation on Hyperparameter tuning, visit: Hyperparameter tuning and lags selection.
# Grid search hyperparameter and lags
# ==============================================================================
# Regressor hyperparameters
param_grid = {
'n_estimators': [50, 100],
'max_depth': [5, 10, 15]
}
# Lags used as predictors
lags_grid = [3, 10, [1, 2, 3, 20]]
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = data,
param_grid = param_grid,
lags_grid = lags_grid,
steps = 10,
refit = True,
metric = 'mean_squared_error',
initial_train_size = len(data.loc[:end_train]),
fixed_train_size = False,
return_best = True,
verbose = False
)
Number of models compared: 18.
lags grid: 100%|██████████| 3/3 [00:08<00:00, 2.74s/it]
`Forecaster` refitted using the best-found lags and parameters, and the whole data set: Lags: [ 1 2 3 4 5 6 7 8 9 10] Parameters: {'max_depth': 10, 'n_estimators': 100} Backtesting metric: 0.015161925563451037
# Grid results
# ==============================================================================
results_grid
lags | params | mean_squared_error | max_depth | n_estimators | |
---|---|---|---|---|---|
9 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 10, 'n_estimators': 100} | 0.015162 | 10 | 100 |
11 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 15, 'n_estimators': 100} | 0.015559 | 15 | 100 |
10 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 15, 'n_estimators': 50} | 0.016284 | 15 | 50 |
8 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 10, 'n_estimators': 50} | 0.018412 | 10 | 50 |
6 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 5, 'n_estimators': 50} | 0.018674 | 5 | 50 |
7 | [1, 2, 3, 4, 5, 6, 7, 8, 9, 10] | {'max_depth': 5, 'n_estimators': 100} | 0.018912 | 5 | 100 |
14 | [1, 2, 3, 20] | {'max_depth': 10, 'n_estimators': 50} | 0.037743 | 10 | 50 |
17 | [1, 2, 3, 20] | {'max_depth': 15, 'n_estimators': 100} | 0.043362 | 15 | 100 |
13 | [1, 2, 3, 20] | {'max_depth': 5, 'n_estimators': 100} | 0.045687 | 5 | 100 |
2 | [1, 2, 3] | {'max_depth': 10, 'n_estimators': 50} | 0.047272 | 10 | 50 |
3 | [1, 2, 3] | {'max_depth': 10, 'n_estimators': 100} | 0.047515 | 10 | 100 |
5 | [1, 2, 3] | {'max_depth': 15, 'n_estimators': 100} | 0.050939 | 15 | 100 |
16 | [1, 2, 3, 20] | {'max_depth': 15, 'n_estimators': 50} | 0.051572 | 15 | 50 |
4 | [1, 2, 3] | {'max_depth': 15, 'n_estimators': 50} | 0.052900 | 15 | 50 |
12 | [1, 2, 3, 20] | {'max_depth': 5, 'n_estimators': 50} | 0.055255 | 5 | 50 |
15 | [1, 2, 3, 20] | {'max_depth': 10, 'n_estimators': 100} | 0.055293 | 10 | 100 |
1 | [1, 2, 3] | {'max_depth': 5, 'n_estimators': 100} | 0.056037 | 5 | 100 |
0 | [1, 2, 3] | {'max_depth': 5, 'n_estimators': 50} | 0.068426 | 5 | 50 |
Since return_best = True
, the forecaster object is updated with the best configuration found and trained with the whole data set. This means that the final model obtained from grid search will have the best combination of lags and hyperparameters that resulted in the highest performance metric. This final model can then be used for future predictions on new data.
forecaster
================= ForecasterAutoreg ================= Regressor: RandomForestRegressor(max_depth=10, random_state=123) Lags: [ 1 2 3 4 5 6 7 8 9 10] Transformer for y: None Transformer for exog: None Window size: 10 Weight function included: False Exogenous included: False Type of exogenous variable: None Exogenous variables names: None Training range: [Timestamp('1991-07-01 00:00:00'), Timestamp('2008-06-01 00:00:00')] Training index type: DatetimeIndex Training index frequency: MS Regressor parameters: {'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'squared_error', 'max_depth': 10, 'max_features': 1.0, 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': 123, 'verbose': 0, 'warm_start': False} fit_kwargs: {} Creation date: 2023-05-19 21:22:42 Last fit date: 2023-05-19 21:22:52 Skforecast version: 0.8.0 Python version: 3.9.13 Forecaster id: None
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>