Quick start skforecast¶

Welcome to a quick start guide to using skforecast! In this guide, we will provide you with a code example that demonstrates how to create, validate, and optimize a recursive multi-step forecaster, ForecasterRecursive, using skforecast.

A Forecaster object in the skforecast library is a comprehensive container that provides essential functionality and methods for training a forecasting model and generating predictions for future points in time.

If you need more detailed documentation or guidance, you can visit the User Guides section.

Without further ado, let's jump into the code example!

Libraries and data¶

In [1]:

Copied!





# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error

from skforecast.datasets import load_demo_dataset
from skforecast.preprocessing import RollingFeatures
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import grid_search_forecaster

plt.style.use('seaborn-v0_8-darkgrid')
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error

from skforecast.datasets import load_demo_dataset
from skforecast.preprocessing import RollingFeatures
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import grid_search_forecaster

plt.style.use('seaborn-v0_8-darkgrid')

In [2]:

Copied!





# Download data
# ==============================================================================
data = load_demo_dataset()
data.head(5)
# Download data
# ==============================================================================
data = load_demo_dataset()
data.head(5)

Out[2]:

datetime
1991-07-01    0.429795
1991-08-01    0.400906
1991-09-01    0.432159
1991-10-01    0.492543
1991-11-01    0.502369
Freq: MS, Name: y, dtype: float64

In [3]:

Copied!





# Data partition train-test
# ==============================================================================
end_train = '2005-06-01 23:59:00'
print(
    f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()}  " 
    f"(n={len(data.loc[:end_train])})")
print(
    f"Test dates  : {data.loc[end_train:].index.min()} --- {data.index.max()}  "
    f"(n={len(data.loc[end_train:])})")

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
ax.legend()
plt.show();
# Data partition train-test
# ==============================================================================
end_train = '2005-06-01 23:59:00'
print(
    f"Train dates : {data.index.min()} --- {data.loc[:end_train].index.max()}  " 
    f"(n={len(data.loc[:end_train])})")
print(
    f"Test dates  : {data.loc[end_train:].index.min()} --- {data.index.max()}  "
    f"(n={len(data.loc[end_train:])})")

# Plot
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
ax.legend()
plt.show();

Train dates : 1991-07-01 00:00:00 --- 2005-06-01 00:00:00  (n=168)
Test dates  : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00  (n=36)

No description has been provided for this image

Train a forecaster¶

Let's start by training a forecaster! For a more in-depth guide to using ForecasterRecursive, visit the User guide.

In [4]:

Copied!





# Create and fit a recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor       = LGBMRegressor(random_state=123, verbose=-1),
                 lags            = 15,
                 window_features = RollingFeatures(stats=['mean'], window_sizes=10)
             )

forecaster.fit(y=data.loc[:end_train])
forecaster
# Create and fit a recursive multi-step forecaster (ForecasterRecursive)
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor       = LGBMRegressor(random_state=123, verbose=-1),
                 lags            = 15,
                 window_features = RollingFeatures(stats=['mean'], window_sizes=10)
             )

forecaster.fit(y=data.loc[:end_train])
forecaster

Out[4]:

ForecasterRecursive

General Information

Regressor: LGBMRegressor
Lags: [ 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15]
Window features: ['roll_mean_10']
Window size: 15
Exogenous included: False
Weight function included: False
Differentiation order: None
Creation date: 2024-11-10 18:11:49
Last fit date: 2024-11-10 18:11:50
Skforecast version: 0.14.0
Python version: 3.11.10
Forecaster id: None

Exogenous Variables

None

Data Transformations

Transformer for y: None
Transformer for exog: None

Training Information

Training range: [Timestamp('1991-07-01 00:00:00'), Timestamp('2005-06-01 00:00:00')]
Training index type: DatetimeIndex
Training index frequency: MS

Regressor Parameters

{'boosting_type': 'gbdt', 'class_weight': None, 'colsample_bytree': 1.0, 'importance_type': 'split', 'learning_rate': 0.1, 'max_depth': -1, 'min_child_samples': 20, 'min_child_weight': 0.001, 'min_split_gain': 0.0, 'n_estimators': 100, 'n_jobs': None, 'num_leaves': 31, 'objective': None, 'random_state': 123, 'reg_alpha': 0.0, 'reg_lambda': 0.0, 'subsample': 1.0, 'subsample_for_bin': 200000, 'subsample_freq': 0, 'verbose': -1}

Fit Kwargs

{}

🛈 API Reference 🗎 User Guide

💡 Tip

To understand what can be done when initializing a forecaster with skforecast visit Forecaster parameters and Forecaster attributes.

Prediction¶

After training the forecaster, the predict method can be used to make predictions for the future $n$ steps.

In [5]:

Copied!





# Predict
# ==============================================================================
predictions = forecaster.predict(steps=len(data.loc[end_train:]))
predictions.head(3)
# Predict
# ==============================================================================
predictions = forecaster.predict(steps=len(data.loc[end_train:]))
predictions.head(3)

Out[5]:

2005-07-01    1.026507
2005-08-01    1.042429
2005-09-01    1.116730
Freq: MS, Name: pred, dtype: float64

In [6]:

Copied!





# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();
# Plot predictions
# ==============================================================================
fig, ax = plt.subplots(figsize=(7, 3))
data.loc[:end_train].plot(ax=ax, label='train')
data.loc[end_train:].plot(ax=ax, label='test')
predictions.plot(ax=ax, label='predictions')
ax.legend();

In [7]:

Copied!





# Prediction error on test data
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data.loc[end_train:],
                y_pred = predictions
            )
print(f"Test error (mse): {error_mse}")
# Prediction error on test data
# ==============================================================================
error_mse = mean_squared_error(
                y_true = data.loc[end_train:],
                y_pred = predictions
            )
print(f"Test error (mse): {error_mse}")

Test error (mse): 0.006632513357651682

Backtesting: forecaster validation¶

In time series forecasting, backtesting refers to the process of validating a predictive model using historical data. The technique involves moving backwards in time, step-by-step, to assess how well a model would have performed if it had been used to make predictions during that time period. Backtesting is a form of cross-validation that is applied to previous periods in the time series.

Backtesting can be done using a variety of techniques, such as simple train-test splits or more sophisticated methods like rolling windows or expanding windows. The choice of method depends on the specific needs of the analysis and the characteristics of the time series data. For more detailed documentation on backtesting, visit: User guide Backtesting forecaster.

In [8]:

Copied!





# Backtesting
# ==============================================================================
cv = TimeSeriesFold(
         steps              = 10,
         initial_train_size = len(data.loc[:end_train]),
         refit              = True,
         fixed_train_size   = False
     )

metric, predictions_backtest = backtesting_forecaster(
                                   forecaster    = forecaster,
                                   y             = data,
                                   cv            = cv,
                                   metric        = 'mean_squared_error',
                                   n_jobs        = 'auto',
                                   verbose       = True,
                                   show_progress = True
                               )

print(f"Backtest error: {metric}")
# Backtesting
# ==============================================================================
cv = TimeSeriesFold(
         steps              = 10,
         initial_train_size = len(data.loc[:end_train]),
         refit              = True,
         fixed_train_size   = False
     )

metric, predictions_backtest = backtesting_forecaster(
                                   forecaster    = forecaster,
                                   y             = data,
                                   cv            = cv,
                                   metric        = 'mean_squared_error',
                                   n_jobs        = 'auto',
                                   verbose       = True,
                                   show_progress = True
                               )

print(f"Backtest error: {metric}")

Information of folds
--------------------
Number of observations used for initial training: 168
Number of observations used for backtesting: 36
    Number of folds: 4
    Number skipped folds: 0 
    Number of steps per fold: 10
    Number of steps to exclude between last observed data (last window) and predictions (gap): 0
    Last fold only includes 6 observations.

Fold: 0
    Training:   1991-07-01 00:00:00 -- 2005-06-01 00:00:00  (n=168)
    Validation: 2005-07-01 00:00:00 -- 2006-04-01 00:00:00  (n=10)
Fold: 1
    Training:   1991-07-01 00:00:00 -- 2006-04-01 00:00:00  (n=178)
    Validation: 2006-05-01 00:00:00 -- 2007-02-01 00:00:00  (n=10)
Fold: 2
    Training:   1991-07-01 00:00:00 -- 2007-02-01 00:00:00  (n=188)
    Validation: 2007-03-01 00:00:00 -- 2007-12-01 00:00:00  (n=10)
Fold: 3
    Training:   1991-07-01 00:00:00 -- 2007-12-01 00:00:00  (n=198)
    Validation: 2008-01-01 00:00:00 -- 2008-06-01 00:00:00  (n=6)

  0%|          | 0/4 [00:00<?, ?it/s]

Backtest error:    mean_squared_error
0            0.006816

Hyperparameter tuning and lags selection¶

Hyperparameter tuning is a crucial aspect of developing accurate and effective machine learning models. In machine learning, hyperparameters are values that cannot be learned from data and must be set by the user before the model is trained. These hyperparameters can significantly impact the performance of the model, and tuning them carefully can improve its accuracy and generalization to new data. In the case of forecasting models, the lags included in the model can be considered as an additional hyperparameter.

Hyperparameter tuning involves systematically testing different values or combinations of hyperparameters (including lags) to find the optimal configuration that produces the best results. The Skforecast library offers various hyperparameter tuning strategies, including grid search, random search, and Bayesian search. For more detailed documentation on Hyperparameter tuning, visit: Hyperparameter tuning and lags selection.

In [9]:

Copied!





# Grid search hyperparameter and lags
# ==============================================================================
# Regressor hyperparameters
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, 15]
}

# Lags used as predictors
lags_grid = [3, 10, [1, 2, 3, 20]]

# Folds
cv = TimeSeriesFold(
         steps              = 10,
         initial_train_size = len(data.loc[:end_train]),
         refit              = False,
     )

results_grid = grid_search_forecaster(
                   forecaster    = forecaster,
                   y             = data,
                   param_grid    = param_grid,
                   lags_grid     = lags_grid,
                   cv            = cv,
                   metric        = 'mean_squared_error',
                   return_best   = True,
                   n_jobs        = 'auto',
                   verbose       = False,
                   show_progress = True
               )
# Grid search hyperparameter and lags
# ==============================================================================
# Regressor hyperparameters
param_grid = {
    'n_estimators': [50, 100],
    'max_depth': [5, 10, 15]
}

# Lags used as predictors
lags_grid = [3, 10, [1, 2, 3, 20]]

# Folds
cv = TimeSeriesFold(
         steps              = 10,
         initial_train_size = len(data.loc[:end_train]),
         refit              = False,
     )

results_grid = grid_search_forecaster(
                   forecaster    = forecaster,
                   y             = data,
                   param_grid    = param_grid,
                   lags_grid     = lags_grid,
                   cv            = cv,
                   metric        = 'mean_squared_error',
                   return_best   = True,
                   n_jobs        = 'auto',
                   verbose       = False,
                   show_progress = True
               )

lags grid:   0%|          | 0/3 [00:00<?, ?it/s]

params grid:   0%|          | 0/6 [00:00<?, ?it/s]

`Forecaster` refitted using the best-found lags and parameters, and the whole data set: 
  Lags: [ 1  2  3  4  5  6  7  8  9 10] 
  Parameters: {'max_depth': 10, 'n_estimators': 50}
  Backtesting metric: 0.017861588026122758

In [10]:

Copied!

# Grid results
# ==============================================================================
results_grid
# Grid results
# ==============================================================================
results_grid

Out[10]:

	lags	lags_label	params	mean_squared_error	max_depth	n_estimators
0	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	{'max_depth': 10, 'n_estimators': 50}	0.017862	10	50
1	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	{'max_depth': 15, 'n_estimators': 50}	0.017862	15	50
2	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	{'max_depth': 5, 'n_estimators': 100}	0.018772	5	100
3	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	{'max_depth': 15, 'n_estimators': 100}	0.018898	15	100
4	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	{'max_depth': 10, 'n_estimators': 100}	0.018898	10	100
5	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	[1, 2, 3, 4, 5, 6, 7, 8, 9, 10]	{'max_depth': 5, 'n_estimators': 50}	0.019198	5	50
6	[1, 2, 3]	[1, 2, 3]	{'max_depth': 10, 'n_estimators': 50}	0.035032	10	50
7	[1, 2, 3]	[1, 2, 3]	{'max_depth': 15, 'n_estimators': 50}	0.035032	15	50
8	[1, 2, 3]	[1, 2, 3]	{'max_depth': 5, 'n_estimators': 50}	0.035168	5	50
9	[1, 2, 3]	[1, 2, 3]	{'max_depth': 15, 'n_estimators': 100}	0.040312	15	100
10	[1, 2, 3]	[1, 2, 3]	{'max_depth': 10, 'n_estimators': 100}	0.040312	10	100
11	[1, 2, 3]	[1, 2, 3]	{'max_depth': 5, 'n_estimators': 100}	0.040562	5	100
12	[1, 2, 3, 20]	[1, 2, 3, 20]	{'max_depth': 5, 'n_estimators': 100}	0.042146	5	100
13	[1, 2, 3, 20]	[1, 2, 3, 20]	{'max_depth': 10, 'n_estimators': 100}	0.042147	10	100
14	[1, 2, 3, 20]	[1, 2, 3, 20]	{'max_depth': 15, 'n_estimators': 100}	0.042147	15	100
15	[1, 2, 3, 20]	[1, 2, 3, 20]	{'max_depth': 5, 'n_estimators': 50}	0.043385	5	50
16	[1, 2, 3, 20]	[1, 2, 3, 20]	{'max_depth': 10, 'n_estimators': 50}	0.043385	10	50
17	[1, 2, 3, 20]	[1, 2, 3, 20]	{'max_depth': 15, 'n_estimators': 50}	0.043385	15	50

Since return_best = True, the forecaster object is updated with the best configuration found and trained with the whole data set. This means that the final model obtained from grid search will have the best combination of lags and hyperparameters that resulted in the highest performance metric. This final model can then be used for future predictions on new data.

In [11]:

Copied!

# Print forecaster information
# ==============================================================================
forecaster
# Print forecaster information
# ==============================================================================
forecaster

Out[11]: