Parallelization in skforecast¶
Parallelization rules¶
The n_jobs argument facilitates the parallelization of specific functionalities to enhance speed within the skforecast library. Parallelization has been strategically integrated at two key levels: during the process of forecaster fitting and during the backtesting phase, which also encompasses hyperparameter search. When the n_jobs argument is set to its default value of auto, the library dynamically determines the number of jobs to employ, guided by the ensuing guidelines:
Forecaster Fitting
If the forecaster is either
ForecasterAutoregDirectorForecasterAutoregMultiVariate, and the underlying regressor happens to be a linear regressor, thenn_jobsis set to 1.Otherwise, if none of the above conditions hold, the
n_jobsvalue is determined ascpu_count(), aligning with the number of available CPU cores.
Backtesting
If
refitis an integer, thenn_jobs=1. This is because parallelization doesn`t work with intermittent refit.If forecaster is
ForecasterAutoregorForecasterAutoregCustomand the underlying regressor is linear,n_jobsis set to 1.If forecaster is
ForecasterAutoregorForecasterAutoregCustom, the underlying regressor regressor is not a linear regressor andrefit=True, thenn_jobsis set tocpu_count().If forecaster is
ForecasterAutoregorForecasterAutoregCustom, the underlying regressor is not linear andrefit=False, n_jobs is set to 1.If forecaster is
ForecasterAutoregDirectorForecasterAutoregMultiVariateand refit=True, thenn_jobsis set tocpu_count().If forecaster is
ForecasterAutoregDirectorForecasterAutoregMultiVariateand refit=False, thenn_jobsis set to 1.If forecaster is
ForecasterAutoregMultiseries, thenn_jobsis set tocpu_count().
Warning
The automatic selection of the parallelization level relies on heuristics and is therefore not guaranteed to be optimal. In addition, it is important to keep in mind that many regressors already parallelize their fitting procedures inherently. As a result, introducing additional parallelization may not necessarily improve overall performance. For a more detailed look at parallelization, visit select_n_jobs_backtesting and select_n_jobs_fit_forecaster.
Libraries¶
# Libraries
# ==============================================================================
import platform
import psutil
import skforecast
import pandas as pd
import numpy as np
import scipy
import sklearn
import time
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMRegressor
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multivariate
from skforecast.model_selection_multiseries import backtesting_forecaster_multivariate
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.ForecasterAutoregMultiVariate import ForecasterAutoregMultiVariate
# Versions
# ==============================================================================
print(f"Python version : {platform.python_version()}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"skforecast version : {skforecast.__version__}")
print(f"pandas version : {pd.__version__}")
print(f"numpy version : {np.__version__}")
print(f"scipy version : {scipy.__version__}")
print(f"psutil version : {psutil.__version__}")
print("")
# System information
# ==============================================================================
#Computer network name
print(f"Computer network name: {platform.node()}")
#Machine type
print(f"Machine type: {platform.machine()}")
#Processor type
print(f"Processor type: {platform.processor()}")
#Platform type
print(f"Platform type: {platform.platform()}")
#Operating system
print(f"Operating system: {platform.system()}")
#Operating system release
print(f"Operating system release: {platform.release()}")
#Operating system version
print(f"Operating system version: {platform.version()}")
#Physical cores
print(f"Number of physical cores: {psutil.cpu_count(logical=False)}")
#Logical cores
print(f"Number of logical cores: {psutil.cpu_count(logical=True)}")
Python version : 3.11.4 scikit-learn version: 1.3.0 skforecast version : 0.9.1 pandas version : 2.0.3 numpy version : 1.25.2 scipy version : 1.11.1 psutil version : 5.9.5 Computer network name: EU-HYYV0J3 Machine type: AMD64 Processor type: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel Platform type: Windows-10-10.0.19045-SP0 Operating system: Windows Operating system release: 10 Operating system version: 10.0.19045 Number of physical cores: 4 Number of logical cores: 8
Data¶
# Data
# ==============================================================================
n = 5_000
rgn = np.random.default_rng(seed=123)
y = pd.Series(rgn.random(size=(n)), name="y")
exog = pd.DataFrame(rgn.random(size=(n, 10)))
exog.columns = [f"exog_{i}" for i in range(exog.shape[1])]
multi_series = pd.DataFrame(rgn.random(size=(n, 10)))
multi_series.columns = [f"series_{i+1}" for i in range(multi_series.shape[1])]
y_train = y[:-int(n/2)]
display(y.head())
display(exog.head())
display(multi_series.head())
0 0.682352 1 0.053821 2 0.220360 3 0.184372 4 0.175906 Name: y, dtype: float64
| exog_0 | exog_1 | exog_2 | exog_3 | exog_4 | exog_5 | exog_6 | exog_7 | exog_8 | exog_9 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.593121 | 0.353471 | 0.336277 | 0.399734 | 0.915459 | 0.822278 | 0.480418 | 0.929802 | 0.950948 | 0.863556 |
| 1 | 0.764104 | 0.638191 | 0.956624 | 0.178105 | 0.434077 | 0.137480 | 0.837667 | 0.768947 | 0.244235 | 0.815336 |
| 2 | 0.475312 | 0.312415 | 0.353596 | 0.272162 | 0.772064 | 0.110216 | 0.596551 | 0.688549 | 0.651380 | 0.191837 |
| 3 | 0.039253 | 0.962713 | 0.189194 | 0.910629 | 0.169796 | 0.697751 | 0.830913 | 0.484824 | 0.634634 | 0.862865 |
| 4 | 0.872447 | 0.861421 | 0.394829 | 0.877763 | 0.286779 | 0.131008 | 0.450185 | 0.898167 | 0.590147 | 0.045838 |
| series_1 | series_2 | series_3 | series_4 | series_5 | series_6 | series_7 | series_8 | series_9 | series_10 | |
|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0.967448 | 0.580646 | 0.643348 | 0.461737 | 0.450859 | 0.894496 | 0.037967 | 0.097698 | 0.094356 | 0.893528 |
| 1 | 0.207450 | 0.194904 | 0.377063 | 0.975065 | 0.351034 | 0.812253 | 0.265956 | 0.262733 | 0.784995 | 0.674256 |
| 2 | 0.520431 | 0.985069 | 0.039559 | 0.541797 | 0.612761 | 0.640336 | 0.823467 | 0.768387 | 0.561777 | 0.600835 |
| 3 | 0.866694 | 0.165510 | 0.819767 | 0.691179 | 0.717778 | 0.392694 | 0.094067 | 0.271990 | 0.467866 | 0.041054 |
| 4 | 0.406310 | 0.657688 | 0.630730 | 0.694424 | 0.943934 | 0.888538 | 0.470363 | 0.518283 | 0.719674 | 0.010789 |
Benchmark ForecasterAutoreg¶
warnings.filterwarnings("ignore")
print("-----------------")
print("ForecasterAutoreg")
print("-----------------")
steps = 100
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoreg(
regressor=regressor,
lags=lags,
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(
index=["regressor", "method"],
columns="parallel",
values="elapsed_time"
).reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=results_pivot.dropna(), x="method", y="pct_improvement", hue="regressor", ax=ax)
ax.set_title(f"Parallel vs Sequential (ForecasterAutoreg, n={n})")
ax.set_ylabel("Percent difference")
ax.set_xlabel("Method")
-----------------
ForecasterAutoreg
-----------------
Ridge(alpha=0.1, random_state=77) {'alpha': [0.1, 0.1, 0.1]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
Number of models compared: 9.
Profiling GridSearch no refit no parallel
Number of models compared: 9.
LGBMRegressor(max_depth=5, n_estimators=50, n_jobs=1, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
Number of models compared: 12.
Profiling GridSearch no refit no parallel
Number of models compared: 12.
LGBMRegressor(max_depth=5, n_estimators=50, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
Number of models compared: 12.
Profiling GridSearch no refit no parallel
Number of models compared: 12.
| regressor | method | False | True | pct_improvement | |
|---|---|---|---|---|---|
| 0 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_no_refit | 0.571370 | 0.416455 | 27.112945 |
| 1 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_refit | 6.569071 | 2.720030 | 58.593380 |
| 2 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | create_train_X_y | 0.007874 | NaN | NaN |
| 3 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | fit | 0.277259 | NaN | NaN |
| 4 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | gridSearch_no_refit | 6.699500 | 6.123097 | 8.603664 |
| 5 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_no_refit | 0.751951 | 0.546975 | 27.259220 |
| 6 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_refit | 7.001708 | 2.424567 | 65.371771 |
| 7 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | create_train_X_y | 0.008184 | NaN | NaN |
| 8 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | fit | 0.275215 | NaN | NaN |
| 9 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | gridSearch_no_refit | 6.601839 | 7.825191 | -18.530468 |
| 10 | Ridge(alpha=0.1, random_state=77) | backtest_no_refit | 0.744464 | 0.378917 | 49.102068 |
| 11 | Ridge(alpha=0.1, random_state=77) | backtest_refit | 1.571386 | 12.706505 | -708.617632 |
| 12 | Ridge(alpha=0.1, random_state=77) | create_train_X_y | 0.011935 | NaN | NaN |
| 13 | Ridge(alpha=0.1, random_state=77) | fit | 0.024534 | NaN | NaN |
| 14 | Ridge(alpha=0.1, random_state=77) | gridSearch_no_refit | 4.548398 | 2.756721 | 39.391399 |
Text(0.5, 0, 'Method')
Benchmark ForecasterAutoreg¶
print("-----------------------")
print("ForecasterAutoregDirect")
print("-----------------------")
steps = 25
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoregDirect(
regressor=regressor,
lags=lags,
steps=steps,
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(
index=["regressor", "method"],
columns="parallel",
values="elapsed_time"
).reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=results_pivot.dropna(), x="method", y="pct_improvement", hue="regressor", ax=ax)
ax.set_title(f"Parallel vs Sequential (ForecasterAutoregDirect, n={n})")
ax.set_ylabel("Percent difference")
ax.set_xlabel("Method");
-----------------------
ForecasterAutoregDirect
-----------------------
Ridge(alpha=0.1, random_state=77) {'alpha': [0.1, 0.1, 0.1]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
Number of models compared: 9.
Profiling GridSearch no refit no parallel
Number of models compared: 9.
LGBMRegressor(max_depth=5, n_estimators=50, n_jobs=1, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
Number of models compared: 12.
Profiling GridSearch no refit no parallel
Number of models compared: 12.
LGBMRegressor(max_depth=5, n_estimators=50, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
Number of models compared: 12.
Profiling GridSearch no refit no parallel
Number of models compared: 12.
| regressor | method | False | True | pct_improvement | |
|---|---|---|---|---|---|
| 0 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_no_refit | 5.992304 | 10.283535 | -71.612377 |
| 1 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_refit | 141.952965 | 75.919333 | 46.517966 |
| 2 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | create_train_X_y | 0.072996 | NaN | NaN |
| 3 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | fit | 4.144056 | NaN | NaN |
| 4 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | gridSearch_no_refit | 43.838513 | 76.157306 | -73.722375 |
| 5 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_no_refit | 5.886309 | 10.738465 | -82.431196 |
| 6 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_refit | 143.764385 | 64.597212 | 55.067305 |
| 7 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | create_train_X_y | 0.060571 | NaN | NaN |
| 8 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | fit | 4.122978 | NaN | NaN |
| 9 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | gridSearch_no_refit | 42.607671 | 75.015008 | -76.059866 |
| 10 | Ridge(alpha=0.1, random_state=77) | backtest_no_refit | 0.323168 | 0.319153 | 1.242526 |
| 11 | Ridge(alpha=0.1, random_state=77) | backtest_refit | 5.785780 | 2.094785 | 63.794244 |
| 12 | Ridge(alpha=0.1, random_state=77) | create_train_X_y | 0.033940 | NaN | NaN |
| 13 | Ridge(alpha=0.1, random_state=77) | fit | 0.430777 | NaN | NaN |
| 14 | Ridge(alpha=0.1, random_state=77) | gridSearch_no_refit | 4.431143 | 3.941903 | 11.040943 |
Benchmark ForecasterAutoregMultiSeries¶
print("----------------------------")
print("ForecasterAutoregMultiseries")
print("----------------------------")
steps = 100
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoregMultiSeries(
regressor=regressor,
lags=lags,
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit and parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(
index=["regressor", "method"],
columns="parallel",
values="elapsed_time"
).reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=results_pivot.dropna(), x="method", y="pct_improvement", hue="regressor", ax=ax)
ax.set_title(f"Parallel vs Sequential (ForecasterAutoregMultiseries, n={n})")
ax.set_ylabel("Percent difference")
ax.set_xlabel("Method");
----------------------------
ForecasterAutoregMultiseries
----------------------------
Ridge(alpha=0.1, random_state=77) {'alpha': [0.1, 0.1, 0.1]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit and parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
9 models compared for 10 level(s). Number of iterations: 9.
Profiling GridSearch no refit no parallel
9 models compared for 10 level(s). Number of iterations: 9.
LGBMRegressor(max_depth=5, n_estimators=50, n_jobs=1, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit and parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
12 models compared for 10 level(s). Number of iterations: 12.
Profiling GridSearch no refit no parallel
12 models compared for 10 level(s). Number of iterations: 12.
LGBMRegressor(max_depth=5, n_estimators=50, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit and parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
12 models compared for 10 level(s). Number of iterations: 12.
Profiling GridSearch no refit no parallel
12 models compared for 10 level(s). Number of iterations: 12.
| regressor | method | False | True | pct_improvement | |
|---|---|---|---|---|---|
| 0 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_no_refit | 3.856087 | 1.785516 | 53.696183 |
| 1 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_refit | 26.799302 | 14.820522 | 44.698105 |
| 2 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | create_train_X_y | 0.082853 | NaN | NaN |
| 3 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | fit | 1.439756 | NaN | NaN |
| 4 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | gridSearch_no_refit | 42.083872 | 29.374772 | 30.199456 |
| 5 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_no_refit | 5.492170 | 2.099951 | 61.764638 |
| 6 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_refit | 30.017390 | 14.576546 | 51.439661 |
| 7 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | create_train_X_y | 0.079670 | NaN | NaN |
| 8 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | fit | 1.465376 | NaN | NaN |
| 9 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | gridSearch_no_refit | 72.783301 | 36.314506 | 50.105992 |
| 10 | Ridge(alpha=0.1, random_state=77) | backtest_no_refit | 3.122103 | 1.041682 | 66.635227 |
| 11 | Ridge(alpha=0.1, random_state=77) | backtest_refit | 6.525677 | 2.931308 | 55.080398 |
| 12 | Ridge(alpha=0.1, random_state=77) | create_train_X_y | 0.119956 | NaN | NaN |
| 13 | Ridge(alpha=0.1, random_state=77) | fit | 0.277841 | NaN | NaN |
| 14 | Ridge(alpha=0.1, random_state=77) | gridSearch_no_refit | 26.023329 | 13.008654 | 50.011569 |
Benchmark ForecasterAutoregMultivariate¶
print("-----------------------------")
print("ForecasterAutoregMultivariate")
print("-----------------------------")
steps = 25
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoregMultiVariate(
regressor=regressor,
lags=lags,
steps=steps,
level="series_1",
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(index=["regressor", "method"], columns="parallel", values="elapsed_time").reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=results_pivot.dropna(), x="method", y="pct_improvement", hue="regressor", ax=ax)
ax.set_title(f"Parallel vs Sequential (ForecasterAutoregMultivariate, n={n})")
ax.set_ylabel("Percent difference")
ax.set_xlabel("Method");
-----------------------------
ForecasterAutoregMultivariate
-----------------------------
Ridge(alpha=0.1, random_state=77) {'alpha': [0.1, 0.1, 0.1]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
9 models compared for 1 level(s). Number of iterations: 9.
Profiling GridSearch no refit no parallel
9 models compared for 1 level(s). Number of iterations: 9.
LGBMRegressor(max_depth=5, n_estimators=50, n_jobs=1, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
12 models compared for 1 level(s). Number of iterations: 12.
Profiling GridSearch no refit no parallel
12 models compared for 1 level(s). Number of iterations: 12.
LGBMRegressor(max_depth=5, n_estimators=50, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]}
Profiling fit
Profiling create_train_X_y
Profiling backtesting refit parallel
Profiling backtesting refit no parallel
Profiling backtesting no refit parallel
Profiling backtesting no refit no parallel
Profiling GridSearch no refit parallel
12 models compared for 1 level(s). Number of iterations: 12.
Profiling GridSearch no refit no parallel
12 models compared for 1 level(s). Number of iterations: 12.
| regressor | method | False | True | pct_improvement | |
|---|---|---|---|---|---|
| 0 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_no_refit | 34.915484 | 41.451426 | -18.719322 |
| 1 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_refit | 773.195707 | 615.900236 | 20.343552 |
| 2 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | create_train_X_y | 0.163573 | NaN | NaN |
| 3 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | fit | 34.095271 | NaN | NaN |
| 4 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | gridSearch_no_refit | 315.079898 | 407.191276 | -29.234292 |
| 5 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_no_refit | 30.237485 | 29.182785 | 3.488057 |
| 6 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_refit | 536.445092 | 462.887896 | 13.711971 |
| 7 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | create_train_X_y | 0.110270 | NaN | NaN |
| 8 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | fit | 24.590466 | NaN | NaN |
| 9 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | gridSearch_no_refit | 313.850108 | 366.675623 | -16.831447 |
| 10 | Ridge(alpha=0.1, random_state=77) | backtest_no_refit | 2.171713 | 2.316724 | -6.677227 |
| 11 | Ridge(alpha=0.1, random_state=77) | backtest_refit | 80.197614 | 44.966058 | 43.930928 |
| 12 | Ridge(alpha=0.1, random_state=77) | create_train_X_y | 0.228073 | NaN | NaN |
| 13 | Ridge(alpha=0.1, random_state=77) | fit | 5.533158 | NaN | NaN |
| 14 | Ridge(alpha=0.1, random_state=77) | gridSearch_no_refit | 19.418475 | 21.956889 | -13.072160 |
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>