Parallelization in skforecast¶
Parallelization rules¶
The n_jobs
argument facilitates the parallelization of specific functionalities to enhance speed within the skforecast library. Parallelization has been strategically integrated at two key levels: during the process of forecaster fitting and during the backtesting phase, which also encompasses hyperparameter search. When the n_jobs
argument is set to its default value of auto
, the library dynamically determines the number of jobs to employ, guided by the ensuing guidelines:
Forecaster Fitting
If the forecaster is either
ForecasterAutoregDirect
orForecasterAutoregMultiVariate
, and the underlying regressor happens to be a linear regressor, thenn_jobs
is set to 1.Otherwise, if none of the above conditions hold, the
n_jobs
value is determined ascpu_count()
, aligning with the number of available CPU cores.
Backtesting
If
refit
is an integer, thenn_jobs=1
. This is because parallelization doesn`t work with intermittent refit.If forecaster is
ForecasterAutoreg
orForecasterAutoregCustom
and the underlying regressor is linear,n_jobs
is set to 1.If forecaster is
ForecasterAutoreg
orForecasterAutoregCustom
, the underlying regressor regressor is not a linear regressor andrefit=True
, thenn_jobs
is set tocpu_count()
.If forecaster is
ForecasterAutoreg
orForecasterAutoregCustom
, the underlying regressor is not linear andrefit=False
, n_jobs is set to 1.If forecaster is
ForecasterAutoregDirect
orForecasterAutoregMultiVariate
and refit=True
, thenn_jobs
is set tocpu_count()
.If forecaster is
ForecasterAutoregDirect
orForecasterAutoregMultiVariate
and refit=False
, thenn_jobs
is set to 1.If forecaster is
ForecasterAutoregMultiseries
, thenn_jobs
is set tocpu_count()
.
⚠ Warning
The automatic selection of the parallelization level relies on heuristics and is therefore not guaranteed to be optimal. In addition, it is important to keep in mind that many regressors already parallelize their fitting procedures inherently. As a result, introducing additional parallelization may not necessarily improve overall performance. For a more detailed look at parallelization, visit select_n_jobs_backtesting and select_n_jobs_fit_forecaster.
Libraries¶
# Libraries
# ==============================================================================
import platform
import psutil
import skforecast
import pandas as pd
import numpy as np
import scipy
import sklearn
import time
import warnings
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge
from sklearn.preprocessing import StandardScaler
from lightgbm import LGBMRegressor
from skforecast.model_selection import backtesting_forecaster
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection_multiseries import grid_search_forecaster_multiseries
from skforecast.model_selection_multiseries import backtesting_forecaster_multiseries
from skforecast.model_selection_multiseries import grid_search_forecaster_multivariate
from skforecast.model_selection_multiseries import backtesting_forecaster_multivariate
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
from skforecast.ForecasterAutoregMultiSeries import ForecasterAutoregMultiSeries
from skforecast.ForecasterAutoregMultiVariate import ForecasterAutoregMultiVariate
# Versions
# ==============================================================================
print(f"Python version : {platform.python_version()}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"skforecast version : {skforecast.__version__}")
print(f"pandas version : {pd.__version__}")
print(f"numpy version : {np.__version__}")
print(f"scipy version : {scipy.__version__}")
print(f"psutil version : {psutil.__version__}")
print("")
# System information
# ==============================================================================
#Computer network name
print(f"Computer network name: {platform.node()}")
#Machine type
print(f"Machine type: {platform.machine()}")
#Processor type
print(f"Processor type: {platform.processor()}")
#Platform type
print(f"Platform type: {platform.platform()}")
#Operating system
print(f"Operating system: {platform.system()}")
#Operating system release
print(f"Operating system release: {platform.release()}")
#Operating system version
print(f"Operating system version: {platform.version()}")
#Physical cores
print(f"Number of physical cores: {psutil.cpu_count(logical=False)}")
#Logical cores
print(f"Number of logical cores: {psutil.cpu_count(logical=True)}")
Python version : 3.11.4 scikit-learn version: 1.3.0 skforecast version : 0.9.1 pandas version : 2.0.3 numpy version : 1.25.2 scipy version : 1.11.1 psutil version : 5.9.5 Computer network name: EU-HYYV0J3 Machine type: AMD64 Processor type: Intel64 Family 6 Model 140 Stepping 1, GenuineIntel Platform type: Windows-10-10.0.19045-SP0 Operating system: Windows Operating system release: 10 Operating system version: 10.0.19045 Number of physical cores: 4 Number of logical cores: 8
Data¶
# Data
# ==============================================================================
n = 5_000
rgn = np.random.default_rng(seed=123)
y = pd.Series(rgn.random(size=(n)), name="y")
exog = pd.DataFrame(rgn.random(size=(n, 10)))
exog.columns = [f"exog_{i}" for i in range(exog.shape[1])]
multi_series = pd.DataFrame(rgn.random(size=(n, 10)))
multi_series.columns = [f"series_{i+1}" for i in range(multi_series.shape[1])]
y_train = y[:-int(n/2)]
display(y.head())
display(exog.head())
display(multi_series.head())
0 0.682352 1 0.053821 2 0.220360 3 0.184372 4 0.175906 Name: y, dtype: float64
exog_0 | exog_1 | exog_2 | exog_3 | exog_4 | exog_5 | exog_6 | exog_7 | exog_8 | exog_9 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.593121 | 0.353471 | 0.336277 | 0.399734 | 0.915459 | 0.822278 | 0.480418 | 0.929802 | 0.950948 | 0.863556 |
1 | 0.764104 | 0.638191 | 0.956624 | 0.178105 | 0.434077 | 0.137480 | 0.837667 | 0.768947 | 0.244235 | 0.815336 |
2 | 0.475312 | 0.312415 | 0.353596 | 0.272162 | 0.772064 | 0.110216 | 0.596551 | 0.688549 | 0.651380 | 0.191837 |
3 | 0.039253 | 0.962713 | 0.189194 | 0.910629 | 0.169796 | 0.697751 | 0.830913 | 0.484824 | 0.634634 | 0.862865 |
4 | 0.872447 | 0.861421 | 0.394829 | 0.877763 | 0.286779 | 0.131008 | 0.450185 | 0.898167 | 0.590147 | 0.045838 |
series_1 | series_2 | series_3 | series_4 | series_5 | series_6 | series_7 | series_8 | series_9 | series_10 | |
---|---|---|---|---|---|---|---|---|---|---|
0 | 0.967448 | 0.580646 | 0.643348 | 0.461737 | 0.450859 | 0.894496 | 0.037967 | 0.097698 | 0.094356 | 0.893528 |
1 | 0.207450 | 0.194904 | 0.377063 | 0.975065 | 0.351034 | 0.812253 | 0.265956 | 0.262733 | 0.784995 | 0.674256 |
2 | 0.520431 | 0.985069 | 0.039559 | 0.541797 | 0.612761 | 0.640336 | 0.823467 | 0.768387 | 0.561777 | 0.600835 |
3 | 0.866694 | 0.165510 | 0.819767 | 0.691179 | 0.717778 | 0.392694 | 0.094067 | 0.271990 | 0.467866 | 0.041054 |
4 | 0.406310 | 0.657688 | 0.630730 | 0.694424 | 0.943934 | 0.888538 | 0.470363 | 0.518283 | 0.719674 | 0.010789 |
Benchmark ForecasterAutoreg¶
warnings.filterwarnings("ignore")
print("-----------------")
print("ForecasterAutoreg")
print("-----------------")
steps = 100
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoreg(
regressor=regressor,
lags=lags,
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(
index=["regressor", "method"],
columns="parallel",
values="elapsed_time"
).reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=results_pivot.dropna(), x="method", y="pct_improvement", hue="regressor", ax=ax)
ax.set_title(f"Parallel vs Sequential (ForecasterAutoreg, n={n})")
ax.set_ylabel("Percent difference")
ax.set_xlabel("Method")
----------------- ForecasterAutoreg ----------------- Ridge(alpha=0.1, random_state=77) {'alpha': [0.1, 0.1, 0.1]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit no parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel Number of models compared: 9. Profiling GridSearch no refit no parallel Number of models compared: 9. LGBMRegressor(max_depth=5, n_estimators=50, n_jobs=1, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit no parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel Number of models compared: 12. Profiling GridSearch no refit no parallel Number of models compared: 12. LGBMRegressor(max_depth=5, n_estimators=50, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit no parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel Number of models compared: 12. Profiling GridSearch no refit no parallel Number of models compared: 12.
regressor | method | False | True | pct_improvement | |
---|---|---|---|---|---|
0 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_no_refit | 0.571370 | 0.416455 | 27.112945 |
1 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_refit | 6.569071 | 2.720030 | 58.593380 |
2 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | create_train_X_y | 0.007874 | NaN | NaN |
3 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | fit | 0.277259 | NaN | NaN |
4 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | gridSearch_no_refit | 6.699500 | 6.123097 | 8.603664 |
5 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_no_refit | 0.751951 | 0.546975 | 27.259220 |
6 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_refit | 7.001708 | 2.424567 | 65.371771 |
7 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | create_train_X_y | 0.008184 | NaN | NaN |
8 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | fit | 0.275215 | NaN | NaN |
9 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | gridSearch_no_refit | 6.601839 | 7.825191 | -18.530468 |
10 | Ridge(alpha=0.1, random_state=77) | backtest_no_refit | 0.744464 | 0.378917 | 49.102068 |
11 | Ridge(alpha=0.1, random_state=77) | backtest_refit | 1.571386 | 12.706505 | -708.617632 |
12 | Ridge(alpha=0.1, random_state=77) | create_train_X_y | 0.011935 | NaN | NaN |
13 | Ridge(alpha=0.1, random_state=77) | fit | 0.024534 | NaN | NaN |
14 | Ridge(alpha=0.1, random_state=77) | gridSearch_no_refit | 4.548398 | 2.756721 | 39.391399 |
Text(0.5, 0, 'Method')
Benchmark ForecasterAutoreg¶
print("-----------------------")
print("ForecasterAutoregDirect")
print("-----------------------")
steps = 25
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoregDirect(
regressor=regressor,
lags=lags,
steps=steps,
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(y=y, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster(
forecaster = forecaster,
y = y,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(
index=["regressor", "method"],
columns="parallel",
values="elapsed_time"
).reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=results_pivot.dropna(), x="method", y="pct_improvement", hue="regressor", ax=ax)
ax.set_title(f"Parallel vs Sequential (ForecasterAutoregDirect, n={n})")
ax.set_ylabel("Percent difference")
ax.set_xlabel("Method");
----------------------- ForecasterAutoregDirect ----------------------- Ridge(alpha=0.1, random_state=77) {'alpha': [0.1, 0.1, 0.1]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit no parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel Number of models compared: 9. Profiling GridSearch no refit no parallel Number of models compared: 9. LGBMRegressor(max_depth=5, n_estimators=50, n_jobs=1, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit no parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel Number of models compared: 12. Profiling GridSearch no refit no parallel Number of models compared: 12. LGBMRegressor(max_depth=5, n_estimators=50, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit no parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel Number of models compared: 12. Profiling GridSearch no refit no parallel Number of models compared: 12.
regressor | method | False | True | pct_improvement | |
---|---|---|---|---|---|
0 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_no_refit | 5.992304 | 10.283535 | -71.612377 |
1 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_refit | 141.952965 | 75.919333 | 46.517966 |
2 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | create_train_X_y | 0.072996 | NaN | NaN |
3 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | fit | 4.144056 | NaN | NaN |
4 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | gridSearch_no_refit | 43.838513 | 76.157306 | -73.722375 |
5 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_no_refit | 5.886309 | 10.738465 | -82.431196 |
6 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_refit | 143.764385 | 64.597212 | 55.067305 |
7 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | create_train_X_y | 0.060571 | NaN | NaN |
8 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | fit | 4.122978 | NaN | NaN |
9 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | gridSearch_no_refit | 42.607671 | 75.015008 | -76.059866 |
10 | Ridge(alpha=0.1, random_state=77) | backtest_no_refit | 0.323168 | 0.319153 | 1.242526 |
11 | Ridge(alpha=0.1, random_state=77) | backtest_refit | 5.785780 | 2.094785 | 63.794244 |
12 | Ridge(alpha=0.1, random_state=77) | create_train_X_y | 0.033940 | NaN | NaN |
13 | Ridge(alpha=0.1, random_state=77) | fit | 0.430777 | NaN | NaN |
14 | Ridge(alpha=0.1, random_state=77) | gridSearch_no_refit | 4.431143 | 3.941903 | 11.040943 |
Benchmark ForecasterAutoregMultiSeries¶
print("----------------------------")
print("ForecasterAutoregMultiseries")
print("----------------------------")
steps = 100
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoregMultiSeries(
regressor=regressor,
lags=lags,
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit and parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = len(y_train),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(
index=["regressor", "method"],
columns="parallel",
values="elapsed_time"
).reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5))
sns.barplot(data=results_pivot.dropna(), x="method", y="pct_improvement", hue="regressor", ax=ax)
ax.set_title(f"Parallel vs Sequential (ForecasterAutoregMultiseries, n={n})")
ax.set_ylabel("Percent difference")
ax.set_xlabel("Method");
---------------------------- ForecasterAutoregMultiseries ---------------------------- Ridge(alpha=0.1, random_state=77) {'alpha': [0.1, 0.1, 0.1]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit and parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel 9 models compared for 10 level(s). Number of iterations: 9. Profiling GridSearch no refit no parallel 9 models compared for 10 level(s). Number of iterations: 9. LGBMRegressor(max_depth=5, n_estimators=50, n_jobs=1, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit and parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel 12 models compared for 10 level(s). Number of iterations: 12. Profiling GridSearch no refit no parallel 12 models compared for 10 level(s). Number of iterations: 12. LGBMRegressor(max_depth=5, n_estimators=50, random_state=77) {'n_estimators': [50, 50], 'max_depth': [5, 5]} Profiling fit Profiling create_train_X_y Profiling backtesting refit parallel Profiling backtesting refit and parallel Profiling backtesting no refit parallel Profiling backtesting no refit no parallel Profiling GridSearch no refit parallel 12 models compared for 10 level(s). Number of iterations: 12. Profiling GridSearch no refit no parallel 12 models compared for 10 level(s). Number of iterations: 12.
regressor | method | False | True | pct_improvement | |
---|---|---|---|---|---|
0 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_no_refit | 3.856087 | 1.785516 | 53.696183 |
1 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | backtest_refit | 26.799302 | 14.820522 | 44.698105 |
2 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | create_train_X_y | 0.082853 | NaN | NaN |
3 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | fit | 1.439756 | NaN | NaN |
4 | LGBMRegressor(max_depth=5, n_estimators=50, n_... | gridSearch_no_refit | 42.083872 | 29.374772 | 30.199456 |
5 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_no_refit | 5.492170 | 2.099951 | 61.764638 |
6 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | backtest_refit | 30.017390 | 14.576546 | 51.439661 |
7 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | create_train_X_y | 0.079670 | NaN | NaN |
8 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | fit | 1.465376 | NaN | NaN |
9 | LGBMRegressor(max_depth=5, n_estimators=50, ra... | gridSearch_no_refit | 72.783301 | 36.314506 | 50.105992 |
10 | Ridge(alpha=0.1, random_state=77) | backtest_no_refit | 3.122103 | 1.041682 | 66.635227 |
11 | Ridge(alpha=0.1, random_state=77) | backtest_refit | 6.525677 | 2.931308 | 55.080398 |
12 | Ridge(alpha=0.1, random_state=77) | create_train_X_y | 0.119956 | NaN | NaN |
13 | Ridge(alpha=0.1, random_state=77) | fit | 0.277841 | NaN | NaN |
14 | Ridge(alpha=0.1, random_state=77) | gridSearch_no_refit | 26.023329 | 13.008654 | 50.011569 |
Benchmark ForecasterAutoregMultivariate¶
print("-----------------------------")
print("ForecasterAutoregMultivariate")
print("-----------------------------")
steps = 25
lags = 50
regressors = [
Ridge(random_state=77, alpha=0.1),
LGBMRegressor(random_state=77, n_jobs=1, n_estimators=50, max_depth=5),
LGBMRegressor(random_state=77, n_jobs=-1, n_estimators=50, max_depth=5),
]
param_grids = [
{'alpha': [0.1, 0.1, 0.1]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
{'n_estimators': [50, 50], 'max_depth': [5, 5]},
]
lags_grid = [50, 50, 50]
elapsed_times = []
for regressor, param_grid in zip(regressors, param_grids):
print("")
print(regressor, param_grid)
print("")
forecaster = ForecasterAutoregMultiVariate(
regressor=regressor,
lags=lags,
steps=steps,
level="series_1",
transformer_exog=StandardScaler()
)
print("Profiling fit")
start = time.time()
forecaster.fit(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling create_train_X_y")
start = time.time()
_ = forecaster.create_train_X_y(series=multi_series, exog=exog)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = True,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling backtesting no refit no parallel")
start = time.time()
metric, backtest_predictions = backtesting_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
fixed_train_size = False,
steps = steps,
metric = 'mean_squared_error',
refit = False,
interval = None,
n_boot = 500,
random_state = 123,
in_sample_residuals = True,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = -1
)
end = time.time()
elapsed_times.append(end - start)
print("Profiling GridSearch no refit no parallel")
start = time.time()
results_grid = grid_search_forecaster_multiseries(
forecaster = forecaster,
series = multi_series,
exog = exog,
initial_train_size = int(len(y)*0.9),
steps = steps,
param_grid = param_grid,
lags_grid = lags_grid,
refit = False,
metric = 'mean_squared_error',
fixed_train_size = False,
return_best = False,
verbose = False,
show_progress = False,
n_jobs = 1
)
end = time.time()
elapsed_times.append(end - start)
methods = [
"fit",
"create_train_X_y",
"backtest_refit_parallel",
"backtest_refit_noparallel",
"backtest_no_refit_parallel",
"backtest_no_refit_noparallel",
"gridSearch_no_refit_parallel",
"gridSearch_no_refit_noparallel"
]
results = pd.DataFrame({
"regressor": np.repeat(np.array([str(regressor) for regressor in regressors]), len(methods)),
"method": np.tile(methods, len(regressors)),
"elapsed_time": elapsed_times
})
results['parallel'] = results.method.str.contains("_parallel")
results['method'] = results.method.str.replace("_parallel", "")
results['method'] = results.method.str.replace("_noparallel", "")
results = results.sort_values(by=["regressor", "method", "parallel"])
results_pivot = results.pivot_table(index=["regressor", "method"], columns="parallel", values="elapsed_time").reset_index()
results_pivot.columns.name = None
results_pivot["pct_improvement"] = (results_pivot[False] - results_pivot[True]) / results_pivot[False] * 100
display(results_pivot)
fig, ax = plt.subplots(figsize=(10, 5