This document shows the profiling of the main classes, methods and functions available in skforecast. Understanding the bottlenecks will help to:
- Use it more efficiently
- Improve the code for future releases
Libraries¶
# Libraries
# ==============================================================================
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from skforecast.ForecasterAutoreg import ForecasterAutoreg
from skforecast.ForecasterAutoregCustom import ForecasterAutoregCustom
from skforecast.ForecasterAutoregDirect import ForecasterAutoregDirect
from skforecast.model_selection import grid_search_forecaster
from skforecast.model_selection import backtesting_forecaster
from sklearn.linear_model import LinearRegression
from sklearn.linear_model import Ridge
from sklearn.ensemble import HistGradientBoostingRegressor
from lightgbm import LGBMRegressor
%load_ext pyinstrument
Data¶
A time series of length 1000 with random values is created.
# Data
# ==============================================================================
np.random.seed(123)
n = 1_000
data = pd.Series(data = np.random.normal(size=n))
Dummy regressor¶
To isolate the training process of the regressor from the other parts of the code, a dummy regressor class is created. This dummy regressor has a fit method that does nothing, and a predict method that returns a constant value.
class DummyRegressor(LinearRegression):
"""
Dummy regressor with dummy fit and predict methods.
"""
def fit(self, X, y):
pass
def predict(self, y):
predictions = np.ones(shape = len(y))
return predictions
Profiling fit¶
%%pyinstrument
forecaster = ForecasterAutoreg(
regressor = DummyRegressor(),
lags = 24
)
forecaster.fit(y=data)
Almost all of the time spent by fit
is required by the create_train_X_y
method.
%%pyinstrument
forecaster = ForecasterAutoreg(
regressor = HistGradientBoostingRegressor(max_iter=10, random_state=123),
lags = 24
)
forecaster.fit(y=data)
When training a forecaster with a real machine learning regressor, the time spent by create_train_X_y
is negligible compared to the time needed by the fit
method of the regressor. Therefore, improving the speed of create_train_X_y
will not have much impact.
Profiling create_train_X_y¶
Understand how the create_train_X_y
method is influenced by the length of the series and the number of lags.
# Profiling `create_train_X_y` for different length of series and number of lags
# ======================================================================================
series_length = np.linspace(1000, 1000000, num=5, dtype=int)
n_lags = [5, 10, 50, 100, 200]
results = {}
for lags in n_lags:
execution_time = []
forecaster = ForecasterAutoreg(
regressor = DummyRegressor(),
lags = lags
)
for n in series_length:
y = pd.Series(data = np.random.normal(size=n))
tic = time.perf_counter()
_ = forecaster.create_train_X_y(y=y)
toc = time.perf_counter()
execution_time.append(toc-tic)
results[lags] = execution_time
results = pd.DataFrame(
data = results,
index = series_length
)
results
5 | 10 | 50 | 100 | 200 | |
---|---|---|---|---|---|
1000 | 0.001328 | 0.000788 | 0.000817 | 0.002116 | 0.003013 |
250750 | 0.008482 | 0.023412 | 0.126703 | 0.255135 | 0.508422 |
500500 | 0.015588 | 0.052422 | 0.255091 | 0.506566 | 0.998829 |
750250 | 0.024476 | 0.067499 | 0.383377 | 0.749951 | 1.481535 |
1000000 | 0.030307 | 0.091011 | 0.492627 | 0.991524 | 2.012482 |
fig, ax = plt.subplots(figsize=(7, 4))
results.plot(ax=ax, marker='.')
ax.set_xlabel('length of series')
ax.set_ylabel('time (seconds)')
ax.set_title('Profiling create_train_X_y()')
ax.legend(title='number of lags');
Profiling predict¶
forecaster = ForecasterAutoreg(
regressor = DummyRegressor(),
lags = 24
)
forecaster.fit(y=data)
%%pyinstrument
_ = forecaster.predict(steps=1000)
forecaster = ForecasterAutoreg(
regressor = HistGradientBoostingRegressor(max_iter=10, random_state=123),
lags = 24
)
forecaster.fit(y=data)
%%pyinstrument
_ = forecaster.predict(steps=1000)
Inside the predict
method, the append
action is the most expensive but, similar to what happen with fit
, it is negligible compared to the time need by the predict
method of the regressor.
%%html
<style>
.jupyter-wrapper .jp-CodeCell .jp-Cell-inputWrapper .jp-InputPrompt {display: none;}
</style>