This document shows the profiling of the main classes, methods and functions available in skforecast. Understanding the bottlenecks will help to:
- Use it more efficiently
- Improve the code for future releases
Libraries and data¶
# Libraries
# ==============================================================================
import time
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import platform
import psutil
import sklearn
from sklearn.linear_model import LinearRegression, Ridge
from sklearn.ensemble import HistGradientBoostingRegressor
from lightgbm import LGBMRegressor
import skforecast
from skforecast.recursive import ForecasterRecursive
from skforecast.direct import ForecasterDirect
from skforecast.model_selection import grid_search_forecaster, backtesting_forecaster
%load_ext pyinstrument
# Versions
# ==============================================================================
print(f"Python version : {platform.python_version()}")
print(f"scikit-learn version: {sklearn.__version__}")
print(f"skforecast version : {skforecast.__version__}")
print(f"pandas version : {pd.__version__}")
print(f"numpy version : {np.__version__}")
print(f"psutil version : {psutil.__version__}")
print("")
# System information
# ==============================================================================
print(f"Machine type: {platform.machine()}")
print(f"Processor type: {platform.processor()}")
print(f"Platform type: {platform.platform()}")
print(f"Operating system: {platform.system()}")
print(f"Operating system release: {platform.release()}")
print(f"Operating system version: {platform.version()}")
print(f"Number of physical cores: {psutil.cpu_count(logical=False)}")
print(f"Number of logical cores: {psutil.cpu_count(logical=True)}")
Python version : 3.12.9 scikit-learn version: 1.5.2 skforecast version : 0.15.0 pandas version : 2.2.3 numpy version : 1.26.4 psutil version : 7.0.0 Machine type: x86_64 Processor type: x86_64 Platform type: Linux-5.15.0-1077-aws-x86_64-with-glibc2.31 Operating system: Linux Operating system release: 5.15.0-1077-aws Operating system version: #84~20.04.1-Ubuntu SMP Mon Jan 20 22:14:54 UTC 2025 Number of physical cores: 4 Number of logical cores: 8
A time series of length 1000 with random values is created.
# Data
# ==============================================================================
np.random.seed(123)
n = 1_000
data = pd.Series(data = np.random.normal(size=n))
Dummy regressor¶
To isolate the training process of the regressor from the other parts of the code, a dummy regressor class is created. This dummy regressor has a fit method that does nothing, and a predict method that returns a constant value.
class DummyRegressor(LinearRegression):
"""
Dummy regressor with dummy fit and predict methods.
"""
def fit(self, X, y):
pass
def predict(self, y):
predictions = np.ones(shape = len(y))
return predictions
Profiling fit¶
%%pyinstrument
forecaster = ForecasterRecursive(
regressor = DummyRegressor(),
lags = 24
)
forecaster.fit(y=data)
Almost all of the time spent by fit
is required by the create_train_X_y
method.
%%pyinstrument
forecaster = ForecasterRecursive(
regressor = HistGradientBoostingRegressor(max_iter=10, random_state=123),
lags = 24
)
forecaster.fit(y=data)
When training a forecaster with a real machine learning regressor, the time spent by create_train_X_y
is negligible compared to the time needed by the fit
method of the regressor. Therefore, improving the speed of create_train_X_y
will not have much impact.
Profiling create_train_X_y¶
Understand how the create_train_X_y
method is influenced by the length of the series and the number of lags.
# Profiling `create_train_X_y` for different length of series and number of lags
# ======================================================================================
series_length = np.linspace(1000, 1000000, num=5, dtype=int)
n_lags = [5, 10, 50, 100, 200]
results = {}
for lags in n_lags:
execution_time = []
forecaster = ForecasterRecursive(
regressor = DummyRegressor(),
lags = lags
)
for n in series_length:
y = pd.Series(data = np.random.normal(size=n))
tic = time.perf_counter()
_ = forecaster.create_train_X_y(y=y)
toc = time.perf_counter()
execution_time.append(toc - tic)
results[lags] = execution_time
results = pd.DataFrame(
data = results,
index = series_length
)
results
5 | 10 | 50 | 100 | 200 | |
---|---|---|---|---|---|
1000 | 0.000890 | 0.000957 | 0.001248 | 0.002070 | 0.003482 |
250750 | 0.006234 | 0.006186 | 0.033500 | 0.063247 | 0.123518 |
500500 | 0.011425 | 0.018634 | 0.064680 | 0.126329 | 0.250267 |
750250 | 0.016728 | 0.023954 | 0.102774 | 0.200989 | 0.402883 |
1000000 | 0.016318 | 0.027664 | 0.107433 | 0.211251 | 0.425320 |
fig, ax = plt.subplots(figsize=(7, 4))
results.plot(ax=ax, marker='.')
ax.set_xlabel('length of series')
ax.set_ylabel('time (seconds)')
ax.set_title('Profiling create_train_X_y()')
ax.legend(title='number of lags');
Profiling predict¶
forecaster = ForecasterRecursive(
regressor = DummyRegressor(),
lags = 24
)
forecaster.fit(y=data)
%%pyinstrument
_ = forecaster.predict(steps=1000)
forecaster = ForecasterRecursive(
regressor = HistGradientBoostingRegressor(max_iter=10, random_state=123),
lags = 24
)
forecaster.fit(y=data)
%%pyinstrument
_ = forecaster.predict(steps=1000)
Inside the predict
method, the append
action is the most expensive but, similar to what happen with fit
, it is negligible compared to the time need by the predict
method of the regressor.