Benchmarking skforecast¶
Benchmarking is an essential part of the development process of skforecast. It provides a transparent view of how the performance of the library evolves across versions and helps users make informed decisions when choosing the right forecaster or configuration for their use case.
In this section, we present benchmark results that measure the execution time of key methods (fit
, predict
, etc.) and their performance evolution across versions, allowing the detection of improvements or regressions.
Methodology¶
The benchmarking results presented here are generated using a custom benchmarking script located in the benchmarks/
directory of the repository. This script executes a series of performance tests for the main forecasting classes and their methods, recording both execution time and variability across multiple runs.
To ensure consistency and reproducibility, all benchmarks are automatically executed as part of the continuous integration (CI) pipeline using GitHub Actions. This guarantees that each new release of the library is tested under the same environment and dependency configuration, making performance results directly comparable across versions.
Users who wish to reproduce the benchmarks locally can execute the same script by running:
# From skforecast root directory
python benchmarks/run_benchmarks.py
Results are stored in a structured format in benchmarks/benchmarks.joblib
.
⚠ Warning
The plot_benchmark_results
function is located at the end of the notebook. Go to plot function.
# Libraries
# ==============================================================================
import platform
import psutil
import numpy as np
import pandas as pd
import joblib
import sklearn
import skforecast
# Display all columns in pandas
pd.set_option("display.max_columns", None)
# Environment information
# ==============================================================================
print(f"Python version : {platform.python_version()}")
print(f"skforecast version : {skforecast.__version__}")
print(f"numpy version : {np.__version__}")
print(f"pandas version : {pd.__version__}")
print(f"scikit-learn version : {sklearn.__version__}")
print(f"Computer network name : {platform.node()}")
print(f"Processor type : {platform.processor()}")
print(f"Platform type : {platform.platform()}")
print(f"Number of physical cores : {psutil.cpu_count(logical=False)}")
print(f"Number of logical cores : {psutil.cpu_count(logical=True)}")
print(f"Memory total : {round(psutil.virtual_memory().total / 1e9, 2)} GB")
Python version : 3.12.11 skforecast version : 0.17.0 numpy version : 2.1.3 pandas version : 2.3.1 scikit-learn version : 1.6.1 Computer network name : ITES015-NB0029 Processor type : Intel64 Family 6 Model 141 Stepping 1, GenuineIntel Platform type : Windows-11-10.0.26100-SP0 Number of physical cores : 8 Number of logical cores : 16 Memory total : 34.07 GB
import warnings
warnings.filterwarnings(
"ignore",
category=FutureWarning,
message="'force_all_finite' was renamed to 'ensure_all_finite'"
)
Global results¶
# Load benchmark results
# ==============================================================================
results_benchmark_all = joblib.load("../../benchmarks/benchmark.joblib")
results_benchmark_all.head(2)
forecaster_name | regressor_name | function_name | function_hash | method_name | run_time_avg | run_time_median | run_time_p95 | run_time_std | n_repeats | datetime | python_version | skforecast_version | numpy_version | pandas_version | sklearn_version | lightgbm_version | platform | processor | cpu_count | memory_gb | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | ForecasterRecursive | DummyRegressor | ForecasterRecursive__create_train_X_y | 59b823f1ff395872fac4f7578bd859fa | _create_train_X_y | 0.004280 | 0.004191 | 0.004524 | 0.000344 | 30 | 2025-08-26 06:59:07.003048 | 3.12.11 | 0.17.0 | 2.1.3 | 2.3.2 | 1.6.1 | 4.6.0 | Linux-6.11.0-1018-azure-x86_64-with-glibc2.39 | x86_64 | 4 | 16.77 |
1 | ForecasterRecursive | DummyRegressor | ForecasterRecursive_fit | 9d73eaf5faa980194d715362601eed68 | fit | 0.005956 | 0.005379 | 0.008166 | 0.001107 | 10 | 2025-08-26 06:59:07.066656 | 3.12.11 | 0.17.0 | 2.1.3 | 2.3.2 | 1.6.1 | 4.6.0 | Linux-6.11.0-1018-azure-x86_64-with-glibc2.39 | x86_64 | 4 | 16.77 |
ForecasterRecursive¶
The ForecasterRecursive
class is benchmarked under a fixed experimental setup to ensure fair comparison across library versions. Key conditions are:
Condition | Value |
---|---|
Regressor | sklearn.dummy.DummyRegressor (to isolate forecaster overhead) |
Dataset length | 2000 synthetic observations (generated with bench_forecaster_recursive._make_data ) |
Lags as predictors | 50 |
Exogenous features | 3 |
Data transformation | StandardScaler() for time series features and exogenous variables |
Prediction horizon | 100 steps ahead |
Backtesting split | 1200 training, remaining for testing, step size = 50, no re-fitting (see guide) |
Note: In versions < 0.14.0
, ForecasterRecursive
was named ForecasterAutoreg
.
# Plot benchmark results
# ==============================================================================
plot_benchmark_results(
results_benchmark_all,
forecaster_names = ['ForecasterRecursive', 'ForecasterAutoreg'],
regressors = ['DummyRegressor'],
add_mean = True,
add_median = True
)
ForecasterDirect¶
The ForecasterDirect
class is benchmarked under a fixed experimental setup to ensure fair comparison across library versions. Key conditions are:
Condition | Value |
---|---|
Regressor | sklearn.dummy.DummyRegressor (to isolate forecaster overhead) |
Dataset length | 2000 synthetic observations (generated with bench_forecaster_direct._make_data ) |
Lags as predictors | 20 |
Exogenous features | 3 |
Data transformation | StandardScaler() for time series features and exogenous variables |
Prediction horizon | 10 steps ahead (10 regressors) |
Backtesting split | 1200 training, remaining for testing, step size = 10, no re-fitting (see guide) |
Note: In versions < 0.14.0
, ForecasterDirect
was named ForecasterAutoregDirect
.
# Plot benchmark results
# ==============================================================================
plot_benchmark_results(
results_benchmark_all,
forecaster_names = ['ForecasterDirect', 'ForecasterAutoregDirect'],
regressors = ['DummyRegressor'],
add_mean = True,
add_median = True
)
ForecasterRecursiveMultiSeries¶
The ForecasterRecursiveMultiSeries
class is benchmarked under a fixed experimental setup to ensure fair comparison across library versions. Key conditions are:
Condition | Value |
---|---|
Regressor | sklearn.dummy.DummyRegressor (to isolate forecaster overhead) |
Number of series | 600 time series |
Dataset length | 2000 synthetic observations per series (generated with bench_forecaster_recursive_multiseries._make_data ) |
Lags as predictors | 50 |
Exogenous features | 3 |
Data transformation | StandardScaler() for time series features and exogenous variables |
Prediction horizon | 100 steps ahead |
Backtesting split | 1200 training, remaining for testing, step size = 50, no re-fitting (see guide) |
Note: In versions < 0.14.0
, ForecasterRecursiveMultiSeries
was named ForecasterAutoregMultiSeries
.
# Plot benchmark results
# ==============================================================================
plot_benchmark_results(
results_benchmark_all,
forecaster_names = ['ForecasterRecursiveMultiSeries', 'ForecasterAutoregMultiSeries'],
regressors = ['DummyRegressor'],
add_mean = True,
add_median = True
)
ForecasterDirectMultiVariate¶
The ForecasterDirectMultiVariate
class is benchmarked under a fixed experimental setup to ensure fair comparison across library versions. Key conditions are:
Condition | Value |
---|---|
Regressor | sklearn.dummy.DummyRegressor (to isolate forecaster overhead) |
Number of series | 13 time series |
Dataset length | 1000 synthetic observations (generated with bench_forecaster_direct_multivariate._make_data ) |
Lags as predictors | 20 |
Exogenous features | 3 |
Data transformation | StandardScaler() for time series features and exogenous variables |
Prediction horizon | 10 steps ahead (10 regressors) |
Backtesting split | 900 training, remaining for testing, step size = 10, no re-fitting (see guide) |
Note: In versions < 0.14.0
, ForecasterDirectMultiVariate
was named ForecasterAutoregMultiVariate
.
# Plot benchmark results
# ==============================================================================
plot_benchmark_results(
results_benchmark_all,
forecaster_names = ['ForecasterDirectMultiVariate', 'ForecasterAutoregMultiVariate'],
regressors = ['DummyRegressor'],
add_mean = True,
add_median = True
)
Plot function¶
# Plot function
# ==============================================================================
from __future__ import annotations
import numpy as np
import pandas as pd
import plotly.io as pio
import plotly.graph_objects as go
from plotly.express.colors import qualitative
from packaging.version import Version, InvalidVersion
pio.renderers.default = "notebook_connected"
def plot_benchmark_results(
df: pd.DataFrame,
forecaster_names: str | list[str],
regressors: str | list[str] | None = None,
add_median: bool = True,
add_mean: bool = True
) -> None:
"""
Plot interactive benchmark results by method and package version.
This function renders an interactive Plotly chart that visualizes execution
times for multiple methods across `skforecast` versions. Each data point
represents a run (or an aggregated run) for a given `(method, version)` and is
jittered horizontally to avoid overlap. Points are **colored by version** and
(optionally) per-version **median** and **mean** lines are overlaid for the
selected method. A dropdown allows switching the visible method.
Parameters
----------
df : pandas DataFrame
Input data to benchmark.
forecaster_names : str, list
Forecaster(s) to filter in `df` (column `forecaster_name`).
regressors : str, list, default None
Regressor(s) to filter in `df` (column `regressor_name`). If `None`,
no additional filtering by regressor is applied.
add_median : bool, default True
If `True`, draw one per-version **median** line for the selected method.
add_mean : bool, default True
If `True`, draw one per-version **mean** line for the selected method.
Returns
-------
plotly.graph_objects.Figure
A Plotly figure object containing the benchmark results.
"""
if not isinstance(forecaster_names, list):
forecaster_names = [forecaster_names]
df = df.query("forecaster_name in @forecaster_names")
if regressors is not None:
if not isinstance(regressors, list):
regressors = [regressors]
df = df.query("regressor_name in @regressors")
def try_version(v):
try:
return Version(str(v).lstrip('v'))
except InvalidVersion:
return Version("0") # fallback
versions_sorted = sorted(df['skforecast_version'].unique(), key=try_version)
version_to_num = {v: i for i, v in enumerate(versions_sorted)}
df['skforecast_version'] = pd.Categorical(
df['skforecast_version'],
categories=versions_sorted,
ordered=True
)
df['x_jittered'] = (
df['skforecast_version'].map(version_to_num).astype(float) +
np.random.uniform(-0.05, 0.05, size=len(df))
)
# --- paleta por versión ---
version_colors = {
v: qualitative.Plotly[i % len(qualitative.Plotly)]
for i, v in enumerate(versions_sorted)
}
methods = list(df['method_name'].unique())
fig = go.Figure()
# --- un trace por (método, métrica); colores por versión en los puntos ---
method_to_traces = {m: [] for m in methods}
for i, m in enumerate(methods):
for version in versions_sorted:
sub_df = df[
(df['method_name'] == m) &
(df['skforecast_version'] == version)
]
if sub_df.empty:
continue
marker = dict(
size=10,
color=version_colors[version],
opacity=0.85,
line=dict(color="white", width=1)
)
error_y = dict(
type='data',
array=sub_df['run_time_std'].to_numpy(),
visible=not sub_df['run_time_std'].isna().all(),
color=version_colors[version],
thickness=1.5,
width=5
)
fig.add_trace(go.Scatter(
x=sub_df['x_jittered'],
y=sub_df['run_time_avg'],
mode='markers',
marker=marker,
error_y=error_y,
visible=(i == 0),
# name=f"{methods[i]} — {label}",
text = sub_df.apply(lambda row: (
f"Forecaster: {row['forecaster_name']}<br>"
f"Regressor: {row['regressor_name']}<br>"
f"Function: {row['function_name']}<br>"
f"Function_hash: {row['function_hash']}<br>"
f"Method: {row['method_name']}<br>"
f"Datetime: {row['datetime']}<br>"
f"Python version: {row['python_version']}<br>"
f"skforecast version: {row['skforecast_version']}<br>"
f"numpy version: {row['numpy_version']}<br>"
f"pandas version: {row['pandas_version']}<br>"
f"sklearn version: {row['sklearn_version']}<br>"
f"lightgbm version: {row['lightgbm_version']}<br>"
f"Platform: {row['platform']}<br>"
f"Processor: {row['processor']}<br>"
f"CPU count: {row['cpu_count']}<br>"
f"Memory (GB): {row['memory_gb']:.2f}<br>"
f"Run time avg: {row['run_time_avg']:.6f} s<br>"
f"Run time median: {row['run_time_median']:.6f} s<br>"
f"Run time p95: {row['run_time_p95']:.6f} s<br>"
f"Run time std: {row['run_time_std']:.6f} s<br>"
f"Nº repeats: {row['n_repeats']}"
), axis=1),
hovertemplate = '%{text}<extra></extra>'
))
method_to_traces[m].append(len(fig.data) - 1)
median_trace_id = {}
if add_median:
for i, m in enumerate(methods):
med_df = (
df[df['method_name'] == m]
.groupby('skforecast_version', observed=True)['run_time_avg']
.median()
.reset_index()
)
if med_df.empty:
continue
median_color = "#374151"
med_df['x_center'] = med_df['skforecast_version'].map(version_to_num)
fig.add_trace(go.Scatter(
x=med_df['x_center'],
y=med_df['run_time_avg'],
mode='lines+markers',
line=dict(color=median_color, width=2),
marker=dict(size=8, color=median_color),
name='Median (per version)',
visible=(i == 0)
))
median_trace_id[m] = len(fig.data) - 1
mean_trace_id = {}
if add_mean:
for i, m in enumerate(methods):
mean_df = (
df[df['method_name'] == m]
.groupby('skforecast_version', observed=True)['run_time_avg']
.mean()
.reset_index()
)
if mean_df.empty:
continue
mean_color = "#9CA3AF"
mean_df['x_center'] = mean_df['skforecast_version'].map(version_to_num)
fig.add_trace(go.Scatter(
x=mean_df['x_center'],
y=mean_df['run_time_avg'],
mode='lines+markers',
line=dict(color=mean_color, width=2, dash='dash'),
marker=dict(size=8, color=mean_color),
name='Mean (per version)',
visible=(i == 0)
))
mean_trace_id[m] = len(fig.data) - 1
def visible_mask_for(method):
n = len(fig.data)
mask = [False] * n
# puntos del método (todas sus versiones)
for idx in method_to_traces.get(method, []):
mask[idx] = True
# su mediana y media (si existen)
if add_median and method in median_trace_id:
mask[median_trace_id[method]] = True
if add_mean and method in mean_trace_id:
mask[mean_trace_id[method]] = True
return mask
buttons_methods = []
for i, m in enumerate(methods):
buttons_methods.append(dict(
label=m,
method="update",
args=[
{"visible": visible_mask_for(m)},
{"title": {"text": f"Execution time — method: `{m}`"}}
]
))
fig.update_layout(
title=dict(
text=f"Execution time — method: `{methods[0]}`"
),
xaxis=dict(
tickmode="array",
tickvals=list(version_to_num.values()),
ticktext=list(version_to_num.keys()),
title="skforecast version",
tickangle=0,
automargin=True,
),
yaxis=dict(
title="Execution time (s)",
automargin=True
),
# template="plotly_white",
margin=dict(l=70, r=20, t=100, b=60),
updatemenus=[
dict(
type="dropdown",
buttons=buttons_methods,
showactive=True,
direction="down",
x=1.00,
y=1.03,
xanchor="right",
yanchor="bottom",
pad={"r": 2, "t": 0},
),
],
legend=dict(title=""),
showlegend=False,
)
fig.show()
# Show in web (other option)
# from IPython.display import HTML
# return HTML(fig.to_html(full_html=False, include_plotlyjs="cdn"))
" f"Regressor: {row['regressor_name']}
" f"Function: {row['function_name']}
" f"Function_hash: {row['function_hash']}
" f"Method: {row['method_name']}
" f"Datetime: {row['datetime']}
" f"Python version: {row['python_version']}
" f"skforecast version: {row['skforecast_version']}
" f"numpy version: {row['numpy_version']}
" f"pandas version: {row['pandas_version']}
" f"sklearn version: {row['sklearn_version']}
" f"lightgbm version: {row['lightgbm_version']}
" f"Platform: {row['platform']}
" f"Processor: {row['processor']}
" f"CPU count: {row['cpu_count']}
" f"Memory (GB): {row['memory_gb']:.2f}
" f"Run time avg: {row['run_time_avg']:.6f} s
" f"Run time median: {row['run_time_median']:.6f} s
" f"Run time p95: {row['run_time_p95']:.6f} s
" f"Run time std: {row['run_time_std']:.6f} s
" f"Nº repeats: {row['n_repeats']}" ), axis=1), hovertemplate = '%{text}