Plotting

Plot forecasting residuals¶

Analyzing the residuals (errors) of predictions is useful to understand the behavior of a forecaster. The function skforecast.plot.plot_residuals creates 3 plots:

A time-ordered plot of residual values
A distribution plot that showcases the distribution of residuals
A plot showcasing the autocorrelation of residuals

By examining the residual values over time, you can determine whether there is a pattern in the errors made by the forecast model. The distribution plot helps you understand whether the residuals are normally distributed, and the autocorrelation plot helps you identify whether there are any dependencies or relationships between the residuals.

In [1]:

Copied!





# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge

from skforecast.datasets import fetch_dataset
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
from skforecast.plot import plot_residuals
from skforecast.plot import set_dark_theme
# Libraries
# ==============================================================================
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.linear_model import Ridge

from skforecast.datasets import fetch_dataset
from skforecast.recursive import ForecasterRecursive
from skforecast.model_selection import TimeSeriesFold
from skforecast.model_selection import backtesting_forecaster
from skforecast.plot import plot_residuals
from skforecast.plot import set_dark_theme

In [2]:

Copied!





# Download data
# ==============================================================================
data = fetch_dataset(
    name="h2o", raw=True, kwargs_read_csv={"names": ["y", "date"], "header": 0}
)

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('MS')

# Plot data
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(7, 3))
data.plot(ax=ax);
# Download data
# ==============================================================================
data = fetch_dataset(
    name="h2o", raw=True, kwargs_read_csv={"names": ["y", "date"], "header": 0}
)

# Data preprocessing
# ==============================================================================
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('MS')

# Plot data
# ==============================================================================
set_dark_theme()
fig, ax = plt.subplots(figsize=(7, 3))
data.plot(ax=ax);

h2o
---
Monthly expenditure ($AUD) on corticosteroid drugs that the Australian health
system had between 1991 and 2008.
Hyndman R (2023). fpp3: Data for Forecasting: Principles and Practice(3rd
Edition). http://pkg.robjhyndman.com/fpp3package/,https://github.com/robjhyndman
/fpp3package, http://OTexts.com/fpp3.
Shape of the dataset: (204, 2)

No description has been provided for this image

In [3]:

Copied!





# Train and backtest forecaster
# ==============================================================================
n_backtest = 36 * 3
data_train = data[:-n_backtest]
data_test  = data[-n_backtest:]

forecaster = ForecasterRecursive(
                 regressor = Ridge(),
                 lags      = 5 
             )

cv = TimeSeriesFold(steps=36, initial_train_size=len(data_train))

metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['y'],
                          cv         = cv,
                          metric     = 'mean_squared_error',
                          verbose    = True
                      )
 
predictions.head()
# Train and backtest forecaster
# ==============================================================================
n_backtest = 36 * 3
data_train = data[:-n_backtest]
data_test  = data[-n_backtest:]

forecaster = ForecasterRecursive(
                 regressor = Ridge(),
                 lags      = 5 
             )

cv = TimeSeriesFold(steps=36, initial_train_size=len(data_train))

metric, predictions = backtesting_forecaster(
                          forecaster = forecaster,
                          y          = data['y'],
                          cv         = cv,
                          metric     = 'mean_squared_error',
                          verbose    = True
                      )
 
predictions.head()

Information of folds
--------------------
Number of observations used for initial training: 96
Number of observations used for backtesting: 108
    Number of folds: 3
    Number skipped folds: 0 
    Number of steps per fold: 36
    Number of steps to exclude between last observed data (last window) and predictions (gap): 0

Fold: 0
    Training:   1991-07-01 00:00:00 -- 1999-06-01 00:00:00  (n=96)
    Validation: 1999-07-01 00:00:00 -- 2002-06-01 00:00:00  (n=36)
Fold: 1
    Training:   No training in this fold
    Validation: 2002-07-01 00:00:00 -- 2005-06-01 00:00:00  (n=36)
Fold: 2
    Training:   No training in this fold
    Validation: 2005-07-01 00:00:00 -- 2008-06-01 00:00:00  (n=36)

  0%|          | 0/3 [00:00<?, ?it/s]

Out[3]:

	pred
1999-07-01	0.667651
1999-08-01	0.655759
1999-09-01	0.652177
1999-10-01	0.641377
1999-11-01	0.635245

The plot_residuals function can be used in two ways: with pre-calculated residuals or by passing the predicted and actual values of the series.

In [4]:

Copied!





# Plot residuals
# ======================================================================================
residuals = predictions['pred'] - data_test['y']
_ = plot_residuals(residuals=residuals, figsize=(7, 3.5))
# Plot residuals
# ======================================================================================
residuals = predictions['pred'] - data_test['y']
_ = plot_residuals(residuals=residuals, figsize=(7, 3.5))

In [5]:

Copied!

_ = plot_residuals(y_true=data_test['y'], y_pred=predictions['pred'], figsize=(7, 3.5))
_ = plot_residuals(y_true=data_test['y'], y_pred=predictions['pred'], figsize=(7, 3.5))

It is possible to customize the plot by by either passing a pre-existing matplotlib figure object or using additional keyword arguments that are passed to matplotlib.pyplot.figure().

Plot prediction intervals¶

In [6]:

Copied!

# Libraries
# ==============================================================================
from skforecast.plot import plot_prediction_intervals
# Libraries
# ==============================================================================
from skforecast.plot import plot_prediction_intervals

In [7]:

Copied!





# Data download
# ==============================================================================
data = fetch_dataset(name='h2o_exog', raw=True, verbose=False)

# Data preparation
# ==============================================================================
data = data.rename(columns={'fecha': 'date'})
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('MS')
data = data.sort_index()

# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]
print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")
# Data download
# ==============================================================================
data = fetch_dataset(name='h2o_exog', raw=True, verbose=False)

# Data preparation
# ==============================================================================
data = data.rename(columns={'fecha': 'date'})
data['date'] = pd.to_datetime(data['date'], format='%Y-%m-%d')
data = data.set_index('date')
data = data.asfreq('MS')
data = data.sort_index()

# Split data into train-test
# ==============================================================================
steps = 36
data_train = data[:-steps]
data_test  = data[-steps:]
print(f"Train dates : {data_train.index.min()} --- {data_train.index.max()}  (n={len(data_train)})")
print(f"Test dates  : {data_test.index.min()} --- {data_test.index.max()}  (n={len(data_test)})")

Train dates : 1992-04-01 00:00:00 --- 2005-06-01 00:00:00  (n=159)
Test dates  : 2005-07-01 00:00:00 --- 2008-06-01 00:00:00  (n=36)

In [8]:

Copied!





# Create and train forecaster
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor = Ridge(alpha=0.1, random_state=765),
                 lags      = 15
             )
forecaster.fit(y=data_train['y'])

# Prediction intervals
# ==============================================================================
predictions = forecaster.predict_interval(
                  steps    = steps,
                  interval = [1, 99],
                  n_boot   = 500
              )
# Create and train forecaster
# ==============================================================================
forecaster = ForecasterRecursive(
                 regressor = Ridge(alpha=0.1, random_state=765),
                 lags      = 15
             )
forecaster.fit(y=data_train['y'])

# Prediction intervals
# ==============================================================================
predictions = forecaster.predict_interval(
                  steps    = steps,
                  interval = [1, 99],
                  n_boot   = 500
              )

In [9]:

Copied!





# Plot forecasts with prediction intervals
# ==============================================================================
plot_prediction_intervals(
    predictions     = predictions,
    y_true          = data_test,
    target_variable = "y",
    title           = "Real value vs predicted in test data"
)
# Plot forecasts with prediction intervals
# ==============================================================================
plot_prediction_intervals(
    predictions     = predictions,
    y_true          = data_test,
    target_variable = "y",
    title           = "Real value vs predicted in test data"
)

Plot correlation between lags of multiple time series¶

When training a Global Forecasting Model of type Dependent multi-series, ForecasterDirectMultiVariate, it is useful to analyze the correlation between the lags of the different time series and the target series. The function skforecast.plot.plot_correlation_lags creates a heatmap that shows the correlation between the lags of the different time series.

In [10]:

Copied!





# Libraries
# ==============================================================================
from skforecast.plot import plot_multivariate_time_series_corr
from skforecast.utils import multivariate_time_series_corr
# Libraries
# ==============================================================================
from skforecast.plot import plot_multivariate_time_series_corr
from skforecast.utils import multivariate_time_series_corr

In [11]:

Copied!





# Data download
# ==============================================================================
data = fetch_dataset(name='air_quality_valencia', raw=False, verbose=True)
data
# Data download
# ==============================================================================
data = fetch_dataset(name='air_quality_valencia', raw=False, verbose=True)
data

air_quality_valencia
--------------------
Hourly measures of several air chemical pollutant at Valencia city (Avd.
Francia) from 2019-01-01 to 20213-12-31. Including the following variables:
pm2.5 (µg/m³), CO (mg/m³), NO (µg/m³), NO2 (µg/m³), PM10 (µg/m³), NOx (µg/m³),
O3 (µg/m³), Veloc. (m/s), Direc. (degrees), SO2 (µg/m³).
Red de Vigilancia y Control de la Contaminación Atmosférica, 46250047-València -
Av. França, https://mediambient.gva.es/es/web/calidad-ambiental/datos-
historicos.
Shape of the dataset: (43824, 10)

Out[11]:

	so2	co	no	no2	pm10	nox	o3	veloc.	direc.	pm2.5
datetime
2019-01-01 00:00:00	8.0	0.2	3.0	36.0	22.0	40.0	16.0	0.5	262.0	19.0
2019-01-01 01:00:00	8.0	0.1	2.0	40.0	32.0	44.0	6.0	0.6	248.0	26.0
2019-01-01 02:00:00	8.0	0.1	11.0	42.0	36.0	58.0	3.0	0.3	224.0	31.0
2019-01-01 03:00:00	10.0	0.1	15.0	41.0	35.0	63.0	3.0	0.2	220.0	30.0
2019-01-01 04:00:00	11.0	0.1	16.0	39.0	36.0	63.0	3.0	0.4	221.0	30.0
...	...	...	...	...	...	...	...	...	...	...
2023-12-31 19:00:00	3.0	0.1	6.0	18.0	8.0	26.0	47.0	1.7	246.0	7.0
2023-12-31 20:00:00	3.0	0.1	6.0	19.0	7.0	27.0	49.0	1.3	239.0	6.0
2023-12-31 21:00:00	3.0	0.1	4.0	15.0	5.0	22.0	55.0	1.5	247.0	4.0
2023-12-31 22:00:00	3.0	0.1	5.0	13.0	5.0	20.0	57.0	1.1	246.0	5.0
2023-12-31 23:00:00	3.0	0.1	5.0	12.0	5.0	20.0	55.0	0.5	247.0	4.0

43824 rows × 10 columns

In [12]:

Copied!





# Correlation between target series and the lags of the other series
# ======================================================================================
corr = multivariate_time_series_corr(
           time_series = data['pm2.5'],
           other       = data,
           lags        = 24
       )
corr.head(3)
# Correlation between target series and the lags of the other series
# ======================================================================================
corr = multivariate_time_series_corr(
           time_series = data['pm2.5'],
           other       = data,
           lags        = 24
       )
corr.head(3)

Out[12]:

	so2	co	no	no2	pm10	nox	o3	veloc.	direc.	pm2.5
lag
0	0.110520	0.167426	0.330658	0.463759	0.687683	0.444082	-0.344922	-0.184636	0.007725	1.000000
1	0.098191	0.168089	0.313363	0.454940	0.633034	0.428927	-0.328899	-0.197767	-0.006486	0.939515
2	0.082010	0.157377	0.272874	0.424717	0.574097	0.388446	-0.294490	-0.200191	-0.021277	0.870223

In [14]:

Copied!

_ = plot_multivariate_time_series_corr(corr, figsize=(7, 7))
_ = plot_multivariate_time_series_corr(corr, figsize=(7, 7))