Drift detection¶
In the context of machine learning, data drift refers to a change in the statistical properties of the input data over time compared to the data on which the model was originally trained. This can cause the model’s performance to deteriorate, since it may no longer generalize well to new, unseen data.
There are several types of data drift, including:
Covariate Drift (Feature Drift): The distribution of the input features changes, but the relationship between features and target remains the same. Example: A model was trained when a feature had values in a certain range. Over time, if that feature shifts to a different range, covariate drift occurs.
Prior Probability Drift (Label Drift): The distribution of the target variable changes. Example: A model trained to predict energy consumption during a season may fail if seasonal patterns change due to external factors.
Concept Drift: The relationship between input features and the target variable changes. Example: A model predicting energy consumption from weather data might fail if new technologies or behaviors alter how weather affects energy usage.
Detecting and addressing data drift is crucial for maintaining the model performance in production. Common strategies include:
Monitoring input data during the prediction phase.
Monitoring model performance (accuracy, precision, recall, etc.) over time.
Periodically retraining the model with new data to adapt to changes.
RangeDriftDetector
Skforecast provides the class RangeDriftDetector to detect covariate drift in both single and multiple time series, as well as in exogenous variables.
The detector checks whether the input data (lags and exogenous variables) used to predict new values fall within the range of the data used to train the model.
Its API follows the same design as the forecasters:
The data used to train a forecaster can also be used to fit the
RangeDriftDetector.The data passed for prediction can be used to check for drift.
✎ Note
This module is in active development, we expect to add more features and improvements in future releases.
Libraries¶
# Libraries
# ==============================================================================
import numpy as np
import pandas as pd
from sklearn.ensemble import HistGradientBoostingRegressor
from skforecast.datasets import fetch_dataset
from skforecast.recursive import ForecasterRecursive
from skforecast.drift_detection import RangeDriftDetector
Detecting out-of-range values in a single series¶
The RangeDriftDetector checks whether the values of a time series remain consistent with the data seen during training.
For numeric variables, it verifies that each new value falls within the minimum and maximum range of the training data. Values outside this range are flagged as potential drift.
For categorical variables, it checks whether each new category was observed during training. Unseen categories are flagged as potential drift.
This mechanism allows you to quickly identify when the model is receiving inputs that differ from those it was trained on, helping you decide whether to retrain the model or adjust preprocessing.
# Simulated data
# ==============================================================================
rgn = np.random.default_rng(123)
y_train = pd.Series(
rgn.normal(loc=10, scale=2, size=100),
index=pd.date_range(start="2020-01-01", periods=100),
name="y",
)
exog_train = pd.DataFrame(
{
"exog_1": rgn.normal(loc=10, scale=2, size=100),
"exog_2": rgn.choice(["A", "B", "C", "D", "E"], size=100),
},
index=y_train.index,
)
display(y_train.head())
display(exog_train.head())
2020-01-01 8.021757 2020-01-02 9.264427 2020-01-03 12.575851 2020-01-04 10.387949 2020-01-05 11.840462 Freq: D, Name: y, dtype: float64
| exog_1 | exog_2 | |
|---|---|---|
| 2020-01-01 | 8.968465 | B |
| 2020-01-02 | 13.316227 | B |
| 2020-01-03 | 9.405475 | A |
| 2020-01-04 | 7.233246 | A |
| 2020-01-05 | 9.437591 | A |
# Train RangeDriftDetector
# ==============================================================================
detector = RangeDriftDetector()
detector.fit(y=y_train, exog=exog_train)
detector
RangeDriftDetector
General Information
- Fitted series: y
- Fitted exogenous: exog_1, exog_2
- Series-specific exogenous: False
- Is fitted: True
Series value ranges
-
{'y': (5.5850578036003915, 14.579819894629157)}
Exogenous value ranges
-
{'exog_1': (4.5430286262543085, 14.531041199734418), 'exog_2': {'D', 'E', 'B', 'C', 'A'}}
Lets assume the model is deployed in production and new data is being used to forecast future values. We simulate a covariate drift in the target series and in the exogenous variables to illustrate how to use the RangeDriftDetector class to detect it.
# Prediction with drifted data
# ==============================================================================
last_window = pd.Series(
[6.6, 7.5, 100, 9.3, 10.2], name="y"
) # Value 100 is out of range
exog_predict = pd.DataFrame(
{
"exog_1": [8, 9, 10, 70, 12], # Value 70 is out of range
"exog_2": ["A", "B", "C", "D", "W"], # Value 'W' is out of range
}
)
flag_out_of_range, series_out_of_range, exog_out_of_range = detector.predict(
last_window = last_window,
exog = exog_predict,
verbose = True,
suppress_warnings = False
)
print("Out of range detected :", flag_out_of_range)
print("Series out of range :", series_out_of_range)
print("Exogenous out of range :", exog_out_of_range)
╭────────────────────────────── FeatureOutOfRangeWarning ──────────────────────────────╮ │ 'y' has values outside the range seen during training [5.58506, 14.57982]. This may │ │ affect the accuracy of the predictions. │ │ │ │ Category : skforecast.exceptions.FeatureOutOfRangeWarning │ │ Location : │ │ c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\skforecast\drift_d │ │ etection\_range_drift.py:283 │ │ Suppress : warnings.simplefilter('ignore', category=FeatureOutOfRangeWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────── FeatureOutOfRangeWarning ──────────────────────────────╮ │ 'exog_1' has values outside the range seen during training [4.54303, 14.53104]. This │ │ may affect the accuracy of the predictions. │ │ │ │ Category : skforecast.exceptions.FeatureOutOfRangeWarning │ │ Location : │ │ c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\skforecast\drift_d │ │ etection\_range_drift.py:283 │ │ Suppress : warnings.simplefilter('ignore', category=FeatureOutOfRangeWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────── FeatureOutOfRangeWarning ──────────────────────────────╮ │ 'exog_2' has values not seen during training. Seen values: {'D', 'E', 'B', 'C', │ │ 'A'}. This may affect the accuracy of the predictions. │ │ │ │ Category : skforecast.exceptions.FeatureOutOfRangeWarning │ │ Location : │ │ c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\skforecast\drift_d │ │ etection\_range_drift.py:283 │ │ Suppress : warnings.simplefilter('ignore', category=FeatureOutOfRangeWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
╭───────────────────────────── Out-of-range summary ──────────────────────────────╮ │ Series: │ │ 'y' has values outside the observed range [5.58506, 14.57982]. │ │ │ │ Exogenous Variables: │ │ 'exog_1' has values outside the observed range [4.54303, 14.53104]. │ │ 'exog_2' has values not seen during training. Seen values: {'D', 'E', 'B', 'C', │ │ 'A'}. │ ╰─────────────────────────────────────────────────────────────────────────────────╯
Out of range detected : True Series out of range : ['y'] Exogenous out of range : ['exog_1', 'exog_2']
Detecting out-of-range values in multiple series¶
The same process applies when modeling multiple time series.
For each series, the
RangeDriftDetectorchecks whether the new values remain within the range of the training data.If exogenous variables are included, they are checked grouped by series, ensuring that drift is detected in the correct context.
This allows you to monitor drift at the per-series level, making it easier to spot issues in specific series without being misled by aggregated results.
# Simulated data - Multiple time series
# ==============================================================================
idx = pd.MultiIndex.from_product(
[
["series_1", "series_2", "series_3"],
pd.date_range(start="2020-01-01", periods=3),
],
names=["series_id", "datetime"],
)
series_train = pd.DataFrame(
{"values": [1, 2, 3, 10, 20, 30, 100, 200, 300]}, index=idx
)
exog_train = pd.DataFrame(
{
"exog_1": [5.0, 6.0, 7.0, 15.0, 25.0, 35.0, 150.0, 250.0, 350.0],
"exog_2": ["A", "B", "C", "D", "E", "F", "G", "H", "I"],
},
index=idx,
)
display(series_train)
display(exog_train)
| values | ||
|---|---|---|
| series_id | datetime | |
| series_1 | 2020-01-01 | 1 |
| 2020-01-02 | 2 | |
| 2020-01-03 | 3 | |
| series_2 | 2020-01-01 | 10 |
| 2020-01-02 | 20 | |
| 2020-01-03 | 30 | |
| series_3 | 2020-01-01 | 100 |
| 2020-01-02 | 200 | |
| 2020-01-03 | 300 |
| exog_1 | exog_2 | ||
|---|---|---|---|
| series_id | datetime | ||
| series_1 | 2020-01-01 | 5.0 | A |
| 2020-01-02 | 6.0 | B | |
| 2020-01-03 | 7.0 | C | |
| series_2 | 2020-01-01 | 15.0 | D |
| 2020-01-02 | 25.0 | E | |
| 2020-01-03 | 35.0 | F | |
| series_3 | 2020-01-01 | 150.0 | G |
| 2020-01-02 | 250.0 | H | |
| 2020-01-03 | 350.0 | I |
# Train RangeDriftDetector - Multiple time series
# ==============================================================================
detector = RangeDriftDetector()
detector.fit(series=series_train, exog=exog_train)
detector
RangeDriftDetector
General Information
- Fitted series: series_1, series_2, series_3
- Fitted exogenous: exog_1, exog_2
- Series-specific exogenous: True
- Is fitted: True
Series value ranges
-
{'series_1': (1.0, 3.0), 'series_2': (10.0, 30.0), 'series_3': (100.0, 300.0)}
Exogenous value ranges
-
{'series_1': {'exog_1': (5.0, 7.0), 'exog_2': {'A', 'C', 'B'}}, 'series_2': {'exog_1': (15.0, 35.0), 'exog_2': {'E', 'D', 'F'}}, 'series_3': {'exog_1': (150.0, 350.0), 'exog_2': {'I', 'H', 'G'}}}
# Prediction with drifted data - Multiple time series
# ==============================================================================
last_window = pd.DataFrame(
{
"series_1": np.array([1.5, 2.3]),
"series_2": np.array([100, 20]), # Value 100 is out of range
"series_3": np.array([110, 200]),
},
index=pd.date_range(start="2020-01-02", periods=2),
)
idx = pd.MultiIndex.from_product(
[
["series_1", "series_2", "series_3"],
pd.date_range(start="2020-01-04", periods=2),
],
names=["series_id", "datetime"],
)
exog_predict = pd.DataFrame(
{
"exog_1": [5.0, 6.1, 10, 70, 220, 290],
"exog_2": ["A", "B", "D", "F", "W", "E"],
},
index=idx,
)
display(last_window)
display(exog_predict)
| series_1 | series_2 | series_3 | |
|---|---|---|---|
| 2020-01-02 | 1.5 | 100 | 110 |
| 2020-01-03 | 2.3 | 20 | 200 |
| exog_1 | exog_2 | ||
|---|---|---|---|
| series_id | datetime | ||
| series_1 | 2020-01-04 | 5.0 | A |
| 2020-01-05 | 6.1 | B | |
| series_2 | 2020-01-04 | 10.0 | D |
| 2020-01-05 | 70.0 | F | |
| series_3 | 2020-01-04 | 220.0 | W |
| 2020-01-05 | 290.0 | E |
# Prediction with drifted data - Multiple time series
# ==============================================================================
flag_out_of_range, series_out_of_range, exog_out_of_range = detector.predict(
last_window = last_window,
exog = exog_predict,
verbose = True,
suppress_warnings = False
)
print("Out of range detected :", flag_out_of_range)
print("Series out of range :", series_out_of_range)
print("Exogenous out of range :", exog_out_of_range)
╭────────────────────────────── FeatureOutOfRangeWarning ──────────────────────────────╮ │ 'series_2' has values outside the range seen during training [10.00000, 30.00000]. │ │ This may affect the accuracy of the predictions. │ │ │ │ Category : skforecast.exceptions.FeatureOutOfRangeWarning │ │ Location : │ │ c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\skforecast\drift_d │ │ etection\_range_drift.py:283 │ │ Suppress : warnings.simplefilter('ignore', category=FeatureOutOfRangeWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────── FeatureOutOfRangeWarning ──────────────────────────────╮ │ 'series_2': 'exog_1' has values outside the range seen during training [15.00000, │ │ 35.00000]. This may affect the accuracy of the predictions. │ │ │ │ Category : skforecast.exceptions.FeatureOutOfRangeWarning │ │ Location : │ │ c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\skforecast\drift_d │ │ etection\_range_drift.py:283 │ │ Suppress : warnings.simplefilter('ignore', category=FeatureOutOfRangeWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────── FeatureOutOfRangeWarning ──────────────────────────────╮ │ 'series_3': 'exog_2' has values not seen during training. Seen values: {'I', 'H', │ │ 'G'}. This may affect the accuracy of the predictions. │ │ │ │ Category : skforecast.exceptions.FeatureOutOfRangeWarning │ │ Location : │ │ c:\Users\jaesc2\Miniconda3\envs\skforecast_py12\Lib\site-packages\skforecast\drift_d │ │ etection\_range_drift.py:283 │ │ Suppress : warnings.simplefilter('ignore', category=FeatureOutOfRangeWarning) │ ╰──────────────────────────────────────────────────────────────────────────────────────╯
╭────────────────────────────── Out-of-range summary ──────────────────────────────╮ │ Series: │ │ 'series_2' has values outside the observed range [10.00000, 30.00000]. │ │ │ │ Exogenous Variables: │ │ 'series_2': 'exog_1' has values outside the observed range [15.00000, 35.00000]. │ │ 'series_3': 'exog_2' has values not seen during training. Seen values: {'I', │ │ 'H', 'G'}. │ ╰──────────────────────────────────────────────────────────────────────────────────╯
Out of range detected : True
Series out of range : ['series_2']
Exogenous out of range : {'series_2': ['exog_1'], 'series_3': ['exog_2']}
Combining RangeDriftDetector with Forecasters¶
When deploying a forecaster in production, it is good practice to pair it with a drift detector. This ensures that both are trained on the same dataset, allowing the drift detector to verify the input data before the forecaster makes predictions.
# Data
# ==============================================================================
data = fetch_dataset(name='h2o_exog')
data.index.name = 'datetime'
data.head(3)
╭─────────────────────────────────── h2o_exog ────────────────────────────────────╮ │ Description: │ │ Monthly expenditure ($AUD) on corticosteroid drugs that the Australian health │ │ system had between 1991 and 2008. Two additional variables (exog_1, exog_2) are │ │ simulated. │ │ │ │ Source: │ │ Hyndman R (2023). fpp3: Data for Forecasting: Principles and Practice (3rd │ │ Edition). http://pkg.robjhyndman.com/fpp3package/, │ │ https://github.com/robjhyndman/fpp3package, http://OTexts.com/fpp3. │ │ │ │ URL: │ │ https://raw.githubusercontent.com/skforecast/skforecast- │ │ datasets/main/data/h2o_exog.csv │ │ │ │ Shape: 195 rows x 3 columns │ ╰─────────────────────────────────────────────────────────────────────────────────╯
| y | exog_1 | exog_2 | |
|---|---|---|---|
| datetime | |||
| 1992-04-01 | 0.379808 | 0.958792 | 1.166029 |
| 1992-05-01 | 0.361801 | 0.951993 | 1.117859 |
| 1992-06-01 | 0.410534 | 0.952955 | 1.067942 |
# Train Forecaster and RangeDriftDetector
# ==============================================================================
steps = 36
data_train = data.iloc[:-steps, :]
data_test = data.iloc[-steps:, :]
forecaster = ForecasterRecursive(
regressor = HistGradientBoostingRegressor(random_state=123),
lags = 15
)
detector = RangeDriftDetector()
forecaster.fit(
y = data_train['y'],
exog = data_train[['exog_1', 'exog_2']]
)
detector.fit(
series = data_train['y'],
exog = data_train[['exog_1', 'exog_2']]
)
If you use the last_window stored in the Forecaster, drift detection is unnecessary because it corresponds to the final window of the training data. In production environments, however, you may supply an external last_window from a different time period. In that case, drift detection is recommended.
In the example below, the external last_window is identical to the final training window, so no drift will be detected.
# Last window (same as forecaster.last_window_)
# ==============================================================================
last_window = data_train['y'].iloc[-forecaster.max_lag:]
last_window
datetime 2004-04-01 0.739986 2004-05-01 0.795129 2004-06-01 0.856803 2004-07-01 1.001593 2004-08-01 0.994864 2004-09-01 1.134432 2004-10-01 1.181011 2004-11-01 1.216037 2004-12-01 1.257238 2005-01-01 1.170690 2005-02-01 0.597639 2005-03-01 0.652590 2005-04-01 0.670505 2005-05-01 0.695248 2005-06-01 0.842263 Freq: MS, Name: y, dtype: float64
# Check data with RangeDriftDetector and predict with Forecaster
# ==============================================================================
detector.predict(
last_window = last_window,
exog = data_test[['exog_1', 'exog_2']],
verbose = True,
suppress_warnings = False
)
predictions = forecaster.predict(
steps = 36,
last_window = last_window,
exog = data_test[['exog_1', 'exog_2']]
)
╭───────────────── Out-of-range summary ─────────────────╮ │ Series: │ │ No series with out-of-range values found. │ │ │ │ Exogenous Variables: │ │ No exogenous variables with out-of-range values found. │ ╰────────────────────────────────────────────────────────╯