This forecaster predicts future values based on the most recent equivalent
date. It also allows to aggregate multiple past values of the equivalent
date using a function (e.g. mean, median, max, min, etc.). The equivalent
date is calculated by moving back in time a specified number of steps (offset).
The offset can be defined as an integer or as a pandas DateOffset. This
approach is useful as a baseline, but it is a simplistic method and may not
capture complex underlying patterns.
Number of steps to go back in time to find the most recent equivalent
date to the target period.
If offset is an integer, it represents the number of steps to go back
in time. For example, if the frequency of the time series is daily,
offset = 7 means that the most recent data similar to the target
period is the value observed 7 days ago.
Pandas DateOffsets can also be used to move forward a given number of
valid dates. For example, Bday(2) can be used to move back two business
days. If the date does not start on a valid date, it is first moved to a
valid date. For example, if the date is a Saturday, it is moved to the
previous Friday. Then, the offset is applied. If the result is a non-valid
date, it is moved to the next valid date. For example, if the date
is a Sunday, it is moved to the next Monday.
For more information about offsets, see
https://pandas.pydata.org/docs/reference/offset_frequency.html.
Number of equivalent dates (multiple of offset) used in the prediction.
If n_offsets is greater than 1, the values at the equivalent dates are
aggregated using the agg_func function. For example, if the frequency
of the time series is daily, offset = 7, n_offsets = 2 and
agg_func = np.mean, the predicted value will be the mean of the values
observed 7 and 14 days ago.
Additional arguments to pass to the QuantileBinner used to discretize
the residuals into k bins according to the predicted values associated
with each residual. Available arguments are: n_bins, method, subsample,
random_state and dtype. Argument method is passed internally to the
function numpy.percentile.
New in version 0.17.0
Number of steps to go back in time to find the most recent equivalent
date to the target period.
If offset is an integer, it represents the number of steps to go back
in time. For example, if the frequency of the time series is daily,
offset = 7 means that the most recent data similar to the target
period is the value observed 7 days ago.
Pandas DateOffsets can also be used to move forward a given number of
valid dates. For example, Bday(2) can be used to move back two business
days. If the date does not start on a valid date, it is first moved to a
valid date. For example, if the date is a Saturday, it is moved to the
previous Friday. Then, the offset is applied. If the result is a non-valid
date, it is moved to the next valid date. For example, if the date
is a Sunday, it is moved to the next Monday.
For more information about offsets, see
https://pandas.pydata.org/docs/reference/offset_frequency.html.
Number of equivalent dates (multiple of offset) used in the prediction.
If offset is greater than 1, the value at the equivalent dates is
aggregated using the agg_func function. For example, if the frequency
of the time series is daily, offset = 7, n_offsets = 2 and
agg_func = np.mean, the predicted value will be the mean of the values
observed 7 and 14 days ago.
This window represents the most recent data observed by the predictor
during its training phase. It contains the past values needed to include
the last equivalent date according the offset and n_offsets.
Residuals of the model when predicting training data. Only stored up to
10_000 values. If transformer_y is not None, residuals are stored in
the transformed scale. If differentiation is not None, residuals are
stored after differentiation.
In sample residuals binned according to the predicted value each residual
is associated with. The number of residuals stored per bin is limited to
10_000 // self.binner.n_bins_ in the form {bin: residuals}. If
transformer_y is not None, residuals are stored in the transformed
scale. If differentiation is not None, residuals are stored after
differentiation.
Residuals of the model when predicting non-training data. Only stored up to
10_000 values. Use set_out_sample_residuals() method to set values. If
transformer_y is not None, residuals are stored in the transformed
scale. If differentiation is not None, residuals are stored after
differentiation.
Out of sample residuals binned according to the predicted value each residual
is associated with. The number of residuals stored per bin is limited to
10_000 // self.binner.n_bins_ in the form {bin: residuals}. If
transformer_y is not None, residuals are stored in the transformed
scale. If differentiation is not None, residuals are stored after
differentiation.
def__init__(self,offset:int|pd.tseries.offsets.DateOffset,n_offsets:int=1,agg_func:Callable=np.mean,binner_kwargs:dict[str,object]|None=None,forecaster_id:str|int|None=None)->None:self.offset=offsetself.n_offsets=n_offsetsself.agg_func=agg_funcself.last_window_=Noneself.index_type_=Noneself.index_freq_=Noneself.training_range_=Noneself.series_name_in_=Noneself.in_sample_residuals_=Noneself.out_sample_residuals_=Noneself.in_sample_residuals_by_bin_=Noneself.out_sample_residuals_by_bin_=Noneself.creation_date=pd.Timestamp.today().strftime('%Y-%m-%d %H:%M:%S')self.is_fitted=Falseself.fit_date=Noneself.skforecast_version=__version__self.python_version=sys.version.split(" ")[0]self.forecaster_id=forecaster_idself._probabilistic_mode="binned"self.estimator=Noneself.differentiation=Noneself.differentiation_max=Noneifnotisinstance(self.offset,(int,pd.tseries.offsets.DateOffset)):raiseTypeError("`offset` must be an integer greater than 0 or a ""pandas.tseries.offsets. Find more information about offsets in ""https://pandas.pydata.org/docs/reference/offset_frequency.html")self.window_size=self.offset*self.n_offsetsself.binner_kwargs=binner_kwargsifbinner_kwargsisNone:self.binner_kwargs={'n_bins':10,'method':'linear','subsample':200000,'random_state':789654,'dtype':np.float64}self.binner=QuantileBinner(**self.binner_kwargs)self.binner_intervals_=Noneself.__skforecast_tags__={"library":"skforecast","forecaster_name":"ForecasterEquivalentDate","forecaster_task":"regression","forecasting_scope":"single-series",# single-series | global"forecasting_strategy":"recursive",# recursive | direct | deep_learning"index_types_supported":["pandas.RangeIndex","pandas.DatetimeIndex"],"requires_index_frequency":True,"allowed_input_types_series":["pandas.Series"],"supports_exog":False,"allowed_input_types_exog":[],"handles_missing_values_series":False,"handles_missing_values_exog":False,"supports_lags":False,"supports_window_features":False,"supports_transformer_series":False,"supports_transformer_exog":False,"supports_weight_func":False,"supports_differentiation":False,"prediction_types":["point","interval"],"supports_probabilistic":True,"probabilistic_methods":["conformal"],"handles_binned_residuals":True}
If True, in-sample residuals will be stored in the forecaster object
after fitting (in_sample_residuals_ and in_sample_residuals_by_bin_
attributes).
If False, only the intervals of the bins are stored.
deffit(self,y:pd.Series,store_in_sample_residuals:bool=False,random_state:int=123,exog:Any=None)->None:""" Training Forecaster. Parameters ---------- y : pandas Series Training time series. store_in_sample_residuals : bool, default False If `True`, in-sample residuals will be stored in the forecaster object after fitting (`in_sample_residuals_` and `in_sample_residuals_by_bin_` attributes). If `False`, only the intervals of the bins are stored. random_state : int, default 123 Set a seed for the random generator so that the stored sample residuals are always deterministic. exog : Ignored Not used, present here for API consistency by convention. Returns ------- None """ifnotisinstance(y,pd.Series):raiseTypeError(f"`y` must be a pandas Series with a DatetimeIndex or a RangeIndex. "f"Found {type(y)}.")ifisinstance(self.offset,pd.tseries.offsets.DateOffset):ifnotisinstance(y.index,pd.DatetimeIndex):raiseTypeError("If `offset` is a pandas DateOffset, the index of `y` must be a ""pandas DatetimeIndex with frequency.")elify.index.freqisNone:raiseTypeError("If `offset` is a pandas DateOffset, the index of `y` must be a ""pandas DatetimeIndex with frequency.")# Reset values in case the forecaster has already been fitted.self.last_window_=Noneself.index_type_=Noneself.index_freq_=Noneself.training_range_=Noneself.series_name_in_=Noneself.is_fitted=False_,y_index=check_extract_values_and_index(data=y,data_label='`y`',return_values=False)ifisinstance(self.offset,pd.tseries.offsets.DateOffset):# Calculate the window_size in steps for compatibility with the# check_predict_input function. This is not a exact calculation# because the offset follows the calendar rules and the distance# between two dates may not be constant.first_valid_index=(y_index[-1]-self.offset*self.n_offsets)try:window_size_idx_start=y_index.get_loc(first_valid_index)window_size_idx_end=y_index.get_loc(y_index[-1])self.window_size=window_size_idx_end-window_size_idx_startexceptKeyError:raiseValueError(f"The length of `y` ({len(y)}), must be greater than or equal "f"to the window size ({self.window_size}). This is because "f"the offset ({self.offset}) is larger than the available "f"data. Try to decrease the size of the offset ({self.offset}), "f"the number of `n_offsets` ({self.n_offsets}) or increase the "f"size of `y`.")else:iflen(y)<=self.window_size:raiseValueError(f"Length of `y` must be greater than the maximum window size "f"needed by the forecaster. This is because "f"the offset ({self.offset}) is larger than the available "f"data. Try to decrease the size of the offset ({self.offset}), "f"the number of `n_offsets` ({self.n_offsets}) or increase the "f"size of `y`.\n"f" Length `y`: {len(y)}.\n"f" Max window size: {self.window_size}.\n")self.is_fitted=Trueself.series_name_in_=y.nameify.nameisnotNoneelse'y'self.fit_date=pd.Timestamp.today().strftime('%Y-%m-%d %H:%M:%S')self.training_range_=y_index[[0,-1]]self.index_type_=type(y_index)self.index_freq_=(y_index.freqifisinstance(y_index,pd.DatetimeIndex)elsey_index.step)# NOTE: This is done to save time during fit in functions such as backtesting()ifself._probabilistic_modeisnotFalse:self._binning_in_sample_residuals(y=y,store_in_sample_residuals=store_in_sample_residuals,random_state=random_state)# The last time window of training data is stored so that equivalent# dates are available when calling the `predict` method.# Store the whole series to avoid errors when the offset is larger # than the data available.self.last_window_=y.copy()
Bin residuals according to the predicted value each residual is
associated with. First a skforecast.preprocessing.QuantileBinner object
is fitted to the predicted values. Then, residuals are binned according
to the predicted value each residual is associated with. Residuals are
stored in the forecaster object as in_sample_residuals_ and
in_sample_residuals_by_bin_.
The number of residuals stored per bin is limited to
10_000 // self.binner.n_bins_. The total number of residuals stored is
10_000.
New in version 0.17.0
If True, in-sample residuals will be stored in the forecaster object
after fitting (in_sample_residuals_ and in_sample_residuals_by_bin_
attributes).
If False, only the intervals of the bins are stored.
def_binning_in_sample_residuals(self,y:pd.Series,store_in_sample_residuals:bool=False,random_state:int=123)->None:""" Bin residuals according to the predicted value each residual is associated with. First a `skforecast.preprocessing.QuantileBinner` object is fitted to the predicted values. Then, residuals are binned according to the predicted value each residual is associated with. Residuals are stored in the forecaster object as `in_sample_residuals_` and `in_sample_residuals_by_bin_`. The number of residuals stored per bin is limited to `10_000 // self.binner.n_bins_`. The total number of residuals stored is `10_000`. **New in version 0.17.0** Parameters ---------- y : pandas Series Training time series. store_in_sample_residuals : bool, default False If `True`, in-sample residuals will be stored in the forecaster object after fitting (`in_sample_residuals_` and `in_sample_residuals_by_bin_` attributes). If `False`, only the intervals of the bins are stored. random_state : int, default 123 Set a seed for the random generator so that the stored sample residuals are always deterministic. Returns ------- None """ifisinstance(self.offset,pd.tseries.offsets.DateOffset):y_preds=[]forn_offinrange(1,self.n_offsets+1):idx=y.index-self.offset*n_offmask=idx>=y.index[0]y_pred=y.loc[idx[mask]]y_pred.index=y.index[-mask.sum():]y_preds.append(y_pred)y_preds=pd.concat(y_preds,axis=1).to_numpy()y_true=y.to_numpy()[-len(y_preds):]else:y_preds=[y.shift(self.offset*n_off)[self.window_size:]forn_offinrange(1,self.n_offsets+1)]y_preds=np.column_stack(y_preds)y_true=y.to_numpy()[self.window_size:]y_pred=np.apply_along_axis(self.agg_func,axis=1,arr=y_preds)residuals=y_true-y_predifself._probabilistic_mode=="binned":data=pd.DataFrame({'prediction':y_pred,'residuals':residuals}).dropna()y_pred=data['prediction'].to_numpy()residuals=data['residuals'].to_numpy()self.binner.fit(y_pred)self.binner_intervals_=self.binner.intervals_ifstore_in_sample_residuals:rng=np.random.default_rng(seed=random_state)ifself._probabilistic_mode=="binned":data['bin']=self.binner.transform(y_pred).astype(int)self.in_sample_residuals_by_bin_=(data.groupby('bin')['residuals'].apply(np.array).to_dict())max_sample=10_000//self.binner.n_bins_fork,vinself.in_sample_residuals_by_bin_.items():iflen(v)>max_sample:sample=v[rng.integers(low=0,high=len(v),size=max_sample)]self.in_sample_residuals_by_bin_[k]=sampleforkinself.binner_intervals_.keys():ifknotinself.in_sample_residuals_by_bin_:self.in_sample_residuals_by_bin_[k]=np.array([])empty_bins=[kfork,vinself.in_sample_residuals_by_bin_.items()ifv.size==0]ifempty_bins:empty_bin_size=min(max_sample,len(residuals))forkinempty_bins:self.in_sample_residuals_by_bin_[k]=rng.choice(a=residuals,size=empty_bin_size,replace=False)iflen(residuals)>10_000:residuals=residuals[rng.integers(low=0,high=len(residuals),size=10_000)]self.in_sample_residuals_=residuals
Past values needed to select the last equivalent dates according to
the offset. If last_window = None, the values stored in
self.last_window_ are used and the predictions start immediately
after the training data.
If True, the input is checked for possible warnings and errors
with the check_predict_input function. This argument is created
for internal use and is not recommended to be changed.
True
exog
Ignored
Not used, present here for API consistency by convention.
None
Returns:
Name
Type
Description
predictions
pandas Series
Predicted values.
Source code in skforecast\recursive\_forecaster_equivalent_date.py
defpredict(self,steps:int,last_window:pd.Series|None=None,check_inputs:bool=True,exog:Any=None)->pd.Series:""" Predict n steps ahead. Parameters ---------- steps : int Number of steps to predict. last_window : pandas Series, default None Past values needed to select the last equivalent dates according to the offset. If `last_window = None`, the values stored in `self.last_window_` are used and the predictions start immediately after the training data. check_inputs : bool, default True If `True`, the input is checked for possible warnings and errors with the `check_predict_input` function. This argument is created for internal use and is not recommended to be changed. exog : Ignored Not used, present here for API consistency by convention. Returns ------- predictions : pandas Series Predicted values. """iflast_windowisNone:last_window=self.last_window_ifcheck_inputs:check_predict_input(forecaster_name=type(self).__name__,steps=steps,is_fitted=self.is_fitted,exog_in_=False,index_type_=self.index_type_,index_freq_=self.index_freq_,window_size=self.window_size,last_window=last_window)prediction_index=expand_index(index=last_window.index,steps=steps)ifisinstance(self.offset,int):last_window_values=last_window.to_numpy(copy=True).ravel()equivalent_indexes=np.tile(np.arange(-self.offset,0),int(np.ceil(steps/self.offset)))equivalent_indexes=equivalent_indexes[:steps]ifself.n_offsets==1:equivalent_values=last_window_values[equivalent_indexes]predictions=equivalent_values.ravel()ifself.n_offsets>1:equivalent_indexes=[equivalent_indexes-n*self.offsetforninnp.arange(self.n_offsets)]equivalent_indexes=np.vstack(equivalent_indexes)equivalent_values=last_window_values[equivalent_indexes]predictions=np.apply_along_axis(self.agg_func,axis=0,arr=equivalent_values)predictions=pd.Series(data=predictions,index=prediction_index,name='pred')ifisinstance(self.offset,pd.tseries.offsets.DateOffset):last_window=last_window.copy()max_allowed_date=last_window.index[-1]# For every date in prediction_index, calculate the n offsetsoffset_dates=[]fordateinprediction_index:selected_offsets=[]whilelen(selected_offsets)<self.n_offsets:offset_date=date-self.offsetifoffset_date<=max_allowed_date:selected_offsets.append(offset_date)date=offset_dateoffset_dates.append(selected_offsets)offset_dates=np.array(offset_dates)# Select the values of the time series corresponding to the each# offset date. If the offset date is not in the time series, the# value is set to NaN.equivalent_values=(last_window.reindex(offset_dates.ravel()).to_numpy().reshape(-1,self.n_offsets))equivalent_values=pd.DataFrame(data=equivalent_values,index=prediction_index,columns=[f'offset_{i}'foriinrange(self.n_offsets)])# Error if all values are missingifequivalent_values.isnull().all().all():raiseValueError(f"All equivalent values are missing. This is caused by using "f"an offset ({self.offset}) larger than the available data. "f"Try to decrease the size of the offset ({self.offset}), "f"the number of `n_offsets` ({self.n_offsets}) or increase the "f"size of `last_window`. In backtesting, this error may be "f"caused by using an `initial_train_size` too small.")# Warning if equivalent values are missingincomplete_offsets=equivalent_values.isnull().any(axis=1)incomplete_offsets=incomplete_offsets[incomplete_offsets].indexifnotincomplete_offsets.empty:warnings.warn(f"Steps: {incomplete_offsets.strftime('%Y-%m-%d').to_list()} "f"are calculated with less than {self.n_offsets} `n_offsets`. "f"To avoid this, increase the `last_window` size or decrease "f"the number of `n_offsets`. The current configuration requires "f"a total offset of {self.offset*self.n_offsets}.",MissingValuesWarning)aggregate_values=equivalent_values.apply(self.agg_func,axis=1)predictions=aggregate_values.rename('pred')returnpredictions
Predict n steps ahead and estimate prediction intervals using conformal
prediction method. Refer to the References section for additional
details on this method.
Past values needed to select the last equivalent dates according to
the offset. If last_window = None, the values stored in
self.last_window_ are used and the predictions start immediately
after the training data.
Confidence level of the prediction interval. Interpretation depends
on the method used:
If float, represents the nominal (expected) coverage (between 0
and 1). For instance, interval=0.95 corresponds to [2.5, 97.5]
percentiles.
If list or tuple, defines the exact percentiles to compute, which
must be between 0 and 100 inclusive. For example, interval
of 95% should be as interval = [2.5, 97.5].
When using method='conformal', the interval must be a float or
a list/tuple defining a symmetric interval.
If True, residuals from the training data are used as proxy of
prediction error to create predictions.
If False, out of sample residuals (calibration) are used.
Out-of-sample residuals must be precomputed using Forecaster's
set_out_sample_residuals() method.
defpredict_interval(self,steps:int,last_window:pd.Series|None=None,method:str='conformal',interval:float|list[float]|tuple[float]=[5,95],use_in_sample_residuals:bool=True,use_binned_residuals:bool=True,random_state:Any=None,exog:Any=None,n_boot:Any=None)->pd.DataFrame:""" Predict n steps ahead and estimate prediction intervals using conformal prediction method. Refer to the References section for additional details on this method. Parameters ---------- steps : int Number of steps to predict. last_window : pandas Series, default None Past values needed to select the last equivalent dates according to the offset. If `last_window = None`, the values stored in `self.last_window_` are used and the predictions start immediately after the training data. method : str, default 'conformal' Technique used to estimate prediction intervals. Available options: - 'conformal': Employs the conformal prediction split method for interval estimation [1]_. interval : float, list, tuple, default [5, 95] Confidence level of the prediction interval. Interpretation depends on the method used: - If `float`, represents the nominal (expected) coverage (between 0 and 1). For instance, `interval=0.95` corresponds to `[2.5, 97.5]` percentiles. - If `list` or `tuple`, defines the exact percentiles to compute, which must be between 0 and 100 inclusive. For example, interval of 95% should be as `interval = [2.5, 97.5]`. - When using `method='conformal'`, the interval must be a float or a list/tuple defining a symmetric interval. use_in_sample_residuals : bool, default True If `True`, residuals from the training data are used as proxy of prediction error to create predictions. If `False`, out of sample residuals (calibration) are used. Out-of-sample residuals must be precomputed using Forecaster's `set_out_sample_residuals()` method. use_binned_residuals : bool, default True If `True`, residuals are selected based on the predicted values (binned selection). If `False`, residuals are selected randomly. random_state : Ignored Not used, present here for API consistency by convention. exog : Ignored Not used, present here for API consistency by convention. n_boot : Ignored Not used, present here for API consistency by convention. Returns ------- predictions : pandas DataFrame Values predicted by the forecaster and their estimated interval. - pred: predictions. - lower_bound: lower bound of the interval. - upper_bound: upper bound of the interval. References ---------- .. [1] MAPIE - Model Agnostic Prediction Interval Estimator. https://mapie.readthedocs.io/en/stable/theoretical_description_regression.html#the-split-method """ifmethod!='conformal':raiseValueError(f"Method '{method}' is not supported. Only 'conformal' is available.")iflast_windowisNone:last_window=self.last_window_check_predict_input(forecaster_name=type(self).__name__,steps=steps,is_fitted=self.is_fitted,exog_in_=False,index_type_=self.index_type_,index_freq_=self.index_freq_,window_size=self.window_size,last_window=last_window)check_residuals_input(forecaster_name=type(self).__name__,use_in_sample_residuals=use_in_sample_residuals,in_sample_residuals_=self.in_sample_residuals_,out_sample_residuals_=self.out_sample_residuals_,use_binned_residuals=use_binned_residuals,in_sample_residuals_by_bin_=self.in_sample_residuals_by_bin_,out_sample_residuals_by_bin_=self.out_sample_residuals_by_bin_)ifisinstance(interval,(list,tuple)):check_interval(interval=interval,ensure_symmetric_intervals=True)nominal_coverage=(interval[1]-interval[0])/100else:check_interval(alpha=interval,alpha_literal='interval')nominal_coverage=intervalifuse_in_sample_residuals:residuals=self.in_sample_residuals_residuals_by_bin=self.in_sample_residuals_by_bin_else:residuals=self.out_sample_residuals_residuals_by_bin=self.out_sample_residuals_by_bin_prediction_index=expand_index(index=last_window.index,steps=steps)ifisinstance(self.offset,int):last_window_values=last_window.to_numpy(copy=True).ravel()equivalent_indexes=np.tile(np.arange(-self.offset,0),int(np.ceil(steps/self.offset)))equivalent_indexes=equivalent_indexes[:steps]ifself.n_offsets==1:equivalent_values=last_window_values[equivalent_indexes]predictions=equivalent_values.ravel()ifself.n_offsets>1:equivalent_indexes=[equivalent_indexes-n*self.offsetforninnp.arange(self.n_offsets)]equivalent_indexes=np.vstack(equivalent_indexes)equivalent_values=last_window_values[equivalent_indexes]predictions=np.apply_along_axis(self.agg_func,axis=0,arr=equivalent_values)ifisinstance(self.offset,pd.tseries.offsets.DateOffset):last_window=last_window.copy()max_allowed_date=last_window.index[-1]# For every date in prediction_index, calculate the n offsetsoffset_dates=[]fordateinprediction_index:selected_offsets=[]whilelen(selected_offsets)<self.n_offsets:offset_date=date-self.offsetifoffset_date<=max_allowed_date:selected_offsets.append(offset_date)date=offset_dateoffset_dates.append(selected_offsets)offset_dates=np.array(offset_dates)# Select the values of the time series corresponding to the each# offset date. If the offset date is not in the time series, the# value is set to NaN.equivalent_values=(last_window.reindex(offset_dates.ravel()).to_numpy().reshape(-1,self.n_offsets))equivalent_values=pd.DataFrame(data=equivalent_values,index=prediction_index,columns=[f'offset_{i}'foriinrange(self.n_offsets)])# Error if all values are missingifequivalent_values.isnull().all().all():raiseValueError(f"All equivalent values are missing. This is caused by using "f"an offset ({self.offset}) larger than the available data. "f"Try to decrease the size of the offset ({self.offset}), "f"the number of `n_offsets` ({self.n_offsets}) or increase the "f"size of `last_window`. In backtesting, this error may be "f"caused by using an `initial_train_size` too small.")# Warning if equivalent values are missingincomplete_offsets=equivalent_values.isnull().any(axis=1)incomplete_offsets=incomplete_offsets[incomplete_offsets].indexifnotincomplete_offsets.empty:warnings.warn(f"Steps: {incomplete_offsets.strftime('%Y-%m-%d').to_list()} "f"are calculated with less than {self.n_offsets} `n_offsets`. "f"To avoid this, increase the `last_window` size or decrease "f"the number of `n_offsets`. The current configuration requires "f"a total offset of {self.offset*self.n_offsets}.",MissingValuesWarning)aggregate_values=equivalent_values.apply(self.agg_func,axis=1)predictions=aggregate_values.to_numpy()ifuse_binned_residuals:correction_factor_by_bin={k:np.quantile(np.abs(v),nominal_coverage)fork,vinresiduals_by_bin.items()}replace_func=np.vectorize(lambdax:correction_factor_by_bin[x])predictions_bin=self.binner.transform(predictions)correction_factor=replace_func(predictions_bin)else:correction_factor=np.quantile(np.abs(residuals),nominal_coverage)lower_bound=predictions-correction_factorupper_bound=predictions+correction_factorpredictions=np.column_stack([predictions,lower_bound,upper_bound])predictions=pd.DataFrame(data=predictions,index=prediction_index,columns=["pred","lower_bound","upper_bound"])returnpredictions
Set in-sample residuals in case they were not calculated during the
training process.
In-sample residuals are calculated as the difference between the true
values and the predictions made by the forecaster using the training
data. The following internal attributes are updated:
in_sample_residuals_: residuals stored in a numpy ndarray.
binner_intervals_: intervals used to bin the residuals are calculated
using the quantiles of the predicted values.
in_sample_residuals_by_bin_: residuals are binned according to the
predicted value they are associated with and stored in a dictionary, where
the keys are the intervals of the predicted values and the values are
the residuals associated with that range.
A total of 10_000 residuals are stored in the attribute in_sample_residuals_.
If the number of residuals is greater than 10_000, a random sample of
10_000 residuals is stored. The number of residuals stored per bin is
limited to 10_000 // self.binner.n_bins_.
defset_in_sample_residuals(self,y:pd.Series,random_state:int=123,exog:Any=None)->None:""" Set in-sample residuals in case they were not calculated during the training process. In-sample residuals are calculated as the difference between the true values and the predictions made by the forecaster using the training data. The following internal attributes are updated: + `in_sample_residuals_`: residuals stored in a numpy ndarray. + `binner_intervals_`: intervals used to bin the residuals are calculated using the quantiles of the predicted values. + `in_sample_residuals_by_bin_`: residuals are binned according to the predicted value they are associated with and stored in a dictionary, where the keys are the intervals of the predicted values and the values are the residuals associated with that range. A total of 10_000 residuals are stored in the attribute `in_sample_residuals_`. If the number of residuals is greater than 10_000, a random sample of 10_000 residuals is stored. The number of residuals stored per bin is limited to `10_000 // self.binner.n_bins_`. Parameters ---------- y : pandas Series Training time series. random_state : int, default 123 Sets a seed to the random sampling for reproducible output. exog : Ignored Not used, present here for API consistency by convention. Returns ------- None """ifnotself.is_fitted:raiseNotFittedError("This forecaster is not fitted yet. Call `fit` with appropriate ""arguments before using `set_in_sample_residuals()`.")check_y(y=y)y_index_range=check_extract_values_and_index(data=y,data_label='`y`',return_values=False)[1][[0,-1]]ifnoty_index_range.equals(self.training_range_):raiseIndexError(f"The index range of `y` does not match the range "f"used during training. Please ensure the index is aligned "f"with the training data.\n"f" Expected : {self.training_range_}\n"f" Received : {y_index_range}")self._binning_in_sample_residuals(y=y,store_in_sample_residuals=True,random_state=random_state)
Set new values to the attribute out_sample_residuals_. Out of sample
residuals are meant to be calculated using observations that did not
participate in the training process. Two internal attributes are updated:
out_sample_residuals_: residuals stored in a numpy ndarray.
out_sample_residuals_by_bin_: residuals are binned according to the
predicted value they are associated with and stored in a dictionary, where
the keys are the intervals of the predicted values and the values are
the residuals associated with that range. If a bin binning is empty, it
is filled with a random sample of residuals from other bins. This is done
to ensure that all bins have at least one residual and can be used in the
prediction process.
A total of 10_000 residuals are stored in the attribute out_sample_residuals_.
If the number of residuals is greater than 10_000, a random sample of
10_000 residuals is stored. The number of residuals stored per bin is
limited to 10_000 // self.binner.n_bins_.
Parameters:
Name
Type
Description
Default
y_true
numpy ndarray, pandas Series
True values of the time series from which the residuals have been
calculated.
If True, new residuals are added to the once already stored in the
forecaster. If after appending the new residuals, the limit of
10_000 // self.binner.n_bins_ values per bin is reached, a random
sample of residuals is stored.
defset_out_sample_residuals(self,y_true:np.ndarray|pd.Series,y_pred:np.ndarray|pd.Series,append:bool=False,random_state:int=123)->None:""" Set new values to the attribute `out_sample_residuals_`. Out of sample residuals are meant to be calculated using observations that did not participate in the training process. Two internal attributes are updated: + `out_sample_residuals_`: residuals stored in a numpy ndarray. + `out_sample_residuals_by_bin_`: residuals are binned according to the predicted value they are associated with and stored in a dictionary, where the keys are the intervals of the predicted values and the values are the residuals associated with that range. If a bin binning is empty, it is filled with a random sample of residuals from other bins. This is done to ensure that all bins have at least one residual and can be used in the prediction process. A total of 10_000 residuals are stored in the attribute `out_sample_residuals_`. If the number of residuals is greater than 10_000, a random sample of 10_000 residuals is stored. The number of residuals stored per bin is limited to `10_000 // self.binner.n_bins_`. Parameters ---------- y_true : numpy ndarray, pandas Series True values of the time series from which the residuals have been calculated. y_pred : numpy ndarray, pandas Series Predicted values of the time series. append : bool, default False If `True`, new residuals are added to the once already stored in the forecaster. If after appending the new residuals, the limit of `10_000 // self.binner.n_bins_` values per bin is reached, a random sample of residuals is stored. random_state : int, default 123 Sets a seed to the random sampling for reproducible output. Returns ------- None """ifnotself.is_fitted:raiseNotFittedError("This forecaster is not fitted yet. Call `fit` with appropriate ""arguments before using `set_out_sample_residuals()`.")ifnotisinstance(y_true,(np.ndarray,pd.Series)):raiseTypeError(f"`y_true` argument must be `numpy ndarray` or `pandas Series`. "f"Got {type(y_true)}.")ifnotisinstance(y_pred,(np.ndarray,pd.Series)):raiseTypeError(f"`y_pred` argument must be `numpy ndarray` or `pandas Series`. "f"Got {type(y_pred)}.")iflen(y_true)!=len(y_pred):raiseValueError(f"`y_true` and `y_pred` must have the same length. "f"Got {len(y_true)} and {len(y_pred)}.")ifisinstance(y_true,pd.Series)andisinstance(y_pred,pd.Series):ifnoty_true.index.equals(y_pred.index):raiseValueError("`y_true` and `y_pred` must have the same index.")ifnotisinstance(y_pred,np.ndarray):y_pred=y_pred.to_numpy()ifnotisinstance(y_true,np.ndarray):y_true=y_true.to_numpy()data=pd.DataFrame({'prediction':y_pred,'residuals':y_true-y_pred}).dropna()y_pred=data['prediction'].to_numpy()residuals=data['residuals'].to_numpy()data['bin']=self.binner.transform(y_pred).astype(int)residuals_by_bin=data.groupby('bin')['residuals'].apply(np.array).to_dict()out_sample_residuals=(np.array([])ifself.out_sample_residuals_isNoneelseself.out_sample_residuals_)out_sample_residuals_by_bin=({}ifself.out_sample_residuals_by_bin_isNoneelseself.out_sample_residuals_by_bin_)ifappend:out_sample_residuals=np.concatenate([out_sample_residuals,residuals])fork,vinresiduals_by_bin.items():ifkinout_sample_residuals_by_bin:out_sample_residuals_by_bin[k]=np.concatenate((out_sample_residuals_by_bin[k],v))else:out_sample_residuals_by_bin[k]=velse:out_sample_residuals=residualsout_sample_residuals_by_bin=residuals_by_binmax_samples=10_000//self.binner.n_bins_rng=np.random.default_rng(seed=random_state)fork,vinout_sample_residuals_by_bin.items():iflen(v)>max_samples:sample=rng.choice(a=v,size=max_samples,replace=False)out_sample_residuals_by_bin[k]=samplebin_keys=([]ifself.binner_intervals_isNoneelseself.binner_intervals_.keys())forkinbin_keys:ifknotinout_sample_residuals_by_bin:out_sample_residuals_by_bin[k]=np.array([])empty_bins=[kfork,vinout_sample_residuals_by_bin.items()ifv.size==0]ifempty_bins:warnings.warn(f"The following bins have no out of sample residuals: {empty_bins}. "f"No predicted values fall in the interval "f"{[self.binner_intervals_[bin]forbininempty_bins]}. "f"Empty bins will be filled with a random sample of residuals.",ResidualsUsageWarning)empty_bin_size=min(max_samples,len(out_sample_residuals))forkinempty_bins:out_sample_residuals_by_bin[k]=rng.choice(a=out_sample_residuals,size=empty_bin_size,replace=False)iflen(out_sample_residuals)>10_000:out_sample_residuals=rng.choice(a=out_sample_residuals,size=10_000,replace=False)self.out_sample_residuals_=out_sample_residualsself.out_sample_residuals_by_bin_=out_sample_residuals_by_bin
Source code in skforecast\recursive\_forecaster_equivalent_date.py
116211631164116511661167116811691170117111721173
defget_tags(self)->dict[str,Any]:""" Return the tags that characterize the behavior of the forecaster. Returns ------- skforecast_tags : dict Dictionary with forecaster tags. """returnself.__skforecast_tags__