Skip to content

experimental

skforecast.experimental._splitter.TimeSeriesSplitter

TimeSeriesSplitter(*series)

A utility class for splitting time series data into training, validation, and testing sets for machine learning algorithms.

This class provides flexible splitting strategies supporting multiple input formats (wide DataFrame, long DataFrame with MultiIndex, or dictionary of Series), both DatetimeIndex and RangeIndex, and flexible output formats.

New in this version: Support for multiple series arguments with independent splitting behavior. Each series can have different lengths and date ranges.

Parameters:

Name Type Description Default
*series DataFrame | dict[str, Series | DataFrame]

One or more time series data to split. Each can be: - Wide format pandas DataFrame with DatetimeIndex or RangeIndex - Long format pandas DataFrame with MultiIndex (series_id, datetime) - Dictionary of pandas Series or DataFrames with identical indexes

When multiple series are provided, they are treated independently and splits are returned as a list of tuples (one tuple per series group).

()

Attributes:

Name Type Description
series_groups_ list[dict[str, Series]]

List of series dictionaries, one per input argument.

series_indexes_ list[dict[str, Index]]

List of index dictionaries, one per series group.

n_groups_ int

Number of series groups (number of *series arguments).

index_types_ list[type]

Type of index for each group (pd.DatetimeIndex or pd.RangeIndex).

index_freqs_ list[str | int | None]

Frequency (for DatetimeIndex) or step (for RangeIndex) for each group.

skforecast_version str

Version of skforecast library used to create the splitter.

python_version str

Version of Python used to create the splitter.

Raises:

Type Description
ValueError

If no series provided or series have invalid format.

TypeError

If inputs are not in supported format.

Examples:

>>> import pandas as pd
>>> from skforecast.utils.splitter import TimeSeriesSplitter
>>> # Single series (backward compatible)
>>> df1 = pd.DataFrame(
...     {'series_a': range(100), 'series_b': range(100, 200)},
...     index=pd.date_range('2023-01-01', periods=100, freq='d')
... )
>>> splitter = TimeSeriesSplitter(df1)
>>> train_set, test_set = splitter.split_by_date(
...     end_train='2023-03-11',
...     output_format='wide'
... )

Initialize TimeSeriesSplitter with one or more series.

Parameters:

Name Type Description Default
*series DataFrame | dict[str, Series | DataFrame]

One or more time series data in supported formats.

()

Raises:

Type Description
ValueError

If no series provided or series have invalid format.

TypeError

If series are not in a supported format.

Methods:

Name Description
split_by_date

Split time series based on date ranges.

split_by_size

Split time series based on size (absolute or proportional).

Source code in skforecast\experimental\_splitter.py
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
def __init__(
    self, *series: pd.DataFrame | dict[str, pd.Series | pd.DataFrame]
) -> None:
    """
    Initialize TimeSeriesSplitter with one or more series.

    Parameters
    ----------
    *series : pd.DataFrame | dict[str, pd.Series | pd.DataFrame]
        One or more time series data in supported formats.

    Raises
    ------
    ValueError
        If no series provided or series have invalid format.
    TypeError
        If series are not in a supported format.
    """
    if len(series) == 0:
        raise ValueError('At least one series must be provided.')

    # -- Process each series argument independently
    self.series_groups_ = []
    self.series_indexes_ = []
    self.index_types_ = []
    self.index_freqs_ = []
    self._min_indexes_ = []
    self._max_indexes_ = []

    for i, series_input in enumerate(series):
        # Use inner check_preprocess_series() preprocessing for each group
        series_dict, series_indexes = check_preprocess_series(series_input)

        # -- Store the preprocess series data dict & index dict
        self.series_groups_.append(series_dict)
        self.series_indexes_.append(series_indexes)

        # -- Store index type and frequency information for this group
        first_index = next(iter(series_indexes.values()))
        index_type = type(first_index)
        self.index_types_.append(index_type)

        if isinstance(first_index, pd.DatetimeIndex):
            self.index_freqs_.append(first_index.freq)
            self._min_indexes_.append(
                min([idx.min() for idx in series_indexes.values()])
            )
            self._max_indexes_.append(
                max([idx.max() for idx in series_indexes.values()])
            )
        if isinstance(first_index, pd.RangeIndex):
            self.index_freqs_.append(first_index.step)
            self._min_indexes_.append(0)
            self._max_indexes_.append(len(first_index) - 1)

    # -- Store the number groups/series input
    self.n_groups_ = len(series)
    self.n_timeseries = sum(map(len, self.series_indexes_))

    # -- Store version information
    self.skforecast_version = __version__
    self.python_version = sys.version.split(' ')[0]

Attributes

series_groups_ instance-attribute

series_groups_ = []

series_indexes_ instance-attribute

series_indexes_ = []

index_types_ instance-attribute

index_types_ = []

index_freqs_ instance-attribute

index_freqs_ = []

_min_indexes_ instance-attribute

_min_indexes_ = []

_max_indexes_ instance-attribute

_max_indexes_ = []

n_groups_ instance-attribute

n_groups_ = len(series)

n_timeseries instance-attribute

n_timeseries = sum(map(len, series_indexes_))

skforecast_version instance-attribute

skforecast_version = __version__

python_version instance-attribute

python_version = split(' ')[0]

Functions

_repr_html_

_repr_html_()

Return HTML representation for Jupyter notebooks.

Returns:

Type Description
str

HTML string with embedded CSS styling and object information.

Source code in skforecast\experimental\_splitter.py
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
def _repr_html_(self) -> str:
    """
    Return HTML representation for Jupyter notebooks.

    Returns
    -------
    str
        HTML string with embedded CSS styling and object information.
    """
    # -- Define a component id
    unique_id = str(uuid.uuid4()).replace('-', '')

    # -- Define render colors
    background_color = '#f0f8ff'
    section_color = '#b3dbfd'

    # -- Define CSS styles
    style = f"""
    <style>
        .container-{unique_id} {{
            font-family: 'Arial', sans-serif;
            font-size: 0.9em;
            color: #333333;
            border: 1px solid #ddd;
            background-color: {background_color};
            padding: 5px 15px;
            border-radius: 8px;
            max-width: 700px;
        }}
        .container-{unique_id} h2 {{
            font-size: 1.5em;
            color: #222222;
            border-bottom: 2px solid #ddd;
            padding-bottom: 5px;
            margin-bottom: 15px;
            margin-top: 5px;
        }}
        .container-{unique_id} details {{
            margin: 10px 0;
        }}
        .container-{unique_id} summary {{
            font-weight: bold;
            font-size: 1.1em;
            color: #000000;
            cursor: pointer;
            margin-bottom: 5px;
            background-color: {section_color};
            padding: 5px;
            border-radius: 5px;
        }}
        .container-{unique_id} summary:hover {{
            color: #000000;
            background-color: #e0e0e0;
        }}
        .container-{unique_id} ul {{
            font-family: 'Courier New', monospace;
            list-style-type: none;
            padding-left: 20px;
            margin: 10px 0;
            line-height: normal;
        }}
        .container-{unique_id} li {{
            margin: 5px 0;
            font-family: 'Courier New', monospace;
        }}
        .container-{unique_id} li strong {{
            font-weight: bold;
            color: #444444;
        }}
        .container-{unique_id} li::before {{
            content: "- ";
            color: #666666;
        }}
        .container-{unique_id} .group-section {{
            margin-left: 20px;
            padding: 5px;
            background-color: #ffffff;
            border-radius: 3px;
            margin-top: 5px;
        }}
    </style>
    """

    # -- Build series groups html content
    groups_html = ''
    for i in range(self.n_groups_):
        index_freq_info = (
            f'<strong>Frequency:</strong> {self.index_freqs_[i].freqstr}'
            if issubclass(self.index_types_[i], pd.DatetimeIndex)
            else f'<strong>Step:</strong> {self.index_freqs_[i]}'
        )
        n_timeseries = len(self.series_groups_[i])

        groups_html += f"""
            <div class="group-section">
                <strong>Group {i}:</strong>
                <ul style="margin-top: 3px;margin-bottom: 3px;">
                    <li><strong>Series count:</strong> {n_timeseries}</li>
                    <li><strong>Index type:</strong> {self.index_types_[i].__name__}</li>
                    <li>{index_freq_info}</li>
                    <li><strong>Range:</strong> {self._min_indexes_[i]}{self._max_indexes_[i]}</li>
                </ul>
            </div>
        """

    # -- Build global html content
    content = f"""
    <div class="container-{unique_id}">
        <h2>TimeSeriesSplitter</h2>
        <details open>
            <summary>Configuration</summary>
            <ul>
                <li><strong>Number of groups:</strong> {self.n_groups_}</li>
                <li><strong>Number of timeseries:</strong> {self.n_timeseries}</li>
                <li><strong>Supported output formats:</strong> wide, long_multi_index, long, dict</li>
            </ul>
            {groups_html}
        </details>
        <details>
            <summary>Version Information</summary>
            <ul>
                <li><strong>Skforecast version:</strong> {self.skforecast_version}</li>
                <li><strong>Python version:</strong> {self.python_version}</li>
            </ul>
        </details>
    </div>
    """

    return style + content

_convert_date_to_position

_convert_date_to_position(date, index, date_name='date')

Convert a date to its position in the given index.

Parameters:

Name Type Description Default
date str | Timestamp

Date to convert.

required
index Index

Index to search in.

required
date_name str

Name of the date parameter (for error messages).

'date'

Returns:

Type Description
int

Position of the date in the index.

Raises:

Type Description
ValueError

If date is outside the valid range.

Source code in skforecast\experimental\_splitter.py
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
def _convert_date_to_position(
    self,
    date: str | pd.Timestamp,
    index: pd.Index,
    date_name: str = 'date',
) -> int:
    """
    Convert a date to its position in the given index.

    Parameters
    ----------
    date : str | pd.Timestamp
        Date to convert.
    index : pd.Index
        Index to search in.
    date_name : str, default 'date'
        Name of the date parameter (for error messages).

    Returns
    -------
    int
        Position of the date in the index.

    Raises
    ------
    ValueError
        If date is outside the valid range.
    """
    # -- Convert any string date input to Timestamp object
    if isinstance(date, str):
        date = pd.Timestamp(date)

    # -- Raise error if data is not in required time range
    if date not in index:
        raise ValueError(
            f'{date_name} {date} is not present in the series index. '
            f'Available range: {index[0]} to {index[-1]}.'
        )

    # -- Extract the numeric position
    return index.get_loc(date)

_validate_date_split_args

_validate_date_split_args(
    group_idx,
    start_train,
    end_train,
    end_validation,
    end_test,
)

Validate and convert date split arguments to positions for a specific group.

Parameters:

Name Type Description Default
group_idx int

Index of the series group to validate.

required

Returns:

Type Description
tuple[int, int, int | None, int | None]

Positions for (start_train, end_train, end_validation, end_test).

Source code in skforecast\experimental\_splitter.py
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
def _validate_date_split_args(
    self,
    group_idx: int,
    start_train: str | pd.Timestamp | None,
    end_train: str | pd.Timestamp,
    end_validation: str | pd.Timestamp | None,
    end_test: str | pd.Timestamp | None,
) -> tuple[int, int, int | None, int | None]:
    """
    Validate and convert date split arguments to positions for a specific group.

    Parameters
    ----------
    group_idx : int
        Index of the series group to validate.

    Returns
    -------
    tuple[int, int, int | None, int | None]
        Positions for (start_train, end_train, end_validation, end_test).
    """
    # -- Force index to be DatetimeIndex object
    if self.index_types_[group_idx] != pd.DatetimeIndex:
        raise TypeError(
            f'Group {group_idx}: `split_by_date` requires `DatetimeIndex` object. '
            f'Current index type: {self.index_types_[group_idx].__name__}. '
            'Consider using `split_by_size` instead.'
        )

    first_index = next(iter(self.series_indexes_[group_idx].values()))

    # -- Convert dates to positions
    if start_train is None:
        start_train_pos = 0
    else:
        start_train_pos = self._convert_date_to_position(
            start_train, first_index, 'start_train'
        )

    end_train_pos = self._convert_date_to_position(
        end_train, first_index, 'end_train'
    )

    if end_validation is None:
        end_validation_pos = end_train_pos
    else:
        end_validation_pos = self._convert_date_to_position(
            end_validation, first_index, 'end_validation'
        )

    if end_test is None:
        end_test_pos = len(first_index) - 1
    else:
        end_test_pos = self._convert_date_to_position(
            end_test, first_index, 'end_test'
        )

    # -- Validate position order
    if start_train_pos >= end_train_pos:
        raise ValueError(
            f'Group {group_idx}: start_train must be earlier than end_train. '
            f'Got start_train={first_index[start_train_pos]}, '
            f'end_train={first_index[end_train_pos]}.'
        )

    if end_train_pos > end_validation_pos:
        raise ValueError(
            f'Group {group_idx}: end_train must be earlier than or equal to end_validation. '
            f'Got end_train={first_index[end_train_pos]}, '
            f'end_validation={first_index[end_validation_pos]}.'
        )

    if end_validation_pos > end_test_pos:
        raise ValueError(
            f'Group {group_idx}: end_validation must be earlier than or equal to end_test. '
            f'Got end_validation={first_index[end_validation_pos]}, '
            f'end_test={first_index[end_test_pos]}.'
        )

    return start_train_pos, end_train_pos, end_validation_pos, end_test_pos

_convert_size

_convert_size(size, size_name, total_len, group_idx=None)

Convert a size specification to an absolute integer count.

This method handles both absolute (integer) and proportional (float) size specifications, validating the input and converting proportions to actual counts based on the total length.

Parameters:

Name Type Description Default
size int | float | None

Size specification to convert: - If int: Absolute count (returned as-is after validation) - If float: Proportion of total_len (must be between 0 and 1) - If None: Returns None (indicates no size specified)

required
size_name str

Name of the size parameter (e.g., 'train_size', 'validation_size'). Used in error messages for clarity.

required
total_len int

Total length of the series against which proportions are calculated.

required
group_idx int | None

Index of the series group being processed. If provided, it's included in error messages for multi-group scenarios.

None

Returns:

Type Description
int | None

Absolute count as integer, or None if size was None. For float inputs, uses ceiling to ensure at least the requested proportion is included.

Source code in skforecast\experimental\_splitter.py
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
def _convert_size(
    self,
    size: int | float | None,
    size_name: str,
    total_len: int,
    group_idx: int | None = None,
) -> int | None:
    """
    Convert a size specification to an absolute integer count.

    This method handles both absolute (integer) and proportional (float)
    size specifications, validating the input and converting proportions
    to actual counts based on the total length.

    Parameters
    ----------
    size : int | float | None
        Size specification to convert:
        - If int: Absolute count (returned as-is after validation)
        - If float: Proportion of total_len (must be between 0 and 1)
        - If None: Returns None (indicates no size specified)
    size_name : str
        Name of the size parameter (e.g., 'train_size', 'validation_size').
        Used in error messages for clarity.
    total_len : int
        Total length of the series against which proportions are calculated.
    group_idx : int | None, default None
        Index of the series group being processed. If provided, it's included
        in error messages for multi-group scenarios.

    Returns
    -------
    int | None
        Absolute count as integer, or None if size was None.
        For float inputs, uses ceiling to ensure at least the requested
        proportion is included.
    """
    if size is None:
        return None

    # -- Build error message prefix with optional group index
    error_prefix = f'Group {group_idx}: ' if group_idx is not None else ''

    if isinstance(size, float):
        # -- Validate proportion is in valid range
        if not 0 < size < 1:
            raise ValueError(
                f'{error_prefix}{size_name} proportion must be between 0 and 1. '
                f'Got {size}.'
            )
        # -- Convert proportion to count using ceiling to ensure minimum coverage
        return int(np.ceil(size * total_len))

    # -- Return integer size as-is
    return int(size)

_validate_size_split_args

_validate_size_split_args(
    group_idx, train_size, validation_size, test_size
)

Validate and convert size split arguments for a specific group.

Parameters:

Name Type Description Default
group_idx int

Index of the series group to validate.

required

Returns:

Type Description
tuple[int, int | None, int | None]

Absolute sizes for (train, validation, test).

Source code in skforecast\experimental\_splitter.py
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
def _validate_size_split_args(
    self,
    group_idx: int,
    train_size: int | float,
    validation_size: int | float | None,
    test_size: int | float | None,
) -> tuple[int, int | None, int | None]:
    """
    Validate and convert size split arguments for a specific group.

    Parameters
    ----------
    group_idx : int
        Index of the series group to validate.

    Returns
    -------
    tuple[int, int | None, int | None]
        Absolute sizes for (train, validation, test).
    """
    # -- Extract sample index (first one)
    first_index = next(iter(self.series_indexes_[group_idx].values()))
    total_len = len(first_index)

    # Convert all sizes using the helper method
    train_count = self._convert_size(train_size, 'train_size', total_len, group_idx)
    validation_count = self._convert_size(
        validation_size, 'validation_size', total_len, group_idx
    )
    test_count = self._convert_size(test_size, 'test_size', total_len, group_idx)

    # -- Validate total doesn't exceed series length
    total_requested = train_count
    if validation_count is not None:
        total_requested += validation_count
    if test_count is not None:
        total_requested += test_count

    if total_requested > total_len:
        raise ValueError(
            f'Group {group_idx}: Sum of requested sizes ({total_requested}) '
            f'exceeds series length ({total_len}). '
            f'Got train_size={train_count}, validation_size={validation_count}, '
            f'test_size={test_count}.'
        )

    # -- Return set sample counts
    return train_count, validation_count, test_count

_split_series_dict

_split_series_dict(series_dict, positions)

Split a single series dictionary according to positions.

Parameters:

Name Type Description Default
series_dict dict[str, Series]

Dictionary of series to split.

required
positions dict[str, tuple[int, int]]

Start and end positions for each split.

required

Returns:

Type Description
list[dict[str, Series]]

List of dictionaries, one for each split.

Source code in skforecast\experimental\_splitter.py
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
def _split_series_dict(
    self,
    series_dict: dict[str, pd.Series],
    positions: dict[str, tuple[int, int]],
) -> list[dict[str, pd.Series]]:
    """
    Split a single series dictionary according to positions.

    Parameters
    ----------
    series_dict : dict[str, pd.Series]
        Dictionary of series to split.
    positions : dict[str, tuple[int, int]]
        Start and end positions for each split.

    Returns
    -------
    list[dict[str, pd.Series]]
        List of dictionaries, one for each split.
    """
    # -- Collect series ids
    split_names = list(positions.keys())
    split_data = {name: {} for name in split_names}

    for series_name, series in series_dict.items():
        for split_name in split_names:
            start, end = positions[split_name]
            split_data[split_name][series_name] = series.iloc[
                start : end + 1
            ].copy()

    return [split_data[name] for name in split_names]

_convert_output

_convert_output(split_dicts, output_format='wide')

Convert split data to requested output format.

Parameters:

Name Type Description Default
split_dicts list[dict[str, Series]]

List of split data as dictionaries.

required
output_format ('wide', 'long', 'long_multi_index', 'dict')

Output format.

'wide'

Returns:

Type Description
tuple

Splits in requested format.

Source code in skforecast\experimental\_splitter.py
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
def _convert_output(
    self,
    split_dicts: list[dict[str, pd.Series]],
    output_format: Literal['wide', 'long', 'long_multi_index', 'dict'] = 'wide',
) -> tuple:
    """
    Convert split data to requested output format.

    Parameters
    ----------
    split_dicts : list[dict[str, pd.Series]]
        List of split data as dictionaries.
    output_format : {'wide', 'long', 'long_multi_index', 'dict'}, default 'wide'
        Output format.

    Returns
    -------
    tuple
        Splits in requested format.
    """
    match output_format:
        case 'dict':
            return tuple(split_dicts)

        case 'wide':
            return tuple(
                pd.DataFrame.from_dict(split_dict) for split_dict in split_dicts
            )

        case 'long' | 'long_multi_index':
            return tuple(
                reshape_series_wide_to_long(
                    pd.DataFrame.from_dict(split_dict),
                    return_multi_index=(output_format == 'long_multi_index'),
                )
                for split_dict in split_dicts
            )

        case _:
            raise ValueError(
                f'Output format `{output_format}` is not supported. '
                f'Choose one of ["wide", "long", "long_multi_index", "dict"].'
            )

split_by_date

split_by_date(
    end_train,
    start_train=None,
    end_validation=None,
    end_test=None,
    output_format="wide",
    verbose=False,
)

Split time series based on date ranges.

Creates training, validation (optional), and test sets by splitting series at specified date boundaries. Dates are inclusive.

When multiple series groups were provided to the constructor, this method returns a list of tuples (one per group). Each group is split independently based on its own date range.

Parameters:

Name Type Description Default
end_train str | Timestamp

Training set end date (inclusive). Required parameter.

required
start_train str | Timestamp | None

Training set start date (inclusive). Defaults to first date in each group.

None
end_validation str | Timestamp | None

Validation set end date (inclusive). Defaults to end_train if not provided (no validation set created).

None
end_test str | Timestamp | None

Test set end date (inclusive). Defaults to last date in each group.

None
output_format ('wide', 'long', 'long_multi_index', 'dict')

Output format for the splits.

'wide'
verbose bool

If True, print detailed split information for each group.

False

Returns:

Type Description
list[tuple] | tuple

If single series group: tuple of splits (train, test) or (train, val, test) If multiple series groups: list of tuples, one per group

Raises:

Type Description
TypeError

If series don't have DatetimeIndex.

ValueError

If dates are invalid or outside available range.

Examples:

>>> # Single group
>>> splitter = TimeSeriesSplitter(df1)
>>> train, test = splitter.split_by_date(end_train='2023-03-11')
>>> # Multiple groups
>>> splitter = TimeSeriesSplitter(df1, df2, df3)
>>> splits = splitter.split_by_date(end_train='2023-03-11')
>>> # splits = [(df1_train, df1_test), (df2_train, df2_test), (df3_train, df3_test)]
Source code in skforecast\experimental\_splitter.py
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
def split_by_date(
    self,
    end_train: str | pd.Timestamp,
    start_train: str | pd.Timestamp | None = None,
    end_validation: str | pd.Timestamp | None = None,
    end_test: str | pd.Timestamp | None = None,
    output_format: Literal['wide', 'long', 'long_multi_index', 'dict'] = 'wide',
    verbose: bool = False,
) -> list[tuple] | tuple:
    """
    Split time series based on date ranges.

    Creates training, validation (optional), and test sets by splitting
    series at specified date boundaries. Dates are inclusive.

    When multiple series groups were provided to the constructor, this method
    returns a list of tuples (one per group). Each group is split independently
    based on its own date range.

    Parameters
    ----------
    end_train : str | pd.Timestamp
        Training set end date (inclusive). Required parameter.
    start_train : str | pd.Timestamp | None, default None
        Training set start date (inclusive). Defaults to first date in each group.
    end_validation : str | pd.Timestamp | None, default None
        Validation set end date (inclusive).
        Defaults to end_train if not provided (no validation set created).
    end_test : str | pd.Timestamp | None, default None
        Test set end date (inclusive).
        Defaults to last date in each group.
    output_format : {'wide', 'long', 'long_multi_index', 'dict'}, default 'wide'
        Output format for the splits.
    verbose : bool, default False
        If True, print detailed split information for each group.

    Returns
    -------
    list[tuple] | tuple
        If single series group: tuple of splits (train, test) or (train, val, test)
        If multiple series groups: list of tuples, one per group

    Raises
    ------
    TypeError
        If series don't have DatetimeIndex.
    ValueError
        If dates are invalid or outside available range.

    Examples
    --------
    >>> # Single group
    >>> splitter = TimeSeriesSplitter(df1)
    >>> train, test = splitter.split_by_date(end_train='2023-03-11')

    >>> # Multiple groups
    >>> splitter = TimeSeriesSplitter(df1, df2, df3)
    >>> splits = splitter.split_by_date(end_train='2023-03-11')
    >>> # splits = [(df1_train, df1_test), (df2_train, df2_test), (df3_train, df3_test)]
    """
    results = []

    for group_idx in range(self.n_groups_):
        # -- Validate and get positions for current group
        start_pos, end_train_pos, end_val_pos, end_test_pos = (
            self._validate_date_split_args(
                group_idx, start_train, end_train, end_validation, end_test
            )
        )

        # -- Define positions split dict
        positions = {'train': (start_pos, end_train_pos)}

        if end_validation is not None:
            positions['validation'] = (end_train_pos + 1, end_val_pos)
            positions['test'] = (end_val_pos + 1, end_test_pos)
        else:
            positions['test'] = (end_train_pos + 1, end_test_pos)

        # -- Perform split on current group
        split_dicts = [
            {k: v for k, v in split_dict.items() if len(v) > 0}
            for split_dict in self._split_series_dict(
                self.series_groups_[group_idx], positions
            )
        ]

        # -- Convert to required output
        result = self._convert_output(split_dicts, output_format)

        if verbose:
            self._print_split_info(group_idx, positions, output_format)

        results.append(result)

    # -- Return single tuple if only one group, otherwise list of tuples
    return results if self.n_groups_ > 1 else results[0]

split_by_size

split_by_size(
    train_size,
    validation_size=None,
    test_size=None,
    output_format="wide",
    verbose=False,
)

Split time series based on size (absolute or proportional).

Creates training, validation (optional), and test sets by splitting series at specified size boundaries. Sizes can be absolute (int) or proportional (float between 0 and 1).

When multiple series groups were provided to the constructor, this method returns a list of tuples (one per group). Each group is split independently based on its own length.

Parameters:

Name Type Description Default
train_size int | float

Training set size. If int, absolute count. If float, proportion of total.

required
validation_size int | float | None

Validation set size. Same as train_size. If None, no validation set is created.

None
test_size int | float | None

Test set size. Same as train_size. If None, remainder is used as test set.

None
output_format ('wide', 'long', 'long_multi_index', 'dict')

Output format for the splits.

'wide'
verbose bool

If True, print detailed split information for each group.

False

Returns:

Type Description
list[tuple] | tuple

If single series group: tuple of splits (train, test) or (train, val, test) If multiple series groups: list of tuples, one per group

Raises:

Type Description
ValueError

If sizes are invalid or exceed series length.

Examples:

>>> # Single group with proportions
>>> splitter = TimeSeriesSplitter(df1)
>>> train, test = splitter.split_by_size(train_size=0.8)
>>> # Multiple groups with absolute sizes
>>> splitter = TimeSeriesSplitter(df1, df2, df3)
>>> splits = splitter.split_by_size(train_size=70, test_size=30)
>>> # Each group split with 70 training samples and 30 test samples
Source code in skforecast\experimental\_splitter.py
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
def split_by_size(
    self,
    train_size: int | float,
    validation_size: int | float | None = None,
    test_size: int | float | None = None,
    output_format: Literal['wide', 'long', 'long_multi_index', 'dict'] = 'wide',
    verbose: bool = False,
) -> list[tuple] | tuple:
    """
    Split time series based on size (absolute or proportional).

    Creates training, validation (optional), and test sets by splitting
    series at specified size boundaries. Sizes can be absolute (int) or
    proportional (float between 0 and 1).

    When multiple series groups were provided to the constructor, this method
    returns a list of tuples (one per group). Each group is split independently
    based on its own length.

    Parameters
    ----------
    train_size : int | float
        Training set size. If int, absolute count. If float, proportion of total.
    validation_size : int | float | None, default None
        Validation set size. Same as train_size.
        If None, no validation set is created.
    test_size : int | float | None, default None
        Test set size. Same as train_size.
        If None, remainder is used as test set.
    output_format : {'wide', 'long', 'long_multi_index', 'dict'}, default 'wide'
        Output format for the splits.
    verbose : bool, default False
        If True, print detailed split information for each group.

    Returns
    -------
    list[tuple] | tuple
        If single series group: tuple of splits (train, test) or (train, val, test)
        If multiple series groups: list of tuples, one per group

    Raises
    ------
    ValueError
        If sizes are invalid or exceed series length.

    Examples
    --------
    >>> # Single group with proportions
    >>> splitter = TimeSeriesSplitter(df1)
    >>> train, test = splitter.split_by_size(train_size=0.8)

    >>> # Multiple groups with absolute sizes
    >>> splitter = TimeSeriesSplitter(df1, df2, df3)
    >>> splits = splitter.split_by_size(train_size=70, test_size=30)
    >>> # Each group split with 70 training samples and 30 test samples
    """
    results = []

    for group_idx in range(self.n_groups_):
        # -- Validate and get counts for current group
        train_count, val_count, test_count = self._validate_size_split_args(
            group_idx, train_size, validation_size, test_size
        )

        # -- Get total length for current group
        first_index = next(iter(self.series_indexes_[group_idx].values()))
        total_len = len(first_index)

        # -- Compute positions
        train_end = train_count - 1
        val_end = train_end + (val_count if val_count is not None else 0)
        test_end = total_len - 1

        positions = {'train': (0, train_end)}

        if val_count is not None:
            positions['validation'] = (train_end + 1, val_end)
            positions['test'] = (val_end + 1, test_end)
        else:
            positions['test'] = (train_end + 1, test_end)

        # -- Perform split on current group
        split_dicts = self._split_series_dict(
            self.series_groups_[group_idx], positions
        )

        # -- Convert to required output
        result = self._convert_output(split_dicts, output_format)

        if verbose:
            self._print_split_info(group_idx, positions, output_format)

        results.append(result)

    # Return single tuple if only one group, otherwise list of tuples
    return results[0] if self.n_groups_ == 1 else results

_print_split_info

_print_split_info(group_idx, positions, output_format)

Print detailed split information for a specific group.

Parameters:

Name Type Description Default
group_idx int

Index of the series group.

required
positions dict[str, tuple[int, int]]

Position ranges for each split.

required
output_format str

Output format being used.

required
Source code in skforecast\experimental\_splitter.py
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
def _print_split_info(
    self,
    group_idx: int,
    positions: dict[str, tuple[int, int]],
    output_format: str,
) -> None:
    """
    Print detailed split information for a specific group.

    Parameters
    ----------
    group_idx : int
        Index of the series group.
    positions : dict[str, tuple[int, int]]
        Position ranges for each split.
    output_format : str
        Output format being used.
    """
    # -- Extract a sample index (first one)
    first_index = next(iter(self.series_indexes_[group_idx].values()))
    total_len = len(first_index)

    # -- Print header
    print(f'Split Information (Group id: {group_idx})')
    print('=' * 32)

    for split_name, (start, end) in positions.items():
        length = max(0, end - start + 1)
        percentage = (length / total_len * 100) if total_len > 0 else 0

        if isinstance(first_index, pd.DatetimeIndex):
            start_date = first_index[start] if start < len(first_index) else 'N/A'
            end_date = first_index[end] if end < len(first_index) else 'N/A'
            print(
                f'{split_name.capitalize():12} | '
                f'Range: {start_date} to {end_date} | '
                f'Length: {length} ({percentage:.1f}%)'
            )
        else:
            print(
                f'{split_name.capitalize():12} | '
                f'Positions: {start} to {end} | '
                f'Length: {length} ({percentage:.1f}%)'
            )
    # -- Print the given output format
    print(f'Output format: {output_format}', end='')
    if group_idx < self.n_groups_:
        print('\n')

skforecast.experimental._experimental.calculate_distance_from_holiday

calculate_distance_from_holiday(
    df,
    holiday_column="is_holiday",
    date_column="date",
    fill_na=0.0,
)

Calculate the number of days to the next holiday and the number of days since the last holiday.

Parameters:

Name Type Description Default
df pandas DataFrame

DataFrame containing the holiday data.

required
holiday_column str

The name of the column indicating holidays (True/False), by default 'is_holiday'.

'is_holiday'
date_column str

The name of the column containing the dates, by default 'date'.

'date'
fill_na (int, float)

Value to fill for NaN values in the output columns, by default 0.

0.

Returns:

Name Type Description
df DataFrame

DataFrame with additional columns for days to the next holiday ('days_to_holiday') and days since the last holiday ('days_since_holiday').

Notes

The function assumes that the input df contains a boolean column indicating holidays and a date column. It calculates the number of days to the next holiday and the number of days since the last holiday for each date in the date column.

Source code in skforecast\experimental\_experimental.py
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
def calculate_distance_from_holiday(
    df: pd.DataFrame, 
    holiday_column: str = 'is_holiday',
    date_column: str = 'date',
    fill_na: int | float = 0.
) -> pd.DataFrame:  # pragma: no cover
    """
    Calculate the number of days to the next holiday and the number of days since 
    the last holiday.

    Parameters
    ----------
    df : pandas DataFrame
        DataFrame containing the holiday data.
    holiday_column : str, default 'is_holiday'
        The name of the column indicating holidays (True/False), by default 'is_holiday'.
    date_column : str, default 'date'
        The name of the column containing the dates, by default 'date'.
    fill_na : int, float, default 0.
        Value to fill for NaN values in the output columns, by default 0.

    Returns
    -------
    df : pd.DataFrame
        DataFrame with additional columns for days to the next holiday ('days_to_holiday') 
        and days since the last holiday ('days_since_holiday').

    Notes
    -----
    The function assumes that the input `df` contains a boolean column indicating holidays
    and a date column. It calculates the number of days to the next holiday and the number of
    days since the last holiday for each date in the date column.

    """

    df = df.reset_index(drop=True)
    df[date_column] = pd.to_datetime(df[date_column])

    dates = df[date_column].to_numpy()
    holiday_dates = df.loc[df[holiday_column], date_column].to_numpy()
    holiday_dates_sorted = np.sort(holiday_dates)

    # For next holiday (right side)
    next_idx = np.searchsorted(holiday_dates_sorted, dates, side='left')
    has_next = next_idx < len(holiday_dates_sorted)
    days_to_holiday = np.full(len(dates), np.nan)
    days_to_holiday[has_next] = (
        holiday_dates_sorted[next_idx[has_next]] - dates[has_next]
    ).astype('timedelta64[D]').astype(int)

    # For previous holiday (left side)
    prev_idx = np.searchsorted(holiday_dates_sorted, dates, side='right') - 1
    has_prev = prev_idx >= 0
    days_since_holiday = np.full(len(dates), np.nan)
    days_since_holiday[has_prev] = (
        dates[has_prev] - holiday_dates_sorted[prev_idx[has_prev]]
    ).astype('timedelta64[D]').astype(int)

    df["days_to_holiday"] = pd.Series(days_to_holiday, dtype="Int64").fillna(fill_na)
    df["days_since_holiday"] = pd.Series(days_since_holiday, dtype="Int64").fillna(fill_na)

    return df