fusionlab.utils.ts_utils.ts_split¶

fusionlab.utils.ts_utils.ts_split(df, dt_col=None, value_col=None, split_type='simple', test_ratio=None, n_splits=5, gap=0, train_start=None, train_end=None, verbose=0)[source]¶

Perform a time-based split on a time series dataset for either a simple train-test partition or cross-validation.

In time-series modeling, it is critical to maintain chronological ordering [1]. Let \(\{x_t\}_{t=1}^N\) be the time-ordered observations. A simple time-based split partitions the data at some time index \(k\):

\[\text{Train set}: \{x_t | t \le k \}, \quad \text{Test set}: \{x_t | t > k \}.\]

Cross-validation (“cv”) uses multiple splits, iteratively moving the boundary to create overlapping train sets for model training and test sets for validation [2].

Parameters:

df (pandas.DataFrame) – The input DataFrame containing time series data. Must include a column (or index) for time references.
dt_col (str, optional) – The name of the datetime column if the index is not already datetime. If provided, the function ensures it is valid and can parse it as datetime if needed.
value_col (str, optional) – The name of the target variable column (e.g., “sales”). Primarily for logging or reference; it is not required for the split logic itself.
split_type ({'simple', 'cv'}, optional) –
Type of split:
- 'simple' or 'base': Splits the DataFrame into a single train and test set based on time or specified rows.
- 'cv': Constructs a generator for time-series cross-validation using sklearn.model_selection.TimeSeriesSplit.
test_ratio (float, optional) – For a simple split, if set, this denotes the fraction of rows allocated to the test set (\(0 < \text{test_ratio} < 1\)). If not specified, train_end can determine the boundary. Not used for cross-validation.
n_splits (int, optional) – Number of splits for cross-validation if split_type='cv'. Defaults to 5.
gap (int, optional) – Gap (number of points) between train and test sets in cross-validation. Defaults to 0.
train_start (str, optional) – If set, the earliest date to include in the training set for a simple split. Should be a string convertible by pandas to a datetime, e.g., “2021-01-01”.
train_end (str, optional) – If set, the last date to include in the training set for a simple split. The subsequent rows become the test set if older than train_end.
verbose (int, optional) –
Verbosity level:
- 0: No messages.
- 1: Basic logs on split info.
- 2: More detailed logs (including indices for cross-validation splits).

Returns:

splits –

If split_type='simple', returns a tuple (train_df, test_df).
If split_type='cv', returns a TimeSeriesSplit generator yielding indices for train/test.

Return type:

tuple or generator

Notes

Maintaining time order in training and testing sets is essential to avoid leakage of future information into model training. Cross-validation further generalizes the idea by repeated train-test sub-sampling in an expanding window manner, shifting the boundary forward for each split.

Examples

>>> import pandas as pd
>>> from fusionlab.utils.ts_utils import ts_split
>>> data = {
...     'Date': [
...         '2021-01-01','2021-01-02','2021-01-03',
...         '2021-01-04','2021-01-05'
...     ],
...     'Sales': [10, 12, 14, 13, 15]
... }
>>> df = pd.DataFrame(data)
>>> # Simple split using 60% train and 40% test
>>> train_df, test_df = ts_split(
...     df,
...     dt_col='Date',
...     split_type='simple',
...     test_ratio=0.4,
...     verbose=1
... )
Performing simple split: Train size=3, Test size=2.

>>> # Cross-validation with 2 splits and gap=0
>>> splits = ts_split(
...     df,
...     dt_col='Date',
...     split_type='cv',
...     n_splits=2,
...     verbose=1
... )
Performing cross-validation split with n_splits=2,
gap=0.