fusionlab.utils.ts_utils.ts_split¶
- fusionlab.utils.ts_utils.ts_split(df, dt_col=None, value_col=None, split_type='simple', test_ratio=None, n_splits=5, gap=0, train_start=None, train_end=None, verbose=0)[source]¶
Perform a time-based split on a time series dataset for either a simple train-test partition or cross-validation.
In time-series modeling, it is critical to maintain chronological ordering [1]. Let \(\{x_t\}_{t=1}^N\) be the time-ordered observations. A simple time-based split partitions the data at some time index \(k\):
\[\text{Train set}: \{x_t | t \le k \}, \quad \text{Test set}: \{x_t | t > k \}.\]Cross-validation (“cv”) uses multiple splits, iteratively moving the boundary to create overlapping train sets for model training and test sets for validation [2].
- Parameters:
df (
pandas.DataFrame) – The input DataFrame containing time series data. Must include a column (or index) for time references.dt_col (
str, optional) – The name of the datetime column if the index is not already datetime. If provided, the function ensures it is valid and can parse it as datetime if needed.value_col (
str, optional) – The name of the target variable column (e.g., “sales”). Primarily for logging or reference; it is not required for the split logic itself.split_type (
{'simple', 'cv'}, optional) –Type of split:
'simple'or'base': Splits the DataFrame into a single train and test set based on time or specified rows.'cv': Constructs a generator for time-series cross-validation usingsklearn.model_selection.TimeSeriesSplit.
test_ratio (
float, optional) – For a simple split, if set, this denotes the fraction of rows allocated to the test set (\(0 < \text{test_ratio} < 1\)). If not specified,train_endcan determine the boundary. Not used for cross-validation.n_splits (
int, optional) – Number of splits for cross-validation ifsplit_type='cv'. Defaults to 5.gap (
int, optional) – Gap (number of points) between train and test sets in cross-validation. Defaults to 0.train_start (
str, optional) – If set, the earliest date to include in the training set for a simple split. Should be a string convertible by pandas to a datetime, e.g., “2021-01-01”.train_end (
str, optional) – If set, the last date to include in the training set for a simple split. The subsequent rows become the test set if older thantrain_end.verbose (
int, optional) –Verbosity level:
0: No messages.
1: Basic logs on split info.
2: More detailed logs (including indices for cross-validation splits).
- Returns:
splits –
If
split_type='simple', returns a tuple(train_df, test_df).If
split_type='cv', returns aTimeSeriesSplitgenerator yielding indices for train/test.
- Return type:
tupleorgenerator
Notes
Maintaining time order in training and testing sets is essential to avoid leakage of future information into model training. Cross-validation further generalizes the idea by repeated train-test sub-sampling in an expanding window manner, shifting the boundary forward for each split.
Examples
>>> import pandas as pd >>> from fusionlab.utils.ts_utils import ts_split >>> data = { ... 'Date': [ ... '2021-01-01','2021-01-02','2021-01-03', ... '2021-01-04','2021-01-05' ... ], ... 'Sales': [10, 12, 14, 13, 15] ... } >>> df = pd.DataFrame(data) >>> # Simple split using 60% train and 40% test >>> train_df, test_df = ts_split( ... df, ... dt_col='Date', ... split_type='simple', ... test_ratio=0.4, ... verbose=1 ... ) Performing simple split: Train size=3, Test size=2.
>>> # Cross-validation with 2 splits and gap=0 >>> splits = ts_split( ... df, ... dt_col='Date', ... split_type='cv', ... n_splits=2, ... verbose=1 ... ) Performing cross-validation split with n_splits=2, gap=0.
See also
sklearn.model_selection.TimeSeriesSplitCross-validation splits for time-series data.
References