fusionlab.datasets.load_processed_subsidence_data

fusionlab.datasets.load_processed_subsidence_data(dataset_name='zhongshan', *, n_samples=None, as_frame=False, include_coords=True, include_target=True, data_home=None, download_if_missing=True, force_download_raw=False, random_state=None, apply_feature_select=True, apply_nan_ops=True, encode_categoricals=True, scale_numericals=True, scaler_type='minmax', return_sequences=False, time_steps=4, forecast_horizon=4, target_col='subsidence', scale_target=False, group_by_cols=True, use_processed_cache=True, use_sequence_cache=True, save_processed_frame=False, save_sequences=False, cache_suffix='', nan_handling_method='fill', verbose=True)[source]

Loads, preprocesses, and optionally sequences landslide datasets.

This function provides a complete pipeline to prepare the Zhongshan or Nansha landslide datasets for use with forecasting models like TFT/XTFT. It performs the following steps:

  1. Loads the raw sampled data (‘zhongshan_2000.csv’ or ‘nansha_2000.csv’) using fetch functions (Workspace_zhongshan_data() or Workspace_nansha_data()), optionally sub-sampling using spatial stratification if n_samples is specified.

  2. Optionally applies a predefined preprocessing sequence, mirroring steps often used in research (e.g., based on [Liu24]): - Feature Selection (selecting a subset of columns). - NaN Handling (e.g., filling missing values). - Categorical Encoding (using One-Hot Encoding). - Numerical Scaling (using MinMaxScaler or StandardScaler).

  3. Optionally reshapes the fully processed data into sequences suitable for TFT/XTFT models using reshape_xtft_data().

  4. Optionally leverages caching by loading/saving the processed DataFrame or the final sequence arrays to/from disk (.joblib) to accelerate repeated executions with the same parameters.

Parameters:
  • dataset_name ({'zhongshan', 'nansha'}, default 'zhongshan') – Which dataset to load and process (‘zhongshan’ or ‘nansha’).

  • n_samples (int, str, or None, default None) –

    Number of samples to load from the raw dataset file. - If None or '*': Loads the full dataset (~2000 rows). - If int: Sub-samples the specified number using spatial

    stratification via spatial_sampling(). Must be a positive integer less than or equal to the total available samples.

  • as_frame (bool, default False) –

    Determines the return type only if return_sequences is False. - If False: Returns a Bunch object containing the processed

    DataFrame and metadata.

    • If True: Returns only the processed pandas DataFrame.

  • include_coords (bool, default True) – If True, include ‘longitude’ and ‘latitude’ columns in the output frame (and Bunch attributes).

  • include_target (bool, default True) – If True, include the target column (‘subsidence’) in the output frame (and Bunch attributes).

  • data_home (str, optional) – Specify a directory path to cache raw datasets and processed files. If None, uses the path determined by get_data() (typically ~/fusionlab_data). Default is None.

  • download_if_missing (bool, default True) – If True, attempt to download the raw dataset file from the remote repository if it’s not found locally.

  • force_download_raw (bool, default False) – If True, forces download of the raw dataset file, ignoring any local cache or packaged version.

  • random_state (int, optional) – Seed for the random number generator used during sub-sampling when n_samples is an integer. Ensures reproducibility.

  • apply_feature_select (bool, default True) – If True, selects only the subset of features typically used in reference examples for the specified dataset_name. If False, attempts to use all columns found (after excluding coords/target).

  • apply_nan_ops (bool, default True) – If True, apply NaN handling using the internal nan_ops() utility with the strategy specified by nan_handling_method.

  • encode_categoricals (bool, default True) – If True, apply Scikit-learn’s OneHotEncoder to predefined categorical columns (‘geology’, ‘density_tier’ for Zhongshan; ‘geology’ for Nansha). Adds new columns for encoded features and removes the original categorical columns.

  • scale_numericals (bool, default True) – If True, apply feature scaling to predefined numerical columns (excluding coordinates, year, target, and encoded categoricals) using the scaler specified by scaler_type. Target column is also scaled.

  • scaler_type ({'minmax', 'standard'}, default 'minmax') – Type of scaler to use if scale_numericals is True.

  • return_sequences (bool, default False) –

    Controls the final output format. - If True: Performs sequence generation using

    reshape_xtft_data() and returns the sequence arrays.

    • If False: Skips sequence generation and returns the processed DataFrame or Bunch object (controlled by as_frame).

  • time_steps (int, default 4) – Lookback window size (number of past time steps) for sequence generation. Only used if return_sequences=True.

  • forecast_horizon (int, default 4) – Prediction horizon (number of future steps) for sequence generation. Only used if return_sequences=True.

  • target_col (str, default 'subsidence') – Name of the target variable column used for sequence generation.

  • scale_target (bool, default False) – Whether to scale the target or not.

  • group_by_cols (bool or list of str, default True) –

    Controls how the data is partitioned before sequence generation.

    • False (default): do not group by any columns; the entire dataset is treated as a single continuous time series.

    • list of str: names of one or more DataFrame columns (e.g. [‘longitude’, ‘latitude’]) to group by; each unique group will produce its own set of sequences.

    • None: equivalent to False (no grouping).

  • use_processed_cache (bool, default True) – If True and return_sequences=False, attempts to load a previously saved processed DataFrame (and scaler/encoder info) from the cache directory before running the preprocessing steps.

  • use_sequence_cache (bool, default True) – If True and return_sequences=True, attempts to load previously saved sequence arrays from the cache directory before running preprocessing and sequence generation.

  • save_processed_frame (bool, default False) – If True and preprocessing is performed (cache miss or use_processed_cache=False), saves the resulting processed DataFrame, scaler info, and encoder info to a joblib file in the cache directory. Ignored if return_sequences=True.

  • save_sequences (bool, default False) – If True and sequence generation is performed (cache miss or use_sequence_cache=False), saves the resulting sequence arrays (static_data, dynamic_data, future_data, target_data) to a joblib file in the cache directory. Only used if return_sequences=True.

  • cache_suffix (str, default "") – Optional suffix appended to cache filenames (before ‘.joblib’) to allow caching results from different processing variations (e.g., different n_samples or preprocessing flags).

  • nan_handling_method (str, default 'fill') – Method used by nan_ops() if apply_nan_ops=True. Typically ‘fill’ (forward fill then backward fill).

  • verbose (bool, default True) – If True, print status messages during file fetching, processing, caching, and sequence generation.

Returns:

Processed Data – The type depends on return_sequences and as_frame: - If return_sequences=True: Returns a tuple containing the

sequence arrays required by TFT/XTFT: (static_data, dynamic_data, future_data, target_data)

  • If return_sequences=False and as_frame=True: Returns the fully processed pandas DataFrame (after selection, NaN handling, encoding, scaling).

  • If return_sequences=False and as_frame=False: Returns a Bunch object containing the processed DataFrame (frame), extracted numerical features (data), feature names (feature_names), target info (target_names, target), coordinates (longitude, latitude), and a description (DESCR).

Return type:

Union[Bunch, pd.DataFrame, Tuple[np.ndarray, ]]

Raises:
  • ValueError – If dataset_name is invalid, n_samples is invalid, or required columns are missing for selected processing steps.

  • FileNotFoundError – If underlying raw data loading fails (fetching from cache, package, or download).

References

[Liu24]

Liu, J., et al. (2024). Machine learning-based techniques… Journal of Environmental Management, 352, 120078.

Examples

>>> from fusionlab.datasets import load_processed_subsidence_data
>>> # Load processed Zhongshan data as a Bunch object
>>> data_bunch = load_processed_subsidence_data(dataset_name='zhongshan',
...                                             as_frame=False,
...                                             return_sequences=False)
>>> print(data_bunch.frame.head())
>>> print(data_bunch.feature_names)
>>> # Load Nansha data, preprocess, and return sequences
>>> static, dynamic, future, target = load_processed_subsidence_data(
...     dataset_name='nansha',
...     return_sequences=True,
...     time_steps=6,
...     forecast_horizons=3,
...     scale_numericals=True,
...     scaler_type='standard',
...     verbose=False
... )
>>> print(f"Nansha sequences shapes: S={static.shape}, D={dynamic.shape},"
...       f" F={future.shape}, y={target.shape}")
>>> # Load a small sample and save processed frame
>>> df_proc_sample = load_processed_subsidence_data(
...     dataset_name='zhongshan',
...     n_samples=100,
...     random_state=42,
...     as_frame=True,
...     return_sequences=False,
...     save_processed_frame=True,
...     cache_suffix="_sample100"
... )
>>> print(f"Loaded and processed sample shape: {df_proc_sample.shape}")