fusionlab.datasets.load_processed_subsidence_data¶
- fusionlab.datasets.load_processed_subsidence_data(dataset_name='zhongshan', *, n_samples=None, as_frame=False, include_coords=True, include_target=True, data_home=None, download_if_missing=True, force_download_raw=False, random_state=None, apply_feature_select=True, apply_nan_ops=True, encode_categoricals=True, scale_numericals=True, scaler_type='minmax', return_sequences=False, time_steps=4, forecast_horizon=4, target_col='subsidence', scale_target=False, group_by_cols=True, use_processed_cache=True, use_sequence_cache=True, save_processed_frame=False, save_sequences=False, cache_suffix='', nan_handling_method='fill', verbose=True)[source]¶
Loads, preprocesses, and optionally sequences landslide datasets.
This function provides a complete pipeline to prepare the Zhongshan or Nansha landslide datasets for use with forecasting models like TFT/XTFT. It performs the following steps:
Loads the raw sampled data (‘zhongshan_2000.csv’ or ‘nansha_2000.csv’) using fetch functions (
Workspace_zhongshan_data()orWorkspace_nansha_data()), optionally sub-sampling using spatial stratification if n_samples is specified.Optionally applies a predefined preprocessing sequence, mirroring steps often used in research (e.g., based on [Liu24]): - Feature Selection (selecting a subset of columns). - NaN Handling (e.g., filling missing values). - Categorical Encoding (using One-Hot Encoding). - Numerical Scaling (using MinMaxScaler or StandardScaler).
Optionally reshapes the fully processed data into sequences suitable for TFT/XTFT models using
reshape_xtft_data().Optionally leverages caching by loading/saving the processed DataFrame or the final sequence arrays to/from disk (.joblib) to accelerate repeated executions with the same parameters.
- Parameters:
dataset_name (
{'zhongshan', 'nansha'}, default'zhongshan') – Which dataset to load and process (‘zhongshan’ or ‘nansha’).n_samples (
int,str, orNone, defaultNone) –Number of samples to load from the raw dataset file. - If
Noneor'*': Loads the full dataset (~2000 rows). - If int: Sub-samples the specified number using spatialstratification via
spatial_sampling(). Must be a positive integer less than or equal to the total available samples.as_frame (
bool, defaultFalse) –Determines the return type only if
return_sequencesisFalse. - IfFalse: Returns a Bunch object containing the processedDataFrame and metadata.
If
True: Returns only the processed pandas DataFrame.
include_coords (
bool, defaultTrue) – IfTrue, include ‘longitude’ and ‘latitude’ columns in the outputframe(and Bunch attributes).include_target (
bool, defaultTrue) – IfTrue, include the target column (‘subsidence’) in the outputframe(and Bunch attributes).data_home (
str, optional) – Specify a directory path to cache raw datasets and processed files. IfNone, uses the path determined byget_data()(typically~/fusionlab_data). Default isNone.download_if_missing (
bool, defaultTrue) – IfTrue, attempt to download the raw dataset file from the remote repository if it’s not found locally.force_download_raw (
bool, defaultFalse) – IfTrue, forces download of the raw dataset file, ignoring any local cache or packaged version.random_state (
int, optional) – Seed for the random number generator used during sub-sampling whenn_samplesis an integer. Ensures reproducibility.apply_feature_select (
bool, defaultTrue) – IfTrue, selects only the subset of features typically used in reference examples for the specified dataset_name. IfFalse, attempts to use all columns found (after excluding coords/target).apply_nan_ops (
bool, defaultTrue) – IfTrue, apply NaN handling using the internalnan_ops()utility with the strategy specified bynan_handling_method.encode_categoricals (
bool, defaultTrue) – IfTrue, apply Scikit-learn’s OneHotEncoder to predefined categorical columns (‘geology’, ‘density_tier’ for Zhongshan; ‘geology’ for Nansha). Adds new columns for encoded features and removes the original categorical columns.scale_numericals (
bool, defaultTrue) – IfTrue, apply feature scaling to predefined numerical columns (excluding coordinates, year, target, and encoded categoricals) using the scaler specified byscaler_type. Target column is also scaled.scaler_type (
{'minmax', 'standard'}, default'minmax') – Type of scaler to use if scale_numericals is True.return_sequences (
bool, defaultFalse) –Controls the final output format. - If
True: Performs sequence generation usingreshape_xtft_data()and returns the sequence arrays.If
False: Skips sequence generation and returns the processed DataFrame or Bunch object (controlled by as_frame).
time_steps (
int, default4) – Lookback window size (number of past time steps) for sequence generation. Only used ifreturn_sequences=True.forecast_horizon (
int, default4) – Prediction horizon (number of future steps) for sequence generation. Only used ifreturn_sequences=True.target_col (
str, default'subsidence') – Name of the target variable column used for sequence generation.scale_target (
bool, defaultFalse) – Whether to scale the target or not.group_by_cols (
boolorlistofstr, defaultTrue) –Controls how the data is partitioned before sequence generation.
False (default): do not group by any columns; the entire dataset is treated as a single continuous time series.
list of str: names of one or more DataFrame columns (e.g. [‘longitude’, ‘latitude’]) to group by; each unique group will produce its own set of sequences.
None: equivalent to False (no grouping).
use_processed_cache (
bool, defaultTrue) – IfTrueandreturn_sequences=False, attempts to load a previously saved processed DataFrame (and scaler/encoder info) from the cache directory before running the preprocessing steps.use_sequence_cache (
bool, defaultTrue) – IfTrueandreturn_sequences=True, attempts to load previously saved sequence arrays from the cache directory before running preprocessing and sequence generation.save_processed_frame (
bool, defaultFalse) – IfTrueand preprocessing is performed (cache miss oruse_processed_cache=False), saves the resulting processed DataFrame, scaler info, and encoder info to a joblib file in the cache directory. Ignored ifreturn_sequences=True.save_sequences (
bool, defaultFalse) – IfTrueand sequence generation is performed (cache miss oruse_sequence_cache=False), saves the resulting sequence arrays (static_data, dynamic_data, future_data, target_data) to a joblib file in the cache directory. Only used ifreturn_sequences=True.cache_suffix (
str, default"") – Optional suffix appended to cache filenames (before ‘.joblib’) to allow caching results from different processing variations (e.g., different n_samples or preprocessing flags).nan_handling_method (
str, default'fill') – Method used bynan_ops()ifapply_nan_ops=True. Typically ‘fill’ (forward fill then backward fill).verbose (
bool, defaultTrue) – IfTrue, print status messages during file fetching, processing, caching, and sequence generation.
- Returns:
Processed Data – The type depends on return_sequences and as_frame: - If return_sequences=True: Returns a tuple containing the
sequence arrays required by TFT/XTFT:
(static_data, dynamic_data, future_data, target_data)If return_sequences=False and as_frame=True: Returns the fully processed pandas DataFrame (after selection, NaN handling, encoding, scaling).
If return_sequences=False and as_frame=False: Returns a
Bunchobject containing the processed DataFrame (frame), extracted numerical features (data), feature names (feature_names), target info (target_names, target), coordinates (longitude, latitude), and a description (DESCR).
- Return type:
Union[Bunch,pd.DataFrame,Tuple[np.ndarray,]]- Raises:
ValueError – If dataset_name is invalid, n_samples is invalid, or required columns are missing for selected processing steps.
FileNotFoundError – If underlying raw data loading fails (fetching from cache, package, or download).
References
[Liu24]Liu, J., et al. (2024). Machine learning-based techniques… Journal of Environmental Management, 352, 120078.
Examples
>>> from fusionlab.datasets import load_processed_subsidence_data >>> # Load processed Zhongshan data as a Bunch object >>> data_bunch = load_processed_subsidence_data(dataset_name='zhongshan', ... as_frame=False, ... return_sequences=False) >>> print(data_bunch.frame.head()) >>> print(data_bunch.feature_names)
>>> # Load Nansha data, preprocess, and return sequences >>> static, dynamic, future, target = load_processed_subsidence_data( ... dataset_name='nansha', ... return_sequences=True, ... time_steps=6, ... forecast_horizons=3, ... scale_numericals=True, ... scaler_type='standard', ... verbose=False ... ) >>> print(f"Nansha sequences shapes: S={static.shape}, D={dynamic.shape}," ... f" F={future.shape}, y={target.shape}")
>>> # Load a small sample and save processed frame >>> df_proc_sample = load_processed_subsidence_data( ... dataset_name='zhongshan', ... n_samples=100, ... random_state=42, ... as_frame=True, ... return_sequences=False, ... save_processed_frame=True, ... cache_suffix="_sample100" ... ) >>> print(f"Loaded and processed sample shape: {df_proc_sample.shape}")