fusionlab.datasets.load_subsidence_pinn_data

fusionlab.datasets.load_subsidence_pinn_data(data_name='zhongshan', strategy='load', n_samples=None, include_coords=True, include_target=True, encode_categoricals=True, scale_numericals=True, scaler_type='minmax', as_frame=False, data_home=None, use_cache=True, save_cache=False, cache_suffix='', augment_data=False, augment_mode='both', group_by_cols=None, time_col=None, value_cols_interpolate=None, feature_cols_augment=None, interpolation_config=None, feature_config=None, target_name=None, interpolate_target=False, coordinate_precision=4, year_range=None, coords_range=None, vars_range=None, verbose=1)[source]

Load and preprocess subsidence‐focused PINN data for Zhongshan or Nansha.

This function handles data retrieval (from local CSV or remote), optional dummy‐data generation, caching, preprocessing (datetime conversion, one‐hot encoding, numeric scaling), and optional spatio‐temporal augmentation. When return_dataframe=False, it returns an XBunch object suitable for downstream modeling.

Parameters:
  • data_name (str, default 'zhongshan') – Which city dataset to load. Supported values: ‘zhongshan’, ‘nansha’. Case‐insensitive. Used to select city‐specific metadata (file names, feature lists, etc.).

  • strategy ({'load', 'generate', 'fallback'}, default 'load') –

    • ‘load’: strictly attempt to read the CSV from a local or downloaded location; raise an error if not found.

    • ’generate’: skip CSV loading and always generate randomized “dummy” data matching the schema.

    • ’fallback’: attempt CSV load; if loading fails (file missing or corrupted), generate dummy data instead.

  • n_samples (int or None, default None) – Number of dummy rows to generate when strategy is ‘generate’ or when generation is triggered under ‘fallback’. If None, defaults to 500 000 samples.

  • include_coords (bool, default True) – If True, include ‘longitude’ and ‘latitude’ arrays in the returned XBunch (under keys “longitude” and “latitude”). If False, drop those coordinate columns from the feature set.

  • include_target (bool, default True) – If True, include subsidence and GWL as targets in the XBunch (“target_names” and “target”). If False, return an XBunch with no “target” array.

  • encode_categoricals (bool, default True) – If True, One‐Hot Encode any city‐specific categorical columns (e.g., ‘geology’, ‘density_tier’). Otherwise, skip encoding.

  • scale_numericals (bool, default True) – If True, apply MinMaxScaler or StandardScaler (see scaler_type) to the city’s “main” numeric features (e.g., rainfall, density, seismic risk). Coordinates and targets are not scaled here.

  • scaler_type ({'minmax', 'standard'}, default 'minmax') – Which scaler to apply when scale_numericals=True. - ‘minmax’ uses sklearn.preprocessing.MinMaxScaler. - ‘standard’ uses sklearn.preprocessing.StandardScaler.

  • as_frame (bool, default False) –

    If True, return the processed pd.DataFrame directly. Otherwise, pack results into an XBunch with fields:

    • ”frame”: the processed DataFrame

    • ”data”: feature matrix (numpy array)

    • ”feature_names”: list of column names in “data”

    • ”target_names”: list of target column names (or [])

    • ”target”: target array (or None)

    • ”DESCR”: textual description

    • ”longitude”, “latitude” (if include_coords=True)

    • ”encoder_info”: dict of one‐hot encoder details

    • ”scaler_info”: dict of scaler details

  • data_home (str or None, default None) – Root directory for caching and for locating local data files. Passed to fusionlab.datasets.get_data(). If None, uses the package’s default data directory.

  • use_cache (bool, default True) – If True, attempt to load a previously processed .joblib cache (filename includes data_name and cache_suffix). If the cache exists and is valid, skip reprocessing and return cached results.

  • save_cache (bool, default False) – If True, after successful processing, save the processed DataFrame and encoder/scaler info to a .joblib file under data_home. Subsequent calls with use_cache=True will load from this cache.

  • cache_suffix (str, default '') – Suffix to append to the cache filename (before .joblib). Useful to distinguish different processing parameters or versions.

  • augment_data (bool, default False) – If True, apply spatio‐temporal augmentation via augment_city_spatiotemporal_data. This can interpolate missing values, add noise to features, and upsample the time dimension.

  • augment_mode (str, default 'both') – Passed to the augmentation routine. Typical options include ‘both’ (interpolate + noise), ‘interpolate_only’, ‘noise_only’. See fusionlab.utils.geo_utils.augment_city_spatiotemporal_data for full details.

  • group_by_cols (list[str] or None, default None) – Which columns to group by during augmentation (e.g., coordinates). If None, defaults to the city’s spatial columns (lon_col, lat_col).

  • time_col (str or None, default None) – Time column name (string) to pass to augmentation. If None, uses the city’s configured “time_col”.

  • value_cols_interpolate (list[str] or None, default None) – Which numeric columns to interpolate during augmentation (e.g., “GWL”, “rainfall_mm”). If None, uses the city’s “default_value_cols_interpolate” list.

  • feature_cols_augment (list[str] or None, default None) – Which columns to add noise to during augmentation. If None, uses the city’s “default_feature_cols_augment” list. Note that the target column is never noise‐augmented by default.

  • interpolation_config (dict or None, default None) – Configuration passed to the interpolation step of augmentation (e.g., {‘freq’: ‘AS’, ‘method’: ‘linear’}). If None, defaults to {‘freq’: ‘AS’, ‘method’: ‘linear’}.

  • feature_config (dict or None, default None) – Configuration for adding noise, e.g., {‘noise_level’: 0.01, ‘noise_type’: ‘gaussian’}. If None, sensible defaults are used.

  • target_name (str or None, default None) – If interpolate_target=True, this names the column to interpolate (default is the city’s “subsidence_col”). If not provided, the function uses the configured subsidence column.

  • interpolate_target (bool, default False) – If True, include the target column itself in the interpolation pass. (Useful when filling gaps in observed subsidence values.)

  • coordinate_precision (int or None, default 4) – Number of decimal places to round latitude/longitude to. After rounding, spatial grouping (e.g., augmentation) will treat points at the same rounded coordinate as identical. Set to None to skip coordinate rounding.

  • year_range (tuple[int, int] or None, default None) – If dummy data generation is used, the (min_year, max_year) range for uniformly sampling integer years. If None, defaults to (2000, 2025).

  • coords_range (tuple[tuple[float, float], tuple[float, float]] or None, default None) – If dummy generation is used, spatial bounds as ((lon_min, lon_max), (lat_min, lat_max)). If None, defaults to ((113.0, 113.8), (22.3, 22.8)) for Zhongshan and a similar range for Nansha.

  • vars_range (dict or None, default None) – If dummy generation is used, a dictionary specifying ranges for other variables. For example: {“rainfall_mm”: (500, 2500), “GWL”: (1.0, 4.0)}. If a variable is omitted, its default distribution is used.

  • verbose ({0, 1, 2}, default 1) – Controls verbosity of console output: - 0: silent (except for exceptions). - 1: high‐level info messages. - 2: debug‐level messages (detailed shape/log prints).

Returns:

If return_dataframe=True, returns the processed DataFrame. Otherwise, returns an XBunch with the following keys:

  • frame: pandas DataFrame of processed data

  • data: numpy array (rows × features) ready for modeling

  • feature_names: list of column names corresponding to data

  • target_names: list of target column names (or empty list)

  • target: numpy array of target values (or None if include_target=False)

  • DESCR: a multi‐line description string summarizing processing steps

  • longitude, latitude: numpy arrays if include_coords=True

  • encoder_info: dict containing one‐hot encoder metadata

  • scaler_info: dict containing scaler metadata

Return type:

Union[pd.DataFrame, XBunch]

Notes

  1. Caching: When use_cache=True, the function looks for a .joblib file named {data_name}_{basename}_processed{cache_suffix}.joblib under data_home. If found and valid, this file is loaded to skip reprocessing. If save_cache=True, the final processed DataFrame is saved to the same path for future reuse.

  2. Dummy Generation: When strategy=’generate’ or when fallback generation is triggered under ‘fallback’, the function calls generate_dummy_pinn_data(…) to produce a synthetic dataset. Users can override year_range, coords_range, and vars_range to control the random distributions. See the fusionlab.utils.geo_utils.generate_dummy_pinn_data docstring for details on default behavior.

  3. Augmentation: When augment_data=True, the function invokes augment_city_spatiotemporal_data(…) with parameters:

    • group_by_cols: columns used to group points (spatially)

    • time_col: column used for temporal interpolation

    • value_cols_interpolate: numeric columns to interpolate

    • feature_cols_augment: columns to which noise is added

    • interpolation_config: interpolation parameters (freq/method)

    • feature_config: noise configuration (level/type)

    • coordinate_precision: precision used to round coords before grouping

    Ensure that fusionlab.utils.geo_utils.augment_city_spatiotemporal_data is available in your installation if using augmentation.

  4. One‐Hot Encoding: Only the configured categorical columns (e.g., ‘geology’, ‘density_tier’) are encoded. All other string columns remain unchanged.

  5. Numeric Scaling: Only the city’s numerical_main features (e.g., rainfall, density, seismic risk) are passed through the chosen scaler. Coordinates, time numeric columns, and targets are not scaled here; downstream models or sequence preprocessors may handle those separately.

Examples

1. Simple load of processed Zhongshan data (no caching, no augmentation):

>>> from fusionlab.datasets.load import load_subsidence_pinn_data
>>> zbunch = load_subsidence_pinn_data(
...     data_name='zhongshan',
...     strategy='load',
...     use_cache=False,
...     encode_categoricals=True,
...     scale_numericals=True,
...     scaler_type='minmax',
...     return_dataframe=False,
...     verbose=1
... )
>>> print(zbunch.frame.head())
     year  longitude  latitude  GWL  rainfall_mm  density_tier_...  ...
     2000   113.05     22.35    3.2  1200.0     ...

2. Force generation of 100 000 dummy Nansha samples, skip encoding:

>>> nbunch = load_subsidence_pinn_data(
...     data_name='nansha',
...     strategy='generate',
...     n_samples=100000,
...     encode_categoricals=False,
...     scale_numericals=True,
...     scaler_type='standard',
...     save_cache=True,
...     cache_suffix='_v1',
...     verbose=2
... )
>>> print(nbunch.data.shape)
(100000, 5)  # e.g., 5 numeric features

3. Load Zhongshan data, but fallback to dummy if file missing, then apply augmentation with yearly interpolation and Gaussian noise:

>>> zbunch_aug = load_subsidence_pinn_data(
...     data_name='zhongshan',
...     strategy='fallback',
...     use_cache=False,
...     augment_data=True,
...     augment_mode='both',
...     interpolation_config={'freq':'YS','method':'linear'},
...     feature_config={'noise_level':0.02,'noise_type':'gaussian'},
...     verbose=1
... )
>>> print(zbunch_aug.frame.shape)
(e.g., 550000, 12)  # Augmented rows added after interpolation