fusionlab.datasets.load_subsidence_pinn_data¶
- fusionlab.datasets.load_subsidence_pinn_data(data_name='zhongshan', strategy='load', n_samples=None, include_coords=True, include_target=True, encode_categoricals=True, scale_numericals=True, scaler_type='minmax', as_frame=False, data_home=None, use_cache=True, save_cache=False, cache_suffix='', augment_data=False, augment_mode='both', group_by_cols=None, time_col=None, value_cols_interpolate=None, feature_cols_augment=None, interpolation_config=None, feature_config=None, target_name=None, interpolate_target=False, coordinate_precision=4, year_range=None, coords_range=None, vars_range=None, verbose=1)[source]¶
Load and preprocess subsidence‐focused PINN data for Zhongshan or Nansha.
This function handles data retrieval (from local CSV or remote), optional dummy‐data generation, caching, preprocessing (datetime conversion, one‐hot encoding, numeric scaling), and optional spatio‐temporal augmentation. When return_dataframe=False, it returns an XBunch object suitable for downstream modeling.
- Parameters:
data_name (
str, default'zhongshan') – Which city dataset to load. Supported values: ‘zhongshan’, ‘nansha’. Case‐insensitive. Used to select city‐specific metadata (file names, feature lists, etc.).strategy (
{'load', 'generate', 'fallback'}, default'load') –‘load’: strictly attempt to read the CSV from a local or downloaded location; raise an error if not found.
’generate’: skip CSV loading and always generate randomized “dummy” data matching the schema.
’fallback’: attempt CSV load; if loading fails (file missing or corrupted), generate dummy data instead.
n_samples (
intorNone, defaultNone) – Number of dummy rows to generate when strategy is ‘generate’ or when generation is triggered under ‘fallback’. If None, defaults to 500 000 samples.include_coords (
bool, defaultTrue) – If True, include ‘longitude’ and ‘latitude’ arrays in the returned XBunch (under keys “longitude” and “latitude”). If False, drop those coordinate columns from the feature set.include_target (
bool, defaultTrue) – If True, include subsidence and GWL as targets in the XBunch (“target_names” and “target”). If False, return an XBunch with no “target” array.encode_categoricals (
bool, defaultTrue) – If True, One‐Hot Encode any city‐specific categorical columns (e.g., ‘geology’, ‘density_tier’). Otherwise, skip encoding.scale_numericals (
bool, defaultTrue) – If True, apply MinMaxScaler or StandardScaler (see scaler_type) to the city’s “main” numeric features (e.g., rainfall, density, seismic risk). Coordinates and targets are not scaled here.scaler_type (
{'minmax', 'standard'}, default'minmax') – Which scaler to apply when scale_numericals=True. - ‘minmax’ uses sklearn.preprocessing.MinMaxScaler. - ‘standard’ uses sklearn.preprocessing.StandardScaler.as_frame (
bool, defaultFalse) –If True, return the processed pd.DataFrame directly. Otherwise, pack results into an XBunch with fields:
”frame”: the processed DataFrame
”data”: feature matrix (numpy array)
”feature_names”: list of column names in “data”
”target_names”: list of target column names (or [])
”target”: target array (or None)
”DESCR”: textual description
”longitude”, “latitude” (if include_coords=True)
”encoder_info”: dict of one‐hot encoder details
”scaler_info”: dict of scaler details
data_home (
strorNone, defaultNone) – Root directory for caching and for locating local data files. Passed to fusionlab.datasets.get_data(). If None, uses the package’s default data directory.use_cache (
bool, defaultTrue) – If True, attempt to load a previously processed .joblib cache (filename includes data_name and cache_suffix). If the cache exists and is valid, skip reprocessing and return cached results.save_cache (
bool, defaultFalse) – If True, after successful processing, save the processed DataFrame and encoder/scaler info to a .joblib file under data_home. Subsequent calls with use_cache=True will load from this cache.cache_suffix (
str, default'') – Suffix to append to the cache filename (before .joblib). Useful to distinguish different processing parameters or versions.augment_data (
bool, defaultFalse) – If True, apply spatio‐temporal augmentation via augment_city_spatiotemporal_data. This can interpolate missing values, add noise to features, and upsample the time dimension.augment_mode (
str, default'both') – Passed to the augmentation routine. Typical options include ‘both’ (interpolate + noise), ‘interpolate_only’, ‘noise_only’. See fusionlab.utils.geo_utils.augment_city_spatiotemporal_data for full details.group_by_cols (
list[str]orNone, defaultNone) – Which columns to group by during augmentation (e.g., coordinates). If None, defaults to the city’s spatial columns (lon_col, lat_col).time_col (
strorNone, defaultNone) – Time column name (string) to pass to augmentation. If None, uses the city’s configured “time_col”.value_cols_interpolate (
list[str]orNone, defaultNone) – Which numeric columns to interpolate during augmentation (e.g., “GWL”, “rainfall_mm”). If None, uses the city’s “default_value_cols_interpolate” list.feature_cols_augment (
list[str]orNone, defaultNone) – Which columns to add noise to during augmentation. If None, uses the city’s “default_feature_cols_augment” list. Note that the target column is never noise‐augmented by default.interpolation_config (
dictorNone, defaultNone) – Configuration passed to the interpolation step of augmentation (e.g., {‘freq’: ‘AS’, ‘method’: ‘linear’}). If None, defaults to {‘freq’: ‘AS’, ‘method’: ‘linear’}.feature_config (
dictorNone, defaultNone) – Configuration for adding noise, e.g., {‘noise_level’: 0.01, ‘noise_type’: ‘gaussian’}. If None, sensible defaults are used.target_name (
strorNone, defaultNone) – If interpolate_target=True, this names the column to interpolate (default is the city’s “subsidence_col”). If not provided, the function uses the configured subsidence column.interpolate_target (
bool, defaultFalse) – If True, include the target column itself in the interpolation pass. (Useful when filling gaps in observed subsidence values.)coordinate_precision (
intorNone, default4) – Number of decimal places to round latitude/longitude to. After rounding, spatial grouping (e.g., augmentation) will treat points at the same rounded coordinate as identical. Set to None to skip coordinate rounding.year_range (
tuple[int,int]orNone, defaultNone) – If dummy data generation is used, the (min_year, max_year) range for uniformly sampling integer years. If None, defaults to (2000, 2025).coords_range (
tuple[tuple[float,float],tuple[float,float]]orNone, defaultNone) – If dummy generation is used, spatial bounds as ((lon_min, lon_max), (lat_min, lat_max)). If None, defaults to ((113.0, 113.8), (22.3, 22.8)) for Zhongshan and a similar range for Nansha.vars_range (
dictorNone, defaultNone) – If dummy generation is used, a dictionary specifying ranges for other variables. For example: {“rainfall_mm”: (500, 2500), “GWL”: (1.0, 4.0)}. If a variable is omitted, its default distribution is used.verbose (
{0, 1, 2}, default1) – Controls verbosity of console output: - 0: silent (except for exceptions). - 1: high‐level info messages. - 2: debug‐level messages (detailed shape/log prints).
- Returns:
If return_dataframe=True, returns the processed DataFrame. Otherwise, returns an XBunch with the following keys:
frame: pandas DataFrame of processed data
data: numpy array (rows × features) ready for modeling
feature_names: list of column names corresponding to data
target_names: list of target column names (or empty list)
target: numpy array of target values (or None if include_target=False)
DESCR: a multi‐line description string summarizing processing steps
longitude, latitude: numpy arrays if include_coords=True
encoder_info: dict containing one‐hot encoder metadata
scaler_info: dict containing scaler metadata
- Return type:
Union[pd.DataFrame,XBunch]
Notes
Caching: When use_cache=True, the function looks for a .joblib file named {data_name}_{basename}_processed{cache_suffix}.joblib under data_home. If found and valid, this file is loaded to skip reprocessing. If save_cache=True, the final processed DataFrame is saved to the same path for future reuse.
Dummy Generation: When strategy=’generate’ or when fallback generation is triggered under ‘fallback’, the function calls generate_dummy_pinn_data(…) to produce a synthetic dataset. Users can override year_range, coords_range, and vars_range to control the random distributions. See the fusionlab.utils.geo_utils.generate_dummy_pinn_data docstring for details on default behavior.
Augmentation: When augment_data=True, the function invokes augment_city_spatiotemporal_data(…) with parameters:
group_by_cols: columns used to group points (spatially)
time_col: column used for temporal interpolation
value_cols_interpolate: numeric columns to interpolate
feature_cols_augment: columns to which noise is added
interpolation_config: interpolation parameters (freq/method)
feature_config: noise configuration (level/type)
coordinate_precision: precision used to round coords before grouping
Ensure that fusionlab.utils.geo_utils.augment_city_spatiotemporal_data is available in your installation if using augmentation.
One‐Hot Encoding: Only the configured categorical columns (e.g., ‘geology’, ‘density_tier’) are encoded. All other string columns remain unchanged.
Numeric Scaling: Only the city’s numerical_main features (e.g., rainfall, density, seismic risk) are passed through the chosen scaler. Coordinates, time numeric columns, and targets are not scaled here; downstream models or sequence preprocessors may handle those separately.
Examples
1. Simple load of processed Zhongshan data (no caching, no augmentation):
>>> from fusionlab.datasets.load import load_subsidence_pinn_data >>> zbunch = load_subsidence_pinn_data( ... data_name='zhongshan', ... strategy='load', ... use_cache=False, ... encode_categoricals=True, ... scale_numericals=True, ... scaler_type='minmax', ... return_dataframe=False, ... verbose=1 ... ) >>> print(zbunch.frame.head()) year longitude latitude GWL rainfall_mm density_tier_... ... 2000 113.05 22.35 3.2 1200.0 ...
2. Force generation of 100 000 dummy Nansha samples, skip encoding:
>>> nbunch = load_subsidence_pinn_data( ... data_name='nansha', ... strategy='generate', ... n_samples=100000, ... encode_categoricals=False, ... scale_numericals=True, ... scaler_type='standard', ... save_cache=True, ... cache_suffix='_v1', ... verbose=2 ... ) >>> print(nbunch.data.shape) (100000, 5) # e.g., 5 numeric features
3. Load Zhongshan data, but fallback to dummy if file missing, then apply augmentation with yearly interpolation and Gaussian noise:
>>> zbunch_aug = load_subsidence_pinn_data( ... data_name='zhongshan', ... strategy='fallback', ... use_cache=False, ... augment_data=True, ... augment_mode='both', ... interpolation_config={'freq':'YS','method':'linear'}, ... feature_config={'noise_level':0.02,'noise_type':'gaussian'}, ... verbose=1 ... ) >>> print(zbunch_aug.frame.shape) (e.g., 550000, 12) # Augmented rows added after interpolation