fusionlab.datasets.fetch_nansha_data

fusionlab.datasets.fetch_nansha_data(*, n_samples=None, as_frame=False, include_coords=True, include_target=True, data_home=None, download_if_missing=True, force_download=False, random_state=None, verbose=True)[source]

Fetch the sampled Nansha land subsidence dataset (2000 points).

Loads the nansha_2000.csv file, which contains features related to land subsidence in Nansha, China, spatially sampled down to 2000 representative data points. It includes geographical coordinates, temporal information (year), geological factors, hydrogeological factors (GWL, rainfall), building concentration, risk scores, soil thickness, and the measured land subsidence (target).

Optionally allows further sub-sampling using the n_samples parameter via spatial_sampling().

Column details: ‘longitude’, ‘latitude’, ‘year’, ‘building_concentration’, ‘geology’, ‘GWL’, ‘rainfall_mm’, ‘normalized_seismic_risk_score’, ‘soil_thickness’, ‘subsidence’.

The function searches for the data file (nansha_2000.csv) using the logic in download_file_if() (Cache > Package > Download).

Parameters:
  • n_samples (int, str or None, default None) –

    Number of samples to load. - If None or '*': Load the full sampled dataset (~2000 rows). - If int: Sub-sample the specified number using spatial

    stratification via spatial_sampling(). Must be <= number of rows in the full file. Requires spatial_sampling to be available.

  • as_frame (bool, default False) – Return type: False for Bunch object, True for DataFrame.

  • include_coords (bool, default True) – Include ‘longitude’ and ‘latitude’ columns.

  • include_target (bool, default True) – Include the ‘subsidence’ column.

  • data_home (str, optional) – Path to cache directory. Defaults to ~/fusionlab_data.

  • download_if_missing (bool, default True) – Attempt download if file is not found locally.

  • force_download (bool, default False) – Force download attempt even if file exists locally.

  • random_state (int, optional) – Seed for the random number generator used during sub-sampling.

  • verbose (bool, default True) – Print status messages during file fetching and sampling.

Returns:

data – Loaded or sampled data. Bunch object includes frame, data, feature_names, target_names, target, coords, and DESCR.

Return type:

Bunch or pandas.DataFrame

Raises:
  • ValueError – If n_samples is invalid.

  • FileNotFoundError – If the dataset file cannot be found or downloaded.

  • OSError – If there is an error reading the dataset file.