fusionlab.datasets.fetch_zhongshan_data

fusionlab.datasets.fetch_zhongshan_data(*, n_samples=None, as_frame=False, include_coords=True, include_target=True, data_home=None, download_if_missing=True, force_download=False, random_state=None, verbose=True)[source]

Fetch the Zhongshan land subsidence dataset (sampled 2000 points).

Loads the zhongshan_2000.csv file, which contains features related to land subsidence spatially sampled down to ~2000 points from a larger dataset [Liu24]. Includes coordinates, year, hydrogeological factors, geological properties, risk scores, and measured subsidence (target).

Optionally allows further sub-sampling using the n_samples parameter via spatial_sampling().

Column details: ‘longitude’, ‘latitude’, ‘year’, ‘GWL’, ‘seismic_risk_score’, ‘rainfall_mm’, ‘subsidence’, ‘geological_category’, ‘normalized_density’, ‘density_tier’, ‘subsidence_intensity’, ‘density_concentration’, ‘normalized_seismic_risk_score’, ‘rainfall_category’.

Parameters:
  • n_samples (int, str or None, default None) –

    Number of samples to load. - If None or '*': Load the full sampled dataset (~2000 rows). - If int: Sub-sample the specified number using spatial

    stratification via spatial_sampling(). Must be <= number of rows in the full file. Requires spatial_sampling to be available.

  • as_frame (bool, default False) – Return type: False for Bunch object, True for DataFrame.

  • include_coords (bool, default True) – Include ‘longitude’ and ‘latitude’ columns.

  • include_target (bool, default True) – Include the ‘subsidence’ column.

  • data_home (str, optional) – Path to cache directory. Defaults to ~/fusionlab_data.

  • download_if_missing (bool, default True) – Attempt download if file is not found locally.

  • force_download (bool, default False) – Force download attempt even if file exists locally.

  • random_state (int, optional) – Seed for the random number generator used during sub-sampling if n_samples is an integer. Ensures reproducibility.

  • verbose (bool, default True) – Print status messages during file fetching and sampling.

Returns:

data – Loaded or sampled data. Bunch object includes frame, data, feature_names, target_names, target, coords, and DESCR.

Return type:

Bunch or pandas.DataFrame

Raises:
  • ValueError – If n_samples is invalid (e.g., non-integer, negative, or larger than available rows when sampling).

  • FileNotFoundError – If the dataset file cannot be found or downloaded.

  • OSError – If there is an error reading the dataset file.

References

[Liu24]

Liu, J., et al. (2024). Machine learning-based techniques… Journal of Environmental Management, 352, 120078.