fusionlab.utils.spatial_sampling¶

fusionlab.utils.spatial_sampling(data, sample_size=0.01, stratify_by=None, spatial_bins=10, spatial_cols=None, method='abs', min_relative_ratio=0.01, random_state=42, savefile=None, verbose=1)[source]¶

Sample spatial data intelligently to represent the distribution of the whole area and include different years.

This function performs stratified sampling on spatial data, ensuring that the sample reflects both spatial distribution and temporal aspects of the entire dataset [1]. It combines spatial stratification based on coordinates and additional stratification columns specified by the user.

datapandas.DataFrame
The input DataFrame to sample from. Must contain spatial coordinate columns (e.g., ‘longitude’, ‘latitude’) and any columns specified in stratify_by.

sample_sizefloat or int, optional
The proportion or absolute number of samples to select. If float, should be between 0.0 and 1.0 and represents the fraction of the dataset to include in the sample. If int, represents the absolute number of samples to select. Default is 0.01 (1% of the data).

stratify_bylist of str, optional
List of column names to stratify by.

spatial_binsint or tuple/list of int, optional
Number of bins to divide the spatial coordinates into. If an integer, the same number of bins is used for all spatial dimensions. If a tuple or list, its length must match the number of spatial columns, specifying the number of bins for each spatial dimension. Default is 10.

spatial_colslist or tuple of str, optional
List of spatial coordinate column names. Can accept one or two columns. If None, the function checks for columns named ‘longitude’ and/or ‘latitude’ in data. If only one spatial column is provided or found, a warning is issued, suggesting that providing both spatial columns is recommended for more accurate sampling. If more than two columns are provided, an error is raised.

methodstr, {‘abs’, ‘relative’}, default=’abs’
Defines how the sample size is determined: - 'abs' or 'absolute': Uses a fixed sampling proportion

based on sample_size.

'relative': Dynamically scales sampling based on dataset stratification, ensuring that all stratification groups receive a proportional sample while maintaining a minimum sampling ratio (controlled by min_relative_ratio).

When method='relative', the function ensures that even small stratification groups receive a sufficient sample by applying min_relative_ratio.

min_relative_ratiofloat, default=0.01
Controls the minimum allowable fraction of records that must be sampled when method='relative'.

Ensures that no group is undersampled to zero, even if its natural proportion in the dataset is very small.

Must be a value between 0 and 1.

The default value (0.01) means that at least 1% of the total dataset will be sampled from each stratification group, regardless of its relative size.

Example Scenarios:

If min_relative_ratio=0.05, then each group must contribute at least 5% of the total dataset size (if possible).

If a group is too small to reach this minimum, its entire subset is sampled instead.

This ensures that no group receives **less than

min_relative_ratio × total samples**.

random_stateint, optional
Random seed for reproducibility. Default is 42.

verbose: bool, default=False,
If True, displays a progress bar and detailed status messages during execution. Useful for monitoring the process, especially when working with large datasets.

sampled_datapandas.DataFrame
A sampled DataFrame representing the distribution of the whole area and including different years.

The function performs stratified sampling based on spatial bins and other specified stratification columns. Spatial coordinates are binned using quantile-based discretization (pandas.qcut()), ensuring each bin has approximately the same number of observations.

Let \(N\) be the total number of samples in data, and \(n\) be the desired sample size. The function calculates the number of samples to draw from each stratification group based on the proportion of the group size to the total dataset size:

\[n_i = \left\lceil\]

rac{N_i}{N} imes n ight ceil

where \(N_i\) is the size of group \(i\), and \(n_i\) is the number of samples to draw from group \(i\).

The function ensures that:

All specified spatial and stratification columns exist in data.

The number of spatial bins matches the number of spatial columns.

The sample size is valid (positive float between 0 and 1, or positive integer).

Warnings are issued if:

Only one spatial column is used, suggesting that using two spatial columns is recommended for more accurate sampling.
>>> from fusionlab.utils.spatial_utils import spatial_sampling
>>> import pandas as pd
>>> # Assume 'df' is a pandas DataFrame with columns
>>> # 'longitude', 'latitude', 'year', and other data.
>>> sampled_df = spatial_sampling(
...     data=df,
...     sample_size=0.05,
...     stratify_by=['year', 'geological_category'],
...     spatial_bins=(10, 15),
...     spatial_cols=['longitude', 'latitude'],
...     random_state=42
... )
>>> print(sampled_df.shape)
pandas.qcut : Quantile-based discretization function used for binning. sklearn.model_selection.StratifiedShuffleSplit : For stratified sampling. batch_spatial_sampling: Resample spatial data with batching.

[1]
Kotsiantis, S., Kanellopoulos, D., & Pintelas, P. (2006). “Data preprocessing for supervised learning.” International Journal of Computer Science, 1(2), 111-117.