fusionlab.utils.create_spatial_clusters

fusionlab.utils.create_spatial_clusters(df, spatial_cols=None, cluster_col='region', n_clusters=None, algorithm='kmeans', view=True, figsize=(14, 10), s=60, plot_style='seaborn', cmap='tab20', show_grid=True, grid_props=None, auto_scale=True, savefile=None, verbose=1, **kwargs)[source]

Cluster 2D spatial data in df using <algorithm> and optionally plot the results.

This function, <create_spatial_clusters>, extracts two coordinate columns from <df> to form clusters via methods such as ‘kmeans’, ‘dbscan’, or ‘agglo’ (agglomerative). It uses the function filter_valid_kwargs (when relevant) to strip out invalid parameters for certain estimators, and writes cluster labels into <cluster_col>.

Parameters:
  • df (pandas.DataFrame) – Input DataFrame holding spatial coordinates and optional other fields.

  • spatial_cols (list of str, optional) – Two-column list for x and y coordinates. Defaults to ['longitude','latitude'] if None.

  • cluster_col (str, default 'region') – Name of the column to store the assigned cluster labels.

  • n_clusters (int, optional) – Number of clusters to form. If not provided for KMeans, it is auto-detected. For DBSCAN or Agglomerative, a warning is issued if not set.

  • algorithm (str, default 'kmeans') – Choice of clustering algorithm among ['kmeans','dbscan','agglo'].

  • view (bool, default True) – If True, displays a scatterplot of the final clusters.

  • figsize (tuple, default (14, 10)) – Size of the displayed figure for the cluster plot.

  • s (int, default 60) – Marker size in the scatterplot.

  • plot_style (str, default 'seaborn') – Matplotlib style used for the plot.

  • cmap (str, default 'tab20') – Colormap name used to differentiate clusters.

  • show_grid (bool, default True) – Toggles grid lines on or off.

  • grid_props (dict, optional) – Additional keyword arguments controlling the grid style.

  • auto_scale (bool, default True) – If True, standardize coordinates before clustering.

  • savefile (str, optional) – File path to save the data with an additional <cluster_col> storing the assigned cluster labels if desired.

  • verbose (int, default 1) – Controls console logs. Higher values yield more details about scaling and cluster detection.

  • **kwargs – Additional keyword arguments passed to the chosen algorithm (filtered by filter_valid_kwargs for KMeans, DBSCAN, AgglomerativeClustering ).

Returns:

A copy of <df> with an additional <cluster_col> storing the assigned cluster labels.

Return type:

pandas.DataFrame

Notes

If <auto_scale> is True, it uses a standard scaler to normalize the coordinate columns. The scatterplot is generated using the library seaborn for enhanced styling.

By default, for <algorithm> = “kmeans”, the model attempts to minimize:

\[J = \sum_{i=1}^{N} \min_{\mu_j} \lVert x_i - \mu_j \rVert^2\]

where \(x_i\) are the scaled or raw 2D coordinates in <df>. The function can optionally auto-detect n_clusters using a silhouette and elbow analysis if not provided.

Examples

>>> from fusionlab.utils.spatial_utils import create_spatial_clusters
>>> import pandas as pd
>>> df = pd.DataFrame({
...     "longitude": [0.1, 0.2, 2.2, 2.3],
...     "latitude": [1.0, 1.1, 2.1, 2.2]
... })
>>> # KMeans with auto scale and auto-detect k
>>> result = create_spatial_clusters(
...     df=df,
...     algorithm="kmeans",
...     view=True
... )
>>> # DBSCAN with custom arguments
>>> result_db = create_spatial_clusters(
...     df=df,
...     algorithm="dbscan",
...     eps=0.5,
...     min_samples=2
... )

See also

filter_valid_kwargs

Helps discard unsupported keyword arguments for chosen estimators.

References