fusionlab.utils.create_spatial_clusters¶
- fusionlab.utils.create_spatial_clusters(df, spatial_cols=None, cluster_col='region', n_clusters=None, algorithm='kmeans', view=True, figsize=(14, 10), s=60, plot_style='seaborn', cmap='tab20', show_grid=True, grid_props=None, auto_scale=True, savefile=None, verbose=1, **kwargs)[source]¶
Cluster 2D spatial data in
dfusing <algorithm> and optionally plot the results.This function, <create_spatial_clusters>, extracts two coordinate columns from <df> to form clusters via methods such as ‘kmeans’, ‘dbscan’, or ‘agglo’ (agglomerative). It uses the function filter_valid_kwargs (when relevant) to strip out invalid parameters for certain estimators, and writes cluster labels into <cluster_col>.
- Parameters:
df (
pandas.DataFrame) – Input DataFrame holding spatial coordinates and optional other fields.spatial_cols (
listofstr, optional) – Two-column list for x and y coordinates. Defaults to['longitude','latitude']if None.cluster_col (
str, default'region') – Name of the column to store the assigned cluster labels.n_clusters (
int, optional) – Number of clusters to form. If not provided for KMeans, it is auto-detected. For DBSCAN or Agglomerative, a warning is issued if not set.algorithm (
str, default'kmeans') – Choice of clustering algorithm among['kmeans','dbscan','agglo'].view (
bool, defaultTrue) – If True, displays a scatterplot of the final clusters.figsize (
tuple, default(14,10)) – Size of the displayed figure for the cluster plot.s (
int, default60) – Marker size in the scatterplot.plot_style (
str, default'seaborn') – Matplotlib style used for the plot.cmap (
str, default'tab20') – Colormap name used to differentiate clusters.show_grid (
bool, defaultTrue) – Toggles grid lines on or off.grid_props (
dict, optional) – Additional keyword arguments controlling the grid style.auto_scale (
bool, defaultTrue) – If True, standardize coordinates before clustering.savefile (
str, optional) – File path to save the data with an additional <cluster_col> storing the assigned cluster labels if desired.verbose (
int, default1) – Controls console logs. Higher values yield more details about scaling and cluster detection.**kwargs – Additional keyword arguments passed to the chosen algorithm (filtered by filter_valid_kwargs for KMeans, DBSCAN, AgglomerativeClustering ).
- Returns:
A copy of <df> with an additional <cluster_col> storing the assigned cluster labels.
- Return type:
pandas.DataFrame
Notes
If <auto_scale> is True, it uses a standard scaler to normalize the coordinate columns. The scatterplot is generated using the library seaborn for enhanced styling.
By default, for <algorithm> = “kmeans”, the model attempts to minimize:
\[J = \sum_{i=1}^{N} \min_{\mu_j} \lVert x_i - \mu_j \rVert^2\]where \(x_i\) are the scaled or raw 2D coordinates in <df>. The function can optionally auto-detect
n_clustersusing a silhouette and elbow analysis if not provided.Examples
>>> from fusionlab.utils.spatial_utils import create_spatial_clusters >>> import pandas as pd >>> df = pd.DataFrame({ ... "longitude": [0.1, 0.2, 2.2, 2.3], ... "latitude": [1.0, 1.1, 2.1, 2.2] ... }) >>> # KMeans with auto scale and auto-detect k >>> result = create_spatial_clusters( ... df=df, ... algorithm="kmeans", ... view=True ... ) >>> # DBSCAN with custom arguments >>> result_db = create_spatial_clusters( ... df=df, ... algorithm="dbscan", ... eps=0.5, ... min_samples=2 ... )
See also
filter_valid_kwargsHelps discard unsupported keyword arguments for chosen estimators.
References