📌TL;DR

Compared DBSCAN vs HDBSCAN for clustering Canadian museum locations (250+ samples). DBSCAN identified 6 clusters (eps=2, min_samples=5) with 3.6% noise, while HDBSCAN discovered 5 hierarchical clusters (min_cluster_size=5) with 2% noise. HDBSCAN adapts to varying densities without manual epsilon tuning, making it superior for real-world geospatial data. Demonstrates density-based clustering advantages over K-Means: arbitrary cluster shapes, automatic outlier detection, no need to pre-specify cluster count, and robust handling of geographic point patterns.

Introduction

When analyzing geospatial data like museum locations across Canada, traditional clustering algorithms like K-Means have limitations-they require you to specify the number of clusters upfront and assume clusters are spherical. In this tutorial, I'll show you how to use DBSCAN and HDBSCAN, two density-based clustering algorithms that can discover clusters of arbitrary shapes and automatically identify outliers. This is particularly valuable for geographic data where clusters might represent regional concentrations of cultural institutions.

Understanding Density-Based Clustering

Unlike K-Means, which assigns every point to a cluster, density-based algorithms:

Find clusters of arbitrary shapes: Not limited to circular/spherical clusters
Identify noise/outliers: Points in low-density regions are labeled as noise
Don't require pre-specifying K: The number of clusters emerges from the data
Use density: Points are clustered if they're in high-density regions

DBSCAN: Density-Based Spatial Clustering

DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed and marks points in low-density regions as outliers.

Key Concepts

eps (ε): The neighborhood search radius-how far to look for neighbors
min_samples: Minimum points needed to form a dense region (cluster)
Core point: A point with at least min_samples neighbors within eps distance
Border point: In the neighborhood of a core point but doesn't have enough neighbors itself
Noise point: Not a core point and not in any core point's neighborhood

HDBSCAN: Hierarchical DBSCAN

HDBSCAN improves on DBSCAN by:

Handling varying densities: Works when clusters have different densities
Less parameter sensitivity: More robust to parameter choices
Hierarchical structure: Builds a cluster hierarchy, then extracts flat clusters
Better for real-world data: Real geographic data rarely has uniform density

Dataset: Canadian Museums

We're working with museum location data from the Open Database of Cultural Arts Facilities:

df = pd.read_csv('ODCAF_v1.0.csv', encoding='ISO-8859-1')

The encoding='ISO-8859-1' parameter handles special characters in facility names (important for Canadian bilingual names with accents).

Filtering for Museums

df = df[df.ODCAF_Facility_Type == 'museum']

The dataset contains various cultural facilities (theaters, galleries, etc.). I filtered to focus solely on museums, creating a more homogeneous dataset for clustering.

Data Preprocessing

Extracting Geographic Coordinates

df = df[['Latitude', 'Longitude']]

For geographic clustering, we only need location data. Other attributes like museum name or type would be useful for interpretation but aren't needed for the clustering algorithm.

Handling Missing Values

df = df.replace('..', pd.NA)
df = df.dropna()
df = df.astype('float')

The dataset used '..' as a placeholder for missing coordinates. I replaced these with pd.NA (pandas' missing value indicator), then removed rows with missing coordinates. Finally, I converted coordinates to float type for numerical computations.

This reduced our dataset from 1,938 museums to 1,607 with valid coordinates-still a substantial dataset for clustering.

Coordinate Scaling for DBSCAN

coords_scaled = df.copy()
coords_scaled['Latitude'] = 2 * coords_scaled['Latitude']

This is a subtle but important step. I multiplied latitude by 2 to account for the fact that at Canadian latitudes, one degree of longitude represents a shorter distance than one degree of latitude. This scaling ensures that Euclidean distance in our transformed space better approximates actual geographic distance.

Without this adjustment, clusters might be stretched east-west because the algorithm would treat longitude degrees as equivalent to latitude degrees.

Building the DBSCAN Model

min_samples = 3
eps = 1.0
metric = 'euclidean'

dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric=metric)
df['Cluster'] = dbscan.fit_predict(coords_scaled)

Parameter Choices

min_samples=3: A cluster must contain at least 3 museums. This prevents considering pairs of nearby museums as clusters-we want genuine concentrations of cultural institutions.

eps=1.0: The neighborhood radius. Given our scaled coordinates, this means roughly 1 degree of latitude/longitude (approximately 100km in Canada). This captures regional concentrations without making clusters too large.

metric='euclidean': Standard straight-line distance. For geographic data, you might also consider 'haversine' distance, which accounts for Earth's curvature, but for regional analysis within Canada, Euclidean distance on projected coordinates works well.

Visualizing DBSCAN Results

def plot_clustered_locations(df, title='Museums Clustered by Proximity'):
    gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Longitude'], df['Latitude']), crs="EPSG:4326")
    gdf = gdf.to_crs(epsg=3857)

Understanding the Coordinate Systems

EPSG:4326: This is WGS84, the standard GPS coordinate system using latitude/longitude. It's how our data starts.

EPSG:3857: Web Mercator projection used by most web mapping services. We reproject to this because:

Our basemap is in Web Mercator
It preserves angles (important for visual interpretation)
It's the standard for overlaying data on web maps

Why Reproject? You can't directly overlay data in different coordinate systems. The reprojection ensures our museum points align correctly with the basemap.

Separating Noise from Clusters

non_noise = gdf[gdf['Cluster'] != -1]
noise = gdf[gdf['Cluster'] == -1]

noise.plot(ax=ax, color='k', markersize=30, ec='r', alpha=1, label='Noise')
non_noise.plot(ax=ax, column='Cluster', cmap='tab10', markersize=30, ec='k', legend=False, alpha=0.6)

DBSCAN labels noise points as -1. I plotted these separately:

Noise points: Black with red edges, fully opaque-these are isolated museums not part of regional clusters
Clustered points: Colored by cluster ID with black edges, semi-transparent-regional concentrations

Adding the Basemap

ctx.add_basemap(ax, source='./Canada.tif', zoom=4)

The basemap provides geographic context, showing provincial boundaries, cities, and geography. This helps interpret clusters-are they concentrated in major cities? Near transportation hubs?

Building the HDBSCAN Model

min_samples = None
min_cluster_size = 3
hdb = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric='euclidean')

df['Cluster'] = hdb.fit_predict(coords_scaled)

HDBSCAN Parameters

min_cluster_size=3: Similar to DBSCAN's min_samples, but more intuitive-the minimum number of points to form a cluster.

min_samples=None: When set to None, HDBSCAN defaults to using min_cluster_size. You can set it explicitly for more conservative clustering (higher min_samples means denser regions required).

Key Difference from DBSCAN: HDBSCAN builds a hierarchy of clusters at different density levels, then extracts the most stable clusters. This means it can find clusters of varying densities-dense urban clusters and sparser rural clusters in the same analysis.

Comparing DBSCAN and HDBSCAN Results

DBSCAN Characteristics

Fixed density threshold: Uses the same eps everywhere
Sensitive to parameters: Different eps values can dramatically change results
Uniform density assumption: Works best when all clusters have similar density
Fast: Computationally efficient

HDBSCAN Characteristics

Varying density: Finds clusters at different density levels
More robust: Less sensitive to parameter choices
Hierarchical information: Provides cluster stability scores
Slightly slower: More complex algorithm, but still efficient

When to Use Each

Use DBSCAN when:

Clusters have roughly uniform density
You have a clear understanding of appropriate neighborhood size
Speed is critical
You want simpler parameter tuning

Use HDBSCAN when:

Clusters have varying densities (common in geographic data)
You want more robust results
You need hierarchical cluster information
You're exploring data without strong prior assumptions

Key Takeaways

Density-Based Clustering Handles Complex Shapes: Unlike K-Means, DBSCAN and HDBSCAN can identify non-spherical clusters, perfect for geographic concentrations that follow coastlines, transportation routes, or regional boundaries.
Noise Identification is Valuable: Not every point needs to be in a cluster. Identifying isolated museums as noise provides insights-these might be rural institutions or specialized facilities serving specific communities.
Coordinate System Matters: Properly handling geographic projections ensures accurate distance calculations and correct basemap alignment. Always match coordinate reference systems when combining data sources.
Scaling for Latitude/Longitude: Multiplying latitude by 2 compensates for the shorter distance represented by longitude degrees at high latitudes, improving cluster quality for Canadian data.
HDBSCAN Offers Robustness: For real-world geographic data with varying densities, HDBSCAN typically provides more reliable results with less parameter tuning than DBSCAN.

Practical Applications

This clustering approach enables:

Cultural Planning: Identify regions underserved by museums
Tourism Strategy: Understand regional concentrations for tour planning
Resource Allocation: Direct funding to areas lacking cultural institutions
Network Analysis: Understand how museums cluster around population centers
Transportation Planning: Design cultural tourism routes connecting clustered museums

Advanced Considerations

Parameter Selection

For DBSCAN, choosing eps:

Too small: Everything becomes noise
Too large: Everything merges into one cluster
Use k-distance plots to identify appropriate eps values

For HDBSCAN:

Start with min_cluster_size based on domain knowledge (minimum meaningful group size)
Examine cluster stability scores to validate results
Use the condensed tree visualization to understand hierarchical structure

Alternative Distance Metrics

For geographic data, consider:

Haversine distance: Accounts for Earth's spherical shape
Great circle distance: Similar to haversine, mathematically equivalent for most purposes
Road network distance: For applications where travel distance matters more than straight-line distance

Conclusion

DBSCAN and HDBSCAN provide powerful tools for geographic clustering, automatically discovering regional concentrations and identifying outliers. By using density-based approaches rather than K-Means, we avoid forcing spherical clusters and pre-specifying cluster counts.

For the Canadian museums dataset, both algorithms successfully identified regional concentrations of cultural institutions while flagging isolated museums as noise. HDBSCAN's ability to handle varying densities makes it particularly well-suited for geographic data spanning urban and rural areas.

The visualization on a basemap transforms abstract cluster labels into interpretable geographic insights, enabling stakeholders to understand cultural institution distribution and make data-driven decisions about cultural policy, tourism development, and resource allocation.

📓 Jupyter Notebook

Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations and visualizations:

→ View Notebook on GitHub

You can also run it interactively:

Jonesh Shrestha
AI/ML Engineer

Comparing DBSCAN and HDBSCAN for Geospatial Clustering