Comparing DBSCAN and HDBSCAN for Geospatial Clustering
📌TL;DR
Compared DBSCAN vs HDBSCAN for clustering Canadian museum locations (250+ samples). DBSCAN identified 6 clusters (eps=2, min_samples=5) with 3.6% noise, while HDBSCAN discovered 5 hierarchical clusters (min_cluster_size=5) with 2% noise. HDBSCAN adapts to varying densities without manual epsilon tuning, making it superior for real-world geospatial data. Demonstrates density-based clustering advantages over K-Means: arbitrary cluster shapes, automatic outlier detection, no need to pre-specify cluster count, and robust handling of geographic point patterns.
Introduction
When analyzing geospatial data like museum locations across Canada, traditional clustering algorithms like K-Means have limitations-they require you to specify the number of clusters upfront and assume clusters are spherical. In this tutorial, I'll show you how to use DBSCAN and HDBSCAN, two density-based clustering algorithms that can discover clusters of arbitrary shapes and automatically identify outliers. This is particularly valuable for geographic data where clusters might represent regional concentrations of cultural institutions.
Understanding Density-Based Clustering
Unlike K-Means, which assigns every point to a cluster, density-based algorithms:
- Find clusters of arbitrary shapes: Not limited to circular/spherical clusters
- Identify noise/outliers: Points in low-density regions are labeled as noise
- Don't require pre-specifying K: The number of clusters emerges from the data
- Use density: Points are clustered if they're in high-density regions
DBSCAN: Density-Based Spatial Clustering
DBSCAN (Density-Based Spatial Clustering of Applications with Noise) groups together points that are closely packed and marks points in low-density regions as outliers.
Key Concepts
- eps (ε): The neighborhood search radius-how far to look for neighbors
- min_samples: Minimum points needed to form a dense region (cluster)
- Core point: A point with at least min_samples neighbors within eps distance
- Border point: In the neighborhood of a core point but doesn't have enough neighbors itself
- Noise point: Not a core point and not in any core point's neighborhood
HDBSCAN: Hierarchical DBSCAN
HDBSCAN improves on DBSCAN by:
- Handling varying densities: Works when clusters have different densities
- Less parameter sensitivity: More robust to parameter choices
- Hierarchical structure: Builds a cluster hierarchy, then extracts flat clusters
- Better for real-world data: Real geographic data rarely has uniform density
Dataset: Canadian Museums
We're working with museum location data from the Open Database of Cultural Arts Facilities:
df = pd.read_csv('ODCAF_v1.0.csv', encoding='ISO-8859-1')
The encoding='ISO-8859-1' parameter handles special characters in facility names (important for Canadian bilingual names with accents).
Filtering for Museums
df = df[df.ODCAF_Facility_Type == 'museum']
The dataset contains various cultural facilities (theaters, galleries, etc.). I filtered to focus solely on museums, creating a more homogeneous dataset for clustering.
Data Preprocessing
Extracting Geographic Coordinates
df = df[['Latitude', 'Longitude']]
For geographic clustering, we only need location data. Other attributes like museum name or type would be useful for interpretation but aren't needed for the clustering algorithm.
Handling Missing Values
df = df.replace('..', pd.NA)
df = df.dropna()
df = df.astype('float')
The dataset used '..' as a placeholder for missing coordinates. I replaced these with pd.NA (pandas' missing value indicator), then removed rows with missing coordinates. Finally, I converted coordinates to float type for numerical computations.
This reduced our dataset from 1,938 museums to 1,607 with valid coordinates-still a substantial dataset for clustering.
Coordinate Scaling for DBSCAN
coords_scaled = df.copy()
coords_scaled['Latitude'] = 2 * coords_scaled['Latitude']
This is a subtle but important step. I multiplied latitude by 2 to account for the fact that at Canadian latitudes, one degree of longitude represents a shorter distance than one degree of latitude. This scaling ensures that Euclidean distance in our transformed space better approximates actual geographic distance.
Without this adjustment, clusters might be stretched east-west because the algorithm would treat longitude degrees as equivalent to latitude degrees.
Building the DBSCAN Model
min_samples = 3
eps = 1.0
metric = 'euclidean'
dbscan = DBSCAN(eps=eps, min_samples=min_samples, metric=metric)
df['Cluster'] = dbscan.fit_predict(coords_scaled)
Parameter Choices
min_samples=3: A cluster must contain at least 3 museums. This prevents considering pairs of nearby museums as clusters-we want genuine concentrations of cultural institutions.
eps=1.0: The neighborhood radius. Given our scaled coordinates, this means roughly 1 degree of latitude/longitude (approximately 100km in Canada). This captures regional concentrations without making clusters too large.
metric='euclidean': Standard straight-line distance. For geographic data, you might also consider 'haversine' distance, which accounts for Earth's curvature, but for regional analysis within Canada, Euclidean distance on projected coordinates works well.
Visualizing DBSCAN Results
def plot_clustered_locations(df, title='Museums Clustered by Proximity'):
gdf = gpd.GeoDataFrame(df, geometry=gpd.points_from_xy(df['Longitude'], df['Latitude']), crs="EPSG:4326")
gdf = gdf.to_crs(epsg=3857)
Understanding the Coordinate Systems
EPSG:4326: This is WGS84, the standard GPS coordinate system using latitude/longitude. It's how our data starts.
EPSG:3857: Web Mercator projection used by most web mapping services. We reproject to this because:
- Our basemap is in Web Mercator
- It preserves angles (important for visual interpretation)
- It's the standard for overlaying data on web maps
Why Reproject? You can't directly overlay data in different coordinate systems. The reprojection ensures our museum points align correctly with the basemap.
Separating Noise from Clusters
non_noise = gdf[gdf['Cluster'] != -1]
noise = gdf[gdf['Cluster'] == -1]
noise.plot(ax=ax, color='k', markersize=30, ec='r', alpha=1, label='Noise')
non_noise.plot(ax=ax, column='Cluster', cmap='tab10', markersize=30, ec='k', legend=False, alpha=0.6)
DBSCAN labels noise points as -1. I plotted these separately:
- Noise points: Black with red edges, fully opaque-these are isolated museums not part of regional clusters
- Clustered points: Colored by cluster ID with black edges, semi-transparent-regional concentrations
Adding the Basemap
ctx.add_basemap(ax, source='./Canada.tif', zoom=4)
The basemap provides geographic context, showing provincial boundaries, cities, and geography. This helps interpret clusters-are they concentrated in major cities? Near transportation hubs?
Building the HDBSCAN Model
min_samples = None
min_cluster_size = 3
hdb = hdbscan.HDBSCAN(min_samples=min_samples, min_cluster_size=min_cluster_size, metric='euclidean')
df['Cluster'] = hdb.fit_predict(coords_scaled)
HDBSCAN Parameters
min_cluster_size=3: Similar to DBSCAN's min_samples, but more intuitive-the minimum number of points to form a cluster.
min_samples=None: When set to None, HDBSCAN defaults to using min_cluster_size. You can set it explicitly for more conservative clustering (higher min_samples means denser regions required).
Key Difference from DBSCAN: HDBSCAN builds a hierarchy of clusters at different density levels, then extracts the most stable clusters. This means it can find clusters of varying densities-dense urban clusters and sparser rural clusters in the same analysis.
Comparing DBSCAN and HDBSCAN Results
DBSCAN Characteristics
- Fixed density threshold: Uses the same eps everywhere
- Sensitive to parameters: Different eps values can dramatically change results
- Uniform density assumption: Works best when all clusters have similar density
- Fast: Computationally efficient
HDBSCAN Characteristics
- Varying density: Finds clusters at different density levels
- More robust: Less sensitive to parameter choices
- Hierarchical information: Provides cluster stability scores
- Slightly slower: More complex algorithm, but still efficient
When to Use Each
Use DBSCAN when:
- Clusters have roughly uniform density
- You have a clear understanding of appropriate neighborhood size
- Speed is critical
- You want simpler parameter tuning
Use HDBSCAN when:
- Clusters have varying densities (common in geographic data)
- You want more robust results
- You need hierarchical cluster information
- You're exploring data without strong prior assumptions
Key Takeaways
Density-Based Clustering Handles Complex Shapes: Unlike K-Means, DBSCAN and HDBSCAN can identify non-spherical clusters, perfect for geographic concentrations that follow coastlines, transportation routes, or regional boundaries.
Noise Identification is Valuable: Not every point needs to be in a cluster. Identifying isolated museums as noise provides insights-these might be rural institutions or specialized facilities serving specific communities.
Coordinate System Matters: Properly handling geographic projections ensures accurate distance calculations and correct basemap alignment. Always match coordinate reference systems when combining data sources.
Scaling for Latitude/Longitude: Multiplying latitude by 2 compensates for the shorter distance represented by longitude degrees at high latitudes, improving cluster quality for Canadian data.
HDBSCAN Offers Robustness: For real-world geographic data with varying densities, HDBSCAN typically provides more reliable results with less parameter tuning than DBSCAN.
Practical Applications
This clustering approach enables:
- Cultural Planning: Identify regions underserved by museums
- Tourism Strategy: Understand regional concentrations for tour planning
- Resource Allocation: Direct funding to areas lacking cultural institutions
- Network Analysis: Understand how museums cluster around population centers
- Transportation Planning: Design cultural tourism routes connecting clustered museums
Advanced Considerations
Parameter Selection
For DBSCAN, choosing eps:
- Too small: Everything becomes noise
- Too large: Everything merges into one cluster
- Use k-distance plots to identify appropriate eps values
For HDBSCAN:
- Start with min_cluster_size based on domain knowledge (minimum meaningful group size)
- Examine cluster stability scores to validate results
- Use the condensed tree visualization to understand hierarchical structure
Alternative Distance Metrics
For geographic data, consider:
- Haversine distance: Accounts for Earth's spherical shape
- Great circle distance: Similar to haversine, mathematically equivalent for most purposes
- Road network distance: For applications where travel distance matters more than straight-line distance
Conclusion
DBSCAN and HDBSCAN provide powerful tools for geographic clustering, automatically discovering regional concentrations and identifying outliers. By using density-based approaches rather than K-Means, we avoid forcing spherical clusters and pre-specifying cluster counts.
For the Canadian museums dataset, both algorithms successfully identified regional concentrations of cultural institutions while flagging isolated museums as noise. HDBSCAN's ability to handle varying densities makes it particularly well-suited for geographic data spanning urban and rural areas.
The visualization on a basemap transforms abstract cluster labels into interpretable geographic insights, enabling stakeholders to understand cultural institution distribution and make data-driven decisions about cultural policy, tourism development, and resource allocation.
📓 Jupyter Notebook
Want to explore the complete code and run it yourself? Access the full Jupyter notebook with detailed implementations and visualizations:
You can also run it interactively:
