Abstract:
Clustering is an important Data Mining operation that groups objects into clusters based on their similarity. The similarity join is a primitive operation used in clustering which retrieves the most similar pairs from two input data-sets based on a dissimilarity function (also named metric). In this article, we transform DBSCAN’s (Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with noise) algorithmic schema by replacing the multiple range queries with a single similarity join to minimize the hyperparameter. Thus, instead of the two hyperparameters required by the DBSCAN, our approach requires only the neighborhood radius ε hyperparameter. We propose two implementations for DBSCAN with similarity join: i) QuickDBSCAN that uses an adapted QuickJoin algorithm and ii) KDTreeDBSCAN that uses k-d-tree indexing structure. The experimental results show that DBSCAN with similarity join outperforms the classic DBSCAN.