Density-based Text Clustering using Document Embeddings

Abstract:

Density-based clustering algorithms can accurately identify arbitrary shaped clusters, characteristic which makes them advantageous for many real-life datasets. However, most density-based clustering algorithms are affected by the curse of dimensionality, since they rely on distance metrics and range queries. In this paper, we demonstrate how density-based clustering algorithms can exactly cluster short text documents using a modern document embedding model, specifically Doc2Vec. We evaluate the accuracy of a classic density-based clustering algorithm, DBSCAN, and one of its recent variants, HDBSCAN, using two distinct quality functions, the Adjusted Rand Index and the Adjusted Mutual Information.