Density-based Text Clustering using Document Embeddings

Iulia Maria Rădulescu, Ciprian-Octavian Truică, Elena-Simona Apostol, Alexandru Boicea, Mariana Mocanu, Daniel Călin Popeangă and Florin Rădulescu

Abstract:

Density-based clustering algorithms can accurately identify arbitrary shaped clusters, characteristic which makes them advantageous for many real-life datasets. However, most density-based clustering algorithms are affected by the curse of dimensionality, since they rely on distance metrics and range queries. In this paper, we demonstrate how density-based clustering algorithms can exactly cluster short text documents using a modern document embedding model, specifically Doc2Vec. We evaluate the accuracy of a classic density-based clustering algorithm, DBSCAN, and one of its recent variants, HDBSCAN, using two distinct quality functions, the Adjusted Rand Index and the Adjusted Mutual Information.

36th IBIMA Conference: 4-5 November 2020, Granada, Spain

Density-based Text Clustering using Document Embeddings

Abstract: