Text Clustering and Representation Using the Latent Dirichlet Allocation Model

Abstract:

Text mining models allow examination of textual data in order to find out their message, without reading them. This evaluation can be used not only to identify the main subject ofthe analyzed texts but can also be usedfor the developing ofsimilarity patterns or for predictions. Knowing that
nowadays, an increasing volume of data is recorded especially in textformat, not in numerical one, text mining models have become an important and necessary tool. The main purpose of this paper is to present the Latent Dirichlet Allocation (LDA) model, mainly used for the the cluster analysis of documents, but also for the dimensionality reduction and text representation. T 0 prove that the LDA model is a useful tool for other clustering models, we compare the performance of the k-medoids clustering algorithm using LDA representation with the one generated using TF-IDF representation. We use a self-defined corpus of news published on the Washington Post website and categorized as “policy”, “business "sport", “technology” and “entertainment”. We evince that the LDA model is not only an eflicient clustering model, but it is also usefuljor text representation cutting back the implementation costs and enhancing the quality ofthejurther textual tasks.