Performance Parameters of Data Analysis for Text Processing Using Machine Learning Models

Abstract:

The efficiency of text classification is one of the problems encoun- tered in natural language processing and it is becoming a focal point for research- ers, especially with the increasing amount of textual data from a variety of sec- tors. Decision-makers in these sectors are investing heavily in the analysis of tex- tual data to make decisions. The appropriate choice of parameters such as dataset size and the ML algorithm used in the analysis positively influences the quality of this decision. In this paper, we are interested in discussing the influence of the choice of parameters to achieve good performance of the data analysis in the case of text processing data analysis using machine learning models. This study then discusses the accuracy comparison of decision tree algorithms, random forest classifiers, logistic regression, Naïve Bayes, SVM, and KNN, and their execution time as a function of dataset sizes. The results have shown that the logistic re- gression algorithm was considered the most efficient algorithm for text categori- zation in terms of dataset size and execution time compared to the other algo- rithms.