The Use of Machine Learning in the Diabetes Prevention

Abstract:

Diabetes Mellitus is one of the fastest growing chronic diseases in the world, requiring effective solutions for diagnosis and prevention. In this context, Machine Learning (ML) techniques have significant potential for identifying patterns relevant to disease control. This study used the CRISP-DM methodology to analyze data from the Diabetes Health Indicators Dataset, containing sociodemographic, clinical and behavioral information. In the pre-processing phase, class balancing by undersampling (NearMiss) was applied due to the low proportion of diabetic individuals. Feature selection techniques, such as Recursive Feature Elimination (RFE) and Principal Component Analysis (PCA), were used to assess the relevance of the variables and reduce dimensionality. Six models were evaluated: Random Forest, Gradient Boosting, KNN, Logistic Regression, Multilayer Perceptron (MLP) and Recurrent Neural Networks (RNN). The results showed that class balancing significantly improved performance, with RNNstanding out with accuracy above 86% and an F1-score near 0.87. The combination of RFE feature selection with MLP also yielded robust results. It is concluded that ML and DL are promising for prioritizing clinical follow-up and supporting public policies. However, it is necessary to increase data representativeness, incorporate Explainable AI techniques for greater interpretability, and adjust decision-making thresholds aiming to minimize false negatives.