Developing an Efficient Corpus Using Multi-model Algorithms of NLP: Towards an Improvement of Detection Accuracy of Patients’ Medical Condition

Abstract:

The growth of unstructured medical data presents a significant challenge for healthcare systems, especially in extracting relevant and reliable information. While Natural Language Processing (NLP) offers a promising solution for automating information retrieval, the issue of data cleaning remains a bottleneck. Additionally, there is a lack of medical corpora that can answer specific clinical questions. This research addresses these challenges by introducing an ensemble-based data cleaning method and developing a medical corpus designed to answer questions based on the semantic relationships in the data. Our ensemble method achieved an accuracy of 94%, outperforming traditional data cleaning techniques such as vectorization and exploratory data analysis. The corpus, built to handle medical queries, successfully extracted and provided relevant answers. This work demonstrates the potential of NLP to improve healthcare data processing, making it more accurate and efficient while reducing the need for expert intervention.