Building FAAQA-QADS: Frequently Asked Arabic Question-Answer for Question-Answering Diseases System

Abstract:

Close Domain Question Answering systems is a specialized area in the field of Natural Language Processing (NLP) that aims to generate answers to questions asked for a specific field of interest like healthcare. The lack of Arabic Question Answer (QA) datasets leads to a limited number of studies in Arabic language QA systems compared to the studies achieved in the English language. The available datasets are usually large-sized and low-quality due to either bad pre-processing or inaccurate annotation. To address the previously mentioned problem, we introduce, in this work, a new Modern Standard Arabic (MSA) span-extraction QA dataset for Arabic machine reading comprehension (FAAQA-QAD) collected directly in MSA language without the use of translation. To pre-process and annotate the FAAQA-QAD dataset, we use Pandas library and an end-to-end comprehension CdQAannotator. In addition, different pre-trained language models that have provided an interesting results in many Arabic NLP tasks, have been used to evaluate our FAAQA-QAD dataset. In fact, we use our dataset to compare the performance of these models. Then we provide an analysis to understand and interpret the low-performance results obtained by some models. The performed experimentation has shown that the ARBERT model outperformed the compared models using the FAAQA-QAD dataset compared
to other RC span-extraction datasets.