Towards Building an Open-Domain Corpus for Arabic Reading Comprehension

Abstract:

Reading comprehension is one of the fields of natural language understanding in which machine understanding can be evaluated through answering questions about paragraphs. The work in Arabic reading comprehension is little due to the lack of reading comprehension datasets for Arabic language. The goal of this paper is to present detailed phases aimed to create an Arabic dataset semi-automatically for the purpose of computerized reading comprehension. The paper starts by presenting an introductory survey of available datasets for the English language, then presents the phases of creating a dataset semi-automatically. The presented phases are mainly four each having sub-steps. The first phase is the manual check of the question and answer pairs, the second phase is the google search, the third phase is the document retrieval, and the fourth phase is the paragraph retrieval. The paper then presents some statistics for each phase.