Using the Web as an Efficient Source of Building an Arabic Corpus: Presentation and Evaluation


Nowadays, the user demand for accurate information is still increasing, especially, with the expansion of numeric Arabic information in the Web. This growing is not only devoted to consult the existing documents on the Web, but also to build corpus for several applications of natural language, such as, question-answering, machine translation, information retrieval, etc. In this paper, we introduce a presentation and an implementation of Arabic corpus of questions-texts. This corpus, called AQA-WebCorp (Arabic Question Answering Web Corpus), revealed a real automatic interrogation of Google, in order to generate passages of texts whose the answer of a given question is located. This constitution then will provide a better base for our experimentation step. Thus, we try to model this constitution by a method for Arabic insofar as it recovers texts from the web that could prove to be answers to our factual questions. To do this, we had to develop a java script that can extract from a given question an html page. Then, clean this page to the extent of having a data base of passages to build our corpus. In addition, we give the preliminary results of our proposal method. Finally, some investigations for the construction of Arabic corpus are also described.
