Enhancing CRF Model with N-grams for Arabic chunking Task

Abstract:

Among natural language processing analysis, shallow parsing (also called chunking) has received much attention, with the development of a large number of annotated corpora and with the rise of effective machine learning techniques. This task gets worse for the Arabic language because of its specific features that make it quite different and even more ambiguous than other natural languages when processed. In this paper, we present a method for chunking Arabic texts based on supervised learning. We describe how Conditional Random Fields algorithm and the Penn Arabic Treebank can be used to automatically learn a chunking model for Modern Standard Arabic (MSA). For the experimentation, we use over than 10,100 sentences as training data and 2,524 sentences for the test. To evaluate our model performance, we calculated the accuracy of the model and we calculate the precision, recall and F-measure by chunks. We consider that obtained results are satisfactory (Precision 81.57%, Recall 73.86%, F-measure 76.23 %).