An Empirical Analysis towards Processing Moroccan Colloquial User-Generated Text

Abstract:

With the increase of web uses in Morocco today, internet has become an important source of information. Specifically, across social media, Moroccan people use several languages in their communication leaving behind unstructured user-generated text that present several opportunities for Natural Language Processing. Among languages found in this data, Moroccan Colloquial Arabic stands with an important content and several features. In this paper, we investigate online written text generated by Moroccan users in social media with emphasis on Moroccan Colloquial Arabic. For this purpose, we used several tools and resources in order to conduct a deep study of this data such as a lexicon, language identification system. The most interesting findings that have emerged is the orthographic inconsistency existing in written Moroccan Colloquial Arabic regarding both Arabic and Latin scripts. This phenomenon represents almost 80% of the MCA user-generated text and proved the need for two major systems: spelling correction and customized MCA transliteration tools.

nsdlogo2016