Twitter Arabic Dialect Identification using Different Feature Engineering Models

Abstract:

Arabic is one of the most widely used and spoken languages in the world. Along with the variety of Arabic dialects, there is a dearth of Arabic content online, which makes NLP and machine learning tasks involving Arabic very challenging. In this paper we present the results on Arabic dialect identification. We analysed and identified existing challenges with the observed dialects in relation to dataset characteristics and limitations. We investigated the different feature engineering techniques on Twitter dataset from different Arabic speaking countries. We compared the dialect identification accuracy based on TF-IDF, AraVec and BERT language model fine-tuned for Arabic language. We fine-tuned the data on six different pretrained BERT models. Some models showed an enhancement compared to previous studies using the same dataset. We re-modelled the dataset to different versions to overcome the unbalancing issue. We found better results when we did the experiments based on regions instead of countries. We got 63.8% accuracy on the AraBERT model using the pretrained model bert-base-arabertv02-twitter , and MARBERT. the models showed similarity between neighbouring countries in dialect. The ensemble model has no significant improvement in the study results.