Twitter Arabic Dialect Identification using AraBERT


In this short paper we present the results on Arabic dialect identification. We analyze and identify existing challenges with the observed dialects in relation to dataset characteristics and limitations. We investigated a model based on BERT language model fine-tuned for Arabic language - AraBERT. The selected model is further fine-tuned on about 40% of 10M unlabelled Arabic tweets provided by the organizers. The model showed similarity between neighbouring countries in dialect. The study still in progress to achieve better results and are working in a different proposed data model based on Arabian regions and sub-regions instead of countries or provinces level.