Twitter Arabic Dialect Identification using AraBERT

Abstract:

In this short paper we present the results on Arabic dialect identification. We analyze and identify existing challenges with the observed dialects in relation to dataset characteristics and limitations. We investigated a model based on BERT language model fine-tuned for Arabic language - AraBERT. The selected model is further fine-tuned on about 40% of 10M unlabelled Arabic tweets provided by the organizers. The model showed similarity between neighbouring countries in dialect. The study still in progress to achieve better results and are working in a different proposed data model based on Arabian regions and sub-regions instead of countries or provinces level.