A New Technique for Automatic Text Categorization for Arabic Documents

Abstract:

Due to the wide spread of Arabic documents on the Internet, it becomes an urgent necessity to build systems that manipulate Arabic documents. In this paper, we propose a new technique of an Automatic Text Categorization (ATC) for Arabic documents based on a light stemming algorithm; which removes suffixes and prefixes from words. Despite the complexity of Arabic language, our technique shows a very significant F-measure varying between 0.85 and 0.987 with an average of 0.955. The obtained results are like those for the well-studied English language using the best ATC techniques including Support Vector Machines.