A Feature Extraction Method for Automatic Web Page Classification System

Abstract:

Web page classification is a very important topic today; this is due to the increasing volume of data available on the World Wide Web and the heterogeneity in the formats of the data. For that, there exist a need to ways to manage and extract important knowledge from the web and to facilitate indexing and searching. This paper proposes a method for extracting features of web pages using WEKA as a data mining tool. The resulting features are used for building an automatic web page classification system with a specific number of categories (autos, computers & internet, health, sports) based on different web page classification algorithms (page text, page title)with a good accuracy that can be improved using the feature weights.