Abstract:
With the increasing amount of web pages over the internet, it has been a major concern to obtain
information on the internet with accuracy at reasonable cost and feasible performance. A potential
solution would be through web page classification. An effective classification of web pages is of benefit in various applications such as web mining and search engines. Unlike text documents, the nature of web pages limits the performance of successful traditional pure-text classification methods. Existence of noises in the form of HTML tags, multimedia contents, dynamic contents and the network structure of web pages requires a deeper look into effective feature selection of web pages. Often, these features are filtered out relying on the displayed texts of the web page for classification. Instead in this research paper, web page features are taken into consideration during classification of the web page due to the potential valuable information that might be stored. For this reason, this paper explores the potential of the universal Resource Locator (URL), web page title as well as the metadata for classification into various categories defined by the users. The framework uses suitable machine learning algorithms for individual classification of these web features to jointly vote by weight towards the eventual classification of the webpage. This approach showed improvements over pure-text as well as virtual-webpage classification approaches.