Abstract:
Web page classification has now given quite unimaginable facts gathered by different researchers and of course by different algorithms. Classification has grown from its birth - single labeling to now multilabeling. The quest for performance and timely result has also propelled the need for swift switch from supervised to unsupervised approach of web page classification such as demonstrated in the numerous automatic machine learning algorithms available from previous research works. While the numerous facts or results gathered and/or claimed by these classification algorithms remain a varying output depending on the particular approach and classification algorithm adopted, we anticipate the birth of many more of such algorithms as well as their variations, and we propose in this work, a Hyper-Classification Framework that takes in Web page from a given dataset and automatically assigned the best classifying algorithm(s) using geometry features of web page and the combination of multiple web features. We conducted experiments on set of Web pages from Yahoo! Directory which is a renown web taxonomy maintained by human editors of yahoo.com, and our results show the possibilities of improving the performance of existing classifiers as well as the amount of resources consumption over a large scale web taxonomy.