Web-page classification through summarization
Proceedings of the 27th annual international ACM SIGIR conference on …, 2004•dl.acm.org
Web-page classification is much more difficult than pure-text classification due to a large
variety of noisy information embedded in Web pages. In this paper, we propose a new Web-
page classification algorithm based on Web summarization for improving the accuracy. We
first give empirical evidence that ideal Web-page summaries generated by human editors
can indeed improve the performance of Web-page classification algorithms. We then
propose a new Web summarization-based classification algorithm and evaluate it along with …
variety of noisy information embedded in Web pages. In this paper, we propose a new Web-
page classification algorithm based on Web summarization for improving the accuracy. We
first give empirical evidence that ideal Web-page summaries generated by human editors
can indeed improve the performance of Web-page classification algorithms. We then
propose a new Web summarization-based classification algorithm and evaluate it along with …
Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.
ACM Digital Library