Article

Web-page classification through summarization

Authors:

Wei-Ying MaAuthors Info & Claims

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

Pages 242 - 249

https://doi.org/10.1145/1008992.1009035

Published: 25 July 2004 Publication History

Get Access

Abstract

Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.

References

[1]

G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proc. of THAI'99, 1999, 105--119.

Google Scholar

[2]

A.L. Berger, V.O. Mittal. OCELOT: A System for Summarizing Web Pages. Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 144--151.

Digital Library

Google Scholar

[3]

M.W. Berry, S.T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573--595, 1995.

Digital Library

Google Scholar

[4]

O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole in parts: text summarization for Web browsing on handheld devices. Proc. of WWW10, Hong Kong, China, May 2001.

Digital Library

Google Scholar

[5]

S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proc. of the ACM SIGMOD, 1998.

Digital Library

Google Scholar

[6]

H. Chen and S. T. Dumais. Bringing order to the Web: Automatically categorizing search results. Proc. of CHI2000, 2000, 145--152.

Digital Library

Google Scholar

[7]

J.L. Chen, B.Y. Zhou, J. Shi, H.J. Zhang, and Q.F. Wu. Function-based Object Model Towards Website Adaptation, Proc. of WWW10, HK, China, 2001.

Digital Library

Google Scholar

[8]

Z. Chen, S.P. Liu, W.Y. Liu, G.G. Pu, W.Y. Ma. Building a Web Thesaurus from Web Link Structure. Proc. of the 26th annual international ACM SIGIR, Canada, 2003, 48--55.

Digital Library

Google Scholar

[9]

W. Chuang, J. Yang, Extracting sentence segments for text summarization: a machine learning approach, Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 152--159.

Digital Library

Google Scholar

[10]

C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1--25, 1995.

Digital Library

Google Scholar

[11]

S. Deerwester, S. Dumais, G. Furnas, T. Landauer,and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, 1990, 391--407.

Crossref

Google Scholar

[12]

J.-Y. Delort, B. Bouchon-Meunier and M. Rifqi. Web Document Summarization by Context. Poster Proc. of WWW12, 2003.

Google Scholar

[13]

E. J. Glover, K. Tsioutsiouliklis, and et al. Flake. Using Web structure for classifying and describing Web pages. Proc. of WWW12, 2002.

Digital Library

Google Scholar

[14]

Y.H. Gong, X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In Proc. Of the 24th annual international ACM SIGIR, New Orleans, Louisiana, United States, 2001, 19--25.

Digital Library

Google Scholar

[15]

T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998, 137--142.

Digital Library

Google Scholar

[16]

T. Joachims. Transductive inference for text classification using support vector machines. Proc. of ICML-99, Bled, Slovenia, June 1999.

Digital Library

Google Scholar

[17]

S.J. Ker and J.-N. Chen. A Text Categorization Based on Summarization Technique. In the 38th Annual Meeting of the Association for Computational Linguistics IR&NLP workshop, Hong Kong, October 3-8, 2000.

Digital Library

Google Scholar

[18]

Y.J Ko, J.W Park, J.Y. Seo. Automatic Text Categorization using the Importance of Sentences. Proc. of COLING 2002.

Digital Library

Google Scholar

[19]

A. Kolcz, V. Prabakarmurthi, J.K. Kalita. Summarization as feature selection for text categorization. Proc. Of CIKM01, 2001.

Digital Library

Google Scholar

[20]

J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. Proc. of the 18th annual international ACM SIGIR, United States, 1995, 68--73.

Digital Library

Google Scholar

[21]

W. Lam, Y.q. Han. Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5): 628--633, 2003.

Digital Library

Google Scholar

[22]

T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to Latent Semantic Analysis. Discourse processes, 25, 1998, 259--284.

Google Scholar

[23]

H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, Vol. 2, No. 2, April 1958, 159--165.

Digital Library

Google Scholar

[24]

A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, In AAAI-98 Workshop on Learning for Text Categorization, 1998.

Google Scholar

[25]

T. Mitchell. Machine Learning. McGraw-Hill, 1997.

Digital Library

Google Scholar

[26]

W. Press and et al., Numerical Recipes in C: The Art of Scientific Computing. Cambridge, England: Cambridge University Press, 2 ed., 1992.

Digital Library

Google Scholar

[27]

F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002.

Digital Library

Google Scholar

[28]

Sequential Minimal Optimization, http://research.micro-soft.com/jplatt/smo.html.

Google Scholar

[29]

The Porter Stemming Algorithm, http://www.tartarus.org/martin/PorterStemmer.

Google Scholar

[30]

S. Teufel and M. Moens. Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, 1997.

Google Scholar

[31]

C. J. van Rijsbergen. Information Retrieval. Butterworth, London, 1979, 173--176.

Digital Library

Google Scholar

[32]

V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NY, USA, 1995.

Digital Library

Google Scholar

[33]

L. Yi, B. Liu, and X. Li. Eliminating Noisy Information in Web Pages for Data Mining. KDD2003. 2003.

Digital Library

Google Scholar

[34]

Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. Proc. of ICML-97.

Digital Library

Google Scholar

Cited By

View all

Ye EBai XO’Hare NAsgarieh EThadani KPerez-Sorrosal FAdiga S(2024)Multilingual Taxonomic Web Page Categorization Through Ensemble Knowledge DistillationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340636836:11(6614-6627)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3406368
Agrawal SKadam KMehta JHole V(2024)Minimizing Web Diversion Using Query Classification and Text MiningData Intelligence and Cognitive Informatics10.1007/978-981-99-7962-2_12(151-165)Online publication date: 7-Jan-2024
https://doi.org/10.1007/978-981-99-7962-2_12
Katwe PKhamparia AGupta DDutta A(2023)Methodical Systematic Review of Abstractive Summarization and Natural Language Processing Models for Biomedical Health Informatics: Approaches, Metrics and ChallengesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3600230Online publication date: 31-May-2023
https://dl.acm.org/doi/10.1145/3600230
Show More Cited By

Index Terms

Web-page classification through summarization

Recommendations

Web page classification: Features and algorithms

Classification of Web page content is essential to many tasks in Web information retrieval such as maintaining Web directories and focused crawling. The uncontrolled nature of Web content presents additional challenges to Web page classification as ...
Noise reduction through summarization for Web-page classification

Due to a large variety of noisy information embedded in Web pages, Web-page classification is much more difficult than pure-text classification. In this paper, we propose to improve the Web-page classification performance by removing the noise through ...
Visual summarization of web pages
SIGIR '10: Proceedings of the 33rd international ACM SIGIR conference on Research and development in information retrieval

Visual summarization is a attractive new scheme to summarize web pages, which can help achieve a more friendly user experience in search and re-finding tasks by allowing users quickly get the idea of what the web page is about and helping users recall ...

Reviews

Reviewer: Christoph F. Strnadl

Undoubtedly, information retrieval is pivotal to the continued success of the Web. One means of accomplishing this retrieval is by way of a significant categorization of Web pages (which is nontrivial, given the idiosyncratic Hypertext Markup Language (HTML)-based structure of a Web document). This paper proposes a Web page categorization scheme based on Web page summarization: the algorithm first obtains a textual summary of the examined Web page, and the summary is then classified into the respective categories. The authors implement four different summarization algorithms (an adaptation of Luhn's method, latent semantic analysis, the function-based object model, and supervised classification based on machine learning; human summarization by category editors is used as proxy to the "ideal" summary). They test their scheme's effectiveness on 150,000 pre-categorized pages. Classifiers are obtained from the output of the summarization stage, by applying a naïve Bayesian classifier-learning method, or using a support vector machine. The authors use the standard F1 measure of the field (the harmonic mean of precision and recall). Two findings are interesting. First, categorization based on summarization outperforms simple, text-based categorization by approximately 14 percent. Second, no individual summarization algorithm is as effective as human summarization. However, an unweighted sum of the four summarization methods nearly reproduces the "ideal" case. The paper is easy to follow, and focuses on the overall experiment and its results, with only short introductions to the algorithms used. The intended audience for the paper is the research scientist or professional who is fairly deeply involved in the design or implementation of Web page information management algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval

July 2004

624 pages

ISBN:1581138814

DOI:10.1145/1008992

General Chair:
Mark Sanderson
University of Sheffield (UK)
,
Program Chairs:
Kalervo Järvelin
University of Tampere (Finland)
,
James Allan
University of Massachusetts (USA)
,
Peter Bruza
Distributed Systems Technology Centre (Australia)

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Article

Conference

SIGIR04

Sponsor:

SIGIR04: The 27th ACM/SIGIR International Symposium on Information Retrieval 2004

July 25 - 29, 2004

Sheffield, United Kingdom

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

107
Total Citations
View Citations
3,329
Total Downloads

Downloads (Last 12 months)48
Downloads (Last 6 weeks)1

Reflects downloads up to 24 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

View all

Ye EBai XO’Hare NAsgarieh EThadani KPerez-Sorrosal FAdiga S(2024)Multilingual Taxonomic Web Page Categorization Through Ensemble Knowledge DistillationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340636836:11(6614-6627)Online publication date: Nov-2024
https://doi.org/10.1109/TKDE.2024.3406368
Agrawal SKadam KMehta JHole V(2024)Minimizing Web Diversion Using Query Classification and Text MiningData Intelligence and Cognitive Informatics10.1007/978-981-99-7962-2_12(151-165)Online publication date: 7-Jan-2024
https://doi.org/10.1007/978-981-99-7962-2_12
Katwe PKhamparia AGupta DDutta A(2023)Methodical Systematic Review of Abstractive Summarization and Natural Language Processing Models for Biomedical Health Informatics: Approaches, Metrics and ChallengesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3600230Online publication date: 31-May-2023
https://dl.acm.org/doi/10.1145/3600230
Pera MMurgia ELandoni MHuibers TAliannejadi M(2023)Where a Little Change Makes a Big Difference: A Preliminary Exploration of Children’s QueriesAdvances in Information Retrieval10.1007/978-3-031-28238-6_43(522-533)Online publication date: 17-Mar-2023
https://doi.org/10.1007/978-3-031-28238-6_43
Gao SNg Y(2022)Generating extractive sentiment summaries for natural language user queries on productsACM SIGAPP Applied Computing Review10.1145/3558053.355805422:2(5-20)Online publication date: 17-Aug-2022
https://dl.acm.org/doi/10.1145/3558053.3558054
Yang MYu YMi XTang SGuo SLi YZheng XDuan HYin HStavrou ACremers CShi E(2022)An Extensive Study of Residential Proxies in ChinaProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559377(3049-3062)Online publication date: 7-Nov-2022
https://dl.acm.org/doi/10.1145/3548606.3559377
Kaddour AZellal NSayad L(2022)Improving text classification using text summarization2022 2nd International Conference on New Technologies of Information and Communication (NTIC)10.1109/NTIC55069.2022.10100492(1-8)Online publication date: 21-Dec-2022
https://doi.org/10.1109/NTIC55069.2022.10100492
Shawabkeh AFaris HAljarah IAbu-Salih BAlboaneen DAlhindawi N(2021)An Evolutionary-based Random Weight Networks with Taguchi Method for Arabic Web Pages ClassificationArabian Journal for Science and Engineering10.1007/s13369-020-05301-z46:4(3955-3980)Online publication date: 5-Feb-2021
https://doi.org/10.1007/s13369-020-05301-z
Allen GDowns BShukla AKennington CFails JWright KPera M(2021)BiGBERT: Classifying Educational Web Resources for Kindergarten-12$$^{th}$$ GradesAdvances in Information Retrieval10.1007/978-3-030-72240-1_13(176-184)Online publication date: 30-Mar-2021
https://doi.org/10.1007/978-3-030-72240-1_13
Tayade HShetty CJankar RPande A(2020)Segregation of Live News Articles Based on Location Using Machine LearningInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT206380(505-512)Online publication date: 25-May-2020
https://doi.org/10.32628/CSEIT206380
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Cited By

Index Terms

Recommendations

Web page classification: Features and algorithms

Noise reduction through summarization for Web-page classification

Visual summarization of web pages

Reviews

Access critical reviews of Computing literature here