Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/1008992.1009035acmconferencesArticle/Chapter ViewAbstractPublication PagesirConference Proceedingsconference-collections
Article

Web-page classification through summarization

Published: 25 July 2004 Publication History

Abstract

Web-page classification is much more difficult than pure-text classification due to a large variety of noisy information embedded in Web pages. In this paper, we propose a new Web-page classification algorithm based on Web summarization for improving the accuracy. We first give empirical evidence that ideal Web-page summaries generated by human editors can indeed improve the performance of Web-page classification algorithms. We then propose a new Web summarization-based classification algorithm and evaluate it along with several other state-of-the-art text summarization algorithms on the LookSmart Web directory. Experimental results show that our proposed summarization-based classification algorithm achieves an approximately 8.8% improvement as compared to pure-text-based classification algorithm. We further introduce an ensemble classifier using the improved summarization algorithm and show that it achieves about 12.9% improvement over pure-text based methods.

References

[1]
G. Attardi, A. Gulli, and F. Sebastiani. Automatic Web Page Categorization by Link and Context Analysis. In Chris Hutchison and Gaetano Lanzarone (eds.), Proc. of THAI'99, 1999, 105--119.
[2]
A.L. Berger, V.O. Mittal. OCELOT: A System for Summarizing Web Pages. Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 144--151.
[3]
M.W. Berry, S.T. Dumais, and Gavin W. O'Brien. Using linear algebra for intelligent information retrieval. SIAM Review, 37:573--595, 1995.
[4]
O. Buyukkokten, H. Garcia-Molina, and A. Paepcke. Seeing the whole in parts: text summarization for Web browsing on handheld devices. Proc. of WWW10, Hong Kong, China, May 2001.
[5]
S. Chakrabarti, B. Dom, and P. Indyk. Enhanced Hypertext Categorization Using Hyperlinks. Proc. of the ACM SIGMOD, 1998.
[6]
H. Chen and S. T. Dumais. Bringing order to the Web: Automatically categorizing search results. Proc. of CHI2000, 2000, 145--152.
[7]
J.L. Chen, B.Y. Zhou, J. Shi, H.J. Zhang, and Q.F. Wu. Function-based Object Model Towards Website Adaptation, Proc. of WWW10, HK, China, 2001.
[8]
Z. Chen, S.P. Liu, W.Y. Liu, G.G. Pu, W.Y. Ma. Building a Web Thesaurus from Web Link Structure. Proc. of the 26th annual international ACM SIGIR, Canada, 2003, 48--55.
[9]
W. Chuang, J. Yang, Extracting sentence segments for text summarization: a machine learning approach, Proc. of the 23rd annual international ACM SIGIR, Athens, Greece, 2000, 152--159.
[10]
C. Cortes and V. Vapnik. Support vector networks. Machine Learning, 20:1--25, 1995.
[11]
S. Deerwester, S. Dumais, G. Furnas, T. Landauer,and R. Harshman. Indexing by latent semantic analysis. Journal of the American Society for Information Science, vol. 41, 1990, 391--407.
[12]
J.-Y. Delort, B. Bouchon-Meunier and M. Rifqi. Web Document Summarization by Context. Poster Proc. of WWW12, 2003.
[13]
E. J. Glover, K. Tsioutsiouliklis, and et al. Flake. Using Web structure for classifying and describing Web pages. Proc. of WWW12, 2002.
[14]
Y.H. Gong, X. Liu. Generic text summarization using relevance measure and latent semantic analysis. In Proc. Of the 24th annual international ACM SIGIR, New Orleans, Louisiana, United States, 2001, 19--25.
[15]
T. Joachims. Text categorization with support vector machines: learning with many relevant features. In Proceedings of ECML-98, 10th European Conference on Machine Learning, 1998, 137--142.
[16]
T. Joachims. Transductive inference for text classification using support vector machines. Proc. of ICML-99, Bled, Slovenia, June 1999.
[17]
S.J. Ker and J.-N. Chen. A Text Categorization Based on Summarization Technique. In the 38th Annual Meeting of the Association for Computational Linguistics IR&NLP workshop, Hong Kong, October 3-8, 2000.
[18]
Y.J Ko, J.W Park, J.Y. Seo. Automatic Text Categorization using the Importance of Sentences. Proc. of COLING 2002.
[19]
A. Kolcz, V. Prabakarmurthi, J.K. Kalita. Summarization as feature selection for text categorization. Proc. Of CIKM01, 2001.
[20]
J. Kupiec, J. Pedersen, and F. Chen. A trainable document summarizer. Proc. of the 18th annual international ACM SIGIR, United States, 1995, 68--73.
[21]
W. Lam, Y.q. Han. Automatic Textual Document Categorization Based on Generalized Instance Sets and a Metamodel. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(5): 628--633, 2003.
[22]
T. K. Landauer, P. W. Foltz, and D. Laham. Introduction to Latent Semantic Analysis. Discourse processes, 25, 1998, 259--284.
[23]
H.P. Luhn. The Automatic Creation of Literature Abstracts. IBM Journal of Research and Development, Vol. 2, No. 2, April 1958, 159--165.
[24]
A. McCallum and K. Nigam, A comparison of event models for naive bayes text classification, In AAAI-98 Workshop on Learning for Text Categorization, 1998.
[25]
T. Mitchell. Machine Learning. McGraw-Hill, 1997.
[26]
W. Press and et al., Numerical Recipes in C: The Art of Scientific Computing. Cambridge, England: Cambridge University Press, 2 ed., 1992.
[27]
F. Sebastiani. Machine learning in automated text categorization. ACM Computing Surveys, 2002.
[28]
Sequential Minimal Optimization, http://research.micro-soft.com/jplatt/smo.html.
[29]
The Porter Stemming Algorithm, http://www.tartarus.org/martin/PorterStemmer.
[30]
S. Teufel and M. Moens. Sentence extraction as a classification task. In ACL/EACL-97 Workshop on Intelligent and Scalable Text Summarization, 1997.
[31]
C. J. van Rijsbergen. Information Retrieval. Butterworth, London, 1979, 173--176.
[32]
V. Vapnik. The Nature of Statistical Learning Theory. Springer-Verlag, NY, USA, 1995.
[33]
L. Yi, B. Liu, and X. Li. Eliminating Noisy Information in Web Pages for Data Mining. KDD2003. 2003.
[34]
Y. Yang and J.O. Pedersen. A comparative study on feature selection in text categorization. Proc. of ICML-97.

Cited By

View all
  • (2024)Multilingual Taxonomic Web Page Categorization Through Ensemble Knowledge DistillationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340636836:11(6614-6627)Online publication date: Nov-2024
  • (2024)Minimizing Web Diversion Using Query Classification and Text MiningData Intelligence and Cognitive Informatics10.1007/978-981-99-7962-2_12(151-165)Online publication date: 7-Jan-2024
  • (2023)Methodical Systematic Review of Abstractive Summarization and Natural Language Processing Models for Biomedical Health Informatics: Approaches, Metrics and ChallengesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3600230Online publication date: 31-May-2023
  • Show More Cited By

Recommendations

Reviews

Christoph F. Strnadl

Undoubtedly, information retrieval is pivotal to the continued success of the Web. One means of accomplishing this retrieval is by way of a significant categorization of Web pages (which is nontrivial, given the idiosyncratic Hypertext Markup Language (HTML)-based structure of a Web document). This paper proposes a Web page categorization scheme based on Web page summarization: the algorithm first obtains a textual summary of the examined Web page, and the summary is then classified into the respective categories. The authors implement four different summarization algorithms (an adaptation of Luhn's method, latent semantic analysis, the function-based object model, and supervised classification based on machine learning; human summarization by category editors is used as proxy to the "ideal" summary). They test their scheme's effectiveness on 150,000 pre-categorized pages. Classifiers are obtained from the output of the summarization stage, by applying a naïve Bayesian classifier-learning method, or using a support vector machine. The authors use the standard F1 measure of the field (the harmonic mean of precision and recall). Two findings are interesting. First, categorization based on summarization outperforms simple, text-based categorization by approximately 14 percent. Second, no individual summarization algorithm is as effective as human summarization. However, an unweighted sum of the four summarization methods nearly reproduces the "ideal" case. The paper is easy to follow, and focuses on the overall experiment and its results, with only short introductions to the algorithms used. The intended audience for the paper is the research scientist or professional who is fairly deeply involved in the design or implementation of Web page information management algorithms. Online Computing Reviews Service

Access critical reviews of Computing literature here

Become a reviewer for Computing Reviews.

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
SIGIR '04: Proceedings of the 27th annual international ACM SIGIR conference on Research and development in information retrieval
July 2004
624 pages
ISBN:1581138814
DOI:10.1145/1008992
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 25 July 2004

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. content body
  2. web page categorization
  3. web page summarization

Qualifiers

  • Article

Conference

SIGIR04
Sponsor:

Acceptance Rates

Overall Acceptance Rate 792 of 3,983 submissions, 20%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)48
  • Downloads (Last 6 weeks)1
Reflects downloads up to 24 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)Multilingual Taxonomic Web Page Categorization Through Ensemble Knowledge DistillationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2024.340636836:11(6614-6627)Online publication date: Nov-2024
  • (2024)Minimizing Web Diversion Using Query Classification and Text MiningData Intelligence and Cognitive Informatics10.1007/978-981-99-7962-2_12(151-165)Online publication date: 7-Jan-2024
  • (2023)Methodical Systematic Review of Abstractive Summarization and Natural Language Processing Models for Biomedical Health Informatics: Approaches, Metrics and ChallengesACM Transactions on Asian and Low-Resource Language Information Processing10.1145/3600230Online publication date: 31-May-2023
  • (2023)Where a Little Change Makes a Big Difference: A Preliminary Exploration of Children’s QueriesAdvances in Information Retrieval10.1007/978-3-031-28238-6_43(522-533)Online publication date: 17-Mar-2023
  • (2022)Generating extractive sentiment summaries for natural language user queries on productsACM SIGAPP Applied Computing Review10.1145/3558053.355805422:2(5-20)Online publication date: 17-Aug-2022
  • (2022)An Extensive Study of Residential Proxies in ChinaProceedings of the 2022 ACM SIGSAC Conference on Computer and Communications Security10.1145/3548606.3559377(3049-3062)Online publication date: 7-Nov-2022
  • (2022)Improving text classification using text summarization2022 2nd International Conference on New Technologies of Information and Communication (NTIC)10.1109/NTIC55069.2022.10100492(1-8)Online publication date: 21-Dec-2022
  • (2021)An Evolutionary-based Random Weight Networks with Taguchi Method for Arabic Web Pages ClassificationArabian Journal for Science and Engineering10.1007/s13369-020-05301-z46:4(3955-3980)Online publication date: 5-Feb-2021
  • (2021)BiGBERT: Classifying Educational Web Resources for Kindergarten-12$$^{th}$$ GradesAdvances in Information Retrieval10.1007/978-3-030-72240-1_13(176-184)Online publication date: 30-Mar-2021
  • (2020)Segregation of Live News Articles Based on Location Using Machine LearningInternational Journal of Scientific Research in Computer Science, Engineering and Information Technology10.32628/CSEIT206380(505-512)Online publication date: 25-May-2020
  • Show More Cited By

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media