research-article

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Authors:

Monika Henzinger,

Ludmila Marian,

Ingmar WeberAuthors Info & Claims

ACM Transactions on the Web (TWEB), Volume 5, Issue 3

Article No.: 15, Pages 1 - 29

https://doi.org/10.1145/1993053.1993057

Published: 01 July 2011 Publication History

Abstract

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

References

[1]

Alex, P., Chirita, R., Costache, S., Nejdl, W., and Handschuh, S. 2007. P-tag: Large scale automatic generation of personalized annotation tags for the Web. In Proceedings of the International Conference on World Wide Web (WWW). 8--12.

Digital Library

[2]

Avesani, P., Giunchiglia, F., and Yatskevich, M. 2005. A large scale taxonomy mapping evaluation. In Proceedings of the International Semantic Web Conference (ISWC). 67--81.

Digital Library

[3]

Baykan, E., Henzinger, M., and Weber, I. 2008. Web page language identification based on URLs. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 176--187.

[4]

Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2009. Purely URL-based topic classification. In Proceedings of the International Conference on World Wide Web (WWW). 1109--1110.

Digital Library

[5]

Chaker, J. and Habib, O. 2007. Genre categorization of Web pages. In Proceedings of the International Conference on Data Mining Workshops (ICDMW). 455--464.

Digital Library

[6]

Chakrabarti, S., Dom, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of the International Conference on Management of Data (SIGMOD). 307--318.

Digital Library

[7]

Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640.

Digital Library

[8]

Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). 145--152.

Digital Library

[9]

Cover, T. and Thomas, J. 1991. Elements of Information Theory. Wiley & Sons.

Digital Library

[10]

Dasgupta, A., Kumar, R., and Sasturkar, A. 2008. De-Duping URLs via rewrite rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 186--194.

Digital Library

[11]

Devi, M. I., Rajaram, R., and Selvakuberan, K. 2007. Machine learning techniques for automated Web page classification using URL features. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA). 116--120.

Digital Library

[12]

Domingos, P. and Pazzani, M. 1997. On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29, 103--130.

Digital Library

[13]

Freud, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning (ICML). 148--156.

[14]

Freudiger, J., Vratonjic, N., and Hubaux, J.-P. 2009. Towards privacy-friendly online advertising. In Proceedings of the IEEE Web 2.0 Security and Privacy Conference (W2SP).

[15]

Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 38, 2, 337--374.

[16]

Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.

[17]

Heymann, P., Ramage, D., and Garcia-Molina, H. 2008. Social tag prediction. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 531--538.

Digital Library

[18]

Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., and Stumme, G. 2007. Tag recommendations in folksonomies. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 506--514.

Digital Library

[19]

Joachims, T. 1999. Making Large-Scale Support Vector Machine Learning Practical. MIT Press, 169--184. http://svmlight.joachims.org/.

Digital Library

[20]

Kan, M.-Y. 2004. Web page classification without the Web page. In Proceedings of the International World Wide Web Conference on Alternate Track Papers and Posters. 262--263.

Digital Library

[21]

Kan, M.-Y. and Nguyen, H. O. T. 2005. Fast Webpage classification using URL features. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 325--326.

Digital Library

[22]

Koppula, H. S., Leela, K., Agarwal, A., Chitrapura, K. P., Garg, S., and Sasturkar, A. 2010. Learning URL patterns for Webpage de-duplication. In Proceedings of the International Conference on Web Search and Data Mining. 381--390.

Digital Library

[23]

Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.

Digital Library

[24]

McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization. 41--48.

[25]

McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.

[26]

McGuinness, D., Fikes, R., Rice, J., and Wilder, S. 2000. An environment for merging and testing large ontologies. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR). 483--493.

[27]

Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the Workshop on Machine Learning for Information Filtering. 61--67.

[28]

Noy, N. 2004. Tools for mapping and merging ontologies. In Handbook on Ontologies, S. Staab and R. Studer Eds., Springer, 365--384.

[29]

P, D. and Khemani, D. 2006. Unsupervised learning from URL corpora. In Proceedings of the International Conference on Management of Data (COMAD’06).

[30]

Poola, K. L. and Ramanujapuram, A. 2007. Techniques for keyword extraction from URLs using statistical analysis. http://www.faqs.org/patents/app/20090089278. US Patent application.

[31]

Power, R., Chen, J., Karthik, T., and Subramanian, L. 2009. Document classification for focused topics. In Proceedings of the AAAI Spring Symposium on AI for Development.

[32]

Qi, X. and Davison, B. D. 2006. Knowing a web page by the company it keeps. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 228--237.

Digital Library

[33]

Qi, X. and Davison, B. D. 2008. Classifiers without borders: Incorporating fielded text from neighboring Web pages. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 643--650.

Digital Library

[34]

Qi, X. and Davison, B. D. 2009. Web page classification: Features and algorithms. ACM Comput. Surv. 41, 2, 1--31.

Digital Library

[35]

Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., and Ma, W. 2004. Web-Page classification through summarization. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 242--249.

Digital Library

[36]

Shih, L. K. and Karger, D. R. 2004. Using URLs and table layout for Web classification tasks. In Proceedings of the International Conference on World Wide Web (WWW). 193--202.

Digital Library

[37]

Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the European Conference on IR Research (ECIR). 101--112.

Digital Library

[38]

Stumme, G. and Maedche, A. 2001. FCA-MERGE: Bottom-Up merging of ontologies. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 225--230.

Digital Library

[39]

Umbrich, J., Karnstedt, M., and Harth, A. 2009. Fast and scalable pattern mining for media-type focused crawling. In Proceedings of the Knowledge Discovery, Data Mining, and Machine Learning Workshop.

[40]

Vezhnevets, A. and Vezhnevets, V. 2005. Modest AdaBoost - Teaching AdaBoost to generalize better. In Proceedings of the Computer Graphics and Applications Conference (GraphiCon). 322--325.

[41]

Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann.

Digital Library

[42]

Zesch, T. and Gurevych, I. 2007. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing (NAACL). 1--8.

[43]

Zhang, D. and Lee, W. S. 2004. Web taxonomy integration using support vector machines. In Proceedings of the International Conference on World Wide Web (WWW). 472--481.

Digital Library

[44]

Zhang, J., Qin, J., and Yan, Q. 2006. The role of URLs in objectionable Web content categorization. In Proceedings of the International Conference on Web Intelligence (WI). 277--283.

Digital Library

Cited By

Asabere NAsare ILawson GBalde FDuodu NTsoekeku GAfriyie PGaniu A(2024) Geo-Insurance : Improving Big Data Challenges in the Context of Insurance Services Using a Geographical Information System (GIS) Human Behavior and Emerging Technologies10.1155/2024/90150122024:1Online publication date: 13-Aug-2024
https://doi.org/10.1155/2024/9015012
Yigit-Sert SAltingovde IUlusoy Ö(2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
https://doi.org/10.1016/j.ipm.2024.103795
Farshidi SRezaee KMazaheri SRahimi ADadashzadeh AZiabakhsh MEskandari SJansen S(2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024
https://doi.org/10.1007/s11257-024-09398-x
Show More Cited By

Index Terms

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification
1. Information systems
  1. Information retrieval
    1. Document representation

Recommendations

A Comprehensive Study of Techniques for URL-Based Web Page Language Classification

Given only the URL of a Web page, can we identify its language? In this article we examine this question. URL-based language classification is useful when the content of the Web page is not available or downloading the content is a waste of bandwidth ...
Purely URL-based topic classification
WWW '09: Proceedings of the 18th international conference on World wide web

Given only the URL of a web page, can we identify its topic? This is the question that we examine in this paper. Usually, web pages are classified using their content, but a URL-only classifier is preferable, (i) when speed is crucial, (ii) to enable ...
Twitter Trending Topic Classification
ICDMW '11: Proceedings of the 2011 IEEE 11th International Conference on Data Mining Workshops

With the increasing popularity of microblogging sites, we are in the era of information explosion. As of June 2011, about 200 million tweets are being generated everyday. Although Twitter provides a list of most popular topics people tweet about known ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on the Web

ACM Transactions on the Web Volume 5, Issue 3

July 2011

177 pages

ISSN:1559-1131

EISSN:1559-114X

DOI:10.1145/1993053

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 01 July 2011

Accepted: 01 December 2010

Revised: 01 September 2010

Received: 01 June 2009

Published in TWEB Volume 5, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

42
Total Citations
View Citations
1,141
Total Downloads

Downloads (Last 12 months)43
Downloads (Last 6 weeks)5

Reflects downloads up to 28 Sep 2024

Other Metrics

View Author Metrics

Citations

Cited By

Asabere NAsare ILawson GBalde FDuodu NTsoekeku GAfriyie PGaniu A(2024) Geo-Insurance : Improving Big Data Challenges in the Context of Insurance Services Using a Geographical Information System (GIS) Human Behavior and Emerging Technologies10.1155/2024/90150122024:1Online publication date: 13-Aug-2024
https://doi.org/10.1155/2024/9015012
Yigit-Sert SAltingovde IUlusoy Ö(2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
https://doi.org/10.1016/j.ipm.2024.103795
Farshidi SRezaee KMazaheri SRahimi ADadashzadeh AZiabakhsh MEskandari SJansen S(2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024
https://doi.org/10.1007/s11257-024-09398-x
Asabere NFiamavle EAgyiri JTorgby WDzata JDoe N(2022)SARCPInternational Journal of Decision Support System Technology10.4018/IJDSST.28669114:1(1-21)Online publication date: 11-Mar-2022
https://doi.org/10.4018/IJDSST.286691
KURT MYÜCEL DEMİREL E(2022)WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODSUludağ University Journal of The Faculty of Engineering10.17482/uumfd.891038(191-204)Online publication date: 16-Mar-2022
https://doi.org/10.17482/uumfd.891038
Ünal HÖzel S(2022)A Novel Approach for Semi-supervised Learning: Incremental Parallel Training with Cross-Validation (IPT-CV)Arabian Journal for Science and Engineering10.1007/s13369-022-07433-w48:8(10457-10477)Online publication date: 23-Nov-2022
https://doi.org/10.1007/s13369-022-07433-w
Hung PHung NDiep V(2022)URL Classification Using Convolutional Neural Network for a New Large DatasetCooperative Design, Visualization, and Engineering10.1007/978-3-031-16538-2_11(103-114)Online publication date: 25-Oct-2022
https://dl.acm.org/doi/10.1007/978-3-031-16538-2_11
Wei ZZhang X(2021)A Filtering Algorithm of Main Word Frequency for Online Commodity Subject Classification in E-CommerceInternational Journal of Circuits, Systems and Signal Processing10.46300/9106.2021.15.2515(218-224)Online publication date: 30-Mar-2021
https://doi.org/10.46300/9106.2021.15.25
ARSLAN R(2021)Kötücül Web Sayfalarının Tespitinde Doc2Vec Modeli ve Makine Öğrenmesi YaklaşımıEuropean Journal of Science and Technology10.31590/ejosat.981450Online publication date: 6-Oct-2021
https://doi.org/10.31590/ejosat.981450
Jiang JLee CYang LSarrafzadeh BHecht BTeevan J(2021)Learning to Represent Human Motives for Goal-directed Web BrowsingProceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3474260(361-371)Online publication date: 13-Sep-2021
https://dl.acm.org/doi/10.1145/3460231.3474260
Show More Cited By

View Options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents