Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

Published: 01 July 2011 Publication History

Abstract

Given only the URL of a Web page, can we identify its topic? We study this problem in detail by exploring a large number of different feature sets and algorithms on several datasets. We also show that the inherent overlap between topics and the sparsity of the information in URLs makes this a very challenging problem. Web page classification without a page’s content is desirable when the content is not available at all, when a classification is needed before obtaining the content, or when classification speed is of utmost importance. For our experiments we used five different corpora comprising a total of about 3 million (URL, classification) pairs. We evaluated several techniques for feature generation and classification algorithms. The individual binary classifiers were then combined via boosting into metabinary classifiers. We achieve typical F-measure values between 80 and 85, and a typical precision of around 86. The precision can be pushed further over 90 while maintaining a typical level of recall between 30 and 40.

References

[1]
Alex, P., Chirita, R., Costache, S., Nejdl, W., and Handschuh, S. 2007. P-tag: Large scale automatic generation of personalized annotation tags for the Web. In Proceedings of the International Conference on World Wide Web (WWW). 8--12.
[2]
Avesani, P., Giunchiglia, F., and Yatskevich, M. 2005. A large scale taxonomy mapping evaluation. In Proceedings of the International Semantic Web Conference (ISWC). 67--81.
[3]
Baykan, E., Henzinger, M., and Weber, I. 2008. Web page language identification based on URLs. In Proceedings of the International Conference on Very Large Data Bases (VLDB). 176--187.
[4]
Baykan, E., Henzinger, M., Marian, L., and Weber, I. 2009. Purely URL-based topic classification. In Proceedings of the International Conference on World Wide Web (WWW). 1109--1110.
[5]
Chaker, J. and Habib, O. 2007. Genre categorization of Web pages. In Proceedings of the International Conference on Data Mining Workshops (ICDMW). 455--464.
[6]
Chakrabarti, S., Dom, B., and Indyk, P. 1998. Enhanced hypertext categorization using hyperlinks. In Proceedings of the International Conference on Management of Data (SIGMOD). 307--318.
[7]
Chakrabarti, S., van den Berg, M., and Dom, B. 1999. Focused crawling: A new approach to topic-specific Web resource discovery. Comput. Netw. 31, 11--16, 1623--1640.
[8]
Chen, H. and Dumais, S. 2000. Bringing order to the Web: Automatically categorizing search results. In Proceedings of the Conference on Human Factors in Computing Systems (CHI). 145--152.
[9]
Cover, T. and Thomas, J. 1991. Elements of Information Theory. Wiley & Sons.
[10]
Dasgupta, A., Kumar, R., and Sasturkar, A. 2008. De-Duping URLs via rewrite rules. In Proceedings of the International Conference on Knowledge Discovery and Data Mining (KDD). 186--194.
[11]
Devi, M. I., Rajaram, R., and Selvakuberan, K. 2007. Machine learning techniques for automated Web page classification using URL features. In Proceedings of the International Conference on Computational Intelligence and Multimedia Applications (ICCIMA). 116--120.
[12]
Domingos, P. and Pazzani, M. 1997. On the optimality of the simple bayesian classifier under zero-one loss. Mach. Learn. 29, 103--130.
[13]
Freud, Y. and Schapire, R. 1996. Experiments with a new boosting algorithm. In Proceedings of the International Conference on Machine Learning (ICML). 148--156.
[14]
Freudiger, J., Vratonjic, N., and Hubaux, J.-P. 2009. Towards privacy-friendly online advertising. In Proceedings of the IEEE Web 2.0 Security and Privacy Conference (W2SP).
[15]
Friedman, J., Hastie, T., and Tibshirani, R. 2000. Additive logistic regression: A statistical view of boosting. Ann. Statist. 38, 2, 337--374.
[16]
Hastie, T., Tibshirani, R., and Friedman, J. H. 2001. The Elements of Statistical Learning. Springer.
[17]
Heymann, P., Ramage, D., and Garcia-Molina, H. 2008. Social tag prediction. In Proceedings of the International Conference on Research and Development in Information Retrieval (SIGIR). 531--538.
[18]
Jäschke, R., Marinho, L., Hotho, A., Schmidt-Thieme, L., and Stumme, G. 2007. Tag recommendations in folksonomies. In Proceedings of the European Conference on Principles and Practice of Knowledge Discovery in Databases (PKDD). 506--514.
[19]
Joachims, T. 1999. Making Large-Scale Support Vector Machine Learning Practical. MIT Press, 169--184. http://svmlight.joachims.org/.
[20]
Kan, M.-Y. 2004. Web page classification without the Web page. In Proceedings of the International World Wide Web Conference on Alternate Track Papers and Posters. 262--263.
[21]
Kan, M.-Y. and Nguyen, H. O. T. 2005. Fast Webpage classification using URL features. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 325--326.
[22]
Koppula, H. S., Leela, K., Agarwal, A., Chitrapura, K. P., Garg, S., and Sasturkar, A. 2010. Learning URL patterns for Webpage de-duplication. In Proceedings of the International Conference on Web Search and Data Mining. 381--390.
[23]
Manning, C. D., Raghavan, P., and Schütze, H. 2008. Introduction to Information Retrieval. Cambridge University Press.
[24]
McCallum, A. and Nigam, K. 1998. A comparison of event models for naive Bayes text classification. In Proceedings of the AAAI Workshop on Learning for Text Categorization. 41--48.
[25]
McCallum, A. K. 1996. Bow: A toolkit for statistical language modeling, text retrieval, classification and clustering. http://www.cs.cmu.edu/~mccallum/bow.
[26]
McGuinness, D., Fikes, R., Rice, J., and Wilder, S. 2000. An environment for merging and testing large ontologies. In Proceedings of the International Conference on Principles of Knowledge Representation and Reasoning (KR). 483--493.
[27]
Nigam, K., Lafferty, J., and McCallum, A. 1999. Using maximum entropy for text classification. In Proceedings of the Workshop on Machine Learning for Information Filtering. 61--67.
[28]
Noy, N. 2004. Tools for mapping and merging ontologies. In Handbook on Ontologies, S. Staab and R. Studer Eds., Springer, 365--384.
[29]
P, D. and Khemani, D. 2006. Unsupervised learning from URL corpora. In Proceedings of the International Conference on Management of Data (COMAD’06).
[30]
Poola, K. L. and Ramanujapuram, A. 2007. Techniques for keyword extraction from URLs using statistical analysis. http://www.faqs.org/patents/app/20090089278. US Patent application.
[31]
Power, R., Chen, J., Karthik, T., and Subramanian, L. 2009. Document classification for focused topics. In Proceedings of the AAAI Spring Symposium on AI for Development.
[32]
Qi, X. and Davison, B. D. 2006. Knowing a web page by the company it keeps. In Proceedings of the International Conference on Information and Knowledge Management (CIKM). 228--237.
[33]
Qi, X. and Davison, B. D. 2008. Classifiers without borders: Incorporating fielded text from neighboring Web pages. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 643--650.
[34]
Qi, X. and Davison, B. D. 2009. Web page classification: Features and algorithms. ACM Comput. Surv. 41, 2, 1--31.
[35]
Shen, D., Chen, Z., Yang, Q., Zeng, H., Zhang, B., Lu, Y., and Ma, W. 2004. Web-Page classification through summarization. In Proceedings of the Conference on Research and Development in Information Retrieval (SIGIR). 242--249.
[36]
Shih, L. K. and Karger, D. R. 2004. Using URLs and table layout for Web classification tasks. In Proceedings of the International Conference on World Wide Web (WWW). 193--202.
[37]
Silvestri, F. 2007. Sorting out the document identifier assignment problem. In Proceedings of the European Conference on IR Research (ECIR). 101--112.
[38]
Stumme, G. and Maedche, A. 2001. FCA-MERGE: Bottom-Up merging of ontologies. In Proceedings of the International Joint Conference on Artificial Intelligence (IJCAI). 225--230.
[39]
Umbrich, J., Karnstedt, M., and Harth, A. 2009. Fast and scalable pattern mining for media-type focused crawling. In Proceedings of the Knowledge Discovery, Data Mining, and Machine Learning Workshop.
[40]
Vezhnevets, A. and Vezhnevets, V. 2005. Modest AdaBoost - Teaching AdaBoost to generalize better. In Proceedings of the Computer Graphics and Applications Conference (GraphiCon). 322--325.
[41]
Witten, I. H. and Frank, E. 2005. Data Mining: Practical Machine Learning Tools and Techniques 2nd Ed. Morgan Kaufmann.
[42]
Zesch, T. and Gurevych, I. 2007. Analysis of the Wikipedia category graph for NLP applications. In Proceedings of the Workshop on TextGraphs: Graph-Based Algorithms for Natural Language Processing (NAACL). 1--8.
[43]
Zhang, D. and Lee, W. S. 2004. Web taxonomy integration using support vector machines. In Proceedings of the International Conference on World Wide Web (WWW). 472--481.
[44]
Zhang, J., Qin, J., and Yan, Q. 2006. The role of URLs in objectionable Web content categorization. In Proceedings of the International Conference on Web Intelligence (WI). 277--283.

Cited By

View all
  • (2024) Geo-Insurance : Improving Big Data Challenges in the Context of Insurance Services Using a Geographical Information System (GIS) Human Behavior and Emerging Technologies10.1155/2024/90150122024:1Online publication date: 13-Aug-2024
  • (2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
  • (2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024
  • Show More Cited By

Index Terms

  1. A Comprehensive Study of Features and Algorithms for URL-Based Topic Classification

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on the Web
    ACM Transactions on the Web  Volume 5, Issue 3
    July 2011
    177 pages
    ISSN:1559-1131
    EISSN:1559-114X
    DOI:10.1145/1993053
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 01 July 2011
    Accepted: 01 December 2010
    Revised: 01 September 2010
    Received: 01 June 2009
    Published in TWEB Volume 5, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. ODP
    2. Topic classification
    3. URL

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)43
    • Downloads (Last 6 weeks)5
    Reflects downloads up to 28 Sep 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024) Geo-Insurance : Improving Big Data Challenges in the Context of Insurance Services Using a Geographical Information System (GIS) Human Behavior and Emerging Technologies10.1155/2024/90150122024:1Online publication date: 13-Aug-2024
    • (2024)Diversity-aware strategies for static index pruningInformation Processing & Management10.1016/j.ipm.2024.10379561:5(103795)Online publication date: Sep-2024
    • (2024)Understanding user intent modeling for conversational recommender systems: a systematic literature reviewUser Modeling and User-Adapted Interaction10.1007/s11257-024-09398-xOnline publication date: 6-Jun-2024
    • (2022)SARCPInternational Journal of Decision Support System Technology10.4018/IJDSST.28669114:1(1-21)Online publication date: 11-Mar-2022
    • (2022)WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODSUludağ University Journal of The Faculty of Engineering10.17482/uumfd.891038(191-204)Online publication date: 16-Mar-2022
    • (2022)A Novel Approach for Semi-supervised Learning: Incremental Parallel Training with Cross-Validation (IPT-CV)Arabian Journal for Science and Engineering10.1007/s13369-022-07433-w48:8(10457-10477)Online publication date: 23-Nov-2022
    • (2022)URL Classification Using Convolutional Neural Network for a New Large DatasetCooperative Design, Visualization, and Engineering10.1007/978-3-031-16538-2_11(103-114)Online publication date: 25-Oct-2022
    • (2021)A Filtering Algorithm of Main Word Frequency for Online Commodity Subject Classification in E-CommerceInternational Journal of Circuits, Systems and Signal Processing10.46300/9106.2021.15.2515(218-224)Online publication date: 30-Mar-2021
    • (2021)Kötücül Web Sayfalarının Tespitinde Doc2Vec Modeli ve Makine Öğrenmesi YaklaşımıEuropean Journal of Science and Technology10.31590/ejosat.981450Online publication date: 6-Oct-2021
    • (2021)Learning to Represent Human Motives for Goal-directed Web BrowsingProceedings of the 15th ACM Conference on Recommender Systems10.1145/3460231.3474260(361-371)Online publication date: 13-Sep-2021
    • Show More Cited By

    View Options

    Get Access

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media