Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article

Learning to detect malicious URLs

Published: 06 May 2011 Publication History

Abstract

Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical and host-based features of their URLs. We show that this problem lends itself naturally to modern algorithms for online learning. Online algorithms not only process large numbers of URLs more efficiently than batch algorithms, they also adapt more quickly to new features in the continuously evolving distribution of malicious URLs. We develop a real-time system for gathering URL features and pair it with a real-time feed of labeled URLs from a large Web mail provider. From these features and labels, we are able to train an online classifier that detects malicious Web sites with 99% accuracy over a balanced dataset.

References

[1]
Abu-Nimeh, S., Nappa, D., Wang, X., and Nair, S. 2007. A comparison of machine learning techniques for phishing detection. In Proceedings of the Anti-Phishing Working Group eCrime Researchers Summit.
[2]
Bergholz, A., Chang, J.-H., Paass, G., Reichartz, F., and Strobel, S. 2008. Improved Phishing Detection using Model-Based Features. In Proceedings of the Conference on Email and Anti-Spam (CEAS).
[3]
Bottou, L. 1998. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK.
[4]
Bottou, L. and LeCun, Y. 2004. Large scale online learning. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA.
[5]
Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., and Mitchell, J. C. 2004. Client-side defense against web-based identity theft. In Proceedings of the Network and Distributed System Security Symposium (NDSS).
[6]
Crammer, K., Dekel, O., Shalev-Shwartz, S., and Singer, Y. 2006. Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551--585.
[7]
Crammer, K., Dredze, M., and Pereira, F. 2009. Exact convex confidence-weighted learning. In Advances in Neural Information Processing Systems (NIPS).
[8]
Daigle, L. 2004. WHOIS protocol specification. RFC 3912.
[9]
Deering, S. and Hinden, R. 1998. Internet protocol, version 6 (IPv6) specification. RFC 2460.
[10]
Dekel, O., Shalev-Shwartz, S., and Singer, Y. 2008. The forgetron: A kernel-based perceptron on a budget. SIAM J. Comput. 37, 5, 1342--1372.
[11]
Digital Element. 2010. NetAcuity. http://www.digital-element.com/ip_intelligence/ip_intelligence.html.
[12]
Dredze, M., Crammer, K., and Pereira, F. 2008. Confidence-weighted linear classification. In Proceedings of the International Conference on Marchine Learning (ICML).
[13]
Droms, R. 1997. Dynamic host configuration protocol. RFC 2131.
[14]
Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. 2008. LIBLINEAR: A library for large linear classification. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.
[15]
Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW).
[16]
Fuller, V. and Li, T. 2006. Classless inter-domain routing (CIDR): The internet address assignment and aggregation plan. RFC 4632.
[17]
Garera, S., Provos, N., Chew, M., and Rubin, A. D. 2007. A Framework for Detection and measurement of phishing attacks. In Proceedings of the ACM Workshop on Rapid Malcode (WORM). Alexandria, VA.
[18]
Guan, D. J., Chen, C.-M., and Lin, J.-B. 2009. Anomaly based malicious url detection in instant messaging. In Proceedings of the Joint Workshop on Information Security (JWIS).
[19]
Jeftovic, M. and Saez, D. 2010. PHPWhois. http://sourceforge.net/projects/phpwhois/.
[20]
Kan, M.-Y. and Thi, H. O. N. 2005. Fast webpage classification using url features. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM).
[21]
Keizer, G. 2008. Spam plummets after calif. hosting service shuttered. http://www.computerworld.com/s/article/9119963/Spam_plummets_after_Calif%_hosting_service_shuttered.
[22]
Kolari, P., Finin, T., and Joshi, A. 2006. SVMs for the blogosphere: Blog identification and splog detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs.
[23]
Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. 2009a. Beyond blacklists: Learning to detect malicious web sites from suspicious URLs. In Proceedings of the SIGKDD Conference.
[24]
Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. 2009b. Identifying suspicious URLs: An application of large-scale online learning. In Proceedings of the International Conference on Machine Learning (ICML). 681--688.
[25]
MAAWG. 2010. MAAWG email metrics program: The network operators' perspective. Rep. #12 -- Third and Fourth Quarter 2009.
[26]
McGrath, D. K. and Gupta, M. 2008. Behind phishing: An examination of phisher modi operandi. In Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET).
[27]
McMillan, R. 2010. Zeus botnet dealt a blow as ISP troyak knocked out. http://www.itworld.com/government/100020/zeus-botnet-dealt-blow-isp-tro%yak-knocked-out.
[28]
Mockapretis, P. 1987a. Domain names -- Concepts and facilities. RFC 1034.
[29]
Mockapretis, P. 1987b. Domain names -- Implementation and specification. RFC 1035.
[30]
Moshchuk, A., Bragin, T., Deville, D., Gribble, S. D., and Levy, H. M. 2007. SpyProxy: Execution-based detection of malicious web content. In Proceedings of the USENIX Security Symposium.
[31]
Moshchuk, A., Bragin, T., Gribble, S. D., and Levy, H. M. 2006. A Crawler-based study of spyware on the web. In Proceedings of the Symposium on Network and Distributed System Security (NDSS).
[32]
Netscape. 2011. DMOZ Open Directory Project. http://www.dmoz.org.
[33]
Niu, Y., Wang, Y.-M., Chen, H., Ma, M., and Hsu, F. 2007. A quantitative study of forum spamming using context-based analysis. In Proceedings of the Symposium on Network and Distributed System Security (NDSS).
[34]
OpenDNS. 2011. PhishTank. http://www.phishtank.com.
[35]
Orabona, F., Keshet, J., and Caputo, B. 2008. The projectron: A bounded kernel-based perceptron. In Proceedings of the International Conference on Machine Learning (ICML).
[36]
Provos, N., Mavrommatis, P., Rajab, M. A., and Monrose, F. 2008. All your iFRAMEs point to Us. In Proceedings of the USENIX Security Symposium.
[37]
Rosenblatt, F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 6, 386--408.
[38]
Sinha, S., Bailey, M., and Jahanian, F. 2008. Shades of grey: On the effectiveness of reputation based blacklists. In Proceedings of the International Conference on Malicious and Unwared Software (Malware).
[39]
Sonnenburg, S., Franc, V., Yom-Tov, E., and Sebag, M. 2008. PASCAL large scale learning challenge. http://largescale.first.fraunhofer.de/workshop/.
[40]
University of Oregon Advanced Network Technology Center. 2010. Route views project. http://www.routeviews.org.
[41]
USC Information Sciences Institute. 1981. Internet protocol: DARPA internte program protocol specification. RFC 791.
[42]
Wang, Y.-M., Beck, D., Jiang, X., Roussev, R., Verbowski, C., Chen, S., and King, S. 2006. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In Proceedings of the Symposium on Network and Distributed System Security (NDSS).
[43]
Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the IEEE International Conference on Data Mining (ICDM).
[44]
Zhang, Y., Hong, J., and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing web sites. In Proceedings of the International World Wide Web Conference (WWW).

Cited By

View all
  • (2024)Detection of Malicious Websites using Machine LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1199(1409-1412)Online publication date: 29-Mar-2024
  • (2024)Multi-Modal Features Representation-Based Convolutional Neural Network Model for Malicious Website DetectionIEEE Access10.1109/ACCESS.2023.334807112(7271-7284)Online publication date: 2024
  • (2024)ECS-SMOTE: A New Over-Sampling Method for Example-Dependent Costs ClassificationAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5618-6_36(430-443)Online publication date: 5-Aug-2024
  • Show More Cited By

Index Terms

  1. Learning to detect malicious URLs

    Recommendations

    Comments

    Please enable JavaScript to view thecomments powered by Disqus.

    Information & Contributors

    Information

    Published In

    cover image ACM Transactions on Intelligent Systems and Technology
    ACM Transactions on Intelligent Systems and Technology  Volume 2, Issue 3
    April 2011
    259 pages
    ISSN:2157-6904
    EISSN:2157-6912
    DOI:10.1145/1961189
    Issue’s Table of Contents
    Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

    Publisher

    Association for Computing Machinery

    New York, NY, United States

    Publication History

    Published: 06 May 2011
    Accepted: 01 November 2010
    Revised: 01 August 2010
    Received: 01 March 2010
    Published in TIST Volume 2, Issue 3

    Permissions

    Request permissions for this article.

    Check for updates

    Author Tags

    1. Online learning
    2. malicious Web sites

    Qualifiers

    • Research-article
    • Research
    • Refereed

    Funding Sources

    Contributors

    Other Metrics

    Bibliometrics & Citations

    Bibliometrics

    Article Metrics

    • Downloads (Last 12 months)108
    • Downloads (Last 6 weeks)15
    Reflects downloads up to 21 Nov 2024

    Other Metrics

    Citations

    Cited By

    View all
    • (2024)Detection of Malicious Websites using Machine LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1199(1409-1412)Online publication date: 29-Mar-2024
    • (2024)Multi-Modal Features Representation-Based Convolutional Neural Network Model for Malicious Website DetectionIEEE Access10.1109/ACCESS.2023.334807112(7271-7284)Online publication date: 2024
    • (2024)ECS-SMOTE: A New Over-Sampling Method for Example-Dependent Costs ClassificationAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5618-6_36(430-443)Online publication date: 5-Aug-2024
    • (2023)Identifikasi Malware pada Android menggunakan Algoritma K-Nearest NeighborJournal of Information Technology10.46229/jifotech.v3i2.7523:2(63-68)Online publication date: 27-Sep-2023
    • (2023)Unraveling Threat Intelligence Through the Lens of Malicious URL CampaignsProceedings of the 18th Asian Internet Engineering Conference10.1145/3630590.3630600(78-86)Online publication date: 12-Dec-2023
    • (2023)In the Line of Fire: Risks of DPI-triggered Data CollectionProceedings of the 16th Cyber Security Experimentation and Test Workshop10.1145/3607505.3607526(57-63)Online publication date: 7-Aug-2023
    • (2023)A Detailed Analysis on Various Datasets using Machine learning and Deep Learning Techniques for Phishing URLs Detection2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10307474(1-10)Online publication date: 6-Jul-2023
    • (2023)Malicious URL Detection Using Machine Learning2023 6th International Conference on Contemporary Computing and Informatics (IC3I)10.1109/IC3I59117.2023.10397872(1062-1068)Online publication date: 14-Sep-2023
    • (2023)Machine Learning Supported Malicious URL Detection2023 4th IEEE Global Conference for Advancement in Technology (GCAT)10.1109/GCAT59970.2023.10353402(1-5)Online publication date: 6-Oct-2023
    • (2023)Phishing URL recognition based on ON-LSTM attention mechanism and XGBoost model2023 5th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT59757.2023.10280927(159-163)Online publication date: 18-Aug-2023
    • Show More Cited By

    View Options

    Login options

    Full Access

    View options

    PDF

    View or Download as a PDF file.

    PDF

    eReader

    View online with eReader.

    eReader

    Media

    Figures

    Other

    Tables

    Share

    Share

    Share this Publication link

    Share on social media