research-article

Learning to detect malicious URLs

Authors:

Lawrence K. Saul,

Geoffrey M. VoelkerAuthors Info & Claims

ACM Transactions on Intelligent Systems and Technology (TIST), Volume 2, Issue 3

Article No.: 30, Pages 1 - 24

https://doi.org/10.1145/1961189.1961202

Published: 06 May 2011 Publication History

Abstract

Malicious Web sites are a cornerstone of Internet criminal activities. The dangers of these sites have created a demand for safeguards that protect end-users from visiting them. This article explores how to detect malicious Web sites from the lexical and host-based features of their URLs. We show that this problem lends itself naturally to modern algorithms for online learning. Online algorithms not only process large numbers of URLs more efficiently than batch algorithms, they also adapt more quickly to new features in the continuously evolving distribution of malicious URLs. We develop a real-time system for gathering URL features and pair it with a real-time feed of labeled URLs from a large Web mail provider. From these features and labels, we are able to train an online classifier that detects malicious Web sites with 99% accuracy over a balanced dataset.

References

[1]

Abu-Nimeh, S., Nappa, D., Wang, X., and Nair, S. 2007. A comparison of machine learning techniques for phishing detection. In Proceedings of the Anti-Phishing Working Group eCrime Researchers Summit.

Digital Library

[2]

Bergholz, A., Chang, J.-H., Paass, G., Reichartz, F., and Strobel, S. 2008. Improved Phishing Detection using Model-Based Features. In Proceedings of the Conference on Email and Anti-Spam (CEAS).

[3]

Bottou, L. 1998. Online algorithms and stochastic approximations. In Online Learning and Neural Networks. Cambridge University Press, Cambridge, UK.

Digital Library

[4]

Bottou, L. and LeCun, Y. 2004. Large scale online learning. In Advances in Neural Information Processing Systems 16. MIT Press, Cambridge, MA.

[5]

Chou, N., Ledesma, R., Teraguchi, Y., Boneh, D., and Mitchell, J. C. 2004. Client-side defense against web-based identity theft. In Proceedings of the Network and Distributed System Security Symposium (NDSS).

[6]

Crammer, K., Dekel, O., Shalev-Shwartz, S., and Singer, Y. 2006. Online passive-aggressive algorithms. J. Mach. Learn. Res. 7, 551--585.

Digital Library

[7]

Crammer, K., Dredze, M., and Pereira, F. 2009. Exact convex confidence-weighted learning. In Advances in Neural Information Processing Systems (NIPS).

[8]

Daigle, L. 2004. WHOIS protocol specification. RFC 3912.

[9]

Deering, S. and Hinden, R. 1998. Internet protocol, version 6 (IPv6) specification. RFC 2460.

[10]

Dekel, O., Shalev-Shwartz, S., and Singer, Y. 2008. The forgetron: A kernel-based perceptron on a budget. SIAM J. Comput. 37, 5, 1342--1372.

Digital Library

[11]

Digital Element. 2010. NetAcuity. http://www.digital-element.com/ip_intelligence/ip_intelligence.html.

[12]

Dredze, M., Crammer, K., and Pereira, F. 2008. Confidence-weighted linear classification. In Proceedings of the International Conference on Marchine Learning (ICML).

Digital Library

[13]

Droms, R. 1997. Dynamic host configuration protocol. RFC 2131.

[14]

Fan, R.-E., Chang, K.-W., Hsieh, C.-J., Wang, X.-R., and Lin, C.-J. 2008. LIBLINEAR: A library for large linear classification. http://www.csie.ntu.edu.tw/~cjlin/liblinear/.

Digital Library

[15]

Fette, I., Sadeh, N., and Tomasic, A. 2007. Learning to detect phishing emails. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

[16]

Fuller, V. and Li, T. 2006. Classless inter-domain routing (CIDR): The internet address assignment and aggregation plan. RFC 4632.

[17]

Garera, S., Provos, N., Chew, M., and Rubin, A. D. 2007. A Framework for Detection and measurement of phishing attacks. In Proceedings of the ACM Workshop on Rapid Malcode (WORM). Alexandria, VA.

Digital Library

[18]

Guan, D. J., Chen, C.-M., and Lin, J.-B. 2009. Anomaly based malicious url detection in instant messaging. In Proceedings of the Joint Workshop on Information Security (JWIS).

[19]

Jeftovic, M. and Saez, D. 2010. PHPWhois. http://sourceforge.net/projects/phpwhois/.

[20]

Kan, M.-Y. and Thi, H. O. N. 2005. Fast webpage classification using url features. In Proceedings of the ACM Conference on Information and Knowledge Management (CIKM).

Digital Library

[21]

Keizer, G. 2008. Spam plummets after calif. hosting service shuttered. http://www.computerworld.com/s/article/9119963/Spam_plummets_after_Calif%_hosting_service_shuttered.

[22]

Kolari, P., Finin, T., and Joshi, A. 2006. SVMs for the blogosphere: Blog identification and splog detection. In Proceedings of the AAAI Spring Symposium on Computational Approaches to Analysing Weblogs.

[23]

Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. 2009a. Beyond blacklists: Learning to detect malicious web sites from suspicious URLs. In Proceedings of the SIGKDD Conference.

Digital Library

[24]

Ma, J., Saul, L. K., Savage, S., and Voelker, G. M. 2009b. Identifying suspicious URLs: An application of large-scale online learning. In Proceedings of the International Conference on Machine Learning (ICML). 681--688.

Digital Library

[25]

MAAWG. 2010. MAAWG email metrics program: The network operators' perspective. Rep. &num;12 -- Third and Fourth Quarter 2009.

[26]

McGrath, D. K. and Gupta, M. 2008. Behind phishing: An examination of phisher modi operandi. In Proceedings of the USENIX Workshop on Large-Scale Exploits and Emergent Threats (LEET).

Digital Library

[27]

McMillan, R. 2010. Zeus botnet dealt a blow as ISP troyak knocked out. http://www.itworld.com/government/100020/zeus-botnet-dealt-blow-isp-tro%yak-knocked-out.

[28]

Mockapretis, P. 1987a. Domain names -- Concepts and facilities. RFC 1034.

[29]

Mockapretis, P. 1987b. Domain names -- Implementation and specification. RFC 1035.

[30]

Moshchuk, A., Bragin, T., Deville, D., Gribble, S. D., and Levy, H. M. 2007. SpyProxy: Execution-based detection of malicious web content. In Proceedings of the USENIX Security Symposium.

Digital Library

[31]

Moshchuk, A., Bragin, T., Gribble, S. D., and Levy, H. M. 2006. A Crawler-based study of spyware on the web. In Proceedings of the Symposium on Network and Distributed System Security (NDSS).

[32]

Netscape. 2011. DMOZ Open Directory Project. http://www.dmoz.org.

[33]

Niu, Y., Wang, Y.-M., Chen, H., Ma, M., and Hsu, F. 2007. A quantitative study of forum spamming using context-based analysis. In Proceedings of the Symposium on Network and Distributed System Security (NDSS).

[34]

OpenDNS. 2011. PhishTank. http://www.phishtank.com.

[35]

Orabona, F., Keshet, J., and Caputo, B. 2008. The projectron: A bounded kernel-based perceptron. In Proceedings of the International Conference on Machine Learning (ICML).

Digital Library

[36]

Provos, N., Mavrommatis, P., Rajab, M. A., and Monrose, F. 2008. All your iFRAMEs point to Us. In Proceedings of the USENIX Security Symposium.

Digital Library

[37]

Rosenblatt, F. 1958. The perceptron: A probabilistic model for information storage and organization in the brain. Psychol. Rev. 65, 6, 386--408.

[38]

Sinha, S., Bailey, M., and Jahanian, F. 2008. Shades of grey: On the effectiveness of reputation based blacklists. In Proceedings of the International Conference on Malicious and Unwared Software (Malware).

[39]

Sonnenburg, S., Franc, V., Yom-Tov, E., and Sebag, M. 2008. PASCAL large scale learning challenge. http://largescale.first.fraunhofer.de/workshop/.

[40]

University of Oregon Advanced Network Technology Center. 2010. Route views project. http://www.routeviews.org.

[41]

USC Information Sciences Institute. 1981. Internet protocol: DARPA internte program protocol specification. RFC 791.

[42]

Wang, Y.-M., Beck, D., Jiang, X., Roussev, R., Verbowski, C., Chen, S., and King, S. 2006. Automated web patrol with strider honeymonkeys: Finding web sites that exploit browser vulnerabilities. In Proceedings of the Symposium on Network and Distributed System Security (NDSS).

[43]

Zadrozny, B., Langford, J., and Abe, N. 2003. Cost-sensitive learning by cost-proportionate example weighting. In Proceedings of the IEEE International Conference on Data Mining (ICDM).

Digital Library

[44]

Zhang, Y., Hong, J., and Cranor, L. 2007. CANTINA: A content-based approach to detecting phishing web sites. In Proceedings of the International World Wide Web Conference (WWW).

Digital Library

Cited By

Kumar SBrindha D(2024)Detection of Malicious Websites using Machine LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1199(1409-1412)Online publication date: 29-Mar-2024
https://doi.org/10.38124/ijisrt/IJISRT24MAR1199
Alsaedi MGhaleb FSaeed FAhmad JAlasli M(2024)Multi-Modal Features Representation-Based Convolutional Neural Network Model for Malicious Website DetectionIEEE Access10.1109/ACCESS.2023.334807112(7271-7284)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3348071
Yang HZhang Z(2024)ECS-SMOTE: A New Over-Sampling Method for Example-Dependent Costs ClassificationAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5618-6_36(430-443)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-5618-6_36
Show More Cited By

Index Terms

Learning to detect malicious URLs
1. Security and privacy
  1. Network security

Recommendations

Beyond blacklists: learning to detect malicious web sites from suspicious URLs
KDD '09: Proceedings of the 15th ACM SIGKDD international conference on Knowledge discovery and data mining

Malicious Web sites are a cornerstone of Internet criminal activities. As a result, there has been broad interest in developing systems to prevent the end user from visiting such sites. In this paper, we describe an approach to this problem based on ...
Learning to detect malicious urls
A Novel Framework for Learning to Detect Malicious Web Pages
IFITA '10: Proceedings of the 2010 International Forum on Information Technology and Applications - Volume 02

Malicious web pages are a widely-recognized threat to the security of the web. Malicious web pages launch so-called drive-by download attacks that are able to gain complete control of a user’s computer for illegitimate purpose. Even a single visit to a ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Intelligent Systems and Technology

ACM Transactions on Intelligent Systems and Technology Volume 2, Issue 3

April 2011

259 pages

ISSN:2157-6904

EISSN:2157-6912

DOI:10.1145/1961189

Issue’s Table of Contents

Copyright © 2011 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 06 May 2011

Accepted: 01 November 2010

Revised: 01 August 2010

Received: 01 March 2010

Published in TIST Volume 2, Issue 3

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed

Funding Sources

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

121
Total Citations
View Citations
2,013
Total Downloads

Downloads (Last 12 months)108
Downloads (Last 6 weeks)15

Reflects downloads up to 21 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kumar SBrindha D(2024)Detection of Malicious Websites using Machine LearningInternational Journal of Innovative Science and Research Technology (IJISRT)10.38124/ijisrt/IJISRT24MAR1199(1409-1412)Online publication date: 29-Mar-2024
https://doi.org/10.38124/ijisrt/IJISRT24MAR1199
Alsaedi MGhaleb FSaeed FAhmad JAlasli M(2024)Multi-Modal Features Representation-Based Convolutional Neural Network Model for Malicious Website DetectionIEEE Access10.1109/ACCESS.2023.334807112(7271-7284)Online publication date: 2024
https://doi.org/10.1109/ACCESS.2023.3348071
Yang HZhang Z(2024)ECS-SMOTE: A New Over-Sampling Method for Example-Dependent Costs ClassificationAdvanced Intelligent Computing Technology and Applications10.1007/978-981-97-5618-6_36(430-443)Online publication date: 5-Aug-2024
https://dl.acm.org/doi/10.1007/978-981-97-5618-6_36
Chitayae NMuhammad A(2023)Identifikasi Malware pada Android menggunakan Algoritma K-Nearest NeighborJournal of Information Technology10.46229/jifotech.v3i2.7523:2(63-68)Online publication date: 27-Sep-2023
https://doi.org/10.46229/jifotech.v3i2.752
Almashor MAhmed EPick BXue JAbuadbba SGaire RWang SCamtepe SNepal S(2023)Unraveling Threat Intelligence Through the Lens of Malicious URL CampaignsProceedings of the 18th Asian Internet Engineering Conference10.1145/3630590.3630600(78-86)Online publication date: 12-Dec-2023
https://dl.acm.org/doi/10.1145/3630590.3630600
Mirian AUkani AFoster IAkiwate GHalicioglu TMoore CSnoeren AVoelker GSavage S(2023)In the Line of Fire: Risks of DPI-triggered Data CollectionProceedings of the 16th Cyber Security Experimentation and Test Workshop10.1145/3607505.3607526(57-63)Online publication date: 7-Aug-2023
https://dl.acm.org/doi/10.1145/3607505.3607526
Preeti Sharma P(2023)A Detailed Analysis on Various Datasets using Machine learning and Deep Learning Techniques for Phishing URLs Detection2023 14th International Conference on Computing Communication and Networking Technologies (ICCCNT)10.1109/ICCCNT56998.2023.10307474(1-10)Online publication date: 6-Jul-2023
https://doi.org/10.1109/ICCCNT56998.2023.10307474
Kinger SNirmal PShrivastav ASharma ASaindane S(2023)Malicious URL Detection Using Machine Learning2023 6th International Conference on Contemporary Computing and Informatics (IC3I)10.1109/IC3I59117.2023.10397872(1062-1068)Online publication date: 14-Sep-2023
https://doi.org/10.1109/IC3I59117.2023.10397872
Menon RAnandhu V(2023)Machine Learning Supported Malicious URL Detection2023 4th IEEE Global Conference for Advancement in Technology (GCAT)10.1109/GCAT59970.2023.10353402(1-5)Online publication date: 6-Oct-2023
https://doi.org/10.1109/GCAT59970.2023.10353402
Jia QGuo XZhang MLiu MTian XJin XMa D(2023)Phishing URL recognition based on ON-LSTM attention mechanism and XGBoost model2023 5th International Conference on Electronics and Communication, Network and Computer Technology (ECNCT)10.1109/ECNCT59757.2023.10280927(159-163)Online publication date: 18-Aug-2023
https://doi.org/10.1109/ECNCT59757.2023.10280927
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Media

Figures

Other

Tables

View Issue’s Table of Contents