Abstract
Phishing is a persistent and major threat on the internet that is growing steadily and dangerously. It is a type of cyber-attack, in which phisher mimics a legitimate website page to harvest victim’s sensitive information, such as usernames, emails, passwords and bank or credit card details. To prevent such attacks, several phishing detection techniques have been proposed such as AI based, 3rd party, heuristic and content based. However, these approaches suffer from a number of limitations that needs to be addressed in order to detect phishing URLs. Firstly, features extracted in the past are extensive, with a limitation that it takes a considerable amount of time to extract such features. Secondly, several approaches selected important features using statistical methods, while some propose their own features. Although both methods have been implemented successfully in various approaches, however, these methods produce incorrect results without amplification of domain knowledge. Thirdly, most of the literature has used pre-classified and smaller datasets, which fail to produce exact efficiency and precision on large and real world datasets. Fourthly, the previous proposed approaches lack in advanced evaluation measures. Hence, in this paper, effective machine learning framework is proposed, which predicts phishing URLs without visiting the webpage nor utilizing any 3rd party services. The proposed technique is based on URL and uses full URL, protocol scheme, hostname, path area of the URL, entropy feature, suspicious words and brand name matching using TF-IDF technique for the classification of phishing URLs. The experiments are carried out on six different datasets using eight different machine learning classifiers, in which Random Forest achieved a significant higher accuracy than other classifiers on all the datasets. The proposed framework with only 30 features achieved a higher accuracy of 96.25% and 94.65% on the Kaggle datasets. The comparative results show that the proposed model achieved an accuracy of 92.2%, 91.63%, 94.80, 96.85% on benchmark datasets, which is higher than the existing approaches.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Aburub F, Hadi W (2021) A new association classification based method for detecting phishing websites. J Theoret Appl Inf Technol 99(1):147–158
Abuzuraiq A, Alkasassbeh M, Almseidin M (2020) Intelligent methods for accurately detecting phishing websites. In: 1th International Conference on information and communication systems (ICICS), pp 085–090, April 2020.
Al-Alyan A, Al-Ahmadi S (2020) Robust URL phishing detection based on deep learning. KSII Trans Internet Inf Syst 14(7):2752–2768
Alexa (2022) Most popular legitimate URLs. https://www.alexa.com/. Accessed 5 Aug 2021
Alsharnouby M, Alaca F, Chiasson S (2015) Why phishing still works: user strategies for combating phishing attacks. Int J Hum Comput Stud 82:69–82
APWG (2013–2020) Phishing activity trends reports, 1st, 2nd, 3rd, and 4th quarters of each years. https://apwg.org/trendsreports/, published 2013–2020
Bahnsen AC, Bohorquez EC, Villegas S, Vargas J, González FA (2017) Classifying phishing URLs using recurrent neural networks. In: IEEE Proceedings of the APWG Symposium on electronic crime research (eCrime), pp 1–8, 2017
Banik B, Sarma A (2018) Phishing URL detection system based on URL features using SVM. Int J Electron Appl Res (IJEAR) 5(2):40–55
Chatterjee M, Namin AS (2019) Detecting phishing websites through deep reinforcement learning. In: IEEE Annual Computer Software and Applications Conference, pp 227–232, 2019
Chavan S, Inamdar A, Dorle A, Kulkarni S, W, X-W (2019) Phishing detection: malicious and benign websites classification using machine learning techniques. In: Springer Proceeding of International Conference on computational science and applications (ICCSA), pp 437–446, August 2019
Chiew KL, Yong KSC, Tan CL (2018) A survey of phishing attacks: their types, vectors and technical approaches. Elsevier Expert Syst Appl 106:1–20
Chiew KL, Tan CL, Wong K, Yong KS, Tiong WK (2019) A new hybrid ensemble feature selection framework formachine learning-based phishing detection system. Inf Sci 484:153–166
Dou Z, Khalil I, Khreishah A, Al-Fuqaha A, Guizani M (2017) Systematization of knowledge (SoK): a systematic review of software-based web phishing detection. IEEE Commun Surveys & Tutor 19(4):2797–2819
El Aassal A, Baki S, Das A, Verma RM (2020) An indepth benchmarking and evaluation of phishing detection research for security needs. IEEE Access 8:22170–22192
Feng F, Zhou Q, Shen Z et al (2018) The application of a novel neural network in the detection of phishing websites. J Ambient Intell Human Comput. https://doi.org/10.1007/s12652-018-0786-3
Gupta BB, Yadav K, Razzak I, Psannis K, Castiglione A, Chang X (2021) A novel approach for phishing URLs detection using lexical based machine learning in a real-time environment. Comput Commun 175:47–57
Hutchinson S, Zhang Z, Liu Q (2018) Detecting phishing websites with random forest. Springer ICST Inst Comput Sci Soc Inf Telecommun Eng MILICOM 251:470–479
Jagadeesan S, Chaturvedi A, Kumar S (2018) Url phishing analysis using random forest. Int J Pure Appl Math 118(20):4159–4163
Jain AK, Gupta BB (2018a) PHISH-SAFE: URL features-based phishing detection system using machine learning. In: Springer cyber security, advances in intelligent systems and computing, pp 467–474
Jain AK, Gupta BB (2018b) A machine learning based approach for phishing detection using hyperlinks information. Springer J Ambient Intell Humaniz Comput, pp 2015–2028
Jalil S, Usman M (2020) A review of phishing URL detection using machine learning classifiers. Springer Adv Intell Syst Comput 1251:646–665
Jeeva C, Rajsingh EB (2016) Intelligent phishing url detection using association rule mining. SpringerOpen Human-Centric Comput Inf Sci 6:10
Joshi A, Pattanshetti TR (2019) Phishing attack detection using feature selection techniques. In: Proceedings of International Conference on communication and information processing (ICCIP), May 2019, pp 949–952
Korkmaz M, Sahingoz OK, Diri B (2020) Detection of phishing websites by using machine learning-based URL analysis. In: IEEE 11th International Conference on computing, communication and networking technologies (ICCCNT), pp 1–7
Kulkarni A, Brown LL (2019) Phishing websites detection using machine learning. Int J Adv Comput Sci Appl (IJACSA) 10/7:8–13
Li JH, Wang SD (2017) Phishbox: an approach for phishing validation and detection. In: 2017 IEEE 15th Int. Conf. on Dependable, Autonomic and Secure Computing, 15th Int. Conf. on Pervasive Intelligence and Computing, 3rd Int. Conf. on Big Data Intelligence and Computing and Cyber Science and Technology Congress (DASC/PiCom/DataCom/CyberSciTech), Orlando, FL, USA., 6 November 2017, pp 557–564
Li Y, Yang Z, Chen X et al (2019) A stacking model using URL and HTML features for phishing webpage detection. Elsevier Future Gener Comput Syst 94:27–39
Opara C, Wei B, Chen Y (2020) HTMLPhish: enabling phishing web page detection by applying deep learning techniques on HTML analysis. In: IEEE International Joint Conference on neural networks (IJCNN), pp 1–8, 2020
Pandey A, Gill N, Sai Prasad Nadendla K, Sumaiya Thaseen I (2019) Identification of phishing attack in websites using random forest-SVM hybrid model. In: Springer intelligent systems design and applications (ISDA), pp 120–128
PhishTank (2022) Verified phishing URLs. https://www.phishtank.com/. Accessed 5 Aug 2021
Rao RS, Vaishnavi T, Pais AR (2019) CatchPhish: detection of phishing websites by inspecting URLs. Springer J Ambient Intell Humaniz Comput 11:813–825
Sadique F, Kaul R, Badsha S, Sengupta S (2020) An automated framework for real-time phishing URL detection. In: IEEE 10th annual computing and communication workshop and conference (CCWC), pp 0335–0341
Sahingoz OK, Buber E, Demir O, Diri B (2019) Machine learning based phishing detection from URLs. ScienceDirect J Expert Syst Appl 117:345–357
Shahrivari V, Darabi MM, Izadi M (2020) Phishing detection using machine learning techniques. arXiv 2009.11116
Srinivasa Rao RS, Pais AR (2018) Detection of phishing websites using an efficient feature-based machine learning framework. Springer Neural Comput Appl 31:3851–3873
Tan CL, Chiew KL, Wong K, Sze SN (2016) PhishWHO: phishing webpage detection via identity keywords extraction and target domain name finder. Elsevier Decis Support Syst 88:18–27
UCI (2022) UC Irvine Machine Learning Repository. https://archive.ics.uci.edu/ml/index.php/. Accessed 5 Aug 2021
Webroot (2020) Webroot threat report. https://mypage.webroot.com/rs/557-FSI-195/images/2020%20Webroot%20Threat%20Report_US_FINAL.pdf. Accessed 5 Aug 2021
Yang P, Zhao G, Zeng P (2019) Phishing website detection based on multidimensional features driven by deep learning. IEEE Access J Mag 7:15196–15209
Zhu E, Chen Y, Ye C, Li X, Liu F (2019) OFS-NN: an effective phishing websites detection model based on optimal feature selection and neural network. IEEE Access J Mag 7:73271–73284
Author information
Authors and Affiliations
Corresponding author
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Supplementary Information
Below is the link to the electronic supplementary material.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Jalil, S., Usman, M. & Fong, A. Highly accurate phishing URL detection based on machine learning. J Ambient Intell Human Comput 14, 9233–9251 (2023). https://doi.org/10.1007/s12652-022-04426-3
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s12652-022-04426-3