A Novel Phishing Website Detection Model Based on LightGBM and Domain Name Features
<p>Framework of phishing website detection model.</p> "> Figure 2
<p>Morpheme-related-feature extraction algorithm flowchart.</p> "> Figure 3
<p>Domain-name information-extraction crawler program steps.</p> "> Figure 4
<p>Leaf-wise growth strategy.</p> "> Figure 5
<p>Dataset partition.</p> "> Figure 6
<p>Relationship between feature dimension and the classification effect.</p> "> Figure 7
<p>The relationship between feature types and detection effects.</p> "> Figure 8
<p>Detection efficiency of different models.</p> ">
Abstract
:1. Introduction
2. Related Work
3. Methodology
3.1. Framework of the Model
3.2. Feature Analysis
3.2.1. Features of the Characters Used in the Domain Name
3.2.2. Features of Information on the Domain Name
3.3. LightGBM
4. Results and Discussion
4.1. Experimental Data
4.2. Evaluation Indicators
4.3. Influence of Model Parameters on Experimental Results
4.4. Comparative Analysis of Experimental Results
5. Conclusions
Author Contributions
Funding
Institutional Review Board Statement
Informed Consent Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Rao, R.S.; Vaishnavi, T.; Pais, A.R. CatchPhish: Detection of phishing websites by inspecting URLs. J. Ambient. Intell. Humaniz. Comput. 2020, 11, 813–825. [Google Scholar] [CrossRef]
- Hamroun, C.; Amamou, A.; Haddadou, K.; Haroun, H.; Pujolle, G. A review on lexical based malicious domain name detection methods. In Proceedings of the 6th Cyber Security in Networking Conference (CSNet), Rio de Janeiro, Brazil, 24-26 October 2022; pp. 1–7. [Google Scholar] [CrossRef]
- APWG. Phishing Activity Trends Report, 2nd Quarter 2022. Available online: http://apwg.org/trendsreports (accessed on 20 September 2022).
- Prakash, P.; Kumar, M.; Kompella, R.R.; Gupta, M. Phishnet: Predictive blacklisting to detect phishing attacks. In Proceedings of the IEEE Information Communications, San Diego, CA, USA, 14–19 March 2010; Volume 5, pp. 1–5. [Google Scholar]
- Mac, H.; Tran, D.; Tong, V.; Nguyen, L.G.; Tran, H.A. DGA botnet detection using supervised learning methods. In Proceedings of the Eighth International Symposium on Information and Communication Technology, Nha Trang, Vietnam, 7–8 December 2017; pp. 211–218. [Google Scholar]
- Mohamed, Y.E.; Ahmad, S.A. A mobile sensing method to counteract social media website impersonation. Int. J. Distrib. Sens. Netw. 2016, 12, 25–30. [Google Scholar]
- Agten, P.; Joosen, W.; Piessens, F.; Nikiforakis, N.; Leuven, D.K. Seven months’ worth of mistakes: A longitudinal study of typosquatting abuse. In Proceedings of the Network and Distributed System Security Symposium, San Diego, CA, USA, 8–11 February 2015; Volume 2, pp. 8–11. [Google Scholar]
- Banerjee, A.; Rahman, M.S.; Faloutsos, M. SUT: Quantifying and mitigating URL typosquatting. Comput. Netw. 2011, 55, 3001–3014. [Google Scholar] [CrossRef]
- Hu, P.C.; Diao, L.L.; Ye, H.; Yang, Y.L. DGA domains detection based on artificial and depth features. Comput. Sci. 2020, 47, 11–317. [Google Scholar]
- Almousa, M.; Zhang, T.; Sarrafzadeh, A.; Anwar, M. Phishing website detection: How effective are deep learning-based models and hyperparameter optimization? Secur. Priv. 2022, 5, e256. [Google Scholar] [CrossRef]
- Zhao, H.; Chen, Z.; Yan, R. Malicious Domain Names Detection Algorithm Based on Statistical Features of URLs. In Proceedings of the 2022 IEEE 25th International Conference on Computer Supported Cooperative Work in Design (CSCWD), Hangzhou, China, 4–6 May 2022; pp. 11–16. [Google Scholar]
- Almomani, A.; Alauthman, M.; Shatnawi, M.; Alweshah, M.; Alrosan, A.; Alomoush, W.; Gupta, B. Phishing website detection with semantic features based on machine learning classifiers: A comparative study. Int. J. Semant. Web Inf. Syst. 2022, 18, 1–24. [Google Scholar] [CrossRef]
- Do, N.; Selamat, A.; Krejcar, O.; Herrera, E.; Fujita, H. Deep learning for phishing detection: Taxonomy, current challenges and future directions. IEEE Access 2022, 10, 36429–36463. [Google Scholar] [CrossRef]
- Pan, R.; Chen, J.; Ma, H.; Bai, X. Using Extended Character Feature in Bi-LSTM for DGA Domain Name Detection. In Proceedings of the 2022 IEEE/ACIS 22nd International Conference on Computer and Information Science (ICIS), Zhuhai, China, 26–28 June 2022; pp. 115–118. [Google Scholar]
- Altay, B.; Dokeroglu, T.; Cosar, A. Context-sensitive and keyword density-based supervised machine learning techniques for malicious webpage detection. Soft Comput.-Fusion Found. Methodol. Appl. 2018, 23, 4177–4191. [Google Scholar] [CrossRef]
- Feng, F.; Zhou, Q.G.; Shen, Z.B.; Yang, X.H.; Han, L.H. The application of a novel neural network in the detection of phishing websites. J. Ambient. Intell. Humaniz. Comput. 2018, 1–15. [Google Scholar] [CrossRef]
- Chen, J.; Ma, Y.; Huang, K. Intelligent visual similarity-based phishing websites detection. Symmetry 2020, 12, 1681. [Google Scholar] [CrossRef]
- Cersosimo, M.; Lara, A. Detecting malicious domains using the splunk machine learning toolkit. In Proceedings of the 2022 IEEE/IFIP Network Operations and Management Symposium (NOMS), Budapest, Hungary, 25–29 April 2022; pp. 1–6. [Google Scholar]
- Feroz, M.N.; Mengel, S. Examination of data, rule generation and detection of phishing URLs using online logistic regression. In Proceedings of the IEEE International Conference on Big Data, Santa Clara, CA, USA, 29 October–1 November 2015; Volume 12, pp. 241–250. [Google Scholar]
- Chatterjee, M.; Namin, A. Detecting phishing websites through deep reinforcement learning. In Proceedings of the 2019 IEEE 43rd Annual Computer Software and Applications Conference (COMPSAC), Milwaukee, WI, USA, 15–19 July 2019; pp. 227–232. [Google Scholar] [CrossRef]
- Mvula, P.K.; Branco, P.; Jourdan, G.V.; Viktor, H.L. COVID-19 malicious domain names classification. Expert Syst. Appl. 2022, 204, 117553. [Google Scholar] [CrossRef]
- Liu, T.; Zhang, Y.; Shi, J.; Jing, Y.; Li, Q.; Guo, L. Towards quantifying visual similarity of domain names for combating typosquatting abuse. In Proceedings of the Military Communications Conference, Baltimore, MD, USA, 1–3 November 2016; Volume 11, pp. 770–775. [Google Scholar]
- Zouina, M.; Outtaj, B. A novel lightweight URL phishing detection system using SVM and similarity index. Hum.-Cent. Comput. Inf. Sci. 2017, 7, 17. [Google Scholar] [CrossRef] [Green Version]
- Ozgur, K.; Buber, E.; Demir, O.; Diri, B. Machine learning based phishing detection from URLs. Expert Syst. Appl. 2018, 117, 345–357. [Google Scholar]
- Wang, Y.; Liu, B.; Lin, G. Phishing detection algorithm based on language features of URL. Comput. Eng. Appl. 2019, 26, 11–17. [Google Scholar]
- Yuan, L.J.; Zeng, Z.Y.; Lu, Y.K.; Ou, X.F.; Feng, T. A character-level BiGRU-attention for phishing classification. Inf. Commun. Secur. 2019, 12, 746–762. [Google Scholar]
- Sun, D. Research on Phishing Detection Mechanism by Integrating New URL Features. Master’s Thesis, Southwest Jiaotong University, Chengdu, China, 2017. [Google Scholar]
- Aung, E.; Yamana, H. Phishing URL Detection Using Information-Rich Domain and Path Features. In Forum on Data Engineering and Information Management. 2021. Available online: https://proceedings-of-deim.github.io/DEIM2021/papers/I21-1.pdf (accessed on 22 July 2022).
- Alsariera, Y.; Adeyemo, V.; Balogun, A. Phishing website detection: Forest by penalizing attributes algorithm and its enhanced variations. Arab. J. Sci. Eng. 2020, 45, 10459–10470. [Google Scholar] [CrossRef]
- Mehanovic, D.; Kevric, J. Phishing website detection using machine learning classifiers optimized by feature selection. Traitement Signal 2020, 37, 563–569. [Google Scholar] [CrossRef]
- Fernando, M.; Arachchilage, N. Why johnny can’t rely on anti-phishing educational interventions to protect himself against contemporary phishing attacks? In Proceedings of the Australasian Conference on Information Systems, Perth, Australia, 9–11 December 2019; Volume 12, pp. 395–405. [Google Scholar]
- Aleroud, A.; Karabatis, G. Bypassing detection of URL-based phishing attacks using generative adversarial deep neural networks. In Proceedings of the CODASPY ‘20: Tenth ACM Conference on Data and Application Security and Privacy, New Orleans, LA, USA, 16–18 March 2020; Volume 3, pp. 53–60. [Google Scholar]
- Yang, P.; Zhao, G.Z.; Zeng, P. Phishing website detection based on multidimensional features driven by deep learning. IEEE Access 2019, 7, 15196–15209. [Google Scholar] [CrossRef]
- Taha, A. Intelligent ensemble learning approach for phishing website detection based on weighted soft voting. Mathematics 2021, 9, 2799. [Google Scholar] [CrossRef]
- Oram, E.; Dash, P.; Naik, B.; Nayak, J.; Vimal, S.; Nataraj, S. Light gradient boosting machine-based phishing webpage detection model using phisher website features of mimic URLs. Pattern Recognit. Lett. 2021, 152, 100–106. [Google Scholar] [CrossRef]
- Li, Y.K.; Yang, Z.G.; Chen, X.; Yuan, H.P.; Liu, W.Y. A stacking model using URL and HTML features for phishing webpage detection. Future Gener. Comput. Syst. 2019, 94, 27–39. [Google Scholar] [CrossRef]
- Chen, W.L.; Guo, X.F.; Chen, Z.G.; Zheng, Z.B.; Lu, Y.T. Phishing scam detection on Ethereum: Towards financial security for blockchain ecosystem. In Proceedings of the Twenty-Ninth International Joint Conference on Artificial Intelligence and Seventeenth Pacific Rim International Conference on Artificial Intelligence, Yokohama, Japan, 11–17 July 2020; pp. 4506–4512. [Google Scholar]
- Yazdani, A.; Safdari, R.; Golkar, A.; Sharareh, R.; Niakan, K. Words prediction based on N-gram model for free-text entry in electronic health records. Health Inf. Sci. Syst. 2019, 7, 6. [Google Scholar] [CrossRef] [PubMed]
- Wang, H.T.; He, J.; Zhang, X.H.; Liu, S.F. A short text classification method based on N-gram and CNN. Chin. J. Electron. 2020, 29, 248–254. [Google Scholar] [CrossRef]
- Ke, G.; Meng, Q.; Finley, T.; Wang, T.; Chen, W.; Ma, W.; Liu, T. LightGBM: A highly efficient gradient boosting decision tree. In Proceedings of the 31st International Conference on Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30, pp. 3149–3157. [Google Scholar]
Number | Domain Name |
---|---|
1 | mazoon57168.uc.r.appspot.com (accessed on 12 June 2022) |
2 | alibaba.com.spatialsys.com.ru (accessed on 12 June 2022) |
3 | privacy.apple.com.info-sign.in (accessed on 12 June 2022) |
4 | www.nothingelsefilm.com (accessed on 12 June 2022) |
5 | paypal-limited.pdcotton.com (accessed on 12 June 2022) |
Serial Number | [Learning Rate, Estimators, Maximum Depth] | Accuracy (%) | Precision (%) | Recall (%) | F1 (%) |
---|---|---|---|---|---|
1 | [0.01,60,60] | 92.52 | 93.66 | 91.21 | 92.42 |
2 | [0.05,60,60] | 92.92 | 93.81 | 91.90 | 92.85 |
3 | [0.1,60,60] | 93.60 | 94.07 | 93.07 | 93.57 |
4 | [0.2,60,60] | 93.64 | 94.47 | 92.71 | 93.58 |
5 | [0.3,60,60] | 93.57 | 94.04 | 93.04 | 93.54 |
6 | [0.5,60,60] | 93.64 | 94.02 | 93.21 | 93.61 |
7 | [1,60,60] | 88.16 | 89.35 | 86.64 | 87.98 |
8 | [0.2,70,60] | 93.63 | 93.95 | 93.27 | 93.61 |
9 | [0.2,80,60] | 93.70 | 94.08 | 93.27 | 93.67 |
10 | [0.2,90,60] | 93.68 | 94.10 | 93.21 | 93.65 |
11 | [0.2,100,60] | 93.57 | 94.01 | 93.07 | 93.54 |
12 | [0.2,110,60] | 93.68 | 94.18 | 93.13 | 93.65 |
13 | [0.2,120,60] | 93.54 | 94.03 | 92.99 | 93.51 |
14 | [0.2,130,60] | 92.92 | 93.81 | 91.90 | 92.85 |
15 | [0.2,80,50] | 93.71 | 94.35 | 92.99 | 93.67 |
16 | [0.2,80,40] | 93.93 | 94.40 | 93.41 | 93.90 |
17 | [0.2,80,30] | 93.88 | 94.57 | 93.10 | 93.83 |
18 | [0.2,80,20] | 93.68 | 94.43 | 92.85 | 93.63 |
19 | [0.2,80,10] | 93.45 | 94.35 | 92.43 | 93.38 |
Number | Feature Dimension | Training Epoch/s |
---|---|---|
1 | 78 | 256 |
2 | 70 | 148 |
3 | 50 | 97 |
4 | 30 | 64 |
5 | 10 | 36 |
Number | Feature Category | Feature |
---|---|---|
1 | N-gram | 2-gram sequence matrix |
2 | Quantitative feature and matching feature | Domain name character length |
Percentage of numbers in domain names | ||
The number of sensitive words in the domain name | ||
Top-level domain location | ||
Types of top-level domains | ||
3 | Maximum segmentation related features | Maximum number of domain matching splits |
Maximum domain split length | ||
Number of misspelled divisions | ||
4 | Edit distance | Edit distance |
5 | Address | IP |
Physical location | ||
6 | Time | Update time |
Existence time | ||
7 | WHOIS | Completeness of WHOIS Information |
8 | Filling | Whether to file |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Zhou, J.; Cui, H.; Li, X.; Yang, W.; Wu, X. A Novel Phishing Website Detection Model Based on LightGBM and Domain Name Features. Symmetry 2023, 15, 180. https://doi.org/10.3390/sym15010180
Zhou J, Cui H, Li X, Yang W, Wu X. A Novel Phishing Website Detection Model Based on LightGBM and Domain Name Features. Symmetry. 2023; 15(1):180. https://doi.org/10.3390/sym15010180
Chicago/Turabian StyleZhou, Jingxian, Haibin Cui, Xina Li, Wenjin Yang, and Xi Wu. 2023. "A Novel Phishing Website Detection Model Based on LightGBM and Domain Name Features" Symmetry 15, no. 1: 180. https://doi.org/10.3390/sym15010180