Abstract
This paper proposed an integrated Chinese malicious webpages detection method. Firstly, we collected and released a Chinese malicious webpages detection dataset called “ChiMalPages” containing URLs and HTML/JavaScript files, and specified the detailed types of malicious pages according to relevant laws. Secondly, we designed a feature template for Chinese webpages and ranked each feature’s importance based on information gain of the Random Forest algorithm. Thirdly, we fine-tuned BERT on the external URLs classification task and text on webpages, respectively producing new models “BERT-URL” and “BERT-web-text”. The performance of pre-trained models is obviously superior to the baseline models. Finally, we integrated features from manual templates, BERT-URL and BERT-web-text, and the classification F1 score reaches 79.84%, increasing by 7.37% compared with manually designed webpage features. Experiments proved that our method based on BERT is useful and not biased on detailed classes.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
Notes
- 1.
- 2.
Some pages belong to more than one class, especially porn and gambling pages.
- 3.
Numbers on the horizontal coordinates corresponds to numbers of feature items in Table 4.
- 4.
The blue dots represent benign URLs. The red dots represent malicious URLs.
References
CNNIC. The 49th China Statistical Report on Internet Development. http://www.cnnic.cn/hlwfzyj/hlwxzbg/. Accessed 28 Mar 2022
Zzhihao. CN-Malicious-website-list. https://github.com/zzhihao2017/CN-Malicious-website-list. Accessed 28 Mar 2022
National Internet Emergency Center. 2020 China Internet Network Security Report. https://www.cert.org.cn/publish/main/17/index.html. Accessed 28 Mar 2022
Wan, M., Yao, H.: GAN model for malicious web training data generation. Comput. Eng. Appl. (6), 1–10 (2020)
Wang, H., Yu, L., Tian, S.W., et al.: Bidirectional LSTM Malicious webpages detection algorithm based on convolutional neural network and independent recurrent neural network. Appl. Intell. 49(8), 3016–3026 (2019)
Peng, Y., Tian, S., Yu, L.: A joint approach to detect malicious URL based on attention mechanism. Int. J. Comput. Intell. Appl. 18(3) (2019)
Sahoo, D., Liu, C., Hoi, S.C.H.: Malicious URL Detection using Machine Learning: A Survey. arXiv e-prints, 1701-7179 (2017)
Devlin, J., Chang, M., Lee, K., et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, 1810-4805 (2018)
Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints, 1907-11692 (2019)
Tao, W., Yu, S., Xie, B.: A novel framework for learning to detect malicious web pages. In: 2010 International Forum on Information Technology and Applications, vol. 2, pp. 353–357 (2010)
Gowtham, R., Krishnamurthi, et al.: A comprehensive and efficacious architecture for detecting phishing webpages. Comput. Secur. 40, 23–37 (2014)
Xu, L.: A research of phishing detection technology based on deep learning. University of Electronic Science and Technology of China, ChengDu (2017)
Ye, Z.: Designing and application of a large-scale and fast malicious web page recognition method based on combination of Kafka and spark-streaming. Nanjing University of Posts and telecommunications, Nanjing (2019)
Wei, X., Cheng, W.: Malicious web page recognition based on feature fusion and machine learning. J. Nanjing Univ. Posts Telecommun. (Nat. Sci. Ed.) (5), 95–104 (2019)
Hu, Z., Wang, C., Wu, J., et al.: Malicious websites identification based on hyperlink analysis and classification rule. J. Inf. Resour. Manag. (1), 105–113 (2019)
Wu, H.: Research and implementation of activate defense technology for malicious crawlers. Beijing University of posts and telecommunications, Beijing (2019)
Zhou, W.: Machine learning based malicious webpage analysis. Shanghai Jiaotong University, Shanghai (2019)
Chen, B., Song, L.: Malicious webpage detection method for webpage content link hierarchy semantic tree. Comput. Eng. Appl. (11), 90–97 (2020)
Sha, H., Liu, Q., Liu, T., et al.: Survey on malicious webpage detection research. Chin. J. Comput. (3), 529–542 (2016)
Seifert, C., Komisarczuk, P., Welch, I., et al.: Identification of malicious web pages through analysis of underlying DNS and web server relationships. In: IEEE Conference on Local Computer Networks, pp. 935–941 (2008)
Spirin, N., Han, J.: Survey on web spam detection. ACM SIGKDD Explor. Newsl. 13(2), 50 (2012)
Chiew, K.L., Tan, C.L., Wong, K.S., et al.: A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inf. Sci. (484 ),153–166 (2019)
Le, H., Pham, Q., Sahoo, D., et al.: URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. arXiv e-prints, 1802-3162 (2018)
Peng, P., Yang, L., Song, L.: Opening the blackbox of virustotal: analyzing online phishing scan engines. In: The Internet Measurement Conference, pp. 478–485 (2019)
Sheng, S., Wardman, B., Warner, G., et al.: An empirical analysis of phishing blacklists. In: 6th Conference on Email and Anti-Spam, CEAS 2009, Mountain View, CA, United states (2009)
Saxe, J., Berlin, K.: eXpose: A Character-Level Convolutional Neural Network with Embed dings For Detecting Malicious URLs, File Paths and Registry Keys. arXiv e-prints, 1702-8568 (2017)
Zhou, Z.: Machine Learning, pp. 178–181. Tsinghua University Press, Beijing (2016)
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS, vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Yu, P., Wang, X.: BERT-based named entity recognition in Chinese twenty-four histories. In: Wang, G., Lin, X., Hendler, J., Song, W., Xu, Z., Liu, G. (eds.) WISA 2020. LNCS, vol. 12432, pp. 289–301. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60029-7_27
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2022 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Jiang, Y., Wu, D. (2022). An Integrated Chinese Malicious Webpages Detection Method Based on Pre-trained Language Models and Feature Fusion. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds) Web Information Systems and Applications. WISA 2022. Lecture Notes in Computer Science, vol 13579. Springer, Cham. https://doi.org/10.1007/978-3-031-20309-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-20309-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20308-4
Online ISBN: 978-3-031-20309-1
eBook Packages: Computer ScienceComputer Science (R0)