An Integrated Chinese Malicious Webpages Detection Method Based on Pre-trained Language Models and Feature Fusion

Yanting Jiang^11,12 &
Di Wu^13,14

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 13579))

Included in the following conference series:

International Conference on Web Information Systems and Applications

1147 Accesses
2 Citations

Abstract

This paper proposed an integrated Chinese malicious webpages detection method. Firstly, we collected and released a Chinese malicious webpages detection dataset called “ChiMalPages” containing URLs and HTML/JavaScript files, and specified the detailed types of malicious pages according to relevant laws. Secondly, we designed a feature template for Chinese webpages and ranked each feature’s importance based on information gain of the Random Forest algorithm. Thirdly, we fine-tuned BERT on the external URLs classification task and text on webpages, respectively producing new models “BERT-URL” and “BERT-web-text”. The performance of pre-trained models is obviously superior to the baseline models. Finally, we integrated features from manual templates, BERT-URL and BERT-web-text, and the classification F1 score reaches 79.84%, increasing by 7.37% compared with manually designed webpage features. Experiments proved that our method based on BERT is useful and not biased on detailed classes.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 89.00; Price excludes VAT (USA)

Softcover Book: USD 119.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Using a Machine Learning Model for Malicious URL Type Detection

CatchPhish: detection of phishing websites by inspecting URLs

Article 10 May 2019

Malignant Web Sites Recognition Utilizing Distinctive Machine Learning Techniques

Notes

1.
https://jubao.anquan.org/exposure.
2.
Some pages belong to more than one class, especially porn and gambling pages.
3.
Numbers on the horizontal coordinates corresponds to numbers of feature items in Table 4.
4.
The blue dots represent benign URLs. The red dots represent malicious URLs.

References

CNNIC. The 49th China Statistical Report on Internet Development. http://www.cnnic.cn/hlwfzyj/hlwxzbg/. Accessed 28 Mar 2022
Zzhihao. CN-Malicious-website-list. https://github.com/zzhihao2017/CN-Malicious-website-list. Accessed 28 Mar 2022
National Internet Emergency Center. 2020 China Internet Network Security Report. https://www.cert.org.cn/publish/main/17/index.html. Accessed 28 Mar 2022
Wan, M., Yao, H.: GAN model for malicious web training data generation. Comput. Eng. Appl. (6), 1–10 (2020)
Google Scholar
Wang, H., Yu, L., Tian, S.W., et al.: Bidirectional LSTM Malicious webpages detection algorithm based on convolutional neural network and independent recurrent neural network. Appl. Intell. 49(8), 3016–3026 (2019)
Article Google Scholar
Peng, Y., Tian, S., Yu, L.: A joint approach to detect malicious URL based on attention mechanism. Int. J. Comput. Intell. Appl. 18(3) (2019)
Google Scholar
Sahoo, D., Liu, C., Hoi, S.C.H.: Malicious URL Detection using Machine Learning: A Survey. arXiv e-prints, 1701-7179 (2017)
Google Scholar
Devlin, J., Chang, M., Lee, K., et al.: BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv e-prints, 1810-4805 (2018)
Google Scholar
Liu, Y., Ott, M., Goyal, N., et al.: RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv e-prints, 1907-11692 (2019)
Google Scholar
Tao, W., Yu, S., Xie, B.: A novel framework for learning to detect malicious web pages. In: 2010 International Forum on Information Technology and Applications, vol. 2, pp. 353–357 (2010)
Google Scholar
Gowtham, R., Krishnamurthi, et al.: A comprehensive and efficacious architecture for detecting phishing webpages. Comput. Secur. 40, 23–37 (2014)
Google Scholar
Xu, L.: A research of phishing detection technology based on deep learning. University of Electronic Science and Technology of China, ChengDu (2017)
Google Scholar
Ye, Z.: Designing and application of a large-scale and fast malicious web page recognition method based on combination of Kafka and spark-streaming. Nanjing University of Posts and telecommunications, Nanjing (2019)
Google Scholar
Wei, X., Cheng, W.: Malicious web page recognition based on feature fusion and machine learning. J. Nanjing Univ. Posts Telecommun. (Nat. Sci. Ed.) (5), 95–104 (2019)
Google Scholar
Hu, Z., Wang, C., Wu, J., et al.: Malicious websites identification based on hyperlink analysis and classification rule. J. Inf. Resour. Manag. (1), 105–113 (2019)
Google Scholar
Wu, H.: Research and implementation of activate defense technology for malicious crawlers. Beijing University of posts and telecommunications, Beijing (2019)
Google Scholar
Zhou, W.: Machine learning based malicious webpage analysis. Shanghai Jiaotong University, Shanghai (2019)
Google Scholar
Chen, B., Song, L.: Malicious webpage detection method for webpage content link hierarchy semantic tree. Comput. Eng. Appl. (11), 90–97 (2020)
Google Scholar
Sha, H., Liu, Q., Liu, T., et al.: Survey on malicious webpage detection research. Chin. J. Comput. (3), 529–542 (2016)
Google Scholar
Seifert, C., Komisarczuk, P., Welch, I., et al.: Identification of malicious web pages through analysis of underlying DNS and web server relationships. In: IEEE Conference on Local Computer Networks, pp. 935–941 (2008)
Google Scholar
Spirin, N., Han, J.: Survey on web spam detection. ACM SIGKDD Explor. Newsl. 13(2), 50 (2012)
Google Scholar
Chiew, K.L., Tan, C.L., Wong, K.S., et al.: A new hybrid ensemble feature selection framework for machine learning-based phishing detection system. Inf. Sci. (484 ),153–166 (2019)
Google Scholar
Le, H., Pham, Q., Sahoo, D., et al.: URLNet: Learning a URL Representation with Deep Learning for Malicious URL Detection. arXiv e-prints, 1802-3162 (2018)
Google Scholar
Peng, P., Yang, L., Song, L.: Opening the blackbox of virustotal: analyzing online phishing scan engines. In: The Internet Measurement Conference, pp. 478–485 (2019)
Google Scholar
Sheng, S., Wardman, B., Warner, G., et al.: An empirical analysis of phishing blacklists. In: 6th Conference on Email and Anti-Spam, CEAS 2009, Mountain View, CA, United states (2009)
Google Scholar
Saxe, J., Berlin, K.: eXpose: A Character-Level Convolutional Neural Network with Embed dings For Detecting Malicious URLs, File Paths and Registry Keys. arXiv e-prints, 1702-8568 (2017)
Google Scholar
Zhou, Z.: Machine Learning, pp. 178–181. Tsinghua University Press, Beijing (2016)
Google Scholar
Sun, C., Qiu, X., Xu, Y., Huang, X.: How to fine-tune BERT for text classification? In: Sun, M., Huang, X., Ji, H., Liu, Z., Liu, Y. (eds.) CCL 2019. LNCS, vol. 11856, pp. 194–206. Springer, Cham (2019). https://doi.org/10.1007/978-3-030-32381-3_16
Yu, P., Wang, X.: BERT-based named entity recognition in Chinese twenty-four histories. In: Wang, G., Lin, X., Hendler, J., Song, W., Xu, Z., Liu, G. (eds.) WISA 2020. LNCS, vol. 12432, pp. 289–301. Springer, Cham (2020). https://doi.org/10.1007/978-3-030-60029-7_27
Chapter Google Scholar

Download references

Author information

Authors and Affiliations

Chengdu Aeronautic Polytechnic, Chengdu, 610100, China
Yanting Jiang
Sichuan University of Media and Communications, Chengdu, 611745, China
Yanting Jiang
Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, 518000, China
Di Wu
China University of Chinese Academy of Sciences, Shenzhen, 518000, China
Di Wu

Authors

Yanting Jiang
View author publications
You can also search for this author in PubMed Google Scholar
Di Wu
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yanting Jiang .

Editor information

Editors and Affiliations

National University of Defense Technology, Changsha, China
Xiang Zhao
Guangzhou University, Guangzhou, China
Shiyu Yang
Tianjin University, Tianjin, China
Xin Wang
Deakin University, Melbourne, VIC, Australia
Jianxin Li

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Jiang, Y., Wu, D. (2022). An Integrated Chinese Malicious Webpages Detection Method Based on Pre-trained Language Models and Feature Fusion. In: Zhao, X., Yang, S., Wang, X., Li, J. (eds) Web Information Systems and Applications. WISA 2022. Lecture Notes in Computer Science, vol 13579. Springer, Cham. https://doi.org/10.1007/978-3-031-20309-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-20309-1_14
Published: 08 December 2022
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-20308-4
Online ISBN: 978-3-031-20309-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

An Integrated Chinese Malicious Webpages Detection Method Based on Pre-trained Language Models and Feature Fusion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Using a Machine Learning Model for Malicious URL Type Detection

CatchPhish: detection of phishing websites by inspecting URLs

Malignant Web Sites Recognition Utilizing Distinctive Machine Learning Techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

An Integrated Chinese Malicious Webpages Detection Method Based on Pre-trained Language Models and Feature Fusion

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Using a Machine Learning Model for Malicious URL Type Detection

CatchPhish: detection of phishing websites by inspecting URLs

Malignant Web Sites Recognition Utilizing Distinctive Machine Learning Techniques

Notes

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation