Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3565291.3565300acmotherconferencesArticle/Chapter ViewAbstractPublication PagesicbdtConference Proceedingsconference-collections
research-article

Design and Implementation of Web Page Classification Method Based on XLNet Fusing Hierarchical Attention and CNN

Published: 16 December 2022 Publication History

Abstract

The explosive growth in the number of web pages makes web page classification crucial for web information retrieval, content filtering, and topic crawling, etc. To address the problems that traditional web page classification models have incomplete web text feature extraction and difficulty in capturing and utilizing both global semantic features and local semantic features, we propose a web page classification model XLNet-HAC based on the pre-trained model XLNet fusing Hierarchical Attention and CNN. For web page text extracted using URLs, the pre-trained model XLNet is used as an embedding layer to obtain a feature matrix representation with rich contextual relationships, using word attention and sentence attention mechanisms of Hierarchical Attention to capture the words and sentences that highly contribute to the classification, thus generates a global feature representation of the web page text, and the multi-channel CNN with convolutional kernels of different sizes is used to extract local features at multiple granularities of web page text. Finally, the outputs of Hierarchical Attention and CNN are applied to the softmax classifier respectively, and the classification results are fused to obtain the final classification result. The comparison experimental results on THUCNews dataset and DMOZ dataset show that the XLNet-HAC model proposed in this study outperforms the other comparison models in terms of classification accuracy and F1-score.

References

[1]
Kan M Y. Web page classification without the web page[C]//Proceedings of the 13th international World Wide Web conference on Alternate track papers & posters. 2004: 262-263.
[2]
Rajalakshmi R, Aravindan C. Web page classification using n-gram based URL features[C]//2013 fifth international conference on advanced computing (ICoAC). IEEE, 2013: 15-21.
[3]
Abdallah T A, Iglesia B L. URL-based web page classification: With n-gram language models[C]//International Joint Conference on Knowledge Discovery, Knowledge Engineering, and Knowledge Management. Springer, Cham, 2014: 19-33.
[4]
Cyprienna R A, Yannick R Z L, Randria I, URL Classification based on Active Learning Approach[C]//2021 3rd International Cyber Resilience Conference (CRC). IEEE, 2021: 1-6.
[5]
Rajalakshmi R, Aravindan C. A Naive Bayes approach for URL classification with supervised feature selection and rejection framework[J]. Computational Intelligence, 2018, 34(1): 363-396.
[6]
Matošević G, Dobša J, Mladenić D. Using machine learning for web page classification in search engine optimization[J]. Future Internet, 2021, 13(1): 9.
[7]
Yang Y, Pedersen J 0. A Comparative Study on Feature Selection in Text Categorization[C]// Proceedings of the Fourteenth International Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 2017:412-420.
[8]
Mladenic D, Grobelnik M. Feature Select ion for Unbalanced Class Distribution and Naive Bayes[C]// Sixteenth Internalional Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 2019:258—267.
[9]
Mladenic D, Grobelnik M. Feature Select ion for Unbalanced Class Distribution and Naive Bayes[C]// Sixteenth Internalional Conference on Machine Learning. Morgan Kaufmann Publishers Inc., 2019:258—267.
[10]
Rajalakshmi R, Xaviar S. Experimental study of feature weighting techniques for URL based webpage classification[J]. Procedia computer science, 2017, 115: 218-225.
[11]
Onan A. Classifier and feature set ensembles for web page classification[J]. Journal of Information Science, 2016, 42(2): 150-165.
[12]
Kwon O W, Lee J H. Text categorization based on k-nearest neighbor approach for web site classification[J]. Information Processing & Management, 2003, 39(1): 25-44.
[13]
Deng L, Du X, Shen JZ. Web page classification based on heterogeneous features and combined classifier[J]. Frontiers of Information Technology & Electronic Engineering,2020,21(07):995-1005.
[14]
Yoon Kim. Convolutional Neural Networks for Sentence Classification.[J]. CoRR,2014,abs/1408.5882.
[15]
Maladkar K. Content Based Hierarchical URL Classification with Convolutional Neural Networks[C]//2019 International Conference on Information Technology (ICIT). IEEE, 2019: 263-266.
[16]
Buber E, Diri B. Web page classification using RNN[J]. Procedia Computer Science, 2019, 154: 62-72.
[17]
KURT M S, DEMİREL E Y. WEB PAGE CLASSIFICATION WITH DEEP LEARNING METHODS[J]. Uludağ Üniversitesi Mühendislik Fakültesi Dergisi, 2022,27(1): 191-204.
[18]
Vaswani A, Shazeer N, Parmar N, Attention is all you need[J]. Advances in neural information processing systems, 2017, 30.
[19]
Yang Z, Yang D, Dyer C, Hierarchical attention networks for document classification[C]//Proceedings of the 2016 conference of the North American chapter of the association for computational linguistics: human language technologies. 2016: 1480-1489.
[20]
Mikolov T, Chen K, Corrado G, Efficient estimation of word representations in vector space[J]. arXiv preprint arXiv:1301.3781, 2013.
[21]
Pennington J, Socher R, Manning C D. Glove: Global vectors for word representation[C]//Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 2014: 1532-1543.
[22]
Yang Z, Dai Z, Yang Y, Xlnet: Generalized autoregressive pretraining for language understanding[J]. Advances in neural information processing systems, 2019, 32.
[23]
Devlin J, Chang M W, Lee K, Bert: Pre-training of deep bidirectional transformers for language understanding[J]. arXiv preprint arXiv:1810.04805, 2018.
[24]
Gupta A, Bhatia R. Ensemble approach for web page classification[J]. Multimedia Tools and Applications, 2021, 80(16): 25219-25240.
[25]
Mihalcea R, Tarau P. Textrank: Bringing order into text[C]//Proceedings of the 2004 conference on empirical methods in natural language processing. 2004: 404-411.
[26]
Liu P, Qiu X, Huang X. Recurrent neural network for text classification with multi-task learning[J]. arXiv preprint arXiv:1605.05101, 2016.
[27]
Zhou P, Shi W, Tian J, Attention-based bidirectional long short-term memory network for relation classification[C]//Proceedings of the 54th annual meeting of the association for computational linguistics (volume 2: Short papers). 2016: 207-212.
[28]
Joulin A, Grave E, Bojanowski P, Fasttext. zip: Compressing text classification models[J]. arXiv preprint arXiv:1612.03651, 2016.

Cited By

View all
  • (2023)Sensitive Information Identification Method of Power System Based on Deep Learning2023 5th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE58541.2023.10142374(170-173)Online publication date: 14-Apr-2023

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Other conferences
ICBDT '22: Proceedings of the 5th International Conference on Big Data Technologies
September 2022
454 pages
ISBN:9781450396875
DOI:10.1145/3565291
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 16 December 2022

Permissions

Request permissions for this article.

Check for updates

Author Tags

  1. Attention
  2. CNN
  3. Result fusion
  4. URL
  5. Web text
  6. XLNet

Qualifiers

  • Research-article
  • Research
  • Refereed limited

Conference

ICBDT 2022

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)10
  • Downloads (Last 6 weeks)0
Reflects downloads up to 20 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2023)Sensitive Information Identification Method of Power System Based on Deep Learning2023 5th International Conference on Communications, Information System and Computer Engineering (CISCE)10.1109/CISCE58541.2023.10142374(170-173)Online publication date: 14-Apr-2023

View Options

Login options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

HTML Format

View this article in HTML Format.

HTML Format

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media