Abstract
In this modern digital era, the increasing volume of textual data and the widespread adoption of natural language processing (NLP) techniques have presented a critical challenge in safeguarding sensitive privacy information. As a result, there is a pressing demand to design robust and accurate NLP-based techniques to perform efficient sensitive information detection in textual data. This research paper focuses on the detection and classification of sensitive privacy information in textual documents using NLP by proposing a novel algorithm named Privacy BERT-LSTM. The proposed Privacy BERT-LSTM algorithm employs BERT for obtaining contextual embeddings and LSTM for sequential information processing, facilitating efficient sensitive information detection in textual documents. The BERT with its bidirectional characteristics captures the nuances and meaning of the textual documents, while the LSTM derives the long-range dependencies in the textual data. Moreover, the proposed Privacy BERT-LSTM algorithm with its attention mechanism highlights the important regions of the textual documents, contributing to efficient sensitive information detection. The comprehensive performance evaluation is conducted by employing the SMS Spam Collection dataset in terms of standard performance metrics and comparing it with different state-of-the-art techniques, namely, CASSED, PRIVAFRAME, CNN-LSTM, Conv-FFD, GCSA, TSIIP, and, C-PIIM. The experimental outcomes clearly illustrate that the Privacy BERT-LSTM algorithm demonstrates superior performance in identifying various types of sensitive information by achieving an accuracy of 92.50%, F1-score of 85.02%, and Precision of 89.36%. The proposed algorithm outperforms existing baseline models, providing valuable advancements in sensitive information detection using NLP. Therefore, this research contributes to the advancement of privacy protection in NLP applications and opens avenues for future investigations in the domain of sensitive information detection. Additionally, the proposed algorithm provides valuable insights for researchers and practitioners working on privacy-sensitive NLP tasks.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
The data that support the findings of this study are available from the corresponding author upon reasonable request.
References
Ohata EF, Mattos CLC, Gomes SL, Rebouças EDS, Rego PAL (2022) A text classification methodology to assist a large technical support system. IEEE Access 10:108413–108421
Hassan F, Sánchez D, Domingo-Ferrer J (2021) Utility-preserving privacy protection of textual documents via word embeddings. IEEE Trans Knowl Data Eng 35(1):1058–1071
Lynn HM, Kim P, Pan SB (2021) Data independent acquisition based bi-directional deep networks for biometric ECG authentication. Appl Sci 11(3):1125
Khan AR, Yasin A, Usman SM, Hussain S, Khalid S, Ullah SS (2022) Exploring lightweight deep learning solution for malware detection in IoT constraint environment. Electronics 11(24):4147
Gambarelli G, Gangemi A (2022) PRIVAFRAME: a frame-based knowledge graph for sensitive personal data. Big Data Cognit Comput 6(3):90
Zhao M, Fu X, Zhang Y, Meng L, Tang B (2022) Highly imbalanced fault diagnosis of mechanical systems based on wavelet packet distortion and convolutional neural networks. Adv Eng Inform 51:101535
Zhao X, Zhu X, Liu J, Hu Y, Gao T, Zhao L, Yao J, Liu Z (2024) Model-assisted multi-source fusion hypergraph convolutional neural networks for intelligent few-shot fault diagnosis to electro-hydrostatic actuator. Inf Fus 104:102186
Zhao X, Yao J, Deng W, Jia M, Liu Z (2022) Normalized conditional variational auto-encoder with adaptive focal loss for imbalanced fault diagnosis of bearing-rotor system. Mech Syst Signal Process 170:108826
Zhu X, Zhao X, Yao J, Deng W, Shao H, Liu, Z (2023) Adaptive multiscale convolution manifold embedding networks for intelligent fault diagnosis of servo motor-cylindrical rolling bearing under variable working conditions. IEEE/ASME Transactions on Mechatronics.
Aubaid AM, Mishra A (2020) A rule-based approach to embedding techniques for text document classification. Appl Sci 10(11):4009
Huo L, Jiang J (2023) Research on intelligent perception algorithm for sensitive information. Appl Sci 13(6):3383
Zhang K, Jiang X (2023) Sensitive data detection with high-throughput machine learning models in electrical health records. arXiv preprint arXiv:2305.03169.
García M, Maldonado S, Vairetti C (2021) Efficient n-gram construction for text categorization using feature selection techniques. Intell Data Anal 25(3):509–525
Barve Y, Saini JR, Pal K, Kotecha K (2022) A novel evolving sentimental bag-of-words approach for feature extraction to detect misinformation. Int J Adv Comput Sci Appl 13(4):266–275
Zhuohao WANG, Dong WANG, Qing LI (2021) Keyword extraction from scientific research projects based on SRP-TF-IDF. Chin J Electron 30(4):652–657
Luo X (2021) Efficient English text classification using selected machine learning techniques. Alex Eng J 60(3):3401–3409
Kulkarni P, Cauvery NK (2021) Personally identifiable information (pii) detection in the unstructured large text corpus using natural language processing and unsupervised learning technique. Int J Adv Comput Sci Appl 12(9):508–517
Liu Y, Yang CY, Yang J (2021) A graph convolutional network-based sensitive information detection algorithm. Complexity 2021:1–8
Roslan NIM, Foozy CFM (2022) A comparison of sensitive information detection framework using LSTM and RNN techniques. J Soft Comput Data Min 3(2):92–103
Victor N, Lopez D (2020) Sl-LSTM: a Bi-directional LSTM with stochastic gradient descent optimization for sequence labeling tasks in big data. Int J Grid High Perform Comput (IJGHPC) 12(3):1–16
García-Pablos A, Perez N, Cuadros M (2020) Sensitive data detection and classification in Spanish clinical text: Experiments with BERT. arXiv preprint arXiv:2003.03106.
Guo Y, Liu J, Tang W, Huang C (2021) Exsense: extract sensitive information from unstructured data. Comput Secur 102:102156
Qasim R, Bangyal WH, Alqarni MA, Ali Almazroi A (2022) A fine-tuned BERT-based transfer learning approach for text classification. J Healthc Eng 2022:1–17
Devlin J, Chang MW, Lee K, Toutanova K (2018) Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805.
Yuan Y, Lin L, Huo LZ, Kong YL, Zhou ZG, Wu B, Jia Y (2020) Using an attention-based LSTM encoder–decoder network for near real-time disturbance detection. IEEE J Sel Top Appl Earth Obs Remote Sens 13:1819–1832
Deng J, Cheng L, Wang Z (2021) Attention-based BiLSTM fused CNN with gating mechanism model for Chinese long text classification. Comput Speech Lang 68:101182
Almeida T, Hidalgo J (2012) SMS spam collection. UCI Mach Learn Repos. https://doi.org/10.24432/C5CC84
Kužina V, Petric AM, Barišić M, Jović A (2023) CASSED: context-based approach for structured sensitive data detection. Expert Syst Appl 223:119924
Butt UA, Amin R, Aldabbas H, Mohan S, Alouffi B, Ahmadian A (2023) Cloud-based email phishing attack using machine and deep learning algorithm. Complex Intell Syst 9(3):3043–3070
Zhang Q, Guo Z, Zhu Y, Vijayakumar P, Castiglione A, Gupta BB (2023) A deep learning-based fast fake news detection model for cyber-physical social services. Pattern Recogn Lett 168:31–38
Funding
Not applicable.
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflicts of interest
Not applicable.
Consent to participate
Not applicable.
Consent for publication
Not applicable.
Human and animal rights
This article does not contain any studies with human or animal subjects performed by any of the authors.
Informed consent
Informed consent was obtained from all individual participants included in the study.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Muralitharan, J., Arumugam, C. Privacy BERT-LSTM: a novel NLP algorithm for sensitive information detection in textual documents. Neural Comput & Applic 36, 15439–15454 (2024). https://doi.org/10.1007/s00521-024-09707-w
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00521-024-09707-w