Abstract
Unstructured data encompasses a plethora of Personal Identifiable information (PII) represented in both standard (e.g. an email address) and unconventional forms (e.g. slang or emojis). Common examples, arise in social media posts but also more recently, in mesh data scenarios where multiple distributed data owners, might share fragments of their data to enable operations such as service compositions. Correctly identifying PII occurrences in such shared data, is crucial in ensuring that the participating entities adhere to the legal and ethical requirements of privacy legislation. In this paper, we present a toolbox of transformer models and comparatively assess the effectiveness and efficiency of such models in discovering PII in unstructured data. To this end, we evaluated state-of-the-art transformer models such as BERT, RoBERTa and XLNet using datasets containing upto 4.19 million tokens. Our results indicate that XLNet generates a lower false negative rate (0.059) than BERT (0.074) and RoBERTa (0.067). Comparative fine-tuning and evaluation among nine transformer models shows that the BERT models achieve 0.92 macro F1-score, a 5% performance improvement over previous works.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Similar content being viewed by others
References
Clark, K., et al.: ELECTRA: pre-training text encoders as discriminators rather than generators (2020). arXiv:2003.10555
Cohen, W.: Enron email dataset (2015). https://www.cs.cmu.edu/~enron. Accessed 18 May 2024
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116 (2019). arXiv:1911.02116
Gillette, J.B., et al.: Data protections for minors with named entity recognition (2022). https://doi.org/10.1109/BigData55660.2022.10021086
Hathurusinghe, R.: Building a personally identifiable information recognizer in a privacy preserved manner using automated annotation and federated learning (2020). https://doi.org/10.20381/ruor-25235
Lin, T.J., Abhishek, N.V.: Personal identity information detection using synthetic dataset. In: 2023 6th International Conference on Applied Computational Intelligence in Information Systems (ACIIS), pp. 1–5 (2023). https://doi.org/10.1109/ACIIS59385.2023.10367249
Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/ARXIV.1907.11692
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019). arXiv:1711.05101 [cs.LG]
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable information’’. Commun. ACM 53(6), 24–26 (2010). https://doi.org/10.1145/1743546.1743558. ISSN 0001-0782
Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English tweets (2020). arXiv:2005.10200
van der Plas, N.: Detecting PII in git commits (2022). http://resolver.tudelft.nl/uuid:fe195c17-ecf5-4811-a987-89f238a6802f
Posey, C., et al.: Taking stock of organisations’ protection of privacy: categorising and assessing threats to personally identifiable information in the USA. Eur. J. Inf. Syst. 26(6), 585–604 (2017). https://doi.org/10.1057/s41303-017-0065-y
Al-Riyami, S., Lisitsa, A., Coenen, F.: Cross-datasets evaluation of machine learning models for intrusion detection systems. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology. LNNS, vol. 217, pp. 815–828. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-2102-4_73
Rolnick, D., Veit, A., Belongie, S., Shavit, N.: Deep learning is robust to massive label noise (2017). https://doi.org/10.48550/ARXIV.1705.10694
Rozenberg, Y.: Challenges in PII data protection. Comput. Fraud Secur. 2012(6), 5–9 (2012). https://doi.org/10.1016/S1361-3723(12)70061-1. ISSN 1361-372
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. The MIT Press (1986). https://doi.org/10.7551/mitpress/5236.003.0012
Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019). arXiv:1711.05101 [cs.LG]
Shahriar, M.H., Kamal, A.H., Kayem, A.V.D.M.: Discovering personally identifiable information in textual data - a case study with automated concatenation of embeddings. In: Barolli, L. (ed.) AINA 2024. LNDECT, vol. 202, pp. 145–158. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-57916-5_13
da Silva, C.J.A.P.: Detecting and protecting personally identifiable information through machine learning techniques (2020). https://hdl.handle.net/10216/129033
Song, K., et al.: MPNet: Masked and permuted pre-training for language understanding (2020). arXiv:2004.09297 [cs.CL]
Vajjala, S., Balasubramaniam, R.: What do we really know about state of the art NER? (2022). https://doi.org/10.48550/ARXIV.2205.00034
Vaswani, A., et al.: Attention is all you need (2017). https://doi.org/10.48550/ARXIV.1706.03762
Wang, P., Fang, J., Reinspach, J.: CS-BERT: a pretrained model for customer service dialogues (2021). https://doi.org/10.18653/v1/2021.nlp4convai-1.13
Goodacre, R., Xu, Y.: On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning (2018). https://doi.org/10.1007/s41664-018-0068-2
Yang, D., Wan, F., Zhang, Y.: Named entity recognition in XLNet cyberspace security domain based on dictionary embedding, pp. 1–5 (2022). https://doi.org/10.1109/CTISC54888.2022.9849830
Yang, Z., et al. XLNet: generalized autoregressive pretraining for language understanding (2019). https://doi.org/10.48550/ARXIV.1906.08237
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Rights and permissions
Copyright information
© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Shahriar, M.H., Kayem, A.V.D.M., Reich, D., Meinel, C. (2024). Identifying Personal Identifiable Information (PII) in Unstructured Text: A Comparative Study on Transformers. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14911. Springer, Cham. https://doi.org/10.1007/978-3-031-68312-1_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-68312-1_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-68311-4
Online ISBN: 978-3-031-68312-1
eBook Packages: Computer ScienceComputer Science (R0)