Identifying Personal Identifiable Information (PII) in Unstructured Text: A Comparative Study on Transformers

Md Hasan Shahriar¹³,
Anne V. D. M. Kayem ORCID: orcid.org/0000-0002-6587-5313¹⁵,
David Reich¹³ &
…
Christoph Meinel¹⁴

Part of the book series: Lecture Notes in Computer Science ((LNCS,volume 14911))

Included in the following conference series:

International Conference on Database and Expert Systems Applications

231 Accesses

Abstract

Unstructured data encompasses a plethora of Personal Identifiable information (PII) represented in both standard (e.g. an email address) and unconventional forms (e.g. slang or emojis). Common examples, arise in social media posts but also more recently, in mesh data scenarios where multiple distributed data owners, might share fragments of their data to enable operations such as service compositions. Correctly identifying PII occurrences in such shared data, is crucial in ensuring that the participating entities adhere to the legal and ethical requirements of privacy legislation. In this paper, we present a toolbox of transformer models and comparatively assess the effectiveness and efficiency of such models in discovering PII in unstructured data. To this end, we evaluated state-of-the-art transformer models such as BERT, RoBERTa and XLNet using datasets containing upto 4.19 million tokens. Our results indicate that XLNet generates a lower false negative rate (0.059) than BERT (0.074) and RoBERTa (0.067). Comparative fine-tuning and evaluation among nine transformer models shows that the BERT models achieve 0.92 macro F1-score, a 5% performance improvement over previous works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Chapter: USD 29.95; Price excludes VAT (USA)

eBook: USD 59.99; Price excludes VAT (USA)

Softcover Book: USD 74.99; Price excludes VAT (USA)

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Discovering Personally Identifiable Information in Textual Data - A Case Study with Automated Concatenation of Embeddings

Towards Personal Data Anonymization for Social Messaging

Mining Actionable Information from Security Forums: The Case of Malicious IP Addresses

References

Clark, K., et al.: ELECTRA: pre-training text encoders as discriminators rather than generators (2020). arXiv:2003.10555
Cohen, W.: Enron email dataset (2015). https://www.cs.cmu.edu/~enron. Accessed 18 May 2024
Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116 (2019). arXiv:1911.02116
Gillette, J.B., et al.: Data protections for minors with named entity recognition (2022). https://doi.org/10.1109/BigData55660.2022.10021086
Hathurusinghe, R.: Building a personally identifiable information recognizer in a privacy preserved manner using automated annotation and federated learning (2020). https://doi.org/10.20381/ruor-25235
Lin, T.J., Abhishek, N.V.: Personal identity information detection using synthetic dataset. In: 2023 6th International Conference on Applied Computational Intelligence in Information Systems (ACIIS), pp. 1–5 (2023). https://doi.org/10.1109/ACIIS59385.2023.10367249
Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/ARXIV.1907.11692
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019). arXiv:1711.05101 [cs.LG]
Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.645
Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable information’’. Commun. ACM 53(6), 24–26 (2010). https://doi.org/10.1145/1743546.1743558. ISSN 0001-0782
Article Google Scholar
Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English tweets (2020). arXiv:2005.10200
van der Plas, N.: Detecting PII in git commits (2022). http://resolver.tudelft.nl/uuid:fe195c17-ecf5-4811-a987-89f238a6802f
Posey, C., et al.: Taking stock of organisations’ protection of privacy: categorising and assessing threats to personally identifiable information in the USA. Eur. J. Inf. Syst. 26(6), 585–604 (2017). https://doi.org/10.1057/s41303-017-0065-y
Article Google Scholar
Al-Riyami, S., Lisitsa, A., Coenen, F.: Cross-datasets evaluation of machine learning models for intrusion detection systems. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology. LNNS, vol. 217, pp. 815–828. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-2102-4_73
Chapter Google Scholar
Rolnick, D., Veit, A., Belongie, S., Shavit, N.: Deep learning is robust to massive label noise (2017). https://doi.org/10.48550/ARXIV.1705.10694
Rozenberg, Y.: Challenges in PII data protection. Comput. Fraud Secur. 2012(6), 5–9 (2012). https://doi.org/10.1016/S1361-3723(12)70061-1. ISSN 1361-372
Article Google Scholar
Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. The MIT Press (1986). https://doi.org/10.7551/mitpress/5236.003.0012
Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019). arXiv:1711.05101 [cs.LG]
Shahriar, M.H., Kamal, A.H., Kayem, A.V.D.M.: Discovering personally identifiable information in textual data - a case study with automated concatenation of embeddings. In: Barolli, L. (ed.) AINA 2024. LNDECT, vol. 202, pp. 145–158. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-57916-5_13
Chapter Google Scholar
da Silva, C.J.A.P.: Detecting and protecting personally identifiable information through machine learning techniques (2020). https://hdl.handle.net/10216/129033
Song, K., et al.: MPNet: Masked and permuted pre-training for language understanding (2020). arXiv:2004.09297 [cs.CL]
Vajjala, S., Balasubramaniam, R.: What do we really know about state of the art NER? (2022). https://doi.org/10.48550/ARXIV.2205.00034
Vaswani, A., et al.: Attention is all you need (2017). https://doi.org/10.48550/ARXIV.1706.03762
Wang, P., Fang, J., Reinspach, J.: CS-BERT: a pretrained model for customer service dialogues (2021). https://doi.org/10.18653/v1/2021.nlp4convai-1.13
Goodacre, R., Xu, Y.: On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning (2018). https://doi.org/10.1007/s41664-018-0068-2
Yang, D., Wan, F., Zhang, Y.: Named entity recognition in XLNet cyberspace security domain based on dictionary embedding, pp. 1–5 (2022). https://doi.org/10.1109/CTISC54888.2022.9849830
Yang, Z., et al. XLNet: generalized autoregressive pretraining for language understanding (2019). https://doi.org/10.48550/ARXIV.1906.08237

Download references

Author information

Authors and Affiliations

Institute of Computer Science, University of Potsdam, Potsdam, Germany
Md Hasan Shahriar & David Reich
Hasso-Plattner-Institute for Digital Engineering, University of Potsdam, Potsdam, Germany
Christoph Meinel
Department of Computer Science, University of Exeter, Exeter, UK
Anne V. D. M. Kayem

Authors

Md Hasan Shahriar
View author publications
You can also search for this author in PubMed Google Scholar
Anne V. D. M. Kayem
View author publications
You can also search for this author in PubMed Google Scholar
David Reich
View author publications
You can also search for this author in PubMed Google Scholar
Christoph Meinel
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Anne V. D. M. Kayem .

Editor information

Editors and Affiliations

University of Vienna, Vienna, Austria
Christine Strauss
University of Tsukuba, Tsukuba, Japan
Toshiyuki Amagasa
National Research Council (CNR), Rende, Italy
Giuseppe Manco
Johannes Kepler University Linz, Linz, Austria
Gabriele Kotsis
Vienna University of Technology, Vienna, Austria
A Min Tjoa
Johannes Kepler University Linz, Linz, Austria
Ismail Khalil

Rights and permissions

Reprints and permissions

Copyright information

About this paper

Cite this paper

Shahriar, M.H., Kayem, A.V.D.M., Reich, D., Meinel, C. (2024). Identifying Personal Identifiable Information (PII) in Unstructured Text: A Comparative Study on Transformers. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14911. Springer, Cham. https://doi.org/10.1007/978-3-031-68312-1_14

Download citation

DOI: https://doi.org/10.1007/978-3-031-68312-1_14
Published: 17 August 2024
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-68311-4
Online ISBN: 978-3-031-68312-1
eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics

Identifying Personal Identifiable Information (PII) in Unstructured Text: A Comparative Study on Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Discovering Personally Identifiable Information in Textual Data - A Case Study with Automated Concatenation of Embeddings

Towards Personal Data Anonymization for Social Messaging

Mining Actionable Information from Security Forums: The Case of Malicious IP Addresses

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Publish with us

Subscribe and save

Buy Now

Navigation

Identifying Personal Identifiable Information (PII) in Unstructured Text: A Comparative Study on Transformers

Abstract

Access this chapter

Subscribe and save

Buy Now

Similar content being viewed by others

Discovering Personally Identifiable Information in Textual Data - A Case Study with Automated Concatenation of Embeddings

Towards Personal Data Anonymization for Social Messaging

Mining Actionable Information from Security Forums: The Case of Malicious IP Addresses

References

Author information

Authors and Affiliations

Corresponding author

Editor information

Editors and Affiliations

Rights and permissions

Copyright information

About this paper

Cite this paper

Download citation

Share this paper

Publish with us

Search

Navigation