Nothing Special   »   [go: up one dir, main page]

Skip to main content

Identifying Personal Identifiable Information (PII) in Unstructured Text: A Comparative Study on Transformers

  • Conference paper
  • First Online:
Database and Expert Systems Applications (DEXA 2024)

Abstract

Unstructured data encompasses a plethora of Personal Identifiable information (PII) represented in both standard (e.g. an email address) and unconventional forms (e.g. slang or emojis). Common examples, arise in social media posts but also more recently, in mesh data scenarios where multiple distributed data owners, might share fragments of their data to enable operations such as service compositions. Correctly identifying PII occurrences in such shared data, is crucial in ensuring that the participating entities adhere to the legal and ethical requirements of privacy legislation. In this paper, we present a toolbox of transformer models and comparatively assess the effectiveness and efficiency of such models in discovering PII in unstructured data. To this end, we evaluated state-of-the-art transformer models such as BERT, RoBERTa and XLNet using datasets containing upto 4.19 million tokens. Our results indicate that XLNet generates a lower false negative rate (0.059) than BERT (0.074) and RoBERTa (0.067). Comparative fine-tuning and evaluation among nine transformer models shows that the BERT models achieve 0.92 macro F1-score, a 5% performance improvement over previous works.

This is a preview of subscription content, log in via an institution to check access.

Access this chapter

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Chapter
USD 29.95
Price excludes VAT (USA)
  • Available as PDF
  • Read on any device
  • Instant download
  • Own it forever
eBook
USD 59.99
Price excludes VAT (USA)
  • Available as EPUB and PDF
  • Read on any device
  • Instant download
  • Own it forever
Softcover Book
USD 74.99
Price excludes VAT (USA)
  • Compact, lightweight edition
  • Dispatched in 3 to 5 business days
  • Free shipping worldwide - see info

Tax calculation will be finalised at checkout

Purchases are for personal use only

Institutional subscriptions

Similar content being viewed by others

References

  1. Clark, K., et al.: ELECTRA: pre-training text encoders as discriminators rather than generators (2020). arXiv:2003.10555

  2. Cohen, W.: Enron email dataset (2015). https://www.cs.cmu.edu/~enron. Accessed 18 May 2024

  3. Conneau, A., et al.: Unsupervised cross-lingual representation learning at scale. CoRR, abs/1911.02116 (2019). arXiv:1911.02116

  4. Gillette, J.B., et al.: Data protections for minors with named entity recognition (2022). https://doi.org/10.1109/BigData55660.2022.10021086

  5. Hathurusinghe, R.: Building a personally identifiable information recognizer in a privacy preserved manner using automated annotation and federated learning (2020). https://doi.org/10.20381/ruor-25235

  6. Lin, T.J., Abhishek, N.V.: Personal identity information detection using synthetic dataset. In: 2023 6th International Conference on Applied Computational Intelligence in Information Systems (ACIIS), pp. 1–5 (2023). https://doi.org/10.1109/ACIIS59385.2023.10367249

  7. Liu, Y., et al.: RoBERTa: A robustly optimized BERT pretraining approach (2019). https://doi.org/10.48550/ARXIV.1907.11692

  8. Loshchilov, I., Hutter, F.: Decoupled weight decay regularization (2019). arXiv:1711.05101 [cs.LG]

  9. Martin, L., et al.: CamemBERT: a tasty French language model. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics (2020). https://doi.org/10.18653/v1/2020.acl-main.645

  10. Narayanan, A., Shmatikov, V.: Myths and fallacies of “personally identifiable information’’. Commun. ACM 53(6), 24–26 (2010). https://doi.org/10.1145/1743546.1743558. ISSN 0001-0782

    Article  Google Scholar 

  11. Nguyen, D.Q., Vu, T., Nguyen, A.T.: BERTweet: a pre-trained language model for English tweets (2020). arXiv:2005.10200

  12. van der Plas, N.: Detecting PII in git commits (2022). http://resolver.tudelft.nl/uuid:fe195c17-ecf5-4811-a987-89f238a6802f

  13. Posey, C., et al.: Taking stock of organisations’ protection of privacy: categorising and assessing threats to personally identifiable information in the USA. Eur. J. Inf. Syst. 26(6), 585–604 (2017). https://doi.org/10.1057/s41303-017-0065-y

    Article  Google Scholar 

  14. Al-Riyami, S., Lisitsa, A., Coenen, F.: Cross-datasets evaluation of machine learning models for intrusion detection systems. In: Yang, X.-S., Sherratt, S., Dey, N., Joshi, A. (eds.) Proceedings of Sixth International Congress on Information and Communication Technology. LNNS, vol. 217, pp. 815–828. Springer, Singapore (2022). https://doi.org/10.1007/978-981-16-2102-4_73

    Chapter  Google Scholar 

  15. Rolnick, D., Veit, A., Belongie, S., Shavit, N.: Deep learning is robust to massive label noise (2017). https://doi.org/10.48550/ARXIV.1705.10694

  16. Rozenberg, Y.: Challenges in PII data protection. Comput. Fraud Secur. 2012(6), 5–9 (2012). https://doi.org/10.1016/S1361-3723(12)70061-1. ISSN 1361-372

    Article  Google Scholar 

  17. Rumelhart, D.E., Hinton, G.E., Williams, R.J.: Learning internal representations by error propagation. In: Parallel Distributed Processing, Volume 1: Explorations in the Microstructure of Cognition: Foundations. The MIT Press (1986). https://doi.org/10.7551/mitpress/5236.003.0012

  18. Sanh, V., et al.: DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter (2019). arXiv:1711.05101 [cs.LG]

  19. Shahriar, M.H., Kamal, A.H., Kayem, A.V.D.M.: Discovering personally identifiable information in textual data - a case study with automated concatenation of embeddings. In: Barolli, L. (ed.) AINA 2024. LNDECT, vol. 202, pp. 145–158. Springer, Cham (2024). https://doi.org/10.1007/978-3-031-57916-5_13

    Chapter  Google Scholar 

  20. da Silva, C.J.A.P.: Detecting and protecting personally identifiable information through machine learning techniques (2020). https://hdl.handle.net/10216/129033

  21. Song, K., et al.: MPNet: Masked and permuted pre-training for language understanding (2020). arXiv:2004.09297 [cs.CL]

  22. Vajjala, S., Balasubramaniam, R.: What do we really know about state of the art NER? (2022). https://doi.org/10.48550/ARXIV.2205.00034

  23. Vaswani, A., et al.: Attention is all you need (2017). https://doi.org/10.48550/ARXIV.1706.03762

  24. Wang, P., Fang, J., Reinspach, J.: CS-BERT: a pretrained model for customer service dialogues (2021). https://doi.org/10.18653/v1/2021.nlp4convai-1.13

  25. Goodacre, R., Xu, Y.: On splitting training and validation set: a comparative study of cross-validation, bootstrap and systematic sampling for estimating the generalization performance of supervised learning (2018). https://doi.org/10.1007/s41664-018-0068-2

  26. Yang, D., Wan, F., Zhang, Y.: Named entity recognition in XLNet cyberspace security domain based on dictionary embedding, pp. 1–5 (2022). https://doi.org/10.1109/CTISC54888.2022.9849830

  27. Yang, Z., et al. XLNet: generalized autoregressive pretraining for language understanding (2019). https://doi.org/10.48550/ARXIV.1906.08237

Download references

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Anne V. D. M. Kayem .

Editor information

Editors and Affiliations

Rights and permissions

Reprints and permissions

Copyright information

© 2024 The Author(s), under exclusive license to Springer Nature Switzerland AG

About this paper

Check for updates. Verify currency and authenticity via CrossMark

Cite this paper

Shahriar, M.H., Kayem, A.V.D.M., Reich, D., Meinel, C. (2024). Identifying Personal Identifiable Information (PII) in Unstructured Text: A Comparative Study on Transformers. In: Strauss, C., Amagasa, T., Manco, G., Kotsis, G., Tjoa, A.M., Khalil, I. (eds) Database and Expert Systems Applications. DEXA 2024. Lecture Notes in Computer Science, vol 14911. Springer, Cham. https://doi.org/10.1007/978-3-031-68312-1_14

Download citation

  • DOI: https://doi.org/10.1007/978-3-031-68312-1_14

  • Published:

  • Publisher Name: Springer, Cham

  • Print ISBN: 978-3-031-68311-4

  • Online ISBN: 978-3-031-68312-1

  • eBook Packages: Computer ScienceComputer Science (R0)

Publish with us

Policies and ethics