Abstract
Everyday we interact with online services from companies that ask for our permission to use our personal information. Nowadays it is common practice for websites and apps to collect big amounts of data which are mainly used for revenue optimization based on user analytics. This customer data collection and usage is regulated by legal agreements (i.e., privacy and cookie policies) which we are required to accept (multiple times a day), but which are generally very long and formulated in a way that makes their interpretation difficult for the general public. An average privacy policy takes 15 min to read and includes lots of legal jargon (e.g., including words like “data controller” and “legal basis for processing”). In this research project, we are developing a support system where users can search for concrete answers in the privacy policies of companies or websites, by formulating their questions in natural language. Instead of blindly accepting a privacy policy, a user could first query the system for answers to a potential concern. The system will return a ranked list of phrases and documents matching the query. In case the generated answer is not sufficient for the user, an extension will allow them to forward complex requests to best-matching legal professionals, specialized in privacy legislation, which can process them for a small fee. We present different aspects of the internal implementation, including the identification of relevant spans in unstructured privacy policies and the selection of the best-suited NLP model for this specific task. The initial results of a user evaluation are presented, showing promising directions. Eventually, some future research directions for the extension of the system conclude our contribution.
Access this chapter
Tax calculation will be finalised at checkout
Purchases are for personal use only
Notes
- 1.
- 2.
- 3.
- 4.
- 5.
- 6.
- 7.
Vector Search Engine QDrant, see https://qdrant.tech/.
References
Abela, S.: Data protection and freedom of information. In: Abela, S. (ed.) Leadership and Management in Healthcare, pp. 103–107. Springer, Cham (2023). https://doi.org/10.1007/978-3-031-21025-9_10
Crook, M.: The Caldicott report and patient confidentiality (2003)
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: BERT: pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018)
Fabian, B., Ermakova, T., Lentz, T.: Large-scale readability analysis of privacy policies. In: Proceedings of the International Conference on Web Intelligence, pp. 18–25 (2017)
Fleiss, J.L.: Measuring nominal scale agreement among many raters. Psychol. Bull. 76(5), 378 (1971)
Gao, L., Callan, J.: Condenser: a pre-training architecture for dense retrieval. arXiv preprint arXiv:2104.08253 (2021)
Gao, L., Callan, J.: Is your language model ready for dense representation fine-tuning. arXiv preprint arXiv:2104.08253 (2021)
Goddard, M.: The EU general data protection regulation (GDPR): European regulation that has a global impact. Int. J. Mark. Res. 59(6), 703–705 (2017)
Honnibal, M., Montani, I., Van Landeghem, S., Boyd, A., et al.: Spacy: industrial-strength natural language processing in Python (2020)
Khattab, O., Zaharia, M.: ColBERT: efficient and effective passage search via contextualized late interaction over BERT. In: Proceedings of the 43rd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp. 39–48 (2020)
Kiss, T., Strunk, J.: Unsupervised multilingual sentence boundary detection. Comput. Linguist. 32(4), 485–525 (2006)
Korunovska, J., Kamleitner, B., Spiekermann, S.: The challenges and impact of privacy policy comprehension. arXiv preprint arXiv:2005.08967 (2020)
Leatherman, S., Berwick, D.M.: Accelerating global improvements in health care quality. JAMA 324(24), 2479–2480 (2020)
Liu, Y., Stolcke, A., Shriberg, E., Harper, M.: Using conditional random fields for sentence boundary detection in speech. In: Proceedings of the 43rd Annual Meeting of the Association for Computational Linguistics (ACL 2005), pp. 451–458 (2005)
Mazzola, L., Waldis, A., Shankar, A., Argyris, D., Denzler, A., Van Roey, M.: Privacy and customer’s education: NLP for information resources suggestions and expert finder systems. In: Moallem, A. (ed.) HCII 2022. LNCS, vol. 13333, pp. 62–77. Springer, Cham (2022). https://doi.org/10.1007/978-3-031-05563-8_5
Mikolov, T., Sutskever, I., Chen, K., Corrado, G.S., Dean, J.: Distributed representations of words and phrases and their compositionality. In: Advances in Neural Information Processing Systems, vol. 26 (2013)
Nogueira, R., Cho, K.: Passage re-ranking with BERT. arXiv preprint arXiv:1901.04085 (2019)
Peters, S., Verhagen, H.: An evaluation of the nutri-score system along the reasoning for scientific substantiation of health claims in the EU—a narrative review. Foods 11(16), 2426 (2022)
Qi, P., Zhang, Y., Zhang, Y., Bolton, J., Manning, C.D.: Stanza: a Python natural language processing toolkit for many human languages. arXiv preprint arXiv:2003.07082 (2020)
Ravichander, A., Black, A.W., Wilson, S., Norton, T., Sadeh, N.: Question answering for privacy policies: combining computational and legal perspectives. In: Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), Hong Kong, China, pp. 4949–4959. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/D19-1500. https://www.aclweb.org/anthology/D19-1500
Robertson, S.E., Walker, S., Jones, S., Hancock-Beaulieu, M.M., Gatford, M., et al.: Okapi at trec-3. NIST Special Publication Sp 109, 109 (1995)
Sadvilkar, N., Neumann, M.: PySBD: pragmatic sentence boundary disambiguation. arXiv preprint arXiv:2010.09657 (2020)
Sanchez, G.: Sentence boundary detection in legal text. In: Proceedings of the Natural Legal Language Processing Workshop 2019, Minneapolis, Minnesota, pp. 31–38. Association for Computational Linguistics (2019). https://doi.org/10.18653/v1/W19-2204. https://aclanthology.org/W19-2204
Santhanam, K., Khattab, O., Saad-Falcon, J., Potts, C., Zaharia, M.: ColBERTv2: effective and efficient retrieval via lightweight late interaction. arXiv preprint arXiv:2112.01488 (2021)
Savelka, J., Walker, V.R., Grabmair, M., Ashley, K.D.: Sentence boundary detection in adjudicatory decisions in the United States. Traitement automatique des langues 58, 21 (2017)
Sharma, P., Li, Y.: Self-supervised contextual keyword and keyphrase retrieval with self-labelling (2019). https://www.preprints.org/manuscript/201908.0073/v1
Sivan-Sevilla, I.: Varieties of enforcement strategies post-GDPR: a fuzzy-set qualitative comparative analysis (FSQCA) across data protection authorities. J. Eur. Public Policy 1–34 (2022)
Subrahmanya, S.V.G., et al.: The role of data science in healthcare advancements: applications, benefits, and future prospects. Irish J. Med. Sci. (1971-) 191(4), 1473–1483 (2022)
Tikkinen-Piri, C., Rohunen, A., Markkula, J.: EU general data protection regulation: changes and implications for personal data collecting companies. Comput. Law Secur. Rev. 34(1), 134–153 (2018)
Tkachenko, M., Malyuk, M., Holmanyuk, A., Liubimov, N.: Label Studio: Data labeling software (2020–2022). Open source software https://github.com/heartexlabs/label-studio
Trotman, A., Puurula, A., Burgess, B.: Improvements to BM25 and language models examined. In: Proceedings of the 2014 Australasian Document Computing Symposium, pp. 58–65 (2014)
Vail, M.W., Earp, J.B., Antón, A.I.: An empirical study of consumer perceptions and comprehension of web site privacy policies. IEEE Trans. Eng. Manag. 55(3), 442–454 (2008)
Vanberg, A.D.: Informational privacy post GDPR-end of the road or the start of a long journey? Int. J. Hum. Rights 25(1), 52–78 (2021)
Xu, C., Guo, D., Duan, N., McAuley, J.: LaPraDoR: unsupervised pretrained dense retriever for zero-shot text retrieval. arXiv preprint arXiv:2203.06169 (2022)
Acknowledgements
The research leading to this work was partially financed by Innosuisse - Swiss federal agency for Innovation, through a competitive call. The project 50446.1 IP-ICT is called P2Sr Profila Privacy Simplified reloaded: Open-smart knowledge base on Swiss privacy policies and Swiss privacy legislation, simplifying consumers’ access to legal knowledge and expertise (https://www.aramis.admin.ch/Grunddaten/?ProjectID=48867). The authors would like to thank all the people involved on the implementation side at Profila GmbH (https://www.profila.com/) for all the constructive and fruitful discussions and insights provided about privacy regulations and consumers’ rights.
Author information
Authors and Affiliations
Corresponding author
Editor information
Editors and Affiliations
Appendix A - SBD and Q2D Graphs
Appendix A - SBD and Q2D Graphs
In this appendix, we provide the reader with the graphical representations of the data from Table 1 and from Table 2. Effectiveness of nltk is demonstrated with a good F1 measure and a very limited runtime.
BM25+, a relatively simple and sparse IDF-based model, practically outperforms other approaches when considering accuracy and runtime.
Rights and permissions
Copyright information
© 2023 The Author(s), under exclusive license to Springer Nature Switzerland AG
About this paper
Cite this paper
Mazzola, L. et al. (2023). A Question Answering Tool for Website Privacy Policy Comprehension. In: Moallem, A. (eds) HCI for Cybersecurity, Privacy and Trust. HCII 2023. Lecture Notes in Computer Science, vol 14045. Springer, Cham. https://doi.org/10.1007/978-3-031-35822-7_14
Download citation
DOI: https://doi.org/10.1007/978-3-031-35822-7_14
Published:
Publisher Name: Springer, Cham
Print ISBN: 978-3-031-35821-0
Online ISBN: 978-3-031-35822-7
eBook Packages: Computer ScienceComputer Science (R0)