Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing
<p>Distribution of top-level domains in the dataset (.com excluded).</p> "> Figure 2
<p>Distribution of policy snapshots per interval and category. Each bar represents an interval (two intervals per year).</p> "> Figure 3
<p>Data extraction (by Amos et al. [<a href="#B7-information-14-00622" class="html-bibr">7</a>]) and further preprocessing (this study).</p> "> Figure 4
<p>Distribution of vagueness categories.</p> "> Figure 5
<p>Sentence count of privacy policies over time. The shaded area indicates the data between the 25th and 75 percentiles.</p> "> Figure 6
<p>Median sentence count of privacy policies by Alexa ranking (<b>left</b>) and by GDPR content (<b>right</b>).</p> "> Figure 7
<p>Median sentence count of privacy policies by top-level domain (<b>left</b>) and by website category (<b>right</b>).</p> "> Figure 8
<p>Average passive voice index of privacy policies. The shaded area indicates the interquartile range.</p> "> Figure 9
<p>Distribution of the Dale–Chall reading grade levels of privacy policies.</p> "> Figure 10
<p>Median Dale–Chall readability score of privacy policies by GDPR content (<b>left</b>) and by top-level domain (<b>right</b>).</p> "> Figure 11
<p>Median percentage of “vague” sentences in privacy policies.</p> "> Figure 12
<p>Median percentage of “vague” and “clear” sentences in privacy policies.</p> "> Figure 13
<p>Median proportion of sentences classified as “clear” (<b>left</b>) and as “vague” (<b>right</b>) by Alexa rank.</p> "> Figure 14
<p>Median proportion of sentences classified as “clear” (<b>left</b>) and as “vague” (<b>right</b>) by website category.</p> "> Figure 15
<p>Median proportion of sentences classified as “clear” (<b>left</b>) and as “vague” (<b>right</b>) by top-level domain.</p> "> Figure 16
<p>Median proportion of sentences classified as “clear” (<b>left</b>) and as “vague” (<b>right</b>) by GDPR content.</p> "> Figure 17
<p>The proportion of policies containing one or more pacifying phrases.</p> "> Figure 18
<p>Proportion of policies containing one or more pacifying phrases by Alexa ranking (<b>left</b>) and by GDPR content (<b>right</b>).</p> "> Figure 19
<p>Proportion of policies containing one or more pacifying phrases by top-level domain (<b>left</b>) and by website category (<b>right</b>).</p> "> Figure A1
<p>Website count per category. If no category information is available, the “uncategorized” bin applies.</p> "> Figure A2
<p>Usage of phrases specific to the GDPR (selection based on Amos et al. [<a href="#B7-information-14-00622" class="html-bibr">7</a>]).</p> "> Figure A3
<p>Distribution of GDPR and non-GDPR policies per interval.</p> "> Figure A4
<p>Sentence count distribution: original (<b>left</b>) and after cleaning (<b>right</b>).</p> "> Figure A5
<p>Dale–Chall score distribution: (<b>left</b>) and after cleaning (<b>right</b>).</p> "> Figure A6
<p>Vagueness score distribution.</p> "> Figure A7
<p>BERT performance evaluation: confusion matrix (stronger color indicates higher numbers).</p> "> Figure A8
<p>BERT performance evaluation: ROC curves.</p> ">
Abstract
:1. Introduction
- Reading difficulty in terms of readability test and text statistics;
- Privacy policy ambiguity measured by the usage of vague language and statements;
- The use of positive phrasing concerning privacy.
2. Regulatory Background
2.1. Data Protection in the European Union
- The data retention period;
- The data subject’s rights;
- Information about the automated decision-making system, if used;
- The data controller’s identity and contact details;
- The data protection officer’s contact details, where applicable;
- All the purposes of and the legal basis for data processing;
- Recipients of the personal data;
- If the organization intends to transfer to a third country or international organization.
2.2. “Notice and Choice” in the United States
3. Automated Privacy Policy Research
3.1. Privacy Policy Datasets
3.2. Classification and Information Extraction
3.3. Privacy Policy Applications for Enhancing Users’ Comprehension
3.4. Regulatory Impact
3.5. Comprehensibility of Privacy Policies
3.6. Mobile Applications
4. Data and Methods
4.1. The Princeton-Leuven Longitudinal Corpus of Privacy Policies
4.2. Measuring Readability of Privacy Policies
4.3. Measuring Ambiguity in Privacy Policies
4.3.1. Taxonomy of Vague Terms
4.3.2. Language Model for Vagueness Prediction
- → clear
- → somewhat clear
- → vague
- → extremely vague
4.4. Unveiling Positive Framing in Privacy Policies
5. Results
5.1. Readability
5.2. Ambiguity
5.3. Positive Phrases
6. Discussion
7. Conclusions
Author Contributions
Funding
Data Availability Statement
Conflicts of Interest
Appendix A. Website Categories and GDPR Content
Appendix B. Outlier Analysis
Appendix B.1. Policy Length
Appendix B.2. Readability Score
Appendix B.3. Vague Sentences
Appendix C. Model Performance
Appendix D. Positive Phrasing
we value | transparency | protect your (personal) information |
we respect | trust us | protect your (personal) data |
we promise | safe and secure | protect your privacy |
care about | committed to protecting | provides protection |
with care | committed to safeguarding | serious about your privacy |
responsibly | committed to respecting | takes your security (very) seriously |
important to us | respect your privacy | takes your privacy (very) seriously |
References
- Meier, Y.; Schäwel, J.; Krämer, N. The Shorter the Better? Effects of Privacy Policy Length on Online Privacy Decision-Making. Media Commun. 2020, 8, 291. [Google Scholar] [CrossRef]
- Ibdah, D.; Lachtar, N.; Raparthi, S.M.; Bacha, A. “Why Should I Read the Privacy Policy, I Just Need the Service”: A Study on Attitudes and Perceptions Toward Privacy Policies. IEEE Access 2021, 9, 166465–166487. [Google Scholar] [CrossRef]
- Ermakova, T.; Krasnova, H.; Fabian, B. Exploring the Impact of Readability of Privacy Policies on Users’ Trust. In Proceedings of the 24th European Conference on Information Systems (ECIS 2016), Istanbul, Turkey, 12–15 June 2016. [Google Scholar]
- Wagner, I. Privacy Policies Across the Ages: Content and Readability of Privacy Policies 1996–2021. Technical Report. arXiv 2022, arXiv:2201.08739. [Google Scholar]
- Article 29 Working Party: Guidelines on Transparency under Regulation 2016/679. Available online: https://ec.europa.eu/newsroom/article29/items/622227/en (accessed on 15 November 2023).
- Reidenberg, J.R.; Bhatia, J.; Breaux, T.; Norton, T. Ambiguity in Privacy Policies and the Impact of Regulation; SSRN Scholarly; Social Science Research Network: Rochester, NY, USA, 2016. [Google Scholar] [CrossRef]
- Amos, R.; Acar, G.; Lucherini, E.; Kshirsagar, M.; Narayanan, A.; Mayer, J. Privacy Policies over Time: Curation and Analysis of a Million-Document Dataset. In Proceedings of the Web Conference 2021, Ljubljana, Slovenia, 19–23 April 2021; pp. 2165–2176. [Google Scholar] [CrossRef]
- Bhatia, J.; Breaux, T.D.; Reidenberg, J.R.; Norton, T.B. A Theory of Vagueness and Privacy Risk Perception. In Proceedings of the 2016 IEEE 24th International Requirements Engineering Conference (RE), Beijing, China, 12–16 September 2016; pp. 26–35, ISSN 2332-6441. [Google Scholar] [CrossRef]
- Fabian, B.; Ermakova, T.; Lentz, T. Large-scale readability analysis of privacy policies. In Proceedings of the International Conference on Web Intelligence (WI ’17), Leipzig, Germany, 23–26 August 2017; Association for Computing Machinery: New York, NY, USA, 2017; pp. 18–25. [Google Scholar] [CrossRef]
- Ermakova, T.; Fabian, B.; Babina, E. Readability of Privacy Policies of Healthcare Websites. In Proceedings of the 12th International Conference on Wirtschaftsinformatik, Osnabrück, Germany, 4–6 March 2015. [Google Scholar]
- Kaur, J.; Dara, R.A.; Obimbo, C.; Song, F.; Menard, K. A comprehensive keyword analysis of online privacy policies. Inf. Secur. J. Glob. Perspect. 2018, 27, 260–275. [Google Scholar] [CrossRef]
- Srinath, M.; Sundareswara, S.N.; Giles, C.L.; Wilson, S. PrivaSeer: A Privacy Policy Search Engine. In Proceedings of the Web Engineering; Brambilla, M., Chbeir, R., Frasincar, F., Manolescu, I., Eds.; Lecture Notes in Computer Science; Springer International Publishing: Cham, Switzerland, 2021; pp. 286–301. [Google Scholar] [CrossRef]
- Libert, T.; Desai, A.; Patel, D. Preserving Needles in the Haystack: A Search Engine and Multi-Jurisdictional Forensic Documentation System for Privacy Violations on the Web. 2021. Available online: https://timlibert.me/pdf/Libert_et_al-2021-Forensic_Privacy_on_Web.pdf (accessed on 15 November 2023).
- Lebanoff, L.; Liu, F. Automatic Detection of Vague Words and Sentences in Privacy Policies. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 31 October–4 November 2018; pp. 3508–3517. [Google Scholar] [CrossRef]
- Data, Movement of Such. Directive 95/46/EC of the European Parliament and of the Council of 24 October 1995 on the Protection of Individuals with Regard to the Processing of Personal Data and on the Free Movement of Such Data. Off. J. L 1995, 281, 0031–0050. [Google Scholar]
- Robinson, N.; Graux, H.; Botterman, M.; Valeri, L. Review of the European Data Protection Directive; Technical report; RAND Corporation: Cambridge, UK, 2009. [Google Scholar]
- GDPR-Personal Data. Available online: https://gdpr-info.eu/issues/personal-data/ (accessed on 5 August 2023).
- Federal Trade Comission, Privacy Online: A Report to Congress. Federal Trade Commission, 1998. Available online: https://www.ftc.gov/sites/default/files/documents/reports/privacy-online-report-congress/priv-23a.pdf (accessed on 5 August 2023).
- Usable Privacy Policy Project. Available online: https://usableprivacy.org/ (accessed on 18 June 2023).
- Wilson, S.; Schaub, F.; Dara, A.A.; Liu, F.; Cherivirala, S.; Giovanni Leon, P.; Schaarup Andersen, M.; Zimmeck, S.; Sathyendra, K.M.; Russell, N.C.; et al. The Creation and Analysis of a Website Privacy Policy Corpus. In Proceedings of the 54th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Berlin, Germany, 7–12 August 2016; Association for Computational Linguistics: Berlin, Germany, 2016; pp. 1330–1340. [Google Scholar] [CrossRef]
- Bannihatti Kumar, V.; Iyengar, R.; Nisal, N.; Feng, Y.; Habib, H.; Story, P.; Cherivirala, S.; Hagan, M.; Cranor, L.; Wilson, S.; et al. Finding a Choice in a Haystack: Automatic Extraction of Opt-Out Statements from Privacy Policy Text. In Proceedings of the Web Conference 2020, Virtural, 20–24 April 2020; ACM: Taipei, Taiwan, 2020; pp. 1943–1954. [Google Scholar] [CrossRef]
- Ahmad, W.U.; Chi, J.; Le, T.; Norton, T.; Tian, Y.; Chang, K.W. Intent Classification and Slot Filling for Privacy Policies. arXiv 2021, arXiv:2101.00123. [Google Scholar]
- Nokhbeh Zaeem, R.; Barber, K.S. A Large Publicly Available Corpus of Website Privacy Policies Based on DMOZ. In Proceedings of the Eleventh ACM Conference on Data and Application Security and Privacy, Virtual, 26–28 April 2021; ACM: New York, NY, USA, 2021; pp. 143–148. [Google Scholar] [CrossRef]
- Audich, D.; Dara, R.; Nonnecke, B. Privacy Policy Annotation for Semi-Automated Analysis: A Cost-Effective Approach. In Trust Management XII. IFIPTM 2018. IFIP Advances in Information and Communication Technology; Springer: Cham, Switzerland; Toronto, ON, Canada, 2018; pp. 29–44. [Google Scholar] [CrossRef]
- Kumar, V.B.; Ravichander, A.; Story, P.; Sadeh, N. Quantifying the Effect of In-Domain Distributed Word Representations: A Study of Privacy Policies. In AAAI Spring Symposium on Privacy-Enhancing Artificial Intelligence and Language Technologies. 2019. Available online: https://usableprivacy.org/static/files/kumar_pal_2019.pdf (accessed on 18 June 2023).
- Liu, F.; Wilson, S.; Story, P.; Zimmeck, S.; Sadeh, N. Towards Automatic Classification of Privacy Policy Text. Technical Report, CMU-ISR-17-118R, Institute for Software Research and Language Technologies Institute, School of Computer Science, Carnegie Mellon University, 2018. Available online: http://reports-archive.adm.cs.cmu.edu/anon/isr2017/CMU-ISR-17-118R.pdf (accessed on 18 June 2023).
- Mousavi, N.; Jabat, P.; Nedelchev, R.; Scerri, S.; Graux, D. Establishing a Strong Baseline for Privacy Policy Classification. In Proceedings of the IFIP International Conference on ICT Systems Security and Privacy Protection, Maribor, Slovenia, 21–23 September 2020. [Google Scholar]
- Mustapha, M.; Krasnashchok, K.; Al Bassit, A.; Skhiri, S. Privacy Policy Classification with XLNet (Short Paper). In Data Privacy Management, Cryptocurrencies and Blockchain Technology; Garcia-Alfaro, J., Navarro-Arribas, G., Herrera-Joancomarti, J., Eds.; Springer International Publishing: Cham, Switzerland, 2020; Volume 12484, pp. 250–257. [Google Scholar] [CrossRef]
- Bui, D.; Shin, K.G.; Choi, J.M.; Shin, J. Automated Extraction and Presentation of Data Practices in Privacy Policies. Proc. Priv. Enhancing Technol. 2021, 2021, 88–110. [Google Scholar] [CrossRef]
- Alabduljabbar, A.; Abusnaina, A.; Meteriz-Yildiran, U.; Mohaisen, D. Automated Privacy Policy Annotation with Information Highlighting Made Practical Using Deep Representations. In Proceedings of the 2021 ACM SIGSAC Conference on Computer and Communications Security, Virtual, 15–19 November 2021; CCS ’21. Association for Computing Machinery: New York, NY, USA, 2021; pp. 2378–2380. [Google Scholar] [CrossRef]
- Alabduljabbar, A.; Abusnaina, A.; Meteriz-Yildiran, U.; Mohaisen, D. TLDR: Deep Learning-Based Automated Privacy Policy Annotation with Key Policy Highlights. In Proceedings of the 20th Workshop on Workshop on Privacy in the Electronic Society, Virtual, 15 November 2021; ACM: New York, NY, USA, 2021; pp. 103–118. [Google Scholar] [CrossRef]
- Sathyendra, K.M.; Schaub, F.; Wilson, S.; Sadeh, N.M. Automatic Extraction of Opt-Out Choices from Privacy Policies. In AAAI Fall Symposia, 2016, Association for the Advancement of Artificial Intelligence. 2016. Available online: https://api.semanticscholar.org/CorpusID:32896562 (accessed on 18 June 2023).
- Sathyendra, K.M.; Wilson, S.; Schaub, F.; Zimmeck, S.; Sadeh, N. Identifying the Provision of Choices in Privacy Policy Text. In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Copenhagen, Denmark, 7–11 September 2017; Association for Computational Linguistics: Copenhagen, Denmark, 2017; pp. 2774–2779. [Google Scholar] [CrossRef]
- Keymanesh, M.; Elsner, M.; Parthasarathy, S. Toward Domain-Guided Controllable Summarization of Privacy Policies. In Proceedings of the 2020 Natural Legal Language Processing (NLLP) Workshop, Virtual Event/San Diego, CA, USA, 24 August 2020; ACM: New York, NY, USA, 2020; pp. 18–24. [Google Scholar]
- Ravichander, A.; Black, A.W.; Wilson, S.; Norton, T.; Sadeh, N. Question Answering for Privacy Policies: Combining Computational and Legal Perspectives. arXiv 2019, arXiv:1911.00841. [Google Scholar]
- Ahmad, W.U.; Chi, J.; Tian, Y.; Chang, K.W. PolicyQA: A Reading Comprehension Dataset for Privacy Policies. arXiv 2020, arXiv:2010.02557. [Google Scholar]
- Keymanesh, M.; Elsner, M.; Parthasarathy, S. Privacy Policy Question Answering Assistant: A Query-Guided Extractive Summarization Approach. arXiv 2021, arXiv:2109.14638. [Google Scholar]
- Shankar, A.; Waldis, A.; Bless, C.; Andueza Rodriguez, M.; Mazzola, L. PrivacyGLUE: A Benchmark Dataset for General Language Understanding in Privacy Policies. Appl. Sci. 2023, 13, 3701. [Google Scholar] [CrossRef]
- Tesfay, W.B.; Hofmann, P.; Nakamura, T.; Kiyomoto, S.; Serna, J. PrivacyGuide: Towards an Implementation of the EU GDPR on Internet Privacy Policy Evaluation. In Proceedings of the Fourth ACM International Workshop on Security and Privacy Analytics, Tempe, AZ, USA, 19–21 March 2018; IWSPA ’18. Association for Computing Machinery: New York, NY, USA, 2018; pp. 15–21. [Google Scholar] [CrossRef]
- Harkous, H.; Fawaz, K.; Lebret, R.; Schaub, F.; Shin, K.G.; Aberer, K. Polisis: Automated Analysis and Presentation of Privacy Policies Using Deep Learning. In Proceedings of the 27th USENIX Security Symposium, Baltimore, MD, USA, 15–17 August 2018; USENIX Association: Berkeley, CA, USA, 2018; pp. 531–548. [Google Scholar]
- PriBOT. Available online: https://pribot.org/ (accessed on 24 June 2023).
- Zaeem, R.N.; German, R.L.; Barber, K.S. PrivacyCheck: Automatic Summarization of Privacy Policies Using Data Mining. ACM Trans. Internet Technol. 2018, 18, 53:1–53:18. [Google Scholar] [CrossRef]
- Nokhbeh Zaeem, R.; Anya, S.; Issa, A.; Nimergood, J.; Rogers, I.; Shah, V.; Srivastava, A.; Barber, K.S. PrivacyCheck v2: A Tool that Recaps Privacy Policies for You. In Proceedings of the 29th ACM International Conference on Information & Knowledge Management, Virtual, 19–23 October 2020; Association for Computing Machinery: New York, NY, USA, 2020. CIKM ’20. pp. 3441–3444. [Google Scholar] [CrossRef]
- Nokhbeh Zaeem, R.; Ahbab, A.; Bestor, J.; Djadi, H.H.; Kharel, S.; Lai, V.; Wang, N.; Barber, K.S. PrivacyCheck v3: Empowering Users with Higher-Level Understanding of Privacy Policies. In Proceedings of the Fifteenth ACM International Conference on Web Search and Data Mining, Virtual, 21–25 February 2022; WSDM ’22. Association for Computing Machinery: New York, NY, USA, 2022; pp. 1593–1596. [Google Scholar] [CrossRef]
- Privacy Lab|Center for Identity. Available online: https://identity.utexas.edu/privacy-lab (accessed on 24 June 2023).
- Opt-Out Easy. Available online: https://optouteasy.isr.cmu.edu/ (accessed on 24 June 2023).
- Contissa, G.; Docter, K.; Lagioia, F.; Lippi, M.; Micklitz, H.W.; Pałka, P.; Sartor, G.; Torroni, P. Claudette Meets GDPR: Automating the Evaluation of Privacy Policies Using Artificial Intelligence; SSRN Scholarly; Social Science Research Network: Rochester, NY, USA, 2018. [Google Scholar] [CrossRef]
- Liepina, R.; Contissa, G.; Drazewski, K.; Lagioia, F.; Lippi, M.; Micklitz, H.; Palka, P.; Sartor, G.; Torroni, P. GDPR Privacy Policies in CLAUDETTE: Challenges of Omission, Context and Multilingualism. In Proceedings of the Third Workshop on Automated Semantic Analysis of Information in Legal Text (ASAIL 2019), Montreal, QC, Canada, 21 June 2019. [Google Scholar]
- Mousavi, N.; Scerri, S.; Lehmann, J. KnIGHT: Mapping Privacy Policies to GDPR. In Knowledge Engineering and Knowledge Management; Faron Zucker, C., Ghidini, C., Napoli, A., Toussaint, Y., Eds.; Springer International Publishing: Cham, Switzerland, 2018; Volume 11313, pp. 258–272. [Google Scholar]
- Cejas, O.A.; Abualhaija, S.; Torre, D.; Sabetzadeh, M.; Briand, L. AI-enabled Automation for Completeness Checking of Privacy Policies. IEEE Trans. Softw. Eng. 2021, 48, 4647–4674. [Google Scholar] [CrossRef]
- Qamar, A.; Javed, T.; Beg, M.O. Detecting Compliance of Privacy Policies with Data Protection Laws. arXiv 2021, arXiv:2102.12362. [Google Scholar]
- Sánchez, D.; Viejo, A.; Batet, M. Automatic Assessment of Privacy Policies under the GDPR. Appl. Sci. 2021, 11, 1762. [Google Scholar] [CrossRef]
- Degeling, M.; Utz, C.; Lentzsch, C.; Hosseini, H.; Schaub, F.; Holz, T. We Value Your Privacy … Now Take Some Cookies: Measuring the GDPR’s Impact on Web Privacy. In Proceedings of the 2019 Network and Distributed System Security Symposium, San Diego, CA, USA, 24–27 February 2019. [Google Scholar] [CrossRef]
- Linden, T.; Khandelwal, R.; Harkous, H.; Fawaz, K. The Privacy Policy Landscape After the GDPR. arXiv 2019, arXiv:1809.08396. [Google Scholar] [CrossRef]
- Zaeem, R.N.; Barber, K.S. The Effect of the GDPR on Privacy Policies: Recent Progress and Future Promise. ACM Trans. Manag. Inf. Syst. 2020, 12, 2:1–2:20. [Google Scholar] [CrossRef]
- Libert, T. An Automated Approach to Auditing Disclosure of Third-Party Data Collection in Website Privacy Policies. In Proceedings of the 2018 World Wide Web Conference on World Wide Web-WWW ’18, Lyon, France, 23–27 April 2018; pp. 207–216. [Google Scholar] [CrossRef]
- Kotal, A.; Joshi, A.; Pande Joshi, K. The Effect of Text Ambiguity on creating Policy Knowledge Graphs. In Proceedings of the 2021 IEEE Intl Conf on Parallel & Distributed Processing with Applications, Big Data & Cloud Computing, Sustainable Computing & Communications, Social Computing & Networking (ISPA/BDCloud/SocialCom/SustainCom), New York City, NY, USA, 30 September–3 October 2021; IEEE: New York City, NY, USA, 2021; pp. 1491–1500. [Google Scholar] [CrossRef]
- Zimmeck, S.; Story, P.; Smullen, D.; Ravichander, A.; Wang, Z.; Reidenberg, J.; Cameron Russell, N.; Sadeh, N. MAPS: Scaling Privacy Compliance Analysis to a Million Apps. Proc. Priv. Enhancing Technol. 2019, 2019, 66–86. [Google Scholar] [CrossRef]
- Story, P.; Zimmeck, S.; Ravichander, A.; Smullen, D.; Wang, Z.; Reidenberg, J.; Russell, N.; Sadeh, N. Natural Language Processing for Mobile App Privacy Compliance. In Proceedings of the PAL: Privacy-Enhancing Artificial Intelligence and Language Technologies AAAI Spring Symposium, Palo Alto, CA, USA, 25–27 March 2019. [Google Scholar]
- Hashmi, S.S.; Waheed, N.; Tangari, G.; Ikram, M.; Smith, S. Longitudinal Compliance Analysis of Android Applications with Privacy Policies. arXiv 2021, arXiv:2106.10035. [Google Scholar]
- Internet Archive: Wayback Machine. Available online: https://archive.org/web/ (accessed on 5 August 2023).
- NLTK: nltk.tokenize Package. Available online: https://www.nltk.org/api/nltk.tokenize.html (accessed on 5 August 2023).
- Webshrinker. Available online: https://webshrinker.com/ (accessed on 5 August 2023).
- Chall, J.S.; Dale, E. Readability Revisited: The New Dale-Chall Readability Formula; Brookline Books: Brookline, MA, USA, 1995; Google-Books-ID: 2nbuAAAAMAAJ. [Google Scholar]
- Wolf, T.; Debut, L.; Sanh, V.; Chaumond, J.; Delangue, C.; Moi, A.; Cistac, P.; Rault, T.; Louf, R.; Funtowicz, M.; et al. HuggingFace’s Transformers: State-of-the-art Natural Language Processing. arXiv 2020, arXiv:1910.03771. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. arXiv 2019, arXiv:1810.04805. [Google Scholar]
- Nissenbaum, H. A Contextual Approach to Privacy Online. Daedalus 2011, 140, 32–48. [Google Scholar] [CrossRef]
- Chanenson, J.; Pickering, M.; Apthorpe, N. Automating Governing Knowledge Commons and Contextual Integrity (GKC-CI) Privacy Policy Annotations with Large Language Models. arXiv 2023, arXiv:cs.CY/2311.02192. [Google Scholar]
- Tang, C.; Liu, Z.; Ma, C.; Wu, Z.; Li, Y.; Liu, W.; Zhu, D.; Li, Q.; Li, X.; Liu, T.; et al. PolicyGPT: Automated Analysis of Privacy Policies with Large Language Models. arXiv 2023, arXiv:cs.CL/2309.10238. [Google Scholar]
Dataset | # Policies | # Websites | Timeframe | Labeling |
---|---|---|---|---|
OPP-115 | 115 | 115 | 2015 | Yes |
OptOutChoice-2020 | 236 | 236 | - | Yes |
PolicyIE | 400 | 400 (websites + apps) | 2019 | Yes |
DMOZ-based Corpus | 117,502 | - | 2020 | No |
PrivaSeer | 1,005,380 | 995,475 | 2019 | No |
Princeton-Leuven Corpus | 910,546 | 108,499 | 1997–2019 | No |
Adj. Score | Grade Level |
---|---|
5.0–5.9 | Grades 5–6 |
6.0–6.9 | Grades 7–8 |
7.0–7.9 | Grades 9–10 |
8.0–8.9 | Grades 11–12 |
9.0–9.9 | College |
10–above | College graduate |
Category | Key Words and Phrases | Example Sentence |
---|---|---|
Condition | ||
Action performed is dependent on a variable or unclear trigger | depending, necessary, appropriate, inappropriate, as needed, as applicable, otherwise reasonably, sometimes, from time to time | “As a result, Pokemon will not collect more personal information than is reasonably necessary”. |
Generalization | ||
Action/information type is vaguely abstracted with unclear conditions | generally, mostly, widely, general, commonly, usually, normally, typically, largely, often, primarily, among other things | “Generally we will store and process your information within the UK.” |
Modality | ||
Vague likelihood of action or ambiguous possibility of action or event | may, might, can, could, would, likely, possible, possibly, probably, optionally | “If you are logged in to the site, we could associate information about your site usage that is collected by cookies, web beacons and web logs with your user account.” |
Numeric quantifier | ||
Vague quantifier of action/information type | anyone, certain, everyone, numerous, some, most, few, much, many, various, including but not limited to, such as | “You hereby consent to the collection, use, disclosure and retention by edX of your personal information as described under this privacy policy, including but not limited to the transfer of your personal data between edX and the third parties, affiliates and subsidiaries described in this privacy policy.” |
Model | BERT Base Uncased |
---|---|
Batch size | 16 |
Epochs | 2 |
Learning rate | |
Max length | 256 |
Optimizer type | Adam |
Baseline | BERT | ACGAN | |||
---|---|---|---|---|---|
Train | Validation | Test | |||
Accuracy | 0.5172 | 0.6499 | 0.6120 | 0.5960 | |
F1-Score | |||||
Macro average | 0.1705 | 0.4658 | 0.4477 | 0.4098 | |
Micro average | 0.3526 | 0.6327 | 0.5955 | 0.5783 | 0.5234 |
Precision | |||||
Macro average | 0.1293 | 0.4747 | 0.4540 | 0.4157 | |
Micro average | 0.2675 | 0.6240 | 0.5878 | 0.5669 | 0.5290 |
Recall | |||||
Macro average | 0.2500 | 0.4635 | 0.4482 | 0.4085 | |
Micro average | 0.5172 | 0.6499 | 0.6120 | 0.5960 | 0.5464 |
Website | Category | Example Sentence |
---|---|---|
cheapflights.com | travel, business | Cheapflights Media (USA) Inc., which publishes Cheapflights.com, has created this privacy statement in order to demonstrate our firm commitment to user privacy. |
direct-golf.co.uk | business, shopping | The personal information which we hold will be held securely in accordance with our internal security policy and governing UK law. |
tumblr.com | blogs and personal, social networking | Tumblr, Inc. takes the private nature of your information very seriously. |
learningstrategies.com | business, education | We are committed to safeguarding the privacy of our website visitors, subscribers, and clients. |
imdb.com | entertainment | IMDb knows that you care how information about you is used and shared, and we appreciate your trust that we will do so carefully and sensibly. |
alibaba.com | shopping | We at Alibaba.com recognize the importance of privacy and confidentiality of personal information. |
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content. |
© 2023 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Belcheva, V.; Ermakova, T.; Fabian, B. Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing. Information 2023, 14, 622. https://doi.org/10.3390/info14110622
Belcheva V, Ermakova T, Fabian B. Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing. Information. 2023; 14(11):622. https://doi.org/10.3390/info14110622
Chicago/Turabian StyleBelcheva, Veronika, Tatiana Ermakova, and Benjamin Fabian. 2023. "Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing" Information 14, no. 11: 622. https://doi.org/10.3390/info14110622