4.1.1 Survey design.
We distributed our survey using Qualtrics. The median time needed to complete the survey was around 10 minutes. After briefly explaining the purpose of our survey, we asked general background questions such as whether participants have ever come across a privacy notice (Q1), whether they have ever adjusted their privacy preferences (Q2), how concerned they are about their data privacy (Q3), and how much control they feel over their online privacy (Q4).
A central part of our survey was a vignette section where participants imagined themselves using one of eight randomly assigned website types (news, e-commerce, search engine, social media, government, non-profit, entertainment, or a banking site). We used a between-subjects design, where everyone evaluated only one type of website category, and we had an equal number of participants evaluating each website category. Each vignette (i.e., website category) included a screenshot to mentally situate the participant in the environment of the website. These screenshots were non-branded, meaning that they did not represent real websites, so that participants would not be primed by their opinions of particular services. The website categories were compiled based on previous research by Habib et al. [
38], Chanchary and Chiasson [
8], and Amos et al. [
1]. Participants were presented with one randomly selected vignette, but all survey questions and the evaluated data collection purposes were the same, regardless of the website category a participant was assigned to.
To gauge at whether certain consent-based purposes (Functional and strictly necessary, UX improvements, and Sharing with third parties) could potentially be future legitimate interests, we asked participants how essential they deemed this purpose for the functioning and service offering of that particular kind of website (Q5, 7, 9). We used a 5-point likert-scale, ranging from “1 - Completely disagree" to “5 - Completely agree". We also asked participants to rate how much they think each purpose benefited different stakeholders (the user, service provider, third party vendors, other users, and society) (Q6, 8, 10) on a 4-point likert-scale, ranging from “1 - Not at all" to “4 - A lot".
From an operational point of view, the application of legitimate interest means that, for certain purposes, personal data is collected without the user’s explicit permission. We investigated how our participants felt about this practice with respect to the purposes commonly used under legitimate interest (Personalizing and measuring content, Personalizing and measuring ads, Analytics, Develop and improve products, Future innovations, Archiving, Security and debugging, and Fraud and law enforcement), asking how comfortable they were with websites collecting data for these purposes without asking for user permission (Q11, 13, 15, 17, 19, 21, 23, 25). Additionally, we asked participants to rate on a 4-point likert-scale how much they think each purpose benefits different stakeholders (the user, service provider, third party vendors, other users, and society) (Q12, 14, 16, 18, 20, 22, 24, 26). In this part of the survey, we did not use the term “legitimate interest" because we anticipated that some participants might not understand it. Instead, we paraphrased it using an operational framing: “purposes that use data without your permission".
The next section of our survey investigated participants’ general understanding of the concept of legitimate interest and its practical implications. Questions included: asking participants whose legitimate interests they thought websites were referring to when collecting data for a “legitimate interest" purpose (Q27), presenting screenshots of the four possible consent/legitimate interest toggle configurations and asking participants in which configuration(s) they thought data was being collected (Q28), as well as asking an open-ended question about the perceived harms of using legitimate interest for data collection (Q29).
The last section of the survey consisted of demographic questions. We asked about participants’ technical and privacy knowledge using the web skills use survey (Q30-38) [
39], their age (Q39), gender (Q40), how long they had been living in the EU (Q41), where they lived in the EU (Q42), and the language they primarily used the Internet in (Q43).
4.1.2 Survey Validation.
We initially piloted the survey with three participants; one participant had a security and privacy background, and two had non-computational backgrounds. Based on the findings from our pilot, we revised the wording and presentation of the survey, and released a pre-test with 30 participants, which is a recommended sample size for survey validation [
64]. We distributed the pre-test on Prolific
7, using the same pre-screening criteria as in the final survey. We did not use the pre-test data for our final analysis. The pre-test with 43 survey items had a Cronbach’s alpha of
α = 0.92, which indicates that our survey had good internal reliability [
70]. We therefore did not need to further change our questions.
To maximize the survey validity, we reused or adapted questions from previous studies where applicable. Some questions and multiple choice responses were adopted from previous work by Kozyreva et al. [
50] (for the question,
How concerned are you about your data privacy when using the Internet?). Habib et al.’s codebook was used to inform some multiple choice responses [
38]. In the demographics section, we used the web skills use survey by Hargittai and Hseih [
39]. We were also careful to ensure that there was a correspondence between the survey questions and our research questions. For example, for
RQ3: What do people expect or think is reasonable for companies to collect under legitimate interest? we asked,
Sometimes, websites might collect data for the following purposes without asking for permission. For [insert legitimate interest purpose], how comfortable are you with this?4.1.3 Taxonomy of purposes.
There is no centralized nor standard list of legitimate interest purposes, as this legal basis is determined on a case-by-case basis with the legitimate interests balancing test [
32,
44]. Accordingly, pursuant to our survey, we used the purposes from the Cookiepedia Database
8, which provides an extensive database and categorization of cookies, and is often used in empirical studies on cookies [
42,
66]. Additionally, we looked to guidelines from the European Data Protection Board (EDPB) [
25], Center for Information Policy Leadership (CIPL) [
7], and the TCF [
21] to find which purposes can be based on legitimate interest.
After several discussions, we agreed to survey participants on 11 purposes for section two of our survey - eight of which were based on legitimate interest (therefore not requiring consent), and three of which are subject to consent, but were interested in seeing if users might deem them essential, therefore potential legitimate interests. We were broadly interested in the legitimate interest purposes we identified from the first study, but also purposes related to the repurposing of data, such as
Future innovations, Archiving, and
Product development because of their relevance to critiques of data minimization [
29,
76] and their general importance to scientific, social, and product development. For the three consent-based purposes, we included them because they are commonly used purposes [
41,
66], therefore we were interested to see if we could reduce user fatigue by including these as potential legitimate interests in the future.
For purposes that were very similar, such as TCF-based Create a personalized ad profile, Select personalized ads, Create a personalized content profile, and Select personalized content we amalgamated them into general purposes called Personalized content delivery and measurement and Personalized ad delivery and measurement to reduce repetition for participants. Below, we list the purposes and definitions we presented to participants. The table in Section 6 of our Supplementary Materials describes the reasoning behind each purpose and where we sourced it from.
•
Functional, strictly necessary purposes: enables you to move around the website and use its features
•
User experience (UX) improvements: collect and process information about your use of the website to provide you with personalized enhanced features, like to remember the choices that you made
•
Sharing data with third parties: sharing your information with third-parties beyond the website you are visiting
•
Personalizing and measuring content: create and display personalized content that is relevant to you, content you interact with are measured for performance and effectiveness
•
Personalizing and measuring ads: deliver, personalize ads, select and measure the effectiveness of these ads. Advertising and marketing material can be shown to you based on the content you’re viewing, the app you’re using, your approximate location, or your device type. Ads you interact with are measured for performance and effectiveness
•
Analytics, statistics, and audience insights: measure, improve and report on your engagement with the website service, like the number of unique visits to a website, how long users stay in the site, what parts and pages of the website are browsed, main searched keywords, etc. Apply market research to learn more about audiences who visit sites/apps and view ads
•
Developing and improving products: Your data can be used to improve existing systems and software, and to develop new products and functionalities
•
Future innovations: your data can be used for future innovations unrelated to the service the website currently provides
•
Archiving data for scientific or historical research, public interest, or statistical purposes: your data can be used for future archiving purposes in the public interest, scientific or historical research purposes or statistical purposes
•
Security and debugging: your data can be used to ensure systems are working properly and securely
•
Fraud detection and law enforcement: your data can be used to monitor for and prevent fraudulent activity, and indicating possible criminal acts and threats to public safety.
4.1.4 Participants.
For both the pre-test and final survey, we surveyed internet users who speak English and have been living in the EU for at least one year, as we wanted participants who have been exposed to the GDPR and cookie policies. We used Prolific and recruited participants with a minimum approval rate of 90% on the platform to ensure high-quality answers. Our institutional review board declared that our study was exempt from federal human subjects regulation.
We conducted a power analysis to determine the sample size for our survey with 8 website category conditions. To achieve high (0.8) statistical power, we needed approximately 400 participants. Since participants were based in the EU, we expected all of them to have seen privacy notices. Yet, one participant said they had never encountered a privacy notice before, we therefore excluded them from the analysis. Participants were compensated 2,70€ in exchange for approximately 15 minutes of their time. From our observations during the pilot studies and median completion time from our pre-test survey, it was determined that 15 minutes was likely to be more than enough time to complete the survey.
In total, we analyzed 399 responses. We had 250 male, 145 female, and 4 non-binary participants. As is the case with most research using online crowdsourcing platforms, our participants were mostly young, with 76% being between 18 and 34 years old [
45,
68]. Almost all (96.24%) had lived in the EU for over four years, and over half (54.14%) primarily used the internet in English.