1 Introduction
In January 2020, privacy journalist Kashmir Hill published an article in the New York Times describing Clearview.AI — a company that purports to help U.S. law enforcement match photos of unknown people to their online presence through a facial recognition model trained by scraping millions of publicly available face images online [
57]. In 2021, police departments in many different U.S. cities were reported to have used Clearview.AI to identify individuals, including Black Lives Matter protesters [
116]. In 2022, a California-based artist found that photos she thought to be in her private medical record were included, without her knowledge or consent, in the LAION training dataset that has been used to train Stable Diffusion and Google Imagen [
39]. The artist has a rare medical condition that she preferred to keep private, and expressed concern about the abusive potential of generative AI technologies having access to her photos. In January 2023, Twitch streamer QTCinderella made an emphatic plea to her followers on Twitter to stop spreading links to an illicit website hosting AI-generated deepfake pornography of her and other women influencers. “Being seen ‘naked’ against your will should NOT BE A PART OF THIS JOB” [
110].
These examples illuminate the unique privacy risks posed by AI technologies, prompting the foundational research question we ask in this work:
How do modern advances in AI and ML change the privacy risks of a product or service? To answer this question, we introduce a taxonomy of AI privacy risks, grounded in an analysis of 321 privacy-relevant incidents that resulted from AI products and services, sourced from an AI incidents database [
108], much like the ones described above. This work is important for at least two reasons. First, people are concerned about how AI can affect their privacy: a 2021 survey with around 10,000 participants from ten countries found that roughly half of the respondents believed that AI would result in “less privacy” in the future, citing concerns around large-scale collection of personal data, consent, and surveillance [
71]. Second, while privacy is one of the five most commonly cited principles for the development of ethical AI technologies [
66], we do not yet have a systematic understanding of if and how modern advances in AI change the privacy risks entailed by products and services.
While AI and ML technologies have vastly expanded in capability [
159], there is simultaneously a great deal of hype about what these technologies can and cannot do, making it difficult to separate real risks from speculative ones [
68]. Thus, it can be difficult for today’s practitioners who develop AI-inclusive products and services to understand how their use of AI technologies might entail or exacerbate practical privacy risks [
161]. Prior work shows this difficulty to be true: in an interview with 35 AI practitioners, Lee et al. found that participants had relatively low awareness of privacy risks unique to or exacerbated by AI, and had little incentive to and support in addressing these risks [
76].
AI and privacy both existed long before modern dialogues around the role of privacy in ethical AI development. To understand what modern advances in AI
change about privacy, we needed a suitable baseline for privacy risk as it was understood before these advances. To that end, we used Solove’s highly-cited and well-known taxonomy of privacy from 2006 as a baseline [
126]. Solove’s taxonomy was proposed well before modern advances in AI became mainstream in product design, and remains relevant and influential to this day. Yet, Solove’s taxonomy is intentionally broad and technology-agnostic — a useful attribute in the legal and regulatory contexts for which it was developed, but less helpful in prescribing specific mitigations for product designers and developers.
To ground our analysis on real and practical risks, we sourced case studies from a database indexing real AI incidents documented by journalists — the AI, Algorithmic, and Automation Incident and Controversy (AIAAIC) repository [
108]. We sourced 321 case studies from the AIAAIC repository in which real AI products resulted in lived privacy risks. We next systematically analyzed whether and how the capabilities and/or requirements of the AI technology described in the incident either (i)
created a new instantiation of a privacy risk described in Solove’s original taxonomy or an entirely new category of risk, (ii)
exacerbated a privacy risk that was already captured by Solove’s taxonomy, or (iii)
did not change the privacy risk described in the incident relative to at least one of the risks described in Solove’s taxonomy.
The result is our taxonomy of AI privacy risks (see Figure
1). Our taxonomy illustrates how the unique capabilities of AI — e.g., the ability to
recommend courses of action,
infer users’ interests and attributes, and
detect rare or anomalous events [
103] — resulted in both new instantiations of existing categories of risk in Solove’s taxonomy as well as one entirely new category of privacy risk. For example, we found that the ability of AI technologies to generate human-like media resulted in new types of exposure risks (e.g., the generation of deepfake pornography [
4]), while the ability for AI to learn arbitrary classification functions led to a new category of privacy risk: phrenology/physiognomy (e.g., the belief that AI can be used to automatically detect things like sexual orientation from physical attributes [
78]. Our taxonomy also captures how the data and infrastructural requirements of AI exacerbated privacy risks already captured in Solove’s taxonomy. For example, since facial recognition classifiers require tremendous amounts of face data, they can exacerbate surveillance risks by encouraging uncritical data collection practices such as collecting face scans in airports [
42].
We discuss how existing approaches to privacy-preserving AI and machine learning, such as differential privacy and federated machine learning, only account for a subset of these risks, highlighting the need for new tools, artifacts, and resources that aid practitioners in negotiating the utility-intrusiveness trade-off of AI-powered products and services. Finally, we outline how this taxonomy can be used to create tools that help educate practitioners, and as a repository of shared knowledge regarding AI privacy risks and design processes to mitigate against those risks.
4 Taxonomy OF AI Privacy Risks
We introduce a taxonomy of AI privacy risks: i.e., privacy risks that are created and/or exacerbated by the incorporation of AI technologies into products and services. In short, we found that AI technologies create new instantiations of the privacy risks in Solove’s taxonomy [
126] (e.g., generative AI can create new types of distortion intrusions), create a new category of risk not captured by Solove’s taxonomy (e.g., resurging phrenology/physiognomy), and exacerbate many of the risks highlighted by Solove’s taxonomy (e.g., AI technologies can more robustly identify individuals from low fidelity data sources) (see Figure
4).
We discuss these AI-created and exacerbated risks below as they relate to data collection, processing, dissemination, and invasion (see Figure
3). Overall, we found that of the 321 incidents from the AIAAIC database that involve privacy risks, the AI technology implicated in the incident either created or exacerbated the described privacy risks 298 times (92.8%), suggesting that the unique capabilities and/or requirements of AI do appear to meaningfully change privacy risks and that AI-specific privacy guidance may be necessary for practitioners.
4.1 Data collection risks
Data collection risks “create disruption based on the process of data gathering” [
126]. Recent advances in AI/ML have been fueled by the collection of vast amounts of personal data. Solove further identifies surveillance as a risk that pertains to AI technology. AI technologies might
create data collection risks if the AI technology provides functionality that enables the collection of previously inaccessible data; they
exacerbate data collection risks when data is collected specifically for the development of an AI/ML system, or if AI technologies facilitate the data collection process in a manner that increases the scope of the risk. In our analysis, we found incidents of AI exacerbating
surveillance risks, but not of creating new such risks.
4.1.1 Surveillance (150/321).
Surveillance refers to watching, listening to, or recording an individual’s activities [
126]. Surveillance risks long pre-date modern advances in AI. AI technologies do not always meaningfully change surveillance (16/150), i.e., when end-users feed their own personal data to access the utility offered by AI, such as by uploading videos to capture body movement or estimate car speed. Nevertheless, owing to the never-ending need for personal data to train and deploy effective machine learning models, we identified two ways AI technologies can exacerbate surveillance risks: i.e., by increasing the scale and ubiquity of personal data collected.
AI enhances the scale of surveillance (32/150) by enabling linking across a diversity of sources, and increasing the quantity of collected personal data.
Where applicable, real-world models collect data from different sources to enrich datasets. We found that multi-faceted, high-fidelity data can exacerbate risks involving surveillance in the physical world. One example comes from a predictive policing platform deployed in Xinjiang, China. The system
“collects [individual’s] information from a variety of sources including CCTV cameras and Wi-Fi sniffers, as well as existing databases of health information, banking records, and family planning history” [
112]. This information was then used to identify persons and assess their activities in the real world. We also found incidents describing AI systems that collected an array of end-user behavioral data in the cyber world. For example, Gaggle, a student safety management tool, monitors students’ digital footprints such as email accounts, online documents, internet usage, and social media accounts to assess and prevent violence and suicides [
20].
Additionally, as the amount of training data often has a direct impact on model performance, AI technologies can exacerbate surveillance risks by increasing the need for collecting large-scale personal data to train effective models. For example, the South Korean Ministry of Justice attempted to build a government system for screening and identifying travelers based on photos of over 100 million foreign nationals who entered the country through its airports [
42]. Without the promise of AI technologies to automatically sift through and make sense of these data, there would be little incentive to collect data of this scale.
AI technologies exacerbate the ubiquity of surveillance risks (102/150) by using physical sensors and devices to collect information from environments. For example, geolocation data from mobile devices were used to assess employee performance, raising concerns about employee tracking outside of work [
138]. CCTV cameras have been used in applications to detect and prevent suicide attempts [
106] or to detect security anomalies in physical spaces [
142], while also introducing bystander privacy risks and concerns [
83]. Microphones enable a responsive audio interface for virtual assistants, along with concerns of extensive audio data collection and eavesdropping by the service provider [
107].
4.2 Data processing risks
Data processing risks result from the use, storage, and manipulation of personal data [
126]. Solove identified five types of data processing risks: identification, aggregation, secondary use, exclusion, and insecurity. In our analysis, we found incidents pertaining to each of these risks, as well as an entirely new category of data processing risk:
phrenology/physiognomy risk, which is created by AI technologies by correlating arbitrary inputs and outputs. We also found that AI technologies create new types of
identification and
aggregation risks (e.g., by operating on low-quality data; and by forecasting future events), and exacerbate
secondary use,
exclusion, and
insecurity risks (e.g., by re-purposing foundation models; by training models on datasets containing content obtained without consent; and by introducing new security vulnerabilities due to the use of AI).
4.2.1 Identification (124/321).
Identification refers to linking specific data points to an individual’s identity [
126]. These risks are commonplace even without AI; for example, users may be manually tagged in photos, or manually identified in CCTV video feeds. AI technologies, however, allow for automated identity linking across a variety of data sources, including images, audio, and biometrics. We found that AI technologies entail new types of identification risks with respect to scale, latency, robustness, and ubiquity.
AI technologies enabled automated identification at scale (20/124). One example is Facebook’s now-disabled Tag Suggestions product, through which Facebook demonstrated its ability to automatically identify individuals from uploaded photos. When this feature was in use, Facebook had 1.4 billion daily active users
5; still,
“any time someone uploads a photo that includes what Facebook thinks is your face, you’ll be notified even if you weren’t tagged” [
124].
AI technologies allow identification risks to occur more quickly, in nigh real-time (24/124), once the models are trained. For example, in 2019, the Italian government was on the verge of implementing a real-time facial recognition system across football stadiums that “prevent individuals who are banned from sports competitions from entering stadiums.” It also picked up audiences’ “racist conversations” to alert law enforcement authorities to the presence of racist fans [
127].
In addition,
AI technologies allow for robust identification even with low-quality data (7/124). Clearview AI, a facial recognition application that aids U.S. law enforcement in identifying wanted individuals, claims to be able to identify people under a range of obfuscation conditions:
“[a] person can be wearing a hat or glasses, or it can be a profile shot or partial view of their face” [
57]. Similarly, models trained on Simulated Masked Face Recognition Dataset (SMFRD)
6 are capable of identifying persons with a mask on,
“violating the privacy of those who wish to conceal their face” [
150].
Finally,
AI technologies enable ubiquity identification risks in situated physical environments (73/124) like public places (e.g., [
92]), stores (e.g., [
58]), and classrooms (e.g.,[
114]). For example, XPeng Motors, a Chinese electric vehicle firm, was reported for using facial recognition-embedded cameras in their stores to collect biometric data of customers [
49].
4.2.2 Aggregation (49/321).
Aggregation risks refer to combining various pieces of data about a person to make inferences beyond what is explicitly captured in those data [
126]. These risks can occur without AI through manual analysis, but AI technologies greatly facilitate these inferences at scale, identified as a future trend by Solove:
“the data gathered about people is significantly more extensive, the process of combining it is much easier, and the computer technologies to analyze it are more sophisticated and powerful” [
126]. Similar to identification risks, we found that AI technologies create new types of aggregation risks owing to their scale, latency, ubiquity, and their ability to forecast end-user behavior and infer end-user attributes.
One of the unique strengths of AI systems is that they automate complex processes into simple programs that overcome human limitations. While controversial, many public sectors still utilize algorithmic tools in high-stake contexts such as social work [
69] and services for the unhoused [
74] to prioritize limited resources. To that end,
AI technologies create aggregation at scale (23/49) by processing vast amounts of personal data to infer invasive things about individuals not explicit in those data. For example, an AI start-up created a service that assesses a prospective babysitter’s likelihood to engage in risky behaviors such as drug abuse and bullying by
“scan[ning]... thousands of Facebook, Twitter and Instagram posts” [
56].
AI technologies perform complicated inferencing tasks nigh instantly (11/49). Technologies have been developed to estimate employee performance in-the-moment [
141], and to forecast what one might write in emails [
134]. AI technologies have also been developed to predict when end-users might be ovulating [
61], and their moment-to-moment risk of committing suicide [
20].
AI technologies can make physical objects and environments smarter and more responsive,
enabling ubiquitous aggregation risks (5/49). Smart home devices, for example, allow for automated control of home appliances, dynamic temperature control to strike an optimal balance between energy consumption and comfort, and voice user interfaces [
9]. These features require AI technologies to continuously monitor data streamed from physical sensors, creating new aggregation risks in situated environments. For example, smart speaker microphone feeds have been used to infer who is present in a room, who is speaking, and other information that can be algorithmically inferred from voice data [
2].
Finally, AI technologies enable forecasting future behaviors and states based on historical data. This forecasting can be used, for example, to help proactively identify health risks, plan optimal routes to avoid predictable traffic, and estimate retirement savings. These capabilities of AI, however, also
create a new type of predictive aggregation risk (10/49). For example, in 2018, Argentina’s government deployed an AI model that predicted teen pregnancy in low-income areas from their first name, last name, and address [
65]. AI has also been used for crime prediction. For example, in 2018, law enforcement in the United Kingdom aimed to predict serious violent crime using AI based on
“records of people being stopped and searched and logs of crimes committed” [
17].
4.2.3 Phrenology / Physiognomy (27/321).
Phrenology and Physiognomy are debunked pseudosciences that postulate that it is possible to make reliable inferences about a person’s personality, character, or predispositions from an analysis of their
outer appearance and/or
physical characteristics [
1]. Beyond the baseless prediction made from historical data streams discussed in Aggregation risks (Section
4.2.2), phrenology/physiognomy risks pose unique downstream privacy harms distinct from aggregation risks: whereas aggregation risks primarily arise from the collection and combination of disparate pieces of information to make deductive inferences about individuals, phrenology/physiognomy risks introduce new and unfounded inferences about an individual’s internal characteristics (e.g., their preferences and proclivities). Moreover, while aggregation risks generally come from the combination of factual and observable data streams over which users can have some awareness and control (e.g., purchasing habits), phrenology/physiognomy risks arise from making inferences over physical characteristics over which users have no control. Moreover, beyond the harm to the individual, there is also a broader societal harm: prior work has warned that irresponsible use of AI classification models could usher in a revival of these pseudosciences [
14,
128] by, e.g., motivating surveillance institutions to train AI models to make spurious inferences about a person’s preferences, personality, and character from inputs that capture their outer appearance. Our analysis reveals that AI technologies are indeed being used in this way, resulting in a new category of privacy risk not captured by Solove’s initial taxonomy. We define phrenology/physiognomy risks as the use of AI to infer personality, social, and emotional attributes about an individual from their physical attributes. This risk stems from AI’s ability to learn correlations between arbitrary inputs (e.g., images, voices) and outputs (e.g., one’s demographic information).
Some models aim to infer preferences, like sexual orientation. For example, ‘Gaydar’ is an AI sexual orientation prediction model that “distinguishes between gay or straight people” based on their photos [
79]. Researchers have also used AI to predict “criminality” — i.e., whether someone is a criminal — from facial images [
154]. Outside of the problematic assumptions of these models (i.e., that sexual orientation and criminality can be inferred from photos), this research raises concerns about the potential for harm and misuse of AI models to infer and disseminate information about individuals without consent [
79]. AI technologies have also been used to predict other personal information such as one’s name [
26], age [
109], and ethnicity [
117] based on facial characteristics.
Other models aim to predict a person’s mental and emotional state based on their images. For example, teaching tools devised by Class Technologies estimate students’ engagement from their facial expressions without students’ consent [
70]. Still other models scrutinize vocal attributes to predict an individual’s trustworthiness. For instance, the AI system DeepScore captures and assesses voice data to predict deceptiveness, and has been utilized by health insurance and money lending platforms to select low-risk clients [
43].
4.2.4 Secondary Use (39/321).
Secondary use encompasses the use of personal data collected for one purpose for a different purpose without end-user consent [
126]. In AI technologies, this risk is mostly associated with data practices for training data. AI does not always change secondary use risks (6/39). For example, Luca, an app that was used for contact tracing during the COVID-19 pandemic in Germany, was found to re-purpose personal data, such as location data, to support law enforcement by
“tracking down witnesses to a potential crime” [
105]: but the risk described here would have been just as salient even without the use of AI. Nevertheless, many common practices to train AI/ML models more effectively can exacerbate secondary use. In our dataset, we identified two AI-exacerbated secondary use risks: creating new AI capabilities with collected personal data, and (re)creating models from a public dataset.
When data collectors have already built models using personal data, they may be tempted to expand the models by creating additional features and capabilities, which can be unanticipated for end-users (22/39). For example, OkCupid, a dating site that matches users using an
“one-of-a-kind algorithm”7, was found to contact an AI startup, Clarifai,
“about collaborating to determine if they could build unbiased A.I. and facial recognition technology,” and that
“Clarifai used the images from OkCupid to build a service that could identify the age, sex and race of detected faces” [
86].
Secondary use risks can also be exacerbated when AI practitioners try to reuse pubic datasets to train models for purposes other than the original purpose for which those data were collected (11/39). For example, People in Photo Albums (PIPA) is a facial photograph dataset created to
“recogniz[e] peoples’ identities in photo albums in an unconstrained setting” [
162]. Yet, the PIPA dataset has been used in research affiliated with military applications and companies like Facebook [
54,
55]. Similarly, the Diversity in Faces (DiF) dataset is a collection of annotations of one million facial images that was released by IBM in 2019 [
125]. The dataset was created to improve the research on fairness and accuracy of artificial intelligence face recognition systems across genders and skin colors. While it was not to be used for commercial purposes, Amazon and Microsoft were accused of using the dataset to
“improve the accuracy of their facial recognition software” [
13].
4.2.5 Exclusion (149/321).
Exclusion refers to the failure to provide end-users with notice and control over how their data is being used [
126]. Even without AI, computing products can covertly process data without informing users. Thus, AI technologies do not meaningfully change exclusion risks when the risk is isolated to just the covert processing of personal data (76/149). For example, a “trustworthiness” algorithm developed by a short-term homestay company covertly used publicly accessible social media posts to ascertain if a potential customer was trustworthy [
67], but the use of AI in this case did not fundamentally change the privacy risk. We nevertheless found in our incident database that the requirements of AI technology
can exacerbate exclusion risks by incentivizing the collection of large, rich datasets of personal data without securing consent (73/149).
For example, the Large-scale Artificial Intelligence Open Network (LAION) is a German non-profit organization that aims “to make large-scale machine learning models, datasets and related code available to the general public.” In 2022, they released a large-scale dataset LAION-5B [
120], the biggest openly accessible image-text dataset at the time
8. These data have been used to train many other high-profile text-to-image models such as Stable Diffusion
9 and Google Imagen
10[
39]. However, a person found that her private medical photographs were referenced in the public image-text dataset. She suspected that
“someone stole the image from my deceased doctor’s files and it ended up somewhere online, and then it was scraped into this dataset” [
39]. Other models were found to be trained on “semi-public” personal data that were scraped from places like online forums, dating sites, and social media without users’ awareness and consent (e.g., [
3,
57,
164]). For example, Clearview AI built a private face recognition model trained on three billion photos that were
“scraped from Facebook, YouTube, Venmo and millions of other websites” [
85].
Prior work has shown that it can be challenging to ensure agency to any individual over their data regarding how data they have shared online can and cannot be used by such models [
100], and that it can be deliberately made complex for individuals to remove their data from the dataset [
22]. Additionally, when commercial AI models are “black boxes,” the general public has no means to audit how personal data is used by AI (e.g., Clearview AI). Finally, “algorithmic inclusion” — i.e., ensuring that everyone is included in a system — is often seen as a more desirable way to build AI systems in the context of AI ethics. These “inclusive AI” approaches, however, need to be balanced against exclusion-based privacy risks [
10,
12]: when more people’s data are captured to build inclusive systems, those people may be subject to increased exclusion risk if their data is collected without adequate consent and control.
4.2.6 Insecurity (17/321).
Insecurity refers to carelessness in protecting collected personal data from leaks and improper access due to faulty data storage and data practices [
126]. Products and services that include AI are subject to many of the same insecurity risks that result from poor operational security, unrelated to the capabilities and data requirements of AI (12/17). For example, our dataset includes a data breach where attackers hacked into Verkada, a security startup that provides cloud-based security cameras with face recognition. This gave the attackers access to cameras that
“are capable of identifying particular people across time by detecting their faces, and are also capable of filtering individuals by their gender, the color of their clothes, and other attributes” [
34,
135]. These operational security mistakes are not unique to or exacerbated by AI technologies, even though the AI-enabled products and services that are hacked afford attackers access to compromised data that would otherwise not be accessible. We did, however, find instances in which the capabilities and/or data requirements of AI technologies directly exacerbated insecurity risks (5/17).
Sometimes AI technology can compromise end-user privacy in order to enable AI utility. For example, Allo, a messaging app that Google first launched in 2017, included an AI virtual assistant and automatic replies. The messenger was not end-to-end encrypted, allowing for AI models developed by Google to “read” users’ chat content and personalize services for them [
47].
We also found cases where AI technologies unexpectedly reveal the personal data on which they were trained. For example, Lee Luda, a chatbot trained on real-world text conversations, was found to expose the names, nicknames, and home addresses of the users whose data on which it was trained [
63]. Similarly, services that use generative AI models to create realistic but fake human faces, have been shown to be able to reconstruct the raw personal data on which the models were trained [
147].
Additional vulnerabilities can be introduced through the infrastructural data requirements entailed by AI technologies. For example, converting raw data into training-ready labeled data can require the exposure of raw personal data to human annotators. For example, iRobot hired gig workers to annotate audio, photo, and video data captured by their household robots to train AI models. However, some of these raw and sensitive photos were leaked online by the gig workers [
50]. Cases like this illustrate how AI can blur the boundary between data
processing risks and data
dissemination risks — sometimes, the act of processing data through AI requires dissemination.
4.3 Data dissemination risks
Data dissemination threats result when personal information is revealed or shared by data collectors to third-parties [
126]. AI technologies
create new data dissemination risks by enabling new ways of revealing and spreading personal data; they also
exacerbate data dissemination risks by increasing the scale and the frequency of the dissemination.
In our analysis, we found that AI technologies create new types of exposure, distortion, and disclosure risks (e.g., by reconstructing redacted content; by generating a realistic fake video of an individual; and by sharing AI-derived sensitive information about individuals with third-parties). We also found cases in which AI technologies exacerbated known disclosure risk (e.g., by sharing large-scale user data to third-parties to train models), and increased accessibility risk (e.g., by open-sourcing large-scale benchmark datasets containing user data).
4.3.1 Exposure (17/321).
Exposure risks encompass revealing sensitive private information that people view as deeply primordial that we have been socialized into concealing [
126]. Traditionally, these risks arise when an individual’s private activities are recorded and disseminated to others without consent. AI technologies can create new types of exposure risks via generative techniques that can create, reconstruct, manipulate content (i.e., deepfake techniques) (10/17) and expose inferred sensitive end-user attributes predicted by AI/ML (e.g., one’s interests [
79]) (7/17).
Specifically, we found that AI can create new types of exposure risks by reconstructing censored or redacted content. For example, generative adversarial networks (e.g., TecoGAN [
31]) have been used to clarify images of censored genitalia [
91], and to “undress” people to create pornographic images without consent [
27]. Deepfake applications such as DeepFaceLive
11 or DeepFaceLab
12 can be made to morph a non-consenting subject’s face into pornographic videos. These deepfake technologies have been used to facilitate mass dog-piling and online harassment [
16] and to create illegal online pornography businesses [
4].
In our analysis, we also found that AI technologies create new risks that expose sensitive data, preferences, and intentions inferred by AI/ML. For instance, Flo, an app that tracks menstruation and ovulation, forecasts its users’ menstrual cycle and ovulation. Despite promising to maintain the privacy of personal data, Flo allegedly shared customers’ menstrual timing and intention to get pregnant with third-parties like Facebook [
119]. AI can also be built to proactively disseminate incriminating information about individuals to the public. In Shenzhen, China, a system was implemented to detect jaywalking and other offenses captured by cameras. The system identifies offenders and displays their photographs, names, and social identification numbers on LED screens placed at road junctions [
156].
4.3.2 Distortion (20/321).
Distortion refers to disseminating false or misleading information about people [
126]. Distortion risks are analogous to slander or libel, and have existed well before modern advances in AI. However, we found that AI technologies can create new types of distortion risks by exploiting others’ identities to generate realistic fake images and audio that humans have difficulty discerning as fake [
96,
139].
Some models can generate realistic audio of individuals. For example, Prime Voice AI, a text-to-voice generator, was misused to create the voices of celebrities to
“make racist remarks about Alexandria Ocasio-Cortez (the US House representative)”, and that the AI-generated clips
“run the gamut from harmless, to violent, to transphobic, to homophobic, to racist.” [
33,
53]. Other AI-created distortion risks are less egregious, but raise important questions about expectations around privacy in light of how generative AI can be used to simulate the likeness of those who have passed. For example, the filmmaker of a documentary was revealed to be using deepfake technology to create scenes, with the likeness of an actor who had passed away, for lines
“he wanted [Anthony] Bourdain’s (the main character of the documentary) voice for but had no recordings of” [
77].
4.3.3 Disclosure (45/321).
Whereas distortion is the dissemination of false or misleading information, disclosure risks encompass the act of revealing and improperly sharing people’s personal data [
126]. Indeed, any computing product that collects and stores personal data can introduce disclosure risks. Our dataset includes cases where AI does not meaningfully change disclosure risks (17/45), such as sharing personal data with law enforcement or third-parties. Nevertheless, AI technologies create new types of disclosure risks by being able to derive or infer additional information beyond what is explicitly captured in the raw data. We also found AI technologies can exacerbate disclosure risks because the personal data used to train ML models are often shared with specific individuals or organizations.
Many of the disclosure risks we identified involved the creation of machine learning models that automatically infer undisclosed personal information about individuals (14/45). For example, the “Safe City” initiative in Myanmar used AI-infused cameras to identify faces and vehicle license plates in public places and alert authorities to individuals with criminal histories [
5].
AI technologies can also exacerbate disclosure risks when personal data is shared by organizations to train machine learning models (14/45). For example, the UK’s National Health Service partnered with Google to share mental health records and HIV diagnoses of 1.6 million patients to develop a model for detecting acute kidney injury [
59].
4.3.4 Increased Accessibility (23/321).
Increased accessibility refers to making it easier for a wider audience of people to access potentially sensitive information. We found incidents in which AI technologies exacerbated the scale of this risk via the public sharing of large-scale datasets, containing personal information, for the use of building and improving AI/ML models. In the AI/ML community, it is common practice to leverage open-source benchmark datasets to train AI/ML models. This open-source data sharing enables transparency and public audits of AI research and development. However, publicizing datasets also enables anyone to collect large amounts of personal data that may have otherwise been private, access-controlled, or difficult to find. For example, the “OkCupid dataset” contained data of almost seventy thousand users from the dating site OkCupid. The dataset contained personal information such as users’ location, demographics, sexual preferences, and drug use. It was uploaded to Open Science Framework, a website that helps researchers to open source datasets and research software, to facilitate research on modeling dating behaviors [
153].
4.4 Invasion risks
The final top-level category of privacy risk Solove outlined, Invasion, can be understood as the unwanted encroachment into an individual’s personal space, choices, or activities [
126]. Solove placed two sub-categories under invasion: intrusion and decisional interference. We found incidents where AI technologies exacerbated intrusion risks, in particular.
4.4.1 Intrusion (160/321).
Intrusion risks encompass actions that disturb one’s solitude in physical space [
126]. For six of the 160 intrusion incidents we identified, we noted that the AI technologies described in the incident did not fundamentally change the risk described in the incident: the intrusion would have remained as described even without the capabilities and/or requirements of AI. One example is the use of digital screens in stores to show customers personalized ads [
88]: the intrusion would remain, even if the system did not use AI. However, we identified two ways AI can exacerbate intrusion risks that increase their scale and ubiquity.
The capabilities of AI technologies (e.g., to identify a person and detect behaviors)
enable a centralized surveillance infrastructure that creates large-scale intrusion risks (113/160); the requirements of AI (e.g., access to vast troves of data and GPU servers) necessitate this infrastructure. For example, Pharmaceutical University in Nanjing, China, implemented a recognition system at various locations on campus to closely monitor students’ attendance and learning behaviors [
133,
163]. Similarly, employers are increasingly incorporating AI-infused workplace monitoring technologies that collect data from employees’ smartwatches [
131] and computer webcams [
144] to track their performance, absence, and time-on-task.
The capabilities of AI can also
turn everyday products (e.g., doorbells, wristbands) into powerful nodes in a ubiquitous surveillance infrastructure (41/160). For example, Ring, a smart doorbell that enables homeowners to monitor activities and conversations near where the doorbell is installed, has raised concern due to “the device’s excessive ability” to capture data of an individual’s neighbors [
90]. Similarly, Amazon’s Halo fitness tracker uses AI to analyze a user’s conversations to highlight when and how often that user spoke in a manner that was indicative of their being “happy, discouraged, or skeptical” [
101].
5 Discussion
Our findings demonstrate the many ways modern advances in AI meaningfully change privacy risks relative to how we conceived of privacy risks prior to these advances, as captured by Solove’s widely cited taxonomy of privacy [
126]. Across the 321 AI privacy incidents we analyzed, roughly 7% of the cases did not involve privacy risks that were created or exacerbated by AI. For example, we encountered instances where a product that happened to include AI was subject to a data breach in which users’ personal data was compromised [
7]. Nevertheless, in approximately 93% of the cases we analyzed, the unique capabilities and data requirements of the AI technologies involved in the incident either created a new type of privacy risk, or exacerbated a known risk.
We found that the unique capabilities of AI create new types of privacy risks. For example, AI creates new data processing risks in its ability to identify the activity of individuals even with low-quality data, and in its ability to forecast future outcomes. AI creates a new category of phrenology/physiognomy risks by enabling the creation of spurious classifiers correlating physical attributes with social, emotional, and personality traits. AI creates new types of data dissemination risks in its ability to generate human-like media, e.g., by generating a realistic fake video of an individual. We also found that the data requirements of AI exacerbate privacy risks we have grappled with for decades. For example, AI technologies can lead to more pervasive, larger scale surveillance than before; exacerbate secondary use, exclusion, insecurity, disclosure, and increased accessibility risks in the processing and disseminating of personal data; and, increase the ways in which computing can intrude upon people’s personal space.
Equipped with the knowledge of how AI
has changed privacy risks, we first discuss how the current AI/ML methods fall short and only address a subset of the AI privacy risks identified in our taxonomy (Section
5.1). Then, we present our taxonomy as a living structure that can be expanded with risks documented by Solove’s original taxonomy [
126] in cases where we did not find matching incidents in our incident database (Section
5.2). In theory, future advances in and/or the use of AI may entail risks in these categories, so it is worth discussing them as privacy risks that AI may change in the future. Moreover, we discuss a number of ways we expect this taxonomy might be useful for both future research and practice (Sections
5.1.1 and
5.2.1).
5.1 Charting the design space for privacy-preserving AI/ML work
Our findings broaden the design space for privacy-preserving AI and ML. For example, a recent meta-review of HAI principles and guidelines argues that privacy in ML-driven systems centers around the protection, control, and agency over personal data [
161]. Based on our findings, these criteria only consider a small subset of the AI privacy risks we identified: they consider some — but not all — of the data collection and processing risks exacerbated by AI, and do not at all consider the data processing and dissemination risks newly created by AI. In this section, we provide an overview of how the existing tools and approaches, that aim to help practitioners build privacy-preserving AI systems [
87,
152,
161], fall short of effectively identifying and addressing many AI privacy risks.
Differential Privacy and Federated Learning. Differential Privacy (DP) [
95] and Federated Learning (FL) [
80] are commonly thought of as approaches to “privacy-preserving” machine learning where 1) the model output is insensitive to the presence or absence of data on an individual in a dataset, and 2) the model provider only learns and improves the model in an aggregated manner. Tools such as Diffprivlib
13 [
60] and IBM Federated Learning
14 [
60] have been used by practitioners to implement DP and FL into their ML products. When training an ML model, however, these approaches only apply to some data processing risks — e.g., so that the model can not be used to re-identify data of individuals from the model outputs — and not the full range of risks we discuss in our taxonomy. Owing to these shortcomings, organizations that commonly advocate for end-user privacy rights, like the Electronic Frontier Foundation (EFF), have argued against the use of these approaches when they are used as stand-ins for stronger privacy protections (e.g., as in the case of Google’s attempt to replace third-party browser cookies with “Federated Learning of Cohorts”) [
35]. For example, the “criminality classifier” that takes in photos of people’s faces and claims to predict their likelihood to be a criminal [
154] could be built with a federated learning architecture. Doing so would not address the physiognomy risk inherent to the idea itself, nor the exclusion and disclosure risks arising from how the data is collected and the inferences shared without consent.
Data Privacy Auditing. Prior work has created data auditing tools, such as the Privacy Meter
15 [
94], to help practitioners conduct privacy impact assessments on ML models. Doing so allows practitioners to quantify some privacy risks (e.g., membership inference attacks). However, because the Privacy Meter must be applied
after the model is trained, it is inherently limited in its ability to mitigate against the risks that arise in the data collection and processing phases of work. In addition, similar to DP and FL, this approach takes a limited view of privacy and only applies to specific data processing risks — e.g., aggregation risks that arise from collective sensitive personal data in the training data.
Ethics Checklists and Toolkits. Prior work in AI ethics has introduced many toolkits to support practitioners in ethical AI development [
152], some of which also surface privacy risks. For example, Microsoft’s Harms Modeling
16 is an activity that includes design exercises and worksheets that help
“evaluate potential ways the use of a technology you are building could result in negative outcomes for people and society,” including potential privacy risks. AI ethics checklists such as Deon
17 allow practitioners to
“add an ethics checklist to [their] data science projects,” which include questions that make practitioners reflect on the collection, storage, and analysis of data containing PII (personally identifiable information). These checklists and toolkits could help practitioners consider a broader range of privacy risks described in our taxonomy (e.g., data collection and dissemination risks). However, these tools approach privacy risks monolithically, and at a high-level (e.g., privacy loss, PII exposure); they provide little guidance to practitioners to consider privacy risks newly created and/or exacerbated by AI (e.g., physiognomy, distortion risks). In other words, the use of such tools relies on practitioners’ individual awareness of AI privacy risks, which prior work has identified as a key barrier to AI privacy work [
76].
Note that all of these approaches have value and we are not suggesting that they not be used. Rather, we caution against rhetoric that it is possible to create “privacy-preserving” AI/ML technologies using only these approaches.
5.1.1 Future Work: Creating AI-specific privacy guidance.
Given that our findings show that AI creates new types of privacy risks and exacerbates existing ones, and that current privacy-preserving AI/ML methods fall short of identifying and addressing many of these risks, there is a need for future work to fill the gap of mitigating privacy risks created and exacerbated by AI. Specifically, our taxonomy opens up a new design space for privacy-preserving AI/ML tools that aim to raise practitioners’ awareness of utility-intrusiveness trade-offs of their AI product ideas (e.g., [
41]). For example, prior work in other AI-adjacent fields, such as Robotics, has explored how to correlate desired robot function with a minimally-invasive set of sensors [
40]. In the broader context of implementing privacy and security in software products, prior work has found that practitioners still largely see privacy and security in products as an “all or nothing” notion such that privacy comes at the expense of other important objectives [
51,
130].
Future work can explore incorporating our AI privacy taxonomy into harm-envisioning techniques, such as Consequence Scanning [
38], by providing AI privacy risk prompts to capture associated negative consequences holistically. These techniques can help practitioners run lightweight privacy evaluations on AI product ideas, and help them balance the utility and intrusiveness of these products and services across design iterations. With such a tool, we hypothesize that practitioners can better advocate and design for privacy in working contexts that may dissuade this work [
76,
130].
Our taxonomy can also consolidate promising future research in foregrounding tensions across data pipelines, practices, and stakeholders (i.e., data subjects, data observers, data beneficiaries, and data victims). By mirroring the first step in Rahwan’s Society in the Loop framework [
111], AI practitioners can make concrete the envisioned value and the stakeholders of their proposed AI concepts. To assist in this process, future work can create artifacts that encourage practitioners to articulate the value proposition of their envisioned product. Based on our taxonomy, then, it may be possible to mine our database for AI privacy incidents about products that are “semantically” similar based on an articulated value proposition. By showing practitioners related AI privacy incidents, they might then be guided to reflect on the utility-intrusiveness trade-off of their envisioned AI product ideas: for whom that value is generated (i.e., data beneficiaries), whose data is processed to unlock that value (i.e., data subjects), who can be impacted by the data pipeline (i.e., data victims), and by which privacy risk (e.g., surveillance).
In practice, however, this type of early-stage discussion around AI utility and privacy risk can be challenging because: (i) practitioners do not necessarily understand the full potential and limitations of AI [
159]; (ii) privacy is often treated as compliance with general regulatory mandates rather than a product-specific design choice [
143]; and, (iii) practitioners do not have access to AI-specific tools that support their privacy work pertaining to the capabilities and requirements that AI brings to their products [
76]. Accordingly, there is a need for a greater understanding of where such tools and artifacts might be effectively incorporated into practitioners’ workflows.
5.2 Theoretical extensions to the AI privacy risks taxonomy
We see our taxonomy as a living structure that helps scaffold the conversation about how advances in AI change privacy risks. But just as the capabilities and requirements of AI may change with future advances, so too might AI privacy risks. One way we might envision future AI privacy risks is by exploring the four subcategories of privacy risk in Solove’s original taxonomy [
126] for which we did not find relevant incidents in our dataset: Interrogation, Blackmail, Breach of Confidentiality, and Decisional Interference. In the future, we may observe incidents where advances in AI meaningfully change or exacerbate these risks as well.
Interrogation. Interrogation risks encompass the covert collection of data while a subject is being actively questioned [
126]. For example, lie detector tests entail interrogation risks — information beyond what an individual is saying is collected to assess the truthfulness of their words. We can envision AI both creating and exacerbating interrogation risks. Large Language Model-powered chatbots like ChatGPT, for example, could create new interrogation risks by imitating people and interacting with users in natural language, aiming to extract information from users. AI-infused affective computing technologies could exacerbate interrogation risks (e.g., [
98]): using these technologies, it may be possible to draw inferences about an individual’s demeanor from verbal (e.g., language use, tone) and non-verbal (e.g., body language, eye movements) cues.
Blackmail. Blackmail refers to coercing individuals by threatening to disclose private or sensitive information [
126]. Generative AI technologies could create new instantiations of this risk by synthesizing fake but convincing content that may serve as evidence for blackmail. We already saw incidents where incriminating content was fabricated when describing the exposure and distortion risks in our taxonomy, but we did not see this fabricated content being used for blackmail in the incidents we analyzed. Moreover, by automating the process of gathering and compromising information at scale, AI can also exacerbate blackmail risks. As we have seen, ML algorithms can analyze vast datasets from social media, location services, and personal files to identify content that could be used as fodder for blackmail.
Breach of Confidentiality. Breach of Confidentiality refers to an interpersonal risk between two people where one party discloses something to the other in confidence, and the other party violates this confidence by sharing it with third-parties [
126]. AI technologies could exacerbate the scale of this risk by enabling conversational agents capable of gaining users’ trust and guiding them to share sensitive information. For example, attackers can deploy such AI systems in high-stakes scenarios like healthcare and finance, and pose threats of breaching the confidentiality of the users by sharing the sensitive information they shared with the agent to third-parties.
Decisional Interference. Decisional Interference concerns the unwanted influence over or constraint of an individual’s choices or behavior by a third-party [
126]. Solove specifically focuses on the government as the relevant third-party, but private institutions and enterprises can also be culprits for this category of risk. AI technologies can exacerbate decisional interference risks by enabling more personalized political propaganda (e.g., [
122]). AI technologies might also exacerbate the scale of existing practices of online censorship toward political topics (e.g., [
23]). Algorithms for personalized recommendation or persuasive technologies can also subtly guide user choices, sometimes in ways that align more with the goals of external entities (e.g., advertisers or political campaigns) than with the individual’s own preferences or well-being.
5.2.1 Future Work: Creating a living taxonomy of AI privacy risks.
To our knowledge, our taxonomy is the first attempt to show how common AI requirements and capabilities map onto high-level privacy risks. As shown above, future AI privacy incidents can also expand the taxonomy. In addition, future AI privacy incidents may create new categories of privacy risk that go beyond Solove’s taxonomy (like the physiognomy risk we describe here). For example, many artists have been vocal about concerns about the theft of artistic style by generative AI [
72]. While these discussions currently center around notions of copyright and intellectual property, we can envision new types of privacy risk as well: e.g., artistic styles might contain personally identifiable or sensitive information. We envision that our taxonomy can complement ongoing crowd-sourced efforts at curating and organizing AI incidents such as the AIAAIC [
108] and AIID
18 by providing a framework to formally synthesize and identify emerging privacy risks in AI incidents. With that in mind, the research team is building a website
19 to present our taxonomy of AI privacy risks, and is also planning to expand this website to collect and aggregate submissions of new incidents related to these risks.
To present the AI privacy taxonomy in forms useful to the HCI and AI communities, future work can take an iterative approach, grounded on practitioners’ and academics’ actual design and research needs, to model the translation function between AI technology ideas and potential risks to consider. Indeed, envisioning with AI — i.e., treating AI as a design material [
64,
158,
159,
160] — is an open and active area of research. Aligning with this line of research, future work can add to our taxonomy by systematizing AI capabilities and requirements, and the privacy risks they create and exacerbate, at a level of granularity that is useful for practitioners and researchers to ideate, communicate, and collaborate with product teams and stakeholders [
159,
160].
5.3 Limitations
We consciously took an “incident-based” approach when constructing our taxonomy. There is a great deal of hype about what AI technologies can do, blurring the lines between speculation and reality [
68]. The overabundance of speculative risks necessitated that we limit our consideration to those that journalists and the public-at-large have recognized as harmful as chronicled in the AIAAIC database. With that in mind, our dataset should not be interpreted as inclusive and representative of every
possible privacy risk created or exacerbated by AI technologies: it is a repository of many privacy risks that have been realized in practice.
Our goal in creating this taxonomy was to codify AI privacy risks based on an accounting of documented, real-world risks. To that end, AIAAIC is currently
“the most comprehensive, detailed, and timely resource”20 that is openly accessible and has been used by the community as the source to synthesize the harms caused by AI functionality [
113]. To mitigate the sampling bias introduced by our use of the AIAAIC, we tested the database’s coverage by independently collecting a list of 15 AI privacy incidents from various sources, e.g., social media posts, literature. Of the 15 incidents we collected, 13 were also included in AIAAIC. For the two incidents that were not included, we found very similar incidents in the database — i.e., similar privacy risks caused by the same technology (e.g., face recognition software) but of different products. As a comparison, we applied the same procedure to the AIID database and only found five incidents included. Thus, we believe that AIAAIC currently provides a pool of AI privacy accidents comprehensive enough for our goal.
We acknowledge that there will be a growing number of AI incidents, and that there may be existing AI incidents that were not captured in our dataset. For example, prior work has surfaced how algorithmic recommender systems can amplify embarrassing exposures through online social networks [
30]. Nevertheless, our taxonomy provides a solid foundation for understanding how the capabilities and requirements of AI change privacy risks. Since we ground our taxonomy on Solove’s taxonomy of privacy, which has remained highly influential and largely appropriate for nearly two decades, we are confident that our updated taxonomy can be flexibly adapted to encompass new risks if and as they are realized beyond academic inquiry.
Finally, we acknowledge that “privacy” is a broad and context-dependent concept that is susceptible to biased interpretation based on the research team’s background. We are an interdisciplinary research team with diverse expertise across HCI, AI, security and privacy, policy, and design. We mitigated the potential for bias by: (i) building our taxonomy on top of Solove’s existing and widely accepted taxonomy; (ii) ensuring that multiple coders independently agreed on the risks entailed (or not) by a specific incident; and, (iii) dutifully analyzing all incidents, in the AIAAIC database, that were independently characterized by people outside of our research as being privacy-pertinent.
6 Conclusion
In this paper, we conducted a systematic analysis of documented incidents of AI privacy risks to answer the question: How do modern advances in AI and ML change privacy risks? Our taxonomy, constructed from a corpus of 321 documented AI privacy incidents, reveals that while the incorporation of AI technologies into products does not
necessarily change the privacy risks those products might entail, it often does. Our taxonomy reveals that AI can create new types of privacy risks when processing and disseminating end-user data. We showed, for example, that the unique capabilities of AI technologies (e.g., the ability to generate realistic but fake images) also create new types of privacy risks (e.g., exposure risks from deepfake pornography [
16]). The taxonomy also reveals that the data requirements of AI technologies can exacerbate known privacy risks. For example, owing to the unique ability of AI to automatically identify individuals from low-fidelity images, governments are more motivated to capture facial images of all passengers that pass through major transportation hubs (e.g., [
42]). Our work suggests that AI-specific design guidance is needed for practitioners to negotiate the utility-intrusiveness trade-offs of AI-powered user experiences, and that many existing approaches to privacy-preserving machine learning (e.g., federated learning [
80]) address only a small subset of the unique privacy risks entailed by AI technologies.