Nothing Special   »   [go: up one dir, main page]

Dataset Security For Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses

Download as pdf or txt
Download as pdf or txt
You are on page 1of 37

Dataset Security for Machine Learning: Data Poisoning,

Backdoor Attacks, and Defenses


Micah Goldblum* 1 , Dimitris Tsipras2 , Chulin Xie3 , Xinyun Chen4 , Avi Schwarzschild1 ,
Dawn Song4 , Aleksander Madry ˛ 2 , Bo Li3 , and Tom Goldstein†1
arXiv:2012.10544v2 [cs.LG] 30 Dec 2020

1 University of Maryland
2 Massachussetts Institute of Technology
3 University of Illinois Urbana-Champaign
4 University of California, Berkeley

Abstract
As machine learning systems grow in scale, so do their training data requirements, forcing
practitioners to automate and outsource the curation of training data in order to achieve state-
of-the-art performance. The absence of trustworthy human supervision over the data collection
process exposes organizations to security vulnerabilities; training data can be manipulated to
control and degrade the downstream behaviors of learned models. The goal of this work is
to systematically categorize and discuss a wide range of dataset vulnerabilities and exploits,
approaches for defending against these threats, and an array of open problems in this space.
In addition to describing various poisoning and backdoor threat models and the relationships
among them, we develop their unified taxonomy.

1 Introduction
Traditional approaches to computer security isolate systems from the outside world through a
combination of firewalls, passwords, data encryption, and other access control measures. In
contrast, dataset creators often invite the outside world in — data-hungry neural network models
are built by harvesting information from anonymous and unverified sources on the web. Such
open-world dataset creation methods can be exploited in several ways. Outsiders can passively
manipulate datasets by placing corrupted data on the web and waiting for data harvesting bots
to collect them. Active dataset manipulation occurs when outsiders have the privilege of sending
corrupted samples directly to a dataset aggregator such as a chatbot, spam filter, or database of
user profiles. Adversaries may also inject data into systems that rely on federated learning, in which
models are trained on a diffuse network of edge devices that communicate periodically with a
central server. In this case, users have complete control over the training data and labels seen by
their device, in addition to the content of updates sent to the central server. The exploitability of web-
based dataset creation is illustrated by the manipulation of the Tay chatbot (Wakefield 2016), the
* goldblum@umd.edu
† tomg@cs.umd.edu

1
presence of potential malware embedded in ImageNet files (Anonymous 2020), and manipulation
of commercial spam filters (Nelson et al. 2008). A recent survey of industry practitioners found
that organizations are significantly more afraid of data poisoning than other threats of adversarial
machine learning (Kumar et al. 2020).
The goal of this article is to catalog and systematize vulnerabilities in the dataset creation
process, and review how these weaknesses lead to exploitation of machine learning systems. We
will address the following dataset security issues:
• Training-only attacks: These attacks entail manipulating training data and/or labels and require
no access to test-time inputs after a system is deployed. They can be further grouped by
the optimization formulation or heuristic used to craft the attack and whether they target a
from-scratch training process or a transfer learning process that fine-tunes a pre-trained model.
• Attacks on both training and testing: These threats are often referred to as “backdoor attacks” or
“trojans.” They embed an exploit at train time that is subsequently invoked by the presence of
a “trigger” at test time. These attacks can be further sub-divided into model-agnostic attacks
and model-specific attacks that exploit a particular neural network architecture. Additional
categories of attacks exist that exploit special properties of the transfer learning and federated
learning settings.
• Defenses against dataset tampering: Defense methods can either detect when poisoning has taken
place or produce an unaffected model using a training process that resists poisons. Detection
methods include those that identify corrupted training instances, in addition to methods for
flagging corrupted models after they have been trained. Training-based defenses may avoid the
consequences of poisoning altogether using robust training routines, or else perform post-hoc
correction of a corrupted model to remove the effects of poisoning.
In our treatment, we also discuss various threat models addressed in the literature and how
they differ from each other. Finally, for each of the three topics above, we discuss open problems
that, if solved, would increase our understanding of the severity of a given class of attacks, or
enhance our ability to defend against them.

2 Training-Only Attacks

Training-Only Attacks

Feature Bilevel p-Tampering Influence Label Vanishing Generative Model


collision optimization (104), (105), functions flipping gradients models poisoning
(4), (59), (14), (18), (55), (106), (107) (46), (79), (80) (13), (20), (49), (135) (114), (170), (176) (12), (45), (107)
(133), (184) (63), (65), (68), (148), (178), (182)
(88), (103), (108),
(111), (113), (137),
(141), (167)

Figure 1: A taxonomy of training-only data poisoning attacks.

A number of data poisoning strategies manipulate training data without the need to modify
test instances in the field after the victim model is deployed. Training-only attacks are salient in

2
scenarios where training data is collected from potentially compromised online sources. Such
sources include social media profiles, where users can manipulate text or embed exploits in images
scraped for facial recognition. Figure 1 contains a visual depiction of the taxonomy of training-only
attacks based on their methodologies.

2.1 Applications of Data Poisoning


The broad range of applications of training-only attacks includes both targeted attacks in which
the attacker seeks to change the behavior of the model on particular inputs or individuals, and
untargeted attacks where the attacker’s impact is meant to indiscriminately affect model behavior on
a wide range of inputs. An example of a targeted poisoning attack is Venomave (Aghakhani et al.
2020), which attacks automatic speech recognition systems to alter the model’s classification of a
particular person’s utterance of a specific numerical digit. An example of an untargeted attack is
one that reduces algorithmic fairness at the population level (Solans et al. 2020).
Poisoning attacks reveal vulnerabilities not only in neural networks, but in simple classical
models as well. For example, the spam filtering algorithm SpamBayes (Meyer and Whateley 2004),
which uses a naive Bayes classifier to filter email, is susceptible to poisoning attacks (Nelson et al.
2008). By including many words from legitimate emails in messages labeled as spam in the training
set, the attacker can increase the spam score on legitimate emails at test time. When the attacker
has access to a sample of the victim’s legitimate email, the attacker can use the distribution of the
words from that sample to craft undetectable spam. Without such access, the attacker can include
samples from a dictionary of words associated with legitimate email or spam (Nelson et al. 2008).
Recommendation systems. One well-studied application of data poisoning is recommendation
systems (Li et al. 2016, Fang et al. 2018, Hu et al. 2019, Fang et al. 2020). In this application, the
attacker modifies training data either to degrade accuracy overall or to promote a target item at test
time. Matrix factorization based recommender systems are vulnerable to such attacks in addition to
attacks wherein the poison data is designed to appear legitimate (Li et al. 2016). Matrix factorization
methods can also be exploited by computing maximally damaging combinations of product ratings
using integer programming and then deploying these using fake users (Fang et al. 2020). When
systems select products based on user preferences and product properties, fake users can similarly
inject malicious product ratings to cause promotion of target items to real customers (Fang et al.
2018). In social networking settings, poisoning attacks can artificially promote individuals or
groups as recommended connections for users (Hu et al. 2019).
Differential privacy. Differentially private (DP) training is a framework for learning from data
without relying heavily on or memorizing individual data samples. While originally developed
to preserve user privacy, DP training also conveys a degree of resistance to data poisoning since
modifications to a small number of samples can have only a limited impact on the resulting model
(Ma et al. 2019). Bounds on the poisoning vulnerability of recommendation systems that use DP
matrix factorization techniques have been derived and tested empirically (Wadhwa et al. 2020).
At the same time, differential privacy can also enable poisoning, as attackers can mask their
behavior by manipulating data and model updates before DP data aggregation is applied to
harvest data on a privacy-preserving central server (Cao et al. 2019). Similarly, crowd sensing
systems collect sensory data from people who are carrying sensor rich hardware like smartphones.
These systems often employ techniques to discern which workers are contributing truthful data,
as aggregation like this is prone to malicious activity. Even with truth discovery as a standard

3
component of these systems, two poisoning attacks have been developed to harm the integrity of
crowd sensing systems (Miao et al. 2018).
Reinforcement learning. Reinforcement learning algorithms are also susceptible to poisoning
attacks (Ma et al. 2018, Liu and Shroff 2019). Contextual bandits, often used in adaptive medical
treatment, can be manipulated by malicious changes to the rewards in the data (Ma et al. 2018).
Online learning algorithms can also be poisoned. For example, an attacker targeting stochastic
bandits may perturb the reward after a decision with the goal of convincing the agent to pull a
suboptimal arm (Liu and Shroff 2019). Other online learning algorithms have been shown to be
vulnerable to an attack that is formulated as a stochastic optimal control problem (Zhang et al.
2020). In this setting, an attacker can intercept a data stream providing sequential data to an online
learner and craft perturbations by adopting either a model-based planning algorithm or a deep
reinforcement learning approach (Zhang et al. 2020).
Facial recognition. Social media users often post public images, which are in turn scraped by
private or governmental organizations for unauthorized purposes including facial recognition. A
number of authors have proposed poisoning face images on social media sites to prevent their
use in facial recognition systems. Fawkes uses feature collisions to create images that cannot be
matched to their clean counterparts by a classifier trained on small image datasets (Shan et al.
2020), but this method was subsequently found to be ineffective against real-world systems. The
LowKey system (Cherepanova et al. 2020) uses large datasets, data augmentation, and an ensemble
of surrogate models to create perturbations that evade matching by state-of-the-art and commercial
identification systems. A similar approach, Face-Off, has recently been used to attack commercial
APIs, and this work also considers the issue of adversarial defense against these attacks (Gao et al.
2020).
Federated learning. The distributed nature of federated learning raises a number of unique
issues in dataset security. Because model updates are harvested from a diffuse network of untrusted
users, very strong adversarial threat models are realistic. For example, it is reasonable to assume
that an adversary has complete control over input data to the system, labels on training data, and
model updates broadcast to the central server (Bhagoji et al. 2019, Sun et al. 2020). At the same time,
federated learning adversaries are weak in that they only have access to a small slice of the training
data – the entire training system comprises many users that sample from diverse distributions
unknown to the attacker (Tolpegin et al. 2020).
Finally, user privacy is a major issue in federated learning, and privacy may be at odds with
security. Secure aggregation uses cryptographic methods to “mask” each user’s update before it is
sent to the central server, making it impossible to screen incoming models for corruptions (Bhagoji
et al. 2019). Similar issues with differential privacy are discussed above.

2.2 Feature Collision Attacks


Feature collision attacks operate in the targeted data poisoning setting, in which the objective is to
perturb training data so that a particular target example, xt , from the test set is misclassified into
the base class. Feature collision methods for targeted data poisoning perturb training images from
the base class so that their feature-space representations move towards that of the target example.
Intuitively, the attacker hopes that by saturating the region of feature space surrounding the target
example with samples from the base class, the learning algorithm will classify this region into the
base class, thus misclassifying the target example. We begin with description of the original attack

4
from Shafahi et al. (2018), and we then explore a variety of other feature collision attacks which
have emerged since this work.
Following the notation of Schwarzschild et al. (2020), let perturbed poison examples be denoted
( j)
by X p = { x p } jJ=1 and the corresponding corresponding samples from the original training dataset
( j)
Xb = { xb } jJ=1 , where the latter are taken from the base class. The collision attack fixes a pre-trained
feature extractor, f , and solves the following optimization problem:
( j) ( j)
x p = argmin k f ( x ) − f ( xt )k22 + βk x − xb k22 . (1)
x

The first term in this loss function encourages the feature vector extracted from the poison example
to lie close to that of the target, while the second term encourages the poison to stay close to the
corresponding original sample in pixel space. This second term promotes the clean-label property
whereby the poisoned images look like their respective base images, and thus appear to be labeled
correctly. A variant of this attack, BlackCard (Guo and Liu 2020), adds two additional loss terms
that explicitly encourage the poison image to be given the same label as the base image while lying
far away from the base image in feature space. These modifications lead to superior transferability
to black-box models.
A different style of feature collision attacks aims instead to surround the target by poisons in
feature space so that the feature vectors corresponding to poison examples are the vertices of a
convex polytope containing the target’s feature vector (Zhu et al. 2019). These attacks anticipate
that the whole region inside the convex polytope will be classified as the base class, resulting in
better attack reliability compared to a simple feature collision attack. The creators of the Bullseye
Polytope attack (Aghakhani et al. 2020) notice that the target image often lies far away from the
center of the polytope, leading to failed attacks. They optimize the vertices of the polytope with
the constraint that the target image is the mean of the poison feature vectors, resulting in boosted
reliability. Both polytope-based methods compute their attack on an ensemble of models to achieve
better transferability in the black-box setting. Nonetheless, in Schwarzschild et al. (2020), feature
collision methods are shown to be brittle in the black-box setting when the victim’s architecture
and training hyperparameters are unknown.
Feature collision poisoning methods are well-suited for the transfer learning setting, in which
a model is pre-trained on clean data and fine-tuned on a smaller poisoned dataset. This may be
done by freezing the feature extractor and only fine-tuning a linear classification head. In this
setting, the attacker crafts poisons on their own surrogate models and anticipates that the feature
representations of the victim model will be similar to those of their surrogate. This threat model is
quite realistic in the setting of model-reuse attacks, which exploit the fact that most transfer learning
is done from standard public reference models (e.g. the pre-trained ImageNet models that ship
with standard libraries). An attacker can break many transfer learned systems by collecting a large
battery of standard models and creating poisons that cause feature collisions for all of them (Ji et al.
2018). In Section 3, we also discuss a second style of transfer learning attack in which pre-training
data, rather than fine-tuning data, is poisoned so that even when a model is fine-tuned on clean
data, poisoning persists.

5
2.3 Bilevel Optimization
Bilevel optimization methods for data poisoning work by simulating a training pipeline, and then
optimizing through this pipeline to directly search for poison data that result in corrupted models.
While feature collision attacks are most effective when deployed against transfer learning, bilevel
methods are highly effective against both transfer learning and end-to-end training. Simple bilevel
formulations rely on a problem of the form

minX p ∈C L( F ( x t , θ 0 ), y adv ),
(2)
subject to θ0 = argminθ L( F ( X p ∪ Xc , θ ), Y ),
where C denotes a set of permissible poisons (for example within an `∞ ball around clean training
data), and F denotes a neural network with parameters θ. In words, this formulation searches for
poison images with the property that, after training on both poisoned and clean images to obtain
parameters θ 0 , the resulting model places the target image into the desired class. Some works in this
space also investigate additional threat models such as untargeted attacks in which the adversary
seeks to maximize average test loss rather than targeted misclassification on a single test sample
(Muñoz-González et al. 2017, Biggio et al. 2012).
Early works on data poisoning for classical models solve a bilevel optimization problem, but
they require that the inner problem can be solved exactly (Biggio et al. 2012, Xiao et al. 2015, Mei
and Zhu 2015, Koh and Liang 2017). Biggio et al. (2012) use this approach to induce general
performance degradation in support vector machines. Similar algorithms have poisoned LASSO,
ridge regression, and the elastic net for feature selection in malware detection (Xiao et al. 2015).
Another work poisons SVM on data streams (Burkard and Lagesse 2017), and is the first to study
targeted data poisoning rather than general performance degradation. Mei and Zhu (2015) prove
that the bilevel optimization approach yields optimal training set attacks on simple models. Jagielski
et al. (2018) further improve the performance of bilevel optimization based attacks on regression by
strategically selecting which data to poison and also manipulating response variables. Their work
additionally provides an improved defense with formal guarantees.
For neural networks and other non-convex problems, bilevel optimization is more complex.
Muñoz-González et al. (2017) perform bilevel optimization with neural networks using a method
they call “back-gradient descent” in which the inner problem is approximately solved using
several steps of gradient descent. A gradient descent step is then conducted on the outer loss
by back-propagating through the inner minimization routine. Differentiating through multiple
steps of the inner SGD is memory intensive, and so one poison sample is crafted at time instead
of jointly optimizing poisons simultaneously. Additionally, in Muñoz-González et al. (2017), the
back-gradient method is only applied to a single-layer network.
A recent method, MetaPoison (Huang et al. 2020), scales up this unrolling approach to bilevel
optimization using realistic network architectures and training processes. MetaPoison employs
an ensembling method that uses models pre-trained with various numbers of epochs so that
poison images are more likely to influence models during all parts of training. Additionally,
MetaPoison crafts all poisons simultaneously and uses adversarial perturbations to the color
mapping of an image to preserve the clean label property. This work emphasizes the transferability
of poisons in the black-box setting, and successfully poison an industrial API, Google Cloud
AutoML. MetaPoison achieves significantly better performance than earlier methods, including
feature collision techniques, in both the fine-tuning and from-scratch regimes on modern neural

6
networks. The large scale ensembling is expensive though, requiring numerous GPUs to run
efficiently (Huang et al. 2020).
Witches’ Brew (Geiping et al. 2020) improves on MetaPoison by introducing a “gradient align-
ment” objective that encourages the gradient of the loss on poison data to match the gradient of the
adversary’s target loss. When these two gradients are aligned, standard gradient descent steps for
training on the poisoned images will also decrease the adversarial loss on the target images, causing
model poisoning. Improved computational efficiency enables this work to conduct the first targeted
data poisoning of from-scratch training on ImageNet. The method was also demonstrated to break
the Google Cloud AutoML API on ImageNet training. Schwarzschild et al. (2020) benchmark
both the transfer learning and from-scratch settings of targeted data poisoning, and they find that
while Witches’ Brew is the highest performing method on the from-scratch CIFAR-10 (Krizhevsky
2009) problem, Bullseye Polytope actually outperforms this method on the higher dimensional
Tiny-ImageNet dataset (Le and Yang 2015).
Solving the bilevel optimization problem is computationally expensive, even using Witches’
Brew. Several methods thus propose to bypass this problem by training generative models such as
GANs and autoencoders to produce poisons that mimic those generated by bilevel methods (Yang
et al. 2017, Muñoz-González et al. 2019). After a generative model is trained, poisons can be crafted
simply by conducting a forward pass through the generator or the autoencoder. However, these
methods are less effective, and the cost of training the original generative model is high.
Instead of aligning the gradient of training data with a targeted misclassification loss, Tensor-
Clog (Shen et al. 2019) poisons data to cause a vanishing gradient problem during training. This
method perturbs training data so that neural networks have gradients of very low magnitude and
thus do not train effectively while also ensuring that perturbed images have a high value of SSIM
with respect to the originals. TensorClog both prevents networks from achieving low loss during
training while also decreasing validation performance. However, this strategy is only effective in
the white-box transfer learning setting when the feature extractor is known and fixed, and even in
this idealized setting, the attack is only mildly successful (Shen et al. 2019).

2.4 Label Flipping


Label flipping attacks opt to switch training labels while leaving the data instances untouched.
While these attacks are not “clean-label”, they have the advantage of not introducing strange-
looking artifacts, which may be obvious to the intended victim. Biggio et al. (2011) use both random
and adversarial label flips to poison support vector machines. Their work shows that flipping the
labels of an adversarially chosen data subset can cause poisoning, even against learners trained
in a robust fashion. Zhao et al. (2017) show that a projected gradient ascent approach to label
flipping is effective even against black-box linear models ranging from SVM to logistic regression
and LS-SVM. Zhang and Zhu (2017) provide a theoretical analysis of label flipping attacks on
SVM using tools from game theory. In response to these attacks, a number of defenses, theoretical
and empirical, have emerged against label flipping (Paudice et al. 2018, Rosenfeld et al. 2020). In
regression, response variables can be manipulated to enhance data-perturbation based poisoning
(Jagielski et al. 2018).

7
2.5 Influence Functions
Influence functions estimate the effect of an infinitesimal change to training data on the model
parameters that result from training, which can be leveraged to construct poisoning instances. A
function for measuring the impact of up-weighting a data sample can be written as

I( x ) = − Hθ−0 1 ∇θ L( f θ0 ( x )),
n (3)
where θ 0 = argmin ∑ L( f θ ( xi ))
θ i =1

where Hθ 0 denotes the Hessian of the loss with parameters θ 0 . This influence function can be
leveraged to compute the influence that removing a particular data point would have on test
loss (Koh and Liang 2017). This approach can be used to study how individual training samples
are responsible for specific inference-time predictions on linear models whose input features are
extracted by a neural network. That work also extends the influence function method to create
“adversarial training examples” as well as to correct mislabelled data (Koh and Liang 2017). Koh
et al. (2018) leverage such influence functions to create stronger poisoning attacks by accelerating
bilevel optimization. Fang et al. (2020) adopt this approach to conduct data poisoning attacks on
recommender systems. It is worth noting that, while influence functions have been successful on
classical machine learning algorithms, they do not effectively capture data dependence in modern
deep neural networks which have highly non-convex loss surfaces (Basu et al. 2020).

2.6 Online Poisoning


The concept of online poisoning originated with the study of data corruption where each bit is
perturbed with probability p (Valiant 1985, Kearns and Li 1993). When these attacks are restricted
to be clean-label, they are known as p-tampering attacks and were originally studied in the context
of security (Austrin et al. 2014). Mahloujifar and Mahmoody (2017) consider the setting of targeted
poisoning and expand the idea to block-wise p-tampering, where entire samples in the training set
of a learning pipeline can be modified by an adversary. This work studies poisoning attacks from a
theoretical standpoint and includes several results concerning the vulnerability of a model given
the portion p of the data that is perturbed. Specifically, Mahloujifar and Mahmoody prove the
existence of effective targeted clean label attacks for deterministic learners. Mahloujifar et al. (2018)
further improve the quantitative bounds. Moreover, they extend p-tampering to the untargeted
case in the PAC learning setting, where they prove the existence of attacks that degrade confidence.
Further theoretical work relates p-tampering attacks to concentration of measure, wherein the
existence of poisoning attacks against learners over metric spaces is proven (Mahloujifar et al. 2019).

2.7 Data Poisoning in Federated Learning


In the federated learning setting, an adversary can insert poisons at various stages of the training
pipeline, and attacks are not constrained to be “clean-label” since the victim cannot see the attacker’s
training data. Tolpegin et al. (2020) study targeted label flipping attacks on federated learning. Their
work finds that poisons injected late in the training process are significantly more effective than
those injected early. Another work instead adopts a bilevel optimization approach to poisoning
multi-task federated learning (Sun et al. 2020). Others instead opt for GAN-generated poisons
(Zhang et al. 2019).

8
Model poisoning is a unique poisoning strategy for federated learning; in contrast to data
poisoning, the adversary directly manipulates their local model or gradient updates without the
need to modify data or labels. One such approach performs targeted model poisoning in which
the adversary “boosts” their updates to have a large impact on the global model while still staying
beneath the radar of detection algorithms (Bhagoji et al. 2019). The approach is effective even
against Byzantine-resilient aggregation strategies. Another approach shows that Byzantine-robust
federated learning methods can be broken by directly manipulating local model parameters in a
“partial knowledge” setting in which the attacker has no knowledge of local parameters of benign
worker devices (Fang et al. 2020).
Sybils are groups of colluding agents that launch an attack together. Cao et al. (2019) develop
a distributed label flipping attack for federated learning and study the relationship between the
number of sybils and the attack’s effectiveness. Fung et al. (2018) also study label flipping based
sybil attacks, as well as backdoors, for federated learning. Their work additionally proposes a
defense called “FoolsGold” that de-emphasizes groups of clients who’s contributions to the model
are highly similar.
Theoretical work has studied p-Tampering for multi-party learning, of which federated learning
is a special case (Mahloujifar et al. 2019). However, this work assumes that the adversary has
knowledge of updates generated by other benign parties.

2.8 Open Problems


• Accelerated data poisoning for from-scratch training: While bilevel optimization based
data poisoning algorithms are significantly more effective than feature collision at poisoning
neural networks trained from scratch, they are also computationally expensive. Developing
effective data poisoning algorithms for industrial-scale problems is a significant hurdle.

• Attacking with limited information about the dataset: Existing training-only attacks im-
plicitly assume the attacker knows the whole dataset being used for training. More realistic
settings would involve an attack with limited knowledge of the dataset, or even the exact
task, being solved by the victim.

• True clean-label attacks: Existing works often permit large input perturbation budgets,
resulting in poison images that are visibly corrupted. The adversarial attack literature has
recently produced a number of methods for crafting fairly invisible adversarial examples.
Adapting these methods to a data poisoning setting is a promising avenue towards truly
clean-label attacks.

• Fair comparison of methods: Experimental settings vary greatly across studies. One recent
benchmark compares a number of attacks across standardized settings (Schwarzschild et al.
2020). Nonetheless, many methods still have not been benchmarked, and a variety of training-
only threat models have not been compared to the state of the art.

• Robustness to the victim’s training hyperparameters: Schwarzschild et al. (2020) also show
that many existing poisoning methods are less effective than advertised, even in the white-box
setting, when attacking network architectures, optimizers, or data augmentation strategies
that differ from the original experimental setting. Ongoing work seeks attacks that transfer to
a wide range of training hyperparameters. The success of MetaPoison (Huang et al. 2020) and

9
Witches’ Brew (Geiping et al. 2020) at poisoning AutoML suggests that this goal is achievable,
but quantifying robustness across hyperparameters in a controlled testing environment
remains a challenge.

• Broader objectives: Most work on training-only data poisoning has focused on forcing
misclassification of a particular target image or of all data simultaneously. However, broader
goals are largely unexplored. For instance, one can aim to cause misclassification of an
entire sub-class of inputs by targeting specific demographics (Jagielski et al. 2020) or inputs
corresponding to a particular physical object or user.

3 Backdoor Attacks

Backdoor Attacks

Applications Methodology

Object Recognition Generative Reinforcement Model Basic Backdooring Clean-label Attacks for Attacks for
and Detection Models Learning Watermarking Attacks Pre-trained Models Attacks Transfer Learning Federated Learning
(28), (58) (40), (127), (129), (76), (162), (171) (2), (177) (28), (58) (98), (143), (147) (126) , (152), (158) (161), (172) (9), (10), (24),
(158), (179) (96), (144), (168)

Figure 2: A taxonomy of backdoor attacks.

In contrast to the attacks discussed above, backdoor attacks (also known as Trojan attacks) allow
the adversary (limited) access to inputs during inference. This capability allows the adversary to
perform significantly more potent attacks, changing the behavior of the model on a much broader
range of test inputs. Figure 2 contains a taxonomy of backdoor attacks according to different
objectives and threat models.
The key idea behind this class of attacks is to poison a model so that the presence of a
backdoor trigger in a test-time input elicits a particular model behavior (e.g. a particular la-
bel assignment) of the adversary’s choice. The trigger is a pattern that is easily applied to
any input — e.g., a small patch or sticker in the case of images or a specific phrase in the
case of natural language processing (cf. Fig-
ure 3). To ensure that the attack goes unde-
tected when the model is actually deployed, it
is necessary that the model behaves normally
in the absence of a trigger, i.e., during normal
testing.
The most common scenario for backdoor
attacks involves end-to-end training (Gu et al.
2017, Chen et al. 2017), where the adversary is (a) (b)
able to inject multiple poisoned inputs into the
training set, causing a backdoor vulnerability Figure 3: Backdoor attacks with different triggers:
in the trained model. This scenario is quite gen- (a) a square pattern flips the identity of a stop-
eral and encompasses several real-world ML sign (Gu et al. 2017); (b) dead code as the trigger
tasks—e.g., text classification (Dai et al. 2019, for source code modeling (Ramakrishnan and Al-
Chen et al. 2020, Sun 2020), graph classification barghouthi 2020).

10
(Zhang et al. 2020, Xi et al. 2020), malware detection (Severi et al. 2020), biometric systems (Lovisotto
et al. 2019), and reinforcement learning (Kiourti et al. 2020).
Another threat model corresponds to the setting where the model is not trained from scratch
but rather fine-tuned on a new task. Just as in the case of training-only attacks, the adversary can
exploit access to commonly used standard models to produce more potent attacks.
Comparison to evasion attacks. At a high level, backdoor attacks appear similar to evasion attacks
(Biggio et al. 2013, Szegedy et al. 2013) where the adversary perturbs a specific input to cause an
intended model prediction. The key difference here is that backdoor attacks aim to embed a trigger
that is input- and model-agnostic—i.e., the same trigger can cause any poisoned model to produce an
incorrect prediction on any input. While in principle it is possible to construct evasion attacks that
apply to multiple inputs (Moosavi-Dezfooli et al. 2017) and transfer across models (Szegedy et al.
2013), such attacks are less effective (Tramèr et al. 2017).

3.1 Applications of Backdoor Attacks


Backdoor attacks can be used to manipulate learning systems in a range of application areas that
we survey below.
Object recognition and detection. Early work on backdoor attacks focused on manipulating
computer vision systems by altering physical objects. Gu et al. (2017) show that their backdoor
attack can induce an image classifier to label a stop sign as a speed limit sign in the presence of
a small yellow sticker that acts as the trigger. Similarly, Chen et al. (2017) demonstrate backdoor
attacks on a simplified face recognition system in which a pair of glasses triggers a prediction
change in identity.
Generative models. The danger posed by backdoor attacks is not restricted to classifiers; they
can be applied to models whose output is not a single label. In the case of language models,
a trigger can elicit complex behaviors such as generating specified character sequences. Zhang
et al. (2020) use a trigger phrase to cause generative language models to produce offensive text
completion. Similar results occur in machine translation (Wallace et al. 2020), when the trigger
phrase appears in the context of the phrase being translated. Other objectives include suggesting
insecure source code (Schuster et al. 2020) or generating images with specific characteristics (Ding
et al. 2019, Salem et al. 2020).
Adapting attacks on classification to work on generative models is not always straightforward.
One challenging aspect is that the poisoned inputs may need to obey a range of application-specific
constraints. For instance, backdoor attacks on natural language systems (Dai et al. 2019) may
require the poisoned inputs to be natural and syntactically valid. To achieve this, TROJANLM
(Zhang et al. 2020) fine-tunes a pre-trained GPT-2 model (Radford et al. 2019) to generate sentences
containing specified keywords when triggered. In source code modeling, the trigger may need to
be injectable without causing runtime errors or behavioral changes, which can be achieved by only
modifying “dead” code paths that can never be executed (Ramakrishnan and Albarghouthi 2020).
Reinforcement learning. Backdoor attacks on reinforcement learning aim to cause the agent
to perform a malicious action when the trigger appears in a specific state (e.g., a symbol on the
screen of an Atari game) (Yang et al. 2019, Kiourti et al. 2020). For example, in traffic systems, the
attacker’s goal may be to cause congestion when a specific traffic pattern is observed (Wang et al.
2020). One key challenge is that, for the attack to remain unnoticed, the trigger should be added to
as few states as possible. Thus, from the attacker’s perspective, it is desirable that the adversarial
behavior persists even after the trigger disappears from the agent’s observation (Yang et al. 2019).

11
Model watermarking. Backdoor attacks rely on the expressiveness of machine learning models
and in particular DNNs—being able to memorize specific patterns without degrading their overall
accuracy. This property of DNNs has also been leveraged for model watermarking (Adi et al.
2018, Zhang et al. 2018). The goal of watermarking is to train a model while ensuring that a
predetermined set of patterns or inputs gets assigned specific (randomly chosen) labels by the
model. Then, should an adversary steal and deploy the model for profit, the model owner can
prove ownership by demonstrating knowledge of the embedded watermarks—e.g., using standard
cryptographic primitives (Adi et al. 2018).

3.2 Basic Backdoor Attacks


The most common paradigm for launching backdoor attacks is to inject the dataset with poison
samples containing the backdoor trigger. A desirable property of such attacks is that they are
model-agnostic so that the same backdoor attack is effective on different models trained on the same
poisoned dataset. Therefore, the attacks can be launched in the black-box scenario.
The first successful backdoor attacks on modern deep neural networks are demonstrated in
Gu et al. (2017) and Chen et al. (2017), where the adversary injects mislabelled poison samples
into the training set to perform the attack. To encourage the model to rely on the trigger, the
adversary chooses a number of natural images and labels them incorrectly with the target class
label before adding the backdoor trigger. The resulting images are therefore mislabelled based on
their content and are only associated with their label via the backdoor trigger. During training, the
model strongly relies on the easy-to-learn backdoor trigger in order to classify these images. As
a result, when the trigger is applied to a new image during deployment, the model assigns it the
target label, as desired by the adversary.
Gu et al. (2017) demonstrate their attack in the setting where the model is being trained by the
adversary, but Chen et al. (2017) show that the method also works without access to model training
and with significantly fewer poison examples. Moreover, these attacks have been shown to be
effective with imperceptible triggers in the training set (Chen et al. 2017, Li et al. 2020) while still
being realizable in the physical world (Gu et al. 2017, Chen et al. 2017, Wenger et al. 2020, Sarkar
et al. 2020). Dai et al. (2019) similarly attack RNNs for text classification. They show that simple
and unnoticeable phrase modifications can reliably induce backdoor behavior.

3.3 Model-Specific Attacks


The backdoor attacks described so far have the property that they are model-agnostic—i.e., the
triggers are simple patterns constructed without any knowledge of the model under attack. In
this section, we describe how one can mount more powerful attacks by tailoring the attack to a
particular model.

3.3.1 Embedding Backdoors into Pre-Trained Models


In the case where the attacker obtains access to an already trained model, they can embed a
backdoor into this model without completely re-training it and even without access to the original
training data. Liu et al. (2018) select an input region along with neurons which are sensitive to
changes in that region, and the attacker aims to activate these neurons. They then generate artificial
data (from the model alone), and they adapt the basic backdoor attack by applying their trigger

12
to this artificial data and fine-tuning only a few layers of the model. Tang et al. (2020) propose
an approach that adds a small module to the existing model instead of modifying existing model
weights. Note that this attack does modify the model architecture and hence falls outside the
standard threat model. Sun et al. (2020) start from a model into which a backdoor has already been
injected and focus on constructing alternative backdoor triggers. They find that it is possible to
construct multiple distinct triggers without knowledge of the original one.

3.3.2 Clean-Label Backdoor Attacks


Most work on backdoor attacks requires adding mislabeled poison samples into the training set.
However, such attacks assume that the adversary has access to the labeling process. Moreover, they
are likely to be detected should a human manually inspect these samples. Recent work proposes
clean-label backdoor attacks, where the labels of poison samples aim to be semantically correct
(Turner et al. 2019, Saha et al. 2019).
The methodology behind clean-label backdoor attacks is conceptually similar to the feature
collision methods introduced in Section 2.2. Turner et al. (2019) utilize generative adversarial
networks (Goodfellow et al. 2014) and adversarial examples (Szegedy et al. 2013) to perturb images
of the target class towards other classes (hence making them harder to learn) before applying the
backdoor trigger. The Hidden Trigger Backdoor Attack (Saha et al. 2019) adapts the feature collision
framework (Shafahi et al. 2018) to construct backdoor poison samples. These samples are based off
natural images from the target class but slightly modified so that their feature representations are
close to images injected with the backdoor trigger.
For natural language processing, no-overlap backdoor attacks are designed to make the poison
samples hard to identify (Wallace et al. 2020). Specifically, the input sentences of poison samples
in the training set do not contain the words in the backdoor trigger, yet when the test-time input
contains the backdoor trigger, a poisoned model still produces the adversarial prediction. The
poison samples are generated by manipulating the model gradient during training, similar to the
gradient-based optimization procedure in Muñoz-González et al. (2017).

3.3.3 Backdoor Attacks for Transfer Learning


So far, we have discussed backdoor attacks where the victim either trains the model from scratch
on a poisoned dataset or receives an already trained model. A scenario that interpolates between
these two settings is transfer learning, in which part (or all) of the model is re-trained on a new task.
In contrast to the transfer learning setting from Section 2 where fine-tuning data was poisoned, the
threat model we discuss in this section consists of an adversary that poisons the pre-training data
to create a backdoored feature extractor, but has no control over the victim’s fine-tuning process.
Basic backdoor attacks can still be harmful in this setting when only the final fully connected layer
of the model is re-trained (Gu et al. 2017). However, when the entire model is fine-tuned in an
end-to-end fashion, the embedded backdoor is virtually eliminated (Liu et al. 2018, Liu et al. 2017).
To ensure that the backdoor trigger is persistent after fine-tuning, existing works design trigger
patterns specific to the attacker’s goal (Yao et al. 2019, Wang et al. 2020). To generate the optimal
backdoor trigger for each target label, the adversary annotates a number of clean samples with the
target label, injects them into the training set, and trains the model on this augmented dataset. Then,
the adversary generates the backdoor trigger that maximizes the activation of neurons responsible
for recognizing the target label, and applies the basic backdoor attack to generate poison samples

13
for further model pre-training. In order to generate the final trigger, the adversary optimizes the
color intensity so that the intermediate feature representations of inputs injected with the trigger
are close to clean samples in the target class. In Yao et al. (2019), the authors show that when all
layers after the intermediate layer for trigger generation are frozen, the attack remains effective
for transfer learning, and existing backdoor defenses cannot effectively detect or eliminate the
backdoor without degrading the benign prediction accuracy.
To make the attacks more resilient to pruning and fine-tuning based defenses, Wang et al. (2020)
propose a ranking-based neuron selection mechanism, which identifies neurons with weights that
are hard to change during the pruning and fine-tuning process. This work utilizes an autoencoder
to generate strong triggers that are robust under defenses based on input preprocessing. With the
proposed defense-aware fine-tuning algorithm, such backdoor attacks retain higher success rates.

3.3.4 Backdoor Attacks on Federated Learning


Directly applying a basic backdoor attack, i.e., training a local backdoored model and using it to
update the global model, does not work in the federated setting (Bagdasaryan et al. 2020). The main
obstacle is that aggregation across many users will reduce the effect of an individual adversarial
update. To overcome this challenge, Bagdasaryan et al. (2020) study the model replacement
approach, where the attacker scales a malicious model update so as to overpower other benign
model updates, effectively replacing the global model with the adversary’s backdoored local model.
This attack can be modified to bypass norm-based and statistics-based defenses for federated
learning by constraining the norm and variance of gradient updates for local models (Sun et al.
2019, Baruch et al. 2019).
Xie et al. (2019) propose distributed backdoor attacks, which better exploit the decentralized
nature of federated learning. Specifically, they decompose the backdoor pattern for the global
model into multiple distributed small patterns, and inject them into training sets used by several
adversarial participants. Compared to injecting the global backdoor trigger, injecting separate local
patterns for different participants improves the effectiveness of the attacks and bypasses robust
aggregation algorithms.
In addition to the standard federated learning scenario, where the participants have disjoint
training sets for a single task, backdoor attacks have been proposed for other settings. For example,
Liu et al. (2020) investigate backdoor attacks for feature-partitioned federated learning, where
each participant only has access to a subset of features, and most of them do not have access
to labels. They demonstrate that even without manipulating the labels, the adversary can still
successfully embed the backdoor, but such attacks are easier to repel with gradient aggregation
mechanisms. Chen et al. (2020) propose backdoor attacks for federated meta-learning, where they
collaboratively train a model that is able to quickly adapt to new tasks with a few training samples.
They demonstrate that the effects of such attacks still persist after meta-training and fine-tuning on
benign data.

3.4 Open Problems


• Backdoors that persist after end-to-end fine-tuning: While backdoor attacks remain effec-
tive when the defender freezes most layers in the pre-trained model for fine-tuning, when the
entire model is fine-tuned end-to-end, existing attacks for transfer learning fail (Yao et al. 2019,

14
Wang et al. 2020). More generally, developing backdoor attacks without strong assumptions
on the fine-tuning process remains a challenge for the transfer learning setting.

• Backdoor attacks with limited training data: Existing backdoor attack approaches typically
require access to clean samples for training. Even if the adversary already has access to a
pre-trained model, the attacks only work for specific triggers extracted from the model (Liu
et al. 2018). One potential avenue for bypassing this obstacle would be to embed the trigger
directly into the model weights in a method similar to existing watermarking approaches
(Rouhani et al. 2018, Uchida et al. 2017).

• Architecture-agnostic clean-label attacks: The clean-label backdoor attacks described so


far work best when the adversary has access to a surrogate model that closely reflects the
architecture of the victim model (Turner et al. 2019, Saha et al. 2019, Wallace et al. 2020). To
improve the transferability of clean-label attacks among a broad range of model architectures,
one might leverage techniques for generating transferable adversarial examples (Moosavi-
Dezfooli et al. 2017, Liu et al. 2017) and clean-label attacks targeting specific instances (Huang
et al. 2020), e.g., using an ensemble of models to generate poison samples.

• Understanding the effectiveness of backdoor attacks in the physical world: Physically


realizable backdoor attacks have been explored in existing works, mostly in the setting of
face recognition (Chen et al. 2017, Wenger et al. 2020, Sarkar et al. 2020). While these attacks
can still be successful under different physical conditions, e.g., lighting and camera angles,
the attack success rate drastically varies across backdoor triggers. How different factors affect
the physical backdoor attacks is still an overlooked challenge, and drawing inspiration from
physical adversarial examples (Eykholt et al. 2018, Athalye et al. 2018) to propose robust
physical backdoor attacks is another promising direction.

• Combining poisoning and test-time attacks for stronger backdoor attacks: To perform
backdoor attacks at test time, the common practice is to directly embed the backdoor trigger
without additional modification on the input. One potential avenue to further strengthen
these attacks is to apply additional perturbations to the input aside from the trigger. The
optimal perturbation could be computed in a similar way to adversarial example generation.
Integrating evasion attacks with backdoor attacks could lead to improved success rates and
mitigate the difficulty of backdoor embedding.

4 Defenses Against Poisoning Attacks


In this section, we discuss defense mechanisms for mitigating data poisoning attacks. These tools
are employed at different stages of the machine learning pipeline, and can be broken down into
three categories. One type of defenses detects the existence of poisoning attacks by analyzing
either the poisoned training set or the model itself. The second class of defenses aims to repair
the poisoned model by removing the backdoor behavior from the system. The third and final
group comprises robust training approaches designed to prevent poisoning from taking effect.
Figure 4 depicts a taxonomy of defenses against training-only and backdoor attacks according to
methodology.

15
Defenses Against Poisoning Attacks

Identifying Identifying Repairing Poisoned Models Preventing Poisoning Defenses for


Poisoned Data Poisoned Models after Training during Training Federated Learning

Outliers in Latent Predictions Trigger Trigger-agnostic Patching Trigger-agnostic Randomized Majority Differential Input Robust Robust Post-Training
Input Space Space Signatures Reconstruction Detection Known Backdoor Smoothing Vote Privacy Preprocessing Federated Federated Defenses
(39), (116), Signatures (32) (53) (25), (60), (64), (81), Triggers Removal (124), (163) Mechanisms (62), (103) (17), (99) Aggregation Training (165)
(117), (139) (23), (79), (159), (160) (169) (25), (121), (29), (94), (72), (73), (16), (30), (5), (144)
(101), (118), (159), (186) (95) (87) (48), (49),
(150) (90), (92),
(110), (119),
(173)

Figure 4: A taxonomy of defenses against training-only and backdoor attacks.

4.1 Identifying Poisoned Data


The broad goal of detection-based defense strategies is to discover axes along which poison
examples or model parameters differ from their non-poisoned counterparts. These detection
methods are based on raw input data or latent feature representations, or they are otherwise
designed to analyze the behavior of the model around a specific input.

4.1.1 Outliers in Input Space


Perhaps the simplest method for detecting poisoned inputs is to identify outliers in the input space
of the model. This principle connects to a long line of work, known as robust statistics (Huber
2004, Hampel et al. 2011), going all the way back to the work of Tukey (1960) and Huber (1964).
The high level goal of this field is to estimate statistical quantities of a dataset in the presence of
(adversarially-placed) outliers. A significant amount of work in this space shows that, from an
information-theoretic point of view, this problem is indeed tractable for a wide variety of tasks
and data distributions (Tukey 1960, Donoho and Liu 1988, Zuo and Serfling 2000, Chen et al. 2018,
Steinhardt et al. 2018, Zhu et al. 2019). However, from a computational perspective, most of these
approaches do not provide efficient implementations for high-dimensional datasets (which, after
all, are at the core of modern ML).
Recently, there has been a flurry of activity focused on designing computationally efficient
algorithms in this setting. Klivans et al. (2009) present the first algorithm for learning a linear
classifier under adversarial noise. More recently, Diakonikolas et al. (2019) and Lai et al. (2016)
develop efficient algorithms for learning a number of parametric distributions even when a fraction
of the data has been arbitrarily corrupted. These algorithms rely on relatively simple primitives and
can thus be efficiently implemented even for high-dimensional distributions (Diakonikolas et al.
2017). In an orthogonal direction, Gao et al. (2018) draw a connection between robust estimation
and GANs (Goodfellow et al. 2014) that allows one to approximate complex robust estimators
efficiently (Gao et al. 2020).
While these algorithms focus on estimating specific statistical quantities of the data, other work
optimizes general notions of risk in the presence of outliers. For instance, Charikar et al. (2017) pro-
pose algorithms for risk minimization based on recovering a list of possible models and (optionally)
using a small uncorrupted dataset to choose between them. Steinhardt et al. (2017) focus on the
binary classification setting and remove data points that are far from their respective class centroids
(measured directly in input space or after projecting data onto the line between the two centroids).
Similarly, Diakonikolas et al. (2019) and Prasad et al. (2018) adapt tools from robust mean estimation
to robustly estimate the average risk gradient over a (potentially corrupted) dataset. At a high level,

16
these approaches provide the following theoretical guarantee: if a single data point has a large
effect on the model, it will be identified as an outlier. Such a guarantee prevents an adversary from
significantly changing the behavior of the model by just injecting a few inputs. We refer the reader
to Li (2018), Steinhardt (2018), and Diakonikolas and Kane (2019) for additional references on this
line of work.
From the perspective of defending against poisoning on modern ML datasets, both Steinhardt
et al. (2017) and Diakonikolas et al. (2019) show promising results for regression and classification
settings. Similarly, Paudice et al. (2018) propose a data pre-filtering approach with outlier detection
for linear classifiers. They split a trusted training dataset by class and then train one distance-based
outlier detector for each class. When a new untrusted dataset is used for re-training, the outlier
detectors remove samples that exceed some score threshold. In another work, Paudice et al. (2018)
mitigate label flipping attacks by using k-Nearest-Neighbors (k-NN) to re-label each data point
in the training set. Specifically, they re-label each data point with the most common label among
its k nearest neighbors. It is worth noting that certain outlier-based defenses can be bypassed by
adaptive attacks. Specifically, Koh et al. (2018) design attacks that fool anomaly detectors by placing
poisoned inputs close to each other and rephrasing poisoning attacks as constrained optimization
problems to evade detection.

4.1.2 Latent Space Signatures


While input-space outlier detection is simple and intuitive, it is only effective on simple, low-
dimensional input domains. In more complex domains, for example image or text data, directly
comparing raw input data may not convey any meaningful notion of similarity. Thus, recent
work has focused on detecting outliers based on the latent embedding of a deep neural network.
The intuition behind this approach is that latent embeddings capture the signal necessary for
classification, thereby making the difference between clean and poisoned inputs more pronounced.
Several ways of analyzing latent model representations arise from this intuition. Tran et al. (2018)
use tools from robust mean estimation (Diakonikolas et al. 2019, Lai et al. 2016) to find directions
along which the covariance of the feature representations is significantly skewed. Measuring
variation along these directions yields better detection of standard backdoor attacks (Section 3) than
simpler metrics such as the `2 -distance in feature space. The detection algorithm NIC (Ma and Liu
2019) approximates the distribution of neuron activation patterns and detects inputs that contain
the trigger by comparing their activations to this approximate distribution. Chen et al. (2018) apply
clustering algorithms to the latent representations and identify clusters whose members, when
removed from training, would be labeled differently by the learned model. Peri et al. (2019) observe
that the deep features of poison inputs often lie near the distribution of the target class as opposed
to near the distribution of other data with the same label. They use this observation to detect poison
examples in clean-label data poisoning attacks. Koh and Liang (2017) use the latent embedding
of a model to compute influence functions which measure the effect of each training point on
test set performance. They find that these influence estimates are effective at flagging potentially
mislabelled examples (e.g. label flipping attacks) for manual inspection.

4.1.3 Prediction Signatures


There is also a number of approaches that directly study the behavior of a model in an end-to-end
fashion. STRIP (Gao et al. 2019) detects whether a given input contains a backdoor trigger by

17
mixing it with other benign inputs and analyzing the model prediction. The authors posit that if
the prediction does not change often, then the model must be heavily relying on a small part of
that input. This approach allows the detection of backdoor triggers in deployed models. SentiNet
(Chou et al. 2020) uses Grad-Cam (Selvaraju et al. 2017), an input saliency mapping method, to
pinpoint the features of the input that are most responsible for the model’s prediction. If the model
only relies on a small part of the input, then it is likely to be relying on a backdoor trigger for its
prediction.

4.2 Identifying Poisoned Models


The detection approaches described above rely on access to poisoned data used during training.
Thus, they cannot be applied in cases where the entire model training process is outsourced. There
are, however, several defenses that can detect a poisoned model without access to the poisoned
training data.

4.2.1 Trigger Reconstruction


One family of approaches aims to recover the backdoor trigger from the model alone (Wang et al.
2019, Chen et al. 2019, Guo et al. 2019). These methods utilize adversarial perturbations to move
data towards different target classes. Backdoored models are trained to assign an adversarial
label when only a small number of pixels are manipulated to introduce the trigger. As a result,
swapping an image to an adversarial label should require a smaller perturbation than swapping to
a non-adversarial label. The backdoor trigger is thus possibly recovered by computing adversarial
perturbations for the target labels, and then selecting the smallest perturbation out of all labels.
Neural Cleanse (Wang et al. 2019), the first approach to use this observation, is able to detect
poisoned models without access to the poisoned dataset. This method does require a number of
clean image samples and full access to the trained model parameters with which it can perform gra-
dient descent to find potential triggers. DeepInspect (Chen et al. 2019) improves this methodology
in three ways. First, it simultaneously recovers potential triggers for multiple classes at once (and
hence avoids the computational cost of constructing a potential trigger for each class individually).
Second, it relies on model inversion (Fredrikson et al. 2015) to recover a substitute training dataset,
thereby not requiring any clean data. And third, it trains a conditional GAN (Goodfellow et al. 2014)
to estimate the probability density function of potential triggers for any target class. Another trigger
reconstruction defense called TABOR (Guo et al. 2019) further improves upon Neural Cleanse by
enhancing the fidelity of the reconstructed backdoor triggers via heuristic regularization methods.
Most recently, Wang et al. (2020) study the data-limited (one shot per class) and data-free cases.
In the data-limited case, they reconstruct a universal adversarial perturbation (i.e. trigger) and
the image-wise perturbations for each label, the similarity of which is then used for backdoor
detection. If a model is backdoored, then the universal perturbation and the image-wise one for the
backdoor target label may share strong similarities due to the existence of a backdoor shortcut. In
the data-free case, they first generate perturbed images from random seed images by maximizing
neuron activations. Then, they detect if a model is backdoored by investigating the magnitude of
the change in logit outputs with respect to random images and perturbed images.

18
4.2.2 Trigger-agnostic Detection
Different from the detection pipelines discussed above, MNTD (Xu et al. 2021) predicts whether a
model is backdoored by examining its behavior on carefully crafted inputs. MNTD first generates
a battery of benign and backdoored models. Then, a query set of images is generated and pushed
through the battery of networks to create outputs. Finally, a binary meta-classifier is trained to
examine these outputs and determine whether a model is backdoored. The query set is then jointly
optimized with the parameters of the meta-classifier to obtain a high accuracy meta-classifier.
Interestingly, this approach appears to detect attacks on architectures outside of the ensemble used
to train the meta-classifier (Xu et al. 2021). Huang et al. (2020) define a “one-pixel” signature of
a network, which is the collection of single-pixel adversarial perturbations that most effectively
impact the label of a collection of images. They train a meta-classifier on these signatures to
classify backdoored vs clean models. Kolouri et al. (2020) present a method for examining how
networks respond to trigger-like patterns. A set of “universal litmus patterns” are pushed through
a network, and a meta-classifier is trained on the resulting logits to determine whether the network
is backdoored (Kolouri et al. 2020). Importantly, this method seems to generalizes well across
architectures and backdoor attacks.

4.3 Repairing Poisoned Models after Training


While detection is useful for defending against data poisoning attacks, the methods described
above only indicate whether or not an attack has occurred. Another class of methods removes
backdoors from an already trained model without re-training the model from scratch. We describe
methods that rely on (approximate) knowledge of the trigger and also methods that do not require
such knowledge.

4.3.1 Patching Known Triggers


One line of defense strategies relies on trigger reconstruction methods to recover an approximation
of the injected trigger as seen in Section 4.2.1. Once the trigger is identified, there are multiple
ways of rendering it inactive. Neural Cleanse (Wang et al. 2019) shows that it is possible to identify
which neurons are strongly activated by the presence of the trigger and use them to detect inputs
containing the trigger during testing. Similarly, they remove the influence of these neurons by
pruning them. Additionally, Neural Cleanse fine-tunes the model to unlearn the trigger by adding
it to clean samples and training with correct labels. Rather than reconstructing a single trigger,
the authors of Chen et al. (2019), Qiao et al. (2019), and Zhu et al. (2020) model a distribution of
possible triggers using a GAN, which is then sampled to train an immunized model.

4.3.2 Trigger-agnostic Backdoor Removal


Another way to remove backdoor behavior is to modify the model, only keeping the parts necessary
for the intended tasks. Since the backdoor is not active during forward passes on clean data, the
backdoor behavior can be removed during this process. Liu et al. (2018) attempt such a defense by
pruning neurons that are dormant—i.e., they are not activated on clean inputs. Specifically, their
defense tests the poisoned model with clean inputs, records the average activation of each neuron,
and iteratively prunes neurons in increasing order of average activation. However, this pruning
defense cannot remove the backdoor without significantly degrading performance (Liu et al. 2018).

19
To overcome the performance loss of pruning defenses alone, several works propose fine-tuning
the model on a clean dataset (Liu et al. 2018, Chen et al. 2019, Liu et al. 2020). Since the dataset
does not include the backdoor trigger, the backdoor behavior may eventually be forgotten after
updating the parameters during fine-tuning. Combining pruning with fine-tuning can indeed
remove backdoors while preserving the overall model accuracy, even when an adversary crafts
pruning-aware attacks (Liu et al. 2018). However, if the dataset used for fine-tuning is small, model
performance may suffer significantly (Chen et al. 2019).
To better preserve the model’s accuracy on clean data, the watermark removal framework REfiT
(Chen et al. 2019) leverages elastic weight consolidation (Kirkpatrick et al. 2017). This defensive
training process slows down the learning of model weights that are relevant to the main prediction
task while updating other weights that are likely responsible for memorizing watermarks. Similarly,
WILD (Liu et al. 2020) includes a feature distribution alignment scheme to achieve a similar goal.

4.4 Preventing Poisoning during Training


The methods described above aim to either detect or fix an already poisoned model. Here, we
describe training-time strategies to avoid backdoor injection in the first place.

4.4.1 Randomized Smoothing


Randomized smoothing (Lecuyer et al. 2019, Cohen et al. 2019) was originally proposed to defend
against evasion attacks (Biggio et al. 2013). Starting with a base model, a smoothed version of that
model is defined by replacing the model prediction on each data point by the majority prediction
in its neighborhood. The outputs of the smoothed model can be computed efficiently, while at the
same time certifying its robustness to input perturbations (Cohen et al. 2019).
In the context of data poisoning, the goal is to protect the model against perturbations to the
training set. Thus, the goal of certifiable robustness is for each test point, to return a prediction as
well as a certificate that the prediction would not change had some quantity of training data been
modified. To do so, one can view the entire training-plus-single-prediction pipeline as a function
which can be robustified against input perturbations (Weber et al. 2020, Rosenfeld et al. 2020).
Weber et al. (2020) apply this defense to backdoor attacks, where backdoor pixel perturbations are
in the continuous space. On the other hand, Rosenfeld et al. (2020) use this defense against label
flipping attacks where label perturbations are in the discrete space.

4.4.2 Majority Vote Mechanisms


A number of approaches utilize majority vote mechanisms to ignore poisoned samples. The
underlying assumption is that the number of poisoned samples injected by the attacker is small
compared to the size of the overall training dataset. Therefore, the poison samples will not
significantly influence the majority vote when voters each use only a subset of the data. For
example, Deep Partition Aggregation (Levine and Feizi 2020) learns multiple base classifiers by
partitioning the training set into disjoint subsets. Similarly, Jia et al. (2020) train multiple base
models on random subsamples of the training dataset. The base models are then combined via
majority vote to produce an aggregate model, the robustness of which can be confirmed empirically
and certified theoretically.Jia et al. (2020) predict the label of a test example via majority vote among
the labels of its k nearest neighbors or all of its neighbors within radius r in the training dataset.

20
4.4.3 Differential Privacy
Differential Privacy (DP) (Dwork et al. 2006) was originally designed to protect the privacy of
individuals contributing data. The core idea is that if the output of the algorithm remains essentially
unchanged when one individual input point is added or subtracted, the privacy of each individual
is preserved. From the perspective of data poisoning, differential privacy ensures that model
predictions do not depend too much on individual data points. Thus, models will not be dispropor-
tionately affected by poisoned samples. Ma et al. (2019) study defenses based on DP against data
poisoning from the practical and theoretical perspectives. Hong et al. (2020) empirically show that
the off-the-shelf mechanism DP-SGD (Abadi et al. 2016), which clips and noises gradients during
training, can serve as a defense. They point out that the main artifacts of gradients computed in the
presence of poisoning are that their `2 -norms have higher magnitudes and their orientation differs
from clean gradients. Since DP-SGD bounds gradient magnitudes by clipping and minimizes the
difference in orientation by random noise addition, it is successful in defending against poisoning
attacks (Hong et al. 2020).

4.4.4 Input Preprocessing


Additionally, some works propose modifying the model input, during training or testing, to prevent
the model from recognizing the backdoor trigger (Liu et al. 2017, Borgnia et al. 2020). Liu et al. (2017)
utilize an autoencoder (Vincent et al. 2008) trained on clean data to preprocess the input. Since
the input is not perfectly reconstructed (especially given that the autoencoder is not particularly
sensitive to the trigger), the model is unlikely to recognize the trigger. Borgnia et al. (2020) propose
using strong data augmentations during training, such as mixup (Zhang et al. 2017), CutMix (Yun
et al. 2019), and MaxUp (Gong et al. 2020). Dramatic data augmentations sufficiently disrupt
triggers and perturbations to the training data, foiling the attack. This approach is highly effective
against both training-only and backdoor attacks and has the added benefit that it does not degrade
model performance (Borgnia et al. 2020).

4.5 Defenses for Federated Learning


In Federated Learning (FL), a global model is trained by utilizing local data from many clients,
providing a venue for new poisoning attacks from malicious clients. Since this setting allows for
fundamentally different attacks compared to other learning settings (cf. Section 2.7), a number
of application-specific defenses have been developed for FL. These include robust federated
aggregation algorithms, robust federated training protocols, and post-training measures.

4.5.1 Robust Federated Aggregation


Robust federated aggregation algorithms attempt to nullify the effects of attacks while aggregating
client updates. These methods can be broadly classified into two types; one type identifies and
down-weights the malicious updates, while a second type does not attempt to identify malicious
clients and instead computes aggregates in a way which is resistant to poisons. A prototypical idea
for this second method estimates a true “center” of the received model updates rather than taking
a weighted average.
Fung et al. (2018) identify sybils-based attacks, including label flipping and backdoors, as
group actions. Sybils share a common adversarial objective whose updates are more similar to

21
each other than honest clients. They propose FoolsGold, which calculates the cosine similarity of
the gradient updates from clients, reduces aggregation weights of clients that contribute similar
gradient updates, thus promoting contribution diversity. Another avenue for defense in the FL
setting is learning-based robust aggregation. Specificically, Li et al. (2020) utilize a variational
autoencoder (VAE) (Kingma and Welling 2013) to detect and remove malicious model updates in
aggregation, where the VAE is trained on clean model updates. The encoder projects model updates
into low-dimensional embeddings in latent space, and the decoder reconstructs the sanitized model
updates and generates a reconstruction error. The idea is that the low-dimensional embeddings
retain essential features, so malicious updates are sanitized and trigger much higher reconstruction
errors while benign updates are unaffected. Updates with higher reconstruction errors are deemed
malicious and are excluded from aggregation. This method assumes that a clean dataset is available
to train the detection model.
Another approach involves Byzantine-robust aggregation techniques which resist manipulation,
even without identifying malicious clients. Two such algorithms, called Krum and Multi-Krum
(Blanchard et al. 2017), select representative gradient updates which are close to their nearest
neighbor gradients. Another algorithm called Bulyan (Mhamdi et al. 2018) first uses another
aggregation rule such as Krum to iteratively select benign candidates and then aggregates these
candidates by a variant of the trimmed mean (Yin et al. 2018).
Another class of approaches employs the coordinate-wise median gradient (Yin et al. 2018),
geometric median of means (Chen et al. 2017), and approximate geometric median (Pillutla et al.
2019) since median-based computations are more resistant to outliers than mean-based aggregation.
These methods are effective in robust distributed learning where the data of each user comes from
the same distribution. However, they are less effective in FL where local data may be collected in a
non-identically distributed manner across clients. Also, these methods are shown, both theoretically
and empirically, to be less effective in the high-dimensional regime (Mhamdi et al. 2018). Another
Byzantine-robust method called RSA (Li et al. 2019) penalizes parameter updates which move far
away from the previous parameter vector and provides theoretical robustness guarantees. Unlike
earlier methods, RSA does not assume that all workers see i.i.d. data from the same distribution
(Li et al. 2019). Alternatively, Fu et al. (2019) estimate a regression line by a repeated median
estimator (Siegel 1982) for each parameter dimension of the model updates, then dynamically
assign aggregation weights to clients based on the residual of their model updates to that regression
line.

4.5.2 Robust Federated Training


In addition to robust federated aggregation, several FL protocols mitigate poisoning attacks during
training. Sun et al. (2019) show that clipping the norm of model updates and adding Gaussian noise
can mitigate backdoor attacks that are based on the model replacement paradigm (Bagdasaryan
et al. 2020, Bhagoji et al. 2019). This work highlights the success of this method on the realistic
federated EMNIST dataset (Cohen et al. 2017, Caldas et al. 2018).
Andreina et al. (2020) leverage two specific features of federated learning to defend against
backdoor attacks. Specifically, they utilize global models produced in the previous rounds and the
fact that the attacker does not have access to a substantial amount of training data. They propose
BaFFLe (Andreina et al. 2020), which incorporates an additional validation phase to each round
of FL. That is, a set of randomly chosen clients validate the current global model by computing a
validation function on their private data and report whether the current global model is poisoned.

22
The server decides to accept or reject the global model based on the feedback from validating clients.
Specifically, the validation function compares the class-specific misclassification rates of the current
global model with those of the accepted global models in the previous rounds and raises a warning
when the misclassification rates differ significantly, which may indicate backdoor poisoning.

4.5.3 Post-Training Defenses


Other defense strategies focus on restoring the poisoned global model after training. Wu et al. (2020)
extend pruning and fine-tuning methods (Liu et al. 2018) to the FL setting to repair backdoored
global models. Their method requires clients to rank the dormant level of neurons in the neural
network using their local data, since the server itself cannot access clean training data. They then
select and remove the most redundant neurons via majority vote (Wu et al. 2020).

4.6 Open Problems


• Defenses beyond image classification: Although data poisoning and backdoor attacks have
been applied in a variety of domains, image classification remains to be the major focus of
research for defenses. It is thus crucial that these defenses are applied to other domains in
order to understanding potential for real-world use, as well as any shortcomings.

• Navigating trade-offs between accuracy, security, and data privacy: Modern large-scale ML
systems strive to achieve high accuracy while maintaining user privacy. However, these
goals seem to be at odds in the presense of data poisoning. In fact, most FL defenses rely on
direct access to model updates, which can reveal information about user data (Zhu et al. 2019,
Geiping et al. 2020). Achieving security against poisoning while maintaining accuracy and
privacy appears to be elusive given our current methods.

• Can defenses be bypassed without access to training? Tan and Shokri (2019) show that one
can bypass certain outlier-based defenses by enforcing that the internal representations of
a model on poison examples during training are similar to those corresponding to clean
examples. The open question is whether these defenses can be bypassed without access to the
training protocol.

• Efficient and practical defenses: Many approaches to identifying poisoned models require
producing a set of auxiliary clean and poisoned models to train a detector (Xu et al. 2021,
Huang et al. 2020, Kolouri et al. 2020), but this process is computational expensive. Moreover,
generating additional models for trigger-agnostic methods or reconstructing possible triggers
in trigger reconstruction methods (Wang et al. 2019, Guo et al. 2019) requires a clean dataset,
which may not be feasible in practice. Therefore, designing efficient and practical defense
methods with less data and computation requirements is essential for practical application.

• Differential privacy and data poisoning: Hong et al. (2020) and Jagielski et al. (2020) show
that there remains a massive gap between the theoretical worst-case lower bounds provided
by DP mechanisms and the empirical performance of the defenses against data poisoning.
It is however unclear if this gap is due to existing attacks being insufficient or due to the
theoretical bounds being unnecessarily pessimistic.

23
• Certified defenses against poisoning attacks: Certified defenses against poisoning attacks
are still far from producing meaningful guarantees in realistic, large-scale settings. Moreover,
they are particularly hard to study in a federated learning setting. Instead of analyzing
the influence of training datasets on model predictions in an end-to-end manner like in the
centralized setting (Weber et al. 2020, Rosenfeld et al. 2020), one needs to consider how the
local training datasets influence the local updates, and how the performance of the global
model is influenced by these local updates through aggregation.

• Detection of inconspicuous poison examples: Defense strategies rely on poison examples


or backdoor model behavior being noticeably irregular in the context of the ambient dataset.
Detecting malicious behavior when it does not appear anomalous is a much harder task, and
existing methods often fail. Similarly, anomaly detection is ineffective in federated learning,
where each client may have a dramatically different underlying data distribution. Picking
out malicious clients from benign, yet atypical ones is an important open problem.

5 Conclusion
The expansion of the power and scale of machine learning models has been accompanied by a
corresponding expansion of their vulnerability to attacks. In light of the rapid surge in work
on data poisoning and defenses against it, we provide a birds-eye perspective by systematically
dissecting the wide range of research directions within this space. The open problems we enumerate
reflect the interests and experience of the authors and are not an exhaustive list of outstanding
research problems. Moving forward, we anticipate work not only on these open problems but
also on datasets and benchmarks for comparing existing methods since controlled comparisons
are currently lacking in the data poisoning and backdoor literature. We hope that the overhead
perspective we offer helps to illuminate both urgent security needs within industry and also a need
to understand security vulnerabilities so that the community can move towards closing them.

Acknowledgements
We thank Jiantao Jiao, Mohammad Mahmoody, and Jacob Steinhardt for helpful pointers to relevant
literature.
Multiple contributors to this work were supported by the Defense Advanced Research Projects
Agency (DARPA) GARD, QED4RML and D3M programs. Additionally, support for Li and Xie
was provided by NSF grant CCF-1910100 and the Amazon research award program. Song and
Chen were supported by NSF grant TWC-1409915, Berkeley DeepDrive, and the Facebook PhD
Fellowship. Madry and Tsipras were supported by NSF grants CCF-1553428, CNS-1815221, and
the Facebook PhD Fellowship. Goldstein, Goldblum, and Schwarzschild were supported by NSF
grant DMS-1912866, the ONR MURI Program, and the Sloan Foundation.

References
[1] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, Kunal Talwar,
and Li Zhang. Deep learning with differential privacy. In Proceedings of the 2016 ACM SIGSAC

24
Conference on Computer and Communications Security, pages 308–318, 2016.

[2] Yossi Adi, Carsten Baum, Moustapha Cisse, Benny Pinkas, and Joseph Keshet. Turning your
weakness into a strength: Watermarking deep neural networks by backdooring. In 27th
{USENIX} Security Symposium ({USENIX} Security 18), pages 1615–1631, 2018.
[3] Hojjat Aghakhani, Thorsten Eisenhofer, Lea Schönherr, Dorothea Kolossa, Thorsten Holz,
Christopher Kruegel, and Giovanni Vigna. Venomave: Clean-label poisoning against speech
recognition, 2020.

[4] Hojjat Aghakhani, Dongyu Meng, Yu-Xiang Wang, Christopher Kruegel, and Giovanni Vigna.
Bullseye polytope: A scalable clean-label poisoning attack with improved transferability.
arXiv preprint arXiv:2005.00191, 2020.

[5] Sebastien Andreina, Giorgia Azzurra Marson, Helen Möllering, and Ghassan Karame. Baffle:
Backdoor detection via feedback-based federated learning. arXiv preprint arXiv:2011.02167,
2020.

[6] Anonymous. Possible malware found hidden inside images from the ImageNet dataset,
2020. URL https://www.reddit.com/r/MachineLearning/comments/j4jrln/d_possible_
malware_found_hidden_inside_images/.

[7] Anish Athalye, Logan Engstrom, Andrew Ilyas, and Kevin Kwok. Synthesizing robust
adversarial examples. In International conference on machine learning, pages 284–293. PMLR,
2018.

[8] Per Austrin, Kai-Min Chung, Mohammad Mahmoody, Rafael Pass, and Karn Seth. On the
impossibility of cryptography with tamperable randomness. In Annual Cryptology Conference,
pages 462–479. Springer, 2014.

[9] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How
to backdoor federated learning. In International Conference on Artificial Intelligence and Statistics,
pages 2938–2948. PMLR, 2020.

[10] Gilad Baruch, Moran Baruch, and Yoav Goldberg. A little is enough: Circumventing defenses
for distributed learning. In Advances in Neural Information Processing Systems, pages 8635–8645,
2019.

[11] Samyadeep Basu, Philip Pope, and Soheil Feizi. Influence functions in deep learning are
fragile. arXiv preprint arXiv:2006.14651, 2020.

[12] Arjun Nitin Bhagoji, Supriyo Chakraborty, Prateek Mittal, and Seraphin Calo. Analyzing
federated learning through an adversarial lens. In International Conference on Machine Learning,
pages 634–643. PMLR, 2019.

[13] Battista Biggio, Blaine Nelson, and Pavel Laskov. Support vector machines under adversarial
label noise. In Asian conference on machine learning, pages 97–112, 2011.

[14] Battista Biggio, Blaine Nelson, and Pavel Laskov. Poisoning attacks against support vector
machines. arXiv preprint arXiv:1206.6389, 2012.

25
[15] Battista Biggio, Igino Corona, Davide Maiorca, Blaine Nelson, Nedim Šrndić, Pavel Laskov,
Giorgio Giacinto, and Fabio Roli. Evasion attacks against machine learning at test time. In
Joint European conference on machine learning and knowledge discovery in databases, pages 387–402.
Springer, 2013.

[16] Peva Blanchard, Rachid Guerraoui, Julien Stainer, et al. Machine learning with adversaries:
Byzantine tolerant gradient descent. In Advances in Neural Information Processing Systems,
pages 119–129, 2017.

[17] Eitan Borgnia, Valeriia Cherepanova, Liam Fowl, Amin Ghiasi, Jonas Geiping, Micah Gold-
blum, Tom Goldstein, and Arjun Gupta. Strong data augmentation sanitizes poisoning and
backdoor attacks without an accuracy tradeoff. arXiv preprint arXiv:2011.09527, 2020.

[18] Cody Burkard and Brent Lagesse. Analysis of causative attacks against svms learning from
data streams. In Proceedings of the 3rd ACM on International Workshop on Security And Privacy
Analytics, pages 31–36, 2017.

[19] Sebastian Caldas, Sai Meher Karthik Duddu, Peter Wu, Tian Li, Jakub Konečnỳ, H Brendan
McMahan, Virginia Smith, and Ameet Talwalkar. Leaf: A benchmark for federated settings.
arXiv preprint arXiv:1812.01097, 2018.

[20] Di Cao, Shan Chang, Zhijian Lin, Guohua Liu, and Donghong Sun. Understanding dis-
tributed poisoning attack in federated learning. In 2019 IEEE 25th International Conference on
Parallel and Distributed Systems (ICPADS), pages 233–239. IEEE, 2019.

[21] Xiaoyu Cao, Jinyuan Jia, and Neil Zhenqiang Gong. Data poisoning attacks to local differen-
tial privacy protocols, 2019.

[22] Moses Charikar, Jacob Steinhardt, and Gregory Valiant. Learning from untrusted data. In
Symposium on Theory of Computing (STOC), 2017.

[23] Bryant Chen, Wilka Carvalho, Nathalie Baracaldo, Heiko Ludwig, Benjamin Edwards, Tae-
sung Lee, Ian Molloy, and Biplav Srivastava. Detecting backdoor attacks on deep neural
networks by activation clustering. arXiv preprint arXiv:1811.03728, 2018.

[24] Chien-Lun Chen, Leana Golubchik, and Marco Paolieri. Backdoor attacks on federated
meta-learning. arXiv preprint arXiv:2006.07026, 2020.

[25] Huili Chen, Cheng Fu, Jishen Zhao, and Farinaz Koushanfar. Deepinspect: A black-box
trojan detection and mitigation framework for deep neural networks. In Proceedings of
the Twenty-Eighth International Joint Conference on Artificial Intelligence, IJCAI-19, pages 4658–
4664. International Joint Conferences on Artificial Intelligence Organization, 7 2019. doi:
10.24963/ijcai.2019/647. URL https://doi.org/10.24963/ijcai.2019/647.

[26] Mengjie Chen, Chao Gao, Zhao Ren, et al. Robust covariance and scatter matrix estimation
under huber’s contamination model. The Annals of Statistics, 2018.

[27] Xiaoyi Chen, Ahmed Salem, Michael Backes, Shiqing Ma, and Yang Zhang. BadNL: Backdoor
attacks against nlp models. arXiv preprint arXiv:2006.01043, 2020.

26
[28] Xinyun Chen, Chang Liu, Bo Li, Kimberly Lu, and Dawn Song. Targeted backdoor attacks
on deep learning systems using data poisoning. arXiv preprint arXiv:1712.05526, 2017.

[29] Xinyun Chen, Wenxiao Wang, Chris Bender, Yiming Ding, Ruoxi Jia, Bo Li, and Dawn Song.
Refit: a unified watermark removal framework for deep learning systems with limited data.
arXiv preprint arXiv:1911.07205, 2019.

[30] Yudong Chen, Lili Su, and Jiaming Xu. Distributed statistical machine learning in adversarial
settings: Byzantine gradient descent. Proceedings of the ACM on Measurement and Analysis of
Computing Systems, 1(2):1–25, 2017.

[31] Valeriia Cherepanova, Micah Goldblum, Harrison Foley, Shiyuan Duan, John P Dickerson,
Gavin Taylor, and Tom Goldstein. Lowkey: Leveraging adversarial attacks to protect social
media users from facial recognition. OpenReview, 2020.

[32] Edward Chou, Florian Tramer, and Giancarlo Pellegrino. Sentinet: Detecting localized
universal attack against deep learning systems. IEEE SPW 2020, 2020.

[33] Gregory Cohen, Saeed Afshar, Jonathan Tapson, and Andre Van Schaik. Emnist: Extending
mnist to handwritten letters. In 2017 International Joint Conference on Neural Networks (IJCNN),
pages 2921–2926. IEEE, 2017.

[34] Jeremy Cohen, Elan Rosenfeld, and Zico Kolter. Certified adversarial robustness via random-
ized smoothing. In International Conference on Machine Learning, pages 1310–1320, 2019.

[35] Jiazhu Dai, Chuanshuai Chen, and Yufeng Li. A backdoor attack against lstm-based text
classification systems. IEEE Access, 7:138872–138878, 2019.

[36] Ilias Diakonikolas and Daniel M Kane. Recent advances in algorithmic high-dimensional
robust statistics. arXiv preprint arXiv:1911.05911, 2019.

[37] Ilias Diakonikolas, Gautam Kamath, Daniel M Kane, Jerry Li, Ankur Moitra, and Alistair
Stewart. Being robust (in high dimensions) can be practical. In International Conference on
Machine Learning (ICML), 2017.

[38] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Ankur Moitra, and Alistair
Stewart. Robust estimators in high-dimensions without the computational intractability.
SIAM Journal on Computing, 48(2):742–864, 2019.

[39] Ilias Diakonikolas, Gautam Kamath, Daniel Kane, Jerry Li, Jacob Steinhardt, and Alistair
Stewart. Sever: A robust meta-algorithm for stochastic optimization. In International Confer-
ence on Machine Learning, pages 1596–1606, 2019.

[40] Shaohua Ding, Yulong Tian, Fengyuan Xu, Qun Li, and Sheng Zhong. Trojan attack on deep
generative models in autonomous driving. In International Conference on Security and Privacy
in Communication Systems, pages 299–318. Springer, 2019.

[41] David L Donoho and Richard C Liu. The" automatic" robustness of minimum distance
functionals. The Annals of Statistics, 1988.

27
[42] Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to
sensitivity in private data analysis. In Theory of cryptography conference, pages 265–284.
Springer, 2006.

[43] Kevin Eykholt, Ivan Evtimov, Earlence Fernandes, Bo Li, Amir Rahmati, Chaowei Xiao,
Atul Prakash, Tadayoshi Kohno, and Dawn Song. Robust physical-world attacks on deep
learning visual classification. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 1625–1634, 2018.

[44] Minghong Fang, Guolei Yang, Neil Zhenqiang Gong, and Jia Liu. Poisoning attacks to graph-
based recommender systems. In Proceedings of the 34th Annual Computer Security Applications
Conference, pages 381–392, 2018.

[45] Minghong Fang, Xiaoyu Cao, Jinyuan Jia, and Neil Gong. Local model poisoning attacks
to byzantine-robust federated learning. In 29th {USENIX} Security Symposium ({USENIX}
Security 20), pages 1605–1622, 2020.

[46] Minghong Fang, Neil Zhenqiang Gong, and Jia Liu. Influence function based data poisoning
attacks to top-n recommender systems. In Proceedings of The Web Conference 2020, pages
3019–3025, 2020.

[47] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. Model inversion attacks that exploit
confidence information and basic countermeasures. In Proceedings of the 22nd ACM SIGSAC
Conference on Computer and Communications Security, pages 1322–1333, 2015.

[48] Shuhao Fu, Chulin Xie, Bo Li, and Qifeng Chen. Attack-resistant federated learning with
residual-based reweighting. arXiv preprint arXiv:1912.11464, 2019.

[49] Clement Fung, Chris JM Yoon, and Ivan Beschastnikh. Mitigating sybils in federated learning
poisoning. arXiv preprint arXiv:1808.04866, 2018.

[50] Chao Gao, Jiyi Liu, Yuan Yao, and Weizhi Zhu. Robust estimation and generative adversarial
nets. arXiv preprint arXiv:1810.02030, 2018.

[51] Chao Gao, Yuan Yao, and Weizhi Zhu. Generative adversarial nets for robust scatter es-
timation: A proper scoring rule perspective. Journal of Machine Learning Research (JMLR),
2020.

[52] Chuhan Gao, Varun Chandrasekaran, Kassem Fawaz, and Somesh Jha. Face-off: Adversarial
face obfuscation. arXiv preprint arXiv:2003.08861, 2020.

[53] Yansong Gao, Change Xu, Derui Wang, Shiping Chen, Damith C Ranasinghe, and Surya
Nepal. Strip: A defence against trojan attacks on deep neural networks. In Proceedings of the
35th Annual Computer Security Applications Conference, pages 113–125, 2019.

[54] Jonas Geiping, Hartmut Bauermeister, Hannah Dröge, and Michael Moeller. Invert-
ing gradients–How easy is it to break privacy in federated learning? arXiv preprint
arXiv:2003.14053, 2020.

28
[55] Jonas Geiping, Liam Fowl, W Ronny Huang, Wojciech Czaja, Gavin Taylor, Michael Moeller,
and Tom Goldstein. Witches’ brew: Industrial scale data poisoning via gradient matching.
arXiv preprint arXiv:2009.02276, 2020.

[56] Chengyue Gong, Tongzheng Ren, Mao Ye, and Qiang Liu. Maxup: A simple way to improve
generalization of neural network training. arXiv preprint arXiv:2002.09024, 2020.

[57] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil
Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Advances in
neural information processing systems, pages 2672–2680, 2014.

[58] Tianyu Gu, Brendan Dolan-Gavitt, and Siddharth Garg. Badnets: Identifying vulnerabilities
in the machine learning model supply chain. arXiv preprint arXiv:1708.06733, 2017.

[59] Junfeng Guo and Cong Liu. Practical poisoning attacks on neural networks. Proceedings of the
European Conference on Computer Vision, 2020.

[60] Wenbo Guo, Lun Wang, Xinyu Xing, Min Du, and Dawn Song. Tabor: A highly accu-
rate approach to inspecting and restoring trojan backdoors in ai systems. arXiv preprint
arXiv:1908.01763, 2019.

[61] Frank R Hampel, Elvezio M Ronchetti, Peter J Rousseeuw, and Werner A Stahel. Robust
statistics: the approach based on influence functions, volume 196. John Wiley & Sons, 2011.

[62] Sanghyun Hong, Varun Chandrasekaran, Yiğitcan Kaya, Tudor Dumitraş, and Nicolas
Papernot. On the effectiveness of mitigating data poisoning attacks with gradient shaping.
arXiv preprint arXiv:2002.11497, 2020.

[63] Rui Hu, Yuanxiong Guo, Miao Pan, and Yanmin Gong. Targeted poisoning attacks on social
recommender systems. In 2019 IEEE Global Communications Conference (GLOBECOM), pages
1–6. IEEE, 2019.

[64] Shanjiaoyang Huang, Weiqi Peng, Zhiwei Jia, and Zhuowen Tu. One-pixel signature: Char-
acterizing cnn models for backdoor detection. In Proceedings of the European Conference on
Computer Vision (ECCV), 2020.

[65] W Ronny Huang, Jonas Geiping, Liam Fowl, Gavin Taylor, and Tom Goldstein. Metapoison:
Practical general-purpose clean-label data poisoning. arXiv preprint arXiv:2004.00225, 2020.

[66] Peter J Huber. Robust estimation of a location parameter. The Annals of Mathematical Statistics,
pages 73–101, 1964.

[67] Peter J Huber. Robust statistics, volume 523. John Wiley & Sons, 2004.

[68] Matthew Jagielski, Alina Oprea, Battista Biggio, Chang Liu, Cristina Nita-Rotaru, and Bo Li.
Manipulating machine learning: Poisoning attacks and countermeasures for regression
learning. In 2018 IEEE Symposium on Security and Privacy (SP), pages 19–35. IEEE, 2018.

[69] Matthew Jagielski, Giorgio Severi, Niklas Pousette Harger, and Alina Oprea. Subpopulation
data poisoning attacks. arXiv preprint arXiv:2006.14026, 2020.

29
[70] Matthew Jagielski, Jonathan Ullman, and Alina Oprea. Auditing differentially private
machine learning: How private is private sgd? Advances in Neural Information Processing
Systems, 33, 2020.

[71] Yujie Ji, Xinyang Zhang, Shouling Ji, Xiapu Luo, and Ting Wang. Model-reuse attacks on
deep learning systems. In Proceedings of the 2018 ACM SIGSAC Conference on Computer and
Communications Security, pages 349–363, 2018.

[72] Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Certified robustness of nearest neighbors
against data poisoning attacks. arXiv preprint arXiv:2012.03765, 2020.

[73] Jinyuan Jia, Xiaoyu Cao, and Neil Zhenqiang Gong. Intrinsic certified robustness of bagging
against data poisoning attacks. arXiv preprint arXiv:2008.04495, 2020.

[74] Michael Kearns and Ming Li. Learning in the presence of malicious errors. SIAM Journal on
Computing, 22(4):807–837, 1993.

[75] Diederik P Kingma and Max Welling. Auto-encoding variational bayes. arXiv preprint
arXiv:1312.6114, 2013.

[76] Panagiota Kiourti, Kacper Wardega, Susmit Jha, and Wenchao Li. Trojdrl: Evaluation of
backdoor attacks on deep reinforcement learning. In 2020 57th ACM/IEEE Design Automation
Conference (DAC), pages 1–6. IEEE, 2020.

[77] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins,
Andrei A Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,
et al. Overcoming catastrophic forgetting in neural networks. Proceedings of the national
academy of sciences, 114(13):3521–3526, 2017.

[78] Adam R Klivans, Philip M Long, and Rocco A Servedio. Learning halfspaces with malicious
noise. Journal of Machine Learning Research, 10(12), 2009.

[79] Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions.
In International Conference on Machine Learning, pages 1885–1894, 2017.

[80] Pang Wei Koh, Jacob Steinhardt, and Percy Liang. Stronger data poisoning attacks break data
sanitization defenses. arXiv preprint arXiv:1811.00741, 2018.

[81] Soheil Kolouri, Aniruddha Saha, Hamed Pirsiavash, and Heiko Hoffmann. Universal litmus
patterns: Revealing backdoor attacks in cnns. In Proceedings of the IEEE/CVF Conference on
Computer Vision and Pattern Recognition, pages 301–310, 2020.

[82] Alex Krizhevsky. Learning multiple layers of features from tiny images. In Technical report,
2009.

[83] Ram Shankar Siva Kumar, Magnus Nyström, John Lambert, Andrew Marshall, Mario Go-
ertzel, Andi Comissoneru, Matt Swann, and Sharon Xia. Adversarial machine learning–
industry perspectives. arXiv preprint arXiv:2002.05646, 2020.

30
[84] Kevin A Lai, Anup B Rao, and Santosh Vempala. Agnostic estimation of mean and covariance.
In 2016 IEEE 57th Annual Symposium on Foundations of Computer Science (FOCS), pages 665–674.
IEEE, 2016.

[85] Ya Le and Xuan Yang. Tiny imagenet visual recognition challenge. CS 231N, 7, 2015.

[86] Mathias Lecuyer, Vaggelis Atlidakis, Roxana Geambasu, Daniel Hsu, and Suman Jana. Certi-
fied robustness to adversarial examples with differential privacy. In 2019 IEEE Symposium on
Security and Privacy (SP), pages 656–672. IEEE, 2019.

[87] Alexander Levine and Soheil Feizi. Deep partition aggregation: Provable defense against
general poisoning attacks. arXiv preprint arXiv:2006.14768, 2020.

[88] Bo Li, Yining Wang, Aarti Singh, and Yevgeniy Vorobeychik. Data poisoning attacks on
factorization-based collaborative filtering, 2016.

[89] Jerry Zheng Li. Principled approaches to robust machine learning and beyond. PhD thesis,
Massachusetts Institute of Technology, 2018.

[90] Liping Li, Wei Xu, Tianyi Chen, Georgios B Giannakis, and Qing Ling. Rsa: Byzantine-robust
stochastic aggregation methods for distributed learning from heterogeneous datasets. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 1544–1551, 2019.

[91] Shaofeng Li, Minhui Xue, Benjamin Zhao, Haojin Zhu, and Xinpeng Zhang. Invisible
backdoor attacks on deep neural networks via steganography and regularization. IEEE
Transactions on Dependable and Secure Computing, 2020.

[92] Suyi Li, Yong Cheng, Wei Wang, Yang Liu, and Tianjian Chen. Learning to detect malicious
clients for robust federated learning. arXiv preprint arXiv:2002.00211, 2020.

[93] Fang Liu and Ness Shroff. Data poisoning attacks on stochastic bandits, 2019.

[94] Kang Liu, Brendan Dolan-Gavitt, and Siddharth Garg. Fine-pruning: Defending against
backdooring attacks on deep neural networks. In International Symposium on Research in
Attacks, Intrusions, and Defenses, pages 273–294. Springer, 2018.

[95] Xuankai Liu, Fengting Li, Bihan Wen, and Qi Li. Removing backdoor-based watermarks in
neural networks with limited data. arXiv preprint arXiv:2008.00407, 2020.

[96] Yang Liu, Zhihao Yi, and Tianjian Chen. Backdoor attacks and defenses in feature-partitioned
collaborative learning. arXiv preprint arXiv:2007.03608, 2020.

[97] Yanpei Liu, Xinyun Chen, Chang Liu, and Dawn Song. Delving into transferable adversarial
examples and black-box attacks. In International Conference on Learning Representations, 2017.

[98] Yingqi Liu, Shiqing Ma, Yousra Aafer, W. Lee, Juan Zhai, Weihang Wang, and X. Zhang.
Trojaning attack on neural networks. In NDSS, 2018.

[99] Yuntao Liu, Yang Xie, and Ankur Srivastava. Neural trojans. In 2017 IEEE International
Conference on Computer Design (ICCD), pages 45–48. IEEE, 2017.

31
[100] Giulio Lovisotto, Simon Eberz, and Ivan Martinovic. Biometric backdoors: A poisoning
attack against unsupervised template updating. arXiv preprint arXiv:1905.09162, 2019.
[101] Shiqing Ma and Yingqi Liu. Nic: Detecting adversarial samples with neural network invariant
checking. In Proceedings of the 26th Network and Distributed System Security Symposium (NDSS
2019), 2019.
[102] Yuzhe Ma, Kwang-Sung Jun, Lihong Li, and Xiaojin Zhu. Data poisoning attacks in contextual
bandits. In International Conference on Decision and Game Theory for Security, pages 186–204.
Springer, 2018.
[103] Yuzhe Ma, Xiaojin Zhu, and Justin Hsu. Data poisoning against differentially-private learners:
attacks and defenses. In Proceedings of the 28th International Joint Conference on Artificial
Intelligence, pages 4732–4738. AAAI Press, 2019.
[104] Saeed Mahloujifar and Mohammad Mahmoody. Blockwise p-tampering attacks on cryp-
tographic primitives, extractors, and learners. In Theory of Cryptography Conference, pages
245–279. Springer, 2017.
[105] Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. Learning under
p-tampering attacks. In Algorithmic Learning Theory, pages 572–596, 2018.
[106] Saeed Mahloujifar, Dimitrios I Diochnos, and Mohammad Mahmoody. The curse of concen-
tration in robust learning: Evasion and poisoning attacks from concentration of measure. In
Proceedings of the AAAI Conference on Artificial Intelligence, volume 33, pages 4536–4543, 2019.
[107] Saeed Mahloujifar, Mohammad Mahmoody, and Ameer Mohammed. Data poisoning attacks
in multi-party learning. In International Conference on Machine Learning, pages 4274–4283,
2019.
[108] Shike Mei and Xiaojin Zhu. Using machine teaching to identify optimal training-set attacks
on machine learners. In AAAI, pages 2871–2877, 2015.
[109] Tony A Meyer and Brendon Whateley. Spambayes: Effective open-source, bayesian based,
email classification system. In CEAS. Citeseer, 2004.
[110] El Mahdi El Mhamdi, Rachid Guerraoui, and Sébastien Rouault. The hidden vulnerability of
distributed learning in byzantium. arXiv preprint arXiv:1802.07927, 2018.
[111] Chenglin Miao, Qi Li, Houping Xiao, Wenjun Jiang, Mengdi Huai, and Lu Su. Towards data
poisoning attacks in crowd sensing systems. In Proceedings of the Eighteenth ACM International
Symposium on Mobile Ad Hoc Networking and Computing, pages 111–120, 2018.
[112] Seyed-Mohsen Moosavi-Dezfooli, Alhussein Fawzi, Omar Fawzi, and Pascal Frossard. Uni-
versal adversarial perturbations. In Proceedings of the IEEE conference on computer vision and
pattern recognition, pages 1765–1773, 2017.
[113] Luis Muñoz-González, Battista Biggio, Ambra Demontis, Andrea Paudice, Vasin Wongras-
samee, Emil C Lupu, and Fabio Roli. Towards poisoning of deep learning algorithms with
back-gradient optimization. In Proceedings of the 10th ACM Workshop on Artificial Intelligence
and Security, pages 27–38, 2017.

32
[114] Luis Muñoz-González, Bjarne Pfitzner, Matteo Russo, Javier Carnerero-Cano, and Emil C
Lupu. Poisoning attacks with generative adversarial nets. arXiv preprint arXiv:1906.07773,
2019.

[115] Blaine Nelson, Marco Barreno, Fuching Jack Chi, Anthony D Joseph, Benjamin IP Rubinstein,
Udam Saini, Charles A Sutton, J Doug Tygar, and Kai Xia. Exploiting machine learning to
subvert your spam filter. LEET, 8:1–9, 2008.

[116] Andrea Paudice, Luis Muñoz-González, Andras Gyorgy, and Emil C Lupu. Detection of
adversarial training examples in poisoning attacks through anomaly detection. arXiv preprint
arXiv:1802.03041, 2018.

[117] Andrea Paudice, Luis Muñoz-González, and Emil C Lupu. Label sanitization against label
flipping poisoning attacks. In Joint European Conference on Machine Learning and Knowledge
Discovery in Databases, pages 5–15. Springer, 2018.

[118] Neehar Peri, Neal Gupta, W. Ronny Huang, Liam Fowl, Chen Zhu, Soheil Feizi, Tom Gold-
stein, and John P. Dickerson. Deep k-nn defense against clean-label data poisoning attacks.
arXiv preprint arXiv:1909.13374, 2019.

[119] Krishna Pillutla, Sham M Kakade, and Zaid Harchaoui. Robust aggregation for federated
learning. arXiv preprint arXiv:1912.13445, 2019.

[120] Adarsh Prasad, Arun Sai Suggala, Sivaraman Balakrishnan, and Pradeep Ravikumar. Robust
estimation via robust gradient estimation. arXiv preprint arXiv:1802.06485, 2018.

[121] Ximing Qiao, Yukun Yang, and Hai Li. Defending neural backdoors via generative distri-
bution modeling. In Advances in Neural Information Processing Systems, pages 14004–14013,
2019.

[122] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever.
Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019.

[123] Goutham Ramakrishnan and Aws Albarghouthi. Backdoors in neural models of source code.
arXiv preprint arXiv:2006.06841, 2020.

[124] Elan Rosenfeld, Ezra Winston, Pradeep Ravikumar, and J Zico Kolter. Certified robustness to
label-flipping attacks via randomized smoothing. arXiv preprint arXiv:2002.03018, 2020.

[125] Bita Darvish Rouhani, Huili Chen, and Farinaz Koushanfar. Deepsigns: A generic watermark-
ing framework for ip protection of deep learning models. arXiv preprint arXiv:1804.00750,
2018.

[126] Aniruddha Saha, Akshayvarun Subramanya, and Hamed Pirsiavash. Hidden trigger back-
door attacks. arXiv preprint arXiv:1910.00033, 2019.

[127] Ahmed Salem, Yannick Sautter, Michael Backes, Mathias Humbert, and Yang Zhang. Baaan:
Backdoor attacks against autoencoder and gan-based machine learning models. arXiv preprint
arXiv:2010.03007, 2020.

33
[128] Esha Sarkar, Hadjer Benkraouda, and Michail Maniatakos. Facehack: Triggering backdoored
facial recognition systems using facial characteristics. arXiv preprint arXiv:2006.11623, 2020.

[129] Roei Schuster, Congzheng Song, Eran Tromer, and Vitaly Shmatikov. You autocomplete me:
Poisoning vulnerabilities in neural code completion. arXiv preprint arXiv:2007.02220, 2020.

[130] Avi Schwarzschild, Micah Goldblum, Arjun Gupta, John P Dickerson, and Tom Goldstein.
Just how toxic is data poisoning? A unified benchmark for backdoor and data poisoning
attacks. arXiv preprint arXiv:2006.12557, 2020.

[131] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, Ramakrishna Vedantam, Devi
Parikh, and Dhruv Batra. Grad-cam: Visual explanations from deep networks via gradient-
based localization. In Proceedings of the IEEE international conference on computer vision, pages
618–626, 2017.

[132] Giorgio Severi, Jim Meyer, Scott Coull, and Alina Oprea. Exploring backdoor poisoning
attacks against malware classifiers. arXiv preprint arXiv:2003.01031, 2020.

[133] Ali Shafahi, W Ronny Huang, Mahyar Najibi, Octavian Suciu, Christoph Studer, Tudor
Dumitras, and Tom Goldstein. Poison frogs! Targeted clean-label poisoning attacks on neural
networks. In Advances in Neural Information Processing Systems, pages 6103–6113, 2018.

[134] Shawn Shan, Emily Wenger, Jiayun Zhang, Huiying Li, Haitao Zheng, and Ben Y Zhao.
Fawkes: Protecting personal privacy against unauthorized deep learning models. USENIX
Security Symposium, 2020.

[135] Juncheng Shen, Xiaolei Zhu, and De Ma. Tensorclog: An imperceptible poisoning attack on
deep neural network applications. IEEE Access, 7:41498–41506, 2019.

[136] Andrew F Siegel. Robust regression using repeated medians. Biometrika, 69(1):242–244, 1982.

[137] David Solans, Battista Biggio, and Carlos Castillo. Poisoning attacks on algorithmic fairness.
arXiv preprint arXiv:2004.07401, 2020.

[138] Jacob Steinhardt. Robust learning: Information theory and algorithms. PhD thesis, Stanford
University, 2018.

[139] Jacob Steinhardt, Pang Wei W Koh, and Percy S Liang. Certified defenses for data poisoning
attacks. In Advances in neural information processing systems, pages 3517–3529, 2017.

[140] Jacob Steinhardt, Moses Charikar, and Gregory Valiant. Resilience: A criterion for learning
in the presence of arbitrary outliers. In Innovations in Theoretical Computer Science Conference
(ITCS), 2018.

[141] Gan Sun, Yang Cong, Jiahua Dong, Qiang Wang, and Ji Liu. Data poisoning attacks on
federated machine learning. arXiv preprint arXiv:2004.10020, 2020.

[142] Lichao Sun. Natural backdoor attack on text data. arXiv preprint arXiv:2006.16176, 2020.

[143] Mingjie Sun, Siddhant Agarwal, and J Zico Kolter. Poisoned classifiers are not only back-
doored, they are fundamentally broken. arXiv preprint arXiv:2010.09080, 2020.

34
[144] Ziteng Sun, Peter Kairouz, Ananda Theertha Suresh, and H Brendan McMahan. Can you
really backdoor federated learning? arXiv preprint arXiv:1911.07963, 2019.

[145] Christian Szegedy, Wojciech Zaremba, Ilya Sutskever, Joan Bruna, Dumitru Erhan, Ian
Goodfellow, and Rob Fergus. Intriguing properties of neural networks. arXiv preprint
arXiv:1312.6199, 2013.

[146] Te Juin Lester Tan and Reza Shokri. Bypassing backdoor detection algorithms in deep
learning. arXiv preprint arXiv:1905.13409, 2019.

[147] Ruixiang Tang, Mengnan Du, Ninghao Liu, Fan Yang, and Xia Hu. An embarrassingly simple
approach for trojan attack in deep neural networks. In Proceedings of the 26th ACM SIGKDD
International Conference on Knowledge Discovery & Data Mining, pages 218–228, 2020.

[148] Vale Tolpegin, Stacey Truex, Mehmet Emre Gursoy, and Ling Liu. Data poisoning attacks
against federated learning systems. In European Symposium on Research in Computer Security,
pages 480–501. Springer, 2020.

[149] Florian Tramèr, Nicolas Papernot, Ian Goodfellow, Dan Boneh, and Patrick McDaniel. The
space of transferable adversarial examples. arXiv preprint arXiv:1704.03453, 2017.

[150] Brandon Tran, Jerry Li, and Aleksander Madry. Spectral signatures in backdoor attacks. In
Advances in Neural Information Processing Systems, pages 8000–8010, 2018.

[151] John W Tukey. A survey of sampling from contaminated distributions. Contributions to


probability and statistics, 1960.

[152] Alexander Turner, Dimitris Tsipras, and Aleksander Madry. Label-consistent backdoor
attacks. arXiv preprint arXiv:1912.02771, 2019.

[153] Yusuke Uchida, Yuki Nagai, Shigeyuki Sakazawa, and Shin’ichi Satoh. Embedding water-
marks into deep neural networks. In Proceedings of the 2017 ACM on International Conference
on Multimedia Retrieval, pages 269–277. ACM, 2017.

[154] Leslie G Valiant. Learning disjunction of conjunctions. In IJCAI, pages 560–566, 1985.

[155] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting
and composing robust features with denoising autoencoders. In Proceedings of the 25th
international conference on Machine learning, pages 1096–1103, 2008.

[156] Soumya Wadhwa, Saurabh Agrawal, Harsh Chaudhari, Deepthi Sharma, and Kannan Achan.
Data poisoning attacks against differentially private recommender systems. In Proceedings
of the 43rd International ACM SIGIR Conference on Research and Development in Information
Retrieval, pages 1617–1620, 2020.

[157] Jane Wakefield. Microsoft chatbot is taught to swear on Twitter. BBC News, 2016. URL
https://www.bbc.com/news/technology-35890188.

[158] Eric Wallace, Tony Z Zhao, Shi Feng, and Sameer Singh. Customizing triggers with concealed
data poisoning. arXiv preprint arXiv:2010.12563, 2020.

35
[159] Bolun Wang, Yuanshun Yao, Shawn Shan, Huiying Li, Bimal Viswanath, Haitao Zheng, and
Ben Y Zhao. Neural cleanse: Identifying and mitigating backdoor attacks in neural networks.
In 2019 IEEE Symposium on Security and Privacy (SP), pages 707–723. IEEE, 2019.

[160] Ren Wang, Gaoyuan Zhang, Sijia Liu, Pin-Yu Chen, Jinjun Xiong, and Meng Wang. Practical
detection of trojan neural networks: Data-limited and data-free cases. In Proceedings of the
European Conference on Computer Vision (ECCV), 2020.

[161] Shuo Wang, Surya Nepal, Carsten Rudolph, Marthie Grobler, Shangyu Chen, and Tianle
Chen. Backdoor attacks against transfer learning with pre-trained deep learning models.
arXiv preprint arXiv:2001.03274, 2020.

[162] Yue Wang, Esha Sarkar, Michail Maniatakos, and Saif Eddin Jabari. Stop-and-go: Exploring
backdoor attacks on deep reinforcement learning-based traffic congestion control systems.
arXiv preprint arXiv:2003.07859, 2020.

[163] Maurice Weber, Xiaojun Xu, Bojan Karlas, Ce Zhang, and Bo Li. Rab: Provable robustness
against backdoor attacks. arXiv preprint arXiv:2003.08904, 2020.

[164] Emily Wenger, Josephine Passananti, Yuanshun Yao, Haitao Zheng, and Ben Y Zhao. Back-
door attacks on facial recognition in the physical world. arXiv preprint arXiv:2006.14580,
2020.

[165] Chen Wu, Xian Yang, Sencun Zhu, and Prasenjit Mitra. Mitigating backdoor attacks in
federated learning. arXiv preprint arXiv:2011.01767, 2020.

[166] Zhaohan Xi, Ren Pang, Shouling Ji, and Ting Wang. Graph backdoor. arXiv preprint
arXiv:2006.11890, 2020.

[167] Huang Xiao, Battista Biggio, Gavin Brown, Giorgio Fumera, Claudia Eckert, and Fabio Roli.
Is feature selection secure against training data poisoning? In International Conference on
Machine Learning, pages 1689–1698, 2015.

[168] Chulin Xie, Keli Huang, Pin-Yu Chen, and Bo Li. DBA: Distributed backdoor attacks against
federated learning. In International Conference on Learning Representations, 2019.

[169] Xiaojun Xu, Qi Wang, Huichen Li, Nikita Borisov, Carl A Gunter, and Bo Li. Detecting ai
trojans using meta neural analysis. In Proceedings of the IEEE Symposium on Security and
Privacy (May 2021), 2021.

[170] Chaofei Yang, Qing Wu, Hai Li, and Yiran Chen. Generative poisoning attack method against
neural networks. arXiv preprint arXiv:1703.01340, 2017.

[171] Zhaoyuan Yang, Naresh Iyer, Johan Reimann, and Nurali Virani. Design of intentional
backdoors in sequential models. arXiv preprint arXiv:1902.09972, 2019.

[172] Yuanshun Yao, Huiying Li, Haitao Zheng, and Ben Y Zhao. Latent backdoor attacks on
deep neural networks. In Proceedings of the 2019 ACM SIGSAC Conference on Computer and
Communications Security, pages 2041–2055, 2019.

36
[173] Dong Yin, Yudong Chen, Kannan Ramchandran, and Peter Bartlett. Byzantine-robust dis-
tributed learning: Towards optimal statistical rates. arXiv preprint arXiv:1803.01498, 2018.
[174] Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon
Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In
Proceedings of the IEEE International Conference on Computer Vision, pages 6023–6032, 2019.
[175] Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond
empirical risk minimization. arXiv preprint arXiv:1710.09412, 2017.
[176] Jiale Zhang, Junjun Chen, Di Wu, Bing Chen, and Shui Yu. Poisoning attack in federated
learning using generative adversarial nets. In 2019 18th IEEE International Conference on Trust,
Security and Privacy in Computing and Communications/13th IEEE International Conference on Big
Data Science and Engineering (TrustCom/BigDataSE), pages 374–380. IEEE, 2019.
[177] Jialong Zhang, Zhongshu Gu, Jiyong Jang, Hui Wu, Marc Ph Stoecklin, Heqing Huang, and
Ian Molloy. Protecting intellectual property of deep neural networks with watermarking.
In Proceedings of the 2018 on Asia Conference on Computer and Communications Security, pages
159–172, 2018.
[178] Rui Zhang and Quanyan Zhu. A game-theoretic analysis of label flipping attacks on dis-
tributed support vector machines. In 2017 51st Annual Conference on Information Sciences and
Systems (CISS), pages 1–6. IEEE, 2017.
[179] Xinyang Zhang, Zheng Zhang, and Ting Wang. Trojaning language models for fun and profit.
arXiv preprint arXiv:2008.00312, 2020.
[180] Xuezhou Zhang, Xiaojin Zhu, and Laurent Lessard. Online data poisoning attacks. In
Learning for Dynamics and Control, pages 201–210, 2020.
[181] Zaixi Zhang, Jinyuan Jia, Binghui Wang, and Neil Zhenqiang Gong. Backdoor attacks to
graph neural networks. arXiv preprint arXiv:2006.11165, 2020.
[182] Mengchen Zhao, Bo An, Wei Gao, and Teng Zhang. Efficient label contamination attacks
against black-box learning models. In IJCAI, pages 3945–3951, 2017.
[183] Banghua Zhu, Jiantao Jiao, and Jacob Steinhardt. Generalized resilience and robust statistics.
arXiv preprint arXiv:1909.08755, 2019.
[184] Chen Zhu, W Ronny Huang, Ali Shafahi, Hengduo Li, Gavin Taylor, Christoph Studer, and
Tom Goldstein. Transferable clean-label poisoning attacks on deep neural nets. arXiv preprint
arXiv:1905.05897, 2019.
[185] Ligeng Zhu, Zhijian Liu, and Song Han. Deep leakage from gradients. In Advances in Neural
Information Processing Systems, pages 14774–14784, 2019.
[186] Liuwan Zhu, Rui Ning, Cong Wang, Chunsheng Xin, and Hongyi Wu. Gangsweep: Sweep
out neural backdoors by gan. In Proceedings of the 28th ACM International Conference on
Multimedia, pages 3173–3181, 2020.
[187] Yijun Zuo and Robert Serfling. General notions of statistical depth function. Annals of
statistics, 2000.

37

You might also like