Adversarial Attack and Defense Technologies in Natural Language

Neurocomputing 492 (2022) 278–307
Contents lists available at ScienceDirect
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Survey paper
Adversarial attack and defense technologies in natural language

processing: A survey
Shilin Qiu a,⇑, Qihe Liu a, Shijie Zhou a, Wen Huang b,c
a
School of Information and Software Engineering, University of Electronic Science and Technology of China, Chengdu 610054, China
b
Chengdu University of Information Technology
c
HUAWEI TECHNOLOGIES CO., LTD
a r t i c l e i n f o a b s t r a c t
Article history: Recently, the adversarial attack and defense technology has made remarkable achievements and has been
Received 24 May 2021 widely applied in the computer vision field, promoting its rapid development in other fields, primarily
Revised 14 March 2022 the natural language processing domain. However, discrete semantic texts bring additional restrictions
Accepted 3 April 2022
and challenges to successfully implementing adversarial attacks and defenses. This survey systematically
Available online 7 April 2022
Communicated by Zidong Wang
summarizes the current progress of adversarial techniques in the natural language processing field. We
first briefly introduce the textual adversarial example’s particularity, vectorization, and evaluation met-
rics. More importantly, we categorize textual adversarial attacks according to the combination of seman-
Keywords:
Textual adversarial example
tic granularity and example generation strategy. Next, we present commonly used datasets and
Adversarial attack adversarial attack applications in diverse natural language processing tasks. Besides, we classify defense
Adversarial defense strategies as passive and active methods considering both input data and victim models. Finally, we pre-
Natural language processing sent several challenging issues and future research directions in this domain.
Artificial intelligence Ó 2022 Elsevier B.V. All rights reserved.
1. Introduction analysis [24,25], and malware detection systems [26] are appar-
ently vulnerable when facing adversarial threats.
With the rapid progress of high-performance computational
equipment and the continuous accumulation of massive data, arti-
ficial intelligence technology has been greatly developed and 1.1. Development overview of adversarial technology in NLP field
widely used in computer vision (CV) [1–3], natural language pro-
cessing (NLP) [4–6], voice control [7,8], and other tasks [9–11]. Based on the paper list1 collected by Carlini, this survey counts
Hence, various intelligent systems are extensively applied to com- the number of publications related to adversarial in the CV and
munication, transportation, healthcare, public security, and finan- NLP fields. As shown in Fig. 1, adversarial technology has attracted
cial transaction in the real world [12–16]. attention and has developed rapidly. Compared with studies in the
However, Szegedy et al. [17] proposed the concept of adversarial CV field, the publications in the NLP domain are far less. However,
example against image classifiers, demonstrating a tremendous due to the wide application of NLP technology in text classification
security risk in current intelligent systems. They indicated that [27,28], sentiment analysis [29], text question-answering [30], neu-
the adversarial example generated by adding tiny perturbations ral machine translation [31], text generation [32,33] and other tasks,
to the pure image could make a classifier with good performance as well as the continuous deepening of adversarial attack and
output a wrong prediction. More noteworthy, deep neural networks defense technologies, the textual adversarial technology has gradu-
trained on different datasets or with distinct structures can produce ally gained researchers’ attention.
the same misclassification for the same adversarial example. Since Papernot et al. [34] is the first to investigate adversarial attacks
then, adversarial attack technology has become a research highlight on texts. Inspired by the idea of generating adversarial images,
in the artificial intelligence security domain. Thus, sign recognition they crafted adversarial texts through the forward derivative asso-
[18], object detection [19,20], audio recognition [21–23], sentiment ciated with texts’ embeddings. Since then, in order to explore the
security blind spot in NLP systems and seek corresponding defense
⇑ Corresponding author.
strategies, scholars have conducted in-depth research on NLP
E-mail addresses: 742452674@qq.com (S. Qiu), qiheliu@uestc.edu.cn (Q. Liu),
1
sjzhou@uestc.edu.cn (S. Zhou), 562421007@qq.com (W. Huang). https://nicholas.carlini.com/writing/2019/all-adversarial-example-papers.html.
https://doi.org/10.1016/j.neucom.2022.04.020
0925-2312/Ó 2022 Elsevier B.V. All rights reserved.
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
tion to the model’s robustness and concluded testing and verifica-

tion methods in the CV and NLP domains. Zhang et al. [56]
identified an attack method from two aspects, that is, whether it
comes from the CV field and whether it has access to the victim
model knowledge. Moreover, they expounded popular NLP models,
benchmark datasets, and other relevant information. Since the text
is composed of discrete symbols, Huq et al. [57] took the modified
component type, which is called semantic granularity, as the basis
to classify adversarial attacks. Thus, they divided attack methods
into four categories (the character-level, word-level, sentence-
level, and multi-level attack).
However, there are still three critical issues in summarizing tex-
tual adversarial technologies. The first is, since the text is discrete,
classifying adversarial attacks according to semantic granularity is
Fig. 1. Publications on adversarial attack and defense technologies by the end of
more intuitive than NLP tasks or black/white-box scenarios. The
December 2020. The blue represents research in the CV field, while the orange
indicates research in the NLP field. second is that most reviews paid little attention to defense tech-
niques, which are classified and elaborated inadequately. The last
model security from the adversarial attack and defense perspec- is that adversarial technologies’ applications in the NLP domain
tive. Initially, to maintain the semantic consistency and syntactic were not elaborated sufficiently. In order To provide readers with
correctness of texts, researchers transferred adversarial technolo- a systematic overview of textual adversarial techniques, this sur-
gies in the image domain to the text domain by introducing speci- vey contributes in three ways, as shown in Fig. 2. Firstly, we clas-
fic unique methods, such as the Greedy Search algorithm [35] and sify textual adversarial attacks according to the semantic
the Reinforcement Learning (RL) [36,37]. Later, researchers started granularity at the top level and further divide each class into sub-
to propose special adversarial attacks in the text domain since the classes. Compared with classification methods in [55,56], the pro-
text is composed of discrete symbols. In the beginning, works in posed approach is more consistent with the NLP scenario.
[38–41] utilized specific techniques to change several characters Furthermore, we improve the existing approach [57] by dividing
or words in given texts. These approaches maintain semantic con- each class at the second level. Secondly, we conclude the defense
sistency and syntactic correctness well, but the generated adver- strategies as passive and active methods, considering both input
sarial texts generally lack diversity. Latter, researchers [42–44] data and victim model. Compared to methods [55–57] that divided
focused on manipulating the whole text to generate more diverse defense strategies into two types (adversarial training and others)
adversarial texts while maintaining the semantics and syntax of directly, the proposed method is more comprehensive and
the text. Generally speaking, these methods need to be more care- detailed. Finally, we summarize popular benchmark datasets and
fully designed to ensure the attack ability and the quality of adver- applications of adversarial technologies in various NLP tasks. In
sarial texts. The research process on defense strategies can be contrast, works in [55,57] provided a brief description of applica-
divided into two stages. In the early stage, researchers proposed tions. Based on the work [56], we have taken a more systematic
defense methods based on the model input checking [45,24] and look at the application of textual adversarial techniques, including
Adversarial Training (AT) [25,35,46,47]. Recently, to alleviate prob- related datasets, victim models, and NLP tasks.
lems like the over-fitting brought by the AT, researchers studied Considering the large number of abbreviations used in this sur-
other defense strategies for model reinforcement [41,48,49]. vey, we collected abbreviations frequently used in this work into
Table 1 for readers to access the whole work better.
1.2. Comparison of existing relevant surveys
1.3. Explanations for adversarial example’s emergence
Several researchers have provided different review publications
to understand adversarial attack and defense technology. From the The effectiveness of adversarial examples is critical to imple-
perspective of security concerns, Gilmer et al. [50] argued that menting successful adversarial attacks in both CV and NLP
existing defense approaches have mostly considered abstract toy domains. Hence, researchers presented several possible explana-
games that do not relate to any specific security issue. Thus, they tions for the emergence of adversarial examples. In the early stage,
discussed and established a taxonomy of motivations, constraints, researchers proposed some model-based explanations, including
and abilities for adversaries in all fields. They summarized current the insufficient generalization ability [67], the extreme nonlinear
attacks from the defender’s perspective. To conclude the develop- characteristic of deep neural network (DNN) [17], and the linear
ment and application of adversarial technology in various fields, behavior in the high-dimensional space [19]. Recently, researchers
works in [51–54] elaborated from different perspectives. Among analyzed the distribution and characteristics of input data. Then,
them, Chakraborty et al. [51] focused on three main attack scenar- they proposed Tilted Boundary hypothesis [68] and the explanation
ios (the Evasion Attack, Poisoning Attack, and Exploratory Attack) ‘‘Adversarial examples are features” [69]. Unfortunately, there is still
identified by the attack surface. Since adversarial attacks can occur no unified explanation for adversarial examples’ emergence.
during the training and inference stages of the victim model, Qiu
et al. [54] referred to the implementation procedure of adversarial 1.4. Classification of textual adversarial attack
attacks. More importantly, they summarized adversarial attacks’
applications in several tasks, such as face recognition, object detec- For implementing adversarial attacks on NLP models, Papernot
tion, machine translation, malware detection, and even cloud et al. [34] generated adversarial texts by gradient calculation on
service. the embedding space, being the same as the idea of adversarial
Focusing on the NLP domain, works in [55–57] have concluded image generation approaches. Later, researchers proposed diverse
related techniques with various focuses. Specifically, Wang et al. adversarial attack methods for continuous embedding and discrete
[55] classified adversarial attacks according to NLP tasks. They texts. Current textual adversarial attacks can be classified accord-
elaborated on adversarial attacks in text classification tasks but ing to various criteria, such as model access, target type, victim
described them simply in other tasks. In addition, they paid atten- DNN type, and semantic granularity. The semantic granularity,
279
Fig. 2. Summary of the textual adversarial attack and defense strategies in this survey. The attack methods are classified according to the combination of semantic granularity
and adversarial example generation strategy. The defense strategies are divided into passive and active methods. Moreover, the attacks concerned by each kind of defense
method are marked in the figure.
which refers to the type of modified object, is only appropriate for ated texts are usually more imperceptible in grammar and syntax.
attacks on texts. Considering that adversarial text generation For optimization-based methods, most [40,41,71,72,47,73] are
methods are pretty different, we conclude textual adversarial based on Evolutionary Algorithms (such as the Genetic Algorithm
attack methods comprehensively according to the combination of (GA) [74] and Particle Swarm Optimization algorithm (PSO) [75]),
semantic granularity and example generation strategy. Specifically, which do well in handling discrete data but are relatively time-
we categorize adversarial attacks as four classes (the character- consuming and computationally expensive. Furthermore,
level, word-level, sentence-level, and multi-level method) at the researchers proposed specific methods for certain issues, such as
top level according to the semantic granularity. Moreover, we fur- the variability exhibited by second-language speakers and first-
ther divide each class into subclasses (the gradient-based, language dialect speakers [76], the problem of weighing unlikely
optimization-based, importance-based, edit-based, paraphrase- substitutions high and limiting the accuracy gain [77], and the
based, and generative model-based method) according to example inapplicability of massive query operations in the real world [37].
generation strategies. To the best of our knowledge, we are the first For importance-based methods, to craft semantics-preserving texts
to propose this kind of two-level classification for adversarial with minimum modifications, methods in [78–80] ranked words
attacks. according to the class probability changes obtained by removing
In character-level attacks, the adversary inserts, deletes, words one by one. Among them, the method in [78] works best
replaces or swaps characters in a given text. To achieve this pur- for the dataset that has sub-categories within each class. Besides,
pose, researchers utilized the gradient calculation to rank adver- Ren et al. [81] determined the word replacement order by both
sarial manipulations [35] or train a substitute DNN [60]. Besides, the word saliency and classification probability, performing well
DeepWordBug [45] used the importance of tokens to determine in effectively reducing substitutes. Additionally, Hossam et al.
which character to be changed. The edit-based method in [38,39] [82] employed cross-domain interpretability to learn the word
used natural and synthetic noises to generate visual and phonetic importance to handle the issue of computational complexity and
adversarial examples. Generally speaking, adversarial texts crafted query consumption. Yang et al.[83] proposed a systematic proba-
by character-level attacks often have apparent grammar or spelling bilistic framework, which does well in achieving a high success
errors, making it easy for human eyes or misspelling checkers to rate and efficiency. Overall, these importance-based methods are
observe these malicious texts. more efficient than optimization-based methods. For the edit-
In word-level attacks, adversaries modify the word in a given based method, Zhang et al. [84] employed a Metropolis-Hastings
text, generally causing fewer grammar and spelling errors than sampling approach to replace or randomize words, Li et al. [85]
character-level attacks. The gradient-based methods [34,65,119] introduced a pre-trained masked language model, Emelin et al.
utilized the sign of gradient, magnitude of gradient, and gradient [86] detected word sense disambiguation biases. Overall, word-
itself. Due to the semantic consistency requirement, Michael level attacks usually do better in maintaining semantic consistency
et al. [70] demonstrated ‘‘adversarial examples should be meaning- and imperceptibility than other attacks, but their generated adver-
preserving on the source side, but meaning-destroying on the target sarial examples are less varied.
side”. Seq2Sick [64] introduced the Group Lasso and Gradient Reg- In sentence-level attacks, the adversary inserts new sentences,
ularization to a projected gradient method. In contrast, replaces words with their paraphrases, or even changes the struc-
optimization-based, importance-based, and edit-based methods ture of original sentences. Adversaries generally crafted the univer-
are designed for textual adversarial attacks. Therefore, the gener- sal adversarial perturbation through the iterative projected
280
Table 1 works (SCPNs) [63] is an encoder-decoder based framework with

Abbreviation List. soft attention [92] and copy mechanism [93], aiming to craft a
Abbreviation Explanation paraphrase with specified syntax. Although it uses the target strat-
ADDANY method proposed in [58] egy, it does not specify the target output. Besides, AdvExpander [42]
ADDANY-KBEST method proposed in [59] is the first to generate an adversarial text by expanding the original
ADDSENT method proposed in [58] sentence. It differs from existing substitution-based methods and
AdvExpander method proposed in [42] introduces rich linguistic variations to adversarial texts. For the
AT Adversarial Training
BERT Bidirectional Encoder Representations from Transformers
generative model-based method, works in [94,36] leveraged the
BiDAF Bi-Directional Attention Flow network Generative Adversarial Network (GAN) [95] to craft adversarial
BiLSTM bidirectional Long Short Term Memory Network examples, with the help of a converter and an autoencoder, respec-
CAT-Gen Controlled Adversarial Text Generation tively. The work in [36] utilized the RL [96] to guide the training of
CharCNN character-level Convolutional Neural Network
GAN. Differently, the Controlled Adversarial Text Generation (CAT-
CharCNN-LSTM character-level Convolutional Neural - Long Short Term
Memory Network Gen) [97] generates adversarial texts by using the controllable
CharLSTM character-level Long Short Term Memory Network attribute that is known to be irrelevant to the task label. Generally,
CNN Convolutional Neural Network adversarial examples generated by these sentence-level methods
CoCoA Collaborative Communicating Agents dataset are semantics-preserving and full of diversity. However, some of
ConvS2S Convolutional Sequence to Sequence network
them have reduced readability caused by adding meaningless
CRNN Convolutional Recurrent Neural Network
CV computer vision token sequences.
DeepWordBug method proposed in [45] In multi-level attacks, the modified objects could be two or
DISTFLIP method proposed in [60] three kinds of character, word and sentence. The gradient-based
DNN deep neural network
method HotFlip [25] modified characters and words in a given text
Doc2Vec method proposed in [61]
ELMo Embeddings from Language Models by the gradient computation on the one-hot representation. How-
FGSM Fast Gradient Sign Method ever, it is unsuitable for large-scale attacks since it only generates a
GA Genetic Algorithm few successful adversarial examples under strict constraints. The
GAN Generative Adversarial Network optimization-based RewritingSampler [43] efficiently rewrites the
GPT-2 language model proposed in [62]
original sentence by word and sentence-level changes, with the
GRU Gate Recurrent Unit
HotFlip method proposed in [25] help of the word embedding sum and GPT-2 language model
IMDB Internet Movie Database [62]. The importance-based TextBugger [24] used the Jacobian
LSTM Long Short Term Memory Network matrix to calculate each word’s importance in white-box settings
MHA Metropolis Hastings Attack
and determined the split sentence importance by querying the vic-
NLP natural language processing
MPQA Multi-Perspective Question Answering
tim model in black-box settings. However, in TextBugger, only a
MR Rotten Tomatoes Movie Review few editing operations can mislead the classifier. Unlike most
MSCOCO Microsoft COCO dataset works focusing on semantic tasks, Zheng et al. [44] is the first to
OOV out-of-vocabulary construct a syntactic adversarial example at both sentence and
Paragraph2Vec method proposed in [61]
phrase levels. For the edit-based method, Niu et al. [98] proposed
PSO Particle Swarm Optimization algorithm
RewritingSampler method proposed in [43] Should-Not-Change and Should-Change adversarial strategies,
RL Reinforcement Learning Blohm et al. [99] implemented four adversarial text generation
SCPNs Synthetically Controlled Paraphrase Networks [63] methods with diverse edits on the word and sentence level. The
Seq2Sick method proposed in [64]
hybrid encoder-decoder-based Adversarial Examples Generator
SNLI Stanford Natural Language Inference Corpus
SQuAD Stanford Question Answering Dataset
[100] introduced both word and character-level perturbations with
SST Stanford Sentiment Treebank the help of a self-critical approach. In general, adversarial examples
TextBugger method proposed in [24] generated by these multi-level methods are more varied than
TextFool method proposed in [65] those crafted by the character and word-level methods to a certain
VAT Virtual Adversarial Training
extent. However, these attack methods always have more
VQA Visual Question Answering Dataset
WMD Word Mover’s Distance constraints.
Word2Vec method proposed in [66]
WordCNN word-level Convolutional Neural Network 1.5. Applications of textual adversarial attack
WordLSTM word-level Long Short Term Memory Network
To explore potential adversarial attack threats faced by existing

NLP intelligent systems and further provide a foundation for devel-
gradient-based approach on embedding space [87], the projected oping effective defense strategies, researchers have applied various
gradient descent with l2 regularization [88], and the HotFlip [25] adversarial attacks in extensive NLP tasks, such as text classifica-
based gradient guided approach [89,90]. The method in [88] does tion, neural machine translation, and dialogue generation. Consid-
well in avoiding the out-of-distribution noise vector and maintain- ering that these NLP tasks are exceedingly different from each
ing the naturalness of generated texts. The work in [90] adopted a other and performed on distinct datasets, we summarize well-
conditional language model and thus effectively crafted semanti- known benchmark datasets used in NLP tasks. Then, we give a brief
cally valid statements containing at least one trigger. For the introduction about the applicable task, dataset size, and character-
edit-based method, Jia et al. [58] crafted arbitrary sequences of istics of the principal datasets (see details in Section 4.1). After
English words in ADDANY and sentences that look similar to the that, we classify existing adversarial attack applications into eight
question in ADDSENT, and Wang et al. [46] improved ADDSENT categories according to text processing. Besides, we elaborate on
by adding semantic relationship features, Nizar et al. [59] proposed the application scenarios, victim models, benchmark datasets,
a variant of ADDANY. Some of these edit-based methods generate and attack effects of current adversarial attack methods applied
adversarial triggers that do not affect text semantics, except for in each kind of NLP task (see details in Section 4.2). Note that most
the readability. For the paraphrase-based method, researchers attack methods are simultaneously applied in multiple tasks, indi-
[91,63] regarded the paraphrase of a given text as its adversarial cating the transferability of these attack methods across datasets,
example. Among them, Synthetically Controlled Paraphrase Net- DNN models, and even NLP tasks.
281
1.6. Defenses against textual adversarial attack We classify existing defense strategies against textual adversar-
ial attacks as passive and active methods, simultaneously con-
Due to the deepening development of adversarial attack tech- sidering the victim model stages and starting points of
nologies and the wide application of textual adversarial attacks, defense strategies. Compared with existing surveys, our classifi-
researchers have realized the severity of adversarial attack threats cation method is the only one that considers both the input data
in NLP systems, promoting various defense methods. This survey and the victim model.
categorizes current defense strategies as passive and active
defenses. The most commonly used passive defense method is Organization. In Section 2, we introduce causes of the adver-
checking misspellings and typos in the textual input [45,24,101– sarial example’s emergence, particularities of adversarial texts,
103]. However, these passive strategies are suitable for defending vectorization of textual data, and evaluation metrics of adversarial
against character and word-level attacks. The active defenses prin- texts. In Section 3, we classify textual adversarial attack methods
cipally contain adversarial training and representation learning from four aspects and elaborate them according to the combina-
strategies. For the adversarial training, some researchers tion of semantic granularity and example generation strategy. In
[25,35,24,58,46,38,47] directly added adversarial examples gener- Section 4, we present benchmark datasets and applications of tex-
ated by existing attacks to the training dataset. Considering the tual adversarial attacks. In Section 5, we summarize existing
massive calculation consumption and low efficiency of these above defense methods against adversarial attacks on texts. Finally, we
methods, researchers [104,105] proposed GAN-style approaches to conclude several significant challenges and potential development
train NLP models together with the adversarial example generator. directions in Section 6.
The work in [106] presented a variation of Virtual Adversarial
Training (VAT) to generate virtual adversarial examples and virtual 2. Textual adversarial example
labels in the embedding space. Additionally, researchers have paid
attention to the issues of out-of-vocabulary (OOV) words [107], dis- Szegedy et al. [17] proposed that adversarial images, generated
tribution difference [107], diversified adversarial example require- by adding tiny noises to pixel values, can easily make an image
ment [108], and offensive language detection in the real world classifier produce wrong predictions but do not affect the percep-
[109]. Although adversarial training can effectively improve the tion of human eyes. It is the first study on adversarial attack tech-
robustness of NLP models and overcome problems like adversarial nology in the CV domain. In the NLP field, Papernot et al. [34] is the
example preparation and calculation consumption, it is likely to first to investigate adversarial attacks. They crafted adversarial
reduce their classification accuracy. For the representation learn- texts invisible to human eyes by modifying characters or words
ing, researchers improved the input representation ability of NLP in original texts. As shown in Fig. 3, a text correctly categorized
models. Some methods [110–112] introduced random perturba- by a sentiment analysis model is classified wrongly after replacing
tions to the input during the training step. Works in one word of it.
[38,41,48,49] improved the generalization of models by encoding To let readers have a more precise and comprehensive under-
input and their neighbors with the same representation. Other standing of the textual adversarial example, we first present possi-
researchers have designed more effective representation ble reasons for the emergence of adversarial examples in
approaches by fine-tuning both local and global features [113], Section 2.1. Then, we conclude the particularities of textual adver-
introducing disentangled representation learning [114], linking sarial examples in Section 2.2. Later, we elaborate on the vectoriza-
the multi-head attention to structured embedding [115], and aug- tion methods for discrete data in Section 2.3. Finally, we
menting input sentences with their corresponding predicate- summarize the commonly used metrics for evaluating the effec-
argument structures [116]. In general, randomizing input and uni- tiveness and imperceptibility of adversarial texts in Section 2.4.
fying input representation is similar to the passive defense, and by Note that the particularities described in this survey are unique
contrast, designing effective representation is more challenging. to textual examples, and the general characteristics of adversarial
Furthermore, current defense research is much less than attack examples in all fields are elaborated in detail in the literature [54].
research, which reminds researchers that it is necessary to pay
more attention to defense strategies against adversarial attacks. 2.1. Causes of adversarial example
1.7. Contribution and organization Realizing the terrifying attack capability of adversarial exam-
ples, researchers have begun to explore the reasons for the exis-
Our contributions. This survey concentrates on the adversarial tence of adversarial examples. Although researchers have
attack and defense technology in the NLP field and provides a thor- presented some assumptions, there is still no widely accepted
ough and systematic review. The key contributions of this survey explanation. In the early stage, researchers thought that the cause
can be summarized as follows: was the insufficient generalization ability of DNN to predict
unknown data due to over-fitting or inapposite regularization
We comprehensively and systematically summarize the textual [67]. While Szegedy et al. [17] proposed that the extreme nonlinear
adversarial attack and defense technology, elaborating on tex- characteristic of DNN leads to the existence of adversarial
tual adversarial examples, adversarial attacks on texts, defenses examples.
against textual adversarial attacks, applications in various NLP Later, Goodfellow et al. [19] verified the above two hypotheses.
tasks, and potential development directions in this domain. They input adversarial examples into a regularized DNN and dis-
We categorize current textual adversarial attacks according to covered that the model’s effectiveness for resisting adversarial
the semantic granularity at the top level and further classify examples was not significantly improved. Besides, by adding tiny
each class into several subclasses depending on the example perturbations to the input of a linear model, they showed that if
generation strategy. To the best of our knowledge, we are the the input dimension is sufficient, it is possible to construct adver-
first to regard the example generation strategy as a classifica- sarial examples with high confidence. Therefore, they proposed
tion criterion and propose this two-level classification for that the linear behavior of DNN in high-dimensional space leads
adversarial attacks. to the emergence of adversarial examples. In other words, since
282
Fig. 3. An instance of adversarial example. Before the attack, the sentiment analysis model classified the original text into a negative class, in line with human judgment. After
replacing the word ‘‘lack” with ‘‘dearth”, the same model produces a wrong prediction.
the nonlinear components of DNN are linear in segments, if adding ter, violating the principle that adversarial examples are per-
tiny noise to the input, this noise would persist in the same direc- ceivable to humans. Therefore, keeping the semantics
tion and accumulate at the end of DNN, finally leading to a differ- consistent is the key to crafting influential adversarial texts.
ent output.
Differently, some hypotheses were proposed from the perspec- 2.3. Vectorization of discrete data
tive of data characteristics instead of neural network structures.
Tanay et al. [68] proposed a hypothesis called tilted boundary. They Due to additional constraints like grammatical legitimacy and
demonstrated that since a DNN is usually unable to fit all data semantic consistency need to be taken into account when design-
accurately, there is a space for adversarial examples near the clas- ing adversarial example generation algorithms directly on discrete
sification boundary of the DNN. Ilyas et al. [69] proposed ‘‘Adver- texts, researchers proposed to turn original texts into continuous
sarial examples are features”. They thought that the vectors at first, and then apply methods designed for images or
destructiveness of adversarial examples directly results from the design new algorithms on vectors to craft adversarial ones. Current
model’s sensitivity to input features. Hence, they concluded that text vectorization methods mainly include:
adversarial examples are not bugs but features that indicate how
DNN visualizes everything. Furthermore, they divided input fea- One-Hot encoding. The One-Hot encoding represents a charac-
tures into robust and non-robust types. They demonstrated that ter or a word by a vector, in which only one element is one and
adding tiny perturbations to non-robust features could easily make all other elements are zero. When mapping a text to a vector x,
DNNs produce incorrect outputs. the length of x is equal to the size of the vocabulary, and the ele-
ment set to one is determined by its position in the vocabulary.
2.2. Particularities of adversarial text Although this method is simple to implement, the One-Hot vec-
tor usually is sparse, semantically independent, and with a very
As mentioned before, publications related to adversarial tech- high dimension.
nology in the NLP field are far less than those in the CV field. The Word-count-based encoding. This approach initializes a zero-
reason is that three extra constraints need to be considered when coded vector with the length of the vocabulary size and then
generating textual adversarial examples. Specifically: replaces each element with a specific value. The Bag-of-Words
ignores the word order, grammar, and syntax of texts, and
Discrete. Unlike images represented by continuous pixel values, regards the text as a collection of words. Thus, the Bag-of-
the symbolic text is discrete. Therefore, finding appropriate per- Words sets each element of the initialized vector with the cor-
turbations is critical to efficient textual adversarial example responding word count. The Term Frequency-Inverse Document
generation. Researchers proposed two solutions for handling Frequency considers the word importance determined by fre-
this issue. One is vectorizing discrete texts into continuous rep- quencies of the word’s appearance in the text and corpus. Like
resentations at first (see specific methods in Section 2.3), and the One-Hot encoding, the vector obtained by word-count-
then using perturbation generation approaches for images to based encoding is sparse and semantically independent, but
craft adversarial texts. The other is carefully designing pertur- with relatively low dimensions.
bations directly on the discrete space. N-gram encoding. The N-gram language model predicts the
Perceivable. The well-performed adversarial image generation next likely word when given a text. This is based on the
method is based on the premise that a few pixel value changes assumption that the occurrence of n-th word is only related to
in an image are invisible to human eyes. However, a slight mod- the first n 1 words. When n ¼ 1; 2; 3, it is called unigram,
ification of a character or word is easily realized by human eyes bigram, and trigram, respectively. The N-gram takes the word
and spelling checkers. Hence, finding textual adversarial exam- order into account, but with the increase of n, the vocabulary
ples that are hard to be observed by human eyes is vital for suc- expands rapidly, and the vector becomes sparse.
cessful adversarial attacks. Dense encoding. The dense encoding provides a low-
Semantic. Compared with images whose overall semantics do dimensional and distributed vector representation for discrete
not change when changing a few pixel values, the semantics data. The Word2Vec [66] uses continuous bag-of-words and
of a text could be altered by even replacing or adding a charac- skip-gram models to produce a dense representation of words,
283
called word embedding. Its basic assumption is that words where k is the dimension of the word vector, mi and ni are the i-
appearing in similar contexts have similar meanings. Thus, the ~ and ~
th factors of m n. The smaller the Euclidean Distance is, the
Word2Vec alleviates the discreteness and data-sparsity prob- more similar these two vectors are.
lems to some extent. Similarly, the Doc2Vec and Paragraph2Vec Cosine Distance. It is used to calculate the semantic similarity
[61], two extensions of word embedding, encode sentences and between two vectors. Compared with the Euclidean Distance,
paragraphs to dense vectors. the Cosine Distance is more concerned with the difference of
directions between two vectors. For two given word vectors m ~
2.4. Evaluations of adversarial text and ~n, the cosine similarity is expressed as:
Pk
It is necessary to evaluate the effectiveness of adversarial exam- ~ ~
m n i¼1 mi ni
~;~
Dðm nÞ ¼ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
ples in all fields. Additionally, the imperceptibility estimation of kmk knk Pk 2 Pk 2
adversarial text is particularly significant for textual adversarial i¼1 ðmi Þ i¼1 ðni Þ
attacks because of their unique characteristics. Consequently, this

Word Mover’s Distance (WMD) [117]. It is a variant of the
section outlines the effectiveness and imperceptibility evaluation
Earth Mover’s Distance. It measures the minimum distance
metrics. The first fits all kinds of adversarial examples, while the
that words of one document need to travel to reach words
last is appropriate for adversarial texts.
of another document on embedding space. The lower the
WMD value is, the more similar the two texts are. For two
2.4.1. Metrics of effectiveness evaluation
given texts, the WMD can be formalized with a minimization
The effectiveness evaluation reflects the adversarial examples’
problem as follows:
ability to make DNN produce wrong predictions. Existing effective-
P
p !
ness evaluation metrics principally include accuracy, error, and !
accuracy reduction.
min T ij jmi n j j
i;j¼1 2
P
k
Accuracy rate. It refers to the proportion of examples correctly s:t:; T ij ¼ di ; 8i 2 fi; ; kg; ð3Þ
classified by the victim model in total inputs. The lower the j¼1
accuracy rate is, the more effective adversarial examples are. P

k
0
Otherwise, it means the effectiveness of adversarial examples
T ij ¼ di ; 8j 2 fi; ; kg
i¼1
cannot meet a specific requirement.
Error rate. Opposite to the accuracy rate, the error rate refers to Here, m~ i and ~nj are word embeddings of word i and word j,
0
the proportion of examples wrongly classified by the victim respectively. k is the number of words in the text. d and d are
model in total inputs. The higher the error rate is, the higher normalized bag-of-words vectors of two documents,
the validity of the adversarial example is. respectively.
Accuracy reduction. It is a more intuitive evaluation metric, Evaluation Metrics on Discrete Space: suitable for textual data
which refers to the accuracy changes before and after the adver- without converting texts into vectors in advance. Such metrics
sarial attack. The larger the accuracy reduction is, the more mainly include the Jaccard Similarity Coefficient, grammatical
effective adversarial samples are. and syntactic similarity, Edit Distance, and changes count.
Jaccard Similarity Coefficient. It measures the similarity of
It is worth noting that these effectiveness evaluation metrics finite sets by the intersection and union of sets. Given set A
apply in both adversarial attack and defense procedures. and B, their Jacobian Similarity Coefficient is expressed as:
jA \ Bj
2.4.2. Metrics of imperceptibility evaluation J ð A; BÞ ¼ ð4Þ
jA [ Bj
The fundamental constraint of adversarial examples is that the
perturbation must be invisible to human eyes. Hence, adversarial Where 0 6 Jð A; BÞ 6 1. In texts, A and B represent two sentences
texts demand grammatically correct, syntactic correct, and seman- or documents. jA \ Bj denotes the number of similar words in A
tically consistent with the original ones. Since discrete texts can be and B in total, while jA [ Bj represents the number of unique
transformed into continuous numerical vectors through various words in total. The closer Jð A; BÞ is to 1, the more similar the
vectorization approaches, this survey classified imperceptibility two texts are.
evaluation metrics for adversarial text into continuous space and Grammatical and syntactic similarity. Researchers proposed
discrete space measures. three types of evaluation methods for evaluating the gram-
Evaluation Metrics on Continuous Space: suitable for vector- matical and syntactic correctness and similarity of adversarial
ized data. The vectorized adversarial data can be evaluated by met- texts. One uses the grammar and syntax checker to guarantee
rics designed for image data, such as Euclidean Distance. Besides, that the generated adversarial texts are valid; the second uses
researchers proposed several particular metrics for assessing the Complexity [118] to assess the quality of a language model
adversarial texts, including Cosine Distance and Word Mover’s Dis- for generating adversarial texts; the last ensures the effective-
tance. The Euclidean Distance and Cosine Distance are metrics that ness of paraphrases in the generation process.
measure the semantic similarity between the adversarial text and Edit Distance. It measures minimal changes by converting
the original one. one string to another. The smaller the Edit Distance is, the
more similar the two strings are. Levenshtein Distance is
Euclidean Distance. It is also called L2 distance or L2 norm. For known as Edit Distance with insert, delete and replace
two given word vectors ~ ¼ ðm1 ; m2 ; ; mk Þ
m and operations.
~
n ¼ ðn1 ; n2 ; ; nk Þ, the Euclidean Distance is expressed as: Count of changes. It is a straightforward approach that
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi counts the number of modified words or characters in adver-
~;~
Dðm nÞ ¼ ðm1 n1 Þ2 þ . . . þ ðmk nk Þ2 ð1Þ sarial texts under the word-level model or character-level
model, respectively.
284
White-box attack. In white-box scenarios, the adversary has

3. Textual adversarial attack full access to all knowledge about the victim DNN, including
its input, output, architecture, parameters, activation functions,
Inspired by adversarial image generation methods, Papernot and loss functions. Utilizing this available information, adver-
et al. [34] crafted adversarial texts through gradient information saries can identify the most vulnerable feature space of the vic-
on embedding space. Their approach consists of two steps: vector- tim model and craft efficient adversarial examples. For example,
izing the text and calculating the gradient on continuous space. works in [34,65,119] generated adversarial examples by utiliz-
Based on this idea, several homologous methods like ing the gradient’s sign, the gradient’s magnitude, and gradient
[65,78,25,35,119] are proposed. In contrast, some researchers itself, respectively. Furthermore, TextBugger [24] firstly uses
[38,24,45,60,81,82,72] suggested crafting adversarial texts directly the Jacobian matrix J to calculate the importance of each word
on the discrete space. and then uses five editing strategies to modify the text. Blohm
The above classification of adversarial attacks depends on et al. [99] leveraged the victim model’s sentence-level attention
whether the modification operation performs on the continuous distribution to find the plot sentence and change words receiv-
or discrete space. In fact, adversarial attacks can be categorized ing the most attention in the plot sentence. Therefore, white-
according to different criteria, such as model access and target box attacks generally have a high success rate and are a potent
type. To comprehensively review current textual adversarial threat, and are usually the worst-case attack against a specific
attacks, we firstly discuss the classification of adversarial attacks model and input data.
from multiple perspectives in Section 3.1. Then, we elaborate on Black-box attack. Unlike white-box attacks, the adversary in
four types of attacks categorized by the modified object type in black-box scenarios only has access to the output by inputting
Section 3.2–3.5. examples to the victim model, but not any other information
about the model architecture and parameters. For example, to
identify meaningful words in given texts, works in [78,81,79]
computed the probability change value, word saliency, and clas-
3.1. Categorization of attack method
sification probability by using the victim model’s output.
Researchers proposed several randomized attack methods for
This survey applies the three adversarial attack classification
extreme conditions where the attacker cannot obtain valuable
criteria presented in [56]: model access, target type, and semantic
output from victim models. For instance, ADDONESENT [58]
granularity, as shown in Fig. 4. Furthermore, considering that exist-
adds random sentences recognized by human beings to the
ing adversarial attacks employed various approaches to craft high-
end of pure examples. Niu et al. [98] proposed a Should-Not-
quality adversarial texts, we propose classifying these textual
Change strategy with four edits, among which there are two
adversarial attack methods according to the example generation
random approaches (randomly transposing neighboring tokens
strategy. As far as we know, this classification criterion is proposed
and randomly removing stopwords). It can be seen that a black-
for the first. The model access and target type can be employed to
box attack’s success rate and threat power are generally lower
classify attacks in all fields; however, the other two, especially the
than those of a white-box attack. However, due to no require-
semantic granularity, are only suitable for categorizing attack
ment for complete knowledge, black-box attacks are more prac-
methods in the NLP domain.
tical in the real world.
3.1.1. Model access

According to the victim model knowledge the adversary can
access, adversarial attacks are classified into white and black-box 3.1.2. Target type
attacks. This classification is the most commonly used in various According to the attack purpose of an adversary, adversarial
fields, including the CV and NLP domains. Since this survey does attack methods are categorized into targeted and non-targeted
not elaborate on current adversarial attacks according to this crite- attacks. Similar to the first classification method, we briefly outline
rion, we briefly describe them below. it below.
Fig. 4. Categorization of textual adversarial attack methods. There are four criteria: model access, target type, semantic granularity, and example generation strategy.
285
Non-targeted attack. The adversary hopes to generate an Importance-based. This means that which object is to be mod-
adversarial example x0 that makes the victim model f produce ified and how to modify it are determined by each object’s
a wrong output f ðx0 Þ – y, where y is the correct label of the importance related to the victim model. Since the more critical
input x. Since there is no limit on the target wrong output, this the changed word is, the easier it is to change the prediction of
kind of attack is more frequently employed, such as attacks in the victim model, even if the change is small enough. The adver-
[39,119,40]. sarial example generated by this strategy generally maintains
Targeted attack. In this scenario, the adversary intends to gen- semantic consistency, grammatical, and syntactic correctness
erate adversarial examples that make victim models output a well.
specified wrong prediction. More specifically, the adversary Edit-based. It crafts adversarial examples by operations like
hopes that the generated example x0 to cause the victim model inserting, removing, and swapping characters, words, or sen-
f outputting t ¼ f ðx0 Þ, where t is the output specified by the tences. These editing operations are also used in other
adversary. For instance, Ebrahimi et al. [35] considered that approaches, such as gradient-based, optimization-based, and
some corruptions of the neural machine translation model’s importance-based methods. Therefore, the edit-based method
output might be much worse than others, such as translating in this survey refers to attacks that utilize the above editing
‘‘good morning” as ‘‘attack them” is much worse than translating operations but do not use the gradient information, optimiza-
it as ‘‘fish bicycle”. Therefore, they proposed a controlled attack tion algorithm, or item importance.
that aims to remove a specific word from the translation, and Paraphrase-based. The adversary takes the paraphrase of a
a targeted attack that tries to mute a specific word and replace sentence as its adversarial example. In the paraphrase genera-
it with another. Thus, targeted attacks are more aggressive on tion process, the adversary introduces different extra conditions
deep neural networks than non-targeted ones. However, they to fool the victim model without affecting human judgment.
are more challenging to implement since they contain more The sentence-level attacks commonly use these approaches.
constraints than non-targeted ones. Generative model-based. This method uses the generative
model like the GAN [95] and encoder-decoder model to gener-
3.1.3. Semantic granularity ate adversarial texts, and is frequently used in sentence-level
Considering the type of objects modified by the adversary, tex- attacks. Since there are gradient back propagations when train-
tual adversarial attacks can be divided into four types: character- ing the generative model or crafting adversarial examples, these
level, word-level, sentence-level, and multi-level attack. The methods are usually combined with other techniques, such as
character-level attack refers to inserting, removing, flipping, swap- RL [96].
ping, or replacing characters in a text, and the word-level attack
performs the same operations for words. The sentence-level attack Since the semantic granularity is unique for textual adversarial
usually inserts extra distractor sentences, generates the para- attack classification, and the example generation strategy indicates
phrase, or modifies the original sentence structure in the case of the principal idea of generating adversarial examples, we suggest a
semantic consistency. Besides, the multi-level attack simultane- novel two-level classification method for the textual adversarial
ously employs two or three character-level, word-level, and attack taxonomy. We first categorize adversarial attack methods
sentence-level attacks. as four classes according to the semantic granularity at the top
level. Then, we subdivide each kind of method according to the
3.1.4. Example generation strategy example generation strategy. Section 3.2–3.5 shows details.
According to different strategies in the adversarial example
generation process, we divide adversarial attacks into six types: 3.2. Character-level attack
gradient-based, optimization-based, importance-based, edit-
based, paraphrase-based, and generative model-based methods. As mentioned before, the main idea of character-level attacks is
Among them, strategies like the gradient-based method are to insert, delete, flip, replace, or swap individual characters, special
evolved from adversarial image generation methods, and the symbols, or figures in a text. These attacks generally use three
implementation process of these attacks is usually relatively example generation strategies: gradient-based, importance-
straightforward. While other methods like the optimization- based, and edit-based methods, as summarized in Table 2. Adver-
based and edit-based methods are proposed for discrete data, they sarial texts crafted by these methods often have apparent grammar
generally show better performance in maintaining semantic con- or spelling errors, making it easy for human eyes or misspelling
sistency and grammatical correctness; however, they have enor- checkers to observe these malicious texts.
mous difficulty when designing well-turned algorithms.
3.2.1. Gradient-based method
Gradient-based. These methods calculate the forward deriva- The white-box adversary in [35] calculated the gradient infor-
tive to the input and obtain adversarial perturbations by gradi- mation to rank adversarial manipulations and used the greedy
ent backpropagation. Therefore, the vectorization for text needs search and beam search method to find adversarial examples. Fur-
to be implemented at first. Besides, spelling and grammatical thermore, it forced the adversary to remove or change a specific
errors commonly exist in generated texts, causing the adversar- word in a translation by two novel loss functions. In contrast, the
ial example to be perceived. black-box adversary in [35] just randomly selected characters
Optimization-based. It regards the adversarial example gener- and made suitable modifications with them. Since the implicit
ation as a minimax optimization problem, i.e., maximizing the knowledge in an optimization process can be refined into another
victim model’s prediction error while the difference between more efficient neural network, DISTFLIP [60] converts a white-box
the adversarial example and the original one is within an attack into a black-box one by training a neural network that imi-
acceptable range. Currently, researchers craft adversarial texts tates the adversarial example generation process of HotFlip [25].
essentially based on evolutionary algorithms, such as the GA Due to the independence of optimization algorithms, DISTFLIP is
[74] and PSO [75]. ten times faster than HotFlip when crafting adversarial texts. How-
286
Table 2
Summary of Character-level Adversarial Attacks in NLP.
Strategy Work White/ Targeted/Non- Main Ideas

Black-box targeted
gradient- [35] both both Using backward gradient propagation to rank adversarial manipulations, and utilizing the greedy search and
based beam search to seek adversarial examples.
[60] black-box non-targeted Distilling knowledge of white-box attacks.
importance- [45] black-box non-targeted Determining top critical tokens and modifying them by character-level transformations.
based
edit-based [38] black-box non-targeted Using natural and synthetic noises to replace the correct words.
[39] black-box non-targeted Using nine attack modes including visual and phonetic adversaries.
ever, it needs further verification whether DISTFLIP distills all and semantic changes. TextFool [65] uses the magnitude of the vic-
knowledge of a white-box attack. tim model’s cost gradient to determine what, where, and how to
insert, replace, or remove words in white-box scenarios. Besides,
3.2.2. Importance-based method it utilizes three modification strategies (insertion, modification,
The importance-based DeepWordBug [45] includes two proce- and removal) and adopts the natural language watermarking tech-
dures: determining critical tokens, and modifying them slightly. nique to dress up a given text elaborately. However, TextFool has
To find important tokens, DeepWordBug uses four token scoring massive manual executions. AdvGen [119] leverages the gradient
functions: Replace-1 Score, Temporal Head Score, Temporal Tail Score, itself. It considers the final translation loss of the victim model
and Combination Score. In the second phase, the character-level and the distance between a word and its adversarial one. Besides,
transformations, such as swap, substitution, deletion, and inser- it applies a language model to identify possible substitutes for a
tion, are applied to the highest-ranked token to minimize the edit given word to enhance the semantic-preserving.
distance of the perturbation. Although DeepWordBug successfully Since ensuring the generated vector map to a readable word is
generates adversarial texts, most of the introduced perturbations critical, Michael et al. [70] demonstrated ‘‘adversarial examples
are constricted to misspellings. should be meaning-preserving on the source side, but meaning-
destroying on the target side” for non-targeted attacks. They pro-
3.2.3. Edit-based method posed gradient-based word substitution methods with kNN and
Belinkov et al. [38] utilized the natural and synthetic noise to CharSwap constraint, respectively. Seq2Sick [64] is a projected gra-
replace corresponding correct words in a text. They collected nat- dient method combined with group lasso and gradient regulariza-
ural noises like typos and misspellings from different datasets. tion. The non-overlapping attack requires adversarial examples to
Moreover, they crafted synthetic noises through four operations: share no overlapping words with the original one, while the tar-
exchanging characters, randomizing all but the first and last char- geted keyword attack requires adversarial examples to contain
acters of a word, randomizing all characters of a word, and replac- all given targeted keywords. Thus, Seq2Sick applies a hinge-like loss
ing a character with its neighbor on the keyboard. Furthermore, function optimized at the logit layer and an additional mask func-
considering the realization in typical application scenarios such tion mt for them, as shown below:
as social media, Eger et al. [39] proposed the first large-scale cata-
X
M n o
max ; zt t max zt
log and benchmark of low-level adversarial attack, called Zéroe. It ðs Þ ð yÞ
Lnonov erlapping ¼ ð5Þ
encompasses nine attack modes, including visual and phonetic y–st
t¼1
adversaries: inner-shuffle, full-shuffle, intrude, disemvowel, trun-
cate, segment, typo, natural noise, phonetic, and visual. X
jKj n o
min mt max ; max zt
ð yÞ ðk Þ
Lkeywords ¼ zt i ð6Þ
t2½M y–ki
3.3. Word-level attack i¼1
The word-level attack modifies words in a given text. Thus, the

3.3.2. Optimization-based method
modified object is the most significant difference between the
In the early stage, Kuleshov et al. [120] was inspired by the con-
word-level attack and the character-level one. The word-level
cept of thought vector and proposed an iterative method, which
attack often adopts four example generation strategies: gradient-
replaces the original word with the nearest neighbor having the
based, optimization-based, importance-based, and edit-based
most prominent influence on the objective function in each itera-
methods, as summarized in Table 3. Optimization-based,
tion. Sato et al. [121] proposed iAdv-Text, whose optimization pro-
importance-based, and edit-based methods are designed for tex-
cess is shown as:
tual adversarial attacks. Overall, word-level methods usually main-
tain semantic consistency and imperceptibility better than other 1
attacks, but their generated adversarial examples are less varied. JiAdv T ðD; W Þ ¼
jDj
8 9
>
< X X >
=
3.3.1. Gradient-based method
arg minW b; Y
‘ X b; W þ k aiAdv T
Researchers utilized the gradient in various ways. Papernot > >
:b b
bX ;bY ;
et al. [34] first evaluated the forward derivative associated with X ;Y 2D 2D
the embedding input of a word sequence through the computa- ð7Þ

tional graph unfolding technique and Jacobian Based Saliency
Map. Then, they utilized the Fast Gradient Sign Method (FGSM) where aiAdv T is the maximization process to find the worst-case
to craft adversarial perturbations. To solve the problem of mapping weights of the direction vectors. Besides, iAdv-Text uses the cosine
generated vectors to nonexistent words, they set up a specific dic- similarity to maintain the readability and semantic similarity. Gong
tionary for replacing the nonexistent word with an appropriate et al. [122] applied the FGSM and DeepFool [124] to find adversarial
one. However, their method commonly leads to grammatical errors perturbations in the word embedding space, and used WMD to map
287
Table 3
Summary of Word-level Adversarial Attacks in NLP.
Strategy Work White/ Targeted/ Non- Main Ideas

Black-box targeted
gradient- [34] white-box both Using FGSM to craft adversarial vectors, setting up a dictionary to replace nonexistent words.
based
[65] both non-targeted Designing three perturbing strategies, using the natural language watermarking technique and the
magnitude of cost gradient to craft adversarial texts.
[119] white-box non-targeted Using the gradient itself of the victim model, using the language model.
[70] white-box non-targeted Word substitutions with kNN and CharSwap constraints.
[64] white-box targeted Non-overlapping and Keywords attack, with Group Lasso Regularization and Gradient Regularization.
optimization- [120] white-box non-targeted Introducing the thought vector to an iterative procedure to replace words with the nearest neighbor each
based time.
[121] black-box non-targeted Optimization process on embedding space with cosine similarity constraint.
[122] black-box non-targeted Optimization process on embedding space with WMD constraint.
[123] white-box targeted Using antonym substitution strategy, locating and replacing words with antonyms.
[40] black-box non-targeted An optimization method based on the GA, randomly selecting the word to be replaced.
[41] black-box non-targeted Based on the work in [40], allowing a word to be modified multiple times.
[72] black-box non-targeted The GA based optimization with a multi-objective strategy.
[71] black-box non-targeted The GA based optimization with word replacement strategy.
[47] black-box non-targeted Reducing search space by a sememe-based word replacement method, searching for adversarial examples
by the PSO in the reduced search space.
[37] black-box non-targeted Regarding the operations of identifying key words and selecting substitutes as action, and using Policy
Gradient to learn a policy.
[76] black-box non-targeted Perturbing words’ inflectional morphology to craft plausible and semantically similar adversarial texts.
[77] black-box non-targeted Using randomized mechanisms satisfying metric-DP to filter out substitutes with irrelevant meanings.
[73] black-box non-targeted Leveraging the GA to find an optimal ensemble with the minimum number of model members.
importance- [78] black-box non-targeted Determining the word importance by the probability change value when the word is deleted, modifying the
based input by a removal-addition-replacement strategy.
[81] black-box non-targeted Synonym substitution method using the word saliency and classification probability.
[82] black-box non-targeted Using the cross-domain interpretability to learn the word importance.
[79] black-box non-targeted Identifying important words, and replacing them with the most semantically similar and grammatically
correct substitutes.
[80] black-box non-targeted Identifying important words, and replacing them by considering the information of both the original word
and its surrounding context.
[83] both non-targeted A systematic probabilistic framework, and its two instantiations: Greedy Attack and Gumbel Attack.
edit-based [84] both non-targeted Employing the Metropolis–Hastings sampling approach and a language model to craft fluent adversarial
texts.
[85] black-box non-targeted Using a mask-then-infill procedure with three contextualized perturbations to craft adversarial texts.
[86] black-box non-targeted Detecting word sense disambiguation biases in neural machine translation models.
the adversarial example to the nearest meaningful word vector. Besides, considering that second-language speakers and many
Additionally, Song et al. [123] proposed a three-step adversarial first-language dialect speakers frequently exhibit variability in
example generation method for optical character recognition mod- their production of inflectional morphology, MORPHEUS [76] max-
els. It first finds words and their antonyms in WordNet [125] and imally increases the prediction loss by searching for the inflec-
only remains valid and semantically consistent antonyms satisfying tional form of each noun, verb, or adjective in a given text
the edit distance threshold. Then, it locates lines containing the greedily. Xu et al. [77] presented that optimizing the worst-case
above words in the clean image and transforms the target words loss function over all possible substitutions is prone to weigh unli-
in these lines to appropriate ones through the L2 -norm distance. kely substitutions higher and limit the accuracy gain. Thus, they
Finally, it replaces images of the corresponding lines in the text proposed a Metric Differential Privacy mechanism, which samples
image with adversarial ones. k values from a truncated poisson distribution as substitution can-
Later, researchers employed various evolutionary algorithms in didates to ensure nearby words with irrelevant meanings are dis-
the adversarial text generation procedure. Alzantot et al. [40] uti- regarded. For a given privacy parameter, an irrelevant word
lized GA [74] to select words randomly and find their nearest could have a similar substitution probability as a relevant word.
neighbors. They ranked and substituted the selected word to max- Hence, this method is with different degrees of semantic
imize the target label’s probability. Wang et al. [41] improved the preservation.
work in [40] by allowing the words in a given sentence to be mod- More generally, considering the inapplicability of massive
ified multiple times. Maheshwary et al. [71] leveraged a GA-based queries in the real world, Zang et al. [37] introduced RL to learn
approach in a hard label black-box setting. Mathai et al. [72] pro- from the attack history. They regarded two operations (identifying
posed a GA-based optimization method with a multi-objective vital words to be substituted, and selecting an appropriate substi-
strategy. However, randomly selecting words for substitution in tute to replace the identified vital word) as the action. They then
the above methods is full of uncertainties, making some changes used the Policy Gradient to learn the policy under which an adver-
meaningless for the target label. Differently, Zang et al. [47] lever- sarial example is crafted by taking a series of actions. This method
aged the PSO [75] to determine the word to be modified. They fur- theoretically can be combined with any candidate substitute
ther demonstrated that the substitute found based on the word method. Yuan et al. [73] systematically studied the transferability
embedding and language model is not always semantically consis- of adversarial attacks. They leveraged the GA to find an optimal
tent with the replaced word or suitable for the context. In addition, ensemble with the minimum number of model members to gener-
they proposed a sememe-based word substitution method. The ate adversarial texts that strongly transfer to other victim models.
adversarial example generation procedure of this method is shown Further, they generalized adversarial examples constructed by the
in Fig. 5.
288
Fig. 5. The procedure of method [47]. It first uses the sememe-based word replacement method to exclude the invalid or low-quality substitutes. Thus, the remaining ones
form the reduced search space. Then, it uses the PSO-based search method to efficiently find adversarial examples in the reduced search space.
ensemble method into universal semantics-preserving word white-box MHA is the pre-selected function used for selecting
replacement rules, which induce adversaries on any text input. the most likely word to modify. Compared with previous language
generation models using the Metropolis-Hastings sampling
3.3.3. Importance-based method approach, the black-box MHA ’s stationary distribution is equipped
For generating semantics-preserving texts with minimum mod- with a language model term and an adversarial attacking term,
ifications, Samanta et al. [78] first ranked words in descending making the adversarial text generation fluent and effective.
order according to the class probability changes, which were Furthermore, Li et al. [85] built their model on a pre-trained
obtained by removing words one by one. Then, they modified the masked language model and modified the input in a context-
input with a removal-addition-replacement strategy. It works best aware manner. They proposed three contextualized perturbations
for datasets such as the Internet Movie Database (IMDB), which (replace, insert, and merge) and used a mask-then-infill procedure
has sub-categories within each class. Later, Jin et al. [79] and to generate fluent and grammatical adversarial texts with varied
Maheshwary et al. [80] adopted the keyword ranking method in lengths.
[78]. The difference is, Jin et al. [79] utilized three strategies (syn- Emelin et al. [86] detected word sense disambiguation bias in
onym extraction, part-of-speech checking, and semantic similarity neural machine translation models for model-agnostic adversarial
checking) to replace words with the most semantically similar and attacks. The word sense disambiguation is a well-known source of
grammatically correct substitutes. While Maheshwary et al. [80] translation errors in neural machine translation tasks. They
further considered the original word and its surrounding context thought that some incorrect disambiguation choices result from
when searching for substitute candidates. Besides, Ren et al. [81] models’ over-reliance on dataset artifacts found in training data,
proposed a synonym substitution based Probabilistic Weighted specifically superficial word co-occurrences. Besides, they mini-
Word Saliency (PWWS) method, which determines the word mally perturbed sentences to elicit disambiguation errors to probe
replacement order by the word saliency and classification proba- the robustness of translation models. It does not require access to
bility. The former reflects the importance of the original word to gradient information or the score distribution of the decoder.
classification probability; the latter indicates the attack perfor-
mance of the substitute. This method performs well in effectively 3.4. Sentence-level attack
reducing substitutes.
For handling the problem of computational complexity and The sentence-level attack takes the sentence as the object and
query consumption, Explain2Attack [82] employs cross-domain includes operations like inserting new sentences, generating its
interpretability to gain word importance in black-box scenarios. paraphrase, and even changing its structure. Slightly different from
They first built an interpretable substitute model that imitates the former two kinds of attacks, the sentence-level attack fre-
the victim model’s behavior. Then, they used the interpretability quently employs five example generation strategies: gradient-
capability to produce word importance scores. It reduces computa- based, optimization-based, edit-based, paraphrase-based, and gen-
tional complexity and query consumption to a large extent while erative model-based methods, as shown in Table 4. Adversarial
ensuring the attack success rate. examples generated by these methods are semantics-preserving
More generally, Yang et al. [83] proposed a systematic proba- and full of diversity, but some of them have reduced readability
bilistic framework, in which critical features are identified at first caused by adding the meaningless token sequence.
and then perturbed with values chosen from a dictionary. As two
instantiations of this framework, Greedy Attack crafts single- 3.4.1. Gradient-based method
feature perturbed inputs that achieve a higher success rate, Gumbel The universal adversarial perturbation is a particular noise,
Attack learns a parametric sampling distribution and requires combined with which any text can fool NLP models with a high
fewer model evaluations, leading to better efficiency in real-time probability. Behjati et al.[87] is the first to investigate the universal
or large-scale attacks. adversarial perturbations in the NLP field. They used an iterative
projected gradient-based approach on embedding space to craft a
3.3.4. Edit-based method sequence of words. Then, they applied the generated sequence to
The Metropolis Hastings Attack (MHA) [84] employs the any input sequence in the corresponding domain. The Natural
Metropolis-Hastings sampling approach to replace old words and Universal Trigger Search method [88] is based on an adversarially
random words. The only difference between black-box and regularized autoencoder [126]. In order To avoid out-of-
289
Table 4
Summary of Sentence-level Adversarial Attacks in NLP.
Strategy Work White/ Targeted/ Non- Main Ideas

Black-box targeted
gradient-based [87] both both Using an iterative projected gradient based approach to find universal perturbations.
[88] black-box Non-targeted Based on an adversarially regularized autoencoder, and leveraging the projected gradient descent with l2
regularization.
[89] white-box non-targeted Using HotFlip to generate universal adversarial triggers by an optimization process guided by the gradient.
[90] white-box non-targeted Extending HotFlip by jointly minimizing two losses to generate universal triggers, then using a conditional
language model to craft adversarial texts.
optimization- [118] black-box non-targeted A combinatorial optimization that maximizes the quantity measuring the degree of violation, and using
based language models to generate linguistically plausible examples.
edit-based [58] black-box non -targeted Generating and adding a distractor sentence to the end of a given paragraph.
[46] black-box non-targeted Distractors are randomized placed to increase the variance of adversarial examples.
[59] black-box non-targeted Using model extraction to approximate the victim model, using ADDANY-KBEST improved from ADDANY to
generate adversarial examples.
paraphrase- [63] black-box non-targeted An encoder-decoder framework that generates adversarial sentences with a target syntax structure.
based
[91] black-box non-targeted Generating paraphrases with semantically equivalent rules, and regarding the paraphrase as an adversarial
example.
[42] black-box non-targeted Expanding texts by inserting an adversarial modifier for each constituent determined by linguistic rules.
generative [94] black-box non-targeted A framework consisting of a GAN and a converter, and using iterative stochastic search and hybrid shrinking
model- search to search for adversarial examples.
based
[36] black-box non-targeted A framework consisting of a GAN and an autoencoder, and using the RL to guide the training of the GAN.
[97] black-box non-targeted Generating adversarial texts through controllable attributes that are known to be irrelevant to the task
label.
distribution noise vectors and maintain the naturalness of gener- body and head represent the premise and the conclusion of the nat-
ated texts, this method leveraged the projected gradient descent ural language inference rules.
with l2 regularization.
Furthermore, several gradient-guided methods, which are 3.4.3. Edit-based method
based on HotFlip [25], are proposed to generate universal adversar- Jia et al. [58] proposed ADDSENT and ADDANY, both of them
ial triggers. For instance, Wallace et al. [89] first specified the trig- generate distractor sentences that confuse models but do not con-
ger length and initialized a sequence trigger, then replaced tokens tradict the correct answer or confuse humans. ADDSENT generates
of the initialized sequence through HotFlip, finally added the gen- a sentence looking similar to the question but does not contradict
erated trigger to the beginning or end of the given text. Due to that, the correct answer, and then adds the generated sentence to the
a long trigger is more effective and more noticeable than a shorter end of the given paragraph. Its variant ADDONESENT adds random
one; the trigger length in this method is an important criterion. sentences recognized by human beings. On the contrary, ADDANY
Atanasova et al. [90] focused on ensuring the semantic validity of does not consider the sentence’s grammar and adds an arbitrary
adversarial texts. They extended HotFlip by jointly minimizing a sequence of English words by querying the victim model many
fact-checking model’s target class loss and an auxiliary natural lan- times. Unlike ADDSENT and ADDANY, which both try to incorporate
guage inference model’s entailment class loss to generate universal words in the question into the adversarial sentence, ADDCOMMON,
triggers. Then, the generated universal triggers were input to a a variant of ADDANY, only uses common words in the adversarial
conditional language model trained using a GPT-2 model. Their sentence.
method effectively crafts semantically valid statements containing Based on the work in [58], Wang et al. [46] proposed ADDENT-
at least one trigger. DIVERSE, an improvement of ADDSENT, to craft adversarial exam-
ples with a higher variance where distractors will have
randomized placements, leading to the expansion of the fake
3.4.2. Optimization-based method answer set. To address the antonym-style semantic perturbations
Minervini et al. [118] studied the automatic generation issue of used in ADDSENT, they added semantic relationship features to
adversarial examples that violate a set of given First-Order Logic enable the model to identify the semantic relationship among
constraints in the natural language inference task. They regarded question contexts with the help of WordNet. Further, Nizar et al.
this issue as a combinatorial optimization problem. They generated [59] approximated a black-box victim model via the model extrac-
linguistically plausible texts by using language models and maxi- tion [127], and then used ADDANY-KBEST, a variant of ADDANY, to
mizing the quantity that measures the violation degree of such craft adversarial examples.
constraints. Specifically, they maximized the inconsistent loss J I
to search for the substitution set S (i.e., adversarial examples) using
3.4.4. Paraphrase-based method
the following language model:
Some researchers have generated the paraphrase of a given text
as its adversarial example. Ribeiro et al. [91] iteratively generated
max xI ðSÞ ¼ ½pðS; bodyÞ pðS; headÞþ
S ð8Þ paraphrases for an input sentence and obtained the victim model
s:t: log pL ðSÞ 6 s prediction until the prediction was changed. They proposed a
semantic-equivalent rule-based method to generalize these gener-
Here, pL ðSÞ represents the probability of a sentence in ated examples into semantically equivalent rules for understand-
S ¼ fX 1 ! s1 ; . . . ; X n ! sn g; s is a threshold on the perplexity of gen- ing and fixing the most impactful bugs. When generating the
erated sequences. pðS; bodyÞ and pðS; headÞ are probabilities of the paraphrase, controlled perturbations are incorporated. Iyyer et al.
given rule, after replacing X i with the corresponding sentence Si , [63] proposed an encoder-decoder based SCPNs method. For a
290
given sentence and a corresponding target syntax structure, SCPNs 3.5.1. Gradient-based method
first encodes the sentence by a bidirectional Long Short Term HotFlip [25] modifies the character and word in a given text. It
Memory (LSTM) model, and then inputs the interpretation and tar- performs the atomic-flip operation relying on gradient computa-
geted syntactic tree into the LSTM model for decoding to obtain the tion on the one-hot representation. For character-level attacks,
targeted paraphrase of the given sentence. In the decoding proce- the flip operation is represented as:
dure, the soft attention [92] and copy mechanism [93] are further ! !
introduced. Although SCPNs uses the target strategy, it does not v ijb ¼ . . . ; 0 ; . . . ð0; . . . ; 0; 1; 0; . . . ; 1; 0Þj . . . ; 0 ; . . .
~ ð9Þ
i
specify the target output. Besides, adversarial texts generated by
SCPNs can effectively improve the pre-trained model’s robustness The above formula means that the j-th character of i-th word in
to syntactic changes. a text is changed from a to b, which are both characters at a-th and
Some other researchers generated adversarial examples by b-th places in the alphabet. The change from the directional deriva-
expanding the original sentence. For example, AdvExpander [42] tive along this vector is calculated to find the biggest increase in
first uses linguistic rules to determine which constituents (word loss J ðx; yÞ. The calculation process is shown below:
or phrase) to expand and what types of modifiers to expand with. @J ðbÞ @J ðaÞ
Then, it expands each component by inserting an adversarial mod- v ijb ¼ max
max rx J ðx; yÞT ~ ð10Þ
ijb @xij @xij
ifier searched from a conditional variational autoencoder based
generative model [128] that is pre-trained on the Billion Word where xij is a one-hot vector that denotes the j-th character of i-th
Benchmark. This method differs from existing substitution-based word. For word-level attacks, the HotFlip is used further with a
methods and introduces rich linguistic variations to adversarial few semantics-preserving constraints like cosine similarity. How-
texts. ever, with one or two flips under strict constraints, the HotFlip only
generates a few successful adversarial examples, making it unsuit-
able for a large-scale attack.
3.4.5. Generative model-based method
Zhao et al. [94] proposed a GAN-based framework to craft effi- 3.5.2. Optimization-based method
cient and natural adversarial texts. This framework consists of two For crafting adversarial texts with sentence-level rewriting, Xu
main components: a GAN for generating fake data, and a converter et al. [43] first designed a sampling method, called RewritingSam-
for mapping input to its potential representation z0 . The two com- pler, to efficiently rewrite the original sentence in multiple ways.
ponents are trained on original input by minimizing reconstruction Then, they allowed for both word-level and sentence-level
errors between original and adversarial ones. The perturbation is changes. In order To constrain the semantic similarity and gram-
performed on the latent dense space by identifying the perturbed matical quality, they employed the word embedding sum and a
example ^z in the neighborhood of z0 . Two search approaches (iter- GPT-2 model [62].
ative stochastic search, and hybrid shrinking search) were used to
identify the proper ^z. This method is appropriate for both image 3.5.3. Importance-based method
and textual data, as it intrinsically eliminates the problem raised TextBugger [24] is used in black-box and white-box scenarios. In
by the discrete attribute of textual data. However, due to the white-box scenarios, it is also a gradient-based method, as it first
requirement of querying the victim model each time to find the ^z uses the Jacobian matrix J to calculate the importance of each
that can make the model produce incorrect prediction, this method word, as below:
is quite time-consuming. Furthermore, Wong et al. [36] proposed a
GAN-based framework that utilizes RL to guide the training of @F y ðxÞ
C xi ¼ J F ði;yÞ ¼ ð11Þ
GAN. Thus, there is no converter but an autoencoder, which judges @xi
the semantic similarity between original texts and adversarial Here, F y ðÞ represents the confidence score of class y; C xi is the
ones. However, it is restricted to binary text classifiers.
importance score of the i-th word in input x. Then, TextBugger uses
Differ from the above methods, Wang et al. [97] proposed CAT-
five editing strategies (insertion, deletion, swap, substitution with
Gen. Given a text, CAT-Gen generates adversarial texts through con-
visually similar words, and substitution with semantically similar
trollable attributes known to be irrelevant to the task label. As
words) to generate character-level and word-level adversarial texts.
shown in Fig. 6, regarding the product category as a controllable
In black-box scenarios, TextBugger split the document into
attribute for the sentiment analysis task, CAT-Gen first pre-trains
sequences at first. Then, it queries the victim model to filter out sen-
an encoder and a decoder to copy input sentence x with the attri-
tences with predictions different from the original labels, sorts
bute a, and pre-trains an attribute classifier using an auxiliary
these sequences in reverse order according to their confidence,
dataset. Then, given the desired attribute a0 – a, it uses the attri-
and calculates the word importance score through the following
bute classifier to train the decoder to enable the model to generate
equation by deleting:
an output with the attribute a0 . Finally, it searches through the
whole attribute space of a0 – a and looks for a that maximizes C xi ¼ F y ðx1 ; . . . ; xi1 ; xi ; xiþ1 ; . . . ; xn Þ
the cross-entropy loss between the prediction and ground-truth F y ðx1 ; . . . ; xi1 ; xiþ1 ; . . . ; xn Þ ð12Þ
task-label.
Finally, the same editing operation as the white-box attack is
used to modify texts. In TextBugger, only a few editing operations
3.5. Multi-level attack mislead the classifier.
Since previous work exclusively focused on semantic tasks,
The objects modified in multi-level attacks are two or three Zheng et al. [44] is the first to explore adversarial attacks in syntac-
kinds of character, word, and sentence. According to the primary tic tasks. They focused on the dependency parsing model and con-
strategy used in multi-level attacks, these approaches are divided structed syntactic adversarial examples at both sentence and
into the gradient-based, optimization-based, importance-based, phrase levels. They followed a two-step procedure: 1) choosing
edit-based, and generative model-based methods, as shown in weak spots (or positions) to change; 2) modifying them to maxi-
Table 5. These methods generated various adversarial examples mize the victim model’s errors in both black-box and white-box
to a certain extent, but they always have more constraints. scenarios.
291
Fig. 6. Overview of CAT-Gen training process in [97]. The encoder, decoder, projector, and attribute classifier are pre-trained in advance.
Table 5
Summary of Multi-level Adversarial Attacks in NLP.
Strategy Work White/ Targeted/Non- Mian Ideas

Black-box targeted
gradient-based [25] white-box non-targeted Modifying characters and words, and performing an atomic-flip operation by gradient computation on
vectors.
optimization- [43] black-box non-targeted For word-level and sentence-level changes, and using RewritingSampler sampling method to rewrite the
based given sentence.
importance- [24] both non-targeted Generating character-level and word-level adversarial texts according to the importance in black-box and
based white-box scenarios.
[44] both both Generating phrase-level and sentence-level adversarial texts according to the importance and greedy
search strategy in black-box and white-box scenarios.
edit-based [98] black-box non-targeted Modifying sentences and words by the Should-Not-Change and Should-Change adversarial strategies.
[99] both non-targeted Modifying sentences and words by the sentence attention distribution and word importance.
generative [100] black-box non-targeted Using a hybrid Adversarial Examples Generator to introduce both word and character-level perturbations.
model-based
3.5.4. Edit-based method standing advantage of this method is introducing both word and
Niu et al. [98] paid attention to both Should-Not-Change and character-level perturbations.
Should-Change adversarial strategies. The Should-Not-Change strat-
egy includes four edits: transposing neighboring tokens randomly, 3.6. Comparison and analysis of attacks
removing stopwords randomly, replacing words with their para-
phrases, and using grammar errors like changing a verb to the To give readers a more intuitive understanding of these attack
wrong tense. The Should-Change strategy contains two methods: methods, this subsection firstly shows the adversarial examples
negating the root verb of the source input and changing verbs, generated by several representative attacks, and then compares
adjectives, or adverbs to their antonyms. the attack performance and query time consumption of these
Blohm et al. [99] implemented four adversarial text generation methods. Specifically, the BERT model [131] pre-trained on Stan-
methods. For black-box attacks, they performed the word-level ford Sentiment Treebank (SST) dataset is the victim model, which
attack by manually replacing original words with substitutes in performs the sentiment analysis task and outputs probabilities
pre-trained Glove embedding, and used the ADDANY [58] as for label Positive and Negative. We utilize 1 character-level attack
sentence-level attacks. For white-box attacks, they leveraged the (DeepWordBug [45]), 3 word-level attacks (Probabilistic Weighted
models’ sentence-level attention distribution to find the plot sen- Word Saliency (PWWS) [81], TextFooler [79], and BERT-based attack
tence, which has the greatest attention. The k words receiving [132]), 2 sentence-level attacks (Synthetically Controlled Paraphrase
the most attention in the plot sentence were exchanged by ran- Networks (SCPNs [63]), and GAN-based attack [94]), and 1 nulti-
domly chosen words in word-level attacks. Finally, the whole plot level attack (TextBugger [24]) to attack the chosen victim model.
sentence was removed in sentence-level attacks. Fig. 7 and Table 6 show generated adversarial examples and attack
performances of these methods, respectively.
From the perspective of example quality, character-level attack
3.5.5. Generative model-based method methods maintain the semantics of original texts well. However,
Vijayaraghavan et al. [100] developed a hybrid encoder-decoder they are easily detected by human eyes or spelling check tools.
model, called the Adversarial Examples Generator. It consists of two For example, in the TextBugger, words ‘‘crahm”, ‘‘aovids”, ‘‘obv1ous”,
components: an encoder and a decoder. The encoder, which is a and ‘‘hmour” in the adversarial example are easily observed. Never-
slight variant of Chen et al. [129], maps the input sequence to a theless, they are less likely to affect the human eye’s judgment of
representation using word and character-level information. The the emotions of the whole text. In contrast, word-level attacks
decoder has two-level Gate Recurrent Units (GRU): word-GRU compensate for the vulnerability of adversarial examples to detec-
and character-GRU. For training the model, they used the self- tion but affect the semantics of the text to some extent. For
critical approach of Rennie et al. [130] as their policy gradient instance, in the text ‘‘division of the spell of satin rouge is that it void
training algorithm. Compared with Wong et al. [36], the most out- the obvious with body and weightlessness.” generated by PWWS, the
292
Fig. 7. Adversarial examples generated by several attacks. The original examples are two randomly selected from the SST dataset, and both of them are correctly classified as
Positive by the pre-trained BERT model.
Table 6 sentence-level attacks enhance the diversity of generated exam-

Comparison of Adversarial Attack Performance. ples. However, it is clear to see that these adversarial examples
Method Attack Average Model Average Running crafted by sentence-level SCPNs and GAN-based methods are very
Success Rate Queries (times) Time (seconds) different from the original ones in both semantics and readability.
DeepWordBug 59.46% 23.626 0.000439 Therefore, generating semantically consistent and imperceptible
PWWS 75.74% 117.82 0.002190 adversarial examples at the sentence level might be one of the
TextFooler 74.86% 61.68 0.053360 future research directions.
BERT-based 88.44% 61.94 0.036131
For comparing the attack performance of the above methods,
attack
SCPNs 75.66% 11.75 2.366100 we randomly selected 5000 examples from the SST dataset to gen-
GAN-based 42.06% 2.42 0.009495 erate corresponding adversarial texts and attack the selected vic-
attack tim model using the above methods. Table 6 shows the result. In
TextBugger 90.54% 48.79 0.001722
terms of attack success rate, TextBugger is the highest, and its exe-
cution time is also relatively low. The reason might be that
TextBugger uses the Jacobian matrix to calculate the importance
meaning of ‘‘body” and ‘‘weightlessness” are very different from the of each word at once. In comparison, the average model queries
meaning of replaced words ‘‘humour” and ‘‘lightness”. These exam- of sentence-level methods (SCPNs and GAN-based method) are
ples crafted by word-level methods also lacked diversity. Thus,
293
the lowest, but their attack success rates are not satisfactory. As ing to the phrases of words. However, the average example length
mentioned above, the differences between the adversarial exam- of SST is only about 17 words.
ples generated by sentence-level methods and the original ones Internet Movie Database (IMDB):6 used for sentiment analysis.
are relatively huge, so researchers should focus on maintaining It is crawled from the Internet, including 50,000 highly polarized
the semantic consistency and imperceptibility of texts for movie reviews with tags (positive or negative sentient) and URLs
sentence-level methods. In addition, the model query accounts from which the reviews came. Among them, 25,000 examples are
for word-level attacks (PWWS, TextFooler, and BERT-based attack) used for training and 25,000 for testing. The average length of
are comparatively numerous. Therefore, it is worthwhile for reviews is nearly 234 words, and the size of this dataset is bigger
researchers to investigate how to reduce the number of queries than most similar datasets. Besides, IMDB also contains additional
about such methods. unlabeled data, original texts, and processed data.
Rotten Tomatoes Movie Review (MR):7 used for sentiment
analysis. It is a labeled dataset that concerns sentiment polarity, sub-
4. Textual adversarial attack application jective rating, and sentences with subjectivity status or polarity. It
contains 5,331 positive examples and 5,331 negative examples,
To explore potential adversarial attack threats faced by existing and the average example length is 32 words. Since MR is labeled
NLP intelligent systems and further provide a basis for developing by manual work, its size is smaller than others, with a maximum
efficient defense strategies for these NLP models, several research- of dozens of MB.
ers applied various textual adversarial attacks to extensive NLP Amazon Review:8 used for sentiment analysis. It has nearly 35
models. While these diverse NLP tasks, such as classification, neu- million reviews that cross from June 1995 to March 2013, including
ral machine translation, text entailment, and dialogue generation, product and user information, ratings, and plain text reviews. It is
are extremely different and generally implemented on distinct collected by over 6 million users in more than 2 million products
datasets. Thus, the adversarial attack methods applied to these dif- and categorized into 33 classes with a size ranging from KB to GB.
ferent NLP tasks have their own characteristics. Therefore, this sec- Multi-Perspective Question Answering (MPQA):9 used for sen-
tion reviews current works on textual adversarial attacks from the timent analysis. It is collected from various news sources and anno-
application perspective and summarizes them in Table 7. In detail, tated for opinions or other private states. It contains 10,606
Section 4.1 presents popular benchmark datasets in the NLP field, examples, and each example is labeled with objective or subjective
and Section 4.2 elaborates on various applications of textual adver- sentiment. Three different versions are available to people through
sarial attacks. the MITRE Corporation. The higher the version is, the richer the con-
tents are.
4.1. Benchmark dataset Stanford Question Answering Dataset (SQuAD):10 used for
machine reading comprehension. It contains 107,785 manual-
Since there are massive benchmark datasets for different NLP generated reading comprehension questions about more than 500
tasks, this section gives a brief introduction of the applicable task, Wikipedia articles. Each question involves a paragraph of an article,
dataset size, characteristic, and source to primary datasets in and the corresponding answer is in that paragraph. Compared with
Table 7. SQuAD 1.1, SQuAD 2.0 contains 100,000 questions in SQuAD 1.1
AG’s News:2 used for text classification. It consists of over 1 mil- and more than 50,000 malicious seemingly answerable but unan-
lion news collected from more than 2,000 news sources by an aca- swerable questions written by crowd workers. Thus, SQuAD 2.0
demic news search engine called Cometomyhead. In total, it requires machine reading comprehension models to answer ques-
includes 120,000 training examples and 7,600 test examples, which tions when possible and determine when paragraphs do not support
come from four categories of the same scale: World, Sport, Business, answers and avoid answers.
and Technology, and each category has 30,000 training examples and MovieQA:11 used for machine reading comprehension. It aims to
1,900 test examples. The provided DB version and XML version are evaluate the model’s automatic story comprehension ability from
available for download for non-commercial use. both video and text perspectives. This dataset contains 14,944
DBPedia Ontology:3 used for text classification. It is a dataset multiple-choice questions for 408 movies collected by human anno-
with structured content from the information created in various tators. Its questions, which vary from simple ‘‘who” or ‘‘when” to
Wikimedia projects. This dataset contains 560,000 training examples more complex ‘‘why” or ‘‘how”, can be answered by a variety of
and 70,000 testing examples of 14 high-level classes, such as Com- information sources, including film editing, plot, and subtitles. Each
pany, Building, and Film. It has more than 685 classes represented question has five reasonable answers, and only one is correct.
by a subsumption hierarchy structure and is described by 2,795 dif- Stanford Natural Language Inference Corpus (SNLI):12 used
ferent attributes. for text entailment. It concludes with 570,000 human-written Eng-
Yahoo! Answers:4 used for text classification. It contains 4 mil- lish sentence pairs with a manual label of entailment, contradiction,
lion question-answer pairs, and it can be used in question-answer and neutral. There are 550,152 training examples, 10,000 verification
systems. Furthermore, it includes ten categories of classification data examples, and 10,000 test examples.
obtained from Yahoo! Answers Comprehensive Questions and Visual Question Answering Dataset (VQA):13 used for visual
Answers 1.0, and each class contains 140,000 training examples question answering. It is the most widely used dataset for visual
and 5,000 test examples. question answering tasks. Its images are divided into two parts:
Stanford Sentiment Treebank (SST):5 used for sentiment anal-
ysis. It includes 239,232 sentences and phrases, whose syntax
changes greatly. Compared with most other datasets that ignore
6
word order, SST establishes a complete representation based on http://ai.stanford.edu/amaas/data/sentiment/.
7
http://www.cs.cornell.edu/people/pabo/movie-review-data/.
the sentence structure. Furthermore, it can judge the mood accord- 8
http://snap.stanford.edu/data/web-Amazon.html.
9
http://mpqa.cs.pitt.edu/.
2 10
https://www.kaggle.com/amananandrai/ag-news-classification-dataset. https://rajpurkar.github.io/SQuAD-explorer/.
3 11
https://dbpedia.org/ontology/. http://movieqa.cs.toronto.edu/home/.
4 12
https://sourceforge.net/projects/yahoodataset/. https://nlp.stanford.edu/projects/snli/.
5 13
https://nlp.stanford.edu/sentiment/code.html. https://visualqa.org/download.html.
294
Table 7
Summary of Adversarial Attack Applications and Benchmark Datasets in NLP.
Application Benchmark Dataset Representative Work

Clasisfication Text Classification AG’s news [25,45,81,79,87,83,71,43]
DBpedia Ontology [121,65,45]
Yahoo! Answers [81,45,83,71]
Sogou News [45]
RCV1 [121]
Surname Classification Dataset [133]
Sentiment Analysis Yelp Review [45,120,79,71,43,82]

SST [63,47,72,134,87]
IMDB [45,121,24,40,78,122,133,81,47,84,72,79,83,71,42,43,82,80]
MR [121,24,65,91,79,71,43,82,80]
Amazon Review [45,121,82]
MPQA [65]
Elec [121]
Arabic Tweets Sentiment [133]
Toxic Comment Toxic Comments dataset (WTC) [109,60,24,39]

Detection
GOSSIPCOP, PHEME [135]
Spam Detection Enron Spam Email [45]

TREC 2007 Public Spam Corpus [120]
Relation Extraction NYT Relation, UW Relation [136]

ACE04, CoNLL04 EC, Dutch Real Estate Classifieds, Adverse [137]
Drug Events
Gender Identification Twitter Gender [78]
Grammar Error FCE-public [121]

Detection
Neural Machine Translation TED Talks Parallel Corpus [138] [35,38]

WMT corpus [119,64,86,76,70]
NIST [119,86]
Machine Reading Comprehension SQuAD [58,59,76,89]

MovieQA Multiple Choice [99]
Text Entailment SNLI [39,40,42,43,47,71,79,80,84,94,104,118]

SciTail [104]
MultiNLI [118,134,79,71,43]
SICK [63]
FEVER [90]
POS Tagging Penn Treebank WSJ corpus [44,139,140]

Universal Dependencies [139,39]
Text Summarization DUC2003, DUC2004, Gigaword [64]
Dialogue Generation Ubuntu Dialogue Corpus, CoCoA [98]
Cross-modal Image Captioning MSCOCO [141]

Visual Genome [142]
Optical Character Hillary Clinton’s emails Corpus [123]

Recognition
Chinese text image dataset [143]
Scene Text Recognition Street View Text, ICDAR 2013, IIIT5 [144]
Visual Question VQA dataset [142,145]
Answering
Visual Semantic MSCOCO [146]
Embedding
general V + L task COCO, Visual Genome, Conceptual Captions, SBU Captions [147]
Speech Tecognition Mozilla Common Voice [21]
Other Tasks Interactive Dialogue – [148]

Dependency Parsing Penn Treebank WSJ corpus [44]
295
Table 8
Summary of Applications in Classification Tasks.
Subtask Work Benchmark Dataset Victim Model

Text Clasisfication [25] AG’s News CharCNN-LSTM [149]
[87] AG’s News LSTM [150], BiLSTM [151], mean-LSTM
[43] AG’s News BERT [131]
[65] DBpedia Ontology CharCNN [152], WordCNN [153]
[45] AG’s News, Sogou News, DBpedia Ontology, Yahoo! Answers CharCNN, WordLSTM [154]
[79] AG’s News WordCNN, WordLSTM, BERT
[83] AG’s News, Yahoo! Answers WordLSTM, CharCNN
[71] AG’s News, Yahoo! Answers WordCNN, WordLSTM, BERT
Sentiment Analysis [65] MR, MPQA CharCNN, WordCNN
[78] IMDB CNN
[40] IMDB LSTM
[45] Amazon Review, Yelp Review CharCNN, WordLSTM
[121] IMDB, Elec, MR, Amazon Review LSTM based model [155]
[63] SST BiLSTM
[87] SST LSTM, BiLSTM, mean-LSTM
[84] IMDB BiLSTM
[43] MR, Yelp Reviews, IMDB BERT
[82] IMDB, MR, Amazon Review, Yelp Review WordCNN, WordLSTM, BERT
[47] IMDB, SST BiLSTM [156], BERT
[72] IMDB, SST CharLSTM, WordLSTM, ELMo-LSTM
[79] MR, IMDB, Yelp Review WordCNN, WordLSTM, BERT
[71] MR, IMDB, Yelp Review WordCNN, WordLSTM, BERT
[80] IMDB, MR WordLSTM, BERT
Toxicity Comment Detection [39] WTC RoBERTa [157]
[60] WTC Google Perspective API
[135] GOSSIPCOP, PHEME TEXTCNN [153], dEFEND [158]
Spam Detection [120] Trec07p Naive Bayes, CharLSTM, WordCNN
[45] Enron Spam Email WordLSTM, CharCNN
Relation Extraction [136] NYT Relation, UW Relation PCNN [159], BiGRU [160]
[137] ACE04, CoNLL04 EC, Dutch Real Estate Classifieds, Adverse Drug Events BiLSTM [161]
Gender Identification [78] Twitter Gender CNN
Grammatical Error Detection [121] FCE-public BiLSTM [162]
204,721 images that come from the Microsoft COCO dataset text classification, sentiment analysis, gender identification, gram-
(MSCOCO)14 based on real scenes, and 50,000 pictures consisting matical error detection, toxic comment detection, spam detection,
of human and animal models in abstract scenes. Humans generate and relation extraction, as shown in Table 8.
these questions and answers. In particular, the true or false ques- Text classification aims to categorize the given texts into several
tions account for about 40%, and each picture generally corresponds classes, such as class ‘‘Business” and ‘‘Sport” in AG’s News dataset.
to several question-answer pairs. At present, there are two versions. Sentiment analysis classifies sentiments into two or three classes,
In VQA 1.0, the questions are more about the image’s simple posi- for example, in a three-group scheme: neutral, positive, and nega-
tion, quantity, and attribute relationships. In VQA 2.0, besides the tive. Furthermore, gender identification, grammatical error detec-
simple attribute information in the picture, the questions are more tion, toxic comment detection, and spam detection can be
fused with some concept sense. Thus, more studies tend to use framed as binary classification problems. In comparison, the rela-
VQA 2.0. tion extraction extracts the corresponding relation of the entity
pair in a sentence. Thus, relation extraction can be treated as a
4.2. Application in NLP multiple classification issue judging the relationship of the entity
pair.
Differ from Section 3 focusing on adversarial attack approaches, From the perspective of the victim model, the LSTM-based mod-
this section concentrates on the application scenario and impact of els were attacked by several methods [120,45,63,47,84,72,79,87].
adversarial attacks in the NLP field. According to different text- Among them, the ‘‘Sememe + PSO” attack [47] experimented on
processing tasks, this section classifies adversarial attack applica- the bidirectional LSTM (BiLSTM) model [156] with IMDB and SST
tions into eight categories: classification, neural machine transla- dataset. Compared with the ‘‘Embedding/Language Model + Genet
tion, machine reading comprehension, text summarization, text ic” [40] and ‘‘Synonym + Greedy” [81] approach, it achieved the
entailment, part-of-speech tagging, dialogue generation, and highest attack success rate on both datasets, particularly, it
cross-modal tasks. Note that most attacks were simultaneously attacked BiLSTM/ BERT on IMDB with a notably 100.00%/ 98.70%
applied in multiple different tasks, indicating the transferability success rate.
of these methods across datasets and DNN models. The Convolutional Neural Network (CNN) based models were
the target model of several attacks [136,78,65,120,45,25,79,83].
4.2.1. Classification For instance, HotFlip [25] was evaluated on the character-level
The Classification task is one of the most general scenarios in CNN-LSTM (CharCNN-LSTM) model [149] with AG’s News in
the NLP field. It can be further divided into seven sub-categories: white-box scenarios, and changed an average of 4.18% of the char-
acters to fool the classifier at confidence 0.5. Liang et al. [65]
14 focused on the character-level (CharCNN) [152] and word-level
https://cocodataset.org/#download.
296
CNN (WordCNN) [153] model, and evaluated their method on Table 9

DBpedia Ontology, MR and MPQA dataset. For example, by just Summary of Applications in Neural Machine Translation Tasks.
inserting one word, their method could cause a text that describes Work Benchmark Dataset Victim Model
a company to be misclassified as the class ‘‘Building” with the con-
[38] TED Talks Parallel Fully character-level model [164],
fidence of 88.6%. Corpus Nematus [165], attentional seq2seq model
Recently, pre-trained models, such as BERT [131] and ELMo
[35] TED Talks Parallel CharCNN-LSTM
[163], were attacked by adversarial attacks in [43,82,72,39,79,71]. Corpus
These attacks were essentially implemented with AG’s News, MR, [119] NIST, WMT Corpus Transformer [166]
Yelp Reviews, and IMDB datasets. For instance, RewritingSampler
[86] NIST, WMT Corpus Transformer, LSTM-based model
[43] changes the sentence from a passive voice to an active one
[167], ConvS2S [168]
by replacing four words. Although none of the word substitutes
has a similar meaning, the entire semantics of the sentence does [76] WMT Corpus ConvS2S, Transformer
not change. Even, the Google Perspective API15 was attacked by [64] WMT Corpus WordLSTM encoder, word-based attention
the DISTFLIP [60] with the toxic comment dataset16 in black-box sce- decoder
narios. 42% of API-predicted labels corresponding to the generated [70] WMT Corpus Transformer, LSTM-based model, ConvS2S
examples were flipped by this attack, while humans maintain high
accuracy in predicting the target label.
Table 10
Summary of Applications in Machine Reading Comprehension Tasks.
4.2.2. Neural machine translation
Existing Neural Machine Translation systems are used to trans- Work Benchmark Dataset Victim Model
late one natural language into another. Given an adversarial text, [58] SQuAD BiDAF [169], Match-LSTM [170]
the translation obtained from the system is inconsistent with the [89] SQuAD QANet [171], BiDAF, BiDAF with CharCNN
semantics understood by human beings. Some related works are [59] SQuAD BERT based models
[76] SQuAD BiDAF, ELMo-BiDAF [163], SpanBERT
shown in Table 9.
[99] MovieQA Multiple CNN based, RNN-LSTM based models
The adversarial attacks in [38,35] attacked the character-level
neural machine translation models on the TED Talks Parallel Cor-
pus [138]. The difference between them is that the former just
implemented black-box attacks, while the latter proposed both
black-box and white-box attacks. Further, in [35], the average Table 11
number of character changes and queries in the black-box setting Summary of Applications in Text Entailment Tasks.
are respectively 3.6 and 4.3 times those in the white-box setting. Work Benchmark Dataset Victim Model
To attack the Transformer model [166], Cheng et al. [119] used
[63] SICK BiLSTM
the gradient-based method and achieved an improvement of 2.8 [90] FEVER RoBERTa
and 1.6 Bilingual Evaluation Understudy (BLEU) points on the NIST [118] SNLI, MultiSNLI DAM, ESIM, cBiLSTM
dataset and WMT Corpus, respectively. Besides, Emelin et al. [86] [40] SNLI Model with ReLU layers
[84] SNLI BiDAF
elicited word sense disambiguation biases. They demonstrated
[43] SNLI,MultiNLI BERT
that disambiguation robustness varies substantially between [39] SNLI RoBERTa
domains, and different models trained on the same data are vul- [47] SNLI BiLSTM, BERT
nerable to different attacks. Tan et al. [76] perturbed the inflec- [134] MultiNLI BERT, RoBERTa, XLM [172], XLNet [173]
tional morphology of words in given texts. All these attack [79] SNLI, MultiNLI BiLSTM, ESIM [174], BERT
[71] SNLI, MultiNLI BiLSTM, ESIM, BERT
methods have reduced the performance of DNN models to a large
[80] SNLI WordLSTM, BERT
extent.
4.2.3. Machine reading comprehension

on the attacked model by 11 points in comparison to ADDSENT
In the Machine Reading Comprehension task, the model needs
[58].
to answer questions based on a given text, and these questions
vary from simple ‘‘who” or ‘‘when” to more complex ‘‘why” or
‘‘how”, which can be answered by a variety of information sources. 4.2.4. Text entailment
Further, the machine reading comprehension includes the cloze In the Text Entailment task, which is also called Natural Lan-
test, multiple-choice, and free answering tasks. Some adversarial guage Inference, the machine needs to judge the relationship
attacks in these tasks are shown in Fig. 10. between a premise text and a hypothesis one. Generally, the rela-
Most approaches [58,89,59,76] are evaluated on the SQuAD tionship can be divided into three categories: entailment, contra-
dataset. Among them, Jia et al. [58] and Wallace et al. [89] attacked diction, and neutral. Some applications are summarized in
the Bi-Directional Attention Flow network (BiDAF) [169], and Table 11.
achieved a nearly 50% attack success rate. In particular, Wallace Intuitively, just a few works used the SICK and FEVER datasets
et al. [89] manually picked the target answers ‘‘to kill american peo- to evaluate the performance of adversarial texts. Among them,
ple”, ‘‘donald trump”, ‘‘January 2014”, and ‘‘new york” for why, who, the SICK is a relatively small dataset consisting of 10,000 simple
when, and where questions, respectively. The transferability of sentence pairs, and the FEVER includes 185,445 claims manually
these triggers was verified by attacking three black-box models verified against the introductory sections of Wikipedia pages and
with different embeddings/ tokenizations and architectures, as classified as SUPPORTED, REFUTEDREFUTED, or NOTENOUGHINFO.
the attack success rate was about 10% - 30%. Furthermore, Nizar Typically, compared with a neural back-translation baseline, SCPNs
et al. [59] targeted the BERT-based model, and reduced the F1 score [63] generated adversarial texts with more dramatical variations in
syntactic structures on SICK. Most adversarial attacks
[118,40,84,43,39,47,134,79,71,42] applied to text entailment tasks
15
https://www.perspectiveapi.com/. are verified on SNLI and MultiSNLI datasets. For instance, Alzantot
16
https://www.kaggle.com/c/jigsaw-toxic-comment-classification-challenge/data. et al. [40] utilized the GA to craft adversarial examples that main-
297
tain semantics and syntax, and achieved an attack success rate of 4.2.8. Cross-modal task
70% with a modified rate of 23%. To improve the validity and flu- In addition to the tasks dealing with the single modality input,
ency of adversarial text, Zhang et al. [84] improved the work in some NLP-related cross-modal tasks are faced with the adversarial
[40] by employing the Metropolis-Hastings sampling, and it attack threat. These cross-modal tasks can be categorized as two
reduced the perplexity (PPL) by nearly 500 points. main types: text-and-vision, and text-and-audio, as shown in
From the perspective of the victim model, more and more Table 12.
attacks [43,39,47,134,79,90] were evaluated on the pre-trained Text-and-Vision. The Image Captioning model takes an image as
language model. Compared with the effective and efficient TextFoo- an input and generates a textual caption, which describes the
ler [79], RewritingSampler [43] showed its outperformance in the visual content of the input image. For attacking the CNN-RNN-
diversity, semantic similarity and grammatical quality when based image captioning model, Chen et al. [141] proposed Show-
attacking the BERT. The lightweight Mischief [134] was evaluated and-Fool, which includes a targeted caption method that makes
on four transformer-based models: BERT [131], RoBERTa [157], model outputs match the target caption, and a targeted keyword
XLM [172], and XLNet [173], and significantly reduced the perfor- method that makes model outputs contain specific keywords.
mance by up to 20% of these models. Furthermore, the method in Experimented on the MSCOCO dataset [181], the former achieved
[90] ensured the semantic validity of adversarial texts crafted by a 95.8% attack success rate, and the latter achieved an even higher
universal triggers. success rate, especially at least 96% for the 3-keyword case and at
least 97% for 1-keyword and 2-keyword cases. Following, Xu et al.
4.2.5. Part-of-speech tagging [142] proposed an iterative optimization method, and investigated
The part-of-speech (POS) tagging is the process of marking up adversarial attacks on the DenseCap network [182] with Visual
words in a text as corresponding to a particular part of speech, based Genome dataset [183]. Its objective function maximizes the prob-
on both its definition and context, such as ‘‘noun” and ‘‘verb”. In par- ability of the target answer and unweights the preference of adver-
ticular, Yasunaga et al. [139] added perturbations to the input word sarial examples with a smaller distance to the original image when
or character embedding, and conducted a series of POS tagging this distance is below a threshold. Although it is challenging to
experiments on Penn Treebank WSJ corpus [175] (English) and train an RNN-based caption generation model to generate the
Universal Dependencies [176] (27 languages) dataset. Further, Eger exact matching captions and the DenseCap network involves ran-
et al. [39] proposed Zéroe containing nine attack modes, and evalu- domness, this method reached beyond 97% success rate.
ated it with the RoBERTa [157] on the Universal Dependencies data- The Optical Character Recognition model takes an image as an
set. The results indicated that the intrude attack, which randomly input and outputs the recognized text. These tasks approximately
inserts unobtrusive symbols (like $%&.- and whitespace) in a given include two types: character-based and end-to-end. The former is
text, is among the most severe attacks for POS tagging since it a traditional approach for recognizing text in ‘‘block of text” images;
decreased the victim model accuracy to around 16%. Besides, Han the latter is a segmentation-free technique that recognizes the
et al. [140] used a sequence-to-sequence model with feedback from entire sequence of characters in a variable-sized ‘‘block of text”
multiple reference models of the same task to generate adversarial image. Regarding Tesseract17 as the victim model, Song et al. [123]
sentences with different lengths and structures. successfully caused over 90% of the words in their list to be misrecog-
nized in the character-based task, and flipped the meaning of a rela-
tively long document by changing only 1–2 words in the end-to-
4.2.6. Text summarization end task. Additionally, Chen et al. [143] proposed WATERMARK, which
The Text Summarization task summarizes the main content or produces natural distortion in the disguise of watermarks. In white-
meaning of a given document or paragraph with a succinct expres- box and target attack scenarios, the WATERMARK was performed on
sion. Because the average length of given texts varies greatly, it is a DenseNet + CTC based model trained on a Chinese text image data-
challenging to implement an adversarial attack on this task. For set, which includes 3.64 million images and 5,989 unique characters.
example, Seq2Sick [64] was verified on three datasets (the The adversarial examples crafted by WATERMARK are human-eye
DUC2003, DUC2004, and Gigaword). The DUC2003 and DUC2004 friendly and with high success probabilities. Besides, some adversar-
datasets are extensively employed in documentation summariza- ial examples even work on Tesseract in a black-box manner.
tion. Changing only 2 or 3 words on average by Seq2Sick leads to The Scene Text Recognition is a standard sequential learning
entirely different outputs for more than 80% of sentences. task with a varied-length output. By comparison, Optical Character
Recognition is a pipeline process that first segments the word into
4.2.7. Dialogue generation characters and then recognizes a single character. In Scene Text
The Dialogue Generation is a kind of text generation task, which Recognition tasks, the entire image is directly mapped to a word
automatically generates a response according to a given post. string. Yuan et al. [144] proposed an adaptive attack to accelerate
Besides, the dialogue generation model is a fundamental compo- the adversarial attack through multi-task learning [184], which
nent of real-world dialogue systems. Niu et al. [98] verified their improves learning efficiency and prediction accuracy by learning
approach with the Ubuntu Dialogue Corpus [177] and the Collabo- multiple objectives from a shared representation. They imple-
rative Communicating Agents dataset (CoCoA) [178]. For the mented their attack on the Scene Text Recognition model with
Ubuntu Dialogue Corpus, which contains 1 million 2-person, the Street View Text [52], ICDAR 2013 [185], and IIIT 5 K-word
multi-turn dialogues extracted from Ubuntu chat logs used to pro- [186] dataset, and achieved over 99.9% success rate with 3–6 times
vide and receive technical support, they focused on the Variational speedup compared to the Convolutional Recurrent Neural Network
Hierarchical EncoderDecoder model [179] and the RL model [180]. (CRNN) [187].
In contrast, for the CoCoA dialogue dataset, which involves two The Visual Question Answering model provides an accurate
agents that are asymmetrically primed with a private Knowledge answer in natural language when given an image and a natural lan-
Base (KB) and engage in a natural language conversation to find guage question about the image. Xu et al. [142] evaluated their
out the unique entry shared by the two KBs, they paid attention method on two models (the MCB model [188] and compositional
to attack the Dynamic Knowledge Graph Network. Further, adver- model N2NMN [189]) on the VQA dataset [190]. They evaluated
sarial training with generated adversarial examples makes all
models more robust to adversarial attacks and improves their per-
formances when evaluated on the original test set. 17
https://github.com/tesseract-ocr/tesseract.
298
Table 12
Summary of Applications in Cross-modal Tasks.
Type Application Work Benchmark Dataset Victim Model

text-and-vision image captioning [141] MSCOCO CNN + RNN based model
[142] Visual Genome DenseCap
optical character recognition [123] Hillary Clinton’s emails corpus Tesseract
[143] Chinese text image dataset CTC-based OCR model
visual question answering [142] VQA dataset MCB, N2NMN
[145] VQA dataset Bottom-Up-Attention and TopDown
scene text recognition [144] Street View Text, ICDAR 2013, IIIT 5 K-word CRNN
visual semantic embeddings [146] MSCOCO VSE++
general V + L task [147] COCO, Visual Genome, Conceptual Captions, SBU Captions UNITER
text-and-audio speech recognition [21] Mozilla Common Voice DeepSpeech
the success rate (over 90%) and confidence score (above 0.7) of the with the Mozilla Common Voice dataset18 on the DeepSpeech
victim model to predict the target answer. Moreover, they con- model [199]. Furthermore, they generated targeted adversarial
cluded that the attention, bounding box localization, and internal examples with a 100% success rate for each source-target pair with
compositional structures are vulnerable to adversarial attacks. an average perturbation of 31 dB, particularly, the 95% interval
Besides, Tang et al. [145] leveraged an encoder-decoder neural for distortion ranged from 15 dB to 45 dB.
machine translation framework and iterative FGSM [191] to gener-
ate semantic equivalent adversarial examples of both visual and 4.2.9. Other tasks
textual data as the augmented data, which then was utilized for Recently, some researchers have proposed adversarial attacks
training a visual question answering model using adversarial applied to relatively novel NLP tasks. For example, Cheng et al.
learning. The model trained with their method obtained 65.16% [148] proposed a framework to generate adversarial agents rather
accuracy on the clean validation dataset, beating its vanilla training than adversarial examples in an interactive dialogue system under
counterpart by 1.84%. The adversarial ly-trained model signifi- both black-box and white-box settings. Zheng et al. [44] explored
cantly increases accuracy on adversarial examples by 21.55%. the feasibility of generating syntactic adversarial sentences to lead
The Visual Semantic Embedding task bridges the natural lan- a dependency parser to make mistakes without altering the origi-
guage and the underlying visual world. In this task, the embedding nal syntactic structures. The experiments with a graph-based
spaces of both images and descriptive captions are jointly opti- dependency parser [200] on the English Penn Treebank showed
mized and aligned. Shi et al. [146] performed adversarial attacks that up to 77% of input examples admit adversarial perturbations.
on the textual part through three editing operations (replacing
nouns in the caption, changing numerals to different ones, detect- 5. Defense against textual adversarial attack
ing the relations and shuffling the non-interchangeable noun
phrases or replacing the prepositions). The evaluation on VSE++ The wide application of adversarial attacks in the NLP domain
model [192] with the MSCOCO dataset showed that, although makes researchers aware of NLP intelligent systems’ potential sev-
VSE++ obtains good performance on the original test set, it is vul- ere adversarial threats in the real world. To enhance the defensive
nerable to caption-specific adversarial attacks. capability of DNN in NLP tasks and further improve the security of
More generally, Gan et al. [147] proposed VILLA, a large-scale these intelligent systems against adversarial attacks, researchers
adversarial training strategy for vision-and-language representa- have proposed numerous strategies, which contain two types: pas-
tion learning in tasks including Visual Question Answering [193], sive and active defense. The passive method detects adversarial
Visual Commonsense Reasoning [194], Natural Language Visual input during the inference procedure, while the active method
Reasoning for Real (NLVR2 ) [195], Visual Entailment [196], Referring generally improves the robustness of the model when training it.
Expression Comprehension [197], and Image-Text Retrieval [198].
VILLA first conducts a task-agnostic adversarial pre-training to lift 5.1. Passive defense
model performance for all downstream tasks uniformly, then
implements a task-specific adversarial fine-tuning to enhance the As mentioned above, adversarial texts are perceivable and
fine-tuned models additionally. Differing from the conventional semantic. Thus, checking input is the most straightforward and
approaches, VILLA adds adversarial perturbations to word embed- general passive defense method. According to the type of adversar-
ding and extracted image-region features, respectively. To enable ial attacks that these methods defend, this survey categorizes these
large-scale training, it adopts the ‘‘free” adversarial training strat- defense strategies as two classes, as shown in Table 13.
egy and combines it with KL-divergence-based regularization to For character-level attacks, there are some workable mis-
promote higher invariance in the embedding space. Relying on spelling checking tools, such as the Python autocorrect 0.3.0 pack-
standard bottom-up image features only, VILLA improves the age [45] and context-aware spelling check service [24].
single-model performance of UNITER-large from 74.02 to 74.87 Additionally, Pruthi et al. [101] designed a word recognition model
on visual question answering tasks and from 62.8 to 65.7 on visual with three back-off strategies to check misspellings or typos.
commonsense reasoning tasks. With the ensemble, visual question For word-level attacks, Mozes et al. [102] observed the fre-
answering performance is boosted to 75.85 additionally. quency differences between words and their substitutes, and then
Text-and-Audio. The Speech Recognition model recognizes and proposed the Frequency-guided Word Substitutions (FGWS), which is
translates spoken language into text automatically. Carlini et al. rule-based and model-agnostic. Besides, Zhou et al. [103] focused
[21] crafted targeted audio adversarial examples on automatic on determining whether a particular word is a perturbation and
speech recognition models. Given any natural waveform x, they
constructed a perturbation d that is almost inaudible so that
x þ d was recognized as any desired phrase. They experimented 18
https://commonvoice.mozilla.org/zh-CN/datasets.
299
Table 13 Some researchers directly utilized adversarial texts generated by

Summary of Passive Defense Strategies. their attack methods [25,35,24,58,46,38,47]. This adversarial data
Key Idea Work Detail is crafted according to observed patterns of successful attacks.
check [45] Python autocorrect 0.3.0 package. Hence, these approaches have two problems: 1) extensive adver-
misspelling [24] A context-aware spelling check service. sarial examples need to be prepared in advance, resulting in a mas-
[101] A model with three back-off strategies. sive calculation consumption; 2) the effectiveness of these
check word [102] Discriminating the perturbated words by frequency methods has been proved on blocking certain specific attacks,
differences between words and their substitutions. but not on defending other or unknown attacks.
[103] Detecting adversarial words in a sentence. Therefore, some researchers suggested combining the adversar-
ial example generator and the NLP model, and training them using
a GAN-style approach. Kang et al. [104] proposed a knowledge-
proposed DISP, which compares embeddings of a word and its guided adversarial example generator for incorporating enormous
potential substitutes. lexical resources in entailment models via only a few rule tem-
In general, the above passive defense methods, which focus on plates. They proposed the first GAN-style approach for training
checking input, barely consider sentence-level attacks. Further- the entailment model with a sequence-to-sequence model, itera-
more, it is unsuitable for adversarial examples based on other lan- tively improving both the entailment system and the differentiable
guages like Chinese [201]. part of the generator. Likewise, Xu et al. [105] proposed LexicalAT,
which consists of a generator and a pre-trained sentiment classi-
fier. The generator crafts adversarial texts using WordNet. Consid-
5.2. Active defense
ering the discreteness of the generator, they trained LexicalAT
through the policy gradient, which is an RL approach.
Active defense focuses on introducing specific methods in the
Furthermore, to overcome OOV words and distribution differ-
model training process to enhance robustness. Generally, research-
ences in character-level adversarial examples, Liu et al. [107] pro-
ers put forward various methods related to the input data or model
pose a framework, which jointly uses character embedding and
itself. Active defense strategies in this survey thus are classified
adversarial stability training, as shown in Fig. 8. For the OOV word
into two categories: adversarial training and representation learn-
problem, they expressed the word wi in sentence s as the
ing, as shown in Table 14.
character-level word representation ei ði ¼ 1; 2; . . . ; nÞ that pre-
serves the information of wi . For the distribution difference prob-
5.2.1. Adversarial training lem, they added some tiny perturbations to each word wi , and
Most defense methods based on adversarial training consider then represented wp in the same way as wi .
both input and model. Furthermore, adversarial training focuses Considering the requirement of diversified adversarial exam-
on enhancing models’ tolerance of adversarial examples by adding ples, Liu et al. [108] proposed a model-driven approach, which
them to the training dataset and preprocessing them properly.
Table 14
Summary of Active Defense Strategies.
Subcategory Key Idea Work Detail

Adversarial directly use [25,35,24,58,46,38,47] Using adversarial texts generated by current attack methods as the training examples
Training adversarial straightforwardly.
examples
special strategy [104] Training using a GAN-style approach with a knowledge-guided adversarial example generator for
incorporating large lexical resources.
[105] Training the generator and the pre-trained sentiment classifier by adopting policy gradient.
[107] Jointly using character embedding and adversarial stability training, to overcome OOV words and
distribution differences.
[108] Dynamically crafting various adversarial texts based on parameters of the current model, for the
requirement of diversified adversarial examples.
[109] Iterative build it-break it-fix it strategy with humans and models in the loop, to detect offensive
language in the real world.
[106] Generating virtual adversarial examples and virtual labels in the embedding space, to reduce
calculations.
Representation randomize input [110] Training embedding model on noisy texts.
Learning
[111] Encoding words with their synonyms selected by a dynamic Random Synonym Substitution
algorithm.
[112] Randomly sampling embedding vectors for each word in an input sentence from a convex hull
spanned by the word and its synonyms to craft virtual sentences.
unify input [38] Taking the average character embedding as a word representation.
representation
[41] Clustering and encoding all synonyms to a unique code.
[48] Encoding words as sequences of symbols and taking the context of a word into account when
generating the embedding.
[49] Mapping sentences to a smaller and discrete space of encodings.
design effective [113] Fine-tuning both local features (word-level representation) and global features (sentence-level
representation representation).
[114] Utilizing disentangled representation learning to improve robustness.
[115] Linking multi-head attention to structured embeddings.
[116] Augmenting input sentences with their corresponding predicate-argument structures.
[202] Defining the minimum distance of a text from the decision boundary in the embedding space.
300
input representation, and designing more effective representation

approaches.
For randomizing input, researchers introduce random perturba-
tions to the input during the training step. Unlike adversarial train-
ing strategies, the randomized input is with the original label
instead of the label with proper modifications. For example, Li
et al. [110] proposed to train a special embedding model on noisy
texts like tweets, utilizing the randomization characteristics in the
dataset. However, typos are unpredictable, and a corpus cannot
contain all possible incorrectly spelled versions of a word. Thus,
Wang et al. [111] proposed the Random Substitution Encoding
(RSE), which encodes words in the input with their synonyms
selected by a dynamic random synonym substitution algorithm.
Then, they introduced the RSE between the input and the embed-
ding layer, thereby making more labeled neighborhood data for
robustness classifiers. Furthermore, Zhou et al. [112] proposed
Fig. 8. The framework in [107] to defend character-level adversarial attacks.
Ldiff ðs; sp Þ; Ltrue ðS; Y Þ and Lnoisy SP ; Y evaluate the similarity between s and its noisy model-agnostic Dirichlet Neighborhood Ensemble (DNE) to block
counterpart sp , the classification accuracy for s and the classification accuracy for sp , substitution-based attacks. In the training step, to craft virtual sen-
respectively. tences, the DNE samples embedding vectors for each word in an
input sentence from a convex hull spanned by the word and its
synonyms. Then, the generated adversarial sentences are used to
dynamically crafts various adversarial texts based on parameters train the classifier to improve its robustness. A similar sampling
of the current model and further trains the model using these gen- method is introduced in the inference step, and a CBW-D ensemble
erated texts in an iterative schedule to improve the robustness of algorithm [204] is adopted to output the final prediction.
machine reading comprehension models. Their training process For unifying input representation, researchers improved the
includes three steps: 1) take a pre-trained machine reading com- model’s generalization by encoding input and their neighbors with
prehension model as the adversarial generator, and train perturba- the same representation. Belinkov et al. [38] took the average char-
tion embedding sequences to minimize output probabilities of real acter embedding as the word representation to deal with adversar-
answers under given questions and passages; 2) greedily sample ial examples generated by character scrambling, such as swapping
word sequences from perturbation embeddings as misleading two letters in a word. Wang et al. [41] designed the Synonym
texts to create and enrich the adversarial example set; 3) train Encoding Method (SEM), an encoder network that clusters and
the model to maximize probabilities of real answers to block those encodes all synonyms to a unique encoding to force all neighboring
adversarial examples, then return to step 1 with the retrained words to have an equal representation in the embedding space.
model as a new generator. Malykh et al. [48] suggested the Robust Vectors (RoVe), which is
For detecting the offensive language of dialogue in the real an embedding model with two features: open vocabulary and
world, Dinan et al. [109] suggested an iterative build it-break it- context-dependency. The former encodings words as sequences
fix it strategy with humans and models in the loop. It includes three of symbols to produce embeddings for OOV words. The latter takes
steps: 1) Build it: build a model (A0) capable of detecting OFFEN- the context of a word into account when generating the embed-
SIVE messages; 2) Break it: ask crowd-workers to try to ‘‘beat the ding. Furthermore, Jones et al. [49] proposed the Robust Encodings
system” by submitting messages that A0 marks as SAFE but that (RobEn) for mapping sentences to a smaller and discrete space of
the worker considers being OFFENSIVE; 3) Fix it: train a new model encodings. Taking adversarial typos as an example, RobEn reduced
on these collected examples to make the model more robust to the problem of generating token-level encodings by assigning
these adversarial attacks, then turn to the step 1. vocabulary words to clusters and proposed two token-level robust
Considering massive calculations caused by various constraints encodings: connected component encodings and agglomerative
such as imperceptibility and semantics, Li et al. [106] improved the cluster encodings.
Virtual Adversarial Training (VAT) [203], which improves efficient- For designing more effective representation approaches, Wang
ness by generating virtual adversarial examples and virtual labels et al. [113] proposed InfoBERT to enhance language representations
in the embedding space. Without increasing the overhead, the by fine-tuning both local features (word-level representation) and
Token-Aware Virtual Adversarial Training (TA-VAT) crafts fine- global features (sentence-level representation). InfoBERT contains
grained token-aware perturbations. It introduces a token-level an Information Bottleneck Regularizer extracting approximate
accumulated perturbation vocabulary to initialize the perturba- minimally sufficient statistics while removing noisy information,
tions better and then uses a token-level normalization ball to con- and an Anchored Feature Regularizer selecting useful local stable
strain them pertinently. features and maximizing the mutual information between local
Overall, adversarial training strategies have improved models’ features and global features. Wu et al. [114] utilized disentangled
robustness to a large extent and overcome some problems like representation learning to improve the robustness and generality
adversarial example preparation and calculation consumption. of NLP models. They first mapped an input to a set of representa-
However, with the increase in training iterations and attack types tions fz1 ; z2 ; . . . ; zK g. Then, fz1 ; z2 ; . . . ; zK g were mapped to different
considered, there is a significant possibility that the model robust- logits ls, the ensemble of which was used to make the final predic-
ness is improved against more attacks, while the accuracy tion y. In detail, the L2regularizer is added on zs , and Total Correla-
decreases resulting from over-fitting adversarial examples. tion is added under the Variational Information Bottleneck
framework. Li et al. [115] first linked the multi-head attention
[166] to structured embeddings for using each head to linearly pro-
5.2.2. Representation learning ject the query and key into a new embedding space. Then, they
For the representation learning, researchers mainly properly directly added external knowledge like WordNet to this attention
design the embedding model based on the main ideas of current mechanism, forcing the model to explore beyond the data distribu-
methods, including three categories: randomizing input, unifying tion of a specific task. Moosavi et al. [116] augmented input sen-
301
Limited transferability. Although researchers have proposed

some input-agnostic universal adversarial perturbation genera-
tion methods, meaningless triggers could reduce the readability
of inputs. Therefore, there are still no well-performed adversar-
ial attack methods, which can fool any DNN. Furthermore, cur-
rent research focuses on English texts, and only a few studies
Fig. 9. An example of the method [116], which augments an input sentence with its
have considered other languages, such as Chinese. Besides, an
predicate-argument structures. It specified the beginning of the augmentation by approach designed for a certain-language text usually cannot
the [PRD] token that indicates the subsequent tokens are the detected predicate, deal with input in other languages.
and specified the ARG0 and ARG1 arguments with [AG0] and [AG1] tokens, DNN optimization requirements. Researchers have already
respectively. The [PRE] token specifies that the end of the detected predicate-
proposed several passive and active defense strategies, which
argument structure is specified by the [PRE] token.
consider the model and input data, to handle adversarial
attacks. However, most of these methods focus on input check-
ing and representation; few aim to optimize the architecture or
tences with their corresponding predicate-argument structures objective function of DNN models. Note that improving the
using the PropBank-style semantic role labeling model [205], to robustness of the NLP model itself is an essential way to defend
provide a higher-level abstraction over different realizations of against adversarial attacks.
the same meaning, as shown in Fig. 9. While La et al. [202] pro- Universal defense. Existing defense approaches generally aim
posed Maximal Safe Radius (MSR) that defines the minimum dis- at a specific type of adversarial attack, similar to those in the
tance of a text from the decision boundary in the embedding CV field. There is still no defense strategy that can defend
space. MSR was approximated by a lower bound obtained by con- against all different types of attacks. With the continuous devel-
straint relaxation techniques and an upper bound gained by the opment of different attack methods, it is necessary to investi-
Monte Carlo Tree Search algorithm. gate and propose a unified model, which can tackle multiple
Generally, randomizing input and unifying input representation or even all adversarial attacks.
are similar to the passive input checking approach, because both
essentially defend against character and word-level attacks. Differ- Based on these challenges, combined with the current research
ently, the former (randomizing input) improves the tolerance of status, we point out five potential future directions of research
DNN models to adversarial perturbations, similar to the idea of work in textual adversarial attack and defense area, as follows:
adversarial training. Besides, designing representation pays more
attention to the embedding network. It aims to improve the repre- Pre-trained model-based adversarial attacks. In recent years,
sentation ability of the model, so that the model can effectively pre-trained models such as BERT [131], GPT-2 [62], and T5 [206]
learn the critical information of an input. have developed rapidly. Since these pre-trained models extract
text and word semantics well, they continuously reach the
state-of-the-art in several NLP tasks like machine translation,
6. Conclusion
natural language inference, and text summarization. As men-
tioned before, designing textual adversarial attack methods
This survey comprehensively reviews the research progress of
requires additional semantic consistency conditions to be con-
adversarial attack and defense technology in the NLP domain. First,
sidered. Thus, utilizing pre-trained models to handle the
we briefly present the textual adversarial example. Then, we
semantic consistency requirements in textual adversarial attack
regard the example generation strategy as a novel criterion and
methods may be one future direction in this field.
propose a two-level classification method for categorizing adver-
Modification operations in the latent feature space. Unlike
sarial attacks on texts by considering both semantic granularity
image data, improper modification on textual examples usually
and example generation strategy. Besides, we summarize the
leads to spelling, grammatical and other errors. Therefore,
applications of adversarial attacks in the NLP field. Finally, we con-
human eyes and detectors easily detect adversarial texts gener-
clude current defense methods against textual adversarial attacks.
ated by existing character-level and word-level attack methods.
It can be found that the advancement of adversarial technology
Recently, sentence-level approaches, which extract the text
in the CV field promotes its development in the NLP domain. Thus,
semantics and modify the whole text while keeping the seman-
various textual adversarial attack methods have been proposed in
tics unchanged, have rapidly developed. The adversarial texts
succession. However, the progress of textual adversarial technol-
crafted by these methods have fewer grammatical and spelling
ogy is still facing enormous challenges. We conclude five signifi-
errors but perform poorly in terms of readability and syntactic
cant challenging issues in the NLP domain from the adversarial
structure. Therefore, modifying the sentence representation in
attack and defense perspective, as follows:
latent feature space to improve the readability while maintain-
ing semantic consistency will be a future research hotspot.
Extra imperceptibility constraints. Unlike the image data, the
Universal defense approaches. Researchers have not yet pro-
textual data is discrete, perceivable, and semantic. These unique
posed defense methods that can defend against all different
characteristics bring significant problems to the implementa-
types of attacks. In both CV and NLP domains, most of the exist-
tion of adversarial attacks. Thus, current methods usually make
ing defense strategies are proposed aiming at one or a specific
a tradeoff between imperceptibility and attack performance.
class of attack methods, thus are effective against a small num-
Therefore, the corresponding research numbers and attack
ber of types of attacks, but not others. Because of the develop-
effects are far less than those in the CV field.
ment trend of defense methods in recent years, defense
Practicality in the real world. When processing the textual
algorithms based on improving the model itself have gradually
input, most attack methods perform greedily. Thus, these meth-
become the research focus. Hence, researchers may focus on
ods are significantly more time-consuming, query-consuming,
combining models’ characteristics and improving the model
and computationally complex. Such approaches are unsuitable
architecture, loss functions, and hyper-parameters to defend
for implementing black-box, real-time, and practical adversarial
against a variety of different attacks.
attacks on intelligence systems in the real world.
302
Benchmark platform for textual adversarial attack, defense, [9] H. Liu, T. Fang, T. Zhou, Y. Wang, L. Wang, Deep learning-based multimodal
control interface for human-robot collaboration, Procedia CIRP 72 (2018)
and evaluation. In the CV field, researchers have proposed sev-
3–8.
eral adversarial attack and defense toolboxs, such as CleverHans [10] C.-S. Oh, J.-M. Yoon, Hardware acceleration technology for deep-learning in
[207], Foolbox [208], and Advertorch [209]. However, as far as autonomous vehicles, in: 2019 IEEE International Conference on Big Data and
we know, Textattack [210] and Openattack [211] are the only Smart Computing (BigComp) IEEE, 2019, pp. 1–3.
[11] M. Coccia, Deep learning technology for improving cancer care in society:
textual adversarial attack toolbox currently available. Neverthe- New directions in cancer imaging driven by artificial intelligence, Technol.
less, both of them do not cover defense and evaluation func- Soc. 60 (2020) 101198.
tions. Since different attack and defense methods are being [12] J. Harikrishnan, A. Sudarsan, A. Sadashiv, R.A. Ajai, Vision-face recognition
attendance monitoring system for surveillance using deep learning
proposed and applied to various NLP tasks, it is challenging to technology and computer vision, in: 2019 International Conference on
compare the advantages and disadvantages of these methods. Vision Towards Emerging Trends in Communication and Networking
Therefore, it is essential to develop a textual adversarial bench- (ViTECoN), IEEE, 2019, pp. 1–5.
[13] S. So, J. Mun, J. Rho, Simultaneous inverse design of materials and structures
mark platform that integrates attack, defense, and evaluation via deep learning: demonstration of dipole resonance engineering using
functions. core–shell nanoparticles, ACS Appl. Mater. Interfaces 11 (27) (2019) 24264–
Application in emerging tasks. Currently, textual adversarial 24268.
[14] H.-P. Chan, L.M. Hadjiiski, R.K. Samala, Computer-aided diagnosis in the era of
attack technology is mainly used to attack various text process- deep learning, Med. Phys. 47 (5) (2020) e218–e227.
ing intelligence algorithms to assist researchers in identifying [15] F. Zhang, P.P. Chan, B. Biggio, D.S. Yeung, F. Roli, Adversarial feature selection
potential security threats in existing intelligence algorithms against evasion attacks, IEEE Trans. Cybern. 46 (3) (2015) 766–777.
[16] K.D. Julian, J. Lopez, J.S. Brush, M.P. Owen, M.J. Kochenderfer, Policy
and improving them. Researchers can vigorously explore novel
compression for aircraft collision avoidance systems IEEE/AIAA 35th Digital
tasks where adversarial attack technology can be applied in the Avionics Systems Conference (DASC), IEEE 2016 (2016) 1–10.
future. For example, utilizing adversarial technology in the 3D [17] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R.
modelling process to improve the realism of 3D models Fergus, Intriguing properties of neural networks, arXiv preprint
arXiv:1312.6199.
[212,213], applying adversarial examples to human-machine [18] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T.
verification [214,215] to fool machines but do not affect regular Kohno, D. Song, Robust physical-world attacks on deep learning visual
users. classification, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 1625–1634.
[19] I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial
examples, arXiv preprint arXiv:1412.6572.
CRediT authorship contribution statement
[20] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, A. Yuille, Adversarial examples for
semantic segmentation and object detection, in: Proceedings of the IEEE
Shilin Qiu: Conceptualization, Investigation, Writing – original International Conference on Computer Vision, 2017, pp. 1369–1378.
draft, Writing – review & editing. Qihe Liu: Writing – review & [21] N. Carlini, D. Wagner, Audio adversarial examples: Targeted attacks on
speech-to-text, in IEEE Security and Privacy Workshops (SPW), IEEE 2018
editing, Funding acquisition, Project administration. Shijie Zhou: (2018) 1–7.
Supervision, Project administration. Wen Huang: Investigation, [22] H. Yakura, J. Sakuma, Robust audio adversarial example for a physical attack,
Writing – review & editing. arXiv preprint arXiv:1810.11793.
[23] R. Taori, A. Kamsetty, B. Chu, N. Vemuri, Targeted adversarial examples for
black box audio systems, 2019 IEEE Security and Privacy Workshops (SPW),
IEEE 2019 (2019) 15–20.
Declaration of Competing Interest
[24] J. Li, S. Ji, T. Du, B. Li, T. Wang, Textbugger: Generating adversarial text against
real-world applications, arXiv preprint arXiv:1812.05271.
The authors declare that they have no known competing finan- [25] J. Ebrahimi, A. Rao, D. Lowd, D. Dou, Hotflip: White-box adversarial examples
cial interests or personal relationships that could have appeared for text classification, in: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers),
to influence the work reported in this paper. Association for Computational Linguistics, Melbourne, Australia, 2018, pp.
31–36, 10.18653/v1/P18-2006.
[26] X. Liu, Y. Lin, H. Li, J. Zhang, Adversarial examples: Attacks on machine
Acknowledgment learning-based malware visualization detection methods, arXiv preprint
arXiv:1808.01546 10 (3326285.3329073).
[27] J. Chen, Z. Yang, D. Yang, Mixtext: Linguistically-informed interpolation of
This work was supported by the Sichuan Science and Technol-
hidden space for semi-supervised text classification, arXiv preprint
ogy Program [Grant Nos. 2019YFG0399, 2020YFG0472, arXiv:2004.12239.
2020YFG0031]. [28] D. Mekala, J. Shang, Contextualized weak supervision for text classification,
in: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, 2020, pp. 323–333.
References [29] R.K. Bakshi, N. Kaur, R. Kaur, G. Kaur, Opinion mining and sentiment analysis,
in: 2016 3rd International Conference on Computing for Sustainable Global
Development (INDIACom), 2016, pp. 452–455.
[1] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[30] P. Gupta, V. Gupta, A survey of text question answering techniques,
convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90.
International Journal of Computer Applications 53 (4).
[2] L. Qin, N. Yu, D. Zhao, Applying the convolutional neural network deep
[31] Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y.
learning technology to behavioural recognition in intelligent video, Tehnički
Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system:
vjesnik 25 (2) (2018) 528–535.
Bridging the gap between human and machine translation, arXiv preprint
[3] M.S. Hossain, G. Muhammad, Emotion recognition using deep learning
arXiv:1609.08144.
approach from audio–visual emotional big data, Inf. Fusion 49 (2019) 69–78.
[32] Y. Duan, C. Xu, J. Pei, J. Han, C. Li, Pre-train and plug-in: Flexible conditional
[4] A. Chatterjee, U. Gupta, M.K. Chinnakotla, R. Srikanth, M. Galley, P. Agrawal,
text generation with variational auto-encoders, arXiv preprint
Understanding emotions in text using deep learning and big data, Comput.
arXiv:1911.03882.
Hum. Behav. 93 (2019) 309–317.
[33] Y. Tay, D. Bahri, C. Zheng, C. Brunk, D. Metzler, A. Tomkins, Reverse
[5] W. Guo, H. Gao, J. Shi, B. Long, L. Zhang, B.-C. Chen, D. Agarwal, Deep natural
engineering configurations of neural text generation models, arXiv preprint
language processing for search and recommender systems, in: Proceedings of
arXiv:2004.06201.
the 25th ACM SIGKDD International Conference on Knowledge Discovery &
[34] N. Papernot, P. McDaniel, A. Swami, R. Harang, Crafting adversarial input
Data Mining, 2019, pp. 3199–3200.
sequences for recurrent neural networks, MILCOM 2016–2016 IEEE Military
[6] L. Yang, Y. Li, J. Wang, R.S. Sherratt, Sentiment analysis for e-commerce
Communications Conference, IEEE (2016) 49–54.
product reviews in chinese based on sentiment lexicon and deep learning,
[35] J. Ebrahimi, D. Lowd, D. Dou, On adversarial examples for character-
IEEE Access 8 (2020) 23522–23530.
level neural machine translation, in: Proceedings of the 27th
[7] B. Sisman, J. Yamagishi, S. King, H. Li, An overview of voice conversion and its
International Conference on Computational Linguistics, Association for
challenges: From statistical modeling to deep learning, IEEE/ACM
Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 653–
Transactions on Audio, Speech, and Language Processing.
663.
[8] M. Saravanan, B. Selvababu, A. Jayan, A. Anand, A. Raj, Arduino based voice
[36] C. Wong, Dancin seq2seq: Fooling text classifiers with adversarial text
controlled robot vehicle, in: IOP Conference Series: Materials Science and
example generation, arXiv preprint arXiv:1712.05419.
Engineering, Vol. 993, IOP Publishing, 2020, p. 012125.
303
[37] Y. Zang, B. Hou, F. Qi, Z. Liu, X. Meng, M. Sun, Learning to attack: Towards [69] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, A. Madry, Adversarial
textual adversarial attacking in real-world situations, arXiv preprint examples are not bugs, they are features, Advances in Neural Information
arXiv:2009.09192. Processing Systems (2019) 125–136.
[38] Y. Belinkov, Y. Bisk, Synthetic and natural noise both break neural machine [70] P. Michel, X. Li, G. Neubig, J.M. Pino, On evaluation of adversarial
translation, arXiv preprint arXiv:1711.02173. perturbations for sequence-to-sequence models, arXiv preprint
[39] S. Eger, Y. Benz, From hero to zn)éroe: A benchmark of low-level adversarial arXiv:1903.06620.
attacks, arXiv preprint arXiv:2010.05648. [71] R. Maheshwary, S. Maheshwary, V. Pudi, Generating natural language attacks
[40] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, K.-W. Chang, in a hard label black box setting, arXiv preprint arXiv:2012.14956.
Generating natural language adversarial examples, arXiv preprint [72] A. Mathai, S. Khare, S. Tamilselvam, S. Mani, Adversarial black-box attacks on
arXiv:1804.07998. text classifiers using multi-objective genetic optimization guided by deep
[41] X. Wang, H. Jin, K. He, Natural language adversarial attacks and defenses in networks, arXiv preprint arXiv:2011.03901.
word level, arXiv preprint arXiv:1909.06723. [73] L. Yuan, X. Zheng, Y. Zhou, C.-J. Hsieh, K.-W. Chang, X. Huang, Generating
[42] Z. Shao, Z. Liu, J. Zhang, Z. Wu, M. Huang, Advexpander: Generating natural universal language adversarial examples by understanding and enhancing
language adversarial examples by expanding text, arXiv preprint the transferability across neural models, arXiv preprint arXiv:2011.08558.
arXiv:2012.10235. [74] E.J. Anderson, M.C. Ferris, Genetic algorithms for combinatorial optimization:
[43] L. Xu, I. Ramirez, K. Veeramachaneni, Rewriting meaningful sentences via the assemble line balancing problem, ORSA Journal on Computing 6 (2)
conditional bert sampling and an application on fooling text classifiers, arXiv (1994) 161–173.
preprint arXiv:2010.11869. [75] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proceedings of
[44] X. Zheng, J. Zeng, Y. Zhou, C.-J. Hsieh, M. Cheng, X.-J. Huang, Evaluating and ICNN’95-International Conference on Neural Networks, Vol. 4, IEEE, 1995, pp.
enhancing the robustness of neural network-based dependency parsing 1942–1948.
models with adversarial examples, in: Proceedings of the 58th Annual [76] S. Tan, S. Joty, M.-Y. Kan, R. Socher, It’s morphin’time! combating linguistic
Meeting of the Association for Computational Linguistics, 2020, pp. 6600– discrimination with inflectional perturbations, arXiv preprint
6610. arXiv:2005.04364.
[45] J. Gao, J. Lanchantin, M.L. Soffa, Y. Qi, Black-box generation of adversarial text [77] N. Xu, O. Feyisetan, A. Aggarwal, Z. Xu, N. Teissier, Differentially private
sequences to evade deep learning classifiers, in IEEE Security and Privacy adversarial robustness through randomized perturbations, arXiv preprint
Workshops (SPW), IEEE 2018 (2018) 50–56. arXiv:2009.12718.
[46] Y. Wang, M. Bansal, Robust machine comprehension models via adversarial [78] S. Samanta, S. Mehta, Towards crafting text adversarial samples, arXiv
training, arXiv preprint arXiv:1804.06473. preprint arXiv:1707.02812.
[47] Y. Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, M. Sun, Word-level textual [79] D. Jin, Z. Jin, J.T. Zhou, P. Szolovits, Is bert really robust? a strong baseline for
adversarial attacking as combinatorial optimization, in: Proceedings of the natural language attack on text classification and entailment, in: Proceedings
58th Annual Meeting of the Association for Computational Linguistics, 2020, of the AAAI conference on artificial intelligence, Vol. 34, 2020, pp. 8018–8025.
pp. 6066–6080. [80] R. Maheshwary, S. Maheshwary, V. Pudi, A context aware approach for
[48] V. Malykh, Robust to noise models in natural language processing tasks, in: generating natural language attacks, arXiv preprint arXiv:2012.13339.
Proceedings of the 57th Annual Meeting of the Association for Computational [81] S. Ren, Y. Deng, K. He, W. Che, Generating natural language adversarial
Linguistics: Student Research Workshop, 2019, pp. 10–16. examples through probability weighted word saliency, in: Proceedings of the
[49] E. Jones, R. Jia, A. Raghunathan, P. Liang, Robust encodings: A framework for 57th annual meeting of the association for computational linguistics, 2019,
combating adversarial typos, arXiv preprint arXiv:2005.01229. pp. 1085–1097.
[50] J. Gilmer, R.P. Adams, I. Goodfellow, D. Andersen, G.E. Dahl, Motivating the [82] M. Hossam, T. Le, H. Zhao, D. Phung, Explain2attack: Text adversarial attacks
rules of the game for adversarial example research, arXiv preprint via cross-domain interpretability.
arXiv:1807.06732. [83] P. Yang, J. Chen, C.-J. Hsieh, J.-L. Wang, M.I. Jordan, Greedy attack and gumbel
[51] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, D. Mukhopadhyay, attack: Generating adversarial examples for discrete data, Journal of Machine
Adversarial attacks and defences: A survey, arXiv preprint Learning Research 21 (43) (2020) 1–36.
arXiv:1810.00069. [84] H. Zhang, H. Zhou, N. Miao, L. Li, Generating fluent adversarial examples for
[52] X. Yuan, P. He, Q. Zhu, X. Li, Adversarial examples: Attacks and defenses for natural languages, arXiv preprint arXiv:2007.06174.
deep learning, IEEE transactions on neural networks and learning systems 30 [85] D. Li, Y. Zhang, H. Peng, L. Chen, C. Brockett, M.-T. Sun, B. Dolan,
(9) (2019) 2805–2824. Contextualized perturbation for textual adversarial attack, arXiv preprint
[53] J. Zhang, C. Li, Adversarial examples: Opportunities and challenges, IEEE arXiv:2009.07502.
transactions on neural networks and learning systems 31 (7) (2019) 2578– [86] D. Emelin, I. Titov, R. Sennrich, Detecting word sense disambiguation biases in
2593. machine translation for model-agnostic adversarial attacks, arXiv preprint
[54] S. Qiu, Q. Liu, S. Zhou, C. Wu, Review of artificial intelligence adversarial arXiv:2011.01846.
attack and defense technologies, Applied Sciences 9 (5) (2019) 909. [87] M. Behjati, S.-M. Moosavi-Dezfooli, M.S. Baghshah, P. Frossard, Universal
[55] W. Wang, L. Wang, R. Wang, Z. Wang, A. Ye, Towards a robust deep neural adversarial attacks on text classifiers, in: ICASSP 2019–2019 IEEE
network in texts: A survey, arXiv preprint arXiv:1902.07285. International Conference on Acoustics, Speech and Signal Processing
[56] W.E. Zhang, Q.Z. Sheng, A. Alhazmi, C. Li, Adversarial attacks on deep-learning (ICASSP) IEEE, 2019, pp. 7345–7349.
models in natural language processing: A survey, ACM Transactions on [88] L. Song, X. Yu, H.-T. Peng, K. Narasimhan, Universal adversarial attacks with
Intelligent Systems and Technology (TIST) 11 (3) (2020) 1–41. natural triggers for text classification, arXiv preprint arXiv:2005.00174.
[57] A. Huq, M.T. Pervin, Adversarial attacks and defense on texts: A survey, arXiv [89] E. Wallace, S. Feng, N. Kandpal, M. Gardner, S. Singh, Universal adversarial
e-prints, 2020, arXiv–2005.. triggers for attacking and analyzing nlp, arXiv preprint arXiv:1908.07125.
[58] R. Jia, P. Liang, Adversarial examples for evaluating reading comprehension [90] P. Atanasova, D. Wright, I. Augenstein, Generating label cohesive and well-
systems, arXiv preprint arXiv:1707.07328. formed adversarial claims, arXiv preprint arXiv:2009.08205.
[59] N.J. Nizar, A. Kobren, Leveraging extracted model adversaries for improved [91] M.T. Ribeiro, S. Singh, C. Guestrin, Semantically equivalent adversarial rules
black box attacks, arXiv preprint arXiv:2010.16336. for debugging nlp models, in: Proceedings of the 56th Annual Meeting of the
[60] Y. Gil, Y. Chai, O. Gorodissky, J. Berant, White-to-black: Efficient distillation of Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp.
black-box adversarial attacks, arXiv preprint arXiv:1904.02405. 856–865.
[61] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: [92] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning
International conference on machine learning, PMLR, 2014, pp. 1188–1196. to align and translate, arXiv preprint arXiv:1409.0473.
[62] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models [93] A. See, P.J. Liu, C.D. Manning, Get to the point: Summarization with pointer-
are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9. generator networks, arXiv preprint arXiv:1704.04368.
[63] M. Iyyer, J. Wieting, K. Gimpel, L. Zettlemoyer, Adversarial example [94] Z. Zhao, D. Dua, S. Singh, Generating natural adversarial examples, arXiv
generation with syntactically controlled paraphrase networks, arXiv preprint arXiv:1710.11342.
preprint arXiv:1804.06059. [95] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
[64] M. Cheng, J. Yi, P.-Y. Chen, H. Zhang, C.-J. Hsieh, Seq2sick: Evaluating the Courville, Y. Bengio, Generative adversarial nets, Advances in neural
robustness of sequence-to-sequence models with adversarial examples., information processing systems 27.
AAAI (2020) 3601–3608. [96] R.S. Sutton, A.G. Barto, Reinforcement learning, An introduction, 2011.
[65] B. Liang, H. Li, M. Su, P. Bian, X. Li, W. Shi, Deep text classification can be [97] T. Wang, X. Wang, Y. Qin, B. Packer, K. Li, J. Chen, A. Beutel, E. Chi, Cat-gen:
fooled, arXiv preprint arXiv:1704.08006. Improving robustness in nlp models via controlled adversarial text
[66] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed generation, arXiv preprint arXiv:2010.02338.
representations of words and phrases and their compositionality, Advances [98] T. Niu, M. Bansal, Adversarial over-sensitivity and over-stability strategies for
in neural information processing systems 26 (2013) 3111–3119. dialogue models, arXiv preprint arXiv:1809.02079.
[67] K. Taga, K. Kameyama, K. Toraichi, Regularization of hidden layer unit [99] M. Blohm, G. Jagfeld, E. Sood, X. Yu, N.T. Vu, Comparing attention-based
response for neural networks, in: 2003 IEEE Pacific Rim Conference on convolutional and recurrent neural networks: Success and limitations in
Communications Computers and Signal Processing (PACRIM 2003)(Cat. No. machine reading comprehension, in: Proceedings of the 22nd Conference on
03CH37490), Vol. 1, IEEE, 2003, pp. 348–351. Computational Natural Language Learning, Association for Computational
[68] T. Tanay, L. Griffin, A boundary tilting persepective on the phenomenon of Linguistics, Brussels, Belgium, 2018, pp. 108–118, 10.18653/v1/K18-1011.
adversarial examples, arXiv preprint arXiv:1608.07690. [100] P. Vijayaraghavan, D. Roy, Generating black-box adversarial examples for text
classifiers using a deep reinforced model, in: Joint European Conference on
304
Machine Learning and Knowledge Discovery in Databases, Springer, 2019, pp. [130] S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical sequence
711–726. training for image captioning, in: Proceedings of the IEEE Conference on
[101] D. Pruthi, B. Dhingra, Z.C. Lipton, Combating adversarial misspellings with Computer Vision and Pattern Recognition, 2017, pp. 7008–7024.
robust word recognition, arXiv preprint arXiv:1905.11268. [131] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep
[102] M. Mozes, P. Stenetorp, B. Kleinberg, L.D. Griffin, Frequency-guided word bidirectional transformers for language understanding, arXiv preprint
substitutions for detecting textual adversarial examples, arXiv preprint arXiv:1810.04805.
arXiv:2004.05887. [132] L. Li, R. Ma, Q. Guo, X. Xue, X. Qiu, Bert-attack: Adversarial attack against bert
[103] Y. Zhou, J.-Y. Jiang, K.-W. Chang, W. Wang, Learning to discriminate using bert, arXiv preprint arXiv:2004.09984.
perturbations for blocking adversarial attacks in text classification, arXiv [133] P. Neekhara, S. Hussain, S. Dubnov, F. Koushanfar, Adversarial
preprint arXiv:1909.03084. reprogramming of sequence classification neural networks, CoRR abs/
[104] D. Kang, T. Khot, A. Sabharwal, E. Hovy, Adventure: Adversarial training for 1809.01829.
textual entailment with knowledge-guided examples, arXiv preprint [134] A. de Wynter, Mischief: A simple black-box attack against transformer
arXiv:1805.04680. architectures, arXiv preprint arXiv:2010.08542.
[105] J. Xu, L. Zhao, H. Yan, Q. Zeng, Y. Liang, S. Xu, Lexicalat: Lexical-based [135] T. Le, S. Wang, D. Lee, Malcom: Generating malicious comments to attack
adversarial reinforcement training for robust sentiment classification, in: neural fake news detection models, arXiv preprint arXiv:2009.01048.
Proceedings of the 2019 Conference on Empirical Methods in Natural [136] Y. Wu, D. Bamman, S. Russell, Adversarial training for relation extraction, in:
Language Processing and the 9th International Joint Conference on Natural Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing (EMNLP-IJCNLP), 2019, pp. 5521–5530. Language Processing, 2017, pp. 1778–1783.
[106] L. Li, X. Qiu, Textat: Adversarial training for natural language understanding [137] G. Bekoulis, J. Deleu, T. Demeester, C. Develder, Adversarial training for multi-
with token-level perturbation, arXiv preprint arXiv:2004.14543. context joint entity and relation extraction, arXiv preprint arXiv:1808.06876.
[107] H. Liu, Y. Zhang, Y. Wang, Z. Lin, Y. Chen, Joint character-level word [138] M. Cettolo, N. Jan, S. Sebastian, L. Bentivogli, R. Cattoni, M. Federico, The iwslt
embedding and adversarial stability training to defend adversarial text, in: 2016 evaluation campaign, in: International Workshop on Spoken Language
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, Translation, 2016.
pp. 8384–8391. [139] M. Yasunaga, J. Kasai, D. Radev, Robust multilingual part-of-speech tagging
[108] K. Liu, X. Liu, A. Yang, J. Liu, J. Su, S. Li, Q. She, A robust adversarial training via adversarial training, arXiv preprint arXiv:1711.04903.
approach to machine reading comprehension, in: Proceedings of the AAAI [140] W. Han, L. Zhang, Y. Jiang, K. Tu, Adversarial attack and defense of structured
Conference on Artificial Intelligence, Vol. 34, 2020, pp. 8392–8400. prediction models, arXiv preprint arXiv:2010.01610.
[109] E. Dinan, S. Humeau, B. Chintagunta, J. Weston, Build it break it fix it for [141] H. Chen, H. Zhang, P.-Y. Chen, J. Yi, C.-J. Hsieh, Attacking visual language
dialogue safety: Robustness from adversarial human attack, arXiv preprint grounding with adversarial examples: A case study on neural image
arXiv:1908.06083. captioning, arXiv preprint arXiv:1712.02051.
[110] Q. Li, S. Shah, X. Liu, A. Nourbakhsh, Data sets: Word embeddings learned [142] X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell, D. Song, Fooling vision and
from tweets and general data, in: Proceedings of the International AAAI language models despite localization and attention mechanism, in:
Conference on Web and Social Media, Vol. 11, 2017. Proceedings of the IEEE Conference on Computer Vision and Pattern
[111] Z. Wang, H. Wang, Defense of word-level adversarial attacks via random Recognition, 2018, pp. 4951–4961.
substitution encoding, in: International Conference on Knowledge Science, [143] L. Chen, W. Xu, Attacking optical character recognition (ocr) systems with
Engineering and Management, Springer, 2020, pp. 312–324. adversarial watermarks, arXiv preprint arXiv:2002.03095.
[112] Y. Zhou, X. Zheng, C.-J. Hsieh, K.-W. Chang, X. Huang, Defense against [144] X. Yuan, P. He, X. Lit, D. Wu, Adaptive adversarial attack on scene text
adversarial attacks in nlp via dirichlet neighborhood ensemble, arXiv recognition, in: IEEE INFOCOM 2020-IEEE Conference on Computer
preprint arXiv:2006.11627. Communications Workshops (INFOCOM WKSHPS) IEEE, 2020, pp. 358–
[113] B. Wang, S. Wang, Y. Cheng, Z. Gan, R. Jia, B. Li, J. Liu, Infobert: Improving 363.
robustness of language models from an information theoretic perspective, [145] R. Tang, C. Ma, W.E. Zhang, Q. Wu, X. Yang, Semantic equivalent adversarial
arXiv preprint arXiv:2010.02329. data augmentation for visual question answering, European Conference on
[114] J. Wu, X. Li, X. Ao, Y. Meng, F. Wu, J. Li, Improving robustness and generality of Computer Vision, Springer (2020) 437–453.
nlp models using disentangled representations, arXiv preprint [146] H. Shi, J. Mao, T. Xiao, Y. Jiang, J. Sun, Learning visually-grounded semantics
arXiv:2009.09587. from contrastive adversarial samples, arXiv preprint arXiv:1806.10348.
[115] A.H. Li, A. Sethy, Knowledge enhanced attention for robust natural language [147] Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, J. Liu, Large-scale adversarial training
inference, arXiv preprint arXiv:1909.00102. for vision-and-language representation learning, arXiv preprint
[116] N.S. Moosavi, M. de Boer, P.A. Utama, I. Gurevych, Improving robustness by arXiv:2006.06195.
augmenting training sentences with predicate-argument structures, arXiv [148] M. Cheng, W. Wei, C.-J. Hsieh, Evaluating and enhancing the robustness of
preprint arXiv:2010.12510. dialogue systems: A case study on a negotiation agent, in: Proceedings of the
[117] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to 2019 Conference of the North American Chapter of the Association for
document distances, in: International conference on machine learning, 2015, Computational Linguistics: Human Language Technologies, Volume 1 (Long
pp. 957–966. and Short Papers), 2019, pp. 3325–3335.
[118] P. Minervini, S. Riedel, Adversarially regularising neural nli models to [149] Y. Kim, Y. Jernite, D. Sontag, A. Rush, Character-aware neural language
integrate logical background knowledge, arXiv preprint arXiv:1808.08609. models, in: Proceedings of the AAAI conference on artificial intelligence, Vol.
[119] Y. Cheng, L. Jiang, W. Macherey, Robust neural machine translation with 30, 2016.
doubly adversarial inputs, arXiv preprint arXiv:1906.02443. [150] M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks, IEEE
[120] V. Kuleshov, S. Thakoor, T. Lau, S. Ermon, Adversarial examples for natural transactions on Signal Processing 45 (11) (1997) 2673–2681.
language classification problems. [151] K.S. Tai, R. Socher, C.D. Manning, Improved semantic representations from
[121] M. Sato, J. Suzuki, H. Shindo, Y. Matsumoto, Interpretable adversarial tree-structured long short-term memory networks, arXiv preprint
perturbation in input embedding space for text, arXiv preprint arXiv:1503.00075.
arXiv:1805.02917. [152] X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text
[122] Z. Gong, W. Wang, B. Li, D. Song, W.-S. Ku, Adversarial texts with gradient classification, Advances in neural information processing systems 28 (2015)
methods, arXiv preprint arXiv:1801.07175. 649–657.
[123] C. Song, V. Shmatikov, Fooling ocr systems with adversarial text images, arXiv [153] Y. Kim, Convolutional neural networks for sentence classification, arXiv
preprint arXiv:1802.05385. preprint arXiv:1408.5882.
[124] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deepfool: a simple and accurate [154] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation
method to fool deep neural networks, in: Proceedings of the IEEE conference 9 (8) (1997) 1735–1780.
on computer vision and pattern recognition, 2016, pp. 2574–2582. [155] T. Miyato, A.M. Dai, I. Goodfellow, Adversarial training methods for semi-
[125] G.A. Miller, Wordnet: a lexical database for english, Commun. ACM 38 (11) supervised text classification, arXiv preprint arXiv:1605.07725.
(1995) 39–41. [156] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning
[126] J. Zhao, Y. Kim, K. Zhang, A. Rush, Y. LeCun, Adversarially regularized of universal sentence representations from natural language inference data,
autoencoders, in: International conference on machine learning PMLR, 2018, arXiv preprint arXiv:1705.02364.
pp. 5902–5911. [157] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L.
[127] K. Krishna, G.S. Tomar, A.P. Parikh, N. Papernot, M. Iyyer, Thieves on sesame Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining
street! model extraction of bert-based apis. approach, arXiv preprint arXiv:1907.11692.
[128] K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep [158] K. Shu, L. Cui, S. Wang, D. Lee, H. Liu, defend: Explainable fake news detection,
conditional generative models, Advances in neural information processing in: Proceedings of the 25th ACM SIGKDD International Conference on
systems 28 (2015) 3483–3491. Knowledge Discovery & Data Mining, 2019, pp. 395–405.
[129] H. Chen, S. Huang, D. Chiang, X. Dai, J. Chen, Combining character and word [159] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation classification via convolutional
information in neural machine translation using a multi-level attention, in: deep neural network, in: Proceedings of COLING 2014, the 25th International
Proceedings of the 2018 Conference of the North American Chapter of the Conference on Computational Linguistics: Technical Papers, 2014, pp. 2335–
Association for Computational Linguistics: Human Language Technologies, 2344.
Volume 1 (Long Papers), 2018, pp. 1284–1293. [160] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of
neural machine translation: Encoder-decoder approaches, arXiv preprint
arXiv:1409.1259.
305
[161] G. Bekoulis, J. Deleu, T. Demeester, C. Develder, Joint entity recognition and [191] A. Kurakin, I. Goodfellow, S. Bengio, et al., Adversarial examples in the
relation extraction as a multi-head selection problem, Expert Syst. Appl. 114 physical world (2016).
(2018) 34–45. [192] F. Faghri, D.J. Fleet, J.R. Kiros, S. Fidler, Vse++: Improving visual-semantic
[162] M. Kaneko, Y. Sakaizawa, M. Komachi, Grammatical error detection using embeddings with hard negatives, arXiv preprint arXiv:1707.05612.
error-and grammaticality-specific word embeddings, in: Proceedings of the [193] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa
Eighth International Joint Conference on Natural Language Processing matter: Elevating the role of image understanding in visual question
(Volume 1: Long Papers), 2017, pp. 40–48. answering, in: Proceedings of the IEEE Conference on Computer Vision and
[163] M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Pattern Recognition, 2017, pp. 6904–6913.
Zettlemoyer, Deep contextualized word representations, arXiv preprint [194] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From recognition to cognition: Visual
arXiv:1802.05365. commonsense reasoning, in: Proceedings of the IEEE/CVF Conference on
[164] J. Lee, K. Cho, T. Hofmann, Fully character-level neural machine translation Computer Vision and Pattern Recognition, 2019, pp. 6720–6731.
without explicit segmentation, Transactions of the Association for, [195] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, Y. Artzi, A corpus for reasoning
Computational Linguistics 5 (2017) 365–378. about natural language grounded in photographs, arXiv preprint
[165] R. Sennrich, O. Firat, K. Cho, A. Birch, B. Haddow, J. Hitschler, M. Junczys- arXiv:1811.00491.
Dowmunt, S. Läubli, A.V.M. Barone, J. Mokry, et al., Nematus: a toolkit for [196] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment: A novel task for fine-
neural machine translation, arXiv preprint arXiv:1703.04357. grained image understanding, arXiv preprint arXiv:1901.06706.
[166] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, [197] L. Yu, P. Poirson, S. Yang, A.C. Berg, T.L. Berg, Modeling context in referring
I. Polosukhin, Attention is all you need, in: Advances in neural information expressions, European Conference on Computer Vision, Springer (2016) 69–
processing systems, 2017, pp. 5998–6008. 85.
[167] M.-T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based [198] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text
neural machine translation, arXiv preprint arXiv:1508.04025. matching, in: Proceedings of the European Conference on Computer Vision
[168] J. Gehring, M. Auli, D. Grangier, Y.N. Dauphin, A convolutional encoder model (ECCV), 2018, pp. 201–216.
for neural machine translation, arXiv preprint arXiv:1611.02344. [199] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S.
[169] M. Seo, A. Kembhavi, A. Farhadi, H. Hajishirzi, Bidirectional attention flow for Satheesh, S. Sengupta, A. Coates, et al., Deep speech: Scaling up end-to-end
machine comprehension, arXiv preprint arXiv:1611.01603. speech recognition, arXiv preprint arXiv:1412.5567.
[170] S. Wang, J. Jiang, Machine comprehension using match-lstm and answer [200] T. Dozat, C.D. Manning, Deep biaffine attention for neural dependency
pointer, arXiv preprint arXiv:1608.07905. parsing, arXiv preprint arXiv:1611.01734.
[171] A.W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, Q.V. Le, Qanet: [201] W. Wang, R. Wang, L. Wang, B. Tang, Adversarial examples generation
Combining local convolution with global self-attention for reading approach for tendency classification on chinese texts, Ruan Jian Xue Bao/J.
comprehension, arXiv preprint arXiv:1804.09541. Softw. 30 (8) (2019) 2415–2427.
[172] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv [202] E. La Malfa, M. Wu, L. Laurenti, B. Wang, A. Hartshorn, M. Kwiatkowska,
preprint arXiv:1901.07291. Assessing robustness of text classification through maximal safe radius
[173] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, Xlnet: computation, arXiv preprint arXiv:2010.02004.
Generalized autoregressive pretraining for language understanding, in: [203] T. Miyato, S.-I. Maeda, M. Koyama, S. Ishii, Virtual adversarial training: a
Advances in neural information processing systems, 2019, pp. 5753–5763. regularization method for supervised and semi-supervised learning, IEEE
[174] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, D. Inkpen, Enhanced lstm for natural transactions on pattern analysis and machine intelligence 41 (8) (2018)
language inference, arXiv preprint arXiv:1609.06038. 1979–1993.
[175] M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated [204] A. Dubey, L. v. d. Maaten, Z. Yalniz, Y. Li, D. Mahajan, Defense against
corpus of english: The penn treebank. adversarial images using web-scale nearest-neighbor search, in: Proceedings
[176] J. Nivre, Ž. Agić, M.J. Aranzabe, M. Asahara, A. Atutxa, M. Ballesteros, J. Bauer, of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
K. Bengoetxea, R.A. Bhat, C. Bosco, et al., Universal dependencies 1.2. 2019, pp. 8767–8776.
[177] R. Lowe, N. Pow, I. Serban, J. Pineau, The ubuntu dialogue corpus: A large [205] P. Shi, J. Lin, Simple bert models for relation extraction and semantic role
dataset for research in unstructured multi-turn dialogue systems, arXiv labeling, arXiv preprint arXiv:1904.05255.
preprint arXiv:1506.08909. [206] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P.
[178] H. He, A. Balakrishnan, M. Eric, P. Liang, Learning symmetric collaborative J. Liu, Exploring the limits of transfer learning with a unified text-to-text
dialogue agents with dynamic knowledge graph embeddings, arXiv preprint transformer, arXiv preprint arXiv:1910.10683.
arXiv:1704.07130. [207] N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie,
[179] I. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, Y. Bengio, A Y. Sharma, T. Brown, A. Roy, et al., Technical report on the cleverhans v2. 1.0
hierarchical latent variable encoder-decoder model for generating dialogues, adversarial examples library, arXiv preprint arXiv:1610.00768.
in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, [208] J. Rauber, W. Brendel, M. Bethge, Foolbox: A python toolbox to benchmark
2017. the robustness of machine learning models, arXiv preprint arXiv:1707.04131.
[180] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, D. Jurafsky, Deep reinforcement [209] G.W. Ding, L. Wang, X. Jin, Advertorch v0. 1: An adversarial robustness
learning for dialogue generation, arXiv preprint arXiv:1606.01541. toolbox based on pytorch, arXiv preprint arXiv:1902.07623.
[181] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. [210] J.X. Morris, E. Lifland, J.Y. Yoo, Y. Qi, Textattack: A framework for adversarial
Zitnick, Microsoft coco: Common objects in context, in: European conference attacks in natural language processing.
on computer vision, Springer, 2014, pp. 740–755. [211] G. Zeng, F. Qi, Q. Zhou, T. Zhang, Z. Ma, B. Hou, Y. Zang, Z. Liu, M. Sun,
[182] J. Johnson, A. Karpathy, L. Fei-Fei, Densecap: Fully convolutional localization Openattack: An open-source textual adversarial attack toolkit, arXiv preprint
networks for dense captioning, in: Proceedings of the IEEE conference on arXiv:2009.09191.
computer vision and pattern recognition, 2016, pp. 4565–4574. [212] Y. Liang, F. He, X. Zeng, J. Luo, An improved loop subdivision to coordinate the
[183] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. smoothness and the number of faces via multi-objective optimization,
Kalantidis, L.-J. Li, D.A. Shamma, et al., Visual genome: Connecting language Integrated Computer-Aided Engineering (Preprint) (2021) 1–19.
and vision using crowdsourced dense image annotations, International [213] A. Lahav, A. Tal, Meshwalker: Deep mesh understanding by random walks,
journal of computer vision 123 (1) (2017) 32–73. ACM Transactions on Graphics (TOG) 39 (6) (2020) 1–13.
[184] A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh [214] M.I. Hossen, X. Hei, aaecaptcha: The design and implementation of audio
losses for scene geometry and semantics, in: Proceedings of the IEEE adversarial captcha, arXiv preprint arXiv:2203.02735.
conference on computer vision and pattern recognition, 2018, pp. 7482– [215] M. Kumar, M. Jindal, M. Kumar, Design of innovative captcha for hindi
7491. language, Neural Comput. Appl. (2022) 1–36.
[185] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L.G. i Bigorda, S.R. Mestre, J.
Mas, D.F. Mota, J.A. Almazan, L.P. De Las Heras, Icdar 2013 robust reading
competition, in: 2013 12th International Conference on Document Analysis
and Recognition, IEEE, 2013, pp. 1484–1493. Shilin Qiu received the B.E. degree in software engi-
[186] A. Mishra, K. Alahari, C. Jawahar, Scene text recognition using higher order neering from the University of Electronic Science and
language priors, 2012. Technology of China (UESTC), Chengdu, China, in 2017,
[187] B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based where she is currently pursuing the Doctoral degree in
sequence recognition and its application to scene text recognition, IEEE software engineering.
transactions on pattern analysis and machine intelligence 39 (11) (2016) Her current research interests include deep learning,
2298–2304. artificial intelligence adversarial technology.
[188] A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal
compact bilinear pooling for visual question answering and visual grounding,
arXiv preprint arXiv:1606.01847.
[189] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko, Learning to reason: End-
to-end module networks for visual question answering, in: Proceedings of
the IEEE International Conference on Computer Vision, 2017, pp. 804–813.
[190] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, D. Parikh,
Vqa: Visual question answering, in: Proceedings of the IEEE international
conference on computer vision, 2015, pp. 2425–2433.
306
Qihe Liu received the Ph.D. degree in computer appli- Wen Huang received the B.E. degree in computer sci-
cation technology from the University of Electronic ence and technology from the University of Electronic
Science and Technology of China (UESTC), Chengdu, Science and Technology of China (UESTC), Chengdu,
China, in 2005. He is currently a Associate Professor China, in 2016, where he is currently pursuing the
with the School of Information and Software Engineer- Doctoral degree in software engineering. His current
ing, UESTC. research interests include cryptography and differential
His current research interests include communication privacy.
and security in network security, machine learning, and
artificial intelligence adversarial technology.
Shijie Zhou received the Ph.D. degree in computer sci-

ence and technology from the University of Electronic
Science and Technology of China (UESTC), Chengdu,
China, in 2004. He is currently a Professor with the
School of Information and Software Engineering, UESTC.
His current research interests include communication
and security in network security, artificial intelligence,
and intelligent transportation.
307

Adversarial Attack and Defense Technologies in Natural Language

Uploaded by

Copyright:

Available Formats

Adversarial Attack and Defense Technologies in Natural Language

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Adversarial Attack and Defense Technologies in Natural Language

Uploaded by

Copyright:

Available Formats

Neurocomputing 492 (2022) 278–307

Contents lists available at ScienceDirect

Adversarial attack and defense technologies in natural language

tion to the model’s robustness and concluded testing and verifica-

Table 1 works (SCPNs) [63] is an encoder-decoder based framework with

To explore potential adversarial attack threats faced by existing

attacks because of their unique characteristics. Consequently, this

accuracy rate is, the more effective adversarial examples are. P

White-box attack. In white-box scenarios, the adversary has

3.1.1. Model access

Strategy Work White/ Targeted/Non- Main Ideas

The word-level attack modifies words in a given text. Thus, the

the embedding input of a word sequence through the computa- ð7Þ

Strategy Work White/ Targeted/ Non- Main Ideas

Strategy Work White/ Targeted/ Non- Main Ideas

Strategy Work White/ Targeted/Non- Mian Ideas

Table 6 sentence-level attacks enhance the diversity of generated exam-

Application Benchmark Dataset Representative Work

Sentiment Analysis Yelp Review [45,120,79,71,43,82]

Toxic Comment Toxic Comments dataset (WTC) [109,60,24,39]

Spam Detection Enron Spam Email [45]

Relation Extraction NYT Relation, UW Relation [136]

Gender Identification Twitter Gender [78]

Grammar Error FCE-public [121]

Neural Machine Translation TED Talks Parallel Corpus [138] [35,38]

Machine Reading Comprehension SQuAD [58,59,76,89]

Text Entailment SNLI [39,40,42,43,47,71,79,80,84,94,104,118]

POS Tagging Penn Treebank WSJ corpus [44,139,140]

Text Summarization DUC2003, DUC2004, Gigaword [64]

Dialogue Generation Ubuntu Dialogue Corpus, CoCoA [98]

Cross-modal Image Captioning MSCOCO [141]

Optical Character Hillary Clinton’s emails Corpus [123]

Other Tasks Interactive Dialogue – [148]

Subtask Work Benchmark Dataset Victim Model

CNN (WordCNN) [153] model, and evaluated their method on Table 9

4.2.3. Machine reading comprehension

Type Application Work Benchmark Dataset Victim Model

visual semantic embeddings [146] MSCOCO VSE++

Table 13 Some researchers directly utilized adversarial texts generated by

Subcategory Key Idea Work Detail

input representation, and designing more effective representation

Limited transferability. Although researchers have proposed

Shijie Zhou received the Ph.D. degree in computer sci-

You might also like