Adversarial Attack and Defense Technologies in Natural Language
Adversarial Attack and Defense Technologies in Natural Language
Adversarial Attack and Defense Technologies in Natural Language
Neurocomputing
journal homepage: www.elsevier.com/locate/neucom
Survey paper
a r t i c l e i n f o a b s t r a c t
Article history: Recently, the adversarial attack and defense technology has made remarkable achievements and has been
Received 24 May 2021 widely applied in the computer vision field, promoting its rapid development in other fields, primarily
Revised 14 March 2022 the natural language processing domain. However, discrete semantic texts bring additional restrictions
Accepted 3 April 2022
and challenges to successfully implementing adversarial attacks and defenses. This survey systematically
Available online 7 April 2022
Communicated by Zidong Wang
summarizes the current progress of adversarial techniques in the natural language processing field. We
first briefly introduce the textual adversarial example’s particularity, vectorization, and evaluation met-
rics. More importantly, we categorize textual adversarial attacks according to the combination of seman-
Keywords:
Textual adversarial example
tic granularity and example generation strategy. Next, we present commonly used datasets and
Adversarial attack adversarial attack applications in diverse natural language processing tasks. Besides, we classify defense
Adversarial defense strategies as passive and active methods considering both input data and victim models. Finally, we pre-
Natural language processing sent several challenging issues and future research directions in this domain.
Artificial intelligence Ó 2022 Elsevier B.V. All rights reserved.
1. Introduction analysis [24,25], and malware detection systems [26] are appar-
ently vulnerable when facing adversarial threats.
With the rapid progress of high-performance computational
equipment and the continuous accumulation of massive data, arti-
ficial intelligence technology has been greatly developed and 1.1. Development overview of adversarial technology in NLP field
widely used in computer vision (CV) [1–3], natural language pro-
cessing (NLP) [4–6], voice control [7,8], and other tasks [9–11]. Based on the paper list1 collected by Carlini, this survey counts
Hence, various intelligent systems are extensively applied to com- the number of publications related to adversarial in the CV and
munication, transportation, healthcare, public security, and finan- NLP fields. As shown in Fig. 1, adversarial technology has attracted
cial transaction in the real world [12–16]. attention and has developed rapidly. Compared with studies in the
However, Szegedy et al. [17] proposed the concept of adversarial CV field, the publications in the NLP domain are far less. However,
example against image classifiers, demonstrating a tremendous due to the wide application of NLP technology in text classification
security risk in current intelligent systems. They indicated that [27,28], sentiment analysis [29], text question-answering [30], neu-
the adversarial example generated by adding tiny perturbations ral machine translation [31], text generation [32,33] and other tasks,
to the pure image could make a classifier with good performance as well as the continuous deepening of adversarial attack and
output a wrong prediction. More noteworthy, deep neural networks defense technologies, the textual adversarial technology has gradu-
trained on different datasets or with distinct structures can produce ally gained researchers’ attention.
the same misclassification for the same adversarial example. Since Papernot et al. [34] is the first to investigate adversarial attacks
then, adversarial attack technology has become a research highlight on texts. Inspired by the idea of generating adversarial images,
in the artificial intelligence security domain. Thus, sign recognition they crafted adversarial texts through the forward derivative asso-
[18], object detection [19,20], audio recognition [21–23], sentiment ciated with texts’ embeddings. Since then, in order to explore the
security blind spot in NLP systems and seek corresponding defense
⇑ Corresponding author.
strategies, scholars have conducted in-depth research on NLP
E-mail addresses: 742452674@qq.com (S. Qiu), qiheliu@uestc.edu.cn (Q. Liu),
1
sjzhou@uestc.edu.cn (S. Zhou), 562421007@qq.com (W. Huang). https://nicholas.carlini.com/writing/2019/all-adversarial-example-papers.html.
https://doi.org/10.1016/j.neucom.2022.04.020
0925-2312/Ó 2022 Elsevier B.V. All rights reserved.
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Fig. 2. Summary of the textual adversarial attack and defense strategies in this survey. The attack methods are classified according to the combination of semantic granularity
and adversarial example generation strategy. The defense strategies are divided into passive and active methods. Moreover, the attacks concerned by each kind of defense
method are marked in the figure.
which refers to the type of modified object, is only appropriate for ated texts are usually more imperceptible in grammar and syntax.
attacks on texts. Considering that adversarial text generation For optimization-based methods, most [40,41,71,72,47,73] are
methods are pretty different, we conclude textual adversarial based on Evolutionary Algorithms (such as the Genetic Algorithm
attack methods comprehensively according to the combination of (GA) [74] and Particle Swarm Optimization algorithm (PSO) [75]),
semantic granularity and example generation strategy. Specifically, which do well in handling discrete data but are relatively time-
we categorize adversarial attacks as four classes (the character- consuming and computationally expensive. Furthermore,
level, word-level, sentence-level, and multi-level method) at the researchers proposed specific methods for certain issues, such as
top level according to the semantic granularity. Moreover, we fur- the variability exhibited by second-language speakers and first-
ther divide each class into subclasses (the gradient-based, language dialect speakers [76], the problem of weighing unlikely
optimization-based, importance-based, edit-based, paraphrase- substitutions high and limiting the accuracy gain [77], and the
based, and generative model-based method) according to example inapplicability of massive query operations in the real world [37].
generation strategies. To the best of our knowledge, we are the first For importance-based methods, to craft semantics-preserving texts
to propose this kind of two-level classification for adversarial with minimum modifications, methods in [78–80] ranked words
attacks. according to the class probability changes obtained by removing
In character-level attacks, the adversary inserts, deletes, words one by one. Among them, the method in [78] works best
replaces or swaps characters in a given text. To achieve this pur- for the dataset that has sub-categories within each class. Besides,
pose, researchers utilized the gradient calculation to rank adver- Ren et al. [81] determined the word replacement order by both
sarial manipulations [35] or train a substitute DNN [60]. Besides, the word saliency and classification probability, performing well
DeepWordBug [45] used the importance of tokens to determine in effectively reducing substitutes. Additionally, Hossam et al.
which character to be changed. The edit-based method in [38,39] [82] employed cross-domain interpretability to learn the word
used natural and synthetic noises to generate visual and phonetic importance to handle the issue of computational complexity and
adversarial examples. Generally speaking, adversarial texts crafted query consumption. Yang et al.[83] proposed a systematic proba-
by character-level attacks often have apparent grammar or spelling bilistic framework, which does well in achieving a high success
errors, making it easy for human eyes or misspelling checkers to rate and efficiency. Overall, these importance-based methods are
observe these malicious texts. more efficient than optimization-based methods. For the edit-
In word-level attacks, adversaries modify the word in a given based method, Zhang et al. [84] employed a Metropolis-Hastings
text, generally causing fewer grammar and spelling errors than sampling approach to replace or randomize words, Li et al. [85]
character-level attacks. The gradient-based methods [34,65,119] introduced a pre-trained masked language model, Emelin et al.
utilized the sign of gradient, magnitude of gradient, and gradient [86] detected word sense disambiguation biases. Overall, word-
itself. Due to the semantic consistency requirement, Michael level attacks usually do better in maintaining semantic consistency
et al. [70] demonstrated ‘‘adversarial examples should be meaning- and imperceptibility than other attacks, but their generated adver-
preserving on the source side, but meaning-destroying on the target sarial examples are less varied.
side”. Seq2Sick [64] introduced the Group Lasso and Gradient Reg- In sentence-level attacks, the adversary inserts new sentences,
ularization to a projected gradient method. In contrast, replaces words with their paraphrases, or even changes the struc-
optimization-based, importance-based, and edit-based methods ture of original sentences. Adversaries generally crafted the univer-
are designed for textual adversarial attacks. Therefore, the gener- sal adversarial perturbation through the iterative projected
280
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
1.6. Defenses against textual adversarial attack We classify existing defense strategies against textual adversar-
ial attacks as passive and active methods, simultaneously con-
Due to the deepening development of adversarial attack tech- sidering the victim model stages and starting points of
nologies and the wide application of textual adversarial attacks, defense strategies. Compared with existing surveys, our classifi-
researchers have realized the severity of adversarial attack threats cation method is the only one that considers both the input data
in NLP systems, promoting various defense methods. This survey and the victim model.
categorizes current defense strategies as passive and active
defenses. The most commonly used passive defense method is Organization. In Section 2, we introduce causes of the adver-
checking misspellings and typos in the textual input [45,24,101– sarial example’s emergence, particularities of adversarial texts,
103]. However, these passive strategies are suitable for defending vectorization of textual data, and evaluation metrics of adversarial
against character and word-level attacks. The active defenses prin- texts. In Section 3, we classify textual adversarial attack methods
cipally contain adversarial training and representation learning from four aspects and elaborate them according to the combina-
strategies. For the adversarial training, some researchers tion of semantic granularity and example generation strategy. In
[25,35,24,58,46,38,47] directly added adversarial examples gener- Section 4, we present benchmark datasets and applications of tex-
ated by existing attacks to the training dataset. Considering the tual adversarial attacks. In Section 5, we summarize existing
massive calculation consumption and low efficiency of these above defense methods against adversarial attacks on texts. Finally, we
methods, researchers [104,105] proposed GAN-style approaches to conclude several significant challenges and potential development
train NLP models together with the adversarial example generator. directions in Section 6.
The work in [106] presented a variation of Virtual Adversarial
Training (VAT) to generate virtual adversarial examples and virtual 2. Textual adversarial example
labels in the embedding space. Additionally, researchers have paid
attention to the issues of out-of-vocabulary (OOV) words [107], dis- Szegedy et al. [17] proposed that adversarial images, generated
tribution difference [107], diversified adversarial example require- by adding tiny noises to pixel values, can easily make an image
ment [108], and offensive language detection in the real world classifier produce wrong predictions but do not affect the percep-
[109]. Although adversarial training can effectively improve the tion of human eyes. It is the first study on adversarial attack tech-
robustness of NLP models and overcome problems like adversarial nology in the CV domain. In the NLP field, Papernot et al. [34] is the
example preparation and calculation consumption, it is likely to first to investigate adversarial attacks. They crafted adversarial
reduce their classification accuracy. For the representation learn- texts invisible to human eyes by modifying characters or words
ing, researchers improved the input representation ability of NLP in original texts. As shown in Fig. 3, a text correctly categorized
models. Some methods [110–112] introduced random perturba- by a sentiment analysis model is classified wrongly after replacing
tions to the input during the training step. Works in one word of it.
[38,41,48,49] improved the generalization of models by encoding To let readers have a more precise and comprehensive under-
input and their neighbors with the same representation. Other standing of the textual adversarial example, we first present possi-
researchers have designed more effective representation ble reasons for the emergence of adversarial examples in
approaches by fine-tuning both local and global features [113], Section 2.1. Then, we conclude the particularities of textual adver-
introducing disentangled representation learning [114], linking sarial examples in Section 2.2. Later, we elaborate on the vectoriza-
the multi-head attention to structured embedding [115], and aug- tion methods for discrete data in Section 2.3. Finally, we
menting input sentences with their corresponding predicate- summarize the commonly used metrics for evaluating the effec-
argument structures [116]. In general, randomizing input and uni- tiveness and imperceptibility of adversarial texts in Section 2.4.
fying input representation is similar to the passive defense, and by Note that the particularities described in this survey are unique
contrast, designing effective representation is more challenging. to textual examples, and the general characteristics of adversarial
Furthermore, current defense research is much less than attack examples in all fields are elaborated in detail in the literature [54].
research, which reminds researchers that it is necessary to pay
more attention to defense strategies against adversarial attacks. 2.1. Causes of adversarial example
1.7. Contribution and organization Realizing the terrifying attack capability of adversarial exam-
ples, researchers have begun to explore the reasons for the exis-
Our contributions. This survey concentrates on the adversarial tence of adversarial examples. Although researchers have
attack and defense technology in the NLP field and provides a thor- presented some assumptions, there is still no widely accepted
ough and systematic review. The key contributions of this survey explanation. In the early stage, researchers thought that the cause
can be summarized as follows: was the insufficient generalization ability of DNN to predict
unknown data due to over-fitting or inapposite regularization
We comprehensively and systematically summarize the textual [67]. While Szegedy et al. [17] proposed that the extreme nonlinear
adversarial attack and defense technology, elaborating on tex- characteristic of DNN leads to the existence of adversarial
tual adversarial examples, adversarial attacks on texts, defenses examples.
against textual adversarial attacks, applications in various NLP Later, Goodfellow et al. [19] verified the above two hypotheses.
tasks, and potential development directions in this domain. They input adversarial examples into a regularized DNN and dis-
We categorize current textual adversarial attacks according to covered that the model’s effectiveness for resisting adversarial
the semantic granularity at the top level and further classify examples was not significantly improved. Besides, by adding tiny
each class into several subclasses depending on the example perturbations to the input of a linear model, they showed that if
generation strategy. To the best of our knowledge, we are the the input dimension is sufficient, it is possible to construct adver-
first to regard the example generation strategy as a classifica- sarial examples with high confidence. Therefore, they proposed
tion criterion and propose this two-level classification for that the linear behavior of DNN in high-dimensional space leads
adversarial attacks. to the emergence of adversarial examples. In other words, since
282
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Fig. 3. An instance of adversarial example. Before the attack, the sentiment analysis model classified the original text into a negative class, in line with human judgment. After
replacing the word ‘‘lack” with ‘‘dearth”, the same model produces a wrong prediction.
the nonlinear components of DNN are linear in segments, if adding ter, violating the principle that adversarial examples are per-
tiny noise to the input, this noise would persist in the same direc- ceivable to humans. Therefore, keeping the semantics
tion and accumulate at the end of DNN, finally leading to a differ- consistent is the key to crafting influential adversarial texts.
ent output.
Differently, some hypotheses were proposed from the perspec- 2.3. Vectorization of discrete data
tive of data characteristics instead of neural network structures.
Tanay et al. [68] proposed a hypothesis called tilted boundary. They Due to additional constraints like grammatical legitimacy and
demonstrated that since a DNN is usually unable to fit all data semantic consistency need to be taken into account when design-
accurately, there is a space for adversarial examples near the clas- ing adversarial example generation algorithms directly on discrete
sification boundary of the DNN. Ilyas et al. [69] proposed ‘‘Adver- texts, researchers proposed to turn original texts into continuous
sarial examples are features”. They thought that the vectors at first, and then apply methods designed for images or
destructiveness of adversarial examples directly results from the design new algorithms on vectors to craft adversarial ones. Current
model’s sensitivity to input features. Hence, they concluded that text vectorization methods mainly include:
adversarial examples are not bugs but features that indicate how
DNN visualizes everything. Furthermore, they divided input fea- One-Hot encoding. The One-Hot encoding represents a charac-
tures into robust and non-robust types. They demonstrated that ter or a word by a vector, in which only one element is one and
adding tiny perturbations to non-robust features could easily make all other elements are zero. When mapping a text to a vector x,
DNNs produce incorrect outputs. the length of x is equal to the size of the vocabulary, and the ele-
ment set to one is determined by its position in the vocabulary.
2.2. Particularities of adversarial text Although this method is simple to implement, the One-Hot vec-
tor usually is sparse, semantically independent, and with a very
As mentioned before, publications related to adversarial tech- high dimension.
nology in the NLP field are far less than those in the CV field. The Word-count-based encoding. This approach initializes a zero-
reason is that three extra constraints need to be considered when coded vector with the length of the vocabulary size and then
generating textual adversarial examples. Specifically: replaces each element with a specific value. The Bag-of-Words
ignores the word order, grammar, and syntax of texts, and
Discrete. Unlike images represented by continuous pixel values, regards the text as a collection of words. Thus, the Bag-of-
the symbolic text is discrete. Therefore, finding appropriate per- Words sets each element of the initialized vector with the cor-
turbations is critical to efficient textual adversarial example responding word count. The Term Frequency-Inverse Document
generation. Researchers proposed two solutions for handling Frequency considers the word importance determined by fre-
this issue. One is vectorizing discrete texts into continuous rep- quencies of the word’s appearance in the text and corpus. Like
resentations at first (see specific methods in Section 2.3), and the One-Hot encoding, the vector obtained by word-count-
then using perturbation generation approaches for images to based encoding is sparse and semantically independent, but
craft adversarial texts. The other is carefully designing pertur- with relatively low dimensions.
bations directly on the discrete space. N-gram encoding. The N-gram language model predicts the
Perceivable. The well-performed adversarial image generation next likely word when given a text. This is based on the
method is based on the premise that a few pixel value changes assumption that the occurrence of n-th word is only related to
in an image are invisible to human eyes. However, a slight mod- the first n 1 words. When n ¼ 1; 2; 3, it is called unigram,
ification of a character or word is easily realized by human eyes bigram, and trigram, respectively. The N-gram takes the word
and spelling checkers. Hence, finding textual adversarial exam- order into account, but with the increase of n, the vocabulary
ples that are hard to be observed by human eyes is vital for suc- expands rapidly, and the vector becomes sparse.
cessful adversarial attacks. Dense encoding. The dense encoding provides a low-
Semantic. Compared with images whose overall semantics do dimensional and distributed vector representation for discrete
not change when changing a few pixel values, the semantics data. The Word2Vec [66] uses continuous bag-of-words and
of a text could be altered by even replacing or adding a charac- skip-gram models to produce a dense representation of words,
283
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
called word embedding. Its basic assumption is that words where k is the dimension of the word vector, mi and ni are the i-
appearing in similar contexts have similar meanings. Thus, the ~ and ~
th factors of m n. The smaller the Euclidean Distance is, the
Word2Vec alleviates the discreteness and data-sparsity prob- more similar these two vectors are.
lems to some extent. Similarly, the Doc2Vec and Paragraph2Vec Cosine Distance. It is used to calculate the semantic similarity
[61], two extensions of word embedding, encode sentences and between two vectors. Compared with the Euclidean Distance,
paragraphs to dense vectors. the Cosine Distance is more concerned with the difference of
directions between two vectors. For two given word vectors m ~
2.4. Evaluations of adversarial text and ~n, the cosine similarity is expressed as:
Pk
It is necessary to evaluate the effectiveness of adversarial exam- ~ ~
m n i¼1 mi ni
~;~
Dðm nÞ ¼ ¼ rffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
qffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi ð2Þ
ples in all fields. Additionally, the imperceptibility estimation of kmk knk Pk 2 Pk 2
adversarial text is particularly significant for textual adversarial i¼1 ðmi Þ i¼1 ðni Þ
P
k
Accuracy rate. It refers to the proportion of examples correctly s:t:; T ij ¼ di ; 8i 2 fi; ; kg; ð3Þ
classified by the victim model in total inputs. The lower the j¼1
284
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Fig. 4. Categorization of textual adversarial attack methods. There are four criteria: model access, target type, semantic granularity, and example generation strategy.
285
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Non-targeted attack. The adversary hopes to generate an Importance-based. This means that which object is to be mod-
adversarial example x0 that makes the victim model f produce ified and how to modify it are determined by each object’s
a wrong output f ðx0 Þ – y, where y is the correct label of the importance related to the victim model. Since the more critical
input x. Since there is no limit on the target wrong output, this the changed word is, the easier it is to change the prediction of
kind of attack is more frequently employed, such as attacks in the victim model, even if the change is small enough. The adver-
[39,119,40]. sarial example generated by this strategy generally maintains
Targeted attack. In this scenario, the adversary intends to gen- semantic consistency, grammatical, and syntactic correctness
erate adversarial examples that make victim models output a well.
specified wrong prediction. More specifically, the adversary Edit-based. It crafts adversarial examples by operations like
hopes that the generated example x0 to cause the victim model inserting, removing, and swapping characters, words, or sen-
f outputting t ¼ f ðx0 Þ, where t is the output specified by the tences. These editing operations are also used in other
adversary. For instance, Ebrahimi et al. [35] considered that approaches, such as gradient-based, optimization-based, and
some corruptions of the neural machine translation model’s importance-based methods. Therefore, the edit-based method
output might be much worse than others, such as translating in this survey refers to attacks that utilize the above editing
‘‘good morning” as ‘‘attack them” is much worse than translating operations but do not use the gradient information, optimiza-
it as ‘‘fish bicycle”. Therefore, they proposed a controlled attack tion algorithm, or item importance.
that aims to remove a specific word from the translation, and Paraphrase-based. The adversary takes the paraphrase of a
a targeted attack that tries to mute a specific word and replace sentence as its adversarial example. In the paraphrase genera-
it with another. Thus, targeted attacks are more aggressive on tion process, the adversary introduces different extra conditions
deep neural networks than non-targeted ones. However, they to fool the victim model without affecting human judgment.
are more challenging to implement since they contain more The sentence-level attacks commonly use these approaches.
constraints than non-targeted ones. Generative model-based. This method uses the generative
model like the GAN [95] and encoder-decoder model to gener-
3.1.3. Semantic granularity ate adversarial texts, and is frequently used in sentence-level
Considering the type of objects modified by the adversary, tex- attacks. Since there are gradient back propagations when train-
tual adversarial attacks can be divided into four types: character- ing the generative model or crafting adversarial examples, these
level, word-level, sentence-level, and multi-level attack. The methods are usually combined with other techniques, such as
character-level attack refers to inserting, removing, flipping, swap- RL [96].
ping, or replacing characters in a text, and the word-level attack
performs the same operations for words. The sentence-level attack Since the semantic granularity is unique for textual adversarial
usually inserts extra distractor sentences, generates the para- attack classification, and the example generation strategy indicates
phrase, or modifies the original sentence structure in the case of the principal idea of generating adversarial examples, we suggest a
semantic consistency. Besides, the multi-level attack simultane- novel two-level classification method for the textual adversarial
ously employs two or three character-level, word-level, and attack taxonomy. We first categorize adversarial attack methods
sentence-level attacks. as four classes according to the semantic granularity at the top
level. Then, we subdivide each kind of method according to the
3.1.4. Example generation strategy example generation strategy. Section 3.2–3.5 shows details.
According to different strategies in the adversarial example
generation process, we divide adversarial attacks into six types: 3.2. Character-level attack
gradient-based, optimization-based, importance-based, edit-
based, paraphrase-based, and generative model-based methods. As mentioned before, the main idea of character-level attacks is
Among them, strategies like the gradient-based method are to insert, delete, flip, replace, or swap individual characters, special
evolved from adversarial image generation methods, and the symbols, or figures in a text. These attacks generally use three
implementation process of these attacks is usually relatively example generation strategies: gradient-based, importance-
straightforward. While other methods like the optimization- based, and edit-based methods, as summarized in Table 2. Adver-
based and edit-based methods are proposed for discrete data, they sarial texts crafted by these methods often have apparent grammar
generally show better performance in maintaining semantic con- or spelling errors, making it easy for human eyes or misspelling
sistency and grammatical correctness; however, they have enor- checkers to observe these malicious texts.
mous difficulty when designing well-turned algorithms.
3.2.1. Gradient-based method
Gradient-based. These methods calculate the forward deriva- The white-box adversary in [35] calculated the gradient infor-
tive to the input and obtain adversarial perturbations by gradi- mation to rank adversarial manipulations and used the greedy
ent backpropagation. Therefore, the vectorization for text needs search and beam search method to find adversarial examples. Fur-
to be implemented at first. Besides, spelling and grammatical thermore, it forced the adversary to remove or change a specific
errors commonly exist in generated texts, causing the adversar- word in a translation by two novel loss functions. In contrast, the
ial example to be perceived. black-box adversary in [35] just randomly selected characters
Optimization-based. It regards the adversarial example gener- and made suitable modifications with them. Since the implicit
ation as a minimax optimization problem, i.e., maximizing the knowledge in an optimization process can be refined into another
victim model’s prediction error while the difference between more efficient neural network, DISTFLIP [60] converts a white-box
the adversarial example and the original one is within an attack into a black-box one by training a neural network that imi-
acceptable range. Currently, researchers craft adversarial texts tates the adversarial example generation process of HotFlip [25].
essentially based on evolutionary algorithms, such as the GA Due to the independence of optimization algorithms, DISTFLIP is
[74] and PSO [75]. ten times faster than HotFlip when crafting adversarial texts. How-
286
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Table 2
Summary of Character-level Adversarial Attacks in NLP.
ever, it needs further verification whether DISTFLIP distills all and semantic changes. TextFool [65] uses the magnitude of the vic-
knowledge of a white-box attack. tim model’s cost gradient to determine what, where, and how to
insert, replace, or remove words in white-box scenarios. Besides,
3.2.2. Importance-based method it utilizes three modification strategies (insertion, modification,
The importance-based DeepWordBug [45] includes two proce- and removal) and adopts the natural language watermarking tech-
dures: determining critical tokens, and modifying them slightly. nique to dress up a given text elaborately. However, TextFool has
To find important tokens, DeepWordBug uses four token scoring massive manual executions. AdvGen [119] leverages the gradient
functions: Replace-1 Score, Temporal Head Score, Temporal Tail Score, itself. It considers the final translation loss of the victim model
and Combination Score. In the second phase, the character-level and the distance between a word and its adversarial one. Besides,
transformations, such as swap, substitution, deletion, and inser- it applies a language model to identify possible substitutes for a
tion, are applied to the highest-ranked token to minimize the edit given word to enhance the semantic-preserving.
distance of the perturbation. Although DeepWordBug successfully Since ensuring the generated vector map to a readable word is
generates adversarial texts, most of the introduced perturbations critical, Michael et al. [70] demonstrated ‘‘adversarial examples
are constricted to misspellings. should be meaning-preserving on the source side, but meaning-
destroying on the target side” for non-targeted attacks. They pro-
3.2.3. Edit-based method posed gradient-based word substitution methods with kNN and
Belinkov et al. [38] utilized the natural and synthetic noise to CharSwap constraint, respectively. Seq2Sick [64] is a projected gra-
replace corresponding correct words in a text. They collected nat- dient method combined with group lasso and gradient regulariza-
ural noises like typos and misspellings from different datasets. tion. The non-overlapping attack requires adversarial examples to
Moreover, they crafted synthetic noises through four operations: share no overlapping words with the original one, while the tar-
exchanging characters, randomizing all but the first and last char- geted keyword attack requires adversarial examples to contain
acters of a word, randomizing all characters of a word, and replac- all given targeted keywords. Thus, Seq2Sick applies a hinge-like loss
ing a character with its neighbor on the keyboard. Furthermore, function optimized at the logit layer and an additional mask func-
considering the realization in typical application scenarios such tion mt for them, as shown below:
as social media, Eger et al. [39] proposed the first large-scale cata-
X
M n o
max ; zt t max zt
log and benchmark of low-level adversarial attack, called Zéroe. It ðs Þ ð yÞ
Lnonov erlapping ¼ ð5Þ
encompasses nine attack modes, including visual and phonetic y–st
t¼1
adversaries: inner-shuffle, full-shuffle, intrude, disemvowel, trun-
cate, segment, typo, natural noise, phonetic, and visual. X
jKj n o
min mt max ; max zt
ð yÞ ðk Þ
Lkeywords ¼ zt i ð6Þ
t2½M y–ki
3.3. Word-level attack i¼1
Table 3
Summary of Word-level Adversarial Attacks in NLP.
the adversarial example to the nearest meaningful word vector. Besides, considering that second-language speakers and many
Additionally, Song et al. [123] proposed a three-step adversarial first-language dialect speakers frequently exhibit variability in
example generation method for optical character recognition mod- their production of inflectional morphology, MORPHEUS [76] max-
els. It first finds words and their antonyms in WordNet [125] and imally increases the prediction loss by searching for the inflec-
only remains valid and semantically consistent antonyms satisfying tional form of each noun, verb, or adjective in a given text
the edit distance threshold. Then, it locates lines containing the greedily. Xu et al. [77] presented that optimizing the worst-case
above words in the clean image and transforms the target words loss function over all possible substitutions is prone to weigh unli-
in these lines to appropriate ones through the L2 -norm distance. kely substitutions higher and limit the accuracy gain. Thus, they
Finally, it replaces images of the corresponding lines in the text proposed a Metric Differential Privacy mechanism, which samples
image with adversarial ones. k values from a truncated poisson distribution as substitution can-
Later, researchers employed various evolutionary algorithms in didates to ensure nearby words with irrelevant meanings are dis-
the adversarial text generation procedure. Alzantot et al. [40] uti- regarded. For a given privacy parameter, an irrelevant word
lized GA [74] to select words randomly and find their nearest could have a similar substitution probability as a relevant word.
neighbors. They ranked and substituted the selected word to max- Hence, this method is with different degrees of semantic
imize the target label’s probability. Wang et al. [41] improved the preservation.
work in [40] by allowing the words in a given sentence to be mod- More generally, considering the inapplicability of massive
ified multiple times. Maheshwary et al. [71] leveraged a GA-based queries in the real world, Zang et al. [37] introduced RL to learn
approach in a hard label black-box setting. Mathai et al. [72] pro- from the attack history. They regarded two operations (identifying
posed a GA-based optimization method with a multi-objective vital words to be substituted, and selecting an appropriate substi-
strategy. However, randomly selecting words for substitution in tute to replace the identified vital word) as the action. They then
the above methods is full of uncertainties, making some changes used the Policy Gradient to learn the policy under which an adver-
meaningless for the target label. Differently, Zang et al. [47] lever- sarial example is crafted by taking a series of actions. This method
aged the PSO [75] to determine the word to be modified. They fur- theoretically can be combined with any candidate substitute
ther demonstrated that the substitute found based on the word method. Yuan et al. [73] systematically studied the transferability
embedding and language model is not always semantically consis- of adversarial attacks. They leveraged the GA to find an optimal
tent with the replaced word or suitable for the context. In addition, ensemble with the minimum number of model members to gener-
they proposed a sememe-based word substitution method. The ate adversarial texts that strongly transfer to other victim models.
adversarial example generation procedure of this method is shown Further, they generalized adversarial examples constructed by the
in Fig. 5.
288
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Fig. 5. The procedure of method [47]. It first uses the sememe-based word replacement method to exclude the invalid or low-quality substitutes. Thus, the remaining ones
form the reduced search space. Then, it uses the PSO-based search method to efficiently find adversarial examples in the reduced search space.
ensemble method into universal semantics-preserving word white-box MHA is the pre-selected function used for selecting
replacement rules, which induce adversaries on any text input. the most likely word to modify. Compared with previous language
generation models using the Metropolis-Hastings sampling
3.3.3. Importance-based method approach, the black-box MHA ’s stationary distribution is equipped
For generating semantics-preserving texts with minimum mod- with a language model term and an adversarial attacking term,
ifications, Samanta et al. [78] first ranked words in descending making the adversarial text generation fluent and effective.
order according to the class probability changes, which were Furthermore, Li et al. [85] built their model on a pre-trained
obtained by removing words one by one. Then, they modified the masked language model and modified the input in a context-
input with a removal-addition-replacement strategy. It works best aware manner. They proposed three contextualized perturbations
for datasets such as the Internet Movie Database (IMDB), which (replace, insert, and merge) and used a mask-then-infill procedure
has sub-categories within each class. Later, Jin et al. [79] and to generate fluent and grammatical adversarial texts with varied
Maheshwary et al. [80] adopted the keyword ranking method in lengths.
[78]. The difference is, Jin et al. [79] utilized three strategies (syn- Emelin et al. [86] detected word sense disambiguation bias in
onym extraction, part-of-speech checking, and semantic similarity neural machine translation models for model-agnostic adversarial
checking) to replace words with the most semantically similar and attacks. The word sense disambiguation is a well-known source of
grammatically correct substitutes. While Maheshwary et al. [80] translation errors in neural machine translation tasks. They
further considered the original word and its surrounding context thought that some incorrect disambiguation choices result from
when searching for substitute candidates. Besides, Ren et al. [81] models’ over-reliance on dataset artifacts found in training data,
proposed a synonym substitution based Probabilistic Weighted specifically superficial word co-occurrences. Besides, they mini-
Word Saliency (PWWS) method, which determines the word mally perturbed sentences to elicit disambiguation errors to probe
replacement order by the word saliency and classification proba- the robustness of translation models. It does not require access to
bility. The former reflects the importance of the original word to gradient information or the score distribution of the decoder.
classification probability; the latter indicates the attack perfor-
mance of the substitute. This method performs well in effectively 3.4. Sentence-level attack
reducing substitutes.
For handling the problem of computational complexity and The sentence-level attack takes the sentence as the object and
query consumption, Explain2Attack [82] employs cross-domain includes operations like inserting new sentences, generating its
interpretability to gain word importance in black-box scenarios. paraphrase, and even changing its structure. Slightly different from
They first built an interpretable substitute model that imitates the former two kinds of attacks, the sentence-level attack fre-
the victim model’s behavior. Then, they used the interpretability quently employs five example generation strategies: gradient-
capability to produce word importance scores. It reduces computa- based, optimization-based, edit-based, paraphrase-based, and gen-
tional complexity and query consumption to a large extent while erative model-based methods, as shown in Table 4. Adversarial
ensuring the attack success rate. examples generated by these methods are semantics-preserving
More generally, Yang et al. [83] proposed a systematic proba- and full of diversity, but some of them have reduced readability
bilistic framework, in which critical features are identified at first caused by adding the meaningless token sequence.
and then perturbed with values chosen from a dictionary. As two
instantiations of this framework, Greedy Attack crafts single- 3.4.1. Gradient-based method
feature perturbed inputs that achieve a higher success rate, Gumbel The universal adversarial perturbation is a particular noise,
Attack learns a parametric sampling distribution and requires combined with which any text can fool NLP models with a high
fewer model evaluations, leading to better efficiency in real-time probability. Behjati et al.[87] is the first to investigate the universal
or large-scale attacks. adversarial perturbations in the NLP field. They used an iterative
projected gradient-based approach on embedding space to craft a
3.3.4. Edit-based method sequence of words. Then, they applied the generated sequence to
The Metropolis Hastings Attack (MHA) [84] employs the any input sequence in the corresponding domain. The Natural
Metropolis-Hastings sampling approach to replace old words and Universal Trigger Search method [88] is based on an adversarially
random words. The only difference between black-box and regularized autoencoder [126]. In order To avoid out-of-
289
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Table 4
Summary of Sentence-level Adversarial Attacks in NLP.
distribution noise vectors and maintain the naturalness of gener- body and head represent the premise and the conclusion of the nat-
ated texts, this method leveraged the projected gradient descent ural language inference rules.
with l2 regularization.
Furthermore, several gradient-guided methods, which are 3.4.3. Edit-based method
based on HotFlip [25], are proposed to generate universal adversar- Jia et al. [58] proposed ADDSENT and ADDANY, both of them
ial triggers. For instance, Wallace et al. [89] first specified the trig- generate distractor sentences that confuse models but do not con-
ger length and initialized a sequence trigger, then replaced tokens tradict the correct answer or confuse humans. ADDSENT generates
of the initialized sequence through HotFlip, finally added the gen- a sentence looking similar to the question but does not contradict
erated trigger to the beginning or end of the given text. Due to that, the correct answer, and then adds the generated sentence to the
a long trigger is more effective and more noticeable than a shorter end of the given paragraph. Its variant ADDONESENT adds random
one; the trigger length in this method is an important criterion. sentences recognized by human beings. On the contrary, ADDANY
Atanasova et al. [90] focused on ensuring the semantic validity of does not consider the sentence’s grammar and adds an arbitrary
adversarial texts. They extended HotFlip by jointly minimizing a sequence of English words by querying the victim model many
fact-checking model’s target class loss and an auxiliary natural lan- times. Unlike ADDSENT and ADDANY, which both try to incorporate
guage inference model’s entailment class loss to generate universal words in the question into the adversarial sentence, ADDCOMMON,
triggers. Then, the generated universal triggers were input to a a variant of ADDANY, only uses common words in the adversarial
conditional language model trained using a GPT-2 model. Their sentence.
method effectively crafts semantically valid statements containing Based on the work in [58], Wang et al. [46] proposed ADDENT-
at least one trigger. DIVERSE, an improvement of ADDSENT, to craft adversarial exam-
ples with a higher variance where distractors will have
randomized placements, leading to the expansion of the fake
3.4.2. Optimization-based method answer set. To address the antonym-style semantic perturbations
Minervini et al. [118] studied the automatic generation issue of used in ADDSENT, they added semantic relationship features to
adversarial examples that violate a set of given First-Order Logic enable the model to identify the semantic relationship among
constraints in the natural language inference task. They regarded question contexts with the help of WordNet. Further, Nizar et al.
this issue as a combinatorial optimization problem. They generated [59] approximated a black-box victim model via the model extrac-
linguistically plausible texts by using language models and maxi- tion [127], and then used ADDANY-KBEST, a variant of ADDANY, to
mizing the quantity that measures the violation degree of such craft adversarial examples.
constraints. Specifically, they maximized the inconsistent loss J I
to search for the substitution set S (i.e., adversarial examples) using
3.4.4. Paraphrase-based method
the following language model:
Some researchers have generated the paraphrase of a given text
as its adversarial example. Ribeiro et al. [91] iteratively generated
max xI ðSÞ ¼ ½pðS; bodyÞ pðS; headÞþ
S ð8Þ paraphrases for an input sentence and obtained the victim model
s:t: log pL ðSÞ 6 s prediction until the prediction was changed. They proposed a
semantic-equivalent rule-based method to generalize these gener-
Here, pL ðSÞ represents the probability of a sentence in ated examples into semantically equivalent rules for understand-
S ¼ fX 1 ! s1 ; . . . ; X n ! sn g; s is a threshold on the perplexity of gen- ing and fixing the most impactful bugs. When generating the
erated sequences. pðS; bodyÞ and pðS; headÞ are probabilities of the paraphrase, controlled perturbations are incorporated. Iyyer et al.
given rule, after replacing X i with the corresponding sentence Si , [63] proposed an encoder-decoder based SCPNs method. For a
290
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
given sentence and a corresponding target syntax structure, SCPNs 3.5.1. Gradient-based method
first encodes the sentence by a bidirectional Long Short Term HotFlip [25] modifies the character and word in a given text. It
Memory (LSTM) model, and then inputs the interpretation and tar- performs the atomic-flip operation relying on gradient computa-
geted syntactic tree into the LSTM model for decoding to obtain the tion on the one-hot representation. For character-level attacks,
targeted paraphrase of the given sentence. In the decoding proce- the flip operation is represented as:
dure, the soft attention [92] and copy mechanism [93] are further ! !
introduced. Although SCPNs uses the target strategy, it does not v ijb ¼ . . . ; 0 ; . . . ð0; . . . ; 0; 1; 0; . . . ; 1; 0Þj . . . ; 0 ; . . .
~ ð9Þ
i
specify the target output. Besides, adversarial texts generated by
SCPNs can effectively improve the pre-trained model’s robustness The above formula means that the j-th character of i-th word in
to syntactic changes. a text is changed from a to b, which are both characters at a-th and
Some other researchers generated adversarial examples by b-th places in the alphabet. The change from the directional deriva-
expanding the original sentence. For example, AdvExpander [42] tive along this vector is calculated to find the biggest increase in
first uses linguistic rules to determine which constituents (word loss J ðx; yÞ. The calculation process is shown below:
or phrase) to expand and what types of modifiers to expand with. @J ðbÞ @J ðaÞ
Then, it expands each component by inserting an adversarial mod- v ijb ¼ max
max rx J ðx; yÞT ~ ð10Þ
ijb @xij @xij
ifier searched from a conditional variational autoencoder based
generative model [128] that is pre-trained on the Billion Word where xij is a one-hot vector that denotes the j-th character of i-th
Benchmark. This method differs from existing substitution-based word. For word-level attacks, the HotFlip is used further with a
methods and introduces rich linguistic variations to adversarial few semantics-preserving constraints like cosine similarity. How-
texts. ever, with one or two flips under strict constraints, the HotFlip only
generates a few successful adversarial examples, making it unsuit-
able for a large-scale attack.
3.4.5. Generative model-based method
Zhao et al. [94] proposed a GAN-based framework to craft effi- 3.5.2. Optimization-based method
cient and natural adversarial texts. This framework consists of two For crafting adversarial texts with sentence-level rewriting, Xu
main components: a GAN for generating fake data, and a converter et al. [43] first designed a sampling method, called RewritingSam-
for mapping input to its potential representation z0 . The two com- pler, to efficiently rewrite the original sentence in multiple ways.
ponents are trained on original input by minimizing reconstruction Then, they allowed for both word-level and sentence-level
errors between original and adversarial ones. The perturbation is changes. In order To constrain the semantic similarity and gram-
performed on the latent dense space by identifying the perturbed matical quality, they employed the word embedding sum and a
example ^z in the neighborhood of z0 . Two search approaches (iter- GPT-2 model [62].
ative stochastic search, and hybrid shrinking search) were used to
identify the proper ^z. This method is appropriate for both image 3.5.3. Importance-based method
and textual data, as it intrinsically eliminates the problem raised TextBugger [24] is used in black-box and white-box scenarios. In
by the discrete attribute of textual data. However, due to the white-box scenarios, it is also a gradient-based method, as it first
requirement of querying the victim model each time to find the ^z uses the Jacobian matrix J to calculate the importance of each
that can make the model produce incorrect prediction, this method word, as below:
is quite time-consuming. Furthermore, Wong et al. [36] proposed a
GAN-based framework that utilizes RL to guide the training of @F y ðxÞ
C xi ¼ J F ði;yÞ ¼ ð11Þ
GAN. Thus, there is no converter but an autoencoder, which judges @xi
the semantic similarity between original texts and adversarial Here, F y ðÞ represents the confidence score of class y; C xi is the
ones. However, it is restricted to binary text classifiers.
importance score of the i-th word in input x. Then, TextBugger uses
Differ from the above methods, Wang et al. [97] proposed CAT-
five editing strategies (insertion, deletion, swap, substitution with
Gen. Given a text, CAT-Gen generates adversarial texts through con-
visually similar words, and substitution with semantically similar
trollable attributes known to be irrelevant to the task label. As
words) to generate character-level and word-level adversarial texts.
shown in Fig. 6, regarding the product category as a controllable
In black-box scenarios, TextBugger split the document into
attribute for the sentiment analysis task, CAT-Gen first pre-trains
sequences at first. Then, it queries the victim model to filter out sen-
an encoder and a decoder to copy input sentence x with the attri-
tences with predictions different from the original labels, sorts
bute a, and pre-trains an attribute classifier using an auxiliary
these sequences in reverse order according to their confidence,
dataset. Then, given the desired attribute a0 – a, it uses the attri-
and calculates the word importance score through the following
bute classifier to train the decoder to enable the model to generate
equation by deleting:
an output with the attribute a0 . Finally, it searches through the
whole attribute space of a0 – a and looks for a that maximizes C xi ¼ F y ðx1 ; . . . ; xi1 ; xi ; xiþ1 ; . . . ; xn Þ
the cross-entropy loss between the prediction and ground-truth F y ðx1 ; . . . ; xi1 ; xiþ1 ; . . . ; xn Þ ð12Þ
task-label.
Finally, the same editing operation as the white-box attack is
used to modify texts. In TextBugger, only a few editing operations
3.5. Multi-level attack mislead the classifier.
Since previous work exclusively focused on semantic tasks,
The objects modified in multi-level attacks are two or three Zheng et al. [44] is the first to explore adversarial attacks in syntac-
kinds of character, word, and sentence. According to the primary tic tasks. They focused on the dependency parsing model and con-
strategy used in multi-level attacks, these approaches are divided structed syntactic adversarial examples at both sentence and
into the gradient-based, optimization-based, importance-based, phrase levels. They followed a two-step procedure: 1) choosing
edit-based, and generative model-based methods, as shown in weak spots (or positions) to change; 2) modifying them to maxi-
Table 5. These methods generated various adversarial examples mize the victim model’s errors in both black-box and white-box
to a certain extent, but they always have more constraints. scenarios.
291
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Fig. 6. Overview of CAT-Gen training process in [97]. The encoder, decoder, projector, and attribute classifier are pre-trained in advance.
Table 5
Summary of Multi-level Adversarial Attacks in NLP.
3.5.4. Edit-based method standing advantage of this method is introducing both word and
Niu et al. [98] paid attention to both Should-Not-Change and character-level perturbations.
Should-Change adversarial strategies. The Should-Not-Change strat-
egy includes four edits: transposing neighboring tokens randomly, 3.6. Comparison and analysis of attacks
removing stopwords randomly, replacing words with their para-
phrases, and using grammar errors like changing a verb to the To give readers a more intuitive understanding of these attack
wrong tense. The Should-Change strategy contains two methods: methods, this subsection firstly shows the adversarial examples
negating the root verb of the source input and changing verbs, generated by several representative attacks, and then compares
adjectives, or adverbs to their antonyms. the attack performance and query time consumption of these
Blohm et al. [99] implemented four adversarial text generation methods. Specifically, the BERT model [131] pre-trained on Stan-
methods. For black-box attacks, they performed the word-level ford Sentiment Treebank (SST) dataset is the victim model, which
attack by manually replacing original words with substitutes in performs the sentiment analysis task and outputs probabilities
pre-trained Glove embedding, and used the ADDANY [58] as for label Positive and Negative. We utilize 1 character-level attack
sentence-level attacks. For white-box attacks, they leveraged the (DeepWordBug [45]), 3 word-level attacks (Probabilistic Weighted
models’ sentence-level attention distribution to find the plot sen- Word Saliency (PWWS) [81], TextFooler [79], and BERT-based attack
tence, which has the greatest attention. The k words receiving [132]), 2 sentence-level attacks (Synthetically Controlled Paraphrase
the most attention in the plot sentence were exchanged by ran- Networks (SCPNs [63]), and GAN-based attack [94]), and 1 nulti-
domly chosen words in word-level attacks. Finally, the whole plot level attack (TextBugger [24]) to attack the chosen victim model.
sentence was removed in sentence-level attacks. Fig. 7 and Table 6 show generated adversarial examples and attack
performances of these methods, respectively.
From the perspective of example quality, character-level attack
3.5.5. Generative model-based method methods maintain the semantics of original texts well. However,
Vijayaraghavan et al. [100] developed a hybrid encoder-decoder they are easily detected by human eyes or spelling check tools.
model, called the Adversarial Examples Generator. It consists of two For example, in the TextBugger, words ‘‘crahm”, ‘‘aovids”, ‘‘obv1ous”,
components: an encoder and a decoder. The encoder, which is a and ‘‘hmour” in the adversarial example are easily observed. Never-
slight variant of Chen et al. [129], maps the input sequence to a theless, they are less likely to affect the human eye’s judgment of
representation using word and character-level information. The the emotions of the whole text. In contrast, word-level attacks
decoder has two-level Gate Recurrent Units (GRU): word-GRU compensate for the vulnerability of adversarial examples to detec-
and character-GRU. For training the model, they used the self- tion but affect the semantics of the text to some extent. For
critical approach of Rennie et al. [130] as their policy gradient instance, in the text ‘‘division of the spell of satin rouge is that it void
training algorithm. Compared with Wong et al. [36], the most out- the obvious with body and weightlessness.” generated by PWWS, the
292
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Fig. 7. Adversarial examples generated by several attacks. The original examples are two randomly selected from the SST dataset, and both of them are correctly classified as
Positive by the pre-trained BERT model.
the lowest, but their attack success rates are not satisfactory. As ing to the phrases of words. However, the average example length
mentioned above, the differences between the adversarial exam- of SST is only about 17 words.
ples generated by sentence-level methods and the original ones Internet Movie Database (IMDB):6 used for sentiment analysis.
are relatively huge, so researchers should focus on maintaining It is crawled from the Internet, including 50,000 highly polarized
the semantic consistency and imperceptibility of texts for movie reviews with tags (positive or negative sentient) and URLs
sentence-level methods. In addition, the model query accounts from which the reviews came. Among them, 25,000 examples are
for word-level attacks (PWWS, TextFooler, and BERT-based attack) used for training and 25,000 for testing. The average length of
are comparatively numerous. Therefore, it is worthwhile for reviews is nearly 234 words, and the size of this dataset is bigger
researchers to investigate how to reduce the number of queries than most similar datasets. Besides, IMDB also contains additional
about such methods. unlabeled data, original texts, and processed data.
Rotten Tomatoes Movie Review (MR):7 used for sentiment
analysis. It is a labeled dataset that concerns sentiment polarity, sub-
4. Textual adversarial attack application jective rating, and sentences with subjectivity status or polarity. It
contains 5,331 positive examples and 5,331 negative examples,
To explore potential adversarial attack threats faced by existing and the average example length is 32 words. Since MR is labeled
NLP intelligent systems and further provide a basis for developing by manual work, its size is smaller than others, with a maximum
efficient defense strategies for these NLP models, several research- of dozens of MB.
ers applied various textual adversarial attacks to extensive NLP Amazon Review:8 used for sentiment analysis. It has nearly 35
models. While these diverse NLP tasks, such as classification, neu- million reviews that cross from June 1995 to March 2013, including
ral machine translation, text entailment, and dialogue generation, product and user information, ratings, and plain text reviews. It is
are extremely different and generally implemented on distinct collected by over 6 million users in more than 2 million products
datasets. Thus, the adversarial attack methods applied to these dif- and categorized into 33 classes with a size ranging from KB to GB.
ferent NLP tasks have their own characteristics. Therefore, this sec- Multi-Perspective Question Answering (MPQA):9 used for sen-
tion reviews current works on textual adversarial attacks from the timent analysis. It is collected from various news sources and anno-
application perspective and summarizes them in Table 7. In detail, tated for opinions or other private states. It contains 10,606
Section 4.1 presents popular benchmark datasets in the NLP field, examples, and each example is labeled with objective or subjective
and Section 4.2 elaborates on various applications of textual adver- sentiment. Three different versions are available to people through
sarial attacks. the MITRE Corporation. The higher the version is, the richer the con-
tents are.
4.1. Benchmark dataset Stanford Question Answering Dataset (SQuAD):10 used for
machine reading comprehension. It contains 107,785 manual-
Since there are massive benchmark datasets for different NLP generated reading comprehension questions about more than 500
tasks, this section gives a brief introduction of the applicable task, Wikipedia articles. Each question involves a paragraph of an article,
dataset size, characteristic, and source to primary datasets in and the corresponding answer is in that paragraph. Compared with
Table 7. SQuAD 1.1, SQuAD 2.0 contains 100,000 questions in SQuAD 1.1
AG’s News:2 used for text classification. It consists of over 1 mil- and more than 50,000 malicious seemingly answerable but unan-
lion news collected from more than 2,000 news sources by an aca- swerable questions written by crowd workers. Thus, SQuAD 2.0
demic news search engine called Cometomyhead. In total, it requires machine reading comprehension models to answer ques-
includes 120,000 training examples and 7,600 test examples, which tions when possible and determine when paragraphs do not support
come from four categories of the same scale: World, Sport, Business, answers and avoid answers.
and Technology, and each category has 30,000 training examples and MovieQA:11 used for machine reading comprehension. It aims to
1,900 test examples. The provided DB version and XML version are evaluate the model’s automatic story comprehension ability from
available for download for non-commercial use. both video and text perspectives. This dataset contains 14,944
DBPedia Ontology:3 used for text classification. It is a dataset multiple-choice questions for 408 movies collected by human anno-
with structured content from the information created in various tators. Its questions, which vary from simple ‘‘who” or ‘‘when” to
Wikimedia projects. This dataset contains 560,000 training examples more complex ‘‘why” or ‘‘how”, can be answered by a variety of
and 70,000 testing examples of 14 high-level classes, such as Com- information sources, including film editing, plot, and subtitles. Each
pany, Building, and Film. It has more than 685 classes represented question has five reasonable answers, and only one is correct.
by a subsumption hierarchy structure and is described by 2,795 dif- Stanford Natural Language Inference Corpus (SNLI):12 used
ferent attributes. for text entailment. It concludes with 570,000 human-written Eng-
Yahoo! Answers:4 used for text classification. It contains 4 mil- lish sentence pairs with a manual label of entailment, contradiction,
lion question-answer pairs, and it can be used in question-answer and neutral. There are 550,152 training examples, 10,000 verification
systems. Furthermore, it includes ten categories of classification data examples, and 10,000 test examples.
obtained from Yahoo! Answers Comprehensive Questions and Visual Question Answering Dataset (VQA):13 used for visual
Answers 1.0, and each class contains 140,000 training examples question answering. It is the most widely used dataset for visual
and 5,000 test examples. question answering tasks. Its images are divided into two parts:
Stanford Sentiment Treebank (SST):5 used for sentiment anal-
ysis. It includes 239,232 sentences and phrases, whose syntax
changes greatly. Compared with most other datasets that ignore
6
word order, SST establishes a complete representation based on http://ai.stanford.edu/amaas/data/sentiment/.
7
http://www.cs.cornell.edu/people/pabo/movie-review-data/.
the sentence structure. Furthermore, it can judge the mood accord- 8
http://snap.stanford.edu/data/web-Amazon.html.
9
http://mpqa.cs.pitt.edu/.
2 10
https://www.kaggle.com/amananandrai/ag-news-classification-dataset. https://rajpurkar.github.io/SQuAD-explorer/.
3 11
https://dbpedia.org/ontology/. http://movieqa.cs.toronto.edu/home/.
4 12
https://sourceforge.net/projects/yahoodataset/. https://nlp.stanford.edu/projects/snli/.
5 13
https://nlp.stanford.edu/sentiment/code.html. https://visualqa.org/download.html.
294
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Table 7
Summary of Adversarial Attack Applications and Benchmark Datasets in NLP.
Scene Text Recognition Street View Text, ICDAR 2013, IIIT5 [144]
Visual Question VQA dataset [142,145]
Answering
Visual Semantic MSCOCO [146]
Embedding
general V + L task COCO, Visual Genome, Conceptual Captions, SBU Captions [147]
Speech Tecognition Mozilla Common Voice [21]
295
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Table 8
Summary of Applications in Classification Tasks.
204,721 images that come from the Microsoft COCO dataset text classification, sentiment analysis, gender identification, gram-
(MSCOCO)14 based on real scenes, and 50,000 pictures consisting matical error detection, toxic comment detection, spam detection,
of human and animal models in abstract scenes. Humans generate and relation extraction, as shown in Table 8.
these questions and answers. In particular, the true or false ques- Text classification aims to categorize the given texts into several
tions account for about 40%, and each picture generally corresponds classes, such as class ‘‘Business” and ‘‘Sport” in AG’s News dataset.
to several question-answer pairs. At present, there are two versions. Sentiment analysis classifies sentiments into two or three classes,
In VQA 1.0, the questions are more about the image’s simple posi- for example, in a three-group scheme: neutral, positive, and nega-
tion, quantity, and attribute relationships. In VQA 2.0, besides the tive. Furthermore, gender identification, grammatical error detec-
simple attribute information in the picture, the questions are more tion, toxic comment detection, and spam detection can be
fused with some concept sense. Thus, more studies tend to use framed as binary classification problems. In comparison, the rela-
VQA 2.0. tion extraction extracts the corresponding relation of the entity
pair in a sentence. Thus, relation extraction can be treated as a
4.2. Application in NLP multiple classification issue judging the relationship of the entity
pair.
Differ from Section 3 focusing on adversarial attack approaches, From the perspective of the victim model, the LSTM-based mod-
this section concentrates on the application scenario and impact of els were attacked by several methods [120,45,63,47,84,72,79,87].
adversarial attacks in the NLP field. According to different text- Among them, the ‘‘Sememe + PSO” attack [47] experimented on
processing tasks, this section classifies adversarial attack applica- the bidirectional LSTM (BiLSTM) model [156] with IMDB and SST
tions into eight categories: classification, neural machine transla- dataset. Compared with the ‘‘Embedding/Language Model + Genet
tion, machine reading comprehension, text summarization, text ic” [40] and ‘‘Synonym + Greedy” [81] approach, it achieved the
entailment, part-of-speech tagging, dialogue generation, and highest attack success rate on both datasets, particularly, it
cross-modal tasks. Note that most attacks were simultaneously attacked BiLSTM/ BERT on IMDB with a notably 100.00%/ 98.70%
applied in multiple different tasks, indicating the transferability success rate.
of these methods across datasets and DNN models. The Convolutional Neural Network (CNN) based models were
the target model of several attacks [136,78,65,120,45,25,79,83].
4.2.1. Classification For instance, HotFlip [25] was evaluated on the character-level
The Classification task is one of the most general scenarios in CNN-LSTM (CharCNN-LSTM) model [149] with AG’s News in
the NLP field. It can be further divided into seven sub-categories: white-box scenarios, and changed an average of 4.18% of the char-
acters to fool the classifier at confidence 0.5. Liang et al. [65]
14 focused on the character-level (CharCNN) [152] and word-level
https://cocodataset.org/#download.
296
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
not change. Even, the Google Perspective API15 was attacked by [64] WMT Corpus WordLSTM encoder, word-based attention
the DISTFLIP [60] with the toxic comment dataset16 in black-box sce- decoder
narios. 42% of API-predicted labels corresponding to the generated [70] WMT Corpus Transformer, LSTM-based model, ConvS2S
examples were flipped by this attack, while humans maintain high
accuracy in predicting the target label.
Table 10
Summary of Applications in Machine Reading Comprehension Tasks.
4.2.2. Neural machine translation
Existing Neural Machine Translation systems are used to trans- Work Benchmark Dataset Victim Model
late one natural language into another. Given an adversarial text, [58] SQuAD BiDAF [169], Match-LSTM [170]
the translation obtained from the system is inconsistent with the [89] SQuAD QANet [171], BiDAF, BiDAF with CharCNN
semantics understood by human beings. Some related works are [59] SQuAD BERT based models
[76] SQuAD BiDAF, ELMo-BiDAF [163], SpanBERT
shown in Table 9.
[99] MovieQA Multiple CNN based, RNN-LSTM based models
The adversarial attacks in [38,35] attacked the character-level
neural machine translation models on the TED Talks Parallel Cor-
pus [138]. The difference between them is that the former just
implemented black-box attacks, while the latter proposed both
black-box and white-box attacks. Further, in [35], the average Table 11
number of character changes and queries in the black-box setting Summary of Applications in Text Entailment Tasks.
are respectively 3.6 and 4.3 times those in the white-box setting. Work Benchmark Dataset Victim Model
To attack the Transformer model [166], Cheng et al. [119] used
[63] SICK BiLSTM
the gradient-based method and achieved an improvement of 2.8 [90] FEVER RoBERTa
and 1.6 Bilingual Evaluation Understudy (BLEU) points on the NIST [118] SNLI, MultiSNLI DAM, ESIM, cBiLSTM
dataset and WMT Corpus, respectively. Besides, Emelin et al. [86] [40] SNLI Model with ReLU layers
[84] SNLI BiDAF
elicited word sense disambiguation biases. They demonstrated
[43] SNLI,MultiNLI BERT
that disambiguation robustness varies substantially between [39] SNLI RoBERTa
domains, and different models trained on the same data are vul- [47] SNLI BiLSTM, BERT
nerable to different attacks. Tan et al. [76] perturbed the inflec- [134] MultiNLI BERT, RoBERTa, XLM [172], XLNet [173]
tional morphology of words in given texts. All these attack [79] SNLI, MultiNLI BiLSTM, ESIM [174], BERT
[71] SNLI, MultiNLI BiLSTM, ESIM, BERT
methods have reduced the performance of DNN models to a large
[80] SNLI WordLSTM, BERT
extent.
tain semantics and syntax, and achieved an attack success rate of 4.2.8. Cross-modal task
70% with a modified rate of 23%. To improve the validity and flu- In addition to the tasks dealing with the single modality input,
ency of adversarial text, Zhang et al. [84] improved the work in some NLP-related cross-modal tasks are faced with the adversarial
[40] by employing the Metropolis-Hastings sampling, and it attack threat. These cross-modal tasks can be categorized as two
reduced the perplexity (PPL) by nearly 500 points. main types: text-and-vision, and text-and-audio, as shown in
From the perspective of the victim model, more and more Table 12.
attacks [43,39,47,134,79,90] were evaluated on the pre-trained Text-and-Vision. The Image Captioning model takes an image as
language model. Compared with the effective and efficient TextFoo- an input and generates a textual caption, which describes the
ler [79], RewritingSampler [43] showed its outperformance in the visual content of the input image. For attacking the CNN-RNN-
diversity, semantic similarity and grammatical quality when based image captioning model, Chen et al. [141] proposed Show-
attacking the BERT. The lightweight Mischief [134] was evaluated and-Fool, which includes a targeted caption method that makes
on four transformer-based models: BERT [131], RoBERTa [157], model outputs match the target caption, and a targeted keyword
XLM [172], and XLNet [173], and significantly reduced the perfor- method that makes model outputs contain specific keywords.
mance by up to 20% of these models. Furthermore, the method in Experimented on the MSCOCO dataset [181], the former achieved
[90] ensured the semantic validity of adversarial texts crafted by a 95.8% attack success rate, and the latter achieved an even higher
universal triggers. success rate, especially at least 96% for the 3-keyword case and at
least 97% for 1-keyword and 2-keyword cases. Following, Xu et al.
4.2.5. Part-of-speech tagging [142] proposed an iterative optimization method, and investigated
The part-of-speech (POS) tagging is the process of marking up adversarial attacks on the DenseCap network [182] with Visual
words in a text as corresponding to a particular part of speech, based Genome dataset [183]. Its objective function maximizes the prob-
on both its definition and context, such as ‘‘noun” and ‘‘verb”. In par- ability of the target answer and unweights the preference of adver-
ticular, Yasunaga et al. [139] added perturbations to the input word sarial examples with a smaller distance to the original image when
or character embedding, and conducted a series of POS tagging this distance is below a threshold. Although it is challenging to
experiments on Penn Treebank WSJ corpus [175] (English) and train an RNN-based caption generation model to generate the
Universal Dependencies [176] (27 languages) dataset. Further, Eger exact matching captions and the DenseCap network involves ran-
et al. [39] proposed Zéroe containing nine attack modes, and evalu- domness, this method reached beyond 97% success rate.
ated it with the RoBERTa [157] on the Universal Dependencies data- The Optical Character Recognition model takes an image as an
set. The results indicated that the intrude attack, which randomly input and outputs the recognized text. These tasks approximately
inserts unobtrusive symbols (like $%&.- and whitespace) in a given include two types: character-based and end-to-end. The former is
text, is among the most severe attacks for POS tagging since it a traditional approach for recognizing text in ‘‘block of text” images;
decreased the victim model accuracy to around 16%. Besides, Han the latter is a segmentation-free technique that recognizes the
et al. [140] used a sequence-to-sequence model with feedback from entire sequence of characters in a variable-sized ‘‘block of text”
multiple reference models of the same task to generate adversarial image. Regarding Tesseract17 as the victim model, Song et al. [123]
sentences with different lengths and structures. successfully caused over 90% of the words in their list to be misrecog-
nized in the character-based task, and flipped the meaning of a rela-
tively long document by changing only 1–2 words in the end-to-
4.2.6. Text summarization end task. Additionally, Chen et al. [143] proposed WATERMARK, which
The Text Summarization task summarizes the main content or produces natural distortion in the disguise of watermarks. In white-
meaning of a given document or paragraph with a succinct expres- box and target attack scenarios, the WATERMARK was performed on
sion. Because the average length of given texts varies greatly, it is a DenseNet + CTC based model trained on a Chinese text image data-
challenging to implement an adversarial attack on this task. For set, which includes 3.64 million images and 5,989 unique characters.
example, Seq2Sick [64] was verified on three datasets (the The adversarial examples crafted by WATERMARK are human-eye
DUC2003, DUC2004, and Gigaword). The DUC2003 and DUC2004 friendly and with high success probabilities. Besides, some adversar-
datasets are extensively employed in documentation summariza- ial examples even work on Tesseract in a black-box manner.
tion. Changing only 2 or 3 words on average by Seq2Sick leads to The Scene Text Recognition is a standard sequential learning
entirely different outputs for more than 80% of sentences. task with a varied-length output. By comparison, Optical Character
Recognition is a pipeline process that first segments the word into
4.2.7. Dialogue generation characters and then recognizes a single character. In Scene Text
The Dialogue Generation is a kind of text generation task, which Recognition tasks, the entire image is directly mapped to a word
automatically generates a response according to a given post. string. Yuan et al. [144] proposed an adaptive attack to accelerate
Besides, the dialogue generation model is a fundamental compo- the adversarial attack through multi-task learning [184], which
nent of real-world dialogue systems. Niu et al. [98] verified their improves learning efficiency and prediction accuracy by learning
approach with the Ubuntu Dialogue Corpus [177] and the Collabo- multiple objectives from a shared representation. They imple-
rative Communicating Agents dataset (CoCoA) [178]. For the mented their attack on the Scene Text Recognition model with
Ubuntu Dialogue Corpus, which contains 1 million 2-person, the Street View Text [52], ICDAR 2013 [185], and IIIT 5 K-word
multi-turn dialogues extracted from Ubuntu chat logs used to pro- [186] dataset, and achieved over 99.9% success rate with 3–6 times
vide and receive technical support, they focused on the Variational speedup compared to the Convolutional Recurrent Neural Network
Hierarchical EncoderDecoder model [179] and the RL model [180]. (CRNN) [187].
In contrast, for the CoCoA dialogue dataset, which involves two The Visual Question Answering model provides an accurate
agents that are asymmetrically primed with a private Knowledge answer in natural language when given an image and a natural lan-
Base (KB) and engage in a natural language conversation to find guage question about the image. Xu et al. [142] evaluated their
out the unique entry shared by the two KBs, they paid attention method on two models (the MCB model [188] and compositional
to attack the Dynamic Knowledge Graph Network. Further, adver- model N2NMN [189]) on the VQA dataset [190]. They evaluated
sarial training with generated adversarial examples makes all
models more robust to adversarial attacks and improves their per-
formances when evaluated on the original test set. 17
https://github.com/tesseract-ocr/tesseract.
298
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Table 12
Summary of Applications in Cross-modal Tasks.
general V + L task [147] COCO, Visual Genome, Conceptual Captions, SBU Captions UNITER
text-and-audio speech recognition [21] Mozilla Common Voice DeepSpeech
the success rate (over 90%) and confidence score (above 0.7) of the with the Mozilla Common Voice dataset18 on the DeepSpeech
victim model to predict the target answer. Moreover, they con- model [199]. Furthermore, they generated targeted adversarial
cluded that the attention, bounding box localization, and internal examples with a 100% success rate for each source-target pair with
compositional structures are vulnerable to adversarial attacks. an average perturbation of 31 dB, particularly, the 95% interval
Besides, Tang et al. [145] leveraged an encoder-decoder neural for distortion ranged from 15 dB to 45 dB.
machine translation framework and iterative FGSM [191] to gener-
ate semantic equivalent adversarial examples of both visual and 4.2.9. Other tasks
textual data as the augmented data, which then was utilized for Recently, some researchers have proposed adversarial attacks
training a visual question answering model using adversarial applied to relatively novel NLP tasks. For example, Cheng et al.
learning. The model trained with their method obtained 65.16% [148] proposed a framework to generate adversarial agents rather
accuracy on the clean validation dataset, beating its vanilla training than adversarial examples in an interactive dialogue system under
counterpart by 1.84%. The adversarial ly-trained model signifi- both black-box and white-box settings. Zheng et al. [44] explored
cantly increases accuracy on adversarial examples by 21.55%. the feasibility of generating syntactic adversarial sentences to lead
The Visual Semantic Embedding task bridges the natural lan- a dependency parser to make mistakes without altering the origi-
guage and the underlying visual world. In this task, the embedding nal syntactic structures. The experiments with a graph-based
spaces of both images and descriptive captions are jointly opti- dependency parser [200] on the English Penn Treebank showed
mized and aligned. Shi et al. [146] performed adversarial attacks that up to 77% of input examples admit adversarial perturbations.
on the textual part through three editing operations (replacing
nouns in the caption, changing numerals to different ones, detect- 5. Defense against textual adversarial attack
ing the relations and shuffling the non-interchangeable noun
phrases or replacing the prepositions). The evaluation on VSE++ The wide application of adversarial attacks in the NLP domain
model [192] with the MSCOCO dataset showed that, although makes researchers aware of NLP intelligent systems’ potential sev-
VSE++ obtains good performance on the original test set, it is vul- ere adversarial threats in the real world. To enhance the defensive
nerable to caption-specific adversarial attacks. capability of DNN in NLP tasks and further improve the security of
More generally, Gan et al. [147] proposed VILLA, a large-scale these intelligent systems against adversarial attacks, researchers
adversarial training strategy for vision-and-language representa- have proposed numerous strategies, which contain two types: pas-
tion learning in tasks including Visual Question Answering [193], sive and active defense. The passive method detects adversarial
Visual Commonsense Reasoning [194], Natural Language Visual input during the inference procedure, while the active method
Reasoning for Real (NLVR2 ) [195], Visual Entailment [196], Referring generally improves the robustness of the model when training it.
Expression Comprehension [197], and Image-Text Retrieval [198].
VILLA first conducts a task-agnostic adversarial pre-training to lift 5.1. Passive defense
model performance for all downstream tasks uniformly, then
implements a task-specific adversarial fine-tuning to enhance the As mentioned above, adversarial texts are perceivable and
fine-tuned models additionally. Differing from the conventional semantic. Thus, checking input is the most straightforward and
approaches, VILLA adds adversarial perturbations to word embed- general passive defense method. According to the type of adversar-
ding and extracted image-region features, respectively. To enable ial attacks that these methods defend, this survey categorizes these
large-scale training, it adopts the ‘‘free” adversarial training strat- defense strategies as two classes, as shown in Table 13.
egy and combines it with KL-divergence-based regularization to For character-level attacks, there are some workable mis-
promote higher invariance in the embedding space. Relying on spelling checking tools, such as the Python autocorrect 0.3.0 pack-
standard bottom-up image features only, VILLA improves the age [45] and context-aware spelling check service [24].
single-model performance of UNITER-large from 74.02 to 74.87 Additionally, Pruthi et al. [101] designed a word recognition model
on visual question answering tasks and from 62.8 to 65.7 on visual with three back-off strategies to check misspellings or typos.
commonsense reasoning tasks. With the ensemble, visual question For word-level attacks, Mozes et al. [102] observed the fre-
answering performance is boosted to 75.85 additionally. quency differences between words and their substitutes, and then
Text-and-Audio. The Speech Recognition model recognizes and proposed the Frequency-guided Word Substitutions (FGWS), which is
translates spoken language into text automatically. Carlini et al. rule-based and model-agnostic. Besides, Zhou et al. [103] focused
[21] crafted targeted audio adversarial examples on automatic on determining whether a particular word is a perturbation and
speech recognition models. Given any natural waveform x, they
constructed a perturbation d that is almost inaudible so that
x þ d was recognized as any desired phrase. They experimented 18
https://commonvoice.mozilla.org/zh-CN/datasets.
299
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Table 14
Summary of Active Defense Strategies.
300
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
302
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Benchmark platform for textual adversarial attack, defense, [9] H. Liu, T. Fang, T. Zhou, Y. Wang, L. Wang, Deep learning-based multimodal
control interface for human-robot collaboration, Procedia CIRP 72 (2018)
and evaluation. In the CV field, researchers have proposed sev-
3–8.
eral adversarial attack and defense toolboxs, such as CleverHans [10] C.-S. Oh, J.-M. Yoon, Hardware acceleration technology for deep-learning in
[207], Foolbox [208], and Advertorch [209]. However, as far as autonomous vehicles, in: 2019 IEEE International Conference on Big Data and
we know, Textattack [210] and Openattack [211] are the only Smart Computing (BigComp) IEEE, 2019, pp. 1–3.
[11] M. Coccia, Deep learning technology for improving cancer care in society:
textual adversarial attack toolbox currently available. Neverthe- New directions in cancer imaging driven by artificial intelligence, Technol.
less, both of them do not cover defense and evaluation func- Soc. 60 (2020) 101198.
tions. Since different attack and defense methods are being [12] J. Harikrishnan, A. Sudarsan, A. Sadashiv, R.A. Ajai, Vision-face recognition
attendance monitoring system for surveillance using deep learning
proposed and applied to various NLP tasks, it is challenging to technology and computer vision, in: 2019 International Conference on
compare the advantages and disadvantages of these methods. Vision Towards Emerging Trends in Communication and Networking
Therefore, it is essential to develop a textual adversarial bench- (ViTECoN), IEEE, 2019, pp. 1–5.
[13] S. So, J. Mun, J. Rho, Simultaneous inverse design of materials and structures
mark platform that integrates attack, defense, and evaluation via deep learning: demonstration of dipole resonance engineering using
functions. core–shell nanoparticles, ACS Appl. Mater. Interfaces 11 (27) (2019) 24264–
Application in emerging tasks. Currently, textual adversarial 24268.
[14] H.-P. Chan, L.M. Hadjiiski, R.K. Samala, Computer-aided diagnosis in the era of
attack technology is mainly used to attack various text process- deep learning, Med. Phys. 47 (5) (2020) e218–e227.
ing intelligence algorithms to assist researchers in identifying [15] F. Zhang, P.P. Chan, B. Biggio, D.S. Yeung, F. Roli, Adversarial feature selection
potential security threats in existing intelligence algorithms against evasion attacks, IEEE Trans. Cybern. 46 (3) (2015) 766–777.
[16] K.D. Julian, J. Lopez, J.S. Brush, M.P. Owen, M.J. Kochenderfer, Policy
and improving them. Researchers can vigorously explore novel
compression for aircraft collision avoidance systems IEEE/AIAA 35th Digital
tasks where adversarial attack technology can be applied in the Avionics Systems Conference (DASC), IEEE 2016 (2016) 1–10.
future. For example, utilizing adversarial technology in the 3D [17] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, R.
modelling process to improve the realism of 3D models Fergus, Intriguing properties of neural networks, arXiv preprint
arXiv:1312.6199.
[212,213], applying adversarial examples to human-machine [18] K. Eykholt, I. Evtimov, E. Fernandes, B. Li, A. Rahmati, C. Xiao, A. Prakash, T.
verification [214,215] to fool machines but do not affect regular Kohno, D. Song, Robust physical-world attacks on deep learning visual
users. classification, in: Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, 2018, pp. 1625–1634.
[19] I.J. Goodfellow, J. Shlens, C. Szegedy, Explaining and harnessing adversarial
examples, arXiv preprint arXiv:1412.6572.
CRediT authorship contribution statement
[20] C. Xie, J. Wang, Z. Zhang, Y. Zhou, L. Xie, A. Yuille, Adversarial examples for
semantic segmentation and object detection, in: Proceedings of the IEEE
Shilin Qiu: Conceptualization, Investigation, Writing – original International Conference on Computer Vision, 2017, pp. 1369–1378.
draft, Writing – review & editing. Qihe Liu: Writing – review & [21] N. Carlini, D. Wagner, Audio adversarial examples: Targeted attacks on
speech-to-text, in IEEE Security and Privacy Workshops (SPW), IEEE 2018
editing, Funding acquisition, Project administration. Shijie Zhou: (2018) 1–7.
Supervision, Project administration. Wen Huang: Investigation, [22] H. Yakura, J. Sakuma, Robust audio adversarial example for a physical attack,
Writing – review & editing. arXiv preprint arXiv:1810.11793.
[23] R. Taori, A. Kamsetty, B. Chu, N. Vemuri, Targeted adversarial examples for
black box audio systems, 2019 IEEE Security and Privacy Workshops (SPW),
IEEE 2019 (2019) 15–20.
Declaration of Competing Interest
[24] J. Li, S. Ji, T. Du, B. Li, T. Wang, Textbugger: Generating adversarial text against
real-world applications, arXiv preprint arXiv:1812.05271.
The authors declare that they have no known competing finan- [25] J. Ebrahimi, A. Rao, D. Lowd, D. Dou, Hotflip: White-box adversarial examples
cial interests or personal relationships that could have appeared for text classification, in: Proceedings of the 56th Annual Meeting of the
Association for Computational Linguistics (Volume 2: Short Papers),
to influence the work reported in this paper. Association for Computational Linguistics, Melbourne, Australia, 2018, pp.
31–36, 10.18653/v1/P18-2006.
[26] X. Liu, Y. Lin, H. Li, J. Zhang, Adversarial examples: Attacks on machine
Acknowledgment learning-based malware visualization detection methods, arXiv preprint
arXiv:1808.01546 10 (3326285.3329073).
[27] J. Chen, Z. Yang, D. Yang, Mixtext: Linguistically-informed interpolation of
This work was supported by the Sichuan Science and Technol-
hidden space for semi-supervised text classification, arXiv preprint
ogy Program [Grant Nos. 2019YFG0399, 2020YFG0472, arXiv:2004.12239.
2020YFG0031]. [28] D. Mekala, J. Shang, Contextualized weak supervision for text classification,
in: Proceedings of the 58th Annual Meeting of the Association for
Computational Linguistics, 2020, pp. 323–333.
References [29] R.K. Bakshi, N. Kaur, R. Kaur, G. Kaur, Opinion mining and sentiment analysis,
in: 2016 3rd International Conference on Computing for Sustainable Global
Development (INDIACom), 2016, pp. 452–455.
[1] A. Krizhevsky, I. Sutskever, G.E. Hinton, Imagenet classification with deep
[30] P. Gupta, V. Gupta, A survey of text question answering techniques,
convolutional neural networks, Commun. ACM 60 (6) (2017) 84–90.
International Journal of Computer Applications 53 (4).
[2] L. Qin, N. Yu, D. Zhao, Applying the convolutional neural network deep
[31] Y. Wu, M. Schuster, Z. Chen, Q.V. Le, M. Norouzi, W. Macherey, M. Krikun, Y.
learning technology to behavioural recognition in intelligent video, Tehnički
Cao, Q. Gao, K. Macherey, et al., Google’s neural machine translation system:
vjesnik 25 (2) (2018) 528–535.
Bridging the gap between human and machine translation, arXiv preprint
[3] M.S. Hossain, G. Muhammad, Emotion recognition using deep learning
arXiv:1609.08144.
approach from audio–visual emotional big data, Inf. Fusion 49 (2019) 69–78.
[32] Y. Duan, C. Xu, J. Pei, J. Han, C. Li, Pre-train and plug-in: Flexible conditional
[4] A. Chatterjee, U. Gupta, M.K. Chinnakotla, R. Srikanth, M. Galley, P. Agrawal,
text generation with variational auto-encoders, arXiv preprint
Understanding emotions in text using deep learning and big data, Comput.
arXiv:1911.03882.
Hum. Behav. 93 (2019) 309–317.
[33] Y. Tay, D. Bahri, C. Zheng, C. Brunk, D. Metzler, A. Tomkins, Reverse
[5] W. Guo, H. Gao, J. Shi, B. Long, L. Zhang, B.-C. Chen, D. Agarwal, Deep natural
engineering configurations of neural text generation models, arXiv preprint
language processing for search and recommender systems, in: Proceedings of
arXiv:2004.06201.
the 25th ACM SIGKDD International Conference on Knowledge Discovery &
[34] N. Papernot, P. McDaniel, A. Swami, R. Harang, Crafting adversarial input
Data Mining, 2019, pp. 3199–3200.
sequences for recurrent neural networks, MILCOM 2016–2016 IEEE Military
[6] L. Yang, Y. Li, J. Wang, R.S. Sherratt, Sentiment analysis for e-commerce
Communications Conference, IEEE (2016) 49–54.
product reviews in chinese based on sentiment lexicon and deep learning,
[35] J. Ebrahimi, D. Lowd, D. Dou, On adversarial examples for character-
IEEE Access 8 (2020) 23522–23530.
level neural machine translation, in: Proceedings of the 27th
[7] B. Sisman, J. Yamagishi, S. King, H. Li, An overview of voice conversion and its
International Conference on Computational Linguistics, Association for
challenges: From statistical modeling to deep learning, IEEE/ACM
Computational Linguistics, Santa Fe, New Mexico, USA, 2018, pp. 653–
Transactions on Audio, Speech, and Language Processing.
663.
[8] M. Saravanan, B. Selvababu, A. Jayan, A. Anand, A. Raj, Arduino based voice
[36] C. Wong, Dancin seq2seq: Fooling text classifiers with adversarial text
controlled robot vehicle, in: IOP Conference Series: Materials Science and
example generation, arXiv preprint arXiv:1712.05419.
Engineering, Vol. 993, IOP Publishing, 2020, p. 012125.
303
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
[37] Y. Zang, B. Hou, F. Qi, Z. Liu, X. Meng, M. Sun, Learning to attack: Towards [69] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, A. Madry, Adversarial
textual adversarial attacking in real-world situations, arXiv preprint examples are not bugs, they are features, Advances in Neural Information
arXiv:2009.09192. Processing Systems (2019) 125–136.
[38] Y. Belinkov, Y. Bisk, Synthetic and natural noise both break neural machine [70] P. Michel, X. Li, G. Neubig, J.M. Pino, On evaluation of adversarial
translation, arXiv preprint arXiv:1711.02173. perturbations for sequence-to-sequence models, arXiv preprint
[39] S. Eger, Y. Benz, From hero to zn)éroe: A benchmark of low-level adversarial arXiv:1903.06620.
attacks, arXiv preprint arXiv:2010.05648. [71] R. Maheshwary, S. Maheshwary, V. Pudi, Generating natural language attacks
[40] M. Alzantot, Y. Sharma, A. Elgohary, B.-J. Ho, M. Srivastava, K.-W. Chang, in a hard label black box setting, arXiv preprint arXiv:2012.14956.
Generating natural language adversarial examples, arXiv preprint [72] A. Mathai, S. Khare, S. Tamilselvam, S. Mani, Adversarial black-box attacks on
arXiv:1804.07998. text classifiers using multi-objective genetic optimization guided by deep
[41] X. Wang, H. Jin, K. He, Natural language adversarial attacks and defenses in networks, arXiv preprint arXiv:2011.03901.
word level, arXiv preprint arXiv:1909.06723. [73] L. Yuan, X. Zheng, Y. Zhou, C.-J. Hsieh, K.-W. Chang, X. Huang, Generating
[42] Z. Shao, Z. Liu, J. Zhang, Z. Wu, M. Huang, Advexpander: Generating natural universal language adversarial examples by understanding and enhancing
language adversarial examples by expanding text, arXiv preprint the transferability across neural models, arXiv preprint arXiv:2011.08558.
arXiv:2012.10235. [74] E.J. Anderson, M.C. Ferris, Genetic algorithms for combinatorial optimization:
[43] L. Xu, I. Ramirez, K. Veeramachaneni, Rewriting meaningful sentences via the assemble line balancing problem, ORSA Journal on Computing 6 (2)
conditional bert sampling and an application on fooling text classifiers, arXiv (1994) 161–173.
preprint arXiv:2010.11869. [75] J. Kennedy, R. Eberhart, Particle swarm optimization, in: Proceedings of
[44] X. Zheng, J. Zeng, Y. Zhou, C.-J. Hsieh, M. Cheng, X.-J. Huang, Evaluating and ICNN’95-International Conference on Neural Networks, Vol. 4, IEEE, 1995, pp.
enhancing the robustness of neural network-based dependency parsing 1942–1948.
models with adversarial examples, in: Proceedings of the 58th Annual [76] S. Tan, S. Joty, M.-Y. Kan, R. Socher, It’s morphin’time! combating linguistic
Meeting of the Association for Computational Linguistics, 2020, pp. 6600– discrimination with inflectional perturbations, arXiv preprint
6610. arXiv:2005.04364.
[45] J. Gao, J. Lanchantin, M.L. Soffa, Y. Qi, Black-box generation of adversarial text [77] N. Xu, O. Feyisetan, A. Aggarwal, Z. Xu, N. Teissier, Differentially private
sequences to evade deep learning classifiers, in IEEE Security and Privacy adversarial robustness through randomized perturbations, arXiv preprint
Workshops (SPW), IEEE 2018 (2018) 50–56. arXiv:2009.12718.
[46] Y. Wang, M. Bansal, Robust machine comprehension models via adversarial [78] S. Samanta, S. Mehta, Towards crafting text adversarial samples, arXiv
training, arXiv preprint arXiv:1804.06473. preprint arXiv:1707.02812.
[47] Y. Zang, F. Qi, C. Yang, Z. Liu, M. Zhang, Q. Liu, M. Sun, Word-level textual [79] D. Jin, Z. Jin, J.T. Zhou, P. Szolovits, Is bert really robust? a strong baseline for
adversarial attacking as combinatorial optimization, in: Proceedings of the natural language attack on text classification and entailment, in: Proceedings
58th Annual Meeting of the Association for Computational Linguistics, 2020, of the AAAI conference on artificial intelligence, Vol. 34, 2020, pp. 8018–8025.
pp. 6066–6080. [80] R. Maheshwary, S. Maheshwary, V. Pudi, A context aware approach for
[48] V. Malykh, Robust to noise models in natural language processing tasks, in: generating natural language attacks, arXiv preprint arXiv:2012.13339.
Proceedings of the 57th Annual Meeting of the Association for Computational [81] S. Ren, Y. Deng, K. He, W. Che, Generating natural language adversarial
Linguistics: Student Research Workshop, 2019, pp. 10–16. examples through probability weighted word saliency, in: Proceedings of the
[49] E. Jones, R. Jia, A. Raghunathan, P. Liang, Robust encodings: A framework for 57th annual meeting of the association for computational linguistics, 2019,
combating adversarial typos, arXiv preprint arXiv:2005.01229. pp. 1085–1097.
[50] J. Gilmer, R.P. Adams, I. Goodfellow, D. Andersen, G.E. Dahl, Motivating the [82] M. Hossam, T. Le, H. Zhao, D. Phung, Explain2attack: Text adversarial attacks
rules of the game for adversarial example research, arXiv preprint via cross-domain interpretability.
arXiv:1807.06732. [83] P. Yang, J. Chen, C.-J. Hsieh, J.-L. Wang, M.I. Jordan, Greedy attack and gumbel
[51] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, D. Mukhopadhyay, attack: Generating adversarial examples for discrete data, Journal of Machine
Adversarial attacks and defences: A survey, arXiv preprint Learning Research 21 (43) (2020) 1–36.
arXiv:1810.00069. [84] H. Zhang, H. Zhou, N. Miao, L. Li, Generating fluent adversarial examples for
[52] X. Yuan, P. He, Q. Zhu, X. Li, Adversarial examples: Attacks and defenses for natural languages, arXiv preprint arXiv:2007.06174.
deep learning, IEEE transactions on neural networks and learning systems 30 [85] D. Li, Y. Zhang, H. Peng, L. Chen, C. Brockett, M.-T. Sun, B. Dolan,
(9) (2019) 2805–2824. Contextualized perturbation for textual adversarial attack, arXiv preprint
[53] J. Zhang, C. Li, Adversarial examples: Opportunities and challenges, IEEE arXiv:2009.07502.
transactions on neural networks and learning systems 31 (7) (2019) 2578– [86] D. Emelin, I. Titov, R. Sennrich, Detecting word sense disambiguation biases in
2593. machine translation for model-agnostic adversarial attacks, arXiv preprint
[54] S. Qiu, Q. Liu, S. Zhou, C. Wu, Review of artificial intelligence adversarial arXiv:2011.01846.
attack and defense technologies, Applied Sciences 9 (5) (2019) 909. [87] M. Behjati, S.-M. Moosavi-Dezfooli, M.S. Baghshah, P. Frossard, Universal
[55] W. Wang, L. Wang, R. Wang, Z. Wang, A. Ye, Towards a robust deep neural adversarial attacks on text classifiers, in: ICASSP 2019–2019 IEEE
network in texts: A survey, arXiv preprint arXiv:1902.07285. International Conference on Acoustics, Speech and Signal Processing
[56] W.E. Zhang, Q.Z. Sheng, A. Alhazmi, C. Li, Adversarial attacks on deep-learning (ICASSP) IEEE, 2019, pp. 7345–7349.
models in natural language processing: A survey, ACM Transactions on [88] L. Song, X. Yu, H.-T. Peng, K. Narasimhan, Universal adversarial attacks with
Intelligent Systems and Technology (TIST) 11 (3) (2020) 1–41. natural triggers for text classification, arXiv preprint arXiv:2005.00174.
[57] A. Huq, M.T. Pervin, Adversarial attacks and defense on texts: A survey, arXiv [89] E. Wallace, S. Feng, N. Kandpal, M. Gardner, S. Singh, Universal adversarial
e-prints, 2020, arXiv–2005.. triggers for attacking and analyzing nlp, arXiv preprint arXiv:1908.07125.
[58] R. Jia, P. Liang, Adversarial examples for evaluating reading comprehension [90] P. Atanasova, D. Wright, I. Augenstein, Generating label cohesive and well-
systems, arXiv preprint arXiv:1707.07328. formed adversarial claims, arXiv preprint arXiv:2009.08205.
[59] N.J. Nizar, A. Kobren, Leveraging extracted model adversaries for improved [91] M.T. Ribeiro, S. Singh, C. Guestrin, Semantically equivalent adversarial rules
black box attacks, arXiv preprint arXiv:2010.16336. for debugging nlp models, in: Proceedings of the 56th Annual Meeting of the
[60] Y. Gil, Y. Chai, O. Gorodissky, J. Berant, White-to-black: Efficient distillation of Association for Computational Linguistics (Volume 1: Long Papers), 2018, pp.
black-box adversarial attacks, arXiv preprint arXiv:1904.02405. 856–865.
[61] Q. Le, T. Mikolov, Distributed representations of sentences and documents, in: [92] D. Bahdanau, K. Cho, Y. Bengio, Neural machine translation by jointly learning
International conference on machine learning, PMLR, 2014, pp. 1188–1196. to align and translate, arXiv preprint arXiv:1409.0473.
[62] A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever, Language models [93] A. See, P.J. Liu, C.D. Manning, Get to the point: Summarization with pointer-
are unsupervised multitask learners, OpenAI blog 1 (8) (2019) 9. generator networks, arXiv preprint arXiv:1704.04368.
[63] M. Iyyer, J. Wieting, K. Gimpel, L. Zettlemoyer, Adversarial example [94] Z. Zhao, D. Dua, S. Singh, Generating natural adversarial examples, arXiv
generation with syntactically controlled paraphrase networks, arXiv preprint arXiv:1710.11342.
preprint arXiv:1804.06059. [95] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.
[64] M. Cheng, J. Yi, P.-Y. Chen, H. Zhang, C.-J. Hsieh, Seq2sick: Evaluating the Courville, Y. Bengio, Generative adversarial nets, Advances in neural
robustness of sequence-to-sequence models with adversarial examples., information processing systems 27.
AAAI (2020) 3601–3608. [96] R.S. Sutton, A.G. Barto, Reinforcement learning, An introduction, 2011.
[65] B. Liang, H. Li, M. Su, P. Bian, X. Li, W. Shi, Deep text classification can be [97] T. Wang, X. Wang, Y. Qin, B. Packer, K. Li, J. Chen, A. Beutel, E. Chi, Cat-gen:
fooled, arXiv preprint arXiv:1704.08006. Improving robustness in nlp models via controlled adversarial text
[66] T. Mikolov, I. Sutskever, K. Chen, G.S. Corrado, J. Dean, Distributed generation, arXiv preprint arXiv:2010.02338.
representations of words and phrases and their compositionality, Advances [98] T. Niu, M. Bansal, Adversarial over-sensitivity and over-stability strategies for
in neural information processing systems 26 (2013) 3111–3119. dialogue models, arXiv preprint arXiv:1809.02079.
[67] K. Taga, K. Kameyama, K. Toraichi, Regularization of hidden layer unit [99] M. Blohm, G. Jagfeld, E. Sood, X. Yu, N.T. Vu, Comparing attention-based
response for neural networks, in: 2003 IEEE Pacific Rim Conference on convolutional and recurrent neural networks: Success and limitations in
Communications Computers and Signal Processing (PACRIM 2003)(Cat. No. machine reading comprehension, in: Proceedings of the 22nd Conference on
03CH37490), Vol. 1, IEEE, 2003, pp. 348–351. Computational Natural Language Learning, Association for Computational
[68] T. Tanay, L. Griffin, A boundary tilting persepective on the phenomenon of Linguistics, Brussels, Belgium, 2018, pp. 108–118, 10.18653/v1/K18-1011.
adversarial examples, arXiv preprint arXiv:1608.07690. [100] P. Vijayaraghavan, D. Roy, Generating black-box adversarial examples for text
classifiers using a deep reinforced model, in: Joint European Conference on
304
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Machine Learning and Knowledge Discovery in Databases, Springer, 2019, pp. [130] S.J. Rennie, E. Marcheret, Y. Mroueh, J. Ross, V. Goel, Self-critical sequence
711–726. training for image captioning, in: Proceedings of the IEEE Conference on
[101] D. Pruthi, B. Dhingra, Z.C. Lipton, Combating adversarial misspellings with Computer Vision and Pattern Recognition, 2017, pp. 7008–7024.
robust word recognition, arXiv preprint arXiv:1905.11268. [131] J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, Bert: Pre-training of deep
[102] M. Mozes, P. Stenetorp, B. Kleinberg, L.D. Griffin, Frequency-guided word bidirectional transformers for language understanding, arXiv preprint
substitutions for detecting textual adversarial examples, arXiv preprint arXiv:1810.04805.
arXiv:2004.05887. [132] L. Li, R. Ma, Q. Guo, X. Xue, X. Qiu, Bert-attack: Adversarial attack against bert
[103] Y. Zhou, J.-Y. Jiang, K.-W. Chang, W. Wang, Learning to discriminate using bert, arXiv preprint arXiv:2004.09984.
perturbations for blocking adversarial attacks in text classification, arXiv [133] P. Neekhara, S. Hussain, S. Dubnov, F. Koushanfar, Adversarial
preprint arXiv:1909.03084. reprogramming of sequence classification neural networks, CoRR abs/
[104] D. Kang, T. Khot, A. Sabharwal, E. Hovy, Adventure: Adversarial training for 1809.01829.
textual entailment with knowledge-guided examples, arXiv preprint [134] A. de Wynter, Mischief: A simple black-box attack against transformer
arXiv:1805.04680. architectures, arXiv preprint arXiv:2010.08542.
[105] J. Xu, L. Zhao, H. Yan, Q. Zeng, Y. Liang, S. Xu, Lexicalat: Lexical-based [135] T. Le, S. Wang, D. Lee, Malcom: Generating malicious comments to attack
adversarial reinforcement training for robust sentiment classification, in: neural fake news detection models, arXiv preprint arXiv:2009.01048.
Proceedings of the 2019 Conference on Empirical Methods in Natural [136] Y. Wu, D. Bamman, S. Russell, Adversarial training for relation extraction, in:
Language Processing and the 9th International Joint Conference on Natural Proceedings of the 2017 Conference on Empirical Methods in Natural
Language Processing (EMNLP-IJCNLP), 2019, pp. 5521–5530. Language Processing, 2017, pp. 1778–1783.
[106] L. Li, X. Qiu, Textat: Adversarial training for natural language understanding [137] G. Bekoulis, J. Deleu, T. Demeester, C. Develder, Adversarial training for multi-
with token-level perturbation, arXiv preprint arXiv:2004.14543. context joint entity and relation extraction, arXiv preprint arXiv:1808.06876.
[107] H. Liu, Y. Zhang, Y. Wang, Z. Lin, Y. Chen, Joint character-level word [138] M. Cettolo, N. Jan, S. Sebastian, L. Bentivogli, R. Cattoni, M. Federico, The iwslt
embedding and adversarial stability training to defend adversarial text, in: 2016 evaluation campaign, in: International Workshop on Spoken Language
Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 2020, Translation, 2016.
pp. 8384–8391. [139] M. Yasunaga, J. Kasai, D. Radev, Robust multilingual part-of-speech tagging
[108] K. Liu, X. Liu, A. Yang, J. Liu, J. Su, S. Li, Q. She, A robust adversarial training via adversarial training, arXiv preprint arXiv:1711.04903.
approach to machine reading comprehension, in: Proceedings of the AAAI [140] W. Han, L. Zhang, Y. Jiang, K. Tu, Adversarial attack and defense of structured
Conference on Artificial Intelligence, Vol. 34, 2020, pp. 8392–8400. prediction models, arXiv preprint arXiv:2010.01610.
[109] E. Dinan, S. Humeau, B. Chintagunta, J. Weston, Build it break it fix it for [141] H. Chen, H. Zhang, P.-Y. Chen, J. Yi, C.-J. Hsieh, Attacking visual language
dialogue safety: Robustness from adversarial human attack, arXiv preprint grounding with adversarial examples: A case study on neural image
arXiv:1908.06083. captioning, arXiv preprint arXiv:1712.02051.
[110] Q. Li, S. Shah, X. Liu, A. Nourbakhsh, Data sets: Word embeddings learned [142] X. Xu, X. Chen, C. Liu, A. Rohrbach, T. Darrell, D. Song, Fooling vision and
from tweets and general data, in: Proceedings of the International AAAI language models despite localization and attention mechanism, in:
Conference on Web and Social Media, Vol. 11, 2017. Proceedings of the IEEE Conference on Computer Vision and Pattern
[111] Z. Wang, H. Wang, Defense of word-level adversarial attacks via random Recognition, 2018, pp. 4951–4961.
substitution encoding, in: International Conference on Knowledge Science, [143] L. Chen, W. Xu, Attacking optical character recognition (ocr) systems with
Engineering and Management, Springer, 2020, pp. 312–324. adversarial watermarks, arXiv preprint arXiv:2002.03095.
[112] Y. Zhou, X. Zheng, C.-J. Hsieh, K.-W. Chang, X. Huang, Defense against [144] X. Yuan, P. He, X. Lit, D. Wu, Adaptive adversarial attack on scene text
adversarial attacks in nlp via dirichlet neighborhood ensemble, arXiv recognition, in: IEEE INFOCOM 2020-IEEE Conference on Computer
preprint arXiv:2006.11627. Communications Workshops (INFOCOM WKSHPS) IEEE, 2020, pp. 358–
[113] B. Wang, S. Wang, Y. Cheng, Z. Gan, R. Jia, B. Li, J. Liu, Infobert: Improving 363.
robustness of language models from an information theoretic perspective, [145] R. Tang, C. Ma, W.E. Zhang, Q. Wu, X. Yang, Semantic equivalent adversarial
arXiv preprint arXiv:2010.02329. data augmentation for visual question answering, European Conference on
[114] J. Wu, X. Li, X. Ao, Y. Meng, F. Wu, J. Li, Improving robustness and generality of Computer Vision, Springer (2020) 437–453.
nlp models using disentangled representations, arXiv preprint [146] H. Shi, J. Mao, T. Xiao, Y. Jiang, J. Sun, Learning visually-grounded semantics
arXiv:2009.09587. from contrastive adversarial samples, arXiv preprint arXiv:1806.10348.
[115] A.H. Li, A. Sethy, Knowledge enhanced attention for robust natural language [147] Z. Gan, Y.-C. Chen, L. Li, C. Zhu, Y. Cheng, J. Liu, Large-scale adversarial training
inference, arXiv preprint arXiv:1909.00102. for vision-and-language representation learning, arXiv preprint
[116] N.S. Moosavi, M. de Boer, P.A. Utama, I. Gurevych, Improving robustness by arXiv:2006.06195.
augmenting training sentences with predicate-argument structures, arXiv [148] M. Cheng, W. Wei, C.-J. Hsieh, Evaluating and enhancing the robustness of
preprint arXiv:2010.12510. dialogue systems: A case study on a negotiation agent, in: Proceedings of the
[117] M. Kusner, Y. Sun, N. Kolkin, K. Weinberger, From word embeddings to 2019 Conference of the North American Chapter of the Association for
document distances, in: International conference on machine learning, 2015, Computational Linguistics: Human Language Technologies, Volume 1 (Long
pp. 957–966. and Short Papers), 2019, pp. 3325–3335.
[118] P. Minervini, S. Riedel, Adversarially regularising neural nli models to [149] Y. Kim, Y. Jernite, D. Sontag, A. Rush, Character-aware neural language
integrate logical background knowledge, arXiv preprint arXiv:1808.08609. models, in: Proceedings of the AAAI conference on artificial intelligence, Vol.
[119] Y. Cheng, L. Jiang, W. Macherey, Robust neural machine translation with 30, 2016.
doubly adversarial inputs, arXiv preprint arXiv:1906.02443. [150] M. Schuster, K.K. Paliwal, Bidirectional recurrent neural networks, IEEE
[120] V. Kuleshov, S. Thakoor, T. Lau, S. Ermon, Adversarial examples for natural transactions on Signal Processing 45 (11) (1997) 2673–2681.
language classification problems. [151] K.S. Tai, R. Socher, C.D. Manning, Improved semantic representations from
[121] M. Sato, J. Suzuki, H. Shindo, Y. Matsumoto, Interpretable adversarial tree-structured long short-term memory networks, arXiv preprint
perturbation in input embedding space for text, arXiv preprint arXiv:1503.00075.
arXiv:1805.02917. [152] X. Zhang, J. Zhao, Y. LeCun, Character-level convolutional networks for text
[122] Z. Gong, W. Wang, B. Li, D. Song, W.-S. Ku, Adversarial texts with gradient classification, Advances in neural information processing systems 28 (2015)
methods, arXiv preprint arXiv:1801.07175. 649–657.
[123] C. Song, V. Shmatikov, Fooling ocr systems with adversarial text images, arXiv [153] Y. Kim, Convolutional neural networks for sentence classification, arXiv
preprint arXiv:1802.05385. preprint arXiv:1408.5882.
[124] S.-M. Moosavi-Dezfooli, A. Fawzi, P. Frossard, Deepfool: a simple and accurate [154] S. Hochreiter, J. Schmidhuber, Long short-term memory, Neural computation
method to fool deep neural networks, in: Proceedings of the IEEE conference 9 (8) (1997) 1735–1780.
on computer vision and pattern recognition, 2016, pp. 2574–2582. [155] T. Miyato, A.M. Dai, I. Goodfellow, Adversarial training methods for semi-
[125] G.A. Miller, Wordnet: a lexical database for english, Commun. ACM 38 (11) supervised text classification, arXiv preprint arXiv:1605.07725.
(1995) 39–41. [156] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, A. Bordes, Supervised learning
[126] J. Zhao, Y. Kim, K. Zhang, A. Rush, Y. LeCun, Adversarially regularized of universal sentence representations from natural language inference data,
autoencoders, in: International conference on machine learning PMLR, 2018, arXiv preprint arXiv:1705.02364.
pp. 5902–5911. [157] Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L.
[127] K. Krishna, G.S. Tomar, A.P. Parikh, N. Papernot, M. Iyyer, Thieves on sesame Zettlemoyer, V. Stoyanov, Roberta: A robustly optimized bert pretraining
street! model extraction of bert-based apis. approach, arXiv preprint arXiv:1907.11692.
[128] K. Sohn, H. Lee, X. Yan, Learning structured output representation using deep [158] K. Shu, L. Cui, S. Wang, D. Lee, H. Liu, defend: Explainable fake news detection,
conditional generative models, Advances in neural information processing in: Proceedings of the 25th ACM SIGKDD International Conference on
systems 28 (2015) 3483–3491. Knowledge Discovery & Data Mining, 2019, pp. 395–405.
[129] H. Chen, S. Huang, D. Chiang, X. Dai, J. Chen, Combining character and word [159] D. Zeng, K. Liu, S. Lai, G. Zhou, J. Zhao, Relation classification via convolutional
information in neural machine translation using a multi-level attention, in: deep neural network, in: Proceedings of COLING 2014, the 25th International
Proceedings of the 2018 Conference of the North American Chapter of the Conference on Computational Linguistics: Technical Papers, 2014, pp. 2335–
Association for Computational Linguistics: Human Language Technologies, 2344.
Volume 1 (Long Papers), 2018, pp. 1284–1293. [160] K. Cho, B. Van Merriënboer, D. Bahdanau, Y. Bengio, On the properties of
neural machine translation: Encoder-decoder approaches, arXiv preprint
arXiv:1409.1259.
305
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
[161] G. Bekoulis, J. Deleu, T. Demeester, C. Develder, Joint entity recognition and [191] A. Kurakin, I. Goodfellow, S. Bengio, et al., Adversarial examples in the
relation extraction as a multi-head selection problem, Expert Syst. Appl. 114 physical world (2016).
(2018) 34–45. [192] F. Faghri, D.J. Fleet, J.R. Kiros, S. Fidler, Vse++: Improving visual-semantic
[162] M. Kaneko, Y. Sakaizawa, M. Komachi, Grammatical error detection using embeddings with hard negatives, arXiv preprint arXiv:1707.05612.
error-and grammaticality-specific word embeddings, in: Proceedings of the [193] Y. Goyal, T. Khot, D. Summers-Stay, D. Batra, D. Parikh, Making the v in vqa
Eighth International Joint Conference on Natural Language Processing matter: Elevating the role of image understanding in visual question
(Volume 1: Long Papers), 2017, pp. 40–48. answering, in: Proceedings of the IEEE Conference on Computer Vision and
[163] M.E. Peters, M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, L. Pattern Recognition, 2017, pp. 6904–6913.
Zettlemoyer, Deep contextualized word representations, arXiv preprint [194] R. Zellers, Y. Bisk, A. Farhadi, Y. Choi, From recognition to cognition: Visual
arXiv:1802.05365. commonsense reasoning, in: Proceedings of the IEEE/CVF Conference on
[164] J. Lee, K. Cho, T. Hofmann, Fully character-level neural machine translation Computer Vision and Pattern Recognition, 2019, pp. 6720–6731.
without explicit segmentation, Transactions of the Association for, [195] A. Suhr, S. Zhou, A. Zhang, I. Zhang, H. Bai, Y. Artzi, A corpus for reasoning
Computational Linguistics 5 (2017) 365–378. about natural language grounded in photographs, arXiv preprint
[165] R. Sennrich, O. Firat, K. Cho, A. Birch, B. Haddow, J. Hitschler, M. Junczys- arXiv:1811.00491.
Dowmunt, S. Läubli, A.V.M. Barone, J. Mokry, et al., Nematus: a toolkit for [196] N. Xie, F. Lai, D. Doran, A. Kadav, Visual entailment: A novel task for fine-
neural machine translation, arXiv preprint arXiv:1703.04357. grained image understanding, arXiv preprint arXiv:1901.06706.
[166] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A.N. Gomez, Ł. Kaiser, [197] L. Yu, P. Poirson, S. Yang, A.C. Berg, T.L. Berg, Modeling context in referring
I. Polosukhin, Attention is all you need, in: Advances in neural information expressions, European Conference on Computer Vision, Springer (2016) 69–
processing systems, 2017, pp. 5998–6008. 85.
[167] M.-T. Luong, H. Pham, C.D. Manning, Effective approaches to attention-based [198] K.-H. Lee, X. Chen, G. Hua, H. Hu, X. He, Stacked cross attention for image-text
neural machine translation, arXiv preprint arXiv:1508.04025. matching, in: Proceedings of the European Conference on Computer Vision
[168] J. Gehring, M. Auli, D. Grangier, Y.N. Dauphin, A convolutional encoder model (ECCV), 2018, pp. 201–216.
for neural machine translation, arXiv preprint arXiv:1611.02344. [199] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, E. Elsen, R. Prenger, S.
[169] M. Seo, A. Kembhavi, A. Farhadi, H. Hajishirzi, Bidirectional attention flow for Satheesh, S. Sengupta, A. Coates, et al., Deep speech: Scaling up end-to-end
machine comprehension, arXiv preprint arXiv:1611.01603. speech recognition, arXiv preprint arXiv:1412.5567.
[170] S. Wang, J. Jiang, Machine comprehension using match-lstm and answer [200] T. Dozat, C.D. Manning, Deep biaffine attention for neural dependency
pointer, arXiv preprint arXiv:1608.07905. parsing, arXiv preprint arXiv:1611.01734.
[171] A.W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, Q.V. Le, Qanet: [201] W. Wang, R. Wang, L. Wang, B. Tang, Adversarial examples generation
Combining local convolution with global self-attention for reading approach for tendency classification on chinese texts, Ruan Jian Xue Bao/J.
comprehension, arXiv preprint arXiv:1804.09541. Softw. 30 (8) (2019) 2415–2427.
[172] G. Lample, A. Conneau, Cross-lingual language model pretraining, arXiv [202] E. La Malfa, M. Wu, L. Laurenti, B. Wang, A. Hartshorn, M. Kwiatkowska,
preprint arXiv:1901.07291. Assessing robustness of text classification through maximal safe radius
[173] Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, Xlnet: computation, arXiv preprint arXiv:2010.02004.
Generalized autoregressive pretraining for language understanding, in: [203] T. Miyato, S.-I. Maeda, M. Koyama, S. Ishii, Virtual adversarial training: a
Advances in neural information processing systems, 2019, pp. 5753–5763. regularization method for supervised and semi-supervised learning, IEEE
[174] Q. Chen, X. Zhu, Z. Ling, S. Wei, H. Jiang, D. Inkpen, Enhanced lstm for natural transactions on pattern analysis and machine intelligence 41 (8) (2018)
language inference, arXiv preprint arXiv:1609.06038. 1979–1993.
[175] M. Marcus, B. Santorini, M.A. Marcinkiewicz, Building a large annotated [204] A. Dubey, L. v. d. Maaten, Z. Yalniz, Y. Li, D. Mahajan, Defense against
corpus of english: The penn treebank. adversarial images using web-scale nearest-neighbor search, in: Proceedings
[176] J. Nivre, Ž. Agić, M.J. Aranzabe, M. Asahara, A. Atutxa, M. Ballesteros, J. Bauer, of the IEEE/CVF Conference on Computer Vision and Pattern Recognition,
K. Bengoetxea, R.A. Bhat, C. Bosco, et al., Universal dependencies 1.2. 2019, pp. 8767–8776.
[177] R. Lowe, N. Pow, I. Serban, J. Pineau, The ubuntu dialogue corpus: A large [205] P. Shi, J. Lin, Simple bert models for relation extraction and semantic role
dataset for research in unstructured multi-turn dialogue systems, arXiv labeling, arXiv preprint arXiv:1904.05255.
preprint arXiv:1506.08909. [206] C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P.
[178] H. He, A. Balakrishnan, M. Eric, P. Liang, Learning symmetric collaborative J. Liu, Exploring the limits of transfer learning with a unified text-to-text
dialogue agents with dynamic knowledge graph embeddings, arXiv preprint transformer, arXiv preprint arXiv:1910.10683.
arXiv:1704.07130. [207] N. Papernot, F. Faghri, N. Carlini, I. Goodfellow, R. Feinman, A. Kurakin, C. Xie,
[179] I. Serban, A. Sordoni, R. Lowe, L. Charlin, J. Pineau, A. Courville, Y. Bengio, A Y. Sharma, T. Brown, A. Roy, et al., Technical report on the cleverhans v2. 1.0
hierarchical latent variable encoder-decoder model for generating dialogues, adversarial examples library, arXiv preprint arXiv:1610.00768.
in: Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 31, [208] J. Rauber, W. Brendel, M. Bethge, Foolbox: A python toolbox to benchmark
2017. the robustness of machine learning models, arXiv preprint arXiv:1707.04131.
[180] J. Li, W. Monroe, A. Ritter, M. Galley, J. Gao, D. Jurafsky, Deep reinforcement [209] G.W. Ding, L. Wang, X. Jin, Advertorch v0. 1: An adversarial robustness
learning for dialogue generation, arXiv preprint arXiv:1606.01541. toolbox based on pytorch, arXiv preprint arXiv:1902.07623.
[181] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, C.L. [210] J.X. Morris, E. Lifland, J.Y. Yoo, Y. Qi, Textattack: A framework for adversarial
Zitnick, Microsoft coco: Common objects in context, in: European conference attacks in natural language processing.
on computer vision, Springer, 2014, pp. 740–755. [211] G. Zeng, F. Qi, Q. Zhou, T. Zhang, Z. Ma, B. Hou, Y. Zang, Z. Liu, M. Sun,
[182] J. Johnson, A. Karpathy, L. Fei-Fei, Densecap: Fully convolutional localization Openattack: An open-source textual adversarial attack toolkit, arXiv preprint
networks for dense captioning, in: Proceedings of the IEEE conference on arXiv:2009.09191.
computer vision and pattern recognition, 2016, pp. 4565–4574. [212] Y. Liang, F. He, X. Zeng, J. Luo, An improved loop subdivision to coordinate the
[183] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, S. Chen, Y. smoothness and the number of faces via multi-objective optimization,
Kalantidis, L.-J. Li, D.A. Shamma, et al., Visual genome: Connecting language Integrated Computer-Aided Engineering (Preprint) (2021) 1–19.
and vision using crowdsourced dense image annotations, International [213] A. Lahav, A. Tal, Meshwalker: Deep mesh understanding by random walks,
journal of computer vision 123 (1) (2017) 32–73. ACM Transactions on Graphics (TOG) 39 (6) (2020) 1–13.
[184] A. Kendall, Y. Gal, R. Cipolla, Multi-task learning using uncertainty to weigh [214] M.I. Hossen, X. Hei, aaecaptcha: The design and implementation of audio
losses for scene geometry and semantics, in: Proceedings of the IEEE adversarial captcha, arXiv preprint arXiv:2203.02735.
conference on computer vision and pattern recognition, 2018, pp. 7482– [215] M. Kumar, M. Jindal, M. Kumar, Design of innovative captcha for hindi
7491. language, Neural Comput. Appl. (2022) 1–36.
[185] D. Karatzas, F. Shafait, S. Uchida, M. Iwamura, L.G. i Bigorda, S.R. Mestre, J.
Mas, D.F. Mota, J.A. Almazan, L.P. De Las Heras, Icdar 2013 robust reading
competition, in: 2013 12th International Conference on Document Analysis
and Recognition, IEEE, 2013, pp. 1484–1493. Shilin Qiu received the B.E. degree in software engi-
[186] A. Mishra, K. Alahari, C. Jawahar, Scene text recognition using higher order neering from the University of Electronic Science and
language priors, 2012. Technology of China (UESTC), Chengdu, China, in 2017,
[187] B. Shi, X. Bai, C. Yao, An end-to-end trainable neural network for image-based where she is currently pursuing the Doctoral degree in
sequence recognition and its application to scene text recognition, IEEE software engineering.
transactions on pattern analysis and machine intelligence 39 (11) (2016) Her current research interests include deep learning,
2298–2304. artificial intelligence adversarial technology.
[188] A. Fukui, D.H. Park, D. Yang, A. Rohrbach, T. Darrell, M. Rohrbach, Multimodal
compact bilinear pooling for visual question answering and visual grounding,
arXiv preprint arXiv:1606.01847.
[189] R. Hu, J. Andreas, M. Rohrbach, T. Darrell, K. Saenko, Learning to reason: End-
to-end module networks for visual question answering, in: Proceedings of
the IEEE International Conference on Computer Vision, 2017, pp. 804–813.
[190] S. Antol, A. Agrawal, J. Lu, M. Mitchell, D. Batra, C. Lawrence Zitnick, D. Parikh,
Vqa: Visual question answering, in: Proceedings of the IEEE international
conference on computer vision, 2015, pp. 2425–2433.
306
S. Qiu, Q. Liu, S. Zhou et al. Neurocomputing 492 (2022) 278–307
Qihe Liu received the Ph.D. degree in computer appli- Wen Huang received the B.E. degree in computer sci-
cation technology from the University of Electronic ence and technology from the University of Electronic
Science and Technology of China (UESTC), Chengdu, Science and Technology of China (UESTC), Chengdu,
China, in 2005. He is currently a Associate Professor China, in 2016, where he is currently pursuing the
with the School of Information and Software Engineer- Doctoral degree in software engineering. His current
ing, UESTC. research interests include cryptography and differential
His current research interests include communication privacy.
and security in network security, machine learning, and
artificial intelligence adversarial technology.
307