Multi-Task Deep Neural Networks For Natural Language Understanding

Multi-Task Deep Neural Networks for Natural Language Understanding
Xiaodong Liu∗1 , Pengcheng He∗2 , Weizhu Chen2 , Jianfeng Gao1

1 2
Microsoft Research Microsoft Dynamics 365 AI
{xiaodl,penhe,wzchen,jfgao}@microsoft.com
Abstract who does not. Similarly, it is useful for multi-

ple (related) tasks to be learned jointly so that the
In this paper, we present a Multi-Task Deep knowledge learned in one task can benefit other
Neural Network (MT-DNN) for learning rep-
arXiv:1901.11504v1 [cs.CL] 31 Jan 2019
tasks. Recently, there is a growing interest in ap-

resentations across multiple natural language
understanding (NLU) tasks. MT-DNN not
plying MTL to representation learning using deep
only leverages large amounts of cross-task neural networks (DNNs) (Collobert et al., 2011;
data, but also benefits from a regularization ef- Liu et al., 2015; Luong et al., 2015; Xu et al.,
fect that leads to more general representations 2018) for two reasons. First, supervised learning
to help adapt to new tasks and domains. MT- of DNNs requires large amounts of task-specific
DNN extends the model proposed in Liu et al. labeled data, which is not always available. MTL
(2015) by incorporating a pre-trained bidirec- provides an effective way of leveraging supervised
tional transformer language model, known as
data from many related tasks. Second, the use
BERT (Devlin et al., 2018). MT-DNN ob-
tains new state-of-the-art results on ten NLU of multi-task learning profits from a regularization
tasks, including SNLI, SciTail, and eight out of effect via alleviating overfitting to a specific task,
nine GLUE tasks, pushing the GLUE bench- thus making the learned representations universal
mark to 82.2% (1.8% absolute improvement). across tasks.
We also demonstrate using the SNLI and Sc- In contrast to MTL, language model pre-
iTail datasets that the representations learned training has shown to be effective for learning
by MT-DNN allow domain adaptation with
universal language representations by leveraging
substantially fewer in-domain labels than the
pre-trained BERT representations. Our code large amounts of unlabeled data. A recent sur-
and pre-trained models will be made publicly vey is included in Gao et al. (2018). Some of
available. the most prominent examples are ELMo (Peters
et al., 2018), GPT (Radford et al., 2018) and BERT
1 Introduction (Devlin et al., 2018). These are neural network
language models trained on text data using unsu-
Learning vector-space representations of text, e.g.,
pervised objectives. For example, BERT is based
words and sentences, is fundamental to many nat-
on a multi-layer bidirectional Transformer, and is
ural language understanding (NLU) tasks. Two
trained on plain text for masked word prediction
popular approaches are multi-task learning and
and next sentence prediction tasks. To apply a
language model pre-training. In this paper we
pre-trained model to specific NLU tasks, we often
strive to combine the strengths of both approaches
need to fine-tune, for each task, the model with
by proposing a new Multi-Task Deep Neural Net-
additional task-specific layers using task-specific
work (MT-DNN).
training data. For example, Devlin et al. (2018)
Multi-Task Learning (MTL) is inspired by hu- shows that BERT can be fine-tuned this way to
man learning activities where people often apply create state-of-the-art models for a range of NLU
the knowledge learned from previous tasks to help tasks, such as question answering and natural lan-
learn a new task (Caruana, 1997; Zhang and Yang, guage inference.
2017). For example, it is easier for a person who
We argue that MTL and language model pre-
knows how to ski to learn skating than the one
training are complementary technologies, and can
∗
Equal Contribution. be combined to improve the learning of text rep-
resentations to boost the performance of various Single-Sentence Classification: Given a sen-
NLU tasks. To this end, we extend the MT-DNN tence2 , the model labels it using one of the pre-
model originally proposed in Liu et al. (2015) defined class labels. For example, the CoLA task
by incorporating BERT as its shared text encod- is to predict whether an English sentence is gram-
ing layers. As shown in Figure 1, the lower lay- matically plausible. The SST-2 task is to de-
ers (i.e., text encoding layers) are shared across termine whether the sentiment of a sentence ex-
all tasks, while the top layers are task-specific, tracted from movie reviews is positive or negative.
combining different types of NLU tasks such as
Text Similarity: This is a regression task. Given
single-sentence classification, pairwise text clas-
a pair of sentences, the model predicts a real-value
sification, text similarity, and relevance ranking.
score indicating the semantic similarity of the two
Similar to the BERT model, MT-DNN is trained
sentences. STS-B is the only example of the task
in two stages: pre-training and fine-tuning. Un-
in GLUE.
like BERT, MT-DNN uses MTL in the fine-
tuning stage with multiple task-specific layers in Pairwise Text Classification: Given a pair of
its model architecture. sentences, the model determines the relationship
MT-DNN obtains new state-of-the-art results on of the two sentences based on a set of pre-defined
eight out of nine NLU tasks 1 used in the Gen- labels. For example, both RTE and MNLI are
eral Language Understanding Evaluation (GLUE) language inference tasks, where the goal is to pre-
benchmark (Wang et al., 2018), pushing the GLUE dict whether a sentence is an entailment, contra-
benchmark score to 82.2%, amounting to 1.8% ab- diction, or neutral with respect to the other. QQP
solute improvement over BERT. We further extend and MRPC are paragraph datasets that consist of
the superiority of MT-DNN to the SNLI (Bow- sentence pairs. The task is to predict whether the
man et al., 2015a) and SciTail (Khot et al., 2018) sentences in the pair are semantically equivalent.
tasks. The representations learned by MT-DNN
Relevance Ranking: Given a query and a list of
allow domain adaptation with substantially fewer
candidate answers, the model ranks all the can-
in-domain labels than the pre-trained BERT rep-
didates in the order of relevance to the query.
resentations. For example, our adapted models
QNLI is a version of Stanford Question Answer-
achieve the accuracy of 91.1% on SNLI and 94.1%
ing Dataset (Rajpurkar et al., 2016). The task in-
on SciTail, outperforming the previous state-of-
volves assessing whether a sentence contains the
the-art performance by 1.0% and 5.8%, respec-
correct answer to a given query. Although QNLI
tively. Even with only 0.1% or 1.0% of the origi-
is defined as a binary classification task in GLUE,
nal training data, the performance of MT-DNN on
in this study we formulate it as a pairwise ranking
both SNLI and SciTail datasets is fairly good and
task, where the model is expected to rank the can-
much better than many existing models. All of
didate that contains the correct answer higher than
these clearly demonstrate MT-DNN’s exceptional
the candidate that does not. We will show that this
generalization capability via multi-task learning.
formulation leads to a significant improvement in
accuracy over binary classification.
2 Tasks
3 The Proposed MT-DNN Model
The MT-DNN model combines four types of NLU The architecture of the MT-DNN model is shown
tasks: single-sentence classification, pairwise text in Figure 1. The lower layers are shared across all
classification, text similarity scoring, and rele- tasks, while the top layers represent task-specific
vance ranking. For concreteness, we describe outputs. The input X, which is a word sequence
them using the NLU tasks defined in the GLUE (either a sentence or a pair of sentences packed
benchmark as examples. together) is first represented as a sequence of em-
bedding vectors, one for each word, in l1 . Then the
1
The only GLUE task where MT-DNN does not create transformer encoder captures the contextual infor-
a new state of the art result is WNLI. But as noted in the mation for each word via self-attention, and gen-
GLUE webpage (https://gluebenchmark.com/faq), there are
2
issues in the dataset, and none of the submitted systems has In this study, a sentence can be an arbitrary span of con-
ever outperformed the majority voting baseline whose accu- tiguous text or word sequence, rather than a linguistically
racy is 65.1. plausible sentence.
Figure 1: Architecture of the MT-DNN model for representation learning. The lower layers are shared across
all tasks while the top layers are task-specific. The input X (either a sentence or a pair of sentences) is first
represented as a sequence of embedding vectors, one for each word, in l1 . Then the Transformer encoder captures
the contextual information for each word and generates the shared contextual embedding vectors in l2 . Finally, for
each task, additional task-specific layers generate task-specific representations, followed by operations necessary
for classification, similarity scoring, or relevance ranking.
erates a sequence of contextual embeddings in l2 . Single-Sentence Classification Output: Sup-

This is the shared semantic representation that is pose that x is the contextual embedding (l2 ) of the
trained by our multi-task objectives. In what fol- token [CLS], which can be viewed as the seman-
lows, we elaborate on the model in detail. tic representation of input sentence X. Take the
SST-2 task as an example. The probability that
X is labeled as class c (i.e., the sentiment) is pre-
Lexicon Encoder (l1 ): The input X =
dicted by a logistic regression with softmax:
{x1 , ..., xm } is a sequence of tokens of length m.
Following Devlin et al. (2018), the first token x1 is >
Pr (c|X) = softmax(WSST · x), (1)
always the [CLS] token. If X is packed by a sen-
tence pair (X1 , X2 ), we separate the two sentences where WSST is the task-specific parameter ma-
with a special token [SEP]. The lexicon encoder trix.
maps X into a sequence of input embedding vec-
Text Similarity Output: Take the STS-B task
tors, one for each token, constructed by summing
as an example. Suppose that x is the contextual
the corresponding word, segment, and positional
embedding (l2 ) of [CLS] which can be viewed
embeddings.
as the semantic representation of the input sen-
tence pair (X1 , X2 ). We introduce a task-specific
Transformer Encoder (l2 ): We use a multi- parameter vector wST S to compute the similarity
layer bidirectional Transformer encoder (Vaswani score as:
et al., 2017) to map the input representation vec- >
tors (l1 ) into a sequence of contextual embedding Sim(X1 , X2 ) = g(wST S · x), (2)
vectors C ∈ Rd×m . This is the shared repre-
where g(z) = 1+exp1 (−z) is a sigmoid function
sentation across different tasks. Unlike the BERT
that maps the score to a real value of the range
model (Devlin et al., 2018) that learns the repre-
[0, 1].
sentation via pre-training and adapts it to each in-
dividual task via fine-tuning, MT-DNN learns the Pairwise Text Classification Output: Take nat-
representation using multi-task objectives. ural language inference (NLI) as an example. The
NLI task defined here involves a premise P = 3.1 The Training Procedure
(p1 , ..., pm ) of m words and a hypothesis H = The training procedure of MT-DNN consists of
(h1 , ..., hn ) of n words, and aims to find a log- two stages: pretraining and multi-task fine-tuning.
ical relationship R between P and H. The de- The pretraining stage follows that of the BERT
sign of the output module follows the answer model (Devlin et al., 2018). The parameters of
module of the stochastic answer network (SAN) the lexicon encoder and Transformer encoder are
(Liu et al., 2018a), a state-of-the-art neural NLI learned using two unsupervised prediction tasks:
model. SAN’s answer module uses multi-step rea- masked language modeling and next sentence pre-
soning. Rather than directly predicting the entail- diction.3
ment given the input, it maintains a state and iter- In the multi-task fine-tuning stage, we use mini-
atively refines its predictions. batch based stochastic gradient descent (SGD) to
The SAN answer module works as follows. We learn the parameters of our model (i.e., the pa-
first construct the working memory of premise P rameters of all shared layers and task-specific lay-
by concatenating the contextual embeddings of the ers) as shown in Algorithm 1. In each epoch, a
words in P , which are the output of the trans- mini-batch bt is selected(e.g., among all 9 GLUE
former encoder, denoted as Mp ∈ Rd×m , and sim- tasks), and the model is updated according to the
ilarly the working memory of hypothesis H, de- task-specific objective for the task t. This approx-
noted as Mh ∈ Rd×n . Then, we perform K-step imately optimizes the sum of all multi-task objec-
reasoning on the memory to output the relation la- tives.
bel, where K is a hyperparameter. At the begin-
ning, the initial state s0 is the summary of Mh : Algorithm 1: Training a MT-DNN model.
exp(w1> ·Mh
j)
s0 = h Initialize model parameters Θ randomly.
P
j αj Mj , where αj = exp(w> ·Mh )
.
P
i 1 i
At time step k in the range of {1, 2, , K − 1}, Pre-train the shared layers (i.e., the lexicon
the state is defined by sk = GRU(sk−1 , xk ). encoder and the transformer encoder).
Here, xk is computed from the state sk−1 Set the max number of epoch: epochmax .
p k
P previous
p //Prepare the data for T tasks.
and memory M : x = j βj Mj and βj =
k−1 > p for t in 1, 2, ..., T do
softmax(s W2 M ). A one-layer classifier is Pack the dataset t into mini-batch: Dt .
used to determine the relation at each step k: end
Prk = softmax(W3> [sk ; xk ; |sk − xk |; sk · xk ]). for epoch in 1, 2, ..., epochmax do
(3) 1. Merge all the datasets:
At last, we utilize all of the K outputs by aver- D = D1 ∪ D2 ... ∪ DT
aging the scores: 2. Shuffle D
for bt in D do
Pr = avg([Pr0 , Pr1 , ..., PrK−1 ]). (4) //bt is a mini-batch of task t.
Each Pr is a probability distribution over all 3. Compute loss : L(Θ)
the relations R ∈ R. During training, we apply L(Θ) = Eq. 6 for classification
stochastic prediction dropout (Liu et al., 2018b) L(Θ) = Eq. 7 for regression
before the above averaging operation. During de- L(Θ) = Eq. 8 for ranking
coding, we average all outputs to improve robust- 4. Compute gradient: ∇(Θ)
ness. 5. Update model: Θ = Θ − ∇(Θ)
end
Relevance Ranking Output: Take QNLI as an end
example. Suppose that x is the contextual embed-
ding vector of [CLS] which is the semantic rep-
resentation of a pair of question and its candidate For the classification tasks (i.e., single-sentence
answer (Q, A). We compute the relevance score or pairwise text classification), we use the cross-
as: entropy loss as the objective:
>
Rel(Q, A) = g(wQN LI · x), (5)
1(X, c) log(Pr (c|X)),
X
− (6)
For a given Q, we rank all of its candidate an- c
swers based on their relevance scores computed 3
In this study we use the pre-trained BERT models re-
using Equation 5. leased by the authors.
where 1(X, c) is the binary indicator (0 or 1) if CoLA The Corpus of Linguistic Acceptability is
class label c is the correct classification for X, and to predict whether an English sentence is linguis-
Pr (.) is defined by e.g., Equation 1 or 4. tically acceptable or not (Warstadt et al., 2018). It
For the text similarity tasks, such as STS-B, uses Matthews correlation coefficient (Matthews,
where each sentence pair is annotated with a real- 1975) as the evaluation metric.
valued score y, we use the mean squared error as
SST-2 The Stanford Sentiment Treebank is to
the objective:
determine the sentiment of sentences. The sen-
(y − Sim(X1 , X2 ))2 , (7) tences are extracted from movie reviews with hu-
man annotations of their sentiment (Socher et al.,
where Sim(.) is defined by Equation 2. 2013). Accuracy is used as the evaluation metric.
The objective for the relevance ranking tasks STS-B The Semantic Textual Similarity Bench-
follows the pairwise learning-to-rank paradigm mark is a collection of sentence pairs collected
(Burges et al., 2005; Huang et al., 2013). Take from multiple data resources including news head-
QNLI as an example. Given a query Q, we obtain lines, video, and image captions, and NLI data
a list of candidate answers A which contains a pos- (Cer et al., 2017). Each pair is human-annotated
itive example A+ that includes the correct answer, with a similarity score from one to five, indicat-
and |A| − 1 negative examples. We then minimize ing how similar the two sentences are. The task
the negative log likelihood of the positive example is evaluated using two metrics: the Pearson and
given queries across the training data Spearman correlation coefficients.
QNLI This is derived from the Stanford Ques-
X
− Pr (A+ |Q), (8)
(Q,A+ )
tion Answering Dataset (Rajpurkar et al., 2016)
which has been converted to a binary classifica-
tion task in GLUE. A query-candidate-answer tu-
exp(γRel(Q, A+ )) ple is labeled as positive if the candidate con-
Pr (A+ |Q) = P 0 , (9)
tains the correct answer to the query and nega-
A0 ∈A exp(γRel(Q, A ))
tive otherwise. In this study, however, we for-
where Rel(.) is defined by Equation 5 and γ is a mulate QNLI as a relevance ranking task, where
tuning factor determined on held-out data. In our for a given query, its positive candidate answers
experiment, we simply set γ to 1. are considered more relevant, and thus should be
ranked higher than its negative candidates.
4 Experiments
QQP The Quora Question Pairs dataset is a col-
We evaluate the proposed MT-DNN on three pop- lection of question pairs extracted from the com-
ular NLU benchmarks: GLUE (Wang et al., 2018), munity question-answering website Quora. The
Stanford Natural Language Inference (SNLI) task is to predict whether two questions are se-
(Bowman et al., 2015b), and SciTail (Khot et al., mantically equivalent (Chen et al., 2018). As the
2018). We compare MT-DNN with existing state- distribution of positive and negative labels is un-
of-the-art models including BERT and demon- balanced, both accuracy and F1 score are used as
strate the effectiveness of MTL for model fine- evaluation metrics.
tuning using GLUE and domain adaptation using
MRPC The Microsoft Research Paraphrase
SNLI and SciTail.
Corpus consists of sentence pairs automatically
4.1 Datasets extracted from online news sources with human
annotations denoting whether a sentence pair is
This section briefly describes the GLUE, SNLI,
semantically equivalent to the other in the pair
and SciTail datasets, as summarized in Table 1.
(Dolan and Brockett, 2005). Similar to QQP, both
The GLUE benchmark is a collection of nine
accuracy and F1 score are used as evaluation met-
NLU tasks, including question answering, senti-
rics.
ment analysis, and textual entailment; it is consid-
ered well-designed for evaluating the generaliza- MNLI Multi-Genre Natural Language Inference
tion and robustness of NLU models. Both SNLI is a large-scale, crowd-sourced entailment classi-
and SciTail are NLI tasks. fication task (Nangia et al., 2017). Given a pair
Corpus Task #Train #Dev #Test #Label Metrics
Single-Sentence Classification (GLUE)
CoLA Acceptability 8.5k 1k 1k 2 Matthews corr
SST-2 Sentiment 67k 872 1.8k 2 Accuracy
Pairwise Text Classification (GLUE)
MNLI NLI 393k 20k 20k 3 Accuracy
RTE NLI 2.5k 276 3k 2 Accuracy
WNLI NLI 634 71 146 2 Accuracy
QQP Paraphrase 364k 40k 391k 2 Accuracy/F1
MRPC Paraphrase 3.7k 408 1.7k 2 Accuracy/F1
Text Similarity (GLUE)
STS-B Similarity 7k 1.5k 1.4k 1 Pearson/Spearman corr
Relevance Ranking (GLUE)
QNLI QA/NLI 108k 5.7k 5.7k 2 Accuracy
Pairwise Text Classification
SNLI NLI 549k 9.8k 9.8k 3 Accuracy
SciTail NLI 23.5k 1.3k 2.1k 2 Accuracy
Table 1: Summary of the three benchmarks: GLUE, SNLI and SciTail.
of sentences (i.e., a premise-hypothesis pair), the dataset (Khot et al., 2018). The task involves as-
goal is to predict whether the hypothesis is an en- sessing whether a given premise entails a given hy-
tailment, contradiction, or neutral with respect to pothesis. In contrast to other entailment datasets
the premise. The test and development sets are mentioned previously, the hypotheses in SciTail
split into in-domain (matched) and cross-domain are created from science questions while the cor-
(mismatched) sets. The evaluation metric is accu- responding answer candidates and premises come
racy. from relevant web sentences retrieved from a large
corpus. As a result, these sentences are linguis-
RTE The Recognizing Textual Entailment
tically challenging and the lexical similarity of
dataset is collected from a series of annual chal-
premise and hypothesis is often high, thus making
lenges on textual entailment. The task is similar
SciTail particularly difficult. The dataset is used
to MNLI, but uses only two labels: entailment
only for domain adaptation in this study.
and not entailment (Wang et al., 2018).
WNLI The Winograd NLI (WNLI) is a natural 4.2 Implementation details
language inference dataset derived from the Wino-
Our implementation of MT-DNN is based on
grad Schema dataset (Levesque et al., 2012). This
the PyTorch implementation of BERT4 . We used
is a reading comprehension task. The goal is to se-
Adamax (Kingma and Ba, 2014) as our optimizer
lect the referent of a pronoun from a list of choices
with a learning rate of 5e-5 and a batch size of 32.
in a given sentence which contains the pronoun.
The maximum number of epochs was set to 5. A
SNLI The Stanford Natural Language Inference linear learning rate decay schedule with warm-up
(SNLI) dataset contains 570k human annotated over 0.1 was used, unless stated otherwise. Fol-
sentence pairs, in which the premises are drawn lowing (Liu et al., 2018a), we set the number of
from the captions of the Flickr30 corpus and hy- steps to 5 with a dropout rate of 0.1. To avoid the
potheses are manually annotated (Bowman et al., exploding gradient problem, we clipped the gradi-
2015b). This is the most widely used entailment ent norm within 1. All the texts were tokenized
dataset for NLI. The dataset is used only for do- using wordpieces, and were chopped to spans no
main adaptation in this study. longer than 512 tokens.
SciTail This is a textual entailment dataset de- 4

https://github.com/huggingface/pytorch-pretrained-
rived from a science question answering (SciQ) BERT
Model CoLA SST-2 MRPC STS-B QQP MNLI-m/mm QNLI RTE WNLI AX Score
8.5k 67k 3.7k 7k 364k 393k 108k 2.5k 634
BiLSTM+ELMo+Attn 36.0 90.4 84.9/77.9 75.1/73.3 64.8/84.7 76.4/76.1 79.9 56.8 65.1 26.5 70.5
Singletask Pretrain
45.4 91.3 82.3/75.7 82.0/80.0 70.3/88.5 82.1/81.4 88.1 56.0 53.4 29.8 72.8
Transformer
GPT on STILTs 47.2 93.1 87.7/83.7 85.3/84.8 70.1/88.1 80.8/80.6 87.2 69.1 65.1 29.4 76.9
BERTLARGE 60.5 94.9 89.3/85.4 87.6/86.5 72.1/89.3 86.7/85.9 91.1 70.1 65.1 39.6 80.4
MT-DNN 61.5 95.6 90.0/86.7 88.3/87.7 72.4/89.6 86.7/86.0 98.0 75.5 65.1 40.3 82.2
Table 2: GLUE test set results, which are scored by the GLUE evaluation server. The number below each task
denotes the number of training examples. The state-of-the-art results are in bold. MT-DNN uses BERTLARGE for
its shared layers. All the results are obtained from https://gluebenchmark.com/leaderboard.
Model MNLI-m/mm QQP MRPC RTE QNLI SST-2 CoLA STS-B

BERTBASE 84.5/84.4 90.4/87.4 84.5/89.0 65.0 88.4 92.8 55.4 89.6/89.2
ST-DNN 84.7/84.6 91.0/87.9 86.6/89.1 64.6 94.6 - - -
MT-DNN 85.3/85.0 91.6/88.6 86.8/89.2 79.1 95.7 93.6 59.5 90.6/90.4
Table 3: GLUE dev set results. The best result on each task is in bold. BERTBASE is the base BERT model
released by the authors, and is fine-tuned for each single task. The Single-Task DNN (ST-DNN) uses the same
model architecture as MT-DNN. But instead of fine-tuning one model for all tasks using MTL, we create multiple
ST-DNNs, one for each task using only in-domain data for fine-tuning. ST-DNNs and MT-DNN use BERTBASE
for their shared layers.
4.3 GLUE Results problem in GLUE. To investigate the relative

contributions of the above two modeling design
The test results on GLUE are presented in Ta-
choices, we implement different versions of MT-
ble 2.5 MT-DNN outperforms all existing systems
DNNs and compare their performance on the de-
on all tasks, except WNLI, creating new state-of-
velopment sets. The results are shown in Table 3.
the-art results on eight GLUE tasks and pushing
the benchmark to 82.2%, which amounts to 1.8% • BERTBASE is the base BERT model released
absolution improvement over BERTLARGE . Since by the authors, which we used as a baseline.
MT-DNN uses BERTLARGE for its shared layers, We fine-tuned the model for each single task.
the gain is solely attributed to the use of MTL
in fine-tuning. MTL is particularly useful for the • MT-DNN is the proposed model described
tasks with little in-domain training data. As we in Section 3 using the pre-trained BERTBASE
observe in the table, on the same type of tasks, the as its shared layers. We then fine-tuned the
improvements over BERT are much more substan- model using MTL on all GLUE tasks. Com-
tial for the tasks with less in-domain training data paring MT-DNN vs. BERTBASE , we see that
e.g., the two NLI tasks: RTE vs. MNLI, and the the results on dev sets are consistent with the
two paraphrase tasks: MRPC vs. QQP. GLUE test results in Table 2.
The gain of MT-DNN is also attributed to its • ST-DNN, standing for Single-Task DNN,
flexible modeling framework which allows us to uses the same model architecture as MT-
incorporate the task-specific model structures and DNN. But, instead of fine-tuning one model
training methods which have been developed in for all tasks using MTL, we create multiple
the single-task setting, effectively leveraging the ST-DNNs, one for each task using only its in-
existing body of research. domain data for fine-tuning. Thus, for pair-
Two such examples use the SAN answer mod- wise text classification tasks, the only dif-
ule for the pairwise text classification output mod- ference between their ST-DNNs and BERT
ule, and the pairwise ranking loss for the QNLI models is the design of the task-specific out-
task which by design is a binary classification put module. The results show that on three
5
out of four tasks (MNLI, QQP and MRPC)
There is an ongoing discussion on revising the QNLI
dataset. We will update the results when the new dataset is ST-DNNs outperform their BERT counter-
available. parts, justifying the effectiveness of the SAN
answer module. We also compare the results
of ST-DNN and BERT on QNLI. While ST-
DNN is fine-tuned using the pairwise ranking
loss, BERT views QNLI as binary classifica-
tion and is fine-tuned using the cross entropy
loss. That ST-DNN significantly outperforms
BERT demonstrates clearly the importance of
problem formulation.
4.4 SNLI and SciTail Results

In Table 4, we compare our adapted models, us-
ing all in-domain training samples, against sev-
eral strong baselines including the best results
reported in the leaderboards. We see that MT-
DNN generates new state-of-the-art results on
both datasets, pushing the benchmarks to 91.1%
on SNLI (1.0% absolute improvement) and 94.1%
on SciTail (5.8% absolute improvement), respec-
Figure 2: Domain adaption results on SNLI and Sci-
tively. Tail development datasets using the shared embeddings
generated by MT-DNN and BERT, respectively. Both
Model Dev Test MT-DNN and BERT are fine-tuned based on the pre-
SNLI Dataset (Accuracy%) trained BERTBASE . The X-axis indicates the amount of
GPT (Radford et al., 2018) - 89.9 domain-specific labeled samples used for adaptation.
Kim et al. (2018)∗ - 90.1
BERT 91.0 90.8 Model 0.1% 1% 10% 100%
MT-DNN 91.4 91.1 SNLI Dataset (Dev Accuracy%)
SciTail Dataset (Accuracy%) #Training Data 549 5,493 54,936 549,367
GPT (Radford et al., 2018)∗ - 88.3 BERT 52.5 78.1 86.7 91.0
BERT 94.3 92.0 MT-DNN 82.1 85.2 88.4 91.5
MT-DNN 95.8 94.1 SciTail Dataset (Dev Accuracy%)
#Training Data 23 235 2,359 23,596
Table 4: Results on the SNLI and SciTail dataset. BERT 51.2 82.2 90.5 94.3
Previous state-of-the-art results are marked by MT-DNN 81.9 88.3 91.1 95.7
∗, obtained from the official SNLI leaderboard
(https://nlp.stanford.edu/projects/snli/) and the Table 5: Domain adaptation results on SNLI and Sci-
official SciTail leaderboard maintained by AI2 Tail, as shown in Figure 2.
(https://leaderboard.allenai.org/scitail/submissions/public).
Both MT-DNN and BERT are fine-tuned based on the
pre-trained BERTBASE . 1. fine-tune the MT-DNN model on eight GLUE
tasks, excluding WNLI;
4.5 Domain Adaptation Results 2. create for each new task (SNLI or SciTail) a
One of the most important criteria for building task-specific model, by adapting the trained
practical systems is fast adaptation to new tasks MT-DNN using task-specific training data;
and domains. This is because it is prohibitively 3. evaluate the models using task-specific test
expensive to collect labeled training data for new data.
domains or tasks. Very often, we only have very
small training data or even no training data. We denote the two task-specific models as MT-
To evaluate the models using the above crite- DNN. For comparison, we also perform the same
rion, we perform domain adaptation experiments adaptation procedure to the pre-trained BERT
on two NLI tasks, SNLI and SciTail, using the fol- model, creating two task-specific BERT models
lowing procedure: for SNLI and SciTail, respectively, denoted as
BERT. In Proceedings of the 2015 Conference on Empiri-
We split the training data of SNLI and SciTail, cal Methods in Natural Language Processing, pages
632–642.
and randomly sample 0.1%, 1%, 10% and 100%
of its training data. As a result, we obtain four Samuel R. Bowman, Gabor Angeli, Christopher Potts,
sets of training data for SciTail, which includes and Christopher D. Manning. 2015b. A large an-
23, 235, 2.3k and 23.5k training samples. Simi- notated corpus for learning natural language infer-
ence. In Proceedings of the 2015 Conference on
larly, we obtain four sets of training data for SNLI, Empirical Methods in Natural Language Processing
which includes 549, 5.5k, 54.9k and 549.3k train- (EMNLP). Association for Computational Linguis-
ing samples. tics.
Results on different amounts of training data of
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier,
SNLI and SciTail are reported in Figure 2 and Ta- Matt Deeds, Nicole Hamilton, and Greg Hullender.
ble 5. We observe that our model pre-trained on 2005. Learning to rank using gradient descent. In
GLUE via multi-task learning outplays the BERT Proceedings of the 22nd international conference on
baseline consistently. The fewer the training data Machine learning, pages 89–96. ACM.
used, the larger improvement MT-DNN demon- Rich Caruana. 1997. Multitask learning. Machine
strates over BERT. For example, with only 0.1% learning, 28(1):41–75.
(23 samples) of the SNLI training data, MT-DNN
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
achieves 82.1% in accuracy while BERT’s accu-
Gazpio, and Lucia Specia. 2017. Semeval-2017
racy is 52.5%; with 1% of the training data, the ac- task 1: Semantic textual similarity-multilingual and
curacy of our model is 85.2% and BERT is 78.1%. cross-lingual focused evaluation. arXiv preprint
We observe similar results on SciTail. The re- arXiv:1708.00055.
sults indicate that the representations learned by Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.
MT-DNN are more effective for domain adapta- Quora question pairs.
tion than that of BERT.
Ronan Collobert, Jason Weston, Léon Bottou, Michael
5 Conclusion Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
In this work we proposed a model called MT- scratch. Journal of Machine Learning Research,
12(Aug):2493–2537.
DNN to combine multi-task learning and lan-
guage model pre-training for language represen- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
tation learning. MT-DNN obtains new state-of- Kristina Toutanova. 2018. Bert: Pre-training of deep
the-art results on ten NLU tasks across three pop- bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
ular benchmarks: SNLI, SciTail, and GLUE. MT-
DNN also demonstrates an exceptional generaliza- William B Dolan and Chris Brockett. 2005. Automati-
tion capability in domain adaptation experiments. cally constructing a corpus of sentential paraphrases.
In Proceedings of the Third International Workshop
There are many future areas to explore to im-
on Paraphrasing (IWP2005).
prove MT-DNN, including a deeper understand-
ing of model structure sharing in MTL, a more ef- J. Gao, M. Galley, and L. Li. 2018. Neural approaches
fective training method that leverages relatedness to conversational AI. CoRR, abs/1809.08267.
among multiple tasks, and ways of incorporating Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
the linguistic structure of text in a more explicit Alex Acero, and Larry Heck. 2013. Learning deep
and controllable manner. structured semantic models for web search using
clickthrough data. In Proceedings of the 22nd ACM
6 Acknowledgements international conference on Conference on informa-
tion & knowledge management, pages 2333–2338.
We would like to thanks Jade Huang from Mi- ACM.
crosoft for her generous help on this work. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
SciTail: A textual entailment dataset from science
question answering. In AAAI.
References
Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and No-
Samuel R Bowman, Gabor Angeli, Christopher Potts, jun Kwak. 2018. Semantic sentence matching with
and Christopher D Manning. 2015a. A large anno- densely-connected recurrent and co-attentive infor-
tated corpus for learning natural language inference. mation. arXiv preprint arXiv:1805.11360.
Diederik Kingma and Jimmy Ba. 2014. Adam: A Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
method for stochastic optimization. arXiv preprint Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
arXiv:1412.6980. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Hector Levesque, Ernest Davis, and Leora Morgen-
stern. 2012. The winograd schema challenge. In Alex Wang, Amapreet Singh, Julian Michael, Felix
Thirteenth International Conference on the Princi- Hill, Omer Levy, and Samuel R Bowman. 2018.
ples of Knowledge Representation and Reasoning. Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2018a. arXiv:1804.07461.
Stochastic answer networks for natural language in-
ference. arXiv preprint arXiv:1804.07888. Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
man. 2018. Neural network acceptability judg-
Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, ments. arXiv preprint arXiv:1805.12471.
Kevin Duh, and Ye-Yi Wang. 2015. Representa-
Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing
tion learning using multi-task deep neural networks
Liu, and Jianfeng Gao. 2018. Multi-task learning
for semantic classification and information retrieval.
for machine reading comprehension. arXiv preprint
In Proceedings of the 2015 Conference of the North
arXiv:1809.06963.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Yu Zhang and Qiang Yang. 2017. A survey on multi-
pages 912–921. task learning. arXiv preprint arXiv:1707.08114.
Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng
Gao. 2018b. Stochastic answer networks for ma-
chine reading comprehension. In Proceedings of the
56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers). Asso-
ciation for Computational Linguistics.
Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol

Vinyals, and Lukasz Kaiser. 2015. Multi-task
sequence to sequence learning. arXiv preprint
arXiv:1511.06114.
Brian W Matthews. 1975. Comparison of the pre-

dicted and observed secondary structure of t4 phage
lysozyme. Biochimica et Biophysica Acta (BBA)-
Protein Structure, 405(2):442–451.
N. Nangia, A. Williams, A. Lazaridou, and S. R. Bow-

man. 2017. The RepEval 2017 Shared Task: Multi-
Genre Natural Language Inference with Sentence
Representations. ArXiv e-prints.
Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt

Gardner, Christopher Clark, Kenton Lee, and Luke
Zettlemoyer. 2018. Deep contextualized word rep-
resentations. arXiv preprint arXiv:1802.05365.
Alec Radford, Karthik Narasimhan, Tim Salimans, and

Ilya Sutskever. 2018. Improving language under-
standing by generative pre-training.
Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and

Percy Liang. 2016. Squad: 100,000+ questions for
machine comprehension of text. pages 2383–2392.
Richard Socher, Alex Perelygin, Jean Wu, Jason

Chuang, Christopher D Manning, Andrew Ng, and
Christopher Potts. 2013. Recursive deep models
for semantic compositionality over a sentiment tree-
bank. In Proceedings of the 2013 conference on
empirical methods in natural language processing,
pages 1631–1642.

Multi-Task Deep Neural Networks For Natural Language Understanding

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

Multi-Task Deep Neural Networks For Natural Language Understanding

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multi-Task Deep Neural Networks For Natural Language Understanding

Uploaded by

Copyright:

Available Formats

Multi-Task Deep Neural Networks for Natural Language Understanding

Xiaodong Liu∗1 , Pengcheng He∗2 , Weizhu Chen2 , Jianfeng Gao1

Abstract who does not. Similarly, it is useful for multi-

tasks. Recently, there is a growing interest in ap-

erates a sequence of contextual embeddings in l2 . Single-Sentence Classification Output: Sup-

Table 1: Summary of the three benchmarks: GLUE, SNLI and SciTail.

SciTail This is a textual entailment dataset de- 4

Model MNLI-m/mm QQP MRPC RTE QNLI SST-2 CoLA STS-B

4.3 GLUE Results problem in GLUE. To investigate the relative

4.4 SNLI and SciTail Results

Minh-Thang Luong, Quoc V Le, Ilya Sutskever, Oriol

Brian W Matthews. 1975. Comparison of the pre-

N. Nangia, A. Williams, A. Lazaridou, and S. R. Bow-

Matthew E Peters, Mark Neumann, Mohit Iyyer, Matt

Alec Radford, Karthik Narasimhan, Tim Salimans, and

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and

Richard Socher, Alex Perelygin, Jean Wu, Jason

You might also like