Multi-Task Deep Neural Networks For Natural Language Understanding
of sentences (i.e., a premise-hypothesis pair), the dataset (Khot et al., 2018). The task involves as-
goal is to predict whether the hypothesis is an en- sessing whether a given premise entails a given hy-
tailment, contradiction, or neutral with respect to pothesis. In contrast to other entailment datasets
the premise. The test and development sets are mentioned previously, the hypotheses in SciTail
split into in-domain (matched) and cross-domain are created from science questions while the cor-
(mismatched) sets. The evaluation metric is accu- responding answer candidates and premises come
racy. from relevant web sentences retrieved from a large
corpus. As a result, these sentences are linguis-
RTE The Recognizing Textual Entailment
tically challenging and the lexical similarity of
dataset is collected from a series of annual chal-
premise and hypothesis is often high, thus making
lenges on textual entailment. The task is similar
SciTail particularly difficult. The dataset is used
to MNLI, but uses only two labels: entailment
only for domain adaptation in this study.
and not entailment (Wang et al., 2018).
WNLI The Winograd NLI (WNLI) is a natural 4.2 Implementation details
language inference dataset derived from the Wino-
Our implementation of MT-DNN is based on
grad Schema dataset (Levesque et al., 2012). This
the PyTorch implementation of BERT4 . We used
is a reading comprehension task. The goal is to se-
Adamax (Kingma and Ba, 2014) as our optimizer
lect the referent of a pronoun from a list of choices
with a learning rate of 5e-5 and a batch size of 32.
in a given sentence which contains the pronoun.
The maximum number of epochs was set to 5. A
SNLI The Stanford Natural Language Inference linear learning rate decay schedule with warm-up
(SNLI) dataset contains 570k human annotated over 0.1 was used, unless stated otherwise. Fol-
sentence pairs, in which the premises are drawn lowing (Liu et al., 2018a), we set the number of
from the captions of the Flickr30 corpus and hy- steps to 5 with a dropout rate of 0.1. To avoid the
potheses are manually annotated (Bowman et al., exploding gradient problem, we clipped the gradi-
2015b). This is the most widely used entailment ent norm within 1. All the texts were tokenized
dataset for NLI. The dataset is used only for do- using wordpieces, and were chopped to spans no
main adaptation in this study. longer than 512 tokens.
4.5 Domain Adaptation Results 2. create for each new task (SNLI or SciTail) a
One of the most important criteria for building task-specific model, by adapting the trained
practical systems is fast adaptation to new tasks MT-DNN using task-specific training data;
and domains. This is because it is prohibitively 3. evaluate the models using task-specific test
expensive to collect labeled training data for new data.
domains or tasks. Very often, we only have very
small training data or even no training data. We denote the two task-specific models as MT-
To evaluate the models using the above crite- DNN. For comparison, we also perform the same
rion, we perform domain adaptation experiments adaptation procedure to the pre-trained BERT
on two NLI tasks, SNLI and SciTail, using the fol- model, creating two task-specific BERT models
lowing procedure: for SNLI and SciTail, respectively, denoted as
