Multi-Task Deep Neural Networks For Natural Language Understanding
Multi-Task Deep Neural Networks For Natural Language Understanding
Multi-Task Deep Neural Networks For Natural Language Understanding
of sentences (i.e., a premise-hypothesis pair), the dataset (Khot et al., 2018). The task involves as-
goal is to predict whether the hypothesis is an en- sessing whether a given premise entails a given hy-
tailment, contradiction, or neutral with respect to pothesis. In contrast to other entailment datasets
the premise. The test and development sets are mentioned previously, the hypotheses in SciTail
split into in-domain (matched) and cross-domain are created from science questions while the cor-
(mismatched) sets. The evaluation metric is accu- responding answer candidates and premises come
racy. from relevant web sentences retrieved from a large
corpus. As a result, these sentences are linguis-
RTE The Recognizing Textual Entailment
tically challenging and the lexical similarity of
dataset is collected from a series of annual chal-
premise and hypothesis is often high, thus making
lenges on textual entailment. The task is similar
SciTail particularly difficult. The dataset is used
to MNLI, but uses only two labels: entailment
only for domain adaptation in this study.
and not entailment (Wang et al., 2018).
WNLI The Winograd NLI (WNLI) is a natural 4.2 Implementation details
language inference dataset derived from the Wino-
Our implementation of MT-DNN is based on
grad Schema dataset (Levesque et al., 2012). This
the PyTorch implementation of BERT4 . We used
is a reading comprehension task. The goal is to se-
Adamax (Kingma and Ba, 2014) as our optimizer
lect the referent of a pronoun from a list of choices
with a learning rate of 5e-5 and a batch size of 32.
in a given sentence which contains the pronoun.
The maximum number of epochs was set to 5. A
SNLI The Stanford Natural Language Inference linear learning rate decay schedule with warm-up
(SNLI) dataset contains 570k human annotated over 0.1 was used, unless stated otherwise. Fol-
sentence pairs, in which the premises are drawn lowing (Liu et al., 2018a), we set the number of
from the captions of the Flickr30 corpus and hy- steps to 5 with a dropout rate of 0.1. To avoid the
potheses are manually annotated (Bowman et al., exploding gradient problem, we clipped the gradi-
2015b). This is the most widely used entailment ent norm within 1. All the texts were tokenized
dataset for NLI. The dataset is used only for do- using wordpieces, and were chopped to spans no
main adaptation in this study. longer than 512 tokens.
Table 2: GLUE test set results, which are scored by the GLUE evaluation server. The number below each task
denotes the number of training examples. The state-of-the-art results are in bold. MT-DNN uses BERTLARGE for
its shared layers. All the results are obtained from https://gluebenchmark.com/leaderboard.
Table 3: GLUE dev set results. The best result on each task is in bold. BERTBASE is the base BERT model
released by the authors, and is fine-tuned for each single task. The Single-Task DNN (ST-DNN) uses the same
model architecture as MT-DNN. But instead of fine-tuning one model for all tasks using MTL, we create multiple
ST-DNNs, one for each task using only in-domain data for fine-tuning. ST-DNNs and MT-DNN use BERTBASE
for their shared layers.
4.5 Domain Adaptation Results 2. create for each new task (SNLI or SciTail) a
One of the most important criteria for building task-specific model, by adapting the trained
practical systems is fast adaptation to new tasks MT-DNN using task-specific training data;
and domains. This is because it is prohibitively 3. evaluate the models using task-specific test
expensive to collect labeled training data for new data.
domains or tasks. Very often, we only have very
small training data or even no training data. We denote the two task-specific models as MT-
To evaluate the models using the above crite- DNN. For comparison, we also perform the same
rion, we perform domain adaptation experiments adaptation procedure to the pre-trained BERT
on two NLI tasks, SNLI and SciTail, using the fol- model, creating two task-specific BERT models
lowing procedure: for SNLI and SciTail, respectively, denoted as
BERT. In Proceedings of the 2015 Conference on Empiri-
We split the training data of SNLI and SciTail, cal Methods in Natural Language Processing, pages
632–642.
and randomly sample 0.1%, 1%, 10% and 100%
of its training data. As a result, we obtain four Samuel R. Bowman, Gabor Angeli, Christopher Potts,
sets of training data for SciTail, which includes and Christopher D. Manning. 2015b. A large an-
23, 235, 2.3k and 23.5k training samples. Simi- notated corpus for learning natural language infer-
ence. In Proceedings of the 2015 Conference on
larly, we obtain four sets of training data for SNLI, Empirical Methods in Natural Language Processing
which includes 549, 5.5k, 54.9k and 549.3k train- (EMNLP). Association for Computational Linguis-
ing samples. tics.
Results on different amounts of training data of
Chris Burges, Tal Shaked, Erin Renshaw, Ari Lazier,
SNLI and SciTail are reported in Figure 2 and Ta- Matt Deeds, Nicole Hamilton, and Greg Hullender.
ble 5. We observe that our model pre-trained on 2005. Learning to rank using gradient descent. In
GLUE via multi-task learning outplays the BERT Proceedings of the 22nd international conference on
baseline consistently. The fewer the training data Machine learning, pages 89–96. ACM.
used, the larger improvement MT-DNN demon- Rich Caruana. 1997. Multitask learning. Machine
strates over BERT. For example, with only 0.1% learning, 28(1):41–75.
(23 samples) of the SNLI training data, MT-DNN
Daniel Cer, Mona Diab, Eneko Agirre, Inigo Lopez-
achieves 82.1% in accuracy while BERT’s accu-
Gazpio, and Lucia Specia. 2017. Semeval-2017
racy is 52.5%; with 1% of the training data, the ac- task 1: Semantic textual similarity-multilingual and
curacy of our model is 85.2% and BERT is 78.1%. cross-lingual focused evaluation. arXiv preprint
We observe similar results on SciTail. The re- arXiv:1708.00055.
sults indicate that the representations learned by Z. Chen, H. Zhang, X. Zhang, and L. Zhao. 2018.
MT-DNN are more effective for domain adapta- Quora question pairs.
tion than that of BERT.
Ronan Collobert, Jason Weston, Léon Bottou, Michael
5 Conclusion Karlen, Koray Kavukcuoglu, and Pavel Kuksa.
2011. Natural language processing (almost) from
In this work we proposed a model called MT- scratch. Journal of Machine Learning Research,
12(Aug):2493–2537.
DNN to combine multi-task learning and lan-
guage model pre-training for language represen- Jacob Devlin, Ming-Wei Chang, Kenton Lee, and
tation learning. MT-DNN obtains new state-of- Kristina Toutanova. 2018. Bert: Pre-training of deep
the-art results on ten NLU tasks across three pop- bidirectional transformers for language understand-
ing. arXiv preprint arXiv:1810.04805.
ular benchmarks: SNLI, SciTail, and GLUE. MT-
DNN also demonstrates an exceptional generaliza- William B Dolan and Chris Brockett. 2005. Automati-
tion capability in domain adaptation experiments. cally constructing a corpus of sentential paraphrases.
In Proceedings of the Third International Workshop
There are many future areas to explore to im-
on Paraphrasing (IWP2005).
prove MT-DNN, including a deeper understand-
ing of model structure sharing in MTL, a more ef- J. Gao, M. Galley, and L. Li. 2018. Neural approaches
fective training method that leverages relatedness to conversational AI. CoRR, abs/1809.08267.
among multiple tasks, and ways of incorporating Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng,
the linguistic structure of text in a more explicit Alex Acero, and Larry Heck. 2013. Learning deep
and controllable manner. structured semantic models for web search using
clickthrough data. In Proceedings of the 22nd ACM
6 Acknowledgements international conference on Conference on informa-
tion & knowledge management, pages 2333–2338.
We would like to thanks Jade Huang from Mi- ACM.
crosoft for her generous help on this work. Tushar Khot, Ashish Sabharwal, and Peter Clark. 2018.
SciTail: A textual entailment dataset from science
question answering. In AAAI.
References
Seonhoon Kim, Jin-Hyuk Hong, Inho Kang, and No-
Samuel R Bowman, Gabor Angeli, Christopher Potts, jun Kwak. 2018. Semantic sentence matching with
and Christopher D Manning. 2015a. A large anno- densely-connected recurrent and co-attentive infor-
tated corpus for learning natural language inference. mation. arXiv preprint arXiv:1805.11360.
Diederik Kingma and Jimmy Ba. 2014. Adam: A Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob
method for stochastic optimization. arXiv preprint Uszkoreit, Llion Jones, Aidan N Gomez, Lukasz
arXiv:1412.6980. Kaiser, and Illia Polosukhin. 2017. Attention is all
you need. arXiv preprint arXiv:1706.03762.
Hector Levesque, Ernest Davis, and Leora Morgen-
stern. 2012. The winograd schema challenge. In Alex Wang, Amapreet Singh, Julian Michael, Felix
Thirteenth International Conference on the Princi- Hill, Omer Levy, and Samuel R Bowman. 2018.
ples of Knowledge Representation and Reasoning. Glue: A multi-task benchmark and analysis platform
for natural language understanding. arXiv preprint
Xiaodong Liu, Kevin Duh, and Jianfeng Gao. 2018a. arXiv:1804.07461.
Stochastic answer networks for natural language in-
ference. arXiv preprint arXiv:1804.07888. Alex Warstadt, Amanpreet Singh, and Samuel R Bow-
man. 2018. Neural network acceptability judg-
Xiaodong Liu, Jianfeng Gao, Xiaodong He, Li Deng, ments. arXiv preprint arXiv:1805.12471.
Kevin Duh, and Ye-Yi Wang. 2015. Representa-
Yichong Xu, Xiaodong Liu, Yelong Shen, Jingjing
tion learning using multi-task deep neural networks
Liu, and Jianfeng Gao. 2018. Multi-task learning
for semantic classification and information retrieval.
for machine reading comprehension. arXiv preprint
In Proceedings of the 2015 Conference of the North
arXiv:1809.06963.
American Chapter of the Association for Computa-
tional Linguistics: Human Language Technologies, Yu Zhang and Qiang Yang. 2017. A survey on multi-
pages 912–921. task learning. arXiv preprint arXiv:1707.08114.
Xiaodong Liu, Yelong Shen, Kevin Duh, and Jianfeng
Gao. 2018b. Stochastic answer networks for ma-
chine reading comprehension. In Proceedings of the
56th Annual Meeting of the Association for Compu-
tational Linguistics (Volume 1: Long Papers). Asso-
ciation for Computational Linguistics.