Nothing Special   »   [go: up one dir, main page]

skip to main content
10.1145/3534678.3539173acmconferencesArticle/Chapter ViewAbstractPublication PageskddConference Proceedingsconference-collections
research-article
Open access

Alexa Teacher Model: Pretraining and Distilling Multi-Billion-Parameter Encoders for Natural Language Understanding Systems

Published: 14 August 2022 Publication History

Abstract

We present results from a large-scale experiment on pretraining encoders with non-embedding parameter counts ranging from 700M to 9.3B, their subsequent distillation into smaller models ranging from 17M-170M parameters, and their application to the Natural Language Understanding (NLU) component of a virtual assistant system. Though we train using 70% spoken-form data, our teacher models perform comparably to XLM-R and mT5 when evaluated on the written-form Cross-lingual Natural Language Inference (XNLI) corpus. We perform a second stage of pretraining on our teacher models using in-domain data from our system, improving error rates by 3.86% relative for intent classification and 7.01% relative for slot filling. We find that even a 170M-parameter model distilled from our Stage 2 teacher model has 2.88% better intent classification and 7.69% better slot filling error rates when compared to the 2.3B-parameter teacher trained only on public data (Stage 1), emphasizing the importance of in-domain data for pretraining. When evaluated offline using labeled NLU data, our 17M-parameter Stage 2 distilled model outperforms both XLM-R Base (85M params) and DistillBERT (42M params) by 4.23% to 6.14%, respectively. Finally, we present results from a full virtual assistant experimentation platform, where we find that models trained using our pretraining and distillation pipeline outperform models distilled from 85M-parameter teachers by 3.74%-4.91% on an automatic measurement of full-system user dissatisfaction.

Supplemental Material

MP4 File
Suppose that you are building a production natural language understanding model with tight latency constraints and that you have a large corpus of labeled data. Is it better to directly fine tune a small pretrained model, or is it better to pretrain a large teacher with in-domain data, distill, and then fine tune? In our case, the latter approach yielded the best results.

References

[1]
Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive Multi-task Representations with Pre-Finetuning., 5799--5811 pages. https://doi.org/10.18653/v1/2021.emnlp-main.468
[2]
Jimmy Ba and Rich Caruana. 2014. Do Deep Nets Really Need to be Deep?. In Advances in Neural Information Processing Systems, Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K. Q. Weinberger (Eds.), Vol. 27. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2014/file/ea8fcd92d59581717e06eb187f10666d-Paper.pdf
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners., Vol. 33 (2020), 1877--1901. https://proceedings.neurips.cc/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
[4]
Cristian Buciluundefined, Rich Caruana, and Alexandru Niculescu-Mizil. 2006. Model Compression. In Proceedings of the 12th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (Philadelphia, PA, USA) (KDD '06). Association for Computing Machinery, New York, NY, USA, 535--541. https://doi.org/10.1145/1150402.1150464
[5]
Jin Cao, Jun Wang, Wael Hamza, Kelly Vanee, and Shang-Wen Li. 2020. Style Attuned Pre-training and Parameter Efficient Fine-tuning for Spoken Language Understanding. (2020).
[6]
Qian Chen, Zhu Zhuo, and Wen Wang. 2019. Bert for joint intent classification and slot filling. arXiv preprint arXiv:1902.10909 (2019).
[7]
Jang Hyun Cho and Bharath Hariharan. 2019. On the efficacy of knowledge distillation. In Proceedings of the IEEE/CVF international conference on computer vision. 4794--4802.
[8]
Alexis Conneau, Kartikay Khandelwal, Naman Goyal, Vishrav Chaudhary, Guillaume Wenzek, Francisco Guzmán, Edouard Grave, Myle Ott, Luke Zettlemoyer, and Veselin Stoyanov. 2020. Unsupervised Cross-lingual Representation Learning at Scale. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics . Association for Computational Linguistics, Online, 8440--8451. https://doi.org/10.18653/v1/2020.acl-main.747
[9]
Alexis CONNEAU and Guillaume Lample. 2019. Cross-lingual Language Model Pretraining. In Advances in Neural Information Processing Systems, H. Wallach, H. Larochelle, A. Beygelzimer, F. dtextquotesingle Alché-Buc, E. Fox, and R. Garnett (Eds.), Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper/2019/file/c04c19c2c2474dbf5f7ac4372c5b9af1-Paper.pdf
[10]
Alexis Conneau, Guillaume Lample, Ruty Rinott, Adina Williams, Samuel R. Bowman, Holger Schwenk, and Veselin Stoyanov. 2018. XNLI: Evaluating Cross-lingual Sentence Representations.
[11]
Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding., 4171--4186 pages. https://doi.org/10.18653/v1/N19--1423
[12]
William Fedus, Barret Zoph, and Noam Shazeer. 2022. Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity., 39 pages. http://jmlr.org/papers/v23/21-0998.html
[13]
Jack FitzGerald, Christopher Hench, Charith Peris, Scott Mackie, Kay Rottmann, Ana Sanchez, Aaron Nash, Liam Urbach, Vishesh Kakarala, Richa Singh, Swetha Ranganath, Laurie Crist, Misha Britan, Wouter Leeuwis, Gokhan Tur, and Prem Natarajan. 2022. MASSIVE: A 1M-Example Multilingual Natural Language Understanding Dataset with 51 Typologically-Diverse Languages. https://doi.org/10.48550/ARXIV.2204.08582
[14]
Suchin Gururangan, Ana Marasovi?, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks., 8342--8360 pages. https://doi.org/10.18653/v1/2020.acl-main.740
[15]
Matthew Henderson, I nigo Casanueva, Nikola Mrkvs i, Pei-Hao Su, Tsung-Hsien Wen, and Ivan Vuli?. 2020. ConveRT: Efficient and Accurate Conversational Representations from Transformers., 2161--2174 pages. https://doi.org/10.18653/v1/2020.findings-emnlp.196
[16]
Geoffrey Hinton, Oriol Vinyals, and Jeffrey Dean. 2015. Distilling the Knowledge in a Neural Network. In NIPS Deep Learning and Representation Learning Workshop . http://arxiv.org/abs/1503.02531
[17]
Xiaoqi Jiao, Yichun Yin, Lifeng Shang, Xin Jiang, Xiao Chen, Linlin Li, Fang Wang, and Qun Liu. 2020. TinyBERT: Distilling BERT for Natural Language Understanding. In Findings of the Association for Computational Linguistics: EMNLP 2020 . Association for Computational Linguistics, Online, 4163--4174. https://doi.org/10.18653/v1/2020.findings-emnlp.372
[18]
Dhiraj Kalamkar, Dheevatsa Mudigere, Naveen Mellempudi, Dipankar Das, Kunal Banerjee, Sasikanth Avancha, Dharma Teja Vooturi, Nataraj Jammalamadaka, Jianyu Huang, Hector Yuen, Jiyan Yang, Jongsoo Park, Alexander Heinecke, Evangelos Georganas, Sudarshan Srinivasan, Abhisek Kundu, Misha Smelyanskiy, Bharat Kaul, and Pradeep Dubey. 2019. A Study of BFLOAT16 for Deep Learning Training. arxiv: 1905.12322 [cs.LG]
[19]
Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling laws for neural language models. arxiv: 2005.14165 [cs.LG]
[20]
Phillip Keung, Y. Lu, György Szarvas, and Noah A. Smith. 2020. The Multilingual Amazon Reviews Corpus.
[21]
Young Jin Kim, Ammar Ahmad Awan, Alexandre Muzio, Andres Felipe Cruz Salinas, Liyang Lu, Amr Hendy, Samyam Rajbhandari, Yuxiong He, and Hany Hassan Awadalla. 2021. Scalable and Efficient MoE Training for Multitask Multilingual Models. arxiv: 2109.10465 [cs.CL]
[22]
Taku Kudo and John Richardson. 2018. SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: System Demonstrations . Association for Computational Linguistics, Brussels, Belgium, 66--71. https://doi.org/10.18653/v1/D18--2012
[23]
Mike Lewis, Shruti Bhosale, Tim Dettmers, Naman Goyal, and Luke Zettlemoyer. 2021. BASE Layers: Simplifying Training of Large, Sparse Models.
[24]
Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arxiv: 1907.11692 [cs.CL]
[25]
Grégoire Mesnil, Yann Dauphin, Kaisheng Yao, Yoshua Bengio, Li Deng, Dilek Hakkani-Tur, Xiaodong He, Larry Heck, Gokhan Tur, Dong Yu, et almbox. 2014. Using recurrent neural networks for slot filling in spoken language understanding. IEEE/ACM Transactions on Audio, Speech, and Language Processing, Vol. 23, 3 (2014), 530--539.
[26]
Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. 2020. Improved Knowledge Distillation via Teacher Assistant. Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 34, 04 (Apr. 2020), 5191--5198. https://doi.org/10.1609/aaai.v34i04.5963
[27]
Deepak Narayanan, Aaron Harlap, Amar Phanishayee, Vivek Seshadri, Nikhil R. Devanur, Gregory R. Ganger, Phillip B. Gibbons, and Matei Zaharia. 2019. PipeDream: Generalized Pipeline Parallelism for DNN Training. (2019), 1--15. https://doi.org/10.1145/3341301.3359646
[28]
Charith Peris, Gokmen Oz, Khadige Abboud, Venkata sai Varada Varada, Prashan Wanigasekara, and Haidar Khan. 2020. Using multiple ASR hypotheses to boost i18n NLU performance. (Dec. 2020), 30--39. https://aclanthology.org/2020.icon-main.5
[29]
Peng Qi, Yuhao Zhang, Yuhui Zhang, Jason Bolton, and Christopher D. Manning. 2020. Stanza: A Python Natural Language Processing Toolkit for Many Human Languages. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations . https://nlp.stanford.edu/pubs/qi2020stanza.pdf
[30]
A. Radford, Jeffrey Wu, R. Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language Models are Unsupervised Multitask Learners.
[31]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J. Liu. 2020. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer., 67 pages. http://jmlr.org/papers/v21/20-074.html
[32]
Samyam Rajbhandari, Conglong Li, Zhewei Yao, Minjia Zhang, Reza Yazdani Aminabadi, Ammar Ahmad Awan, Jeff Rasley, and Yuxiong He. 2022. DeepSpeed-MoE: Advancing Mixture-of-Experts Inference and Training to Power Next-Generation AI Scale. arxiv: 2201.05596 [cs.LG]
[33]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2019. ZeRO: Memory Optimization Towards Training A Trillion Parameter Models. CoRR, Vol. abs/1910.02054 (2019). showeprint[arXiv]1910.02054 http://arxiv.org/abs/1910.02054
[34]
Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. arxiv: 1910.02054 [cs.LG]
[35]
Milind Rao, Pranav Dheram, Gautam Tiwari, Anirudh Raju, Jasha Droppo, Ariya Rastrow, and Andreas Stolcke. 2021. DO as I Mean, Not as I Say: Sequence Loss Training for Spoken Language Understanding., 7473--7477 pages.
[36]
Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. 2020. DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. arxiv: 1910.01108 [cs.CL]
[37]
Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2019. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. CoRR, Vol. abs/1909.08053 (2019). showeprint[arXiv]1909.08053 http://arxiv.org/abs/1909.08053
[38]
Shaden Smith, Mostofa Patwary, Brandon Norick, Patrick LeGresley, Samyam Rajbhandari, Jared Casper, Zhun Liu, Shrimai Prabhumoye, George Zerveas, Vijay Korthikanti, Elton Zhang, Rewon Child, Reza Yazdani Aminabadi, Julie Bernauer, Xia Song, Mohammad Shoeybi, Yuxiong He, Michael Houston, Saurabh Tiwary, and Bryan Catanzaro. 2022. Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. arxiv: 2201.11990 [cs.CL]
[39]
Saleh Soltan, Haidar Khan, and Wael Hamza. 2021. Limitations of Knowledge Distillation for Zero-shot Transfer Learning. In Proceedings of the Second Workshop on Simple and Efficient Natural Language Processing . Association for Computational Linguistics, Virtual, 22--31. https://doi.org/10.18653/v1/2021.sustainlp-1.3
[40]
Gokhan Tur and Renato De Mori. 2011. Spoken Language Understanding: Systems for Extracting Semantic Information from Speech.
[41]
Gökhan Tür, Jerry H Wright, Allen L Gorin, Giuseppe Riccardi, and Dilek Hakkani-Tür. 2002. Improving spoken language understanding using word confusion networks. In Interspeech .
[42]
Shyam Upadhyay, Manaal Faruqui, Gokhan Tür, Hakkani-Tür Dilek, and Larry Heck. 2018. (Almost) zero-shot cross-lingual spoken language understanding. In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 6034--6038.
[43]
Wenhui Wang, Hangbo Bao, Shaohan Huang, Li Dong, and Furu Wei. 2021. MiniLMv2: Multi-Head Self-Attention Relation Distillation for Compressing Pretrained Transformers. In Findings of the Association for Computational Linguistics: ACL-IJCNLP 2021. Association for Computational Linguistics, Online, 2140--2151. https://doi.org/10.18653/v1/2021.findings-acl.188
[44]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020 a. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. arxiv: 2002.10957 [cs.CL]
[45]
Wenhui Wang, Furu Wei, Li Dong, Hangbo Bao, Nan Yang, and Ming Zhou. 2020 b. MiniLM: Deep Self-Attention Distillation for Task-Agnostic Compression of Pre-Trained Transformers. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M. F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 5776--5788. https://proceedings.neurips.cc/paper/2020/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf
[46]
Ye-Yi Wang, Li Deng, and Alex Acero. 2005. Spoken language understanding. IEEE Signal Processing Magazine, Vol. 22 (2005), 16--31.
[47]
Guillaume Wenzek, Marie-Anne Lachaux, Alexis Conneau, Vishrav Chaudhary, Francisco Guzmán, Armand Joulin, and Edouard Grave. 2020. CCNet: Extracting High Quality Monolingual Datasets from Web Crawl Data. In Proceedings of the 12th Language Resources and Evaluation Conference . European Language Resources Association, Marseille, France, 4003--4012. https://aclanthology.org/2020.lrec-1.494
[48]
Wikipedia contributors. 2021. Wikipedia, The Free Encyclopedia. https://en.wikipedia.org/w/index.php?title=J%C5%8Dy%C5%8D_kanji&oldid=1039289460 [Online; accessed 28-January-2022].
[49]
Chien-Sheng Wu, Steven C.H. Hoi, Richard Socher, and Caiming Xiong. 2020. TOD-BERT: Pre-trained Natural Language Understanding for Task-Oriented Dialogue., 917--929 pages. https://doi.org/10.18653/v1/2020.emnlp-main.66
[50]
Menglin Xia and Emilio Monti. 2021. Multilingual Neural Semantic Parsing for Low-Resourced Languages. In The Tenth Joint Conference on Lexical and Computational Semantics .
[51]
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tieyan Liu. 2020. On Layer Normalization in the Transformer Architecture., 10524--10533 pages. https://proceedings.mlr.press/v119/xiong20b.html
[52]
Weijia Xu, Batool Haider, and Saab Mansour. 2020. End-to-End Slot Alignment and Recognition for Cross-Lingual NLU ., 5052--5063 pages. https://doi.org/10.18653/v1/2020.emnlp-main.410
[53]
Linting Xue, Noah Constant, Adam Roberts, Mihir Kale, Rami Al-Rfou, Aditya Siddhant, Aditya Barua, and Colin Raffel. 2021. mT5: A Massively Multilingual Pre-trained Text-to-Text Transformer., 483--498 pages. https://doi.org/10.18653/v1/2021.naacl-main.41
[54]
Yinfei Yang, Yuan Zhang, Chris Tar, and Jason Baldridge. 2019. PAWS-X: A Cross-lingual Adversarial Dataset for Paraphrase Identification.
[55]
Steve J. Young. 2002. Talking to machines (statistically speaking). In INTERSPEECH .

Cited By

View all
  • (2024)A Lightweight and Effective Multi-View Knowledge Distillation Framework for Text-Image Retrieval2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650723(1-8)Online publication date: 30-Jun-2024
  • (2023)Optimal Transport Posterior Alignment for Cross-lingual Semantic ParsingTransactions of the Association for Computational Linguistics10.1162/tacl_a_0061111(1432-1450)Online publication date: 16-Nov-2023
  • (2023)A Mixed-Methods Approach to Understanding User Trust after Voice Assistant FailuresProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581152(1-16)Online publication date: 19-Apr-2023
  • Show More Cited By

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
August 2022
5033 pages
ISBN:9781450393850
DOI:10.1145/3534678
This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives International 4.0 License.

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 14 August 2022

Check for updates

Author Tags

  1. distributed training
  2. knowledge distillation
  3. model pretraining
  4. natural language understanding
  5. self-attention
  6. transformers
  7. virtual assistant
  8. voice a.i.

Qualifiers

  • Research-article

Conference

KDD '22
Sponsor:

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • Downloads (Last 12 months)436
  • Downloads (Last 6 weeks)36
Reflects downloads up to 25 Nov 2024

Other Metrics

Citations

Cited By

View all
  • (2024)A Lightweight and Effective Multi-View Knowledge Distillation Framework for Text-Image Retrieval2024 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN60899.2024.10650723(1-8)Online publication date: 30-Jun-2024
  • (2023)Optimal Transport Posterior Alignment for Cross-lingual Semantic ParsingTransactions of the Association for Computational Linguistics10.1162/tacl_a_0061111(1432-1450)Online publication date: 16-Nov-2023
  • (2023)A Mixed-Methods Approach to Understanding User Trust after Voice Assistant FailuresProceedings of the 2023 CHI Conference on Human Factors in Computing Systems10.1145/3544548.3581152(1-16)Online publication date: 19-Apr-2023
  • (2023)Initial Development and Performance Evaluation of a Bengali Voice-Operated Virtual Assistant for Personal Computer Control2023 IEEE 64th International Scientific Conference on Information Technology and Management Science of Riga Technical University (ITMS)10.1109/ITMS59786.2023.10317730(1-6)Online publication date: 5-Oct-2023
  • (2023)Efficient Fine-Tuning Large Language Models for Knowledge-Aware Response PlanningMachine Learning and Knowledge Discovery in Databases: Research Track10.1007/978-3-031-43415-0_35(593-611)Online publication date: 18-Sep-2023

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media