research-article

Meta-learning for compressed language model: : A multiple choice question answering study

Authors:

Yi PanAuthors Info & Claims

Volume 487, Issue C

Pages 181 - 189

https://doi.org/10.1016/j.neucom.2021.01.148

Published: 28 May 2022 Publication History

Abstract

Model compression is a promising approach for reducing the model size of pretrained-language-models (PLMs) on low resource edge devices and applications. Unfortunately, the compression process always accompanies a cost of performance degradation, especially for the low resource downstream tasks, i.e., multiple-choice question answering. To address the degradation issue of model compression on PLMs, we proposed an end-to-end reptile (ETER) meta-learning approach to improving the performance of PLMs on the low resource multiple-choice question answering task. Specifically, our ETER improves the traditional two-stage meta-learning to an end-to-end manner, integrating the target finetuning stage into the meta training stage. To strengthen the generic meta-learning, ETER employs two-level meta-task construction from instance-level and domain-level to enrich its task generalization. What is more, ETER optimizes meta-learning by parameter constraints to reduce its parameter learning space. Experiments demonstrate that ETER significantly improved the performance of compressed PLMs and achieved large superiority over the baselines on different datasets.

References

[1]

J. Devlin, M.-W. Chang, K. Lee, K. Toutanova, BERT: pre-training of deep bidirectional transformers for language understanding, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 4171–4186. URL:https://www.aclweb.org/anthology/N19-1423.

[2]

C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y. Zhou, W. Li, P.J. Liu, Exploring the limits of transfer learning with a unified text-to-text transformer, J. Mach. Learn. Res. 21 (140) (2020) 1–67. URL:http://jmlr.org/papers/v21/20-074.html.

[3]

K. Sun, D. Yu, J. Chen, D. Yu, Y. Choi, C. Cardie, DREAM: a challenge data set and models for dialogue-based reading comprehension, Trans. Assoc. Comput. Linguist. 7 (2019) 217–231,. URL:https://www.aclweb.org/anthology/Q19-1014.

[4]

G. Lai, Q. Xie, H. Liu, Y. Yang, E. Hovy, RACE: large-scale ReAding comprehension dataset from examinations, in: Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Copenhagen, Denmark, 2017, pp. 785–794,. URL:https://www.aclweb.org/anthology/D17-1082.

[5]

T.B. Brown, B. Mann, N. Ryder, M. Subbiah, J. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, S. Agarwal, A. Herbert-Voss, G. Krueger, T. Henighan, R. Child, A. Ramesh, D.M. Ziegler, J. Wu, C. Winter, C. Hesse, M. Chen, E. Sigler, M. Litwin, S. Gray, B. Chess, J. Clark, C. Berner, S. McCandlish, A. Radford, I. Sutskever, D. Amodei, Language models are few-shot learners (2020). arXiv:2005.14165.

[6]

Z. Lan, M. Chen, S. Goodman, K. Gimpel, P. Sharma, R. Soricut, Albert: a lite bert for self-supervised learning of language representations (2020). arXiv:1909.11942.

[7]

P. Ganesh, Y. Chen, X. Lou, M.A. Khan, Y. Yang, D. Chen, M. Winslett, H. Sajjad, P. Nakov, Compressing large-scale transformer-based models: a case study on bert (2020). arXiv:2002.11985.

[8]

D. Jin, E. Pan, N. Oufattole, W.-H. Weng, H. Fang, P. Szolovits, What disease does this patient have? A large-scale open domain question answering dataset from medical exams (2020). arXiv:2009.13081.

[9]

M. Richardson, C.J. Burges, E. Renshaw, MCTest: a challenge dataset for the open-domain machine comprehension of text, in: Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Seattle, Washington, USA, 2013, pp. 193–203. URL:https://www.aclweb.org/anthology/D13-1020.

[10]

M. Yan, H. Zhang, D. Jin, J.T. Zhou, Multi-source meta transfer for low resource multiple-choice question answering, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, 2020, pp. 7331–7341,. Online, URL:https://www.aclweb.org/anthology/2020.acl-main.654.

[11]

C. Finn, P. Abbeel, S. Levine, Model-agnostic meta-learning for fast adaptation of deep networks, in: D. Precup, Y.W. Teh (Eds.), Proceedings of the 34th International Conference on Machine Learning, Vol. 70 of Proceedings of Machine Learning Research, PMLR, International Convention Centre, Sydney, Australia, 2017, pp. 1126–1135. URL:http://proceedings.mlr.press/v70/finn17a.html

[12]

V. Sanh, L. Debut, J. Chaumond, T. Wolf, Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter (2020). arXiv:1910.01108.

[13]

X. Jiao, Y. Yin, L. Shang, X. Jiang, X. Chen, L. Li, F. Wang, Q. Liu, TinyBERT: distilling BERT for natural language understanding, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 4163–4174. URL:https://www.aclweb.org/anthology/2020.findings-emnlp.372.

[14]

M.A. Gordon, K. Duh, N. Andrews, Compressing bert: studying the effects of weight pruning on transfer learning (2020). URL:https://openreview.net/forum?id=SJlPOCEKvH.

[15]

V. Sanh, T. Wolf, A.M. Rush, Movement pruning: Adaptive sparsity by fine-tuning (2020). arXiv:2005.07683

[16]

Y. Tay, D. Bahri, D. Metzler, D.-C. Juan, Z. Zhao, C. Zheng, Synthesizer: rethinking self-attention in transformer models (2020). arXiv:2005.00743.

[17]

O. Zafrir, G. Boudoukh, P. Izsak, M. Wasserblat, Q8bert: Quantized 8bit bert (2019). arXiv:1910.06188.

[18]

S. Shen, Z. Dong, J. Ye, L. Ma, Z. Yao, A. Gholami, M.W. Mahoney, K. Keutzer, Q-bert: Hessian based ultra low precision quantization of bert (2019). arXiv:1909.05840.

[19]

Z. Sun, H. Yu, X. Song, R. Liu, Y. Yang, D. Zhou, MobileBERT: a compact task-agnostic BERT for resource-limited devices, in: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Online, 2020, pp. 2158–2170. URL:https://www.aclweb.org/anthology/2020.acl-main.195

[20]

A. Raganato, Y. Scherrer, J. Tiedemann, Fixed encoder self-attention patterns in transformer-based machine translation, in: Findings of the Association for Computational Linguistics: EMNLP 2020, Association for Computational Linguistics, Online, 2020, pp. 556–568. URL:https://www.aclweb.org/anthology/2020.findings-emnlp.49.

[21]

Z. Wang, J. Wohlwend, T. Lei, Structured pruning of large language models, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), Association for Computational Linguistics, Online, 2020, pp. 6151–6162. URL:https://www.aclweb.org/anthology/2020.emnlp-main.496.

[22]

S. Zhao, R. Gupta, Y. Song, D. Zhou, Extreme language model compression with optimal subwords and shared projections (2019). arXiv:1909.11687.

[23]

Z. Jiang, W. Yu, D. Zhou, Y. Chen, J. Feng, S. Yan, Convbert: improving bert with span-based dynamic convolution (2020). arXiv:2008.02496.

[24]

C. Lemke, M. Budka, B. Gabrys, Metalearning: a survey of trends and technologies, Artif. Intell. Rev. 44 (1) (2015) 117–130,.

Digital Library

[25]

A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, T. Lillicrap, Meta-learning with memory-augmented neural networks, in: M.F. Balcan, K.Q. Weinberger (Eds.), Proceedings of The 33rd International Conference on Machine Learning, Vol. 48 of Proceedings of Machine Learning Research, PMLR, New York, New York, USA, 2016, pp. 1842–1850. URL:http://proceedings.mlr.press/v48/santoro16.html

[26]

O. Vinyals, C. Blundell, T. Lillicrap, k. kavukcuoglu, D. Wierstra, Matching networks for one shot learning, in: D. Lee, M. Sugiyama, U. Luxburg, I. Guyon, R. Garnett (Eds.), Advances in Neural Information Processing Systems, vol. 29, Curran Associates Inc., 2016, pp. 3630–3638. URL:https://proceedings.neurips.cc/paper/2016/file/90e1357833654983612fb05e3ec9148c-Paper.pdf.

[27]

J. Snell, K. Swersky, R. Zemel, Prototypical networks for few-shot learning, in: I. Guyon, U.V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, R. Garnett (Eds.), Advances in Neural Information Processing Systems, vol. 30, Curran Associates Inc., 2017, pp. 4077–4087. URL:https://proceedings.neurips.cc/paper/2017/file/cb8da6767461f2812ae4290eac7cbc42-Paper.pdf.

[28]

A. Nichol, J. Achiam, J. Schulman, On first-order meta-learning algorithms (2018). arXiv:1803.02999.

[29]

R. Sennrich, B. Zhang, Revisiting low-resource neural machine translation: A case study, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 211–221. URL:https://www.aclweb.org/anthology/P19-1021.

[30]

D. Shen, M.R. Min, Y. Li, L. Carin, Learning context-sensitive convolutional filters for text processing, in: Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Brussels, Belgium, 2018, pp. 1839–1848,. URL:https://www.aclweb.org/anthology/D18-1210.

[31]

D. Guo, D. Tang, N. Duan, M. Zhou, J. Yin, Coupling retrieval and meta-learning for context-dependent semantic parsing, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 855–866,. URL:https://www.aclweb.org/anthology/P19-1082.

[32]

Q. Sun, Y. Liu, T.-S. Chua, B. Schiele, Meta-transfer learning for few-shot learning, in: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2019, pp. 403–412.

[33]

Z. Liu, R. Zhang, Y. Song, M. Zhang, When does maml work the best? An empirical study on model-agnostic meta-learning in nlp applications (2020). arXiv:2005.11700.

[34]

D. Dua, Y. Wang, P. Dasigi, G. Stanovsky, S. Singh, M. Gardner, DROP: a reading comprehension benchmark requiring discrete reasoning over paragraphs, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 2368–2378,. URL:https://www.aclweb.org/anthology/N19-1246.

[35]

D. Jin, S. Gao, J.-Y. Kao, T. Chung, D. Hakkani-tur, Mmm: multi-stage multi-task learning for multi-choice reading comprehension, in: Proceedings of the AAAI Conference on Artificial Intelligence, vol. 34, 2020, pp. 8010–8017.

[36]

C. Zhu, Y. Cheng, Z. Gan, S. Sun, T. Goldstein, J. Liu, Freelb: enhanced adversarial training for natural language understanding (2020). arXiv:1909.11764.

[37]

M. Shoeybi, M. Patwary, R. Puri, P. LeGresley, J. Casper, B. Catanzaro, Megatron-lm: training multi-billion parameter language models using model parallelism (2020). arXiv:1909.08053.

[38]

S. Ostermann, M. Roth, A. Modi, S. Thater, M. Pinkal, SemEval-2018 task 11: machine comprehension using commonsense knowledge, in: Proceedings of The 12th International Workshop on Semantic Evaluation, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 747–757. URL:https://www.aclweb.org/anthology/S18-1119.

[39]

P. Rajpurkar, J. Zhang, K. Lopyrev, P. Liang, SQuAD: 100,000+ questions for machine comprehension of text, in: Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, Austin, Texas, 2016, pp. 2383–2392,. URL:https://www.aclweb.org/anthology/D16-1264.

[40]

A. Trischler, T. Wang, X. Yuan, J. Harris, A. Sordoni, P. Bachman, K. Suleman, NewsQA: a machine comprehension dataset, in: Proceedings of the 2nd Workshop on Representation Learning for NLP, Association for Computational Linguistics, Vancouver, Canada, 2017, pp. 191–200. URL:https://www.aclweb.org/anthology/W17-2623.

[41]

T. Wolf, L. Debut, V. Sanh, J. Chaumond, C. Delangue, A. Moi, P. Cistac, T. Rault, R. Louf, M. Funtowicz, J. Davison, S. Shleifer, P. von Platen, C. Ma, Y. Jernite, J. Plu, C. Xu, T.L. Scao, S. Gugger, M. Drame, Q. Lhoest, A.M. Rush, Transformers: state-of-the-art natural language processing, in: Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, Association for Computational Linguistics, 2020, pp. 38–45. Online, URL:https://www.aclweb.org/anthology/2020.emnlp-demos.6.

[42]

Z. Yang, Z. Dai, Y. Yang, J. Carbonell, R.R. Salakhutdinov, Q.V. Le, Xlnet: Generalized autoregressive pretraining for language understanding, in: Advances in Neural Information Processing Systems 32, Curran Associates Inc., 2019, pp. 5754–5764. URL:http://papers.nips.cc/paper/8812.pdf.

[43]

Y. Liu, M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, V. Stoyanov, Roberta: a robustly optimized bert pretraining approach, arXiv preprint arXiv:1907.11692. URL:https://arxiv.org/abs/1907.11692.

[44]

P. Zhu, H. Zhao, X. Li, Duma: reading comprehension with transposition thinking (2020). arXiv:2001.09415.

[45]

L. Wang, M. Sun, W. Zhao, K. Shen, J. Liu, Yuanfudao at SemEval-2018 task 11: three-way attention and relational knowledge for commonsense machine comprehension, in: Proceedings of The 12th International Workshop on Semantic Evaluation, Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 758–762,. URL:https://www.aclweb.org/anthology/S18-1120.

[46]

Z. Chen, Y. Cui, W. Ma, S. Wang, T. Liu, G. Hu, Hfl-rc system at semeval-2018 task 11: hybrid multi-aspects model for commonsense reading comprehension (2018). arXiv:1803.05655.

[47]

J. Xia, C. Wu, M. Yan, Incorporating relation knowledge into commonsense reading comprehension with multi-task learning, in: Proceedings of the 28th ACM International Conference on Information and Knowledge Management, CIKM ’19, Association for Computing Machinery, New York, NY, USA, 2019, pp. 2393–2396,. URL:https://doi.org/10.1145/3357384.3358165.

Digital Library

[48]

K. Sun, D. Yu, D. Yu, C. Cardie, Improving machine reading comprehension with general reading strategies, in: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Association for Computational Linguistics, Minneapolis, Minnesota, 2019, pp. 2633–2643. URL:https://www.aclweb.org/anthology/N19-1270.

[49]

C. Si, S. Wang, M.-Y. Kan, J. Jiang, What does bert learn from multiple-choice reading comprehension datasets? (2019). arXiv:1910.12391.

[50]

Y.-A. Chung, H.-Y. Lee, J. Glass, Supervised and unsupervised transfer learning for question answering, in: Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Association for Computational Linguistics, New Orleans, Louisiana, 2018, pp. 1585–1594. URL:https://www.aclweb.org/anthology/N18-1143.

[51]

J. Yu, Z. Zha, J. Yin, Inferential machine comprehension: answering questions by recursively deducing the evidence chain from text, in: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, Association for Computational Linguistics, Florence, Italy, 2019, pp. 2241–2251,. URL:https://www.aclweb.org/anthology/P19-1217.

[52]

D. Khashabi, S. Min, T. Khot, A. Sabharwal, O. Tafjord, P. Clark, H. Hajishirzi, Unifiedqa: crossing format boundaries with a single qa system (2020). arXiv:2005.00700.

Cited By

Wang YBao JDuan CWu YHe XZhu CZhao T(2024)An efficient confusing choices decoupling framework for multi-choice tasks over textsNeural Computing and Applications10.1007/s00521-023-08795-436:1(259-271)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s00521-023-08795-4
Bang JPark JPark J(2023)GACaps-HTC: graph attention capsule network for hierarchical text classificationApplied Intelligence10.1007/s10489-023-04585-653:17(20577-20594)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s10489-023-04585-6

Index Terms

Meta-learning for compressed language model: A multiple choice question answering study
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
  2. Machine learning
    1. Learning paradigms
      1. Supervised learning
    2. Machine learning approaches
2. Information systems

Index terms have been assigned to the content through auto-classification.

Recommendations

MAML²: meta reinforcement learning via meta-learning for task categories
Abstract
Meta-learning has been widely applied to solving few-shot reinforcement learning problems, where we hope to obtain an agent that can learn quickly in a new task. However, these algorithms often ignore some isolated tasks in pursuit of the average ...
Towards well-generalizing meta-learning via adversarial task augmentation
Abstract
Meta-learning aims to use the knowledge from previous tasks to facilitate the learning of novel tasks. Many meta-learning models elaborately design various task-shared inductive bias, and learn it from a large number of tasks, so the ...
Can we improve meta-learning model in few-shot learning by aligning data distributions?
Abstract
Meta-learning becomes a promising way to solve the few-shot learning problem in recent research. This paradigm mainly relies on hierarchical architecture and episodic training to achieve good generalization on the new learning task. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Neurocomputing

Neurocomputing Volume 487, Issue C

May 2022

301 pages

ISSN:0925-2312

Issue’s Table of Contents

Elsevier B.V.

Publisher

Elsevier Science Publishers B. V.

Netherlands

Publication History

Published: 28 May 2022

Author Tags

Qualifiers

Research-article

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

2
Total Citations
View Citations
0
Total Downloads

Downloads (Last 12 months)0
Downloads (Last 6 weeks)0

Reflects downloads up to 14 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Wang YBao JDuan CWu YHe XZhu CZhao T(2024)An efficient confusing choices decoupling framework for multi-choice tasks over textsNeural Computing and Applications10.1007/s00521-023-08795-436:1(259-271)Online publication date: 1-Jan-2024
https://dl.acm.org/doi/10.1007/s00521-023-08795-4
Bang JPark JPark J(2023)GACaps-HTC: graph attention capsule network for hierarchical text classificationApplied Intelligence10.1007/s10489-023-04585-653:17(20577-20594)Online publication date: 1-Sep-2023
https://dl.acm.org/doi/10.1007/s10489-023-04585-6

View Options

View options

Get Access

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Issue’s Table of Contents