research-article

Free access

JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving

Authors:

Guoping HuAuthors Info & Claims

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

Pages 5660 - 5672

https://doi.org/10.1145/3580305.3599850

Published: 04 August 2023 Publication History

Abstract

Although pre-trained language models~(PLMs) have recently advanced the research progress in mathematical reasoning, they are not specially designed as a capable multi-task solver, suffering from high cost for multi-task deployment (e.g. a model copy for a task) and inferior performance on complex mathematical problems in practical applications. To address these issues, we propose JiuZhang 2.0, a unified Chinese PLM specially for multi-task mathematical problem solving. Our idea is to maintain a moderate-sized model and employ the cross-task knowledge sharing to improve the model capacity in a multi-task setting. Specially, we construct a Mixture-of-Experts (MoE) architecture for modeling mathematical text, to capture the common mathematical knowledge across tasks. For optimizing the MoE architecture, we design multi-task continual pre-training and multi-task fine-tuning strategies for multi-task adaptation. These training strategies can effectively decompose the knowledge from the task data and establish the cross-task sharing via expert networks. To further improve the general capacity of solving different complex tasks, we leverage large language models (LLMs) as complementary models to iteratively refine the generated solution by our PLM, via in-context learning. Extensive experiments have demonstrated the effectiveness of our model.

Supplementary Material

MP4 File (KDD2023-Jiuzhang2.0-talk.mp4)

A short introduction of our JiuZhang2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving

Download
36.65 MB

References

[1]

Armen Aghajanyan, Anchit Gupta, Akshat Shrivastava, Xilun Chen, Luke Zettlemoyer, and Sonal Gupta. 2021. Muppet: Massive Multi-task Representations with Pre-Finetuning. ArXiv, Vol. abs/2101.11038 (2021).

[2]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2015. Neural Machine Translation by Jointly Learning to Align and Translate. In ICLR 2015.

[3]

Arnab Banerjee, Srijoy Paul, Tisu Priya, Anamika Rohit, and Nibaran Das. 2023. A Deep Learning-Powered Voice-Enabled Math Tutor for Kids. In Recent Trends in Image Processing and Pattern Recognition: 5th International Conference, RTIP2R 2022, Kingsville, TX, USA, December 1-2, 2022, Revised Selected Papers. Springer, 406--417.

[4]

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners. NeurIPS (2020).

[5]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374 (2021).

[6]

Wenhu Chen, Xueguang Ma, Xinyi Wang, and William W Cohen. 2022. Program of thoughts prompting: Disentangling computation from reasoning for numerical reasoning tasks. arXiv preprint arXiv:2211.12588 (2022).

[7]

Ting-Rui Chiang and Yun-Nung Chen. 2019. Semantically-Aligned Equation Generation for Solving and Reasoning Math Word Problems. In NAACL.

[8]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168 (2021).

[9]

Yiming Cui, Wanxiang Che, Ting Liu, Bing Qin, and Ziqing Yang. 2021. Pre-training with whole word masking for chinese bert. TASLP (2021).

[10]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805 (2018).

[11]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In NAACL.

[12]

Li Dong, Nan Yang, Wenhui Wang, Furu Wei, Xiaodong Liu, Yu Wang, Jianfeng Gao, M. Zhou, and Hsiao-Wuen Hon. 2019. Unified Language Model Pre-training for Natural Language Understanding and Generation. ArXiv, Vol. abs/1905.03197 (2019).

[13]

Iddo Drori, Sunny Tran, Roman Wang, Newman Cheng, Kevin Liu, Leonard Tang, Elizabeth Ke, Nikhil Singh, Taylor L Patti, Jayson Lynch, et al. 2021. A neural network solves and generates mathematics problems by program synthesis: Calculus, differential equations, linear algebra, and more. arXiv preprint arXiv:2112.15594 (2021).

[14]

William Fedus, Barret Zoph, and Noam Shazeer. 2021. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. arXiv preprint arXiv:2101.03961 (2021).

[15]

Yao Fu, Hao-Chun Peng, Ashish Sabharwal, Peter Clark, and Tushar Khot. 2022. Complexity-Based Prompting for Multi-Step Reasoning. ArXiv, Vol. abs/2210.00720 (2022).

[16]

Luyu Gao, Aman Madaan, Shuyan Zhou, Uri Alon, Pengfei Liu, Yiming Yang, Jamie Callan, and Graham Neubig. 2022. PAL: Program-aided Language Models. arXiv preprint arXiv:2211.10435 (2022).

[17]

Tianyu Gao, Xingcheng Yao, and Danqi Chen. 2021. SimCSE: Simple Contrastive Learning of Sentence Embeddings. ArXiv, Vol. abs/2104.08821 (2021).

[18]

Zheng Gong, Kun Zhou, Xin Zhao, Jing Sha, Shijin Wang, and Ji-Rong Wen. 2022. Continual Pre-training of Language Models for Math Problem Understanding with Syntax-Aware Memory Network. In ACL. 5923--5933.

[19]

Shashank Gupta, Subhabrata Mukherjee, Krishan Subudhi, Eduardo Gonzalez, Damien Jose, Ahmed Hassan Awadallah, and Jianfeng Gao. 2022. Sparsely Activated Mixture-of-Experts are Robust Multi-Task Learners. ArXiv, Vol. abs/2204.07689 (2022).

[20]

Suchin Gururangan, Ana Marasović, Swabha Swayamdipta, Kyle Lo, Iz Beltagy, Doug Downey, and Noah A. Smith. 2020. Don't Stop Pretraining: Adapt Language Models to Domains and Tasks. In ACL.

[21]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. 2021a. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874 (2021).

[22]

Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Xiaodong Song, and Jacob Steinhardt. 2021b. Measuring Mathematical Problem Solving With the MATH Dataset. ArXiv, Vol. abs/2103.03874 (2021).

[23]

Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. 1991. Adaptive Mixtures of Local Experts. Neural Computation, Vol. 3 (1991), 79--87.

[24]

Yoon Kim. 2014. Convolutional Neural Networks for Sentence Classification. In EMNLP.

[25]

Takeshi Kojima, Shixiang Shane Gu, Machel Reid, Yutaka Matsuo, and Yusuke Iwasawa. 2022. Large Language Models are Zero-Shot Reasoners. ArXiv, Vol. abs/2205.11916 (2022).

[26]

Sneha Kudugunta, Yanping Huang, Ankur Bapna, Maxim Krikun, Dmitry Lepikhin, Minh-Thang Luong, and Orhan Firat. 2021. Beyond Distillation: Task-level Mixture-of-Experts for Efficient Inference. In Conference on Empirical Methods in Natural Language Processing.

[27]

Siwei Lai, Liheng Xu, Kang Liu, and Jun Zhao. 2015. Recurrent convolutional neural networks for text classification. In AAAI.

[28]

Yihuai Lan, Lei Wang, Qiyuan Zhang, Yunshi Lan, Bing Tian Dai, Yan Wang, Dongxiang Zhang, and Ee-Peng Lim. 2021. MWPToolkit: An Open-Source Framework for Deep Learning-Based Math Word Problem Solvers. arXiv preprint arXiv:2109.00799 (2021).

[29]

Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In ACL.

[30]

Aitor Lewkowycz, Anders Andreassen, David Dohan, Ethan Dyer, Henryk Michalewski, Vinay Venkatesh Ramasesh, Ambrose Slone, Cem Anil, Imanol Schlag, Theo Gutman-Solo, Yuhuai Wu, Behnam Neyshabur, Guy Gur-Ari, and Vedant Misra. 2022. Solving Quantitative Reasoning Problems with Language Models. ArXiv, Vol. abs/2206.14858 (2022).

[31]

Jierui Li, Lei Wang, Jipeng Zhang, Yan Wang, Bing Tian Dai, and Dongxiang Zhang. 2019. Modeling intra-relation in math word problems with different functional multi-head attentions. In ACL.

[32]

Yifei Li, Zeqi Lin, Shizhuo Zhang, Qiang Fu, Bei Chen, Jian-Guang Lou, and Weizhu Chen. 2022. On the advance of making language models better reasoners. arXiv preprint arXiv:2206.02336 (2022).

[33]

Jiachang Liu, Dinghan Shen, Yizhe Zhang, Bill Dolan, Lawrence Carin, and Weizhu Chen. 2021. What Makes Good In-Context Examples for GPT-3?. In Workshop on Knowledge Extraction and Integration for Deep Learning Architectures; Deep Learning Inside Out.

[34]

Xiaodong Liu, Pengcheng He, Weizhu Chen, and Jianfeng Gao. 2019. Multi-Task Deep Neural Networks for Natural Language Understanding. In Annual Meeting of the Association for Computational Linguistics.

[35]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In ICLR.

[36]

Pan Lu, Liang Qiu, Wenhao Yu, Sean Welleck, and Kai-Wei Chang. 2022. A Survey of Deep Learning for Mathematical Reasoning. ArXiv, Vol. abs/2212.10535 (2022).

[37]

Jordan Meadows and André Freitas. 2022. A Survey in Mathematical Language Processing. ArXiv, Vol. abs/2205.15231 (2022).

[38]

Swaroop Mishra, Matthew Finlayson, Pan Lu, Leonard Tang, Sean Welleck, Chitta Baral, Tanmay Rajpurohit, Oyvind Tafjord, Ashish Sabharwal, Peter Clark, and A. Kalyan. 2022. Lila: A Unified Benchmark for Mathematical Reasoning. ArXiv, Vol. abs/2210.17517 (2022).

[39]

Shuai Peng, Ke Yuan, Liangcai Gao, and Zhi Tang. 2021. MathBERT: A Pre-Trained Model for Mathematical Formula Understanding. arXiv preprint arXiv:2105.00377 (2021).

[40]

Stanislas Polu and Ilya Sutskever. 2020. Generative language modeling for automated theorem proving. arXiv preprint arXiv:2009.03393 (2020).

[41]

E. Ponti, Alessandro Sordoni, and Siva Reddy. 2022. Combining Modular Skills in Multitask Learning. ArXiv, Vol. abs/2202.13914 (2022).

[42]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog, Vol. 1, 8 (2019), 9.

[43]

Yunfan Shao, Zhichao Geng, Yitao Liu, Junqi Dai, Fei Yang, Li Zhe, Hujun Bao, and Xipeng Qiu. 2021. CPT: A Pre-Trained Unbalanced Transformer for Both Chinese Language Understanding and Generation. arXiv preprint arXiv:2109.05729 (2021).

[44]

Noam M. Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc V. Le, Geoffrey E. Hinton, and Jeff Dean. 2017. Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. ArXiv, Vol. abs/1701.06538 (2017).

[45]

Shuming Shi, Yuehui Wang, Chin-Yew Lin, Xiaojiang Liu, and Yong Rui. 2015. Automatically solving number word problems by semantic parsing and reasoning. In EMNLP. 1132--1142.

[46]

Sowmya S. Sundaram, Sairam Gurajada, Marco Fisichella, Deepak P, and Savitha Sam Abraham. 2022. Why are NLP Models Fumbling at Elementary Math? A Survey of Deep Learning based Word Problem Solvers. ArXiv, Vol. abs/2205.15683 (2022).

[47]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998--6008.

[48]

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc Le, Ed Chi, and Denny Zhou. 2022. Self-consistency improves chain of thought reasoning in language models. arXiv preprint arXiv:2203.11171 (2022).

[49]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Ed Huai hsin Chi, Quoc Le, and Denny Zhou. 2022. Chain of Thought Prompting Elicits Reasoning in Large Language Models. ArXiv, Vol. abs/2201.11903 (2022).

[50]

Qinyuan Ye, Juan Zha, and Xiang Ren. 2022. Eliciting and Understanding Cross-Task Skills with Task-Level Mixture-of-Experts.

[51]

Richard Zanibbi and Dorothea Blostein. 2012. Recognition and retrieval of mathematical expressions. IJDAR, Vol. 15, 4 (2012), 331--357.

Digital Library

[52]

Zhuosheng Zhang, Aston Zhang, Mu Li, and Alexander J. Smola. 2022. Automatic Chain of Thought Prompting in Large Language Models. ArXiv, Vol. abs/2210.03493 (2022).

[53]

Zhuosheng Zhang, Hanqing Zhang, Keming Chen, Yuhang Guo, Jingyun Hua, Yulong Wang, and Ming Zhou. 2021. Mengzi: Towards Lightweight yet Ingenious Pre-trained Models for Chinese. arXiv preprint arXiv:2110.06696 (2021).

[54]

Wayne Xin Zhao, Kun Zhou, Zheng Gong, Beichen Zhang, Yuanhang Zhou, Jing Sha, Zhigang Chen, Shijin Wang, Cong Liu, and Ji-Rong Wen. 2022. JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding. In Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 4571--4581.

Digital Library

[55]

Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, Jinhao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023. A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2023). http://arxiv.org/abs/2303.18223

[56]

Wei Zhong and Jimmy Lin. 2021. Pya0: A python toolkit for accessible math-aware search. In Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2541--2545.

Digital Library

[57]

Denny Zhou, Nathanael Schärli, Le Hou, Jason Wei, Nathan Scales, Xuezhi Wang, Dale Schuurmans, Olivier Bousquet, Quoc Le, and Ed Chi. 2022. Least-to-most prompting enables complex reasoning in large language models. arXiv preprint arXiv:2205.10625 (2022).

[58]

Xinyu Zhu, Junjie Wang, Lin Zhang, Yuxiang Zhang, Ruyi Gan, Jiaxing Zhang, and Yujiu Yang. 2022. Solving Math Word Problem via Cooperative Reasoning induced Language Models. ArXiv, Vol. abs/2210.16257 (2022).

[59]

Barret Zoph, Irwan Bello, Sameer Kumar, Nan Du, Yanping Huang, Jeff Dean, Noam M. Shazeer, and William Fedus. 2022. ST-MoE: Designing Stable and Transferable Sparse Expert Models.

Cited By

Chen SZhang YYang Q(2024)Multi-Task Learning in Natural Language Processing: An OverviewACM Computing Surveys10.1145/366336356:12(1-32)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3663363
Zhong SGu YLyu WLin HYang GLu YWang GYang YZhang DSerra ESpezzano F(2024)A Behavior-aware Cause Identification Framework for Order Cancellation in Logistics ServiceProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680051(5119-5126)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3680051
Zhang BZhou KWei XZhao WSha JWang SWen JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Evaluating and improving tool-augmented computation-intensive math reasoningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667145(23570-23589)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667145

Index Terms

JiuZhang 2.0: A Unified Chinese Pre-trained Language Model for Multi-task Mathematical Problem Solving
1. Information systems
  1. Information retrieval
    1. Retrieval models and ranking
      1. Language models

Recommendations

JiuZhang: A Chinese Pre-trained Language Model for Mathematical Problem Understanding
KDD '22: Proceedings of the 28th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

This paper aims to advance the mathematical intelligence of machines by presenting the first Chinese mathematical pre-trained language model (PLM) for effectively understanding and representing mathematical problems. Unlike other standard NLP tasks, ...
Trade of a problem-solving task
AAMAS '03: Proceedings of the second international joint conference on Autonomous agents and multiagent systems

This paper focuses on a task allocation problem, particularly in cases where the task is to find a solution to a search problem or a constraint satisfaction problem. If the search problem is difficult to solve, a contractor may fail to find a solution. ...
MCM: A Multi-task Pre-trained Customer Model for Personalization
RecSys '23: Proceedings of the 17th ACM Conference on Recommender Systems

Personalization plays a critical role in helping customers discover the products and contents they prefer for e-commerce stores.Personalized recommendations differ in contents, target customers, and UI. However, they require a common core capability - ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

KDD '23: Proceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 2023

5996 pages

ISBN:9798400701030

DOI:10.1145/3580305

General Chairs:
Ambuj Singh
UC Santa Barbara, USA
,
Yizhou Sun
UC Los Angeles, USA
,
Program Chairs:
Leman Akoglu
Carnegie Mellon University, USA
,
Dimitrios Gunopulos
University of Athens, Greece
,
Xifeng Yan
UC Santa Barbara, USA
,
Ravi Kumar
Google, USA
,
Fatma Ozcan
Google, USA
,
Jieping Ye
Alibaba DAMO Academy

Copyright © 2023 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Sponsors

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 04 August 2023

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article

Funding Sources

National Natural Science Foundation of China
Beijing Natural Science Foundation
Beijing Outstanding Young Scientist Program

Conference

KDD '23

Sponsor:

KDD '23: The 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining

August 6 - 10, 2023

CA, Long Beach, USA

Acceptance Rates

Overall Acceptance Rate 1,133 of 8,635 submissions, 13%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

3
Total Citations
View Citations
399
Total Downloads

Downloads (Last 12 months)294
Downloads (Last 6 weeks)30

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Chen SZhang YYang Q(2024)Multi-Task Learning in Natural Language Processing: An OverviewACM Computing Surveys10.1145/366336356:12(1-32)Online publication date: 11-May-2024
https://dl.acm.org/doi/10.1145/3663363
Zhong SGu YLyu WLin HYang GLu YWang GYang YZhang DSerra ESpezzano F(2024)A Behavior-aware Cause Identification Framework for Order Cancellation in Logistics ServiceProceedings of the 33rd ACM International Conference on Information and Knowledge Management10.1145/3627673.3680051(5119-5126)Online publication date: 21-Oct-2024
https://dl.acm.org/doi/10.1145/3627673.3680051
Zhang BZhou KWei XZhao WSha JWang SWen JOh ANaumann TGloberson ASaenko KHardt MLevine S(2023)Evaluating and improving tool-augmented computation-intensive math reasoningProceedings of the 37th International Conference on Neural Information Processing Systems10.5555/3666122.3667145(23570-23589)Online publication date: 10-Dec-2023
https://dl.acm.org/doi/10.5555/3666122.3667145

View Options

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

Media

Figures

Other

Tables

View Table of Contents