research-article

Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus

Authors:

Yancheng HeAuthors Info & Claims

WWW '20: Proceedings of The Web Conference 2020

Pages 2032 - 2043

https://doi.org/10.1145/3366423.3380270

Published: 20 April 2020 Publication History

Abstract

The ability to ask questions is important in both human and machine intelligence. Learning to ask questions helps knowledge acquisition, improves question-answering and machine reading comprehension tasks, and helps a chatbot to keep the conversation flowing with a human. Existing question generation models are ineffective at generating a large amount of high-quality question-answer pairs from unstructured text, since given an answer and an input passage, question generation is inherently a one-to-many mapping. In this paper, we propose Answer-Clue-Style-aware Question Generation (ACS-QG), which aims at automatically generating high-quality and diverse question-answer pairs from unlabeled text corpus at scale by imitating the way a human asks questions. Our system consists of: i) an information extractor, which samples from the text multiple types of assistive information to guide question generation; ii) neural question generators, which generate diverse and controllable questions, leveraging the extracted assistive information; and iii) a neural quality controller, which removes low-quality generated data based on text entailment. We compare our question generation models with existing approaches and resort to voluntary human evaluation to assess the quality of the generated question-answer pairs. The evaluation results suggest that our system dramatically outperforms state-of-the-art neural question generation models in terms of the generation quality, while being scalable in the meantime. With models trained on a relatively smaller amount of data, we can generate 2.8 million quality-assured question-answer pairs from a million sentences found in Wikipedia.

References

[1]

Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473(2014).

[2]

Yoshua Bengio, Jérôme Louradour, Ronan Collobert, and Jason Weston. 2009. Curriculum learning. In Proceedings of the 26th annual international conference on machine learning. ACM, 41–48.

Digital Library

[3]

Ziqiang Cao, Chuwei Luo, Wenjie Li, and Sujian Li. 2017. Joint Copying and Restricted Generation for Paraphrase. In AAAI. 3152–3158.

[4]

Yllias Chali and Sadid A Hasan. 2015. Towards topic-to-question generation. Computational Linguistics 41, 1 (2015), 1–20.

Digital Library

[5]

Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. 2014. Empirical evaluation of gated recurrent neural networks on sequence modeling. arXiv preprint arXiv:1412.3555(2014).

[6]

Kevin Clark, Urvashi Khandelwal, Omer Levy, and Christopher D Manning. 2019. What Does BERT Look At? An Analysis of BERT’s Attention. arXiv preprint arXiv:1906.04341(2019).

[7]

Guy Danon and Mark Last. 2017. A Syntactic Approach to Domain-Specific Automatic Question Generation. arXiv preprint arXiv:1712.09827(2017).

[8]

Michael Denkowski and Alon Lavie. 2014. Meteor universal: Language specific translation evaluation for any target language. In Proceedings of the ninth workshop on statistical machine translation. 376–380.

[9]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2018. Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805(2018).

[10]

Xinya Du and Claire Cardie. 2018. Harvesting paragraph-level question-answer pairs from wikipedia. arXiv preprint arXiv:1805.05942(2018).

[11]

Xinya Du, Junru Shao, and Claire Cardie. 2017. Learning to ask: Neural question generation for reading comprehension. arXiv preprint arXiv:1705.00106(2017).

[12]

Yifan Gao, Jianan Wang, Lidong Bing, Irwin King, and Michael R Lyu. 2018. Difficulty Controllable Question Generation for Reading Comprehension. arXiv preprint arXiv:1807.03586(2018).

[13]

Ian J Goodfellow, David Warde-Farley, Mehdi Mirza, Aaron Courville, and Yoshua Bengio. 2013. Maxout networks. arXiv preprint arXiv:1302.4389(2013).

[14]

Jiatao Gu, Zhengdong Lu, Hang Li, and Victor OK Li. 2016. Incorporating copying mechanism in sequence-to-sequence learning. arXiv preprint arXiv:1603.06393(2016).

[15]

Caglar Gulcehre, Sungjin Ahn, Ramesh Nallapati, Bowen Zhou, and Yoshua Bengio. 2016. Pointing the unknown words. arXiv preprint arXiv:1603.08148(2016).

[16]

Deepak Gupta, Kaheer Suleman, Mahmoud Adada, Andrew McNamara, and Justin Harris. 2019. Improving Neural Question Generation using World Knowledge. arXiv preprint arXiv:1909.03716(2019).

[17]

Fred.X Han, Di Niu, Kunfeng Lai, Weidong Guo, Yancheng He, and Yu Xu. 2019. Inferring Search Queries from Web Documents via a Graph-Augmented Sequence to Attention Network. 2792–2798. https://doi.org/10.1145/3308558.3313746

[18]

Michael Heilman. 2011. Automatic factual question generation from text. Language Technologies Institute School of Computer Science Carnegie Mellon University 195(2011).

Digital Library

[19]

Michael Heilman and Noah A Smith. 2010. Good question! statistical ranking for question generation. In Human Language Technologies: The 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Association for Computational Linguistics, 609–617.

Digital Library

[20]

Ari Holtzman, Jan Buys, Maxwell Forbes, and Yejin Choi. 2019. The curious case of neural text degeneration. arXiv preprint arXiv:1904.09751(2019).

[21]

Matthew Honnibal and Ines Montani. 2017. spaCy 2: Natural language understanding with Bloom embeddings, convolutional neural networks and incremental parsing. (2017). To appear.

[22]

Wenpeng Hu, Bing Liu, Jinwen Ma, Dongyan Zhao, and Rui Yan. 2018. Aspect-based Question Generation. (2018).

[23]

Yanghoon Kim, Hwanhee Lee, Joongbo Shin, and Kyomin Jung. 2019. Improving neural question generation using answer separation. In Proceedings of the AAAI Conference on Artificial Intelligence, Vol. 33. 6602–6609.

Digital Library

[24]

Diederik P Kingma and Jimmy Ba. 2014. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980(2014).

[25]

Kalpesh Krishna and Mohit Iyyer. 2019. Generating Question-Answer Hierarchies. arXiv preprint arXiv:1906.02622(2019).

[26]

Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. Text Summarization Branches Out(2004).

[27]

Bang Liu, Mingjun Zhao, Di Niu, Kunfeng Lai, Yancheng He, Haojie Wei, and Yu Xu. 2019. Learning to Generate Questions by Learning What not to Generate. In The World Wide Web Conference. ACM, 1106–1118.

[28]

Minh-Thang Luong, Hieu Pham, and Christopher D Manning. 2015. Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025(2015).

[29]

Matthew Honnibal. 2015. spaCy: Industrial-strength Natural Language Processing (NLP) with Python and Cython. https://spacy.io. [Online; accessed 3-November-2018].

[30]

George A Miller. 1995. WordNet: a lexical database for English. Commun. ACM 38, 11 (1995), 39–41.

Digital Library

[31]

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting on association for computational linguistics. Association for Computational Linguistics, 311–318.

[32]

Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. 2017. Pytorch: Tensors and dynamic neural networks in python with strong gpu acceleration.

[33]

Jeffrey Pennington, Richard Socher, and Christopher Manning. 2014. Glove: Global vectors for word representation. In Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP). 1532–1543.

[34]

Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI Blog 1, 8 (2019).

[35]

Pranav Rajpurkar, Robin Jia, and Percy Liang. 2018. Know What You Don’t Know: Unanswerable Questions for SQuAD. arXiv preprint arXiv:1806.03822(2018).

[36]

Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. 2016. Squad: 100,000+ questions for machine comprehension of text. arXiv preprint arXiv:1606.05250(2016).

[37]

Radim Řehůřek and Petr Sojka. 2010. Software Framework for Topic Modelling with Large Corpora. In Proceedings of the LREC 2010 Workshop on New Challenges for NLP Frameworks. ELRA, Valletta, Malta, 45–50. http://is.muni.cz/publication/884893/en.

[38]

Mrinmaya Sachan and Eric Xing. 2018. Self-training for jointly learning to ask and answer questions. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 629–640.

[39]

Iulian Vlad Serban, Alberto García-Durán, Caglar Gulcehre, Sungjin Ahn, Sarath Chandar, Aaron Courville, and Yoshua Bengio. 2016. Generating factoid questions with recurrent neural networks: The 30m factoid question-answer corpus. arXiv preprint arXiv:1603.06807(2016).

[40]

Heung-Yeung Shum, Xiao-dong He, and Di Li. 2018. From Eliza to XiaoIce: challenges and opportunities with social chatbots. Frontiers of Information Technology & Electronic Engineering 19, 1(2018), 10–26.

[41]

Linfeng Song, Zhiguo Wang, Wael Hamza, Yue Zhang, and Daniel Gildea. 2018. Leveraging Context Information for Natural Question Generation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 2 (Short Papers), Vol. 2. 569–574.

[42]

Xingwu Sun, Jing Liu, Yajuan Lyu, Wei He, Yanjun Ma, and Shi Wang. 2018. Answer-focused and Position-aware Neural Question Generation. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3930–3939.

[43]

Ilya Sutskever, Oriol Vinyals, and Quoc V Le. 2014. Sequence to sequence learning with neural networks. In Advances in neural information processing systems. 3104–3112.

[44]

Duyu Tang, Nan Duan, Tao Qin, Zhao Yan, and Ming Zhou. 2017. Question answering and question generation as dual tasks. arXiv preprint arXiv:1706.02027(2017).

[45]

Duyu Tang, Nan Duan, Zhao Yan, Zhirui Zhang, Yibo Sun, Shujie Liu, Yuanhua Lv, and Ming Zhou. 2018. Learning to Collaborate for Question Answering and Asking. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), Vol. 1. 1564–1574.

[46]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in neural information processing systems. 5998–6008.

[47]

Tong Wang, Xingdi Yuan, and Adam Trischler. 2017. A joint model for question answering and question generation. arXiv preprint arXiv:1706.01450(2017).

[48]

Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, and Jamie Brew. 2019. Transformers: State-of-the-art Natural Language Processing. arxiv:cs.CL/1910.03771

[49]

Zhilin Yang, Zihang Dai, Yiming Yang, Jaime Carbonell, Ruslan Salakhutdinov, and Quoc V Le. 2019. XLNet: Generalized Autoregressive Pretraining for Language Understanding. arXiv preprint arXiv:1906.08237(2019).

[50]

Kaichun Yao, Libo Zhang, Tiejian Luo, Lili Tao, and Yanjun Wu. 2018. Teaching Machines to Ask Questions. In IJCAI. 4546–4552.

[51]

Shiyue Zhang and Mohit Bansal. 2019. Addressing Semantic Drift in Question Generation for Semi-Supervised Question Answering. arXiv preprint arXiv:1909.06356(2019).

[52]

Yao Zhao, Xiaochuan Ni, Yuanyuan Ding, and Qifa Ke. 2018. Paragraph-level Neural Question Generation with Maxout Pointer and Gated Self-attention Networks. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 3901–3910.

[53]

Qingyu Zhou, Nan Yang, Furu Wei, Chuanqi Tan, Hangbo Bao, and Ming Zhou. 2017. Neural question generation from text: A preliminary study. In National CCF Conference on Natural Language Processing and Chinese Computing. Springer, 662–671.

[54]

Qingyu Zhou, Nan Yang, Furu Wei, and Ming Zhou. 2018. Sequential Copying Networks. arXiv preprint arXiv:1807.02301(2018).

[55]

Wenjie Zhou, Minghua Zhang, and Yunfang Wu. 2019. Multi-Task Learning with Language Modeling for Question Generation. arXiv preprint arXiv:1908.11813(2019).

[56]

Wenjie Zhou, Minghua Zhang, and Yunfang Wu. 2019. Question-type Driven Question Generation. arXiv preprint arXiv:1909.00140(2019).

Cited By

Kale SKhaire GPatankar J(2024)FAQ-Gen: An automated system to generate domain-specific FAQs to aid content comprehensionJournal of Computer-Assisted Linguistic Research10.4995/jclr.2024.211788(23-49)Online publication date: 15-Nov-2024
https://doi.org/10.4995/jclr.2024.21178
T. Bhuvan NG JN V S(2024)SExpSMA-based T5: Serial exponential-slime mould algorithm based T5 model for question answer and distractor generationIntelligent Decision Technologies10.3233/IDT-23062918:2(1447-1462)Online publication date: 7-Jun-2024
https://doi.org/10.3233/IDT-230629
Pham QLe HDang Nhat MTran T. KTran-Tien MDang VVu HNguyen MPhan X(2024)Towards Vietnamese Question and Answer Generation: An Empirical StudyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/367578123:9(1-28)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675781
Show More Cited By

Index Terms

Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus

Index terms have been assigned to the content through auto-classification.

Recommendations

Learning to Generate Questions by LearningWhat not to Generate
WWW '19: The World Wide Web Conference

Automatic question generation is an important technique that can improve the training of question answering, help chatbots to start or continue a conversation with humans, and provide assessment materials for educational purposes. Existing neural ...
Automatic question-answer pairs generation and question similarity mechanism in question answering system
Abstract
With the swift growth of the information over the past few years, taking full benefit is increasingly essential. Question Answering System is one of the promising methods to access this much information. The Question Answering System lacks humans’ ...
QDG: A unified model for automatic question-distractor pairs generation
Abstract
Generating high-quality complete question sets (for example, the question, answer and distractors) in reading comprehension tasks is challenging and rewarding. This paper proposes a question-distractor joint generation framework (QDG). The ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Conferences

WWW '20: Proceedings of The Web Conference 2020

April 2020

3143 pages

ISBN:9781450370233

DOI:10.1145/3366423

Editors:
Yennun Huang
Acadmica sinica, Taiwan
,
Irwin King
The Chinese University of Hong Kong, Hong Kong
,
Tie-Yan Liu
Microsoft Research Asia, China
,
Maarten van Steen
University of Twente, Netherlands

Copyright © 2020 ACM.

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected]

Sponsors

SIGWEB: ACM Special Interest Group on Hypertext, Hypermedia, and Web

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 20 April 2020

Permissions

Request permissions for this article.

Request Permissions

Check for updates

Author Tags

Qualifiers

Research-article
Research
Refereed limited

Conference

WWW '20

Sponsor:

SIGWEB

WWW '20: The Web Conference 2020

April 20 - 24, 2020

Taipei, Taiwan

Acceptance Rates

Overall Acceptance Rate 1,899 of 8,196 submissions, 23%

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

41
Total Citations
View Citations
1,271
Total Downloads

Downloads (Last 12 months)181
Downloads (Last 6 weeks)12

Reflects downloads up to 25 Nov 2024

Other Metrics

View Author Metrics

Citations

Cited By

Kale SKhaire GPatankar J(2024)FAQ-Gen: An automated system to generate domain-specific FAQs to aid content comprehensionJournal of Computer-Assisted Linguistic Research10.4995/jclr.2024.211788(23-49)Online publication date: 15-Nov-2024
https://doi.org/10.4995/jclr.2024.21178
T. Bhuvan NG JN V S(2024)SExpSMA-based T5: Serial exponential-slime mould algorithm based T5 model for question answer and distractor generationIntelligent Decision Technologies10.3233/IDT-23062918:2(1447-1462)Online publication date: 7-Jun-2024
https://doi.org/10.3233/IDT-230629
Pham QLe HDang Nhat MTran T. KTran-Tien MDang VVu HNguyen MPhan X(2024)Towards Vietnamese Question and Answer Generation: An Empirical StudyACM Transactions on Asian and Low-Resource Language Information Processing10.1145/367578123:9(1-28)Online publication date: 29-Jun-2024
https://dl.acm.org/doi/10.1145/3675781
Dong CShen YLin SLin ZDeng Y(2024)A Unified Framework for Contextual and Factoid Question GenerationIEEE Transactions on Knowledge and Data Engineering10.1109/TKDE.2023.328018236:1(21-34)Online publication date: Jan-2024
https://doi.org/10.1109/TKDE.2023.3280182
Nerella PPittu DUndrakonda SChennamsetty SKumar S VKishore K V(2024)An Efficient Seq2Seq Model to Predict Question and Answer Response System2024 Second International Conference on Advances in Information Technology (ICAIT)10.1109/ICAIT61638.2024.10690343(1-6)Online publication date: 24-Jul-2024
https://doi.org/10.1109/ICAIT61638.2024.10690343
Zeng HWei BLiu J(2024)RTRL: Relation-aware Transformer with Reinforcement Learning for Deep Question GenerationKnowledge-Based Systems10.1016/j.knosys.2024.112120300(112120)Online publication date: Sep-2024
https://doi.org/10.1016/j.knosys.2024.112120
Yuan MKao BWu TCheung MChan HCheung AChan FChen Y(2024)Bringing legal knowledge to the public by constructing a legal question bank using large-scale pre-trained language modelArtificial Intelligence and Law10.1007/s10506-023-09367-632:3(769-805)Online publication date: 1-Sep-2024
https://dl.acm.org/doi/10.1007/s10506-023-09367-6
Kimmich MBartezzaghi ABogojeska JMalossi CVu N(2024)Combining Data Generation and Active Learning for Low-Resource Question AnsweringArtificial Neural Networks and Machine Learning – ICANN 202410.1007/978-3-031-72350-6_9(131-147)Online publication date: 17-Sep-2024
https://doi.org/10.1007/978-3-031-72350-6_9
Lekshmi Narayanan AGomez LSoto Fernandez MNguyen TBlais CRestrepo MGlenberg A(2023)GenQ: Automated Question Generation to Support Caregivers While Reading Stories with ChildrenProceedings of the XI Latin American Conference on Human Computer Interaction10.1145/3630970.3630984(1-11)Online publication date: 30-Oct-2023
https://dl.acm.org/doi/10.1145/3630970.3630984
Jiang KZhao YJin GZhang ZCui R(2023)KETM:A Knowledge-Enhanced Text Matching method2023 International Joint Conference on Neural Networks (IJCNN)10.1109/IJCNN54540.2023.10191337(1-8)Online publication date: 18-Jun-2023
https://doi.org/10.1109/IJCNN54540.2023.10191337
Show More Cited By

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Publication

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

HTML Format

View this article in HTML Format.

Media

Figures

Other

Tables

View Table of Contents