QAScore—An Unsupervised Unreferenced Metric for the Question Generation Evaluation
<p>The process of scoring a question by RoBERTa, where the context (yellow) contains the passage and the question (to be evaluated), 〈eos〉 is the separator token, the score of a single word is the likelihood that RoBERTa can predict the real word (cyan) which is replaced by the mask token 〈mask〉 (green) in the original answer, and the final metric score is the sum of scores of all words in the answer.</p> "> Figure 2
<p>Full instructions shown to a crowd-sourced human assessor read prior to starting HITs.</p> "> Figure 3
<p>The interface shown to human workers, including a passage with highlighted contents and a system-generated question. The worker is then asked to rate the question.</p> "> Figure 4
<p>The example of a Likert statement of an evaluation criterion shown to a human worker.</p> "> Figure 5
<p>The structure of a single HIT in the QG evaluation experiment, where ORD, REPEAT and BADREF respectively represent ordinary, repeat and bad reference questions.</p> ">
Abstract
:1. Introduction
1.1. Contributions
- We propose a pretrained language model based evaluation metric called QAScore, which is unsupervised and reference-free. QAScore utilizes the RoBERTa model [20], and evaluates a system-generated question using the cross entropy in terms of the probability that RoBERTa can correctly predict the masked words in the answer to that question.
- We propose a novel and highly reliable crowd-sourced human evaluation method that can be used as a standard framework for evaluating QG systems. Compared to other human evaluation methods, it is cost-effective and easy to deploy. We further conduct a self-replication experiment showing a correlation of in two distinct evaluations of the same set of systems. According to the results of the human evaluation experiment, QAScore can outperform all other metrics without supervision steps or fine-tuning, achieving a strong Pearson correlation with human assessment;
1.2. Paper Structure
2. Background: Question Answering, Question Generation and Evaluation
2.1. Question Answering
2.2. Question Generation
2.3. Automatic Evaluation Metrics
2.3.1. Word-Overlap-Based Metrics
2.3.2. Pretrained-Model-Based Metrics
2.4. Human Evaluation
- There still lacks a standard human evaluation for QG since the aforementioned examples individually use disparate rating options and settings with only a few overlaps. These existing methods for the QG task can generally change from one set of experiments to the next, highlighting the lack of a standard approach, making comparisons challenging;
- The vast majority of QG human evaluation methods are either expert-based or volunteer-based, with the former are normally expensive and latter likely incurring issues such as shortages of rater availability. Furthermore, the inconvenience of deploying human evaluation at scale can lead to a small sample size, which could possibly hinder the reliability of evaluation results;
- Much of the time, details of human evaluation experiments are vague with on occasion sample sizes and number of raters omitted from publications. Although expert-based human evaluation can be deemed to have a high level of rater’s agreement, such information is seldom reported, resulting in difficulties interpreting the reliability and validity of experiments, in particular when crowd-sourced human evaluation is employed.
2.5. Comparison with Existing Evaluation Methods
2.5.1. QAScore and Existing Automatic Metrics
2.5.2. Our Human Evaluation Method with Existing Methods
3. QAScore—An Automatic Metric for Evaluating QG Systems Using Cross-Entropy
3.1. Proposed Metric
3.2. Methodology
3.2.1. Pre-Trained Language Model–RoBERTa
3.2.2. Process of Scoring
3.3. Dataset and QG Systems
HotpotQA Dataset
3.4. QG Systems for Evaluation
- T5 (small & base): a model using a text-to-text transfer transformer that is pretrained on a large text corpus [48];
- BART (base & large): a denoising auto-encoder using the standard sequence-to-sequence transformer architecture [49];
- Att-GGNN: an attention-based gated graph neural network model [3];
- Att-GGNN (plus): a variant of Att-GGNN model which is combined with the context switch mechanism [19];
- H-Seq2seq: a hierarchical encoding-decoding model proposed for the QG task [19];
- H-Seq2seq: a variant of H-Seq2seq which utilizes a larger dictionary for the avoidance of generating the unknown token ;
- GPT-2: a large transformer-based language model with parameters reaching the size of 1.5 B [50].
- RNN: a sequence-to-sequence model using the vanilla current neural network (RNN) structure [51].
3.5. Results
4. New Human Evaluation Methodology
4.1. Experiment Design
4.1.1. Methodology
4.1.2. Experiment User Interface
4.1.3. Evaluation Criteria
4.2. Quality Control
- for , it comprises 1 word.
- for , it comprises 2 words.
- for , it comprises 3 words.
- for , it comprises 4 words.
- for , it comprises 5 words.
- for , it comprises words.
Structure of HIT
- 1 original question, 1 repeat and 1 bad reference from the Human system (comprising a total of 3 questions);
- 2 original questions and their repeats from 2 of the 10 neural QG systems (comprising a total of 4 questions);
- 5 original questions and their bad references from the other 5 of the 10 normal systems (comprising a total of 10 questions);
- 3 original questions from the remaining 3 of the 10 normal systems (comprising a total of 3 questions).
4.3. Experiment Results
4.3.1. Workers and HITs
4.3.2. Cost of the Experiment
4.3.3. Human Scores
4.3.4. System Consistency
5. Conclusions and Future Work
Author Contributions
Funding
Institutional Review Board Statement
Data Availability Statement
Acknowledgments
Conflicts of Interest
References
- Du, X.; Shao, J.; Cardie, C. Learning to Ask: Neural Question Generation for Reading Comprehension. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 1342–1352. [Google Scholar] [CrossRef]
- Xie, Y.; Pan, L.; Wang, D.; Kan, M.Y.; Feng, Y. Exploring Question-Specific Rewards for Generating Deep Questions. In Proceedings of the 28th International Conference on Computational Linguistics; International Committee on Computational Linguistics: Barcelona, Spain, 2020; pp. 2534–2546. [Google Scholar] [CrossRef]
- Pan, L.; Xie, Y.; Feng, Y.; Chua, T.S.; Kan, M.Y. Semantic Graphs for Generating Deep Questions. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Barcelona, Spain, 2020; pp. 1463–1475. [Google Scholar] [CrossRef]
- Puri, R.; Spring, R.; Shoeybi, M.; Patwary, M.; Catanzaro, B. Training Question Answering Models From Synthetic Data. In Proceedings of 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Barcelona, Spain, 2020; pp. 5811–5826. [Google Scholar] [CrossRef]
- Lyu, C.; Shang, L.; Graham, Y.; Foster, J.; Jiang, X.; Liu, Q. Improving Unsupervised Question Answering via Summarization-Informed Question Generation. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Punta Cana, Dominican Republic, 2021; pp. 4134–4148. [Google Scholar]
- Chen, Y.; Wu, L.; Zaki, M.J. Reinforcement Learning Based Graph-to-Sequence Model for Natural Question Generation. In Proceedings of the 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Li, J.; Qu, K.; Yan, J.; Zhou, L.; Cheng, L. TEBC-Net: An Effective Relation Extraction Approach for Simple Question Answering over Knowledge Graphs. In Proceedings of the Knowledge Science, Engineering and Management; Qiu, H., Zhang, C., Fei, Z., Qiu, M., Kung, S.Y., Eds.; Springer International Publishing: Cham, Switzerland, 2021; pp. 154–165. [Google Scholar]
- Kim, Y.; Lee, H.; Shin, J.; Jung, K. Improving Neural Question Generation Using Answer Separation. Proc. AAAI Conf. Artific. Intell. 2019, 33, 6602–6609. [Google Scholar] [CrossRef] [Green Version]
- Wang, L.; Xu, Z.; Lin, Z.; Zheng, H.; Shen, Y. Answer-driven Deep Question Generation based on Reinforcement Learning. In Proceedings of the 28th International Conference on Computational Linguistics; International Committee on Computational Linguistics: Barcelona, Spain, 2020; pp. 5159–5170. [Google Scholar] [CrossRef]
- Cho, W.S.; Zhang, Y.; Rao, S.; Celikyilmaz, A.; Xiong, C.; Gao, J.; Wang, M.; Dolan, B. Contrastive Multi-document Question Generation. In Proceedings of 16th Conference of the European Chapter of the Association for Computational Linguistics: Mainv Volume; Association for Computational Linguistics: Barcelona, Spain, 2021; pp. 12–30. [Google Scholar]
- Papineni, K.; Roukos, S.; Ward, T.; Zhu, W.J. Bleu: A Method for Automatic Evaluation of Machine Translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Philadelphia, PA, USA, 2002; pp. 311–318. [Google Scholar] [CrossRef] [Green Version]
- Nema, P.; Khapra, M.M. Towards a Better Metric for Evaluating Question Generation Systems. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Brussels, Belgium, 2018; pp. 3950–3959. [Google Scholar] [CrossRef]
- Sellam, T.; Das, D.; Parikh, A. BLEURT: Learning Robust Metrics for Text Generation. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Brussels, Belgium, 2020; pp. 7881–7892. [Google Scholar] [CrossRef]
- Zhang, T.; Kishore, V.; Wu, F.; Weinberger, K.Q.; Artzi, Y. BERTScore: Evaluating Text Generation with BERT. In Proceedings of the International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, 26–30 April 2020. [Google Scholar]
- Reiter, E. A Structured Review of the Validity of BLEU. Comput. Linguist. 2018, 44, 393–401. [Google Scholar] [CrossRef]
- Graham, Y. Re-evaluating Automatic Summarization with BLEU and 192 Shades of ROUGE. In Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing; Association for Computational Linguistics: Lisbon, Portugal, 2015; pp. 128–137. [Google Scholar] [CrossRef] [Green Version]
- Graham, Y.; Liu, Q. Achieving accurate conclusions in evaluation of automatic machine translation metrics. In Proceedings of the 2016 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, San Diego, CA, USA, 12–17 June 2016; pp. 1–10. [Google Scholar]
- Ji, T.; Graham, Y.; Jones, G.J.; Lyu, C.; Liu, Q. Achieving Reliable Human Assessment of Open-Domain Dialogue Systems. arXiv 2022, arXiv:2203.05899. [Google Scholar]
- Ji, T.; Lyu, C.; Cao, Z.; Cheng, P. Multi-Hop Question Generation Using Hierarchical Encoding-Decoding and Context Switch Mechanism. Entropy 2021, 23, 1449. [Google Scholar] [CrossRef] [PubMed]
- Liu, Y.; Ott, M.; Goyal, N.; Du, J.; Joshi, M.; Chen, D.; Levy, O.; Lewis, M.; Zettlemoyer, L.; Stoyanov, V. RoBERTa: A Robustly Optimized BERT Pretraining Approach. CoRR 2019, abs/1907.11692. Available online: http://xxx.lanl.gov/abs/1907.11692 (accessed on 20 October 2022).
- Chen, D.; Fisch, A.; Weston, J.; Bordes, A. Reading Wikipedia to Answer Open-Domain Questions. In Proceedings of the Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Vancouver, BC, Canada, 30 July–4 August 2017; pp. 1870–1879. [Google Scholar]
- Zhu, F.; Lei, W.; Wang, C.; Zheng, J.; Poria, S.; Chua, T.S. Retrieving and reading: A comprehensive survey on open-domain question answering. arXiv 2021, arXiv:2101.00774. [Google Scholar]
- Rajpurkar, P.; Zhang, J.; Lopyrev, K.; Liang, P. SQuAD: 100,000+ Questions for Machine Comprehension of Text. In Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, Austin, TX, USA, 1–4 November 2016; pp. 2383–2392. [Google Scholar] [CrossRef]
- Saha, A.; Aralikatte, R.; Khapra, M.M.; Sankaranarayanan, K. DuoRC: Towards Complex Language Understanding with Paraphrased Reading Comprehension. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Melbourne, Australia, 2018; pp. 1683–1693. [Google Scholar] [CrossRef] [Green Version]
- Kočiský, T.; Schwarz, J.; Blunsom, P.; Dyer, C.; Hermann, K.M.; Melis, G.; Grefenstette, E. The NarrativeQA Reading Comprehension Challenge. Trans. Assoc. Comput. Linguist. 2018, 6, 317–328. [Google Scholar] [CrossRef] [Green Version]
- Xu, Y.; Wang, D.; Yu, M.; Ritchie, D.; Yao, B.; Wu, T.; Zhang, Z.; Li, T.; Bradford, N.; Sun, B.; et al. Fantastic Questions and Where to Find Them: FairytaleQA—An Authentic Dataset for Narrative Comprehension. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers); Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 447–460. [Google Scholar] [CrossRef]
- Trischler, A.; Wang, T.; Yuan, X.; Harris, J.; Sordoni, A.; Bachman, P.; Suleman, K. NewsQA: A Machine Comprehension Dataset. In Proceedings of the 2nd Workshop on Representation Learning for NLP; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 191–200. [Google Scholar] [CrossRef] [Green Version]
- Lyu, C.; Foster, J.; Graham, Y. Extending the Scope of Out-of-Domain: Examining QA models in multiple subdomains. In Proceedings of the Third Workshop on Insights from Negative Results in NLP; Association for Computational Linguistics: Dublin, Ireland, 2022; pp. 24–37. [Google Scholar] [CrossRef]
- Lewis, P.; Wu, Y.; Liu, L.; Minervini, P.; Küttler, H.; Piktus, A.; Stenetorp, P.; Riedel, S. PAQ: 65 Million Probably-Asked Questions and What You Can Do with Them. arXiv 2021, arXiv:cs.CL/2102.07033. [Google Scholar] [CrossRef]
- Zhang, Z.; Zhao, H.; Wang, R. Machine Reading Comprehension: The Role of Contextualized Language Models and Beyond. arXiv 2020, arXiv:cs.CL/2005.06249. [Google Scholar]
- Pan, L.; Lei, W.; Chua, T.; Kan, M. Recent Advances in Neural Question Generation. CoRR 2019, abs/1905.08949. Available online: http://xxx.lanl.gov/abs/1905.08949 (accessed on 20 October 2022).
- Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to Sequence Learning with Neural Networks. In Proceedings of the Advances in Neural Information Processing Systems; Ghahramani, Z., Welling, M., Cortes, C., Lawrence, N., Weinberger, K.Q., Eds.; Curran Associates, Inc.: Red Hook, NY, USA, 2014; Volume 27. [Google Scholar]
- Wu, Y.; Schuster, M.; Chen, Z.; Le, Q.V.; Norouzi, M.; Macherey, W.; Krikun, M.; Cao, Y.; Gao, Q.; Macherey, K.; et al. Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation. CoRR 2016, abs/1609.08144. Available online: http://xxx.lanl.gov/abs/1609.08144 (accessed on 20 October 2022).
- Lin, C.Y. ROUGE: A Package for Automatic Evaluation of Summaries. In Proceedings of the Text Summarization Branches Out; Association for Computational Linguistics: Barcelona, Spain, 2004; pp. 74–81. [Google Scholar]
- Banerjee, S.; Lavie, A. METEOR: An Automatic Metric for MT Evaluation with Improved Correlation with Human Judgments. In Proceedings of the ACL Workshop on Intrinsic and Extrinsic Evaluation Measures for Machine Translation and/or Summarization; Association for Computational Linguistics: Ann Arbor, MI, USA, 2005; pp. 65–72. [Google Scholar]
- Lewis, P.; Perez, E.; Piktus, A.; Petroni, F.; Karpukhin, V.; Goyal, N.; Küttler, H.; Lewis, M.; Yih, W.t.; Rocktäschel, T.; et al. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. In Proceedings of the Advances in Neural Information Processing Systems; Larochelle, H., Ranzato, M., Hadsell, R., Balcan, M., Lin, H., Eds.; Curran Associates, Inc.: Ann Arbor, MI, USA, 2020; Volume 33, pp. 9459–9474. [Google Scholar]
- Devlin, J.; Chang, M.W.; Lee, K.; Toutanova, K. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers); Association for Computational Linguistics: Minneapolis, MI, USA, 2019; pp. 4171–4186. [Google Scholar] [CrossRef]
- Yuan, X.; Wang, T.; Gulcehre, C.; Sordoni, A.; Bachman, P.; Zhang, S.; Subramanian, S.; Trischler, A. Machine Comprehension by Text-to-Text Neural Question Generation. In Proceedings of the 2nd Workshop on Representation Learning for NLP; Association for Computational Linguistics: Vancouver, BC, Canada, 2017; pp. 15–25. [Google Scholar] [CrossRef] [Green Version]
- Jia, X.; Zhou, W.; Sun, X.; Wu, Y. EQG-RACE: Examination-Type Question Generation. In Proceedings of the AAAI, Palo Alto, CA, USA, 2–9 February 2021. [Google Scholar]
- Ren, S.; Zhu, K.Q. Knowledge-Driven Distractor Generation for Cloze-style Multiple Choice Questions. CoRR 2020, abs/2004.09853. Available online: http://xxx.lanl.gov/abs/2004.09853 (accessed on 20 October 2022).
- Liu, B.; Wei, H.; Niu, D.; Chen, H.; He, Y. Asking Questions the Human Way: Scalable Question-Answer Generation from Text Corpus. In Proceedings of the Web Conference 2020, New York, NY, USA, 20–24 April 2020; pp. 2032–2043. [Google Scholar] [CrossRef]
- Ma, X.; Zhu, Q.; Zhou, Y.; Li, X. Improving Question Generation with Sentence-Level Semantic Matching and Answer Position Inferring. Proc. AAAI Conf. Artific. Intell. 2020, 34, 8464–8471. [Google Scholar] [CrossRef]
- Narayan, S.; Simões, G.; Ma, J.; Craighead, H.; McDonald, R.T. QURIOUS: Question Generation Pretraining for Text Generation. CoRR 2020, abs/2004.11026. Available online: http://xxx.lanl.gov/abs/2004.11026 (accessed on 20 October 2022).
- Zhou, S.; Zhang, Y. DATLMedQA: A Data Augmentation and Transfer Learning Based Solution for Medical Question Answering. Appl. Sci. 2021, 11, 11251. [Google Scholar] [CrossRef]
- Shin, T.; Razeghi, Y.; Logan IV, R.L.; Wallace, E.; Singh, S. AutoPrompt: Eliciting Knowledge from Language Models with Automatically Generated Prompts. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 4222–4235. [Google Scholar] [CrossRef]
- Lyu, C. Knowledge and Pre-Trained Language Models Inside and Out: A Deep-Dive into Datasets and External Knowledge 2022. Available online: https://scholar.google.co.jp/scholar?hl=zh-TW&as_sdt=0%2C5&q=Knowledge+and+Pre-trained+Language+Models+Inside+and+Out%3A+A+deep-dive+++into+datasets+and+external+knowledge&btnG= (accessed on 20 October 2022).
- Yang, Z.; Qi, P.; Zhang, S.; Bengio, Y.; Cohen, W.W.; Salakhutdinov, R.; Manning, C.D. HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering. In Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP), Brussels, Belgium, 31 October–4 November 2018. [Google Scholar]
- Raffel, C.; Shazeer, N.; Roberts, A.; Lee, K.; Narang, S.; Matena, M.; Zhou, Y.; Li, W.; Liu, P.J. Exploring the Limits of Transfer Learning with a Unified Text-to-Text Transformer. J. Mach. Learn. Res. 2020, 21, 1–67. [Google Scholar]
- Lewis, M.; Liu, Y.; Goyal, N.; Ghazvininejad, M.; Mohamed, A.; Levy, O.; Stoyanov, V.; Zettlemoyer, L. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics; Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 7871–7880. [Google Scholar] [CrossRef]
- Radford, A.; Wu, J.; Child, R.; Luan, D.; Amodei, D.; Sutskever, I. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9. [Google Scholar]
- Cho, K.; van Merriënboer, B.; Gulcehre, C.; Bahdanau, D.; Bougares, F.; Schwenk, H.; Bengio, Y. Learning Phrase Representations using RNN Encoder–Decoder for Statistical Machine Translation. In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), Doha, Qatar, 25–29 October 2014; pp. 1724–1734. [Google Scholar]
- Ji, T.; Graham, Y.; Jones, G.J. Contrasting Human Opinion of Non-Factoid Question Answering with Automatic Evaluation. In Proceedings of the 2020 Conference on Human Information Interaction and Retrieval; Association for Computing Machinery: New York, NY, USA, 2020; pp. 348–352. [Google Scholar]
- Graham, Y.; Baldwin, T.; Moffat, A.; Zobel, J. Can machine translation systems be evaluated by the crowd alone. Nat. Lang. Eng. 2017, 23, 3–30. [Google Scholar] [CrossRef] [Green Version]
- Graham, Y.; Haddow, B.; Koehn, P. Statistical Power and Translationese in Machine Translation Evaluation. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP); Association for Computational Linguistics: Stroudsburg, PA, USA, 2020; pp. 72–81. [Google Scholar] [CrossRef]
System | z | QAScore | METEOR | ROUGE-L | BERTScore | BLEURT | Q-BLEU4 | Q-BLEU1 |
---|---|---|---|---|---|---|---|---|
Human | 0.322 | −0.985 | – | – | – | – | – | – |
BART | 0.308 | −1.020 | 30.18 | 47.58 | 90.85 | −0.363 | 43.77 | 51.47 |
BART | 0.290 | −1.030 | 29.66 | 47.13 | 90.74 | −0.381 | 44.14 | 51.65 |
T5 | 0.226 | −1.037 | 27.99 | 41.60 | 88.44 | −0.682 | 37.78 | 44.84 |
RNN | 0.147 | −1.064 | 15.46 | 26.77 | 84.59 | −1.019 | 9.68 | 15.92 |
H-Seq2seq | 0.120 | −1.076 | 17.50 | 29.86 | 85.49 | −0.953 | 10.51 | 17.74 |
T5 | 0.117 | −1.049 | 23.62 | 32.37 | 86.34 | −0.860 | 26.73 | 32.92 |
Att-GGNN | 0.076 | −1.065 | 21.77 | 36.31 | 86.27 | −0.784 | 12.63 | 19.86 |
H-Seq2seq | 0.053 | −1.045 | 18.23 | 31.69 | 85.83 | −0.866 | 11.12 | 18.36 |
Att-GGNN | −0.008 | −1.068 | 20.02 | 33.60 | 86.00 | −0.802 | 11.13 | 18.67 |
GPT-2 | −0.052 | −1.108 | 16.40 | 29.98 | 86.44 | −0.899 | 24.83 | 31.85 |
QAScore | METEOR | ROUGE-L | BERTScore | BLEURT | Q-BLEU4 | Q-BLEU1 | |
---|---|---|---|---|---|---|---|
r | 0.864 | 0.801 | 0.770 | 0.761 | 0.739 | 0.725 | 0.724 |
0.827 | 0.612 | 0.503 | 0.430 | 0.503 | 0.467 | 0.467 | |
0.709 | 0.511 | 0.378 | 0.289 | 0.378 | 0.289 | 0.289 |
Label | Likert Statement |
---|---|
Understandability | The question is easy to understand. |
Relevancy | The question is highly relevant to the content of the passage. |
Answerability | The question can be fully answered by the passage |
Appropriateness | The question word (where, when, how, etc.) is fully appropriate. |
(a) | ||||||
---|---|---|---|---|---|---|
Experiment | Worker | HIT | ||||
Passed | Total | Pass Rate | Passed | Total | Pass Rate | |
Run1 | 123 | 356 | 34.55% | 334 | 786 | 42.49% |
Run2 | 105 | 283 | 37.10% | 282 | 598 | 47.16% |
(b) | ||||||
Experiment | Elapsed Time (per HIT in minutes) | Assigned HIT (per Worker) | ||||
Passed | Failed | Total | Passed | Failed | Total | |
Run1 | 33.24 | 26.93 | 29.61 | 2.72 | 1.94 | 2.21 |
Run2 | 38.68 | 25.79 | 31.87 | 2.69 | 1.78 | 2.11 |
System | N | Overall | Understandability | Relevancy | Answerability | Appropriateness |
---|---|---|---|---|---|---|
Human | 668 | 0.322 | 0.164 | 0.262 | 0.435 | 0.429 |
BART | 400 | 0.308 | 0.155 | 0.255 | 0.420 | 0.403 |
BART | 401 | 0.290 | 0.135 | 0.234 | 0.430 | 0.360 |
T5 | 395 | 0.226 | 0.051 | 0.241 | 0.395 | 0.217 |
RNN | 395 | 0.147 | −0.050 | 0.128 | 0.222 | 0.289 |
Seq2Seq | 404 | 0.120 | −0.030 | 0.022 | 0.180 | 0.309 |
T5 | 405 | 0.117 | −0.108 | 0.106 | 0.260 | 0.210 |
Baseline | 408 | 0.076 | −0.133 | 0.076 | 0.196 | 0.165 |
Seq2Seq | 396 | 0.053 | −0.055 | −0.039 | 0.088 | 0.217 |
Baseline | 396 | −0.008 | −0.186 | −0.032 | 0.155 | 0.032 |
GPT-2 | 408 | −0.052 | −0.202 | −0.126 | 0.050 | 0.068 |
System | N | Overall | Understandability | Relevancy | Answerability | Appropriateness |
---|---|---|---|---|---|---|
Human | 564 | 0.316 | 0.188 | 0.279 | 0.386 | 0.410 |
BART | 342 | 0.299 | 0.180 | 0.277 | 0.380 | 0.359 |
BART | 338 | 0.306 | 0.181 | 0.299 | 0.397 | 0.347 |
T5 | 329 | 0.294 | 0.158 | 0.298 | 0.396 | 0.326 |
RNN | 342 | 0.060 | −0.040 | −0.008 | 0.072 | 0.217 |
Seq2Seq | 332 | 0.086 | −0.053 | 0.064 | 0.115 | 0.217 |
T5 | 340 | 0.157 | −0.012 | 0.166 | 0.248 | 0.224 |
Baseline | 341 | 0.069 | −0.094 | 0.081 | 0.134 | 0.157 |
Seq2Seq | 348 | 0.083 | −0.014 | 0.077 | 0.104 | 0.163 |
Baseline | 329 | −0.025 | −0.200 | −0.023 | 0.042 | 0.083 |
GPT-2 | 343 | −0.047 | −0.122 | 0.000 | −0.036 | −0.031 |
Overall | Understandability | Relevancy | Answerability | Appropriateness | |
---|---|---|---|---|---|
r | 0.955 | 0.953 | 0.865 | 0.957 | 0.884 |
0.882 | 0.891 | 0.718 | 0.882 | 0.845 | |
0.745 | 0.709 | 0.527 | 0.745 | 0.709 |
Publisher’s Note: MDPI stays neutral with regard to jurisdictional claims in published maps and institutional affiliations. |
© 2022 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).
Share and Cite
Ji, T.; Lyu, C.; Jones, G.; Zhou, L.; Graham, Y. QAScore—An Unsupervised Unreferenced Metric for the Question Generation Evaluation. Entropy 2022, 24, 1514. https://doi.org/10.3390/e24111514
Ji T, Lyu C, Jones G, Zhou L, Graham Y. QAScore—An Unsupervised Unreferenced Metric for the Question Generation Evaluation. Entropy. 2022; 24(11):1514. https://doi.org/10.3390/e24111514
Chicago/Turabian StyleJi, Tianbo, Chenyang Lyu, Gareth Jones, Liting Zhou, and Yvette Graham. 2022. "QAScore—An Unsupervised Unreferenced Metric for the Question Generation Evaluation" Entropy 24, no. 11: 1514. https://doi.org/10.3390/e24111514