Nothing Special   »   [go: up one dir, main page]

skip to main content
research-article
Open access

MTAS: A Reference-Free Approach for Evaluating Abstractive Summarization Systems

Published: 12 July 2024 Publication History

Abstract

Abstractive summarization (AS) systems, which aim to generate a text for summarizing crucial information of the original document, have been widely adopted in recent years. Unfortunately, factually unreliable summaries may still occur, leading to unexpected misunderstanding and distortion of information. This calls for methods that can properly evaluate the quality of AS systems. Yet, the existing reference-based evaluation approach for AS relies on reference summaries as well as automatic evaluation metrics (e.g., ROUGE). Therefore, the reference-based evaluation approach is highly restricted by the availability and quality of reference summaries as well as the capability of existing automatic evaluation metrics. In this study, we propose MTAS, a novel metamorphic testing based approach for evaluating AS in a reference-free way. Our two major contributions are (i) five metamorphic relations towards AS, which involve semantic-preserving and focus-preserving transformations at the document level, and (ii) a summary consistency evaluation metric SCY, which measures the alignment between a pair of summaries by incorporating both the semantic and factual consistency. Our experimental results show that the proposed metric SCY has a significantly higher correlation with human judgment as compared to a set of existing metrics. It is also demonstrated that MTAS can break the dependence on reference summaries, and it successfully reports a large number of summary inconsistencies, revealing various summarization issues on state-of-the-art AS systems.

References

[1]
Eneko Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, and Weiwei Guo. 2013. * SEM 2013 shared task: Semantic textual similarity. In Second joint conference on lexical and computational semantics (* SEM), volume 1: proceedings of the Main conference and the shared task: semantic textual similarity. 32–43.
[2]
Muhammad Hilmi Asyrofi, Zhou Yang, Imam Nur Bani Yusuf, Hong Jin Kang, Ferdian Thung, and David Lo. 2021. Biasfinder: Metamorphic test generation to uncover bias for sentiment analysis systems. IEEE Transactions on Software Engineering, 48, 12 (2021), 5087–5101.
[3]
Satanjeev Banerjee and Alon Lavie. 2005. METEOR: An automatic metric for MT evaluation with improved correlation with human judgments. In Proceedings of the acl workshop on intrinsic and extrinsic evaluation measures for machine translation and/or summarization. 65–72.
[4]
Florian Böhm, Yang Gao, Christian M Meyer, Ori Shapira, Ido Dagan, and Iryna Gurevych. 2019. Better rewards yield better summaries: Learning to summarise without references. arXiv preprint arXiv:1909.01214.
[5]
Rishi Bommasani and Claire Cardie. 2020. Intrinsic evaluation of summarization datasets. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 8075–8096.
[6]
Ziqiang Cao, Furu Wei, Wenjie Li, and Sujian Li. 2018. Faithful to the original: fact-aware neural abstractive summarization. In Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. 4784–4791.
[7]
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Testing your question answering software via asking recursively. In 2021 36th IEEE/ACM International Conference on Automated Software Engineering (ASE). 104–116.
[8]
Songqiang Chen, Shuo Jin, and Xiaoyuan Xie. 2021. Validation on machine reading comprehension software without annotated labels: A property-based method. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 590–602.
[9]
Tsong Yueh Chen, Fei-Ching Kuo, Huai Liu, Pak-Lok Poon, Dave Towey, TH Tse, and Zhi Quan Zhou. 2018. Metamorphic testing: A review of challenges and opportunities. ACM Computing Surveys (CSUR), 51, 1 (2018), 1–27.
[10]
Tobias Falke, Leonardo FR Ribeiro, Prasetya Ajie Utama, Ido Dagan, and Iryna Gurevych. 2019. Ranking generated summaries by correctness: An interesting but challenging application for natural language inference. In Proceedings of the 57th annual meeting of the Association for Computational Linguistics. 2214–2220.
[11]
Andy Field. 2013. Discovering statistics using IBM SPSS statistics. sage.
[12]
Marina Fomicheva and Lucia Specia. 2016. Reference bias in monolingual machine translation evaluation. In 54th Annual Meeting of the Association for Computational Linguistics, ACL 2016-Short Papers. 77–82.
[13]
Giacomo Frisoni, Paolo Italiani, Stefano Salvatori, and Gianluca Moro. 2023. Cogito ergo summ: abstractive summarization of biomedical papers via semantic parsing graphs and consistency rewards. In Proceedings of the AAAI Conference on Artificial Intelligence. 12781–12789.
[14]
Ben Goodrich, Vinay Rao, Peter J Liu, and Mohammad Saleh. 2019. Assessing the factual accuracy of generated text. In proceedings of the 25th ACM SIGKDD international conference on knowledge discovery & data mining. 166–175.
[15]
Max Grusky, Mor Naaman, and Yoav Artzi. 2018. Newsroom: A dataset of 1.3 million summaries with diverse extractive strategies. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers). 708–719.
[16]
Nianlong Gu, Elliott Ash, and Richard Hahnloser. 2022. MemSum: Extractive summarization of long documents using Multi-step Episodic Markov decision processes. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 6507–6522.
[17]
Tahmid Hasan, Abhik Bhattacharjee, Md Saiful Islam, Kazi Samin, Yuan-Fang Li, Yong-Bin Kang, M Sohel Rahman, and Rifat Shahriyar. 2021. XL-Sum: large-scale multilingual abstractive summarization for 44 languages. In Annual Meeting of the Association of Computational Linguistics and International Joint Conference on Natural Language Processing 2021. 4693–4703.
[18]
Pengcheng He, Jianfeng Gao, and Weizhu Chen. 2023. DeBERTaV3: Improving deberta using electra-style pre-training with gradient-disentangled embedding sharing. arxiv:2111.09543.
[19]
Pinjia He, Clara Meister, and Zhendong Su. 2020. Structure-invariant testing for machine translation. In Proceedings of the ACM/IEEE 42nd International Conference on Software Engineering. 961–973.
[20]
Pinjia He, Clara Meister, and Zhendong Su. 2021. Testing machine translation via referential transparency. In 2021 IEEE/ACM 43rd International Conference on Software Engineering (ICSE). 410–422.
[21]
Jen-tse Huang, Jianping Zhang, Wenxuan Wang, Pinjia He, Yuxin Su, and Michael R Lyu. 2022. AEON: a method for automatic evaluation of NLP test cases. In Proceedings of the 31st ACM SIGSOFT International Symposium on Software Testing and Analysis. 202–214.
[22]
Yichong Huang, Xiachong Feng, Xiaocheng Feng, and Bing Qin. 2023. The factual inconsistency problem in abstractive text summarization: A survey. arxiv:2104.14839.
[23]
Jacob Devlin Ming-Wei Chang Kenton and Lee Kristina Toutanova. 2019. Bert: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of naacL-HLT. 1, 2.
[24]
Joseph Lee Rodgers and W Alan Nicewander. 1988. Thirteen ways to look at the correlation coefficient. The American Statistician, 42, 1 (1988), 59–66.
[25]
Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 7871–7880.
[26]
Christopher Lidbury, Andrei Lascu, Nathan Chong, and Alastair F Donaldson. 2015. Many-core compiler fuzzing. ACM SIGPLAN Notices, 50, 6 (2015), 65–76.
[27]
Chin-Yew Lin. 2004. Rouge: A package for automatic evaluation of summaries. In Text summarization branches out. 74–81.
[28]
Yixin Liu and Pengfei Liu. 2021. SimCLS: A simple framework for contrastive learning of abstractive summarization. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers). 1065–1072.
[29]
Yixin Liu, Pengfei Liu, Dragomir Radev, and Graham Neubig. 2022. BRIO: Bringing order to abstractive summarization. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2890–2903.
[30]
Elena Lloret, Laura Plaza, and Ahmet Aker. 2018. The challenging task of summary evaluation: an overview. Language Resources and Evaluation, 52 (2018), 101–148.
[31]
Dang Hoang Long, Minh-Tien Nguyen, Ngo Xuan Bach, Le-Minh Nguyen, and Tu Minh Phuong. 2018. An entailment-based scoring method for content selection in document summarization. In Proceedings of the 9th International Symposium on Information and Communication Technology. 122–129.
[32]
Joseph Mallia. 2017. Strategies for developing English academic writing skills. Arab World English Journal (AWEJ), 8, 2 (2017), 3–15.
[33]
Muhammad Numair Mansur, Maria Christakis, and Valentin Wüstholz. 2021. Metamorphic testing of Datalog engines. In Proceedings of the 29th ACM Joint Meeting on European Software Engineering Conference and Symposium on the Foundations of Software Engineering. 639–650.
[34]
Nitika Mathur, Timothy Baldwin, and Trevor Cohn. 2020. Tangled up in BLEU: Reevaluating the Evaluation of Automatic Machine Translation Evaluation Metrics. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 4984–4997.
[35]
Joshua Maynez, Shashi Narayan, Bernd Bohnet, and Ryan McDonald. 2020. On faithfulness and factuality in abstractive summarization. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 1906–1919.
[36]
Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2018. Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing. 1797–1807.
[37]
Shashi Narayan, Shay B Cohen, and Mirella Lapata. 2019. What is this article about? Extreme summarization with topic-aware convolutional neural networks. Journal of Artificial Intelligence Research, 66 (2019), 243–278.
[38]
Artidoro Pagnoni, Vidhisha Balachandran, and Yulia Tsvetkov. 2021. Understanding factuality in abstractive summarization with FRANK: A benchmark for factuality metrics. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 4812–4829.
[39]
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a method for automatic evaluation of machine translation. In Proceedings of the 40th annual meeting of the Association for Computational Linguistics. 311–318.
[40]
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. 2019. Language models are unsupervised multitask learners. OpenAI blog, 1, 8 (2019), 9.
[41]
Colin Raffel, Noam Shazeer, Adam Roberts, Katherine Lee, Sharan Narang, Michael Matena, Yanqi Zhou, Wei Li, and Peter J Liu. 2020. Exploring the limits of transfer learning with a unified text-to-text transformer. The Journal of Machine Learning Research, 21, 1 (2020), 5485–5551.
[42]
Ananya B Sai, Akash Kumar Mohankumar, and Mitesh M Khapra. 2022. A survey of evaluation metrics used for NLG systems. ACM Computing Surveys (CSUR), 55, 2 (2022), 1–39.
[43]
Abigail See, Peter J Liu, and Christopher D Manning. 2017. Get to the point: Summarization with pointer-generator networks. In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 1073–1083.
[44]
Sergio Segura, Dave Towey, Zhi Quan Zhou, and Tsong Yueh Chen. 2018. Metamorphic testing: Testing the untestable. IEEE Software, 37, 3 (2018), 46–53.
[45]
Andrew Trask, Phil Michalak, and John Liu. 2015. sense2vec - a fast and accurate method for word sense disambiguation in neural word embeddings. arxiv:1511.06388.
[46]
András Vargha and Harold D Delaney. 2000. A critique and improvement of the CL common language effect size statistics of McGraw and Wong. Journal of Educational and Behavioral Statistics, 25, 2 (2000), 101–132.
[47]
Alex Wang, Kyunghyun Cho, and Mike Lewis. 2020. Asking and answering questions to evaluate the factual consistency of summaries. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 5008–5020.
[48]
Wenxuan Wang, Jen-tse Huang, Weibin Wu, Jianping Zhang, Yizhan Huang, Shuqing Li, Pinjia He, and Michael R Lyu. 2023. Mttm: Metamorphic testing for textual content moderation software. In 2023 IEEE/ACM 45th International Conference on Software Engineering (ICSE). 2387–2399.
[49]
Yuexiang Xie, Fei Sun, Yang Deng, Yaliang Li, and Bolin Ding. 2021. Factual consistency evaluation for text summarization via counterfactual estimation. In Findings of the Association for Computational Linguistics: EMNLP 2021. 100–110.
[50]
Jerrold H Zar. 1972. Significance testing of the Spearman rank correlation coefficient. J. Amer. Statist. Assoc., 67, 339 (1972), 578–580.
[51]
Jingqing Zhang, Yao Zhao, Mohammad Saleh, and Peter Liu. 2020. Pegasus: Pre-training with extracted gap-sentences for abstractive summarization. In International Conference on Machine Learning. 11328–11339.
[52]
Mengli Zhang, Gang Zhou, Wanting Yu, and Wenfen Liu. 2020. A survey of automatic text summarization technology based on deep learning. In 2020 International Conference on Artificial Intelligence and Computer Engineering (ICAICE). 211–217.
[53]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2020. Bertscore: Evaluating text generation with bert. arxiv:1904.09675.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image Proceedings of the ACM on Software Engineering
Proceedings of the ACM on Software Engineering  Volume 1, Issue FSE
July 2024
2770 pages
EISSN:2994-970X
DOI:10.1145/3554322
Issue’s Table of Contents
This work is licensed under a Creative Commons Attribution International 4.0 License.

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 12 July 2024
Published in PACMSE Volume 1, Issue FSE

Author Tags

  1. Abstractive Summarization
  2. Factual Consistency
  3. Metamorphic Relation
  4. Metamorphic Testing
  5. Quality Evaluation

Qualifiers

  • Research-article

Funding Sources

  • National Nature Science Foundation of China

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 179
    Total Downloads
  • Downloads (Last 12 months)179
  • Downloads (Last 6 weeks)44
Reflects downloads up to 16 Nov 2024

Other Metrics

Citations

View Options

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Login options

Full Access

Media

Figures

Other

Tables

Share

Share

Share this Publication link

Share on social media