short-paper

Social-sum-Mal: A Dataset for Abstractive Text Summarization in Malayalam

Authors:

Dhanya S PankajAuthors Info & Claims

ACM Transactions on Asian and Low-Resource Language Information Processing, Volume 23, Issue 11

Article No.: 160, Pages 1 - 20

https://doi.org/10.1145/3696107

Published: 21 November 2024 Publication History

Abstract

Abstractive text summarization techniques for Malayalam language is still in its infancy. The lack of benchmarked datasets for this task is one of the constraints in developing and testing good models. Malayalam has seven nominal case forms, two nominal number forms, and three gender forms. It is subjected to extreme agglutination and inflection. Due to this, the translation of other text summarization datasets to Malayalam may not capture these case forms effectively. Therefore curation of datasets from scratch is highly demanded for specific text-processing applications in Malayalam. This article introduces a novel dataset designed specifically for advancing the field of automatic abstractive text summarization in Malayalam language. The dataset is curated to address the unique linguistic characteristics of the Malayalam language. It is called the Social-sum-Mal dataset, capable of addressing three different types of summarization tasks: long, extreme, and query-based summarizations. In addition, Social-sum-Mal can be extended for other applications like text classification, multi-document summarization, and question answering. To enhance the dataset transparency, a datasheet is created for Social-sum-Mal. Data accuracy and annotator biases are evaluated using proper testing strategies including Jaccard, cosine, and overlap similarities. The correctness of the dataset is further evaluated by comparing it with a deep learning based text summarization model.

References

[1]

Pradeepika Verma, Sukomal Pal, and Hari Om. 2019. A comparative analysis on Hindi and English extractive text summarization. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 1–39.

Digital Library

[2]

Aqil Azmi and Suha Al-Thanyyan. 2009. Ikhtasir—A user selected compression ratio Arabic text summarization system. In Proceedings of the 2009 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 1–7.

[3]

E. B. Ajmal and Rosna P. Haroon. 2015. An extractive Malayalam document summarization based on graph theoretic approach. In Proceedings of the 2015 5th International Conference on e-Learning (econf ’15). IEEE, 237–240.

[4]

Raed Z. Al-Abdallah and Ahmad T. Al-Taani. 2017. Arabic single-document text summarization using particle swarm optimization algorithm. Procedia Computer Science 117 (2017), 30–37.

[5]

Pradeepika Verma and Anshul Verma. 2020. A review on text summarization techniques. Journal of Scientific Research 64, 1 (2020), 251–257.

[6]

Ibrahim F. Moawad and Mostafa Aref. 2012. Semantic graph reduction approach for abstractive text summarization. In Proceedings of the 2012 7th International Conference on Computer Engineering and Systems (ICCES ’12). IEEE, 132–138.

[7]

Joel Larocca Neto, Alex A. Freitas, and Celso A. A. Kaestner. 2002. Automatic text summarization using a machine learning approach. In Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 2507. Springer, 205–215.

[8]

P. G. Magdum and Sheetal Rathi. 2019. A survey on deep learning-based automatic text summarization models. In Proceedings of the International Conference on Artificial Intelligence and Data Engineering. 377–392.

[9]

Alexandre Magueresse, Vincent Carles, and Evan Heetderks. 2020. Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264 (2020).

[10]

Alessandro Sebastianelli, Maria Pia Del Rosso, and Silvia Liberata Ullo. 2021. Automatic dataset builder for machine learning applications to satellite imagery. SoftwareX 15 (2021), 100739.

[11]

Diego Antognini and Boi Faltings. 2020. GameWikiSum: A novel large multi-document summarization dataset. arXiv preprint arXiv:2002.06851 (2020).

[12]

Manju Kondath, David Peter Suseelan, and Sumam Mary Idicula. 2022. Extractive summarization of Malayalam documents using latent Dirichlet allocation: An experience. Journal of Intelligent Systems 31, 1 (2022), 393–406.

[13]

Dirk Hovy and Shrimai Prabhumoye. 2021. Five sources of bias in natural language processing. Language and Linguistics Compass 15, 8 (2021), e12432.

[14]

Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337 (2021).

[15]

SCERT Kerala. 2023. The State Council of Education Research and Training (SCERT Kerala) Thiruvananthapuram. Retrieved December 4, 2023 from https://scert.kerala.gov.in/

[16]

C. K. Vincy. 2020. Language and Society in Kerala: The Origin and Growth of Malayalam Language (1300 CE to 1800 CE). Ph.D. Dissertation. University of Calicut.

[17]

S. V. Shanmugam. 1976. Formation and development of Malayalam. Indian Literature 19, 3 (1976), 5–30.

[18]

Santhosh Thottingal. 2023. Inflection and Agglutination: Challenges to Malayalam Computing. Retrieved June 10, 2024 from https://thottingal.in/blog/

[19]

Rajendran Sankaravelayuthan and Mohan Raj Sn. 2019. Taxonomy of word formation in malayalam. Journal of Advanced Linguistic Studies 8, 1–2 (2019), 114–151.

[20]

Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond human-level accuracy: Computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 1–14.

Digital Library

[21]

Zhiqiang Gong, Ping Zhong, and Weidong Hu. 2019. Diversity in machine learning. IEEE Access 7 (2019), 64323–64350.

[22]

Sindhya K. Nambiar, David Peter Suseelan, and Sumam Mary Idicula. 2023. Abstractive summarization of text document in Malayalam language: Enhancing attention model using POS tagging feature. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 2 (2023), 1–14.

Digital Library

[23]

Sindhya K. Nambiar, David Peter Suseelan, and Sumam Mary Idicula. 2021. Attention based abstractive summarization of Malayalam document. Procedia Computer Science 189 (2021), 250–257.

[24]

Abdul Ghafoor, Ali Shariq Imran, Sher Muhammad Daudpota, Zenun Kastrati, Abdullah, Rakhi Batra, and Mudasir Ahmad Wani. 2021. The impact of translating resource-rich datasets to low-resource languages through multi-lingual text processing. IEEE Access 9 (2021), 124478–124490.

[25]

K. Manju, David Peter Suseelan, and Sumam Mary Idicula. 2021. A framework for generating extractive summary from multiple Malayalam documents. Information 12, 1 (2021), 41.

[26]

Olam.in. n.d. Olam English-Malayalam Dataset. Retrieved June 26, 2024 from https://olam.in/p/open

[27]

Rahul Raj M and Dhanya S. Pankaj. 2023. Challenges in creating text summarization models in Malayalam: A study. In Proceedings of the 2023 International Conference on Control, Communication, and Computing (ICCC ’23). IEEE, 1–6.

[28]

Rajina Kabeer and Sumam Mary Idicula. 2014. Text summarization for Malayalam documents—An experience. In Proceedings of the 2014 International Conference on Data Science and Engineering (ICDSE ’14). IEEE, 145–150.

[29]

K. Manju, D. P. Suseelan, and S. M. Idicula. 2016. Graph based extractive multi-document summarizer for Malayalam—An experiment. In Proceedings of the World Congress on Engineering (WCE ’16), Vol. 1. 1–4.

[30]

P. Krishnaprasad, A. Sooryanarayanan, and Ajeesh Ramanujan. 2016. Malayalam text summarization: An extractive approach. In Proceedings of the 2016 International Conference on Next Generation Intelligent Systems (ICNGIS ’16). IEEE, 1–4.

[31]

M. Rahul Raj and Rosna P. Haroon. 2016. Malayalam text summarization: Minimum spanning tree based graph reduction approach. In Proceedings of the 2016 2nd International Conference on Advances in Computing, Communication, and Automation (ICACCA-Fall ’16). IEEE, 1–5.

[32]

C. Sunitha, A. Jaya, and Amal Ganesh. 2019. Automatic summarization of Malayalam documents using clause identification method. International Journal of Electrical and Computer Engineering 9, 6 (2019), 4929.

[33]

M. Rahul Raj, Rosna P. Haroon, and N. V. Sobhana. 2020. A novel extractive text summarization system with self-organizing map clustering and entity recognition. Sādhanā 45, 1 (2020), 1–18.

[34]

Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.

Digital Library

[35]

Iain Barclay, Alun Preece, Ian Taylor, and Dinesh Verma. 2019. Towards traceability in data ecosystems using a bill of materials model. arXiv preprint arXiv:1904.04253 (2019).

[36]

R. Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, and Jenny Huang. 2020. Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 325–336.

Digital Library

[37]

Andreas Vogelsang and Markus Borg. 2019. Requirements engineering for machine learning: Perspectives from data scientists. In Proceedings of the 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW ’19). IEEE, 245–251.

[38]

Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 10036.

[39]

Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudik, and Hanna Wallach. 2019. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–16.

Digital Library

[40]

Ravit Dotan and Smitha Milli. 2019. Value-laden disciplinary shifts in machine learning. arXiv preprint arXiv:1912. 01172 (2019).

[41]

Irene V. Pasquetto, Bernadette M. Randles, and Christine L. Borgman. 2017. On the reuse of scientific data. Data Science Journal 16 (2017), 8.

[42]

Filip Gralinski, Anna Wróblewska, Tomasz Stanisławek, Kamil Grabowski, and Tomasz Górecki. 2019. GEval: Tool for debugging NLP datasets and models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 254–262.

[43]

Jonathan Bragg, Arman Cohan, Kyle Lo, and Iz Beltagy. 2021. NLP. Advances in Neural Information Processing Systems 34 (2021), 15787–15800.

[44]

Jorge E. Higuera and Jose Polo. 2011. IEEE 1451 standard in 6LoWPAN sensor networks using a compact physical-layer transducer electronic datasheet. IEEE Transactions on Instrumentation and Measurement 60, 8 (2011), 2751–2758.

[45]

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Communications of the ACM 64, 12 (2021), 86–92.

Digital Library

[46]

Karen L. Boyd. 2021. Datasheets for datasets help ML engineers notice and understand ethical issues in training data. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–27.

Digital Library

[47]

Marta R. Costa-Jussà, Roger Creus, Oriol Domingo, Albert Domínguez, Miquel Escobar, Cayetana López, Marina Garcia, and Margarita Geleta. 2020. MT-adapted datasheets for datasets: Template and repository. arXiv preprint arXiv: 2005.13156 (2020).

[48]

Bradley Butcher, Vincent S. Huang, Christopher Robinson, Jeremy Reffin, Sema K. Sgaier, Grace Charles, and Novi Quadrianto. 2021. Causal datasheet for datasets: An evaluation guide for real-world data analysis and data collection design using Bayesian networks. Frontiers in Artificial Intelligence 4 (2021), 612551.

[49]

Ingo Schnabel and Markus Pizka. 2006. Goal-driven software development. In Proceedings of the 2006 30th Annual IEEE/NASA Software Engineering Workshop. IEEE, 59–65.

Digital Library

[50]

Wolfgang Rathmann, Brenda Bongaerts, Hans-Joachim Carius, Silvia Kruppert, and Karel Kostev. 2018. Basic characteristics and representativeness of the German Disease Analyzer database. International Journal of Clinical Pharmacology and Therapeutics 56, 10 (2018), 459.

[51]

Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. 2021. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing 31 (2021), 392–404.

[52]

Ulrich Atz. 2014. The tau of data: A new metric to assess the timeliness of data in catalogues. In Proceedings of the International Conference for E-Democracy and Open Government (CeDEM ’14), Vol. 22. 147–162.

[53]

Sangwon Suh, Matthew Leighton, Shivira Tomar, and Christine Chen. 2016. Interoperability between ecoinvent ver. 3 and US LCI database: A case study. International Journal of Life Cycle Assessment 21 (2016), 1290–1298.

[54]

Wolfram Bublitz. 2011. Cohesion and coherence. Discursive Pragmatics 8 (2011), 37–49.

[55]

Martin Hassel. 2004. Evaluation of Automatic Text Summarization. Licentiate Thesis. Stockholm, Sweden.

[56]

Olam.in. 2023. Olam English Malayalam Dictionary. Retrieved December 5, 2023 from https://olam.in/

[57]

Hugging Face. 2023. Hugging Face—The AI Community Building the Future. Retrieved December 5, 2023 from https://huggingface.co/

[58]

Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, and Pratyush Kumar. 2021. IndicBART: A pre-trained model for Indic natural language generation. arXiv preprint arXiv:2109.02903 (2021).

[59]

Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 4948–4961.

[60]

Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Vol. 1. 380–384.

[61]

Wafaa S. El-Kassas, Cherif R. Salama, Ahmed A. Rafea, and Hoda K. Mohamed. 2021. Automatic text summarization: A comprehensive survey. Expert Systems with Applications 165 (2021), 113679.

[62]

Faisal Rahutomo, Teruaki Kitasuka, and Masayoshi Aritsugi. 2012. Semantic cosine similarity. In Proceedings of the 7th International Student Conference on Advanced Science and Technology (ICAST ’12), Vol. 4. 1.

[63]

Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering 23, 1 (2017), 155–162.

[64]

Keyvan Khosrovian, Dietmar Pfahl, and Vahid Garousi. 2008. GENSIM 2.0: A customizable process simulation model for software process evaluation. In Proceedings of the International Conference on Software Process. 294–306.

[65]

Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. 2020. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401 (2020).

[66]

Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.

[67]

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019).

[68]

Matt Post. 2018. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771 (2018).

[69]

Nihar Ranjan Sahoo, Ashita Saxena, Kishan Maharaj, Arif A. Ahmad, Abhijit Mishra, and Pushpak Bhattacharyya. 2024. Addressing bias and hallucination in large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources, and Evaluation (LREC-COLING ’24): Tutorial Summaries. 73–79.

[70]

Deepa Mary Mathews and Sajimon Abraham. 2018. Effects of pre-processing phases in sentiment analysis for Malayalam. International Journal of Computer Sciences and Engineering 6 (2018), 361–366.

Index Terms

Social-sum-Mal: A Dataset for Abstractive Text Summarization in Malayalam
1. Computing methodologies
  1. Artificial intelligence
    1. Natural language processing
      1. Natural language generation
2. Information systems
  1. Information retrieval

Recommendations

Graph-based abstractive biomedical text summarization
Graphical abstract

Display Omitted
Highlights
- A graph generation and frequent itemset mining approach have been used for the generation of extractive summaries.
- The T5 model has been adopted to generate abstractive summaries in the biomedical domain.
- The ROUGE metric has been ...
Abstract
Summarization is the process of compressing a text to obtain its important informative parts. In recent years, various methods have been presented to extract important parts of textual documents to present them in a summarized form. The first ...
Single-Document Abstractive Text Summarization: A Systematic Literature Review
Abstractive text summarization is a task in natural language processing that automatically generates the summary from the source document in a human-written form with minimal loss of information. Research in text summarization has shifted towards ...
Assessing Abstractive and Extractive Methods for Automatic News Summarization
DocEng '24: Proceedings of the ACM Symposium on Document Engineering 2024

Automatic Text Summarization (ATS) is a research area that originated in the late 1950s and has gained increasing importance with the surge of text data available today. ATS approaches are generally classified into extractive and abstractive methods. ...

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing

ACM Transactions on Asian and Low-Resource Language Information Processing Volume 23, Issue 11

November 2024

248 pages

EISSN:2375-4702

DOI:10.1145/3613714

Editor:
Imed Zitouni
Google, USA

Issue’s Table of Contents

Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than the author(s) must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from [email protected].

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2024

Online AM: 16 September 2024

Accepted: 02 September 2024

Revised: 07 July 2024

Received: 26 January 2024

Published in TALLIP Volume 23, Issue 11

Check for updates

Author Tags

Qualifiers

Short-paper

Contributors

Other Metrics

View Article Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

0
Total Citations
123
Total Downloads

Downloads (Last 12 months)123
Downloads (Last 6 weeks)19

Reflects downloads up to 23 Feb 2025

Other Metrics

View Author Metrics

Citations

View Options

Login options

Check if you have access through your login credentials or your institution to get full access on this article.

Full Access

Get this Article

View options

PDF

View or Download as a PDF file.

eReader

View online with eReader.

Full Text

View this article in Full Text.

Figures

Tables

Media

View full text|Download PDF

View Issue’s Table of Contents