Nothing Special   »   [go: up one dir, main page]

skip to main content
short-paper

Social-sum-Mal: A Dataset for Abstractive Text Summarization in Malayalam

Published: 21 November 2024 Publication History

Abstract

Abstractive text summarization techniques for Malayalam language is still in its infancy. The lack of benchmarked datasets for this task is one of the constraints in developing and testing good models. Malayalam has seven nominal case forms, two nominal number forms, and three gender forms. It is subjected to extreme agglutination and inflection. Due to this, the translation of other text summarization datasets to Malayalam may not capture these case forms effectively. Therefore curation of datasets from scratch is highly demanded for specific text-processing applications in Malayalam. This article introduces a novel dataset designed specifically for advancing the field of automatic abstractive text summarization in Malayalam language. The dataset is curated to address the unique linguistic characteristics of the Malayalam language. It is called the Social-sum-Mal dataset, capable of addressing three different types of summarization tasks: long, extreme, and query-based summarizations. In addition, Social-sum-Mal can be extended for other applications like text classification, multi-document summarization, and question answering. To enhance the dataset transparency, a datasheet is created for Social-sum-Mal. Data accuracy and annotator biases are evaluated using proper testing strategies including Jaccard, cosine, and overlap similarities. The correctness of the dataset is further evaluated by comparing it with a deep learning based text summarization model.

References

[1]
Pradeepika Verma, Sukomal Pal, and Hari Om. 2019. A comparative analysis on Hindi and English extractive text summarization. ACM Transactions on Asian and Low-Resource Language Information Processing 18, 3 (2019), 1–39.
[2]
Aqil Azmi and Suha Al-Thanyyan. 2009. Ikhtasir—A user selected compression ratio Arabic text summarization system. In Proceedings of the 2009 International Conference on Natural Language Processing and Knowledge Engineering. IEEE, 1–7.
[3]
E. B. Ajmal and Rosna P. Haroon. 2015. An extractive Malayalam document summarization based on graph theoretic approach. In Proceedings of the 2015 5th International Conference on e-Learning (econf ’15). IEEE, 237–240.
[4]
Raed Z. Al-Abdallah and Ahmad T. Al-Taani. 2017. Arabic single-document text summarization using particle swarm optimization algorithm. Procedia Computer Science 117 (2017), 30–37.
[5]
Pradeepika Verma and Anshul Verma. 2020. A review on text summarization techniques. Journal of Scientific Research 64, 1 (2020), 251–257.
[6]
Ibrahim F. Moawad and Mostafa Aref. 2012. Semantic graph reduction approach for abstractive text summarization. In Proceedings of the 2012 7th International Conference on Computer Engineering and Systems (ICCES ’12). IEEE, 132–138.
[7]
Joel Larocca Neto, Alex A. Freitas, and Celso A. A. Kaestner. 2002. Automatic text summarization using a machine learning approach. In Advances in Artificial Intelligence. Lecture Notes in Computer Science, Vol. 2507. Springer, 205–215.
[8]
P. G. Magdum and Sheetal Rathi. 2019. A survey on deep learning-based automatic text summarization models. In Proceedings of the International Conference on Artificial Intelligence and Data Engineering. 377–392.
[9]
Alexandre Magueresse, Vincent Carles, and Evan Heetderks. 2020. Low-resource languages: A review of past work and future challenges. arXiv preprint arXiv:2006.07264 (2020).
[10]
Alessandro Sebastianelli, Maria Pia Del Rosso, and Silvia Liberata Ullo. 2021. Automatic dataset builder for machine learning applications to satellite imagery. SoftwareX 15 (2021), 100739.
[11]
Diego Antognini and Boi Faltings. 2020. GameWikiSum: A novel large multi-document summarization dataset. arXiv preprint arXiv:2002.06851 (2020).
[12]
Manju Kondath, David Peter Suseelan, and Sumam Mary Idicula. 2022. Extractive summarization of Malayalam documents using latent Dirichlet allocation: An experience. Journal of Intelligent Systems 31, 1 (2022), 393–406.
[13]
Dirk Hovy and Shrimai Prabhumoye. 2021. Five sources of bias in natural language processing. Language and Linguistics Compass 15, 8 (2021), e12432.
[14]
Douwe Kiela, Max Bartolo, Yixin Nie, Divyansh Kaushik, Atticus Geiger, Zhengxuan Wu, Bertie Vidgen, Grusha Prasad, Amanpreet Singh, Pratik Ringshia, Zhiyi Ma, Tristan Thrush, Sebastian Riedel, Zeerak Waseem, Pontus Stenetorp, Robin Jia, Mohit Bansal, Christopher Potts, and Adina Williams. 2021. Dynabench: Rethinking benchmarking in NLP. arXiv preprint arXiv:2104.14337 (2021).
[15]
SCERT Kerala. 2023. The State Council of Education Research and Training (SCERT Kerala) Thiruvananthapuram. Retrieved December 4, 2023 from https://scert.kerala.gov.in/
[16]
C. K. Vincy. 2020. Language and Society in Kerala: The Origin and Growth of Malayalam Language (1300 CE to 1800 CE). Ph.D. Dissertation. University of Calicut.
[17]
S. V. Shanmugam. 1976. Formation and development of Malayalam. Indian Literature 19, 3 (1976), 5–30.
[18]
Santhosh Thottingal. 2023. Inflection and Agglutination: Challenges to Malayalam Computing. Retrieved June 10, 2024 from https://thottingal.in/blog/
[19]
Rajendran Sankaravelayuthan and Mohan Raj Sn. 2019. Taxonomy of word formation in malayalam. Journal of Advanced Linguistic Studies 8, 1–2 (2019), 114–151.
[20]
Joel Hestness, Newsha Ardalani, and Gregory Diamos. 2019. Beyond human-level accuracy: Computational challenges in deep learning. In Proceedings of the 24th Symposium on Principles and Practice of Parallel Programming. 1–14.
[21]
Zhiqiang Gong, Ping Zhong, and Weidong Hu. 2019. Diversity in machine learning. IEEE Access 7 (2019), 64323–64350.
[22]
Sindhya K. Nambiar, David Peter Suseelan, and Sumam Mary Idicula. 2023. Abstractive summarization of text document in Malayalam language: Enhancing attention model using POS tagging feature. ACM Transactions on Asian and Low-Resource Language Information Processing 22, 2 (2023), 1–14.
[23]
Sindhya K. Nambiar, David Peter Suseelan, and Sumam Mary Idicula. 2021. Attention based abstractive summarization of Malayalam document. Procedia Computer Science 189 (2021), 250–257.
[24]
Abdul Ghafoor, Ali Shariq Imran, Sher Muhammad Daudpota, Zenun Kastrati, Abdullah, Rakhi Batra, and Mudasir Ahmad Wani. 2021. The impact of translating resource-rich datasets to low-resource languages through multi-lingual text processing. IEEE Access 9 (2021), 124478–124490.
[25]
K. Manju, David Peter Suseelan, and Sumam Mary Idicula. 2021. A framework for generating extractive summary from multiple Malayalam documents. Information 12, 1 (2021), 41.
[26]
Olam.in. n.d. Olam English-Malayalam Dataset. Retrieved June 26, 2024 from https://olam.in/p/open
[27]
Rahul Raj M and Dhanya S. Pankaj. 2023. Challenges in creating text summarization models in Malayalam: A study. In Proceedings of the 2023 International Conference on Control, Communication, and Computing (ICCC ’23). IEEE, 1–6.
[28]
Rajina Kabeer and Sumam Mary Idicula. 2014. Text summarization for Malayalam documents—An experience. In Proceedings of the 2014 International Conference on Data Science and Engineering (ICDSE ’14). IEEE, 145–150.
[29]
K. Manju, D. P. Suseelan, and S. M. Idicula. 2016. Graph based extractive multi-document summarizer for Malayalam—An experiment. In Proceedings of the World Congress on Engineering (WCE ’16), Vol. 1. 1–4.
[30]
P. Krishnaprasad, A. Sooryanarayanan, and Ajeesh Ramanujan. 2016. Malayalam text summarization: An extractive approach. In Proceedings of the 2016 International Conference on Next Generation Intelligent Systems (ICNGIS ’16). IEEE, 1–4.
[31]
M. Rahul Raj and Rosna P. Haroon. 2016. Malayalam text summarization: Minimum spanning tree based graph reduction approach. In Proceedings of the 2016 2nd International Conference on Advances in Computing, Communication, and Automation (ICACCA-Fall ’16). IEEE, 1–5.
[32]
C. Sunitha, A. Jaya, and Amal Ganesh. 2019. Automatic summarization of Malayalam documents using clause identification method. International Journal of Electrical and Computer Engineering 9, 6 (2019), 4929.
[33]
M. Rahul Raj, Rosna P. Haroon, and N. V. Sobhana. 2020. A novel extractive text summarization system with self-organizing map clustering and entity recognition. Sādhanā 45, 1 (2020), 1–18.
[34]
Ben Hutchinson, Andrew Smart, Alex Hanna, Emily Denton, Christina Greer, Oddur Kjartansson, Parker Barnes, and Margaret Mitchell. 2021. Towards accountability for machine learning datasets: Practices from software engineering and infrastructure. In Proceedings of the 2021 ACM Conference on Fairness, Accountability, and Transparency. 560–575.
[35]
Iain Barclay, Alun Preece, Ian Taylor, and Dinesh Verma. 2019. Towards traceability in data ecosystems using a bill of materials model. arXiv preprint arXiv:1904.04253 (2019).
[36]
R. Stuart Geiger, Kevin Yu, Yanlai Yang, Mindy Dai, Jie Qiu, Rebekah Tang, and Jenny Huang. 2020. Garbage in, garbage out? Do machine learning application papers in social computing report where human-labeled training data comes from? In Proceedings of the 2020 Conference on Fairness, Accountability, and Transparency. 325–336.
[37]
Andreas Vogelsang and Markus Borg. 2019. Requirements engineering for machine learning: Perspectives from data scientists. In Proceedings of the 2019 IEEE 27th International Requirements Engineering Conference Workshops (REW ’19). IEEE, 245–251.
[38]
Amandalynne Paullada, Inioluwa Deborah Raji, Emily M. Bender, Emily Denton, and Alex Hanna. 2021. Data and its (dis)contents: A survey of dataset development and use in machine learning research. Patterns 2, 11 (2021), 10036.
[39]
Kenneth Holstein, Jennifer Wortman Vaughan, Hal Daumé III, Miro Dudik, and Hanna Wallach. 2019. Improving fairness in machine learning systems: What do industry practitioners need? In Proceedings of the 2019 CHI Conference on Human Factors in Computing Systems. 1–16.
[40]
Ravit Dotan and Smitha Milli. 2019. Value-laden disciplinary shifts in machine learning. arXiv preprint arXiv:1912. 01172 (2019).
[41]
Irene V. Pasquetto, Bernadette M. Randles, and Christine L. Borgman. 2017. On the reuse of scientific data. Data Science Journal 16 (2017), 8.
[42]
Filip Gralinski, Anna Wróblewska, Tomasz Stanisławek, Kamil Grabowski, and Tomasz Górecki. 2019. GEval: Tool for debugging NLP datasets and models. In Proceedings of the 2019 ACL Workshop BlackboxNLP: Analyzing and Interpreting Neural Networks for NLP. 254–262.
[43]
Jonathan Bragg, Arman Cohan, Kyle Lo, and Iz Beltagy. 2021. NLP. Advances in Neural Information Processing Systems 34 (2021), 15787–15800.
[44]
Jorge E. Higuera and Jose Polo. 2011. IEEE 1451 standard in 6LoWPAN sensor networks using a compact physical-layer transducer electronic datasheet. IEEE Transactions on Instrumentation and Measurement 60, 8 (2011), 2751–2758.
[45]
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jennifer Wortman Vaughan, Hanna Wallach, Hal Daumé Iii, and Kate Crawford. 2021. Datasheets for datasets. Communications of the ACM 64, 12 (2021), 86–92.
[46]
Karen L. Boyd. 2021. Datasheets for datasets help ML engineers notice and understand ethical issues in training data. Proceedings of the ACM on Human-Computer Interaction 5, CSCW2 (2021), 1–27.
[47]
Marta R. Costa-Jussà, Roger Creus, Oriol Domingo, Albert Domínguez, Miquel Escobar, Cayetana López, Marina Garcia, and Margarita Geleta. 2020. MT-adapted datasheets for datasets: Template and repository. arXiv preprint arXiv: 2005.13156 (2020).
[48]
Bradley Butcher, Vincent S. Huang, Christopher Robinson, Jeremy Reffin, Sema K. Sgaier, Grace Charles, and Novi Quadrianto. 2021. Causal datasheet for datasets: An evaluation guide for real-world data analysis and data collection design using Bayesian networks. Frontiers in Artificial Intelligence 4 (2021), 612551.
[49]
Ingo Schnabel and Markus Pizka. 2006. Goal-driven software development. In Proceedings of the 2006 30th Annual IEEE/NASA Software Engineering Workshop. IEEE, 59–65.
[50]
Wolfgang Rathmann, Brenda Bongaerts, Hans-Joachim Carius, Silvia Kruppert, and Karel Kostev. 2018. Basic characteristics and representativeness of the German Disease Analyzer database. International Journal of Clinical Pharmacology and Therapeutics 56, 10 (2018), 459.
[51]
Chenglong Li, Wanlin Xue, Yaqing Jia, Zhichen Qu, Bin Luo, Jin Tang, and Dengdi Sun. 2021. LasHeR: A large-scale high-diversity benchmark for RGBT tracking. IEEE Transactions on Image Processing 31 (2021), 392–404.
[52]
Ulrich Atz. 2014. The tau of data: A new metric to assess the timeliness of data in catalogues. In Proceedings of the International Conference for E-Democracy and Open Government (CeDEM ’14), Vol. 22. 147–162.
[53]
Sangwon Suh, Matthew Leighton, Shivira Tomar, and Christine Chen. 2016. Interoperability between ecoinvent ver. 3 and US LCI database: A case study. International Journal of Life Cycle Assessment 21 (2016), 1290–1298.
[54]
Wolfram Bublitz. 2011. Cohesion and coherence. Discursive Pragmatics 8 (2011), 37–49.
[55]
Martin Hassel. 2004. Evaluation of Automatic Text Summarization. Licentiate Thesis. Stockholm, Sweden.
[56]
Olam.in. 2023. Olam English Malayalam Dictionary. Retrieved December 5, 2023 from https://olam.in/
[57]
Hugging Face. 2023. Hugging Face—The AI Community Building the Future. Retrieved December 5, 2023 from https://huggingface.co/
[58]
Raj Dabre, Himani Shrotriya, Anoop Kunchukuttan, Ratish Puduppully, Mitesh M. Khapra, and Pratyush Kumar. 2021. IndicBART: A pre-trained model for Indic natural language generation. arXiv preprint arXiv:2109.02903 (2021).
[59]
Divyanshu Kakwani, Anoop Kunchukuttan, Satish Golla, N. C. Gokul, Avik Bhattacharyya, Mitesh M. Khapra, and Pratyush Kumar. 2020. IndicNLPSuite: Monolingual corpora, evaluation benchmarks and pre-trained multilingual language models for Indian languages. In Findings of the Association for Computational Linguistics: EMNLP 2020. Association for Computational Linguistics, 4948–4961.
[60]
Suphakit Niwattanakul, Jatsada Singthongchai, Ekkachai Naenudorn, and Supachanun Wanapu. 2013. Using of Jaccard coefficient for keywords similarity. In Proceedings of the International Multiconference of Engineers and Computer Scientists, Vol. 1. 380–384.
[61]
Wafaa S. El-Kassas, Cherif R. Salama, Ahmed A. Rafea, and Hoda K. Mohamed. 2021. Automatic text summarization: A comprehensive survey. Expert Systems with Applications 165 (2021), 113679.
[62]
Faisal Rahutomo, Teruaki Kitasuka, and Masayoshi Aritsugi. 2012. Semantic cosine similarity. In Proceedings of the 7th International Student Conference on Advanced Science and Technology (ICAST ’12), Vol. 4. 1.
[63]
Kenneth Ward Church. 2017. Word2Vec. Natural Language Engineering 23, 1 (2017), 155–162.
[64]
Keyvan Khosrovian, Dietmar Pfahl, and Vahid Garousi. 2008. GENSIM 2.0: A customizable process simulation model for software process evaluation. In Proceedings of the International Conference on Software Process. 294–306.
[65]
Yuqing Tang, Chau Tran, Xian Li, Peng-Jen Chen, Naman Goyal, Vishrav Chaudhary, Jiatao Gu, and Angela Fan. 2020. Multilingual translation with extensible multilingual pretraining and finetuning. arXiv preprint arXiv:2008.00401 (2020).
[66]
Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Text Summarization Branches Out. Association for Computational Linguistics, 74–81.
[67]
Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. 2019. BERTScore: Evaluating text generation with BERT. arXiv preprint arXiv:1904.09675 (2019).
[68]
Matt Post. 2018. A call for clarity in reporting BLEU scores. arXiv preprint arXiv:1804.08771 (2018).
[69]
Nihar Ranjan Sahoo, Ashita Saxena, Kishan Maharaj, Arif A. Ahmad, Abhijit Mishra, and Pushpak Bhattacharyya. 2024. Addressing bias and hallucination in large language models. In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources, and Evaluation (LREC-COLING ’24): Tutorial Summaries. 73–79.
[70]
Deepa Mary Mathews and Sajimon Abraham. 2018. Effects of pre-processing phases in sentiment analysis for Malayalam. International Journal of Computer Sciences and Engineering 6 (2018), 361–366.

Recommendations

Comments

Please enable JavaScript to view thecomments powered by Disqus.

Information & Contributors

Information

Published In

cover image ACM Transactions on Asian and Low-Resource Language Information Processing
ACM Transactions on Asian and Low-Resource Language Information Processing  Volume 23, Issue 11
November 2024
248 pages
EISSN:2375-4702
DOI:10.1145/3613714
Issue’s Table of Contents

Publisher

Association for Computing Machinery

New York, NY, United States

Publication History

Published: 21 November 2024
Online AM: 16 September 2024
Accepted: 02 September 2024
Revised: 07 July 2024
Received: 26 January 2024
Published in TALLIP Volume 23, Issue 11

Check for updates

Author Tags

  1. NLP
  2. text summarization
  3. abstractive summarization
  4. extractive summarization
  5. dataset
  6. benchmarking
  7. semantic similarity
  8. data diversity

Qualifiers

  • Short-paper

Contributors

Other Metrics

Bibliometrics & Citations

Bibliometrics

Article Metrics

  • 0
    Total Citations
  • 123
    Total Downloads
  • Downloads (Last 12 months)123
  • Downloads (Last 6 weeks)19
Reflects downloads up to 23 Feb 2025

Other Metrics

Citations

View Options

Login options

Full Access

View options

PDF

View or Download as a PDF file.

PDF

eReader

View online with eReader.

eReader

Full Text

View this article in Full Text.

Full Text

Figures

Tables

Media

Share

Share

Share this Publication link

Share on social media