Abstract
Text summarization is the process of converting the input document into a short form, provided that it preserves the overall meaning associated with it. Primarily, text summarization is achieved in two ways, i.e., abstractive and extractive. Extractive summarizers select a few best sentences out of the input document, while abstractive methods may modify the sentence structure or introduce new sentences. The proposed approach is an extractive text summarization technique, where we have expanded topic modeling specifically to be applied to multiple lower-level specialized entities (i.e., groups) embedded in a single document. Our goal is to overcome the lack of coherence issues found in the summarization techniques. Topic modeling was initially proposed to model text data at the multi-document and word levels without considering sentence modeling. Subsequently, it has been applied at the sentence level and used for the document summarization; however, certain limitations were associated. Topic modeling does not perform as expected when applied to a single document at the sentence level. To address this shortcoming, we have proposed a summarization approach that is incorporated at the individual document and clusters level (instead of the sentence level). We aim to choose the best statement from each group (containing sentences of the same kind) found in the given text. We have tried to select the perfect topic by evaluating the probability distribution of the words and respective topics’ at the cluster level. The method is evaluated on two standard datasets and shows significant performance gains over existing text summarization techniques. Compared to other text summarization techniques, the Rouge parameters for automatic evaluation show a considerable improvement in F-measure, precision, and recall of the generated summary. Furthermore, a manual evaluation has demonstrated that the proposed approach outperforms the current state-of-the-art text summarization approaches.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Data availability
Enquiries about data availability should be directed to the authors.
References
Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2017) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21(7):1785–1801
Abdi A, Shamsuddin SM, Aliguliyev RM (2018) Qmos: query-based multi-documents opinion-oriented summarization. Inf Process Manag 54(2):318–338
Abdi A, Shamsuddin SM, Hasan S, Piran J (2018) Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment. Expert Syst Appl 109:66–85. https://doi.org/10.1016/j.eswa.2018.05.010
Ali SM, Noorian Z, Bagheri E, Ding C, Al-Obeidat F (2020) Topic and sentiment aware microblog summarization for twitter. J Intell Inf Syst 54(1):129–156
Amplayo RK, Song M (2017) An adaptable fine-grained sentiment analysis for summarization of multiple short online reviews. Data Knowl Eng 110:54–67
Arora R, Ravindran B (2008) Latent Dirichlet allocation based multi-document summarization. In: Proceedings of the second workshop on analytics for noisy unstructured text data, pp 91–97
Bahdanau D, Cho K, Bengio Y (2014) Neural machine translation by jointly learning to align and translate. arXiv preprint arXiv:1409.0473
Barrios F, López F, Argerich L, Wachenchauzer R (2015) Variations of the similarity function of textrank for automated summarization. In: Argentine symposium on artificial intelligence (ASAI 2015)-JAIIO 44 (Rosario, 2015)
Barrios F, López F, Argerich L, Wachenchauzer R (2016) Variations of the similarity function of textrank for automated summarization. arXiv preprint arXiv:1602.03606
Barzilay R, McKeown KR (2005) Sentence fusion for multidocument news summarization. Comput Linguist 31(3):297–328. https://doi.org/10.1162/089120105774321091
Baxendale PB (1958) Machine-made index for technical literature—an experiment. IBM J Res Dev 2(4):354–361. https://doi.org/10.1147/rd.24.0354
Belwal RC, Rai S, Gupta A (2020) A new graph-based extractive text summarization using keywords or topic modeling. J Ambient Intell Hum Comput 1–16
Belwal RC, Rai S, Gupta A (2021) Text summarization using topic-based vector space model and semantic measure. Inf Process Manag 58(3):102536
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(Jan):993–1022
Boros E, Kantor PB, Neu DJ (2001) A clustering based approach to creating multi-document summaries. In: Proceedings of the 24th annual international ACM SIGIR conference on research and development in information retrieval
Chang YL, Chien JT (2009) Latent dirichlet learning for document summarization. In: IEEE international conference on acoustics, speech and signal processing, 2009 ICASSP 2009. IEEE, pp 1689–1692. https://doi.org/10.1109/icassp.2009.4959927
Cuong HN, Tran VD, Van LN, Than K (2019) Eliminating overfitting of probabilistic topic models on short and noisy text: the role of dropout. Int J Approx Reason
Diao Y, Lin H, Yang L, Fan X, Chu Y, Wu D, Zhang D, Xu K (2020) Crhasum: extractive text summarization with contextualized-representation hierarchical-attention summarization network. Neural Comput Appl 32(15):11491–11503
Dumais ST (2004) Latent semantic analysis. Ann Rev Inf Sci Technol 38(1):188–230
Erkan G, Radev DR (2004) Lexrank: graph-based lexical centrality as salience in text summarization. J Artif Intell Res 22:457–479
Fattah MA, Ren F (2008) Automatic text summarization. World Acad Sci Eng Technol 37:2008
Fattah MA, Ren F (2009) GA, MR, FFNN, PNN and GMM based models for automatic text summarization. Comput Speech Lang 23(1):126–144. https://doi.org/10.1016/j.csl.2008.04.002
Ferreira R, de Souza Cabral RD, e Silva GP, Freitas F, Cavalcanti GD, Lima R, Simske SJ, Favaro L (2013) Assessing sentence scoring techniques for extractive text summarization. Expert Syst Appl 40(14):5755–5764. https://doi.org/10.1016/j.eswa.2013.04.023
Fu X, Wang J, Zhang J, Wei J, Yang Z (2020) Document summarization with VHTM: variational hierarchical topic-aware mechanism. In: AAAI, pp 7740–7747
Fuad TA, Nayeem MT, Mahmud A, Chali Y (2019) Neural sentence fusion for diversity driven abstractive multi-document summarization. Comput Speech Language 58:216–230
Gambhir M, Gupta V (2017) Recent automatic text summarization techniques: a survey. Artif Intell Rev 47(1):1–66
Ganesan K, Zhai C, Han J (2010) Opinosis: a graph-based approach to abstractive summarization of highly redundant opinions. In: Proceedings of the 23rd international conference on computational linguistics. Association for Computational Linguistics, pp 340–348, https://dl.acm.org/citation.cfm?id=1873820
Gong Y, Liu X (2001) Generic text summarization using relevance measure and latent semantic analysis. In: Proceedings of the 24th annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 19–25, https://doi.org/10.1145/383952.383955
Gupta P, Pendluri VS, Vats I (2011) Summarizing text by ranking text units according to shallow linguistic features. In: 2011 13th international conference on advanced communication technology (ICACT). IEEE, pp 1620–1625. https://ieeexplore.ieee.org/document/5746114
Haiduc S, Aponte J, Moreno L, Marcus A (2010) On the use of automated text summarization techniques for summarizing source code. In: 2010 17th working conference on reverse engineering (WCRE). IEEE, pp 35–44. https://doi.org/10.1109/wcre.2010.13
Harabagiu SM, Lacatusu VF, Morarescu P (2002) Multidocument summarization with gistexter. In: LREC, Citeseer, vol 1, pp 1456–1463. http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.129.4846
Hermann KM, Kocisky T, Grefenstette E, Espeholt L, Kay W, Suleyman M, Blunsom P (2015) Teaching machines to read and comprehend. In: Advances in neural information processing systems, pp 1693–1701. arXiv:1506.03340
Hofmann T (2001) Unsupervised learning by probabilistic latent semantic analysis. Mach Learn 42(1):177–196
Hu M, Sun A, Lim EP (2008) Comments-oriented document summarization: understanding documents with readers’ feedback. In: Proceedings of the 31st annual international ACM SIGIR conference on Research and development in information retrieval. ACM, pp 291–298. https://doi.org/10.1145/1390334.1390385
Jelodar H, Wang Y, Yuan C, Feng X, Jiang X, Li Y, Zhao L (2019) Latent dirichlet allocation (LDA) and topic modeling: models, applications, a survey. Multimedia Tools Appl 78(11):15169–15211
Kanapala A, Pal S, Pamula R (2019) Text summarization from legal documents: a survey. Artif Intell Rev 51(3):371–402
Kikuchi Y, Hirao T, Takamura H, Okumura M, Nagata M (2014) Single document summarization based on nested tree structure. In: Proceedings of the 52nd annual meeting of the association for computational linguistics (Volume 2: Short Papers), pp 315–320
Kuhn A, Ducasse S, Gírba T (2007) Semantic clustering: identifying topics in source code. Inf Softw Technol 49(3):230–243
Lee S, Belkasim S, Zhang Y (2013) Multi-document text summarization using topic model and fuzzy logic. In: International workshop on machine learning and data mining in pattern recognition. Springer, pp 159–168
Lim KW, Buntine W, Chen C, Du L (2016) Nonparametric Bayesian topic modelling with the hierarchical Pitman–Yor processes. Int J Approx Reason 78:172–191
Lin CY (2004) Rouge: a package for automatic evaluation of summaries. Text Summarization Branches Out. http://aclweb.org/anthology/W04-1013
Liu X, Webster JJ, Kit C (2009) An extractive text summarizer based on significant words. In: International conference on computer processing of oriental languages. Springer, pp 168–178
Liu Y, Titov I, Lapata M (2019) Single document summarization as tree induction. In: Proceedings of the 2019 conference of the North American Chapter of the association for computational linguistics: human language technologies, Volume 1 (Long and Short Papers), pp 1745–1755
Lloret E, Palomar M (2009) A gradual combination of features for building automatic summarisation systems. In: International conference on text, speech and dialogue. Springer, pp 16–23. https://doi.org/10.1007/978-3-642-04208-9_6
Lloret E, Balahur A, Gómez JM, Montoyo A, Palomar M (2012) Towards a unified framework for opinion retrieval, mining and summarization. J Intell Inf Syst 39(3):711–747
Lovinger J, Valova I, Clough C (2019) GIST: general integrated summarization of text and reviews. Soft Comput 23(5):1589–1601
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165. https://doi.org/10.1147/rd.22.0159
Luong MT, Pham H, Manning CD (2015) Effective approaches to attention-based neural machine translation. arXiv preprint arXiv:1508.04025
Mani I, Bloedorn E (1998) Machine learning of generic and user-focused summarization. In: AAAI/IAAI, pp 821–826
Mao X, Yang H, Huang S, Liu Y, Li R (2019) Extractive summarization using supervised and unsupervised learning. Expert Syst Appl 133:173–181
Mihalcea R (2004) Graph-based ranking algorithms for sentence extraction, applied to text summarization. In: Proceedings of the ACL 2004 on interactive poster and demonstration sessions. Association for Computational Linguistics, p 20. https://doi.org/10.3115/1219044.1219064
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing. https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf
Moawad IF, Aref M (2012) Semantic graph reduction approach for abstractive text summarization. In: 2012 Seventh international conference on computer engineering & systems (ICCES). IEEE, pp 132–138
Mutlu B, Sezer EA, Akcayol MA (2019) Multi-document extractive text summarization: a comparative assessment on features. Knowl-Based Syst 183:104848
Na L, Ming-xia L, Ying L, Xiao-jun T, Hai-wen W, Peng X (2014) Mixture of topic model for multi-document summarization. In: The 26th chinese control and decision conference (2014 CCDC). IEEE, pp 5168–5172
Nagwani N (2015) Summarizing large text collection using topic modeling and clustering based on mapreduce framework. J Big Data 2(1):6
Nallapati R, Zhou B, Gulcehre C, Xiang B, et al (2016) Abstractive text summarization using sequence-to-sequence RNNS and beyond. arXiv preprint arXiv:1602.06023
Nallapati R, Zhai F, Zhou B (2017) Summarunner: a recurrent neural network based sequence model for extractive summarization of documents. In: Thirty-first AAAI conference on artificial intelligence
Narayan S, Papasarantopoulos N, Cohen SB, Lapata M (2017) Neural extractive summarization with side information. arXiv preprint arXiv:1704.04530
Narayan S, Cohen SB, Lapata M (2018a) Don’t give me the details, just the summary! topic-aware convolutional neural networks for extreme summarization. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 1797–1807
Narayan S, Cohen SB, Lapata M (2018b) Ranking sentences for extractive summarization with reinforcement learning. In: Proceedings of the 2018 conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long Papers), pp 1747–1759
Naveen GK, Nedungadi P (2014) Query-based multi-document summarization by clustering of documents. In: Proceedings of the 2014 international conference on interdisciplinary advances in applied computing, pp 1–8
Neto JL, Freitas AA, Kaestner CA (2002) Automatic text summarization using a machine learning approach. In: Brazilian symposium on artificial intelligence, Springer, pp 205–215
Nobata C, Sekine S, Murata M, Uchimoto K, Utiyama M, Isahara H (2001) Sentence extraction system assembling multiple evidence. In: NTCIR
Orăsan C (2009) Comparative evaluation of term-weighting methods for automatic summarization. J Quant Linguist 16(1):67–95
Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237
Oya T, Mehdad Y, Carenini G, Ng R (2014) A template-based abstractive meeting summarization: Leveraging summary and source text relationships. In: Proceedings of the 8th international natural language generation conference (INLG), pp 45–53
Ozsoy MG, Alpaslan FN, Cicekli I (2011) Text summarization using latent semantic analysis. J Inf Sci 37(4):405–417. https://doi.org/10.1177/0165551511408848
Powell L, Gelich A, Ras ZW (2019) Developing artwork pricing models for online art sales using text analytics. In: International joint conference on rough sets. Springer, pp 480–494
Qazvinian V, Radev DR (2008) Scientific paper summarization using citation summary networks. arXiv preprint arXiv:0807.1560
Rahman N, Borah B (2019) Improvement of query-based text summarization using word sense disambiguation. Complex Intell Syst 1–11
Roul RK (2021) Topic modeling combined with classification technique for extractive multi-document text summarization. Soft Comput 25(2):1113–1127
Rush AM, Chopra S, Weston J (2015) A neural attention model for abstractive sentence summarization. https://doi.org/10.18653/v1/d15-1044. arXiv preprint arXiv:1509.00685
Rush AM, Harvard S, Chopra S, Weston J (2017) A neural attention model for sentence summarization. In: ACLWeb Proceedings of the 2015 conference on empirical methods in natural language processing
Saggion H (2014) Creating summarization systems with summa. In: LREC. Citeseer, pp 4157–4163
See A, Liu PJ, Manning CD (2017) Get to the point: summarization with pointer-generator networks. arXiv preprint arXiv:1704.04368
Silla CN, Pappa GL, Freitas AA, Kaestner CA (2004) Automatic text summarization with genetic algorithm-based attribute selection. In: Ibero-American conference on artificial intelligence. Springer, pp 305–314
Singh RK, Khetarpaul S, Gorantla R, Allada SG (2021) SHEG: summarization and headline generation of news articles using deep learning. Neural Comput Appl 33(8):3251–3265
Steinberger J, Ježek K (2009) Update summarization based on latent semantic analysis. In: International conference on text speech and dialogue. Springer, pp 77–84
Van Lierde H, Chow TW (2019) Query-oriented text summarization based on hypergraph transversals. Inf Process Manag 56(4):1317–1338
Vázquez E, Arnulfo Garcia-Hernandez R, Ledeneva Y (2018) Sentence features relevance for extractive text summarization using genetic algorithms. J Intell Fuzzy Syst 35(1):353–365
Wong KF, Wu M, Li W (2008) Extractive summarization using supervised and semi-supervised learning. In: Proceedings of the 22nd international conference on computational linguistics (Coling 2008), pp 985–992
Yang L, Cai X, Zhang Y, Shi P (2014) Enhancing sentence-level clustering with ranking-based clustering framework for theme-based summarization. Inf Sci 260:37–50
Yang M, Qu Q, Shen Y, Lei K, Zhu J (2020) Cross-domain aspect/sentiment-aware abstractive review summarization by combining topic modeling and deep reinforcement learning. Neural Comput Appl 32(11):6421–6433
Yousefi-Azar M, Hamey L (2017) Text summarization using unsupervised deep learning. Expert Syst Appl 68:93–105
Zhang X, Lapata M, Wei F, Zhou M (2018) Neural latent extractive document summarization. In: Proceedings of the 2018 conference on empirical methods in natural language processing, pp 779–784
Funding
No funding was received for this work.
Author information
Authors and Affiliations
Contributions
RCB and AG conceived this research and designed experiments. SR participated in editing and drafting the article. All authors read and approved the final manuscript.
Corresponding author
Ethics declarations
Conflict of interest
There are no conflict of interest associated with this publication. We all declare that there are no known competing financial interests or personal relationships that could have appeared to influence the work reported in this paper.
Ethical approval
’Not applicable’. This article does not contain any studies with human participants or animals performed by the authors. Formal consent is not required.
Informed consent
’Not applicable’. No individual/personal data used.
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Belwal, R.C., Rai, S. & Gupta, A. Extractive text summarization using clustering-based topic modeling. Soft Comput 27, 3965–3982 (2023). https://doi.org/10.1007/s00500-022-07534-6
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-022-07534-6