Abstract
The qualities of human readable summaries available in the datasets are not up to the mark, leading to issues in creating an accurate model for text summarization. Although recent works have been largely built upon this issue and set up a strong platform for further improvements, they still have many limitations. Looking in this direction, the paper proposes a novel methodology for summarizing a corpus of documents to generate a coherent summary using topic modeling and classification technique. The objectives of the propose work are highlighted below:
-
A novel heuristic approach is introduced to find out the actual number of topics that exist in a corpus of documents which handles the stochastic nature of latent dirichlet allocation.
-
A large corpus of documents is handled by minimizing the huge set of sentences into a small set without losing the important one and thus providing a concise and information rich summary at the end.
-
Ensuring that the sentences are arranged as per their importance in the coherent summary.
-
Results of the experiment are compared with the state-of-the-art summary systems.
The outcomes of the empirical work show that the proposed model is more promising compared to the well-known text summarization models.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.Notes
For experimental purpose various values are tested between the range 0.2 to 0.8 in steps of 0.05, and 0.4 performed the best among them.
since reduction is being performed, \(2/X < 1\).
References
Abdi A, Idris N, Alguliyev RM, Aliguliyev RM (2015) Query-based multi-documents summarization using linguistic knowledge and content word expansion. Soft Comput 21(7):1785–1801. https://doi.org/10.1007/s00500-015-1881-4
Abdi A, Shamsuddin SM, Hasan S, Piran J (2018) Machine learning-based multi-documents sentiment-oriented summarization using linguistic treatment. Expert Syst Appl 109:66–85
Abdi A, Shamsuddin SM, Hasan S, Piran J (2019) Automatic sentiment-oriented summarization of multi-documents using soft computing. Soft Comput 23(20):10 551–10 568
Anand D, Wagh R (2019) Effective deep learning approaches for summarization of legal texts. J King Saud Univ Comput Inf Sci. https://doi.org/10.1016/j.jksuci.2019.11.015
Briët J, Harremoës P (2009) Properties of classical and quantum Jensen–Shannon divergence. Phys Rev A 79(5):1–11
Cagliero L, Garza P, Baralis E (2019) ELSA: a multilingual document summarization algorithm based on frequent itemsets and latent semantic analysis. ACM Trans Inf Syst (TOIS) 37(2):1–33
Chatterjee N, Sahoo PK (2015) Random indexing and modified random indexing based approach for extractive text summarization. Comput Speech Lang 29(1):32–44
Chen H, Jin H, Zhao F (2014) PSG: a two-layer graph model for document summarization. Front Comput Sci Sel Publ Chin Univ 8(1):119–130
Cheng J, Lapata M (2016) Neural summarization by extracting sentences and words. In: Proceedings of the 54th annual meeting of the association for computational linguistics, pp 484–494
Elbarougy R, Behery G, Khatib AE (2020) Graph-based extractive Arabic text summarization using multiple morphological analyzers. J Inf Sci Eng 36(2):347–363
Fang C, Mu D, Deng Z, Wu Z (2017) Word-sentence co-ranking for automatic extractive text summarization. Expert Syst Appl 72:189–195
Ferreira R, de Souza Cabral L, Freitas F, Lins RD, de França Silva G, Simske SJ, Favaro L (2014) A multi-document summarization system based on statistics and linguistic treatment. Expert Syst Appl 41(13):5780–5787
Genç S, Akay D, Boran FE, Yager RR (2019) Linguistic summarization of fuzzy social and economic networks: an application on the international trade network. Soft Comput 24:1511–1527
Glavaš G, Šnajder J (2014) Event graphs for information retrieval and multi-document summarization. Expert Syst Appl 41(15):6904–6916
Gupta V, Lehal GS (2010) A survey of text summarization extractive techniques. J Emerg Technol Web Intell 2(3):258–268
Hu Y-H, Chen Y-L, Chou H-L (2017) Opinion mining from online hotel reviews—a text summarization approach. Inf Process Manag 53(2):436–449
Jagarlamudi J, Pingali P, Varma V (2006) Query independent sentence scoring approach to DUC 2006. In: Proceeding of document understanding conference (DUC-2006)
Joshi A, Fidalgo E, Alegre E, Fernández-Robles L (2019) Summcoder: an unsupervised framework for extractive text summarization based on deep auto-encoders. Expert Syst Appl 129:200–215
Kondru J (2007) Using part of speech structure of text in the prediction of its readability. Comput Sci Eng. Compute Science Engineering, University of Texas, Arlington, US. http://proquest.umi.com/pdqweb?did=1216761731&sid=1&Fmt=2&clientld=46449&PQT=309&VName=PQD
Kullback S, Leibler RA (1951) On information and sufficiency. Ann Math Stat 22(1):79–86
Levy O, Goldberg Y, Dagan I (2015) Improving distributional similarity with lessons learned from word embeddings. Trans Assoc Comput Linguist 3:211–225
Lin C-Y (2004) Rouge: a package for automatic evaluation of summaries. In: Text summarization branches out: proceedings of the ACL-04 workshop, vol 8, pp 74–81
Liu H, Jiang C, Hu C, Zhang L (2016) Efficient relation extraction method based on spatial feature using ELM. Neural Comput Appl 27(2):1–11
Liu Y, Safavi T, Dighe A, Koutra D (2018) Graph summarization methods and applications: a survey. ACM Comput Surv (CSUR) 51(3):1–34
Lovinger J, Valova I, Clough C (2019) GIST: general integrated summarization of text and reviews. Soft Comput 23(5):1589–1601
Luhn HP (1958) The automatic creation of literature abstracts. IBM J Res Dev 2(2):159–165
Lynn HM, Choi C, Kim P (2018) An improved method of automatic text summarization for web contents using lexical chain with semantic-related terms. Soft Comput 22(12):4013–4023
Mashechkin I, Petrovskiy M, Popov D, Tsarev DV (2011) Automatic text summarization using latent semantic analysis. Program Comput Softw 37(6):299–305
Melli G (2006) Description of squash, the SFU question answering summary handler for the DUC-2006 summarization task. Safety 1:1–8
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of the 2004 conference on empirical methods in natural language processing, pp 404–411
Miller GA (1995) Wordnet: a lexical database for English. Commun ACM 38(11):39–41
Nagwani N (2015) Summarizing large text collection using topic modeling and clustering based on mapreduce framework. J Big Data 2(1):1–18
Ouyang Y, Li W, Li S, Lu Q (2011) Applying regression models to query-focused multi-document summarization. Inf Process Manag 47(2):227–237
Ozsoy MG, Alpaslan FN, Cicekli I (2011) Text summarization using latent semantic analysis. J Inf Sci 37(4):405–417
Parveen D, Ramsl H-M, Strube M (2015) Topical coherence for graph-based extractive summarization. In: Proceedings of the 2015 conference on empirical methods in natural language processing, pp 1949–1954
Rousseeuw PJ (1987) Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. J Comput Appl Math 20:53–65
Russakovsky O, Deng J, Su H, Krause J, Satheesh S, Ma S, Huang Z, Karpathy A, Khosla A, Bernstein M et al (2015) Imagenet large scale visual recognition challenge. Int J Comput Vis 115(3):211–252
Salton G, Buckley C (1988) Term-weighting approaches in automatic text retrieval. Inf Process Manag 24(5):513–523
Sanchez-Gomez JM, Vega-Rodríguez MA, Pérez CJ (2018) Extractive multi-document text summarization using a multi-objective artificial bee colony optimization approach. Knowl Based Syst 159:1–8
Sankarasubramaniam Y, Ramanathan K, Ghosh S (2014) Text summarization using wikipedia. Inf Process Manag 50(3):443–461
Tohalino JV, Amancio DR (2018) Extractive multi-document summarization using multilayer networks. Physica A 503:526–539
Valizadeh M, Brazdil P (2015) Exploring actor–object relationships for query-focused multi-document summarization. Soft Comput 19(11):3109–3121
Wan X (2010) Towards a unified approach to simultaneous single-document and multi-document summarizations. In: Proceedings of the 23rd international conference on computational linguistics, Association for Computational Linguistics, pp 1137–1145
Wang X, McCallum A, Wei X (2007) Topical \(n\)-grams: phrase and topic discovery, with an application to information retrieval. In: Seventh IEEE international conference on data mining (ICDM 2007), IEEE, pp 697–702
Woodsend K, Lapata M (2010) Automatic generation of story highlights. In: Proceedings of the 48th annual meeting of the association for computational linguistics, Association for Computational Linguistics, pp 565–574
Wu Z, Lei L, Li G, Huang H, Zheng C, Chen E, Xu G (2017) A topic modeling based approach to novel document automatic summarization. Expert Syst Appl 84:12–23
Yang G, Wen D, Chen N-S, Sutinen E et al (2015) A novel contextual topic model for multi-document summarization. Expert Syst Appl 42(3):1340–1352
Ye S, Chua T-S, Kan M-Y, Qiu L (2007) Document concept lattice for text understanding and summarization. Inf Process Manag 43(6):1643–1662
Yousefi-Azar M, Hamey L (2017) Text summarization using unsupervised deep learning. Expert Syst Appl 68:93–105
Zamanian M, Heydari P (2012) Readability of texts: state of the art. Theory Pract Lang Stud 2(1):43–53
Zhai C, Lafferty J (2017) A study of smoothing methods for language models applied to ad hoc information retrieval. CM SIGIR Forum 51(2):268–276
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
The author declared that he has no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by the author.
Additional information
Communicated by V. Loia.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Roul, R.K. Topic modeling combined with classification technique for extractive multi-document text summarization. Soft Comput 25, 1113–1127 (2021). https://doi.org/10.1007/s00500-020-05207-w
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-020-05207-w