Abstract
Unsupervised random-walk keyphrase extraction models mainly rely on global structural information of the word graph, with nodes representing candidate words and edges capturing the co-occurrence information between candidate words. However, using word embedding method to integrate multiple kinds of useful information into the random-walk model to help better extract keyphrases is relatively unexplored. In this paper, we propose a random-walk-based ranking method to extract keyphrases from text documents using word embeddings. Specifically, we first design a heterogeneous text graph embedding model to integrate local context information of the word graph (i.e., the local word collocation patterns) with some crucial features of candidate words and edges of the word graph. Then, a novel random-walk-based ranking model is designed to score candidate words by leveraging such learned word embeddings. Finally, a new and generic similarity-based phrase scoring model using word embeddings is proposed to score phrases for selecting top-scoring phrases as keyphrases. Experimental results show that the proposed method consistently outperforms eight state-of-the-art unsupervised methods on three real datasets for keyphrase extraction.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Alrehamy H, Walker C (2018) Exploiting extensible background knowledge for clustering-based automatic keyphrase extraction. Soft Comput 22(21):7041–7057
Artetxe M, Labaka G, Agirre E (2018) A robust self-learning method for fully unsupervised cross-lingual mappings of word embeddings. In: Proceedings of ACL, pp 789–798
Baeza-Yates R, Ribeiro BAN et al (2011) Modern information retrieval. ACM Press/Addison-Wesley, New York/Harlow
Bengio Y, Courville A, Vincent P (2013) Representation learning: a review and new perspectives. IEEE Trans Pattern Anal Mach Intell (TPAMI) 35(8):1798–1828
Bhattacharya I, Godbole S, Joshi S (2008) Structured entity identification and document categorization: two tasks with one joint model. In: Proceedings of the 14th ACM SIGKDD international conference on Knowledge discovery and data mining, Las Vegas, 24–27 Aug 2008. ACM, New York, pp 25–33. https://doi.org/10.1145/1401890.1401899
Bird S, Klein E, Loper E (2009) Natural language processing with Python: analyzing text with the natural language toolkit. O’Reilly Media, Inc., Sebastopol
Blei DM, Ng AY, Jordan MI (2003) Latent Dirichlet allocation. J Mach Learn Res 3(1):993–1022
Bojanowski P, Grave E, Joulin A, Mikolov T (2017) Enriching word vectors with subword information. Trans Assoc Comput Linguist (TACL) 5:135–146
Boudin F (2013) A comparison of centrality measures for graph-based keyphrase extraction. In: Proceedings of IJCNLP, pp 834–838
Boudin F (2015) Reducing over-generation errors for automatic keyphrase extraction using integer linear programming. In: Proceedings of ACL workshop on novel computational approaches to keyphrase extraction, pp 19–24
Bulgarov F, Caragea C (2015) A comparison of supervised keyphrase extraction models. In: Proceedings of WWW, pp 13–14
Caragea C, Bulgarov F, Godea A, Gollapalli SD (2014) Citation-enhanced keyphrase extraction from research papers: a supervised approach. In: Proceedings of EMNLP, pp 1435–1446
Chuang J, Manning CD, Heer J (2012) Termite: visualization techniques for assessing textual topic models. In: Proceedings of the international working conference on advanced visual interfaces, pp 74–77
Collobert R, Weston J, Bottou L, Karlen M, Kavukcuoglu K, Kuksa P (2011) Natural language processing (almost) from scratch. J Mach Learn Res 12(8):2493–2537
Conneau A, Lample G, Ranzato M, Denoyer L, Jégou H (2018) Word translation without parallel data. In: Proceedings of ICLR, pp 1–14
Dice LR (1945) Measures of the amount of ecologic association between species. Ecology 26(3):297–302
Din S, Paul A, Ahmad A, Gupta B, Rho S (2018) Service orchestration of optimizing continuous features in industrial surveillance using big data based fog-enabled internet of things. IEEE Access 6:21582–21591
Florescu C, Caragea C (2017) Positionrank: an unsupervised approach to keyphrase extraction from scholarly documents. In: Proceedings of ACL, pp 1105–1115
Frank E, Paynter GW, Witten IH, Gutwin C, Nevill-Manning CG (1999) Domain-specific keyphrase extraction. In: Proceedings of EMNLP, pp 668–673
Gollapalli SD, Caragea C (2014) Extracting keyphrases from research papers using citation networks. In: Proceedings of AAAI, pp 1629–1635
Gollapalli SD, Li X, Yang P (2017) Incorporating expert knowledge into keyphrase extraction. In: Proceedings of AAAI, pp 3180–3187
Gupta BB (2018) Computer and cyber security: principles, algorithm, applications, and perspectives. CRC Press, Boca Raton
Hasan KS, Ng V (2010) Conundrums in unsupervised keyphrase extraction: making sense of the state-of-the-art. In: Proceedings of COLING: Posters, pp 365–373
Hasan KS, Ng V (2014) Automatic keyphrase extraction: a survey of the state of the art. In: Proceedings of ACL, pp 1262–1273
Jones S, Staveley MS (1999) Phrasier: a system for interactive document retrieval using keyphrases. In: Proceedings of SIGIR, pp 160–167
Krapivin M, Autayeu A, Marchese M, Blanzieri E, Segata N (2010) Keyphrases extraction from scientific documents: improving machine learning approaches with natural language processing. In: Proceedings of ICADL, pp 102–111
Levy O, Goldberg Y (2014) Dependency-based word embeddings. Proc ACL 2:302–308
Liu Z, Huang W, Zheng Y, Sun M (2010) Automatic keyphrase extraction via topic decomposition. In: Proceedings of EMNLP, pp 366–376
Liu Y, Liu Z, Chua TS, Sun M (2015) Topical word embeddings. In: Proceedings of AAAI, pp 2418–2424
Lopez P, Romary L (2010) Humb: automatic key term extraction from scientific articles in GROBID. In: Proceedings of workshop on semantic evaluation, pp 248–251
Luo J, Meng B, Quan C, Tu X (2015) Exploiting salient semantic analysis for information retrieval. Enterp Inf Syst 10(9):959–969
Manning CD, Raghavan P, Schütze H (2008) Introduction to information retrieval. Cambridge University Press, Cambridge
Mihalcea R, Tarau P (2004) Textrank: bringing order into text. In: Proceedings of EMNLP, pp 404–411
Mikolov T, Chen K, Corrado G, Dean J (2013a) Efficient estimation of word representations in vector space. In: Proceedings of ICLR workshop
Mikolov T, Sutskever I, Chen K, Corrado GS, Dean J (2013b) Distributed representations of words and phrases and their compositionality. In: Proceedings of NIPS, pp 3111–3119
Nedjah N, Wyant RS, Mourelle L, Gupta B (2017) Efficient yet robust biometric iris matching on smart cards for data high security and privacy. Fut Gener Comput Syst 76:18–32
Page L, Brin S, Motwani R, Winograd T (1999) The pagerank citation ranking: bringing order to the web. Technical report, Stanford InfoLab
Pennington J, Socher R, Manning C (2014) Glove: global vectors for word representation. In: Proceedings of EMNLP, pp 1532–1543
Peters M, Neumann M, Iyyer M, Gardner M, Clark C, Lee K, Zettlemoyer L (2018) Deep contextualized word representations. In: Proceedings of NAACL, pp 2227–2237
Plageras AP, Psannis KE, Stergiou C, Wang H, Gupta BB (2018) Efficient iot-based sensor big data collection-processing and analysis in smart buildings. Fut Gener Comput Syst 82:349–357
Porter M (2006) An algorithm for suffix stripping. Program Electron Libr Inf Syst 40(3):211–218
Qazvinian V, Radev DR, Özgür A (2010) Citation summarization through keyphrase extraction. In: Proceedings of COLING, pp 895–903
Rumelhart DE, Hinton GE, Williams RJ (1986) Learning representations by back-propagating errors. Nature 323(6088):533–536
Shi B, Lam W, Jameel S, Schockaert S, Lai KP (2017) Jointly learning word embeddings and latent topics. In: Proceedings of SIGIR, pp 375–384
Shtok A, Kurland O, Carmel D (2010) Using statistical decision theory and relevance models for query-performance prediction. In: Proceedings of the 33rd international ACM SIGIR conference on research and development in information retrieval, Geneva, 19–23 July 2010. ACM, New York, pp 259–266. https://doi.org/10.1145/1835449.1835494
Sterckx L, Demeester T, Deleu J, Develder C (2015) Topical word importance for fast keyphrase extraction. In: Proceedings of WWW, pp 121–122
Sterckx L, Caragea C, Demeester T, Develder C (2016) Supervised keyphrase extraction as positive unlabeled learning. In: Proceedings of EMNLP, pp 1924–1929
Tang J, Qu M, Mei Q (2015a) Pte: predictive text embedding through large-scale heterogeneous text networks. In: Proceedings of SIGKDD, pp 1165–1174
Tang J, Qu M, Wang M, Zhang M, Yan J, Mei Q (2015b) Line: large-scale information network embedding. In: Proceedings of WWW, pp 1067–1077
Tang Y, Huang W, Liu Q, Tung AK, Wang X, Yang J, Zhang B (2017) Qalink: enriching text documents with relevant Q&A site contents. In: Proceedings of CIKM, pp 1359–1368
Teneva N, Cheng W (2017) Salience rank: efficient keyphrase extraction with topic modeling. In: Proceedings of ACL, pp 530–535
Turney PD (2000) Learning algorithms for keyphrase extraction. Inf Retr J 2(4):303–336
Wan X, Xiao J (2008) Single document keyphrase extraction using neighborhood knowledge. In: Proceedings of AAAI, pp 855–860
Wang R, Liu W, McDonald C (2015) Corpus-independent generic keyphrase extraction using word embedding vectors. In: Proceedings of DL-WSDM, pp 39–46
Wang Y, Jin Y, Zhu X, Goutte C (2016) Extracting discriminative keyphrases with learned semantic hierarchies. In: Proceedings of COLING, pp 932–942
Wieting J, Bansal M, Gimpel K, Livescu K (2016) Charagram: embedding words and sentences via character \(n\)-grams. In: Proceedings of EMNLP, pp 1504–1515
Yang J-M, Cai R, Wang Y, Zhu J, Zhang L, Ma W-Y (2009) Incorporating site-level knowledge to extract structured data from web forums. In: Proceedings of the 18th international conference on world wide web, Madrid, 20–24 Apr 2009. ACM, New York, pp 181–190. https://doi.org/10.1145/1526709.1526735
Zhang W, Feng W, Wang J (2013) Integrating semantic relatedness and words’ intrinsic features for keyword extraction. In: Proceedings of IJCAI, pp 139–160
Zhang W, Ming Z, Zhang Y, Liu T, Chua TS (2015) Exploring key concept paraphrasing based on pivot language translation for question retrieval. In: Proceedings of AAAI, pp 410–416
Zhang Q, Wang Y, Gong Y, Huang X (2016) Keyphrase extraction using deep recurrent neural networks on Twitter. In: Proceedings of EMNLP, pp 836–844
Zhang Y, Chang Y, Liu X, Gollapalli SD, Li X, Xiao C (2017) Mike: keyphrase extraction by integrating multidimensional information. In: Proceedings of CIKM, pp 1349–1358
Zhang Z, Gao J, Ciravegna F (2018) Semre-rank: Improving automatic term extraction by incorporating semantic relatedness with personalised pagerank. ACM Trans Knowl Dis Data (TKDD) 12(5):57:1–57:41
Acknowledgements
This work was partially supported by Grants from the National Natural Science Foundation of China (Nos. U1333109, 61632011, 61573231, U1533104), Department of Industrial and Systems Engineering, Hong Kong Polytechnic University (Project code H-ZG3K) and Open Project Foundation of Intelligent Information Processing Key Laboratory of Shanxi Province (No. CICIP2018004).
Author information
Authors and Affiliations
Corresponding author
Ethics declarations
Conflict of interest
All authors declare that they have no conflict of interest.
Ethical approval
This article does not contain any studies with human participants or animals performed by any of the authors.
Additional information
Communicated by B. B. Gupta.
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
About this article
Cite this article
Zhang, Y., Liu, H., Wang, S. et al. Automatic keyphrase extraction using word embeddings. Soft Comput 24, 5593–5608 (2020). https://doi.org/10.1007/s00500-019-03963-y
Published:
Issue Date:
DOI: https://doi.org/10.1007/s00500-019-03963-y